A detailed discussion of the codebase of FindResearch.online
In the world of academic research, finding relevant papers across multiple sources can be a daunting task. FindResearch.online aims to simplify this process by providing a unified interface to search and explore research papers from various academic databases. In this blog post, we'll take an in-depth look at the frontend and the backend implementation of this open-source project, exploring how it leverages Next.js with the App Router to create a research discovery tool.
The Insights API is a cornerstone feature of FindResearch.online, providing AI-powered summarization of paper abstracts. This feature allows users to quickly grasp the main findings and methodologies of research papers without having to read through entire abstracts. Let's dive deep into its implementation and the technology behind it.
The Insights API uses the DistilBERT model, specifically the "Xenova/distilbert-base-cased-distilled-squad" variant. Let's break down what this means:
DistilBERT: This is a smaller, faster, cheaper, and lighter version of BERT (Bidirectional Encoder Representations from Transformers). BERT is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.
Distillation: The process used to create DistilBERT. It's a compression technique in which a compact model (the student) is trained to reproduce the behavior of a larger model (the teacher) or an ensemble of models.
SQuAD: Stanford Question Answering Dataset, which this model is fine-tuned on. It's a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles.
Question Answering: The specific NLP task this model is designed for. It can understand a given context (in our case, a paper abstract) and answer questions about it.
Efficiency: DistilBERT is a compressed version of BERT, designed to be smaller and faster while retaining much of BERT's performance. According to the original paper by Sanh et al. (2019), DistilBERT retains about 97% of BERT's performance on the GLUE language understanding benchmark while being about 40% smaller and 60% faster. This makes it suitable for deployment in production environments, especially for web applications where response time is crucial.
Source: Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Accuracy: Despite being smaller, DistilBERT still provides high-quality results for many NLP tasks, including question answering.
Versatility: The model can be used to extract various types of information from text by framing the extraction as a question-answering task.
The model is loaded asynchronously and cached for reuse. This ensures that we only load the model once, reducing initialization time for subsequent requests.
Feature Extraction:
This function uses the model to answer specific questions about the abstract. By framing our information extraction as a question-answering task, we leverage the model's capabilities to understand and summarize text.
Predefined Questions:
We define specific questions to extract key information from the abstract. This approach allows us to target the most relevant information for researchers.
Validation and Cleaning:
This function ensures that the extracted features are clean and valid. It removes HTML tags and filters out responses that don't contain any alphanumeric characters.
API Endpoint:
The main POST handler orchestrates the entire process:
Model Size: Even though DistilBERT is smaller than BERT, it's still a substantial model. Loading it can take time and resources. We mitigate this by caching the model after the first load.
Accuracy vs Speed: There's always a trade-off between model accuracy and inference speed. DistilBERT strikes a good balance, but for some specific use cases, a larger model might be more accurate, or a smaller model might be faster.
Context Length: Transformer models like DistilBERT have a maximum input length (usually 512 tokens). For very long abstracts, we might need to implement truncation or chunking strategies.
Generalization: The model's performance can vary depending on the domain of the research papers. It might perform better on some fields than others based on its training data.
Fine-tuning: We could potentially fine-tune the model on a dataset of scientific abstracts to improve its performance on this specific task.
Multi-lingual Support: Implementing support for abstracts in multiple languages could make the tool more globally accessible.
Extended Features: We could extract more features like research implications, dataset information, or key citations.
Model Quantization: To further optimize performance, we could explore quantized versions of the model that maintain accuracy while reducing size and increasing speed.
This endpoint serves as a proxy for the Papers with Code API. Here's a breakdown of its functionality:
It extracts the query and page parameters from the request URL using searchParams.
It validates the presence of these parameters, returning a 400 error if either is missing.
It constructs a URL for the Papers with Code API, encoding the query parameter to ensure special characters are properly handled.
It uses Axios to make a GET request to the Papers with Code API.
If successful, it returns the response data directly to the client.
If an error occurs, it logs the error and returns a 500 error response.
This implementation allows the frontend to interact with the Papers with Code API without exposing the API endpoint directly, providing an additional layer of abstraction and control.
The CORE API endpoint follows a similar pattern to the others but with some unique aspects:
It requires an API key, which is stored as an environment variable (process.env.CORE_API_KEY).
The URL construction includes pagination parameters (limit and offset).
The response structure is validated before processing to ensure it matches the expected format.
Like the arXiv endpoint, it maps the CORE-specific data structure to the standardized Article format.
It handles some CORE-specific fields like downloadUrl and fullTextIdentifier.
The referenceCount is set to 0 as the CORE API doesn't provide this information.
This implementation ensures that data from the CORE API is presented in the same format as other sources, maintaining consistency across the application.
Unlike the other APIs we've discussed, the Crossref API is implemented directly in the frontend service layer. This approach allows for more flexibility in handling the Crossref-specific data structure and pagination. Let's examine the implementation in detail.
File: app/services/articleService.js
Let's break down this implementation:
URL Construction:
Two URLs are constructed: one for exact matching and another for fuzzy searching.
The exact match URL uses query.bibliographic to search for an exact match in bibliographic fields.
The fuzzy search URL uses a general query parameter for broader matching.
Both URLs include parameters for sorting, filtering (only journal articles), and selecting specific fields to reduce response size.
Parallel Requests:
The function uses Promise.all to send both requests (exact match and fuzzy search) simultaneously, improving efficiency.
Data Processing:
The processItems function is defined to transform Crossref items into the standardized Article format used by the application.
It filters out items without abstracts or titles, and excludes figures and tables (which sometimes appear as separate entries in Crossref).
The function handles various edge cases, such as missing author names or publication dates.
Result Combination:
Results from both the exact match and fuzzy search are combined.
Duplicate entries are removed based on DOI to ensure unique results.
Error Handling:
Any errors during the fetch process are caught, logged, and an empty array is returned to prevent the application from crashing.
This implementation offers several advantages:
Comprehensive Search: By combining exact match and fuzzy search, it increases the chances of finding relevant articles.
Efficiency: Parallel requests and selective field retrieval optimize the API calls.
Flexibility: Direct implementation in the service layer allows for complex data transformation and error handling.
Standardization: The Crossref-specific data structure is transformed into a consistent format used across the application.
However, there are also some considerations:
Rate Limiting: Crossref has rate limits, which this implementation doesn't explicitly handle. For a production application, implementing proper rate limiting and error handling for 429 (Too Many Requests) responses would be crucial.
Pagination: The current implementation fetches a fixed number of results. For large result sets, implementing proper pagination would be important.
Error Handling: While basic error handling is implemented, more sophisticated error handling (e.g., retrying failed requests, handling specific HTTP status codes) could improve reliability.
The articleService.js file is the backbone of FindResearch.online, powering the application's search, ranking, and filtering capabilities. This crucial component employs several algorithms and data structures to provide users with relevant and diverse research results. Let's dive deep into the key technical aspects of this service.
FindResearch.online sets itself apart by aggregating results from multiple scholarly databases, providing a comprehensive view of available research. The fetchAndEnhanceArticles function demonstrates this capability:
By using Promise.all, the service makes concurrent API requests to Crossref, CORE, arXiv, and Papers With Code. This approach significantly reduces the overall response time by parallelizing the API calls, enhancing the user experience by providing faster search results.
The heart of our search functionality is the BM25 (Best Matching 25) ranking algorithm. BM25 is a probabilistic retrieval framework that has been widely used in information retrieval systems due to its effectiveness and efficiency (Robertson & Zaragoza, 2009).
Let's break down the implementation of BM25 in FindResearch.online:
This implementation calculates a relevance score for each article based on the frequency of search terms in the article's title and abstract, as well as the inverse document frequency (IDF) of these terms across the entire corpus. Let's break down the key components:
k1 and b are tuning parameters. k1 controls term frequency saturation, while b controls document length normalization.
The CorpusStats interface defines the structure for storing corpus-wide statistics needed for BM25 calculation.
The function iterates over each search term, calculating its contribution to the overall score:
tf (term frequency) is calculated by counting the occurrences of the term in the document.
idf (inverse document frequency) is calculated using the classic IDF formula, which gives higher weight to rarer terms.
The final score for each term is calculated using the BM25 formula, which balances term frequency, inverse document frequency, and document length normalization.
To use this function effectively, we need to calculate corpus statistics:
This function calculates the necessary statistics across the entire corpus, including document count, average document length, and term frequencies.
The BM25 scores are then normalized and combined with other factors to produce a final ranking score:
This approach allows us to balance relevance with other important factors like citation count and recency.
To provide unique and diverse results, especially when aggregating from multiple sources, the service implements an efficient deduplication mechanism:
This approach uses Map data structures for O(1) lookup time, ensuring efficient deduplication even with large result sets. It also prioritizes data from Papers With Code, which often contains valuable additional information such as repository URLs.
FindResearch.online offers users the ability to sort and filter results according to their preferences. The sortArticles function provides three sorting options:
The filterArticles function allows users to filter results based on date range, journals, and minimum citation count:
This flexibility enables researchers to quickly narrow down their search results to the most relevant papers for their needs.
Handling dates from multiple sources can be challenging due to inconsistent formats. We addresses this issue with a robust date parsing function:
This function attempts to parse dates using multiple formats, ensuring that filtering by date range works consistently across all data sources.
As we continue to develop the project, we're exploring ways to further enhance these capabilities, such as implementing more advanced natural language processing techniques for improved relevance scoring and introducing personalized recommendations based on user behavior and preferences.
The ArticleCard component is a crucial part of FindResearch.online's user interface, responsible for displaying individual research paper information. Let's break down its implementation and examine its key features.
FindResearch.online represents an attempt to improve the academic paper search process. Through our analysis of its architecture and components, we've seen both its potential strengths and areas that require further development.
Key Aspects of the Project:
Backend Architecture:
The articleService.tsx implements multi-source data fetching and ranking with BM25. While this approach provides diverse results, it also introduces complexity in data consistency and may require ongoing optimization to handle large-scale queries efficiently.
State Management:
Zustand is used for state management, offering a simpler alternative to more complex solutions. However, as the application grows, careful consideration will be needed to prevent state management from becoming a bottleneck.
Component Design:
Components like ArticleCard, FilterPopover, SearchBar, and SortSelect are modular, which aids in maintenance. Yet, there's room for improvement in terms of reusability and potential over-fragmentation of the UI.
User Interface:
The filtering and sorting options provide flexibility, but user testing will be crucial to ensure they meet real-world researcher needs without overwhelming users with choices.
Performance Considerations:
On-demand loading of AI insights is a step towards performance optimization. However, real-world usage with large datasets will be needed to identify and address potential performance bottlenecks.
Accessibility:
While shadcn UI components provide a base level of accessibility, a comprehensive accessibility audit and user testing with assistive technologies are necessary to ensure broad usability.
Challenges and Areas for Improvement:
Data Quality and Consistency: Integrating multiple sources raises questions about data consistency and quality control across different platforms.
Scalability: The current implementation may face challenges with very large datasets or high concurrent user loads.
Advanced Search Features: Researchers often need complex boolean searches and field-specific queries, which are not yet fully implemented.
User Personalization: The current version lacks user accounts and personalized recommendations, which could significantly enhance its utility.
Mobile Responsiveness: While using Tailwind CSS aids in creating a responsive design, dedicated mobile optimization may be necessary for a truly mobile-friendly experience.