FindResearch.online: A Deep Dive into The Implementation of Multi-Source Academic Search

A detailed discussion of the codebase of FindResearch.online


In the world of academic research, finding relevant papers across multiple sources can be a daunting task. FindResearch.online aims to simplify this process by providing a unified interface to search and explore research papers from various academic databases. In this blog post, we'll take an in-depth look at the frontend and the backend implementation of this open-source project, exploring how it leverages Next.js with the App Router to create a research discovery tool.

API Implementation

1. Insights API: AI-Powered Research Summary

File: app/api/insights/route.ts

The Insights API is a cornerstone feature of FindResearch.online, providing AI-powered summarization of paper abstracts. This feature allows users to quickly grasp the main findings and methodologies of research papers without having to read through entire abstracts. Let's dive deep into its implementation and the technology behind it.

"use server";
 
import { env, pipeline } from "@xenova/transformers";
import { NextRequest, NextResponse } from "next/server";
 
env.backends.onnx.wasm.wasmPaths =
  "<https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/>";
 
type QuestionAnsweringPipeline = {
  (question: string, context: string): Promise<{ answer: string }>;
};
 
let qa: QuestionAnsweringPipeline | null = null;
 
async function setupModel(): Promise<void> {
  if (!qa) {
    qa = (await pipeline(
      "question-answering",
      "Xenova/distilbert-base-cased-distilled-squad"
    )) as QuestionAnsweringPipeline;
  }
}
 
async function extractFeature(
  abstract: string,
  question: string
): Promise<string> {
  if (!qa) {
    await setupModel();
  }
  const result = await qa!(question, abstract);
  return result.answer;
}
 
async function extractFeatures(
  abstract: string
): Promise<{ [key: string]: string }> {
  const features = {
    main_outcome: "What is the main finding or outcome of this research?",
    methodology: "What methodology or approach was used in this research?",
  };
 
  const results: { [key: string]: string } = {};
 
  for (const [feature, question] of Object.entries(features)) {
    results[feature] = await extractFeature(abstract, question);
  }
 
  return results;
}
 
function validateAndCleanFeatures(features: { [key: string]: string }): {
  [key: string]: string;
} {
  const cleanFeatures: { [key: string]: string } = {};
 
  for (const [key, value] of Object.entries(features)) {
    const cleanedValue = value.replace(/<\\/?s>/g, "").trim();
 
    if (cleanedValue && !/^[^\\w\\s]+$/.test(cleanedValue)) {
      cleanFeatures[key] = cleanedValue;
    }
  }
 
  return cleanFeatures;
}
 
export async function POST(request: NextRequest) {
  try {
    const body = await request.json();
    const { abstract } = body;
 
    if (!abstract) {
      return NextResponse.json(
        { error: "Abstract is required" },
        { status: 400 }
      );
    }
 
    await setupModel(); // Ensure model is set up before processing
    const rawFeatures = await extractFeatures(abstract);
    const cleanFeatures = validateAndCleanFeatures(rawFeatures);
 
    return NextResponse.json(cleanFeatures);
  } catch (error) {
    console.error("Error:", error);
    return NextResponse.json(
      { error: "An error occurred while processing the request" },
      { status: 500 }
    );
  }
}
 

Understanding the Model: DistilBERT for Question Answering

The Insights API uses the DistilBERT model, specifically the "Xenova/distilbert-base-cased-distilled-squad" variant. Let's break down what this means:

  1. DistilBERT: This is a smaller, faster, cheaper, and lighter version of BERT (Bidirectional Encoder Representations from Transformers). BERT is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.
  2. Distillation: The process used to create DistilBERT. It's a compression technique in which a compact model (the student) is trained to reproduce the behavior of a larger model (the teacher) or an ensemble of models.
  3. SQuAD: Stanford Question Answering Dataset, which this model is fine-tuned on. It's a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles.
  4. Question Answering: The specific NLP task this model is designed for. It can understand a given context (in our case, a paper abstract) and answer questions about it.

Why Use This Model?

  1. Efficiency: DistilBERT is a compressed version of BERT, designed to be smaller and faster while retaining much of BERT's performance. According to the original paper by Sanh et al. (2019), DistilBERT retains about 97% of BERT's performance on the GLUE language understanding benchmark while being about 40% smaller and 60% faster. This makes it suitable for deployment in production environments, especially for web applications where response time is crucial.

    Source: Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

  2. Accuracy: Despite being smaller, DistilBERT still provides high-quality results for many NLP tasks, including question answering.

  3. Versatility: The model can be used to extract various types of information from text by framing the extraction as a question-answering task.

Implementation Details

  1. Model Setup:

    async function setupModel(): Promise<void> {
      if (!qa) {
        qa = (await pipeline(
          "question-answering",
          "Xenova/distilbert-base-cased-distilled-squad"
        )) as QuestionAnsweringPipeline;
      }
    }

    The model is loaded asynchronously and cached for reuse. This ensures that we only load the model once, reducing initialization time for subsequent requests.

  2. Feature Extraction:

    async function extractFeature(
      abstract: string,
      question: string
    ): Promise<string> {
      if (!qa) {
        await setupModel();
      }
      const result = await qa!(question, abstract);
      return result.answer;
    }

    This function uses the model to answer specific questions about the abstract. By framing our information extraction as a question-answering task, we leverage the model's capabilities to understand and summarize text.

  3. Predefined Questions:

    const features = {
      main_outcome: "What is the main finding or outcome of this research?",
      methodology: "What methodology or approach was used in this research?",
    };

    We define specific questions to extract key information from the abstract. This approach allows us to target the most relevant information for researchers.

  4. Validation and Cleaning:

    function validateAndCleanFeatures(features: { [key: string]: string }): {
      [key: string]: string,
    } {
      const cleanFeatures: { [key: string]: string } = {};
     
      for (const [key, value] of Object.entries(features)) {
        const cleanedValue = value.replace(/<\/?s>/g, "").trim();
     
        if (cleanedValue && !/^[^\w\s]+$/.test(cleanedValue)) {
          cleanFeatures[key] = cleanedValue;
        }
      }
     
      return cleanFeatures;
    }

    This function ensures that the extracted features are clean and valid. It removes HTML tags and filters out responses that don't contain any alphanumeric characters.

  5. API Endpoint:

    export async function POST(request: NextRequest) {
      try {
        const body = await request.json();
        const { abstract } = body;
     
        if (!abstract) {
          return NextResponse.json(
            { error: "Abstract is required" },
            { status: 400 }
          );
        }
     
        await setupModel(); // Ensure model is set up before processing
        const rawFeatures = await extractFeatures(abstract);
        const cleanFeatures = validateAndCleanFeatures(rawFeatures);
     
        return NextResponse.json(cleanFeatures);
      } catch (error) {
        console.error("Error:", error);
        return NextResponse.json(
          { error: "An error occurred while processing the request" },
          { status: 500 }
        );
      }
    }

    The main POST handler orchestrates the entire process:

    • It extracts the abstract from the request body.
    • Ensures the model is set up.
    • Extracts features from the abstract.
    • Cleans and validates the extracted features.
    • Returns the features as a JSON response.

Challenges and Considerations

  1. Model Size: Even though DistilBERT is smaller than BERT, it's still a substantial model. Loading it can take time and resources. We mitigate this by caching the model after the first load.
  2. Accuracy vs Speed: There's always a trade-off between model accuracy and inference speed. DistilBERT strikes a good balance, but for some specific use cases, a larger model might be more accurate, or a smaller model might be faster.
  3. Context Length: Transformer models like DistilBERT have a maximum input length (usually 512 tokens). For very long abstracts, we might need to implement truncation or chunking strategies.
  4. Generalization: The model's performance can vary depending on the domain of the research papers. It might perform better on some fields than others based on its training data.

Future Improvements

  1. Fine-tuning: We could potentially fine-tune the model on a dataset of scientific abstracts to improve its performance on this specific task.
  2. Multi-lingual Support: Implementing support for abstracts in multiple languages could make the tool more globally accessible.
  3. Extended Features: We could extract more features like research implications, dataset information, or key citations.
  4. Model Quantization: To further optimize performance, we could explore quantized versions of the model that maintain accuracy while reducing size and increasing speed.

2. Papers with Code API

File: app/api/paperswithcode/route.ts

typescript;
Copy;
import { NextResponse } from "next/server";
import axios from "axios";
 
export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const query = searchParams.get("query");
  const page = searchParams.get("page");
 
  if (!query || !page) {
    return NextResponse.json(
      { error: "Missing query or page parameter" },
      { status: 400 }
    );
  }
 
  try {
    const url = `https://paperswithcode.com/api/v1/search/?q=${encodeURIComponent(
      query
    )}&page=${page}`;
    const response = await axios.get(url);
    return NextResponse.json(response.data);
  } catch (error) {
    console.error("Error fetching Papers with Code articles:", error);
    return NextResponse.json(
      { error: "Error fetching articles" },
      { status: 500 }
    );
  }
}

This endpoint serves as a proxy for the Papers with Code API. Here's a breakdown of its functionality:

  1. It extracts the query and page parameters from the request URL using searchParams.
  2. It validates the presence of these parameters, returning a 400 error if either is missing.
  3. It constructs a URL for the Papers with Code API, encoding the query parameter to ensure special characters are properly handled.
  4. It uses Axios to make a GET request to the Papers with Code API.
  5. If successful, it returns the response data directly to the client.
  6. If an error occurs, it logs the error and returns a 500 error response.

This implementation allows the frontend to interact with the Papers with Code API without exposing the API endpoint directly, providing an additional layer of abstraction and control.

3. arXiv API

File: app/api/arxiv/route.ts

typescript;
Copy;
import { NextResponse } from "next/server";
import axios from "axios";
import { parseString } from "xml2js";
import { promisify } from "util";
import { ITEMS_PER_API } from "@/app/lib/constants";
import { cleanAbstract } from "@/app/lib/utils";
import { Article, ArxivResponse, ArxivEntry } from "@/app/lib/types";
 
const parseXML = promisify(parseString);
 
export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const query = searchParams.get("query");
  const page = searchParams.get("page");
 
  if (!query || !page) {
    return NextResponse.json(
      { error: "Missing query or page parameter" },
      { status: 400 }
    );
  }
 
  const encodedQuery = encodeURIComponent(query);
  const url = `https://export.arxiv.org/api/query?search_query=${encodedQuery}&start=${
    (parseInt(page) - 1) * ITEMS_PER_API
  }&max_results=${ITEMS_PER_API}`;
 
  try {
    const response = await axios.get(url);
    const result = (await parseXML(response.data)) as ArxivResponse;
 
    const entries = result.feed.entry || [];
    const articles: Article[] = entries.map(
      (entry: ArxivEntry): Article => ({
        title: entry.title[0] || "No title available",
        authors: entry.author
          ? entry.author.map((author) => author.name[0]).join(", ")
          : "No authors available",
        date: entry.published[0] || "No date available",
        journal: entry["arxiv:journal_ref"]
          ? entry["arxiv:journal_ref"][0]
          : "arXiv",
        tags: entry.category ? entry.category.map((cat) => cat.$.term) : [],
        abstract: entry.summary
          ? cleanAbstract(entry.summary[0])
          : "No abstract available",
        doi: entry.id
          ? entry.id[0].replace("http://arxiv.org/abs/", "arxiv:")
          : "No DOI available",
        citationCount: 0,
        referenceCount: 0,
        arxivId: entry.id
          ? entry.id[0].replace("http://arxiv.org/abs/", "")
          : "No arXiv ID available",
      })
    );
 
    return NextResponse.json(articles);
  } catch (error) {
    console.error("Error fetching arXiv articles:", error);
    return NextResponse.json(
      { error: "Error fetching articles" },
      { status: 500 }
    );
  }
}

The arXiv API endpoint is more complex due to the XML format of the arXiv API response. Here's a detailed breakdown:

  1. It uses the xml2js library to parse the XML response from the arXiv API.
  2. The parseXML function is promisified to allow for easier async/await usage.
  3. Like the Papers with Code endpoint, it extracts and validates the query and page parameters.
  4. It constructs the arXiv API URL, including pagination parameters (start and max_results).
  5. After fetching the data, it parses the XML response into a JavaScript object.
  6. It then maps the arXiv-specific data structure to a standardized Article format.
  7. This mapping includes:
    • Extracting title, authors, publication date, and journal reference.
    • Mapping arXiv categories to tags.
    • Cleaning the abstract using a utility function.
    • Converting the arXiv ID to a DOI-like format for consistency with other sources.
  8. The transformed articles are then returned as a JSON response.

This implementation allows for a consistent article format across different data sources, simplifying frontend development.

4. CORE API

File: app/api/core/route.ts

typescript;
Copy;
import { ITEMS_PER_API } from "@/app/lib/constants";
import { Article, CoreApiResponse, CoreApiResult } from "@/app/lib/types";
import axios from "axios";
import { NextResponse } from "next/server";
 
export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const query = searchParams.get("query");
  const page = searchParams.get("page");
 
  if (!query || !page) {
    return NextResponse.json(
      { error: "Missing query or page parameter" },
      { status: 400 }
    );
  }
 
  const encodedQuery = encodeURIComponent(query);
  const url = `https://api.core.ac.uk/v3/search/works/?q=${encodedQuery}&limit=${ITEMS_PER_API}&offset=${
    (parseInt(page) - 1) * ITEMS_PER_API
  }&api_key=${process.env.CORE_API_KEY}`;
 
  try {
    const response = (await axios.get) < CoreApiResponse > url;
 
    if (!response.data || !Array.isArray(response.data.results)) {
      console.error("Unexpected CORE API response structure:", response.data);
      return NextResponse.json(
        { error: "Unexpected API response" },
        { status: 500 }
      );
    }
 
    const articles: Article[] = response.data.results.map(
      (item: CoreApiResult) => ({
        title: item.title || "No title available",
        authors:
          item.authors.map((author) => author.name).join(", ") ||
          "No authors available",
        date: item.datePublished || "No date available",
        journal: item.publisher || "No journal available",
        tags: item.subjects || [],
        abstract: item.abstract || "No abstract available",
        doi: item.doi || "No DOI available",
        citationCount: item.citationCount || 0,
        referenceCount: 0, // CORE API doesn't provide this information
        downloadUrl: item.downloadUrl || item.fullTextIdentifier || "",
      })
    );
 
    return NextResponse.json(articles);
  } catch (error) {
    console.error("Error fetching CORE articles:", error);
    return NextResponse.json(
      { error: "Failed to fetch articles" },
      { status: 500 }
    );
  }
}

The CORE API endpoint follows a similar pattern to the others but with some unique aspects:

  1. It requires an API key, which is stored as an environment variable (process.env.CORE_API_KEY).
  2. The URL construction includes pagination parameters (limit and offset).
  3. The response structure is validated before processing to ensure it matches the expected format.
  4. Like the arXiv endpoint, it maps the CORE-specific data structure to the standardized Article format.
  5. It handles some CORE-specific fields like downloadUrl and fullTextIdentifier.
  6. The referenceCount is set to 0 as the CORE API doesn't provide this information.

This implementation ensures that data from the CORE API is presented in the same format as other sources, maintaining consistency across the application.

5. Crossref API

Unlike the other APIs we've discussed, the Crossref API is implemented directly in the frontend service layer. This approach allows for more flexibility in handling the Crossref-specific data structure and pagination. Let's examine the implementation in detail.

File: app/services/articleService.js

typescript;
Copy;
export const fetchCrossrefArticles = async (
  query: string,
  page: number
): Promise<Article[]> => {
  const encodedQuery = encodeURIComponent(query);
  const exactMatchUrl = `https://api.crossref.org/works?query.bibliographic=${encodedQuery}&rows=1&sort=relevance&order=desc&select=DOI,title,author,container-title,published,abstract,subject,type,is-referenced-by-count,reference&filter=type:journal-article`;
  const fuzzySearchUrl = `https://api.crossref.org/works?query=${encodedQuery}&rows=${ITEMS_PER_API}&offset=${
    (page - 1) * ITEMS_PER_API
  }&sort=relevance&order=desc&select=DOI,title,author,container-title,published,abstract,subject,type,is-referenced-by-count,reference&filter=type:journal-article`;
 
  try {
    const [exactMatchResponse, fuzzySearchResponse] = await Promise.all([
      axios.get < CrossrefResponse > exactMatchUrl,
      axios.get < CrossrefResponse > fuzzySearchUrl,
    ]);
 
    const processItems = (items: CrossrefItem[]): Article[] => {
      return items
        .filter(
          (item: CrossrefItem) =>
            item.abstract &&
            item.title &&
            item.title.length > 0 &&
            !item.DOI?.includes("/fig-") &&
            !item.DOI?.includes("/table-")
        )
        .map((item: CrossrefItem) => ({
          title: item.title ? item.title[0] : "No title available",
          authors: item.author
            ? item.author
                .map(
                  (author) =>
                    author.name ||
                    `${author.given || ""} ${author.family || ""}`.trim()
                )
                .filter((name) => name.length > 0)
                .join(", ")
            : "No authors available",
          date:
            item.published &&
            item.published["date-parts"] &&
            item.published["date-parts"][0]
              ? item.published["date-parts"][0].slice(0, 3).join("-")
              : "No date available",
          journal: item["container-title"]
            ? item["container-title"][0]
            : "No journal available",
          tags: item.subject || [],
          abstract: item.abstract
            ? cleanAbstract(item.abstract)
            : "No abstract available",
          doi: item.DOI || "No DOI available",
          citationCount:
            typeof item["is-referenced-by-count"] === "number"
              ? item["is-referenced-by-count"]
              : 0,
          referenceCount: item.reference ? item.reference.length : 0,
        }));
    };
 
    const exactMatches = processItems(exactMatchResponse.data.message.items);
    const fuzzyMatches = processItems(fuzzySearchResponse.data.message.items);
 
    return [...exactMatches, ...fuzzyMatches].filter(
      (article, index, self) =>
        index === self.findIndex((t) => t.doi === article.doi)
    );
  } catch (error) {
    console.error("Error fetching Crossref articles:", error);
    return [];
  }
};

Let's break down this implementation:

  1. URL Construction:
    • Two URLs are constructed: one for exact matching and another for fuzzy searching.
    • The exact match URL uses query.bibliographic to search for an exact match in bibliographic fields.
    • The fuzzy search URL uses a general query parameter for broader matching.
    • Both URLs include parameters for sorting, filtering (only journal articles), and selecting specific fields to reduce response size.
  2. Parallel Requests:
    • The function uses Promise.all to send both requests (exact match and fuzzy search) simultaneously, improving efficiency.
  3. Data Processing:
    • The processItems function is defined to transform Crossref items into the standardized Article format used by the application.
    • It filters out items without abstracts or titles, and excludes figures and tables (which sometimes appear as separate entries in Crossref).
    • The function handles various edge cases, such as missing author names or publication dates.
  4. Result Combination:
    • Results from both the exact match and fuzzy search are combined.
    • Duplicate entries are removed based on DOI to ensure unique results.
  5. Error Handling:
    • Any errors during the fetch process are caught, logged, and an empty array is returned to prevent the application from crashing.

This implementation offers several advantages:

  • Comprehensive Search: By combining exact match and fuzzy search, it increases the chances of finding relevant articles.
  • Efficiency: Parallel requests and selective field retrieval optimize the API calls.
  • Flexibility: Direct implementation in the service layer allows for complex data transformation and error handling.
  • Standardization: The Crossref-specific data structure is transformed into a consistent format used across the application.

However, there are also some considerations:

  • Rate Limiting: Crossref has rate limits, which this implementation doesn't explicitly handle. For a production application, implementing proper rate limiting and error handling for 429 (Too Many Requests) responses would be crucial.
  • Pagination: The current implementation fetches a fixed number of results. For large result sets, implementing proper pagination would be important.
  • Error Handling: While basic error handling is implemented, more sophisticated error handling (e.g., retrying failed requests, handling specific HTTP status codes) could improve reliability.

Article Service: The Core of FindResearch.online

The articleService.js file is the backbone of FindResearch.online, powering the application's search, ranking, and filtering capabilities. This crucial component employs several algorithms and data structures to provide users with relevant and diverse research results. Let's dive deep into the key technical aspects of this service.

1. Multi-Source Integration

FindResearch.online sets itself apart by aggregating results from multiple scholarly databases, providing a comprehensive view of available research. The fetchAndEnhanceArticles function demonstrates this capability:

export async function fetchAndEnhanceArticles(
  searchInput: string,
  currentPage: number
): Promise<EnhancedArticle[]> {
  const results = await Promise.all([
    fetchCrossrefArticles(searchInput, currentPage),
    fetchCoreArticles(searchInput, currentPage),
    fetchArxivArticles(searchInput, currentPage),
    fetchPapersWithCodeArticles(searchInput, currentPage),
  ]);
 
  const allArticles = results.flat();
 
  // ... (deduplication and enhancement logic)
 
  return enhancedArticles;
}

By using Promise.all, the service makes concurrent API requests to Crossref, CORE, arXiv, and Papers With Code. This approach significantly reduces the overall response time by parallelizing the API calls, enhancing the user experience by providing faster search results.

2. BM25 Ranking Algorithm

The heart of our search functionality is the BM25 (Best Matching 25) ranking algorithm. BM25 is a probabilistic retrieval framework that has been widely used in information retrieval systems due to its effectiveness and efficiency (Robertson & Zaragoza, 2009).

Let's break down the implementation of BM25 in FindResearch.online:

const k1 = 1.2;
const b = 0.75;
 
interface CorpusStats {
  docCount: number;
  avgDocLength: number;
  termFrequency: { [term: string]: number };
}
 
function calculateBM25Score(
  article: Article,
  searchTerms: string[],
  corpusStats: CorpusStats
): number {
  const doc = article.title + " " + article.abstract;
  const docLength = doc.split(/\\s+/).length;
 
  return searchTerms.reduce((score, term) => {
    const tf = (doc.match(new RegExp(term, "gi")) || []).length;
    const idf = Math.log(
      (corpusStats.docCount - corpusStats.termFrequency[term] + 0.5) /
        (corpusStats.termFrequency[term] + 0.5) +
        1
    );
    const numerator = tf * (k1 + 1);
    const denominator =
      tf + k1 * (1 - b + b * (docLength / corpusStats.avgDocLength));
    return score + idf * (numerator / denominator);
  }, 0);
}

This implementation calculates a relevance score for each article based on the frequency of search terms in the article's title and abstract, as well as the inverse document frequency (IDF) of these terms across the entire corpus. Let's break down the key components:

  1. k1 and b are tuning parameters. k1 controls term frequency saturation, while b controls document length normalization.
  2. The CorpusStats interface defines the structure for storing corpus-wide statistics needed for BM25 calculation.
  3. The function iterates over each search term, calculating its contribution to the overall score:
    • tf (term frequency) is calculated by counting the occurrences of the term in the document.
    • idf (inverse document frequency) is calculated using the classic IDF formula, which gives higher weight to rarer terms.
    • The final score for each term is calculated using the BM25 formula, which balances term frequency, inverse document frequency, and document length normalization.

To use this function effectively, we need to calculate corpus statistics:

function getCorpusStats(
  articles: Article[],
  searchTerms: string[]
): CorpusStats {
  let totalLength = 0;
  const termFrequency: { [term: string]: number } = {};
 
  searchTerms.forEach((term) => (termFrequency[term] = 0));
 
  articles.forEach((article) => {
    const doc = article.title + " " + article.abstract;
    totalLength += doc.split(/\\s+/).length;
 
    searchTerms.forEach((term) => {
      if (doc.toLowerCase().includes(term.toLowerCase())) {
        termFrequency[term]++;
      }
    });
  });
 
  return {
    docCount: articles.length,
    avgDocLength: totalLength / articles.length,
    termFrequency,
  };
}

This function calculates the necessary statistics across the entire corpus, including document count, average document length, and term frequencies.

The BM25 scores are then normalized and combined with other factors to produce a final ranking score:

function calculateRankingScore(article: EnhancedArticle): number {
  const citationScore = Math.log(article.citationCount + 1) * 10;
  const dateScore =
    (new Date().getFullYear() - new Date(article.date).getFullYear() + 1) * 5;
  return article.relevanceScore ?? 0 + citationScore + dateScore;
}

This approach allows us to balance relevance with other important factors like citation count and recency.

3. Efficient Deduplication

To provide unique and diverse results, especially when aggregating from multiple sources, the service implements an efficient deduplication mechanism:

const uniqueDoiMap = new Map<string, EnhancedArticle>();
const papersWithCodeMap = new Map<string, EnhancedArticle>();
 
allArticles.forEach((article) => {
  let doi = article.doi;
 
  if (doi && doi.startsWith("arxiv:")) {
    doi = doi.replace(/v\\d+$/, ""); // Remove version number (e.g., v1, v2, etc.)
  }
 
  if (doi) {
    const isPapersWithCode =
      "repositoryUrl" in article && article.repositoryUrl !== undefined;
 
    if (isPapersWithCode) {
      papersWithCodeMap.set(doi, article as EnhancedArticle);
    } else if (!uniqueDoiMap.has(doi)) {
      uniqueDoiMap.set(doi, article as EnhancedArticle);
    }
  }
});
 
papersWithCodeMap.forEach((article, doi) => {
  uniqueDoiMap.set(doi, article);
});
 
const uniqueArticles = Array.from(uniqueDoiMap.values());

This approach uses Map data structures for O(1) lookup time, ensuring efficient deduplication even with large result sets. It also prioritizes data from Papers With Code, which often contains valuable additional information such as repository URLs.

4. Flexible Sorting and Filtering

FindResearch.online offers users the ability to sort and filter results according to their preferences. The sortArticles function provides three sorting options:

export const sortArticles = (
  articles: EnhancedArticle[],
  option: SortOption
) => {
  return [...articles].sort((a, b) => {
    switch (option) {
      case "relevance":
        return (b.relevanceScore ?? 0) - (a.relevanceScore ?? 0);
      case "citationCount":
        return b.citationCount - a.citationCount;
      case "date":
        return new Date(b.date).getTime() - new Date(a.date).getTime();
      default:
        return 0;
    }
  });
};

The filterArticles function allows users to filter results based on date range, journals, and minimum citation count:

export const filterArticles = (
  articles: EnhancedArticle[],
  startDate: Date | undefined,
  endDate: Date | undefined,
  selectedJournals: string[],
  minCitations: string
) => {
  return articles.filter((article) => {
    const articleDate = parseArticleDate(article.date);
    const isAfterStartDate = startDate
      ? isAfter(articleDate, startDate) ||
        articleDate.getTime() === startDate.getTime()
      : true;
    const isBeforeEndDate = endDate
      ? isBefore(articleDate, endDate) ||
        articleDate.getTime() === endDate.getTime()
      : true;
 
    const isInSelectedJournals =
      selectedJournals.length === 0 ||
      selectedJournals.includes(article.journal);
 
    const meetsMinCitations =
      minCitations === "" || article.citationCount >= parseInt(minCitations);
 
    return (
      isAfterStartDate &&
      isBeforeEndDate &&
      isInSelectedJournals &&
      meetsMinCitations
    );
  });
};

This flexibility enables researchers to quickly narrow down their search results to the most relevant papers for their needs.

5. Date Parsing and Normalization

Handling dates from multiple sources can be challenging due to inconsistent formats. We addresses this issue with a robust date parsing function:

const parseArticleDate = (dateString: string): Date => {
  let date = parseISO(dateString);
  if (isValid(date)) return date;
 
  const formats = ["yyyy-MM-dd", "yyyy-MM", "yyyy"];
  for (const format of formats) {
    date = parse(dateString, format, new Date());
    if (isValid(date)) return date;
  }
 
  console.warn(
    `Unable to parse date: ${dateString}. Using current date instead.`
  );
  return new Date();
};

This function attempts to parse dates using multiple formats, ensuring that filtering by date range works consistently across all data sources.

As we continue to develop the project, we're exploring ways to further enhance these capabilities, such as implementing more advanced natural language processing techniques for improved relevance scoring and introducing personalized recommendations based on user behavior and preferences.

Breakdown of The Frontend Components

ArticleCard

The ArticleCard component is a crucial part of FindResearch.online's user interface, responsible for displaying individual research paper information. Let's break down its implementation and examine its key features.

const ArticleCard: React.FC<ArticleCardProps> = ({ article }) => {
  const [aiInsights, setAiInsights] =
    (useState < ExtractedFeatures) | (null > null);
  const [isLoading, setIsLoading] = useState(false);
  const [error, setError] = (useState < string) | (null > null);
  const [mathJaxReady, setMathJaxReady] = useState(false);
 
  // ... useEffect and handler functions
 
  return (
    <Card className="transition-all duration-300 hover:shadow-lg flex flex-col">
      <CardHeader>{/* Title, authors, date */}</CardHeader>
      <CardContent className="flex-grow">{/* Abstract */}</CardContent>
      <CardFooter className="flex flex-col items-start space-y-2">
        {/* Journal, citations, relevance score, buttons, AI insights */}
      </CardFooter>
    </Card>
  );
};

Key Features

  1. State Management: The component uses React's useState hook to manage local state for AI insights, loading status, errors, and MathJax readiness.

    const [aiInsights, setAiInsights] = useState<ExtractedFeatures | null>(null);
    const [isLoading, setIsLoading] = useState(false);
    const [error, setError] = useState<string | null>(null);
    const [mathJaxReady, setMathJaxReady] = useState(false);
     
        ```
     
  2. MathJax Integration: The component includes a MathMLRenderer to handle mathematical equations in article titles.

    const MathMLRenderer: React.FC<{ mathml: string }> = ({ mathml }) => {
    const ref = useRef<HTMLSpanElement>(null);
     
          useEffect(() => {
            if (ref.current && window.MathJax) {
              window.MathJax.typesetPromise([ref.current])
                .then(() => {
                  // MathJax typesetting is complete
                })
                .catch((err: Error) =>
                  console.error("MathJax typesetting failed:", err)
                );
            }
          }, [mathml]);
     
          return (
            <span
              ref={ref}
              dangerouslySetInnerHTML={{ __html: mathml }}
              className="inline-block align-middle"
            />
          );
        };
     
        ```
     
  3. Dynamic Content Rendering: The component handles various data scenarios, providing fallback text when certain information is unavailable.

    <CardTitle className="text-lg font-bold">
    <div className="max-w-[250px] break-words">
    {typeof article.title === "string"
    ? renderTitle(article.title)
    : "No title available"}
    </div>
    </CardTitle>
     
        ```
     
  4. Interactive Elements: The component includes buttons for accessing the full text and code repository, and an accordion for AI-generated insights.

    <Button
    variant="outline"
    size="sm"
    className="w-full"
    onClick={() => {/* ... */}} >
    <FileText className="mr-2 h-4 w-4" />
    Read Full Text
    </Button>
     
        <Accordion type="single" collapsible className="mt-4 w-full">
          <AccordionItem value="ai-insights">
            {/* AI Insights content */}
          </AccordionItem>
        </Accordion>
     
        ```
     
  5. On-Demand AI Insights: AI insights are loaded only when the user expands the accordion, improving initial load performance.

    const handleAIInsightsClick = async () => {
    if (aiInsights) return; // Don't fetch if we already have insights
    setIsLoading(true);
    setError(null);
    try {
    const features = await extractFeaturesFromAbstract(article.abstract);
    setAiInsights(features);
    } catch (err) {
    setError("Failed to load AI insights");
    console.error(err);
    } finally {
    setIsLoading(false);
    }
    };
     
        ```
     
  6. Loading States: The component uses a Skeleton component to provide a loading state while AI insights are being fetched.

    {isLoading ? (
    <div className="space-y-2">
    <Skeleton className="h-4 w-full" />
    <Skeleton className="h-4 w-full" />
    </div>
    ) : /_ ... _/}
     
        ```
     
  7. Error Handling: The component includes error handling for AI insight fetching, displaying an error message if the fetch fails.

    {error ? (
    <p className="text-red-500 text-sm">{error}</p>
    ) : /_ ... _/}
     
        ```
     
  8. Styling with Tailwind CSS: The component extensively uses Tailwind CSS classes for styling, allowing for rapid development and easy customization. `js

    `

FilterPopover Component

The FilterPopover component provides a comprehensive filtering interface for research papers.

import { ArxivJournal } from "@/app/lib/types";
import { useResearchStore } from "@/app/store/researchStore";
import { Button } from "@/components/ui/button";
import { Calendar } from "@/components/ui/calendar";
import { Checkbox } from "@/components/ui/checkbox";
import { Input } from "@/components/ui/input";
import {
  Popover,
  PopoverContent,
  PopoverTrigger,
} from "@/components/ui/popover";
import { format } from "date-fns";
import { Filter } from "lucide-react";
import React from "react";
 
const FilterPopover: React.FC = () => {
  const {
    isFiltersOpen,
    setIsFiltersOpen,
    startDate,
    setStartDate,
    endDate,
    setEndDate,
    selectedJournals,
    setSelectedJournals,
    minCitations,
    setMinCitations,
    availableJournals,
    applyFilters,
    clearFilters,
  } = useResearchStore();
 
  const handleStartDateSelect = (date: Date | undefined) => {
    setStartDate(date);
  };
 
  const handleEndDateSelect = (date: Date | undefined) => {
    setEndDate(date);
  };
 
  const getJournalName = (journal: string | ArxivJournal): string => {
    if (typeof journal === "string") {
      return journal;
    }
    if (journal && typeof journal === "object" && "_" in journal) {
      return journal._;
    }
    return "Unknown Journal";
  };
 
  const isJournalSelected = (journal: string | ArxivJournal): boolean => {
    const journalName = getJournalName(journal);
    return selectedJournals.includes(journalName);
  };
 
  return (
    <Popover open={isFiltersOpen} onOpenChange={setIsFiltersOpen}>
      <PopoverTrigger asChild>
        <Button variant="outline" className="flex items-center gap-2">
          <Filter className="h-4 w-4" />
          Filters
        </Button>
      </PopoverTrigger>
      <PopoverContent className="w-80">
        <div className="space-y-4">
          {/* Date Range Filter */}
          <div className="space-y-2">
            <h3 className="font-medium">Date Range</h3>
            <div className="flex gap-2">
              {/* Start Date Popover */}
              <Popover>
                <PopoverTrigger asChild>
                  <Button
                    variant="outline"
                    className="w-[120px] justify-start text-left font-normal"
                  >
                    {startDate && !isNaN(new Date(startDate).getTime())
                      ? format(new Date(startDate), "dd/MM/yyyy")
                      : "Start Date"}
                  </Button>
                </PopoverTrigger>
                <PopoverContent className="w-auto p-0" align="start">
                  <Calendar
                    mode="single"
                    selected={startDate || undefined}
                    onSelect={handleStartDateSelect}
                    initialFocus
                  />
                </PopoverContent>
              </Popover>
 
              {/* End Date Popover */}
              <Popover>{/* Similar structure to Start Date Popover */}</Popover>
            </div>
          </div>
 
          {/* Journal/Source Filter */}
          <div className="space-y-2">
            <h3 className="font-medium">Journal/Source</h3>
            <div className="space-y-2 max-h-40 overflow-y-auto">
              {availableJournals.map((journal, index) => {
                const journalName = getJournalName(journal);
                return (
                  <label key={index} className="flex items-center space-x-2">
                    <Checkbox
                      checked={isJournalSelected(journal)}
                      onCheckedChange={(checked: boolean) => {
                        setSelectedJournals(
                          checked
                            ? [...selectedJournals, journalName]
                            : selectedJournals.filter((j) => j !== journalName)
                        );
                      }}
                    />
                    <span>{journalName}</span>
                  </label>
                );
              })}
            </div>
          </div>
 
          {/* Minimum Citations Filter */}
          <div className="space-y-2">
            <h3 className="font-medium">Minimum Citations</h3>
            <Input
              type="number"
              value={minCitations}
              onChange={(e) => setMinCitations(e.target.value)}
              placeholder="Enter minimum citations"
            />
          </div>
 
          {/* Apply and Clear Buttons */}
          <div className="flex justify-between">
            <Button onClick={applyFilters}>Apply Filters</Button>
            <Button variant="outline" onClick={clearFilters}>
              Clear Filters
            </Button>
          </div>
        </div>
      </PopoverContent>
    </Popover>
  );
};

Key aspects:

  1. Component Structure: The component is structured as a Popover, with a trigger button and content area.
  2. Date Range Selection: Uses nested Popover components with Calendar for date selection. The format function from date-fns is used for date formatting.
  3. Journal Selection: Implements a scrollable list of checkboxes for journal selection. It handles both string and ArxivJournal types.
  4. Citation Filtering: Uses an Input component for entering minimum citation count.
  5. Helper Functions:
    • getJournalName: Extracts journal name from different data structures.
    • isJournalSelected: Checks if a journal is currently selected.
  6. UI Components: Utilizes shadcn components (Button, Popover, Calendar, Checkbox, Input) for a consistent UI.
  7. Styling: Uses Tailwind CSS classes for layout and styling (e.g., space-y-4, flex gap-2).
  8. Accessibility: Employs semantic HTML (e.g., <h3> for section headers) and ARIA attributes via shadcn components.

2. SearchBar Component

The SearchBar component provides the main search input for the application.

import { Input } from "@/components/ui/input";
import React from "react";
import { useResearchStore } from "../app/store/researchStore";
 
const SearchBar: React.FC = () => {
  const { searchInput, setSearchInput, handleSearch } = useResearchStore();
 
  const handleInputChange = (e: React.ChangeEvent<HTMLInputElement>) => {
    setSearchInput(e.target.value);
  };
 
  const handleKeyDown = (e: React.KeyboardEvent<HTMLInputElement>) => {
    if (e.key === "Enter" && searchInput.trim()) {
      e.preventDefault();
      handleSearch();
    }
  };
 
  return (
    <Input
      type="text"
      placeholder="Enter search terms (searches titles and abstracts)..."
      value={searchInput}
      onChange={handleInputChange}
      onKeyDown={handleKeyDown}
    />
  );
};

Key aspects:

  1. Component Structure: A simple functional component wrapping an Input component.
  2. Event Handlers:
    • handleInputChange: Updates the search input in the store.
    • handleKeyDown: Triggers a search when the Enter key is pressed.
  3. UI Component: Uses the shadcn Input component for consistency with the application's design.
  4. Placeholder Text: Provides clear instructions on what the input searches (titles and abstracts).
  5. Controlled Input: The input value is controlled by the searchInput state from the store.

3. SortSelect Component

The SortSelect component allows users to choose how search results are sorted.

import { SortOption } from "@/app/lib/types";
import {
  Select,
  SelectContent,
  SelectItem,
  SelectTrigger,
  SelectValue,
} from "@/components/ui/select";
import React from "react";
import { useResearchStore } from "../app/store/researchStore";
 
const SortSelect: React.FC = () => {
  const { sortOption, handleSortChange } = useResearchStore();
 
  return (
    <Select
      onValueChange={(value) => handleSortChange(value as SortOption)}
      defaultValue={sortOption}
    >
      <SelectTrigger className="w-[180px]">
        <SelectValue placeholder="Sort by" />
      </SelectTrigger>
      <SelectContent>
        <SelectItem value="relevance">Relevance</SelectItem>
        <SelectItem value="citationCount">Citation Count</SelectItem>
        <SelectItem value="date">Publication Date</SelectItem>
      </SelectContent>
    </Select>
  );
};

Key aspects:

  1. Component Structure: A functional component wrapping a Select component from shadcn.
  2. Sort Options: Provides three sorting options - Relevance, Citation Count, and Publication Date.
  3. UI Components: Uses shadcn's Select, SelectContent, SelectItem, SelectTrigger, and SelectValue for a cohesive UI.
  4. Event Handling: The onValueChange prop triggers the handleSortChange function when a new option is selected.
  5. Default Value: Sets the default value to the current sortOption from the store.
  6. Type Safety: Uses the SortOption type to ensure type safety when changing the sort option.

Conclusion: Current State and Future Challenges

FindResearch.online represents an attempt to improve the academic paper search process. Through our analysis of its architecture and components, we've seen both its potential strengths and areas that require further development.

Key Aspects of the Project:

  1. Backend Architecture: The articleService.tsx implements multi-source data fetching and ranking with BM25. While this approach provides diverse results, it also introduces complexity in data consistency and may require ongoing optimization to handle large-scale queries efficiently.

  2. State Management: Zustand is used for state management, offering a simpler alternative to more complex solutions. However, as the application grows, careful consideration will be needed to prevent state management from becoming a bottleneck.

  3. Component Design: Components like ArticleCard, FilterPopover, SearchBar, and SortSelect are modular, which aids in maintenance. Yet, there's room for improvement in terms of reusability and potential over-fragmentation of the UI.

  4. User Interface: The filtering and sorting options provide flexibility, but user testing will be crucial to ensure they meet real-world researcher needs without overwhelming users with choices.

  5. Performance Considerations: On-demand loading of AI insights is a step towards performance optimization. However, real-world usage with large datasets will be needed to identify and address potential performance bottlenecks.

  6. Accessibility: While shadcn UI components provide a base level of accessibility, a comprehensive accessibility audit and user testing with assistive technologies are necessary to ensure broad usability.

Challenges and Areas for Improvement:

  1. Data Quality and Consistency: Integrating multiple sources raises questions about data consistency and quality control across different platforms.
  2. Scalability: The current implementation may face challenges with very large datasets or high concurrent user loads.
  3. Advanced Search Features: Researchers often need complex boolean searches and field-specific queries, which are not yet fully implemented.
  4. User Personalization: The current version lacks user accounts and personalized recommendations, which could significantly enhance its utility.
  5. Mobile Responsiveness: While using Tailwind CSS aids in creating a responsive design, dedicated mobile optimization may be necessary for a truly mobile-friendly experience.