Archived

This repository has been archived on 2025-02-10. You can view files and clone it, but cannot push or open issues or pull requests.

kpcto ec6a6b0cd0 Initial commit: Academic paper processing system

2025-01-31 02:00:36 +00:00

1.6 KiB

Raw Blame History

AI Paper Review Agent Architecture

Overview

Modular system for fetching, processing, and summarizing academic papers using LangChain.

Core Components

Data Acquisition Layer
- arXiv API Client (REST interface)
- PDF Downloader Service
- Metadata Extraction Module
Processing Pipeline
- PDF Text Extractor (PyPDF2/Unstructured)
- Semantic Chunker
- Metadata Enricher (author institutions, citations)
Analysis Engine
- LangChain Document Loaders
- Multi-stage Summary Chain (Deepseek r1)
- Technical Concept Extractor
- Cross-Paper Insight Aggregator
Storage Layer
- Relational Storage (PostgreSQL)
- Vector Store (Chroma)
- Cache System (Redis)
Orchestration
- Agent Controller Class
- Retry Mechanism with Exponential Backoff
- Quality Assurance Checks

Architectural Diagram

graph LR
    A[User Query] --> B(arXiv API)
    B --> C[PDF Storage]
    C --> D{Processing Queue}
    D --> E[Text Extraction]
    E --> F[Chunking]
    F --> G[Embedding]
    G --> H[Vector Store]
    H --> I[LLM Analysis]
    I --> J[Report Generation]

Key Decisions

Modular Design: Components communicate via clean interfaces for easy replacement
Batch Processing: Asynchronous pipeline for parallel paper processing
Caching Layer: Reduces API calls and improves performance
Fallback Strategies: Multiple PDF parsers with automatic fallback
Security: Environment variables for credentials, encrypted storage

Dependencies

langchain==0.2.1
arxiv==2.1.0
unstructured==0.12.2
openai==1.30.1
faiss-cpu==1.8.0
sqlalchemy==2.0.30

1.6 KiB Raw Blame History