lastin-ai/architecture.md

# AI Paper Review Agent Architecture

## Overview
Modular system for fetching, processing, and summarizing academic papers using LangChain.

### Core Components

1. **Data Acquisition Layer**
   - arXiv API Client (REST interface)
   - PDF Downloader Service
   - Metadata Extraction Module

2. **Processing Pipeline**
   - PDF Text Extractor (PyPDF2/Unstructured)
   - Semantic Chunker
   - Metadata Enricher (author institutions, citations)

3. **Analysis Engine**
   - LangChain Document Loaders
   - Multi-stage Summary Chain (Deepseek r1)
   - Technical Concept Extractor
   - Cross-Paper Insight Aggregator

4. **Storage Layer**
   - Relational Storage (PostgreSQL)
   - Vector Store (Chroma)
   - Cache System (Redis)

5. **Orchestration**
   - Agent Controller Class
   - Retry Mechanism with Exponential Backoff
   - Quality Assurance Checks

## Architectural Diagram
```mermaid
graph LR
    A[User Query] --> B(arXiv API)
    B --> C[PDF Storage]
    C --> D{Processing Queue}
    D --> E[Text Extraction]
    E --> F[Chunking]
    F --> G[Embedding]
    G --> H[Vector Store]
    H --> I[LLM Analysis]
    I --> J[Report Generation]
```

## Key Decisions
1. **Modular Design**: Components communicate via clean interfaces for easy replacement
2. **Batch Processing**: Asynchronous pipeline for parallel paper processing
3. **Caching Layer**: Reduces API calls and improves performance
4. **Fallback Strategies**: Multiple PDF parsers with automatic fallback
5. **Security**: Environment variables for credentials, encrypted storage

## Dependencies
```python
langchain==0.2.1
arxiv==2.1.0
unstructured==0.12.2
openai==1.30.1
faiss-cpu==1.8.0
sqlalchemy==2.0.30