This repository has been archived on 2025-02-10. You can view files and clone it, but cannot push or open issues or pull requests.
lastin-ai/architecture.md

62 lines
1.6 KiB
Markdown

# AI Paper Review Agent Architecture
## Overview
Modular system for fetching, processing, and summarizing academic papers using LangChain.
### Core Components
1. **Data Acquisition Layer**
- arXiv API Client (REST interface)
- PDF Downloader Service
- Metadata Extraction Module
2. **Processing Pipeline**
- PDF Text Extractor (PyPDF2/Unstructured)
- Semantic Chunker
- Metadata Enricher (author institutions, citations)
3. **Analysis Engine**
- LangChain Document Loaders
- Multi-stage Summary Chain (Deepseek r1)
- Technical Concept Extractor
- Cross-Paper Insight Aggregator
4. **Storage Layer**
- Relational Storage (PostgreSQL)
- Vector Store (Chroma)
- Cache System (Redis)
5. **Orchestration**
- Agent Controller Class
- Retry Mechanism with Exponential Backoff
- Quality Assurance Checks
## Architectural Diagram
```mermaid
graph LR
A[User Query] --> B(arXiv API)
B --> C[PDF Storage]
C --> D{Processing Queue}
D --> E[Text Extraction]
E --> F[Chunking]
F --> G[Embedding]
G --> H[Vector Store]
H --> I[LLM Analysis]
I --> J[Report Generation]
```
## Key Decisions
1. **Modular Design**: Components communicate via clean interfaces for easy replacement
2. **Batch Processing**: Asynchronous pipeline for parallel paper processing
3. **Caching Layer**: Reduces API calls and improves performance
4. **Fallback Strategies**: Multiple PDF parsers with automatic fallback
5. **Security**: Environment variables for credentials, encrypted storage
## Dependencies
```python
langchain==0.2.1
arxiv==2.1.0
unstructured==0.12.2
openai==1.30.1
faiss-cpu==1.8.0
sqlalchemy==2.0.30