62 lines
1.6 KiB
Markdown
62 lines
1.6 KiB
Markdown
# AI Paper Review Agent Architecture
|
|
|
|
## Overview
|
|
Modular system for fetching, processing, and summarizing academic papers using LangChain.
|
|
|
|
### Core Components
|
|
|
|
1. **Data Acquisition Layer**
|
|
- arXiv API Client (REST interface)
|
|
- PDF Downloader Service
|
|
- Metadata Extraction Module
|
|
|
|
2. **Processing Pipeline**
|
|
- PDF Text Extractor (PyPDF2/Unstructured)
|
|
- Semantic Chunker
|
|
- Metadata Enricher (author institutions, citations)
|
|
|
|
3. **Analysis Engine**
|
|
- LangChain Document Loaders
|
|
- Multi-stage Summary Chain (Deepseek r1)
|
|
- Technical Concept Extractor
|
|
- Cross-Paper Insight Aggregator
|
|
|
|
4. **Storage Layer**
|
|
- Relational Storage (PostgreSQL)
|
|
- Vector Store (Chroma)
|
|
- Cache System (Redis)
|
|
|
|
5. **Orchestration**
|
|
- Agent Controller Class
|
|
- Retry Mechanism with Exponential Backoff
|
|
- Quality Assurance Checks
|
|
|
|
## Architectural Diagram
|
|
```mermaid
|
|
graph LR
|
|
A[User Query] --> B(arXiv API)
|
|
B --> C[PDF Storage]
|
|
C --> D{Processing Queue}
|
|
D --> E[Text Extraction]
|
|
E --> F[Chunking]
|
|
F --> G[Embedding]
|
|
G --> H[Vector Store]
|
|
H --> I[LLM Analysis]
|
|
I --> J[Report Generation]
|
|
```
|
|
|
|
## Key Decisions
|
|
1. **Modular Design**: Components communicate via clean interfaces for easy replacement
|
|
2. **Batch Processing**: Asynchronous pipeline for parallel paper processing
|
|
3. **Caching Layer**: Reduces API calls and improves performance
|
|
4. **Fallback Strategies**: Multiple PDF parsers with automatic fallback
|
|
5. **Security**: Environment variables for credentials, encrypted storage
|
|
|
|
## Dependencies
|
|
```python
|
|
langchain==0.2.1
|
|
arxiv==2.1.0
|
|
unstructured==0.12.2
|
|
openai==1.30.1
|
|
faiss-cpu==1.8.0
|
|
sqlalchemy==2.0.30 |