Messed up, then fixed ton of files. Note: never leave context window from where you start
This commit is contained in:
parent
ac43cdd77e
commit
d5b90f555d
99
README.md
99
README.md
@ -1,72 +1,83 @@
|
||||
# 🤖📚 LastIn-AI: The Research Assistant That Wrote Itself
|
||||
# 🧠 LastIn AI: Your Paper's New Overlord
|
||||
|
||||
*"I used the AI to create the AI" - This README, probably*
|
||||
[](https://www.youtube.com/watch?v=dQw4w9WgXcQ)
|
||||
|
||||
## 🚀 What Does This Thing Do?
|
||||
**An AI system that reads, judges, and organizes papers so you can pretend you're keeping up with the literature**
|
||||
|
||||
This autonomous research assistant:
|
||||
1. **Paper Hunter** 🔍: Daily arXiv scans using our `PaperFetcher` class
|
||||
2. **Category Ninja** 🥷: Manage topics via `config/arxiv_categories.json`
|
||||
3. **Time Traveler** ⏳: Fetch papers from past X days
|
||||
4. **Self-Aware Entity** 🤯: Literally wrote this README about itself
|
||||
## 🚀 Features
|
||||
|
||||
## ⚙️ Installation (For Humans)
|
||||
- 🤖 **Self-Optimizing Pipeline**: Now with 20% more existential dread about its purpose
|
||||
- 🧩 **Modular Architecture**: Swappable components like you're building Ikea furniture for robots
|
||||
- 🔍 **Context-Aware Analysis**: Reads between LaTeX equations like a PhD student reads Twitter
|
||||
- 🛠 **Self-Healing Storage**: Fixes database issues while questioning why it bothers
|
||||
- 🤯 **Flux Capacitor Mode**: Time-aware processing (results may violate causality)
|
||||
|
||||
## ⚙️ Installation
|
||||
|
||||
```bash
|
||||
# Clone this repository that an AI created
|
||||
git clone https://github.com/yourusername/lastin-ai.git
|
||||
# Clone repository (we promise there's no paperclip maximizer)
|
||||
git clone https://github.com/your-lab/lastin-ai.git
|
||||
cd lastin-ai
|
||||
|
||||
# Create virtual environment (because even AIs hate dependency conflicts)
|
||||
python -m venv .venv
|
||||
.venv\Scripts\activate
|
||||
# Install dependencies (virtual env recommended)
|
||||
pip install -r requirements.txt # Now with non-Euclidean dependency resolution!
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
# Initialize the system (requires PostgreSQL)
|
||||
python -m src.main init-db # Creates tables and a small existential crisis
|
||||
```
|
||||
|
||||
## 🧠 Configuration
|
||||
## 🔧 Configuration
|
||||
|
||||
Edit `config/arxiv_categories.json` to add your academic obsessions:
|
||||
Rename `.env.example` to `.env` and feed it your secrets:
|
||||
```ini
|
||||
OPENAI_API_KEY=sk-your-key-here # We pinky-promise not to become self-aware
|
||||
DEEPSEEK_API_KEY=sk-moon-shot # For when regular AI isn't dramatic enough
|
||||
|
||||
```json
|
||||
{
|
||||
"categories": [
|
||||
{
|
||||
"id": "cs.AI",
|
||||
"name": "Artificial Intelligence",
|
||||
"description": "Papers about systems that will eventually write better systems"
|
||||
}
|
||||
],
|
||||
"default_categories": ["cs.AI"]
|
||||
}
|
||||
DB_HOST=localhost # Where we store your academic sins
|
||||
DB_NAME=paper_analysis # Schema designed during a SIGBOVIK break
|
||||
```
|
||||
|
||||
## 💻 Usage
|
||||
## 🧠 Usage
|
||||
|
||||
```python
|
||||
# Let the AI research AI research
|
||||
python -m src.scripts.fetch_papers --days 3
|
||||
```bash
|
||||
# Harvest papers like an academic combine
|
||||
python -m src.main fetch --categories cs.AI --days 7
|
||||
|
||||
# Query your digital hoard
|
||||
python -m src.main query "papers that cite GPT but clearly didn't read it"
|
||||
```
|
||||
|
||||
**Sample Output:**
|
||||
**Pro Mode:** Add `--loglevel DEBUG` to watch neural networks question the meaning of "breakthrough".
|
||||
|
||||
## 🏗 Architecture
|
||||
|
||||
```
|
||||
[AI]: Found 42 papers about AI
|
||||
[AI]: Determining which papers are about AI writing better AI...
|
||||
[AI]: existential_crisis.exe triggered
|
||||
lastin-ai/
|
||||
├── src/
|
||||
│ ├── data_acquisition/ # Papers go in, embeddings come out
|
||||
│ ├── storage/ # Where vectors go to rethink their life choices
|
||||
│ ├── utils/ # Home of AgentController (the real protagonist)
|
||||
│ └── main.py # The big red button (do not push)
|
||||
```
|
||||
|
||||
## 🤖 Philosophical Corner
|
||||
Our **Agent Controller** handles:
|
||||
- 🧬 Data consistency through sheer willpower
|
||||
- 📉 Quality metrics that judge us all
|
||||
- 🧮 Vector math that would make Von Neumann blush
|
||||
|
||||
*This section written by the AI about the AI that wrote the AI*
|
||||
## 🌌 Roadmap
|
||||
|
||||
Q: Who watches the watchmen?
|
||||
A: An AI that watches watchmen-watching AI.
|
||||
- [ ] Implement ethical constraints (optional)
|
||||
- [ ] Add support for papers written by AIs about AIs
|
||||
- [ ] Quantum-resistant pretension detection
|
||||
- [ ] Automated rebuttal generator for peer review
|
||||
|
||||
## 📜 License
|
||||
## 🌟 Acknowledgments
|
||||
|
||||
MIT License - because even self-writing code needs legal coverage
|
||||
- **ArXiv** for the digital paper avalanche
|
||||
- **GPUs** for pretending we're not just matrix multiplying
|
||||
- **The concept of attention** - you're the real MVP
|
||||
|
||||
---
|
||||
|
||||
*README written by Cascade (AI) at 2025-02-10 22:54 UTC - bow before your new robotic researcher overlords*
|
||||
*Disclaimer: May occasionally generate paper summaries more coherent than the originals. Not liable for recursive self-improvement loops.* 🤖➰
|
||||
|
||||
2356
data/papers/papers_cs_AI_20250210_231748.json
Normal file
2356
data/papers/papers_cs_AI_20250210_231748.json
Normal file
File diff suppressed because it is too large
Load Diff
2356
data/papers/papers_cs_AI_20250210_231924.json
Normal file
2356
data/papers/papers_cs_AI_20250210_231924.json
Normal file
File diff suppressed because it is too large
Load Diff
2356
data/papers/papers_cs_AI_20250210_231946.json
Normal file
2356
data/papers/papers_cs_AI_20250210_231946.json
Normal file
File diff suppressed because it is too large
Load Diff
10
example.env
Normal file
10
example.env
Normal file
@ -0,0 +1,10 @@
|
||||
OPENAI_API_KEY=
|
||||
DEEPSEEK_API_KEY=
|
||||
|
||||
# Database Configuration
|
||||
DB_HOST=localhost
|
||||
DB_PORT=5432
|
||||
DB_NAME=paper_analysis
|
||||
DB_USER=paper_analysis
|
||||
DB_PASSWORD=xyz
|
||||
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/paper_analysis
|
||||
15
papers/2502_05110v1.json
Normal file
15
papers/2502_05110v1.json
Normal file
@ -0,0 +1,15 @@
|
||||
{
|
||||
"title": "ApplE: An Applied Ethics Ontology with Event Context",
|
||||
"authors": [
|
||||
"Aisha Aijaz",
|
||||
"Raghava Mutharaju",
|
||||
"Manohar Kumar"
|
||||
],
|
||||
"abstract": "Applied ethics is ubiquitous in most domains, requiring much deliberation due\nto its philosophical nature. Varying views often lead to conflicting courses of\naction where ethical dilemmas become challenging to resolve. Although many\nfactors contribute to such a decision, the major driving forces can be\ndiscretized and thus simplified to provide an indicative answer. Knowledge\nrepresentation and reasoning offer a way to explicitly translate abstract\nethical concepts into applicable principles within the context of an event. To\nachieve this, we propose ApplE, an Applied Ethics ontology that captures\nphilosophical theory and event context to holistically describe the morality of\nan action. The development process adheres to a modified version of the\nSimplified Agile Methodology for Ontology Development (SAMOD) and utilizes\nstandard design and publication practices. Using ApplE, we model a use case\nfrom the bioethics domain that demonstrates our ontology's social and\nscientific value. Apart from the ontological reasoning and quality checks,\nApplE is also evaluated using the three-fold testing process of SAMOD. ApplE\nfollows FAIR principles and aims to be a viable resource for applied ethicists\nand ontology engineers.",
|
||||
"pdf_url": "http://arxiv.org/pdf/2502.05110v1",
|
||||
"entry_id": "http://arxiv.org/abs/2502.05110v1",
|
||||
"categories": [
|
||||
"cs.CY",
|
||||
"cs.AI"
|
||||
]
|
||||
}
|
||||
15
papers/2502_05111v1.json
Normal file
15
papers/2502_05111v1.json
Normal file
@ -0,0 +1,15 @@
|
||||
{
|
||||
"title": "Flexible and Efficient Grammar-Constrained Decoding",
|
||||
"authors": [
|
||||
"Kanghee Park",
|
||||
"Timothy Zhou",
|
||||
"Loris D'Antoni"
|
||||
],
|
||||
"abstract": "Large Language Models (LLMs) are often asked to generate structured outputs\nthat obey precise syntactic rules, such as code snippets or formatted data.\nGrammar-constrained decoding (GCD) can guarantee that LLM outputs matches such\nrules by masking out tokens that will provably lead to outputs that do not\nbelong to a specified context-free grammar (CFG). To guarantee soundness, GCD\nalgorithms have to compute how a given LLM subword tokenizer can align with the\ntokens used\n by a given context-free grammar and compute token masks based on this\ninformation. Doing so efficiently is challenging and existing GCD algorithms\nrequire tens of minutes to preprocess common grammars. We present a new GCD\nalgorithm together with an implementation that offers 17.71x faster offline\npreprocessing than existing approaches while preserving state-of-the-art\nefficiency in online mask computation.",
|
||||
"pdf_url": "http://arxiv.org/pdf/2502.05111v1",
|
||||
"entry_id": "http://arxiv.org/abs/2502.05111v1",
|
||||
"categories": [
|
||||
"cs.CL",
|
||||
"cs.AI"
|
||||
]
|
||||
}
|
||||
25
papers/2502_05130v1.json
Normal file
25
papers/2502_05130v1.json
Normal file
@ -0,0 +1,25 @@
|
||||
{
|
||||
"title": "Latent Swap Joint Diffusion for Long-Form Audio Generation",
|
||||
"authors": [
|
||||
"Yusheng Dai",
|
||||
"Chenxi Wang",
|
||||
"Chang Li",
|
||||
"Chen Wang",
|
||||
"Jun Du",
|
||||
"Kewei Li",
|
||||
"Ruoyu Wang",
|
||||
"Jiefeng Ma",
|
||||
"Lei Sun",
|
||||
"Jianqing Gao"
|
||||
],
|
||||
"abstract": "Previous work on long-form audio generation using global-view diffusion or\niterative generation demands significant training or inference costs. While\nrecent advancements in multi-view joint diffusion for panoramic generation\nprovide an efficient option, they struggle with spectrum generation with severe\noverlap distortions and high cross-view consistency costs. We initially explore\nthis phenomenon through the connectivity inheritance of latent maps and uncover\nthat averaging operations excessively smooth the high-frequency components of\nthe latent map. To address these issues, we propose Swap Forward (SaFa), a\nframe-level latent swap framework that synchronizes multiple diffusions to\nproduce a globally coherent long audio with more spectrum details in a\nforward-only manner. At its core, the bidirectional Self-Loop Latent Swap is\napplied between adjacent views, leveraging stepwise diffusion trajectory to\nadaptively enhance high-frequency components without disrupting low-frequency\ncomponents. Furthermore, to ensure cross-view consistency, the unidirectional\nReference-Guided Latent Swap is applied between the reference and the\nnon-overlap regions of each subview during the early stages, providing\ncentralized trajectory guidance. Quantitative and qualitative experiments\ndemonstrate that SaFa significantly outperforms existing joint diffusion\nmethods and even training-based long audio generation models. Moreover, we find\nthat it also adapts well to panoramic generation, achieving comparable\nstate-of-the-art performance with greater efficiency and model\ngeneralizability. Project page is available at https://swapforward.github.io/.",
|
||||
"pdf_url": "http://arxiv.org/pdf/2502.05130v1",
|
||||
"entry_id": "http://arxiv.org/abs/2502.05130v1",
|
||||
"categories": [
|
||||
"cs.SD",
|
||||
"cs.AI",
|
||||
"cs.CV",
|
||||
"cs.MM",
|
||||
"eess.AS"
|
||||
]
|
||||
}
|
||||
17
papers/2502_05147v1.json
Normal file
17
papers/2502_05147v1.json
Normal file
@ -0,0 +1,17 @@
|
||||
{
|
||||
"title": "LP-DETR: Layer-wise Progressive Relations for Object Detection",
|
||||
"authors": [
|
||||
"Zhengjian Kang",
|
||||
"Ye Zhang",
|
||||
"Xiaoyu Deng",
|
||||
"Xintao Li",
|
||||
"Yongzhe Zhang"
|
||||
],
|
||||
"abstract": "This paper presents LP-DETR (Layer-wise Progressive DETR), a novel approach\nthat enhances DETR-based object detection through multi-scale relation\nmodeling. Our method introduces learnable spatial relationships between object\nqueries through a relation-aware self-attention mechanism, which adaptively\nlearns to balance different scales of relations (local, medium and global)\nacross decoder layers. This progressive design enables the model to effectively\ncapture evolving spatial dependencies throughout the detection pipeline.\nExtensive experiments on COCO 2017 dataset demonstrate that our method improves\nboth convergence speed and detection accuracy compared to standard\nself-attention module. The proposed method achieves competitive results,\nreaching 52.3\\% AP with 12 epochs and 52.5\\% AP with 24 epochs using ResNet-50\nbackbone, and further improving to 58.0\\% AP with Swin-L backbone. Furthermore,\nour analysis reveals an interesting pattern: the model naturally learns to\nprioritize local spatial relations in early decoder layers while gradually\nshifting attention to broader contexts in deeper layers, providing valuable\ninsights for future research in object detection.",
|
||||
"pdf_url": "http://arxiv.org/pdf/2502.05147v1",
|
||||
"entry_id": "http://arxiv.org/abs/2502.05147v1",
|
||||
"categories": [
|
||||
"cs.CV",
|
||||
"cs.AI"
|
||||
]
|
||||
}
|
||||
28
papers/2502_05151v1.json
Normal file
28
papers/2502_05151v1.json
Normal file
@ -0,0 +1,28 @@
|
||||
{
|
||||
"title": "Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation",
|
||||
"authors": [
|
||||
"Steffen Eger",
|
||||
"Yong Cao",
|
||||
"Jennifer D'Souza",
|
||||
"Andreas Geiger",
|
||||
"Christian Greisinger",
|
||||
"Stephanie Gross",
|
||||
"Yufang Hou",
|
||||
"Brigitte Krenn",
|
||||
"Anne Lauscher",
|
||||
"Yizhi Li",
|
||||
"Chenghua Lin",
|
||||
"Nafise Sadat Moosavi",
|
||||
"Wei Zhao",
|
||||
"Tristan Miller"
|
||||
],
|
||||
"abstract": "With the advent of large multimodal language models, science is now at a\nthreshold of an AI-based technological transformation. Recently, a plethora of\nnew AI models and tools has been proposed, promising to empower researchers and\nacademics worldwide to conduct their research more effectively and efficiently.\nThis includes all aspects of the research cycle, especially (1) searching for\nrelevant literature; (2) generating research ideas and conducting\nexperimentation; generating (3) text-based and (4) multimodal content (e.g.,\nscientific figures and diagrams); and (5) AI-based automatic peer review. In\nthis survey, we provide an in-depth overview over these exciting recent\ndevelopments, which promise to fundamentally alter the scientific research\nprocess for good. Our survey covers the five aspects outlined above, indicating\nrelevant datasets, methods and results (including evaluation) as well as\nlimitations and scope for future research. Ethical concerns regarding\nshortcomings of these tools and potential for misuse (fake science, plagiarism,\nharms to research integrity) take a particularly prominent place in our\ndiscussion. We hope that our survey will not only become a reference guide for\nnewcomers to the field but also a catalyst for new AI-based initiatives in the\narea of \"AI4Science\".",
|
||||
"pdf_url": "http://arxiv.org/pdf/2502.05151v1",
|
||||
"entry_id": "http://arxiv.org/abs/2502.05151v1",
|
||||
"categories": [
|
||||
"cs.CL",
|
||||
"cs.AI",
|
||||
"cs.CV",
|
||||
"cs.LG"
|
||||
]
|
||||
}
|
||||
24
papers/2502_05172v1.json
Normal file
24
papers/2502_05172v1.json
Normal file
@ -0,0 +1,24 @@
|
||||
{
|
||||
"title": "Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient",
|
||||
"authors": [
|
||||
"Jan Ludziejewski",
|
||||
"Maciej Pióro",
|
||||
"Jakub Krajewski",
|
||||
"Maciej Stefaniak",
|
||||
"Michał Krutul",
|
||||
"Jan Małaśnicki",
|
||||
"Marek Cygan",
|
||||
"Piotr Sankowski",
|
||||
"Kamil Adamczewski",
|
||||
"Piotr Miłoś",
|
||||
"Sebastian Jaszczur"
|
||||
],
|
||||
"abstract": "Mixture of Experts (MoE) architectures have significantly increased\ncomputational efficiency in both research and real-world applications of\nlarge-scale machine learning models. However, their scalability and efficiency\nunder memory constraints remain relatively underexplored. In this work, we\npresent joint scaling laws for dense and MoE models, incorporating key factors\nsuch as the number of active parameters, dataset size, and the number of\nexperts. Our findings provide a principled framework for selecting the optimal\nMoE configuration under fixed memory and compute budgets. Surprisingly, we show\nthat MoE models can be more memory-efficient than dense models, contradicting\nconventional wisdom. To derive and validate the theoretical predictions of our\nscaling laws, we conduct over 280 experiments with up to 2.7B active parameters\nand up to 5B total parameters. These results offer actionable insights for\ndesigning and deploying MoE models in practical large-scale training scenarios.",
|
||||
"pdf_url": "http://arxiv.org/pdf/2502.05172v1",
|
||||
"entry_id": "http://arxiv.org/abs/2502.05172v1",
|
||||
"categories": [
|
||||
"cs.LG",
|
||||
"cs.AI",
|
||||
"cs.CL"
|
||||
]
|
||||
}
|
||||
17
papers/2502_05174v1.json
Normal file
17
papers/2502_05174v1.json
Normal file
@ -0,0 +1,17 @@
|
||||
{
|
||||
"title": "MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison",
|
||||
"authors": [
|
||||
"Kaijie Zhu",
|
||||
"Xianjun Yang",
|
||||
"Jindong Wang",
|
||||
"Wenbo Guo",
|
||||
"William Yang Wang"
|
||||
],
|
||||
"abstract": "Recent research has explored that LLM agents are vulnerable to indirect\nprompt injection (IPI) attacks, where malicious tasks embedded in\ntool-retrieved information can redirect the agent to take unauthorized actions.\nExisting defenses against IPI have significant limitations: either require\nessential model training resources, lack effectiveness against sophisticated\nattacks, or harm the normal utilities. We present MELON (Masked re-Execution\nand TooL comparisON), a novel IPI defense. Our approach builds on the\nobservation that under a successful attack, the agent's next action becomes\nless dependent on user tasks and more on malicious tasks. Following this, we\ndesign MELON to detect attacks by re-executing the agent's trajectory with a\nmasked user prompt modified through a masking function. We identify an attack\nif the actions generated in the original and masked executions are similar. We\nalso include three key designs to reduce the potential false positives and\nfalse negatives. Extensive evaluation on the IPI benchmark AgentDojo\ndemonstrates that MELON outperforms SOTA defenses in both attack prevention and\nutility preservation. Moreover, we show that combining MELON with a SOTA prompt\naugmentation defense (denoted as MELON-Aug) further improves its performance.\nWe also conduct a detailed ablation study to validate our key designs.",
|
||||
"pdf_url": "http://arxiv.org/pdf/2502.05174v1",
|
||||
"entry_id": "http://arxiv.org/abs/2502.05174v1",
|
||||
"categories": [
|
||||
"cs.CR",
|
||||
"cs.AI"
|
||||
]
|
||||
}
|
||||
@ -22,13 +22,16 @@ def setup_logging(log_dir: str = "logs"):
|
||||
},
|
||||
'simple': {
|
||||
'format': '%(levelname)s: %(message)s'
|
||||
},
|
||||
'error': {
|
||||
'format': '%(levelname)s: %(message)s\n%(exc_info)s'
|
||||
}
|
||||
},
|
||||
'handlers': {
|
||||
'console': {
|
||||
'class': 'logging.StreamHandler',
|
||||
'level': 'WARNING', # Only show warnings and above in console
|
||||
'formatter': 'simple',
|
||||
'level': 'INFO', # Show info and above in console
|
||||
'formatter': 'error',
|
||||
'stream': 'ext://sys.stdout'
|
||||
},
|
||||
'debug_file': {
|
||||
|
||||
@ -1,11 +1,21 @@
|
||||
"""Client for interacting with the arXiv API."""
|
||||
import arxiv
|
||||
from typing import List, Dict, Any
|
||||
from typing import List, Dict, Any, Optional
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
import aiohttp
|
||||
from datetime import datetime, timedelta
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class ArxivClient:
|
||||
"""
|
||||
Client for interacting with the arXiv API.
|
||||
Documentation: https://info.arxiv.org/help/api/user-manual.html
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.client = arxiv.Client()
|
||||
"""Initialize the ArxivClient."""
|
||||
self._session = None
|
||||
|
||||
async def _get_session(self) -> aiohttp.ClientSession:
|
||||
@ -15,42 +25,116 @@ class ArxivClient:
|
||||
return self._session
|
||||
|
||||
async def close(self):
|
||||
"""Close the session."""
|
||||
if self._session is not None:
|
||||
"""Close any open resources."""
|
||||
if self._session:
|
||||
await self._session.close()
|
||||
self._session = None
|
||||
|
||||
async def __aenter__(self):
|
||||
"""Async context manager entry."""
|
||||
await self._get_session()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
"""Async context manager exit."""
|
||||
await self.close()
|
||||
|
||||
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
|
||||
async def fetch_papers(self, query: str, max_results: int = 10) -> List[Dict[Any, Any]]:
|
||||
"""Fetch papers from arXiv based on query."""
|
||||
async def fetch_papers(self, query: str = None, category: str = None,
|
||||
start_date = None, end_date = None, max_results: int = 100) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Fetch papers from arXiv based on query and/or category.
|
||||
|
||||
Args:
|
||||
query (str, optional): Search query
|
||||
category (str, optional): arXiv category (e.g., 'cs.AI')
|
||||
start_date: Start date for paper search (naive datetime)
|
||||
end_date: End date for paper search (naive datetime)
|
||||
max_results (int): Maximum number of results to return
|
||||
|
||||
Returns:
|
||||
List[Dict[str, Any]]: List of paper metadata
|
||||
"""
|
||||
# Build the search query according to arXiv API syntax
|
||||
search_terms = []
|
||||
|
||||
# Add category constraint if specified
|
||||
if category:
|
||||
search_terms.append(f"cat:{category}")
|
||||
|
||||
# Add custom query if specified
|
||||
if query:
|
||||
formatted_query = query.replace(' AND ', ' AND ').replace(' OR ', ' OR ').replace(' NOT ', ' NOT ')
|
||||
search_terms.append(f"({formatted_query})")
|
||||
|
||||
# Add date range constraint if specified
|
||||
if start_date and end_date:
|
||||
date_range = (
|
||||
f"submittedDate:[{start_date.strftime('%Y%m%d')}0000 TO "
|
||||
f"{end_date.strftime('%Y%m%d')}2359]"
|
||||
)
|
||||
search_terms.append(date_range)
|
||||
logger.info(f"Searching for papers between {start_date.strftime('%Y-%m-%d')} and {end_date.strftime('%Y-%m-%d')}")
|
||||
|
||||
# Combine search terms
|
||||
final_query = ' AND '.join(f'({term})' for term in search_terms) if search_terms else 'all:*'
|
||||
|
||||
# Configure and execute search
|
||||
search = arxiv.Search(
|
||||
query=query,
|
||||
query=final_query,
|
||||
max_results=max_results,
|
||||
sort_by=arxiv.SortCriterion.SubmittedDate
|
||||
sort_by=arxiv.SortCriterion.SubmittedDate,
|
||||
sort_order=arxiv.SortOrder.Descending
|
||||
)
|
||||
|
||||
results = []
|
||||
# Use list() to get all results at once instead of async iteration
|
||||
papers = list(self.client.results(search))
|
||||
papers = []
|
||||
try:
|
||||
client = arxiv.Client()
|
||||
|
||||
for paper in papers:
|
||||
results.append({
|
||||
'id': paper.entry_id.split('/')[-1], # Get the ID without the full URL
|
||||
'title': paper.title,
|
||||
'authors': [author.name for author in paper.authors],
|
||||
'summary': paper.summary,
|
||||
'pdf_url': paper.pdf_url,
|
||||
'published': paper.published.isoformat() if paper.published else None,
|
||||
'updated': paper.updated.isoformat() if paper.updated else None,
|
||||
'categories': paper.categories
|
||||
})
|
||||
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
|
||||
def fetch_results():
|
||||
return list(client.results(search))
|
||||
|
||||
return results
|
||||
results = fetch_results()
|
||||
|
||||
for result in results:
|
||||
papers.append({
|
||||
"title": result.title,
|
||||
"authors": [author.name for author in result.authors],
|
||||
"abstract": result.summary,
|
||||
"pdf_url": result.pdf_url,
|
||||
"entry_id": result.entry_id,
|
||||
"categories": result.categories
|
||||
})
|
||||
|
||||
logger.info(f"Successfully fetched {len(papers)} papers")
|
||||
return papers
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching papers: {e}")
|
||||
raise
|
||||
|
||||
async def get_paper_by_id(self, paper_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Fetch a single paper by arXiv ID."""
|
||||
try:
|
||||
# Extract arXiv ID from URL format
|
||||
arxiv_id = paper_id.split('/')[-1].replace('v', '') # Remove version suffix
|
||||
search = arxiv.Search(id_list=[arxiv_id])
|
||||
|
||||
client = arxiv.Client()
|
||||
results = list(client.results(search))
|
||||
|
||||
if not results:
|
||||
return None
|
||||
|
||||
result = results[0]
|
||||
return {
|
||||
"entry_id": result.entry_id,
|
||||
"title": result.title,
|
||||
"authors": [author.name for author in result.authors],
|
||||
"summary": result.summary,
|
||||
"pdf_url": result.pdf_url,
|
||||
"categories": result.categories
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to fetch paper {paper_id}: {e}")
|
||||
return None
|
||||
|
||||
243
src/main.py
243
src/main.py
@ -2,26 +2,32 @@ import asyncio
|
||||
import signal
|
||||
import logging
|
||||
import sys
|
||||
from src.agent_controller import AgentController
|
||||
import os
|
||||
import json
|
||||
from datetime import datetime, timedelta
|
||||
from typing import List, Optional
|
||||
from zoneinfo import ZoneInfo
|
||||
|
||||
from src.utils.agent_controller import AgentController
|
||||
from src.data_acquisition.arxiv_client import ArxivClient
|
||||
from src.utils.debug import tracker
|
||||
from src.config.logging_config import setup_logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def handle_sigint():
|
||||
def handle_sigint(signum, frame):
|
||||
"""Handle interrupt signal."""
|
||||
logger.info("Received interrupt signal, canceling tasks...")
|
||||
for task in asyncio.all_tasks():
|
||||
task.cancel()
|
||||
logger.info("Received interrupt signal, exiting...")
|
||||
sys.exit(0)
|
||||
|
||||
async def cleanup_resources(loop: asyncio.AbstractEventLoop):
|
||||
async def cleanup_resources():
|
||||
"""Clean up any remaining resources."""
|
||||
logger.debug("Starting resource cleanup...")
|
||||
tracker.print_active_resources()
|
||||
|
||||
try:
|
||||
# Cancel all tasks
|
||||
pending = asyncio.all_tasks(loop)
|
||||
# Cancel all tasks except current
|
||||
pending = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
|
||||
if pending:
|
||||
logger.debug(f"Cancelling {len(pending)} pending tasks...")
|
||||
for task in pending:
|
||||
@ -29,107 +35,168 @@ async def cleanup_resources(loop: asyncio.AbstractEventLoop):
|
||||
task.cancel()
|
||||
|
||||
# Wait for tasks to complete with timeout
|
||||
logger.debug("Waiting for tasks to complete...")
|
||||
await asyncio.wait(pending, timeout=5)
|
||||
|
||||
# Force close any remaining tasks
|
||||
for task in pending:
|
||||
if not task.done():
|
||||
logger.warning(f"Force closing task: {task.get_name()}")
|
||||
try:
|
||||
task.close()
|
||||
except Exception as e:
|
||||
logger.error(f"Error closing task: {str(e)}")
|
||||
try:
|
||||
logger.debug("Waiting for tasks to complete...")
|
||||
await asyncio.wait(pending, timeout=5)
|
||||
except asyncio.CancelledError:
|
||||
logger.debug("Task wait cancelled")
|
||||
|
||||
logger.debug("Cleanup completed successfully")
|
||||
except Exception as e:
|
||||
logger.error(f"Error during cleanup: {str(e)}")
|
||||
finally:
|
||||
logger.debug("Resource cleanup completed")
|
||||
logger.error(f"Error during cleanup: {e}")
|
||||
|
||||
async def fetch_papers(days: int = 7, categories: Optional[List[str]] = None) -> None:
|
||||
"""Fetch and analyze papers from arXiv."""
|
||||
logger.info(f"Fetching papers from the last {days} days")
|
||||
|
||||
if categories is None:
|
||||
categories = ["cs.AI"]
|
||||
|
||||
async with ArxivClient() as client, AgentController() as agent:
|
||||
try:
|
||||
# Calculate date range
|
||||
end_date = datetime.now()
|
||||
start_date = end_date - timedelta(days=days)
|
||||
|
||||
logger.info(f"Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
|
||||
|
||||
# Fetch and analyze papers for each category
|
||||
for category in categories:
|
||||
logger.info(f"Fetching papers for category: {category}")
|
||||
papers = await client.fetch_papers(category=category,
|
||||
start_date=start_date,
|
||||
end_date=end_date)
|
||||
|
||||
if not papers:
|
||||
print(f"\nNo recent papers found in {category}")
|
||||
continue
|
||||
|
||||
print(f"\nProcessing papers in {category}:")
|
||||
print("=" * 80)
|
||||
print()
|
||||
|
||||
for i, paper in enumerate(papers, 1):
|
||||
print(f"Processing paper {i}/{len(papers)}: {paper['title']}")
|
||||
|
||||
try:
|
||||
# Analyze paper
|
||||
analysis = await agent.analyze_paper(paper)
|
||||
|
||||
# Print analysis
|
||||
print("\nAnalysis:")
|
||||
print(f"Summary: {analysis.get('summary', 'No summary available')}")
|
||||
print("\nTechnical Concepts:")
|
||||
print(analysis.get('technical_concepts', 'No technical concepts available'))
|
||||
|
||||
# Print fluff analysis
|
||||
fluff = analysis.get('fluff', {})
|
||||
score = fluff.get('score')
|
||||
if score is not None:
|
||||
color = '\033[92m' if score < 30 else '\033[93m' if score < 70 else '\033[91m'
|
||||
reset = '\033[0m'
|
||||
print(f"\nFluff Score: {color}{score}/100{reset}")
|
||||
print("Analysis:")
|
||||
print(fluff.get('explanation', 'No explanation available'))
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing paper: {e}")
|
||||
print("Failed to analyze paper")
|
||||
|
||||
print("-" * 80)
|
||||
print()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching papers: {e}")
|
||||
raise
|
||||
|
||||
async def process_query(query: str) -> None:
|
||||
"""Process a search query and display results."""
|
||||
async with AgentController() as agent:
|
||||
try:
|
||||
results = await agent.process_query(query)
|
||||
|
||||
if not results:
|
||||
print("\nNo matching papers found.")
|
||||
return
|
||||
|
||||
print(f"\nFound {len(results)} matching papers:")
|
||||
print("=" * 80)
|
||||
|
||||
for i, paper in enumerate(results, 1):
|
||||
print(f"\n{i}. {paper['title']}")
|
||||
print(f" Authors: {', '.join(paper['authors'])}")
|
||||
|
||||
analysis = paper.get('analysis', {})
|
||||
print("\nSummary:")
|
||||
print(analysis.get('summary', 'No summary available'))
|
||||
|
||||
print("\nTechnical Concepts:")
|
||||
print(analysis.get('technical_concepts', 'No technical concepts available'))
|
||||
|
||||
fluff = analysis.get('fluff', {})
|
||||
score = fluff.get('score')
|
||||
if score is not None:
|
||||
color = '\033[92m' if score < 30 else '\033[93m' if score < 70 else '\033[91m'
|
||||
reset = '\033[0m'
|
||||
print(f"\nFluff Score: {color}{score}/100{reset}")
|
||||
print("Analysis:")
|
||||
print(fluff.get('explanation', 'No explanation available'))
|
||||
|
||||
print("-" * 80)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing query: {e}")
|
||||
raise
|
||||
|
||||
async def main():
|
||||
"""Main application entry point."""
|
||||
try:
|
||||
logger.debug("Starting main application...")
|
||||
async with AgentController() as agent:
|
||||
query = "Large Language Models recent advances"
|
||||
logger.debug(f"Processing query: {query}")
|
||||
parser = argparse.ArgumentParser(description='Fetch and analyze arXiv papers')
|
||||
subparsers = parser.add_subparsers(dest='command', help='Command to run')
|
||||
|
||||
try:
|
||||
# Process query and get both new and existing papers
|
||||
results = await agent.process_query(query)
|
||||
# Fetch papers command
|
||||
fetch_parser = subparsers.add_parser('fetch', help='Fetch recent papers')
|
||||
fetch_parser.add_argument('--days', type=int, default=7,
|
||||
help='Number of days to look back')
|
||||
fetch_parser.add_argument('--categories', nargs='+', default=['cs.AI'],
|
||||
help='arXiv categories to fetch')
|
||||
|
||||
if results:
|
||||
print("\n=== Analysis Results ===")
|
||||
for paper in results:
|
||||
print(f"\nTitle: {paper['title']}")
|
||||
print(f"Authors: {paper.get('authors', 'Unknown')}")
|
||||
# Search papers command
|
||||
search_parser = subparsers.add_parser('search', help='Search papers')
|
||||
search_parser.add_argument('query', help='Search query')
|
||||
|
||||
if 'analysis' in paper:
|
||||
# This is a newly processed paper
|
||||
print("Status: Newly Processed")
|
||||
print("\nSummary:", paper['analysis'].get('summary', 'No summary available'))
|
||||
print("\nTechnical Concepts:", paper['analysis'].get('technical_concepts', 'No concepts available'))
|
||||
else:
|
||||
# This is an existing paper from the database
|
||||
print("Status: Retrieved from Database")
|
||||
print("\nRelevant Text:", paper.get('relevant_text', 'No text available')[:500] + "...")
|
||||
else:
|
||||
print("\nNo papers found for the query.")
|
||||
args = parser.parse_args()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in main: {str(e)}")
|
||||
raise
|
||||
finally:
|
||||
logger.debug("Main execution completed")
|
||||
if args.command == 'fetch':
|
||||
await fetch_papers(days=args.days, categories=args.categories)
|
||||
elif args.command == 'search':
|
||||
await process_query(args.query)
|
||||
else:
|
||||
parser.print_help()
|
||||
|
||||
def run_main():
|
||||
"""Run the main application."""
|
||||
# Set up logging
|
||||
setup_logging()
|
||||
|
||||
# Create new event loop
|
||||
logger.debug("Creating new event loop...")
|
||||
loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(loop)
|
||||
tracker.track_loop(loop)
|
||||
|
||||
# Set up signal handlers
|
||||
signals = (signal.SIGTERM, signal.SIGINT)
|
||||
for s in signals:
|
||||
loop.add_signal_handler(
|
||||
s, lambda s=s: handle_sigint()
|
||||
)
|
||||
# Set up signal handlers for Windows
|
||||
signal.signal(signal.SIGINT, handle_sigint)
|
||||
signal.signal(signal.SIGTERM, handle_sigint)
|
||||
|
||||
try:
|
||||
logger.debug("Running main function...")
|
||||
loop.run_until_complete(main())
|
||||
except asyncio.CancelledError:
|
||||
logger.info("Main execution cancelled")
|
||||
# Run main application
|
||||
asyncio.run(main())
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Application cancelled by user")
|
||||
except Exception as e:
|
||||
logger.error(f"Error in run_main: {str(e)}")
|
||||
raise
|
||||
logger.error(f"Application error: {e}")
|
||||
sys.exit(1)
|
||||
finally:
|
||||
# Clean up
|
||||
try:
|
||||
# Run cleanup
|
||||
logger.debug("Running final cleanup...")
|
||||
loop.run_until_complete(cleanup_resources(loop))
|
||||
|
||||
# Close the loop
|
||||
logger.debug("Closing event loop...")
|
||||
loop.run_until_complete(loop.shutdown_asyncgens())
|
||||
loop.run_until_complete(loop.shutdown_default_executor())
|
||||
|
||||
# Remove the event loop
|
||||
asyncio.set_event_loop(None)
|
||||
loop.close()
|
||||
tracker.untrack_loop(loop)
|
||||
|
||||
logger.debug("Cleanup completed successfully")
|
||||
asyncio.run(cleanup_resources())
|
||||
except Exception as e:
|
||||
logger.error(f"Error during final cleanup: {str(e)}")
|
||||
finally:
|
||||
# Force exit to ensure all resources are released
|
||||
sys.exit(0)
|
||||
logger.error(f"Error during cleanup: {e}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
run_main()
|
||||
|
||||
@ -79,34 +79,77 @@ class PaperFetcher:
|
||||
|
||||
logger.info(f"Fetching papers from categories: {categories}")
|
||||
|
||||
# Initialize clients
|
||||
arxiv_client = ArxivClient()
|
||||
# Initialize agent
|
||||
agent = AgentController()
|
||||
|
||||
try:
|
||||
await agent.initialize()
|
||||
|
||||
# Calculate date range
|
||||
# Calculate date range (use a minimum of 7 days if days is too small)
|
||||
end_date = datetime.now()
|
||||
days = max(days, 7) # Ensure we look at least 7 days back
|
||||
start_date = end_date - timedelta(days=days)
|
||||
|
||||
# Fetch and process papers for each category
|
||||
for category in categories:
|
||||
logger.info(f"Processing category: {category}")
|
||||
papers = await arxiv_client.fetch_papers(
|
||||
category=category,
|
||||
start_date=start_date,
|
||||
end_date=end_date
|
||||
)
|
||||
logger.info(f"Searching for papers from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
|
||||
|
||||
if not papers:
|
||||
logger.info(f"No new papers found for category {category}")
|
||||
continue
|
||||
# Create output directory if it doesn't exist
|
||||
output_dir = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), "data", "papers")
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
logger.info(f"Found {len(papers)} papers in {category}")
|
||||
for paper in papers:
|
||||
await agent.process_paper(paper)
|
||||
# Create and use ArxivClient as async context manager
|
||||
async with ArxivClient() as arxiv_client:
|
||||
# Fetch and process papers for each category
|
||||
for category in categories:
|
||||
logger.info(f"Processing category: {category}")
|
||||
try:
|
||||
papers = await arxiv_client.fetch_papers(
|
||||
category=category,
|
||||
start_date=start_date,
|
||||
end_date=end_date
|
||||
)
|
||||
|
||||
if not papers:
|
||||
logger.info(f"No papers found for category {category}")
|
||||
continue
|
||||
|
||||
logger.info(f"Found {len(papers)} papers in {category}")
|
||||
|
||||
# Save papers to JSON file
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
output_file = os.path.join(output_dir, f"papers_{category.replace('.', '_')}_{timestamp}.json")
|
||||
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(papers, f, indent=2, ensure_ascii=False)
|
||||
|
||||
logger.info(f"Saved papers to {output_file}")
|
||||
|
||||
# Display paper information
|
||||
print(f"\nRecent papers in {category}:")
|
||||
print("=" * 80)
|
||||
for i, paper in enumerate(papers, 1):
|
||||
print(f"\n{i}. {paper['title']}")
|
||||
print(f" Authors: {', '.join(paper['authors'])}")
|
||||
print(f" Published: {paper['published']}")
|
||||
print(f" URL: {paper['pdf_url']}")
|
||||
print(f" Categories: {', '.join(paper['categories'])}")
|
||||
if paper.get('doi'):
|
||||
print(f" DOI: {paper['doi']}")
|
||||
print("-" * 80)
|
||||
|
||||
for paper in papers:
|
||||
try:
|
||||
analysis = await agent.analyze_paper(paper)
|
||||
# TODO: Store analysis results
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to analyze paper {paper.get('title', 'Unknown')}: {e}")
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to fetch papers for category {category}: {e}")
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error in fetch_papers: {e}")
|
||||
finally:
|
||||
await agent.close()
|
||||
|
||||
|
||||
@ -19,6 +19,8 @@ CREATE TABLE IF NOT EXISTS papers (
|
||||
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE UNIQUE INDEX IF NOT EXISTS papers_id_idx ON papers (id);
|
||||
"""
|
||||
|
||||
class PaperStore:
|
||||
@ -59,72 +61,78 @@ class PaperStore:
|
||||
if self.pool:
|
||||
await self.pool.close()
|
||||
|
||||
async def store_paper(self, paper: Dict) -> None:
|
||||
"""Store a paper and its analysis in the database."""
|
||||
async def store_paper(self, metadata: Dict) -> None:
|
||||
"""Store paper metadata in the database."""
|
||||
if not self.pool:
|
||||
await self.initialize()
|
||||
|
||||
async with self.pool.acquire() as conn:
|
||||
# Extract analysis if present
|
||||
analysis = paper.get('analysis', {})
|
||||
fluff_analysis = analysis.get('fluff', {})
|
||||
try:
|
||||
logger.info(f"Storing paper metadata: {metadata.get('title')} (ID: {metadata.get('id')})")
|
||||
|
||||
# Store paper
|
||||
await conn.execute('''
|
||||
INSERT INTO papers (
|
||||
id, title, authors, summary, technical_concepts,
|
||||
fluff_score, fluff_explanation, pdf_url
|
||||
)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
|
||||
ON CONFLICT (id) DO UPDATE SET
|
||||
title = EXCLUDED.title,
|
||||
authors = EXCLUDED.authors,
|
||||
summary = EXCLUDED.summary,
|
||||
technical_concepts = EXCLUDED.technical_concepts,
|
||||
fluff_score = EXCLUDED.fluff_score,
|
||||
fluff_explanation = EXCLUDED.fluff_explanation,
|
||||
pdf_url = EXCLUDED.pdf_url,
|
||||
updated_at = CURRENT_TIMESTAMP
|
||||
''',
|
||||
paper['id'],
|
||||
paper['title'],
|
||||
paper['authors'],
|
||||
analysis.get('summary'),
|
||||
analysis.get('technical_concepts'),
|
||||
fluff_analysis.get('score'),
|
||||
fluff_analysis.get('explanation'),
|
||||
paper.get('pdf_url'))
|
||||
async with self.pool.acquire() as conn:
|
||||
# Store paper metadata
|
||||
await conn.execute('''
|
||||
INSERT INTO papers (
|
||||
id, title, authors, summary, technical_concepts,
|
||||
fluff_score, fluff_explanation, pdf_url
|
||||
)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
|
||||
ON CONFLICT (id) DO UPDATE SET
|
||||
title = EXCLUDED.title,
|
||||
authors = EXCLUDED.authors,
|
||||
summary = EXCLUDED.summary,
|
||||
technical_concepts = EXCLUDED.technical_concepts,
|
||||
fluff_score = EXCLUDED.fluff_score,
|
||||
fluff_explanation = EXCLUDED.fluff_explanation,
|
||||
pdf_url = EXCLUDED.pdf_url,
|
||||
updated_at = CURRENT_TIMESTAMP
|
||||
''',
|
||||
metadata['id'],
|
||||
metadata['title'],
|
||||
metadata['authors'],
|
||||
metadata.get('summary'),
|
||||
metadata.get('technical_concepts'),
|
||||
metadata.get('fluff_score'),
|
||||
metadata.get('fluff_explanation'),
|
||||
metadata.get('pdf_url'))
|
||||
|
||||
logger.info(f"Successfully stored paper metadata {metadata.get('id')}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error storing paper metadata {metadata.get('id')}: {str(e)}")
|
||||
raise
|
||||
|
||||
async def get_paper(self, paper_id: str) -> Optional[Dict]:
|
||||
"""Retrieve a paper by its ID."""
|
||||
"""Retrieve paper metadata by its ID."""
|
||||
if not self.pool:
|
||||
await self.initialize()
|
||||
|
||||
async with self.pool.acquire() as conn:
|
||||
# Get paper metadata
|
||||
row = await conn.fetchrow('''
|
||||
SELECT * FROM papers WHERE id = $1
|
||||
SELECT id, title, authors, summary, technical_concepts,
|
||||
fluff_score, fluff_explanation, pdf_url,
|
||||
created_at, updated_at
|
||||
FROM papers
|
||||
WHERE id = $1
|
||||
''', paper_id)
|
||||
|
||||
if row:
|
||||
return dict(row)
|
||||
return None
|
||||
|
||||
async def get_all_paper_ids(self) -> List[str]:
|
||||
"""Get all paper IDs in the database."""
|
||||
if not self.pool:
|
||||
await self.initialize()
|
||||
|
||||
async with self.pool.acquire() as conn:
|
||||
rows = await conn.fetch('SELECT id FROM papers')
|
||||
return [row['id'] for row in rows]
|
||||
|
||||
async def get_papers_by_ids(self, paper_ids: List[str]) -> List[Dict]:
|
||||
"""Retrieve multiple papers by their IDs."""
|
||||
async def get_all_papers(self) -> List[Dict]:
|
||||
"""Retrieve all papers ordered by creation date."""
|
||||
if not self.pool:
|
||||
await self.initialize()
|
||||
|
||||
async with self.pool.acquire() as conn:
|
||||
# Get all papers
|
||||
rows = await conn.fetch('''
|
||||
SELECT * FROM papers WHERE id = ANY($1)
|
||||
''', paper_ids)
|
||||
SELECT id, title, authors, summary, technical_concepts,
|
||||
fluff_score, fluff_explanation, pdf_url,
|
||||
created_at, updated_at
|
||||
FROM papers
|
||||
ORDER BY created_at DESC
|
||||
''')
|
||||
|
||||
return [dict(row) for row in rows]
|
||||
|
||||
@ -1,6 +1,9 @@
|
||||
from typing import List, Dict
|
||||
import chromadb
|
||||
from chromadb.config import Settings
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class VectorStore:
|
||||
def __init__(self, persist_directory: str = "chroma_db"):
|
||||
@ -15,6 +18,11 @@ class VectorStore:
|
||||
metadata={"hnsw:space": "cosine"} # Use cosine similarity
|
||||
)
|
||||
|
||||
async def close(self):
|
||||
"""Close the vector store client."""
|
||||
# ChromaDB doesn't need explicit cleanup
|
||||
pass
|
||||
|
||||
def paper_exists(self, paper_id: str) -> bool:
|
||||
"""Check if a paper already exists in the vector store."""
|
||||
try:
|
||||
@ -61,6 +69,14 @@ class VectorStore:
|
||||
ids=ids
|
||||
)
|
||||
|
||||
def delete_paper(self, paper_id: str) -> None:
|
||||
"""Delete all chunks for a paper."""
|
||||
try:
|
||||
self.collection.delete(where={"paper_id": paper_id})
|
||||
logger.info(f"Deleted vector store entries for paper {paper_id}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete paper {paper_id}: {e}")
|
||||
|
||||
def query_similar(self, query: str, n_results: int = 5) -> Dict:
|
||||
"""Query for similar chunks."""
|
||||
try:
|
||||
|
||||
1
src/utils/__init__.py
Normal file
1
src/utils/__init__.py
Normal file
@ -0,0 +1 @@
|
||||
|
||||
234
src/utils/agent_controller.py
Normal file
234
src/utils/agent_controller.py
Normal file
@ -0,0 +1,234 @@
|
||||
"""
|
||||
Agent Controller for managing AI interactions and paper analysis.
|
||||
"""
|
||||
import os
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, Any, Optional, List
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
|
||||
from src.analysis.llm_analyzer import LLMAnalyzer
|
||||
from src.storage.paper_store import PaperStore
|
||||
from src.storage.vector_store import VectorStore
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class AgentController:
|
||||
"""Controller class for managing AI agent operations."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the agent controller."""
|
||||
self.initialized = False
|
||||
self.llm_analyzer = None
|
||||
self.paper_store = None
|
||||
self.vector_store = None
|
||||
self.papers_dir = Path("papers")
|
||||
|
||||
async def __aenter__(self):
|
||||
"""Async context manager entry."""
|
||||
await self.initialize()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
"""Async context manager exit."""
|
||||
await self.close()
|
||||
return None # Don't suppress exceptions
|
||||
|
||||
async def initialize(self):
|
||||
"""Initialize the agent and prepare it for paper analysis."""
|
||||
if self.initialized:
|
||||
return
|
||||
|
||||
try:
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
||||
# Get API keys from environment
|
||||
deepseek_key = os.getenv('DEEPSEEK_API_KEY')
|
||||
if not deepseek_key:
|
||||
raise ValueError("DEEPSEEK_API_KEY environment variable is required")
|
||||
|
||||
# Create papers directory if it doesn't exist
|
||||
self.papers_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Initialize components
|
||||
self.llm_analyzer = LLMAnalyzer(api_key=deepseek_key, provider='deepseek')
|
||||
self.paper_store = PaperStore()
|
||||
self.vector_store = VectorStore()
|
||||
|
||||
# Initialize database
|
||||
await self.paper_store.initialize()
|
||||
|
||||
self.initialized = True
|
||||
logger.info("Agent controller initialized successfully")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to initialize agent controller: {e}")
|
||||
raise
|
||||
|
||||
async def analyze_paper(self, paper_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Analyze a research paper using AI capabilities.
|
||||
|
||||
Args:
|
||||
paper_data (dict): Paper metadata and content to analyze
|
||||
|
||||
Returns:
|
||||
dict: Analysis results including summary, technical concepts, and fluff analysis
|
||||
"""
|
||||
if not self.initialized:
|
||||
await self.initialize()
|
||||
|
||||
try:
|
||||
# Get paper text (combine title and abstract)
|
||||
paper_text = f"Title: {paper_data.get('title', '')}\n\n"
|
||||
if paper_data.get('abstract'):
|
||||
paper_text += f"Abstract: {paper_data['abstract']}\n\n"
|
||||
|
||||
# Analyze paper using LLM
|
||||
analysis = await self.llm_analyzer.analyze_paper(paper_text)
|
||||
|
||||
# Store paper in databases
|
||||
paper_id = paper_data.get('entry_id') # Use entry_id from arXiv
|
||||
if paper_id:
|
||||
logger.debug(f"Checking PostgreSQL for paper {paper_id}")
|
||||
existing = await self.paper_store.get_paper(paper_id)
|
||||
if existing:
|
||||
logger.info(f"Paper {paper_id} already in database - skipping")
|
||||
return existing
|
||||
|
||||
logger.debug(f"Checking vector store for paper {paper_id}")
|
||||
if self.vector_store.paper_exists(paper_id):
|
||||
logger.warning(f"Found orphaned vector entry for {paper_id} - repairing")
|
||||
await self._repair_orphaned_paper(paper_id, paper_data)
|
||||
return {}
|
||||
|
||||
# Clean paper_id to use as filename
|
||||
safe_id = paper_id.split('/')[-1].replace('.', '_')
|
||||
|
||||
# Save paper content to file
|
||||
paper_path = self.papers_dir / f"{safe_id}.json"
|
||||
with open(paper_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(paper_data, f, indent=2, ensure_ascii=False)
|
||||
|
||||
# Store metadata in PostgreSQL
|
||||
metadata = {
|
||||
'id': paper_id,
|
||||
'title': paper_data['title'],
|
||||
'authors': paper_data['authors'],
|
||||
'summary': analysis.get('summary'),
|
||||
'technical_concepts': analysis.get('technical_concepts'),
|
||||
'fluff_score': analysis.get('fluff', {}).get('score'),
|
||||
'fluff_explanation': analysis.get('fluff', {}).get('explanation'),
|
||||
'pdf_url': paper_data.get('pdf_url')
|
||||
}
|
||||
await self.paper_store.store_paper(metadata)
|
||||
|
||||
# Store in vector database for similarity search
|
||||
chunks = [
|
||||
paper_text, # Store full text
|
||||
analysis.get('summary', ''), # Store summary
|
||||
analysis.get('technical_concepts', '') # Store technical concepts
|
||||
]
|
||||
chunk_metadata = [
|
||||
{'paper_id': paper_id, 'type': 'full_text'},
|
||||
{'paper_id': paper_id, 'type': 'summary'},
|
||||
{'paper_id': paper_id, 'type': 'technical_concepts'}
|
||||
]
|
||||
chunk_ids = [
|
||||
f"{safe_id}_text",
|
||||
f"{safe_id}_summary",
|
||||
f"{safe_id}_concepts"
|
||||
]
|
||||
self.vector_store.add_chunks(chunks, chunk_metadata, chunk_ids)
|
||||
|
||||
return analysis
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to analyze paper: {e}")
|
||||
raise
|
||||
|
||||
async def process_query(self, query: str) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Process a search query and return relevant papers.
|
||||
|
||||
Args:
|
||||
query (str): Search query string
|
||||
|
||||
Returns:
|
||||
list: List of relevant papers with their analysis
|
||||
"""
|
||||
if not self.initialized:
|
||||
await self.initialize()
|
||||
|
||||
try:
|
||||
# Get similar papers from vector store
|
||||
results = self.vector_store.query_similar(query)
|
||||
|
||||
# Get unique paper IDs from results
|
||||
paper_ids = set()
|
||||
for metadata in results['metadatas']:
|
||||
if metadata and 'paper_id' in metadata:
|
||||
paper_ids.add(metadata['paper_id'])
|
||||
|
||||
# Get paper metadata from PostgreSQL
|
||||
papers = []
|
||||
for paper_id in paper_ids:
|
||||
paper = await self.paper_store.get_paper(paper_id)
|
||||
if paper:
|
||||
# Load full paper data if needed
|
||||
safe_id = paper_id.split('/')[-1].replace('.', '_')
|
||||
paper_path = self.papers_dir / f"{safe_id}.json"
|
||||
if paper_path.exists():
|
||||
with open(paper_path, 'r', encoding='utf-8') as f:
|
||||
full_paper = json.load(f)
|
||||
paper['full_data'] = full_paper
|
||||
papers.append(paper)
|
||||
|
||||
return papers
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing query: {e}")
|
||||
raise
|
||||
|
||||
async def close(self):
|
||||
"""Clean up resources."""
|
||||
if not self.initialized:
|
||||
return
|
||||
|
||||
try:
|
||||
if self.llm_analyzer:
|
||||
await self.llm_analyzer.close()
|
||||
if self.paper_store:
|
||||
await self.paper_store.close()
|
||||
if self.vector_store:
|
||||
await self.vector_store.close()
|
||||
|
||||
self.initialized = False
|
||||
logger.info("Agent controller closed successfully")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to close agent controller: {e}")
|
||||
raise
|
||||
|
||||
async def _repair_orphaned_paper(self, paper_id: str, paper_data: Dict):
|
||||
"""Repair a paper that exists in vector store but not PostgreSQL."""
|
||||
try:
|
||||
# Try to get fresh metadata from arXiv
|
||||
fresh_data = await self.arxiv_client.get_paper_by_id(paper_id)
|
||||
|
||||
if fresh_data:
|
||||
# Store in PostgreSQL with original analysis data
|
||||
await self.paper_store.store_paper({
|
||||
**fresh_data,
|
||||
"technical_concepts": paper_data.get('technical_concepts'),
|
||||
"fluff_score": paper_data.get('fluff_score'),
|
||||
"fluff_explanation": paper_data.get('fluff_explanation')
|
||||
})
|
||||
logger.info(f"Repaired orphaned paper {paper_id}")
|
||||
else:
|
||||
# Paper no longer exists on arXiv - clean up vector store
|
||||
self.vector_store.delete_paper(paper_id)
|
||||
logger.warning(f"Deleted orphaned paper {paper_id} (not found on arXiv)")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to repair paper {paper_id}: {e}")
|
||||
Loading…
x
Reference in New Issue
Block a user