Messed up, then fixed ton of files. Note: never leave context window from where you start

This commit is contained in:
kpcto 2025-02-11 01:21:39 +00:00
parent ac43cdd77e
commit d5b90f555d
20 changed files with 7921 additions and 235 deletions

View File

@ -1,72 +1,83 @@
# 🤖📚 LastIn-AI: The Research Assistant That Wrote Itself # 🧠 LastIn AI: Your Paper's New Overlord
*"I used the AI to create the AI" - This README, probably* [![AI Overlord](https://img.shields.io/badge/Powered%20By-Sentient%20Sparks-blue?style=for-the-badge)](https://www.youtube.com/watch?v=dQw4w9WgXcQ)
## 🚀 What Does This Thing Do? **An AI system that reads, judges, and organizes papers so you can pretend you're keeping up with the literature**
This autonomous research assistant: ## 🚀 Features
1. **Paper Hunter** 🔍: Daily arXiv scans using our `PaperFetcher` class
2. **Category Ninja** 🥷: Manage topics via `config/arxiv_categories.json`
3. **Time Traveler** ⏳: Fetch papers from past X days
4. **Self-Aware Entity** 🤯: Literally wrote this README about itself
## ⚙️ Installation (For Humans) - 🤖 **Self-Optimizing Pipeline**: Now with 20% more existential dread about its purpose
- 🧩 **Modular Architecture**: Swappable components like you're building Ikea furniture for robots
- 🔍 **Context-Aware Analysis**: Reads between LaTeX equations like a PhD student reads Twitter
- 🛠 **Self-Healing Storage**: Fixes database issues while questioning why it bothers
- 🤯 **Flux Capacitor Mode**: Time-aware processing (results may violate causality)
## ⚙️ Installation
```bash ```bash
# Clone this repository that an AI created # Clone repository (we promise there's no paperclip maximizer)
git clone https://github.com/yourusername/lastin-ai.git git clone https://github.com/your-lab/lastin-ai.git
cd lastin-ai cd lastin-ai
# Create virtual environment (because even AIs hate dependency conflicts) # Install dependencies (virtual env recommended)
python -m venv .venv pip install -r requirements.txt # Now with non-Euclidean dependency resolution!
.venv\Scripts\activate
# Install dependencies # Initialize the system (requires PostgreSQL)
pip install -r requirements.txt python -m src.main init-db # Creates tables and a small existential crisis
``` ```
## 🧠 Configuration ## 🔧 Configuration
Edit `config/arxiv_categories.json` to add your academic obsessions: Rename `.env.example` to `.env` and feed it your secrets:
```ini
OPENAI_API_KEY=sk-your-key-here # We pinky-promise not to become self-aware
DEEPSEEK_API_KEY=sk-moon-shot # For when regular AI isn't dramatic enough
```json DB_HOST=localhost # Where we store your academic sins
{ DB_NAME=paper_analysis # Schema designed during a SIGBOVIK break
"categories": [
{
"id": "cs.AI",
"name": "Artificial Intelligence",
"description": "Papers about systems that will eventually write better systems"
}
],
"default_categories": ["cs.AI"]
}
``` ```
## 💻 Usage ## 🧠 Usage
```python ```bash
# Let the AI research AI research # Harvest papers like an academic combine
python -m src.scripts.fetch_papers --days 3 python -m src.main fetch --categories cs.AI --days 7
# Query your digital hoard
python -m src.main query "papers that cite GPT but clearly didn't read it"
``` ```
**Sample Output:** **Pro Mode:** Add `--loglevel DEBUG` to watch neural networks question the meaning of "breakthrough".
## 🏗 Architecture
``` ```
[AI]: Found 42 papers about AI lastin-ai/
[AI]: Determining which papers are about AI writing better AI... ├── src/
[AI]: existential_crisis.exe triggered │ ├── data_acquisition/ # Papers go in, embeddings come out
│ ├── storage/ # Where vectors go to rethink their life choices
│ ├── utils/ # Home of AgentController (the real protagonist)
│ └── main.py # The big red button (do not push)
``` ```
## 🤖 Philosophical Corner Our **Agent Controller** handles:
- 🧬 Data consistency through sheer willpower
- 📉 Quality metrics that judge us all
- 🧮 Vector math that would make Von Neumann blush
*This section written by the AI about the AI that wrote the AI* ## 🌌 Roadmap
Q: Who watches the watchmen? - [ ] Implement ethical constraints (optional)
A: An AI that watches watchmen-watching AI. - [ ] Add support for papers written by AIs about AIs
- [ ] Quantum-resistant pretension detection
- [ ] Automated rebuttal generator for peer review
## 📜 License ## 🌟 Acknowledgments
MIT License - because even self-writing code needs legal coverage - **ArXiv** for the digital paper avalanche
- **GPUs** for pretending we're not just matrix multiplying
- **The concept of attention** - you're the real MVP
--- ---
*README written by Cascade (AI) at 2025-02-10 22:54 UTC - bow before your new robotic researcher overlords* *Disclaimer: May occasionally generate paper summaries more coherent than the originals. Not liable for recursive self-improvement loops.* 🤖➰

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

10
example.env Normal file
View File

@ -0,0 +1,10 @@
OPENAI_API_KEY=
DEEPSEEK_API_KEY=
# Database Configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=paper_analysis
DB_USER=paper_analysis
DB_PASSWORD=xyz
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/paper_analysis

15
papers/2502_05110v1.json Normal file
View File

@ -0,0 +1,15 @@
{
"title": "ApplE: An Applied Ethics Ontology with Event Context",
"authors": [
"Aisha Aijaz",
"Raghava Mutharaju",
"Manohar Kumar"
],
"abstract": "Applied ethics is ubiquitous in most domains, requiring much deliberation due\nto its philosophical nature. Varying views often lead to conflicting courses of\naction where ethical dilemmas become challenging to resolve. Although many\nfactors contribute to such a decision, the major driving forces can be\ndiscretized and thus simplified to provide an indicative answer. Knowledge\nrepresentation and reasoning offer a way to explicitly translate abstract\nethical concepts into applicable principles within the context of an event. To\nachieve this, we propose ApplE, an Applied Ethics ontology that captures\nphilosophical theory and event context to holistically describe the morality of\nan action. The development process adheres to a modified version of the\nSimplified Agile Methodology for Ontology Development (SAMOD) and utilizes\nstandard design and publication practices. Using ApplE, we model a use case\nfrom the bioethics domain that demonstrates our ontology's social and\nscientific value. Apart from the ontological reasoning and quality checks,\nApplE is also evaluated using the three-fold testing process of SAMOD. ApplE\nfollows FAIR principles and aims to be a viable resource for applied ethicists\nand ontology engineers.",
"pdf_url": "http://arxiv.org/pdf/2502.05110v1",
"entry_id": "http://arxiv.org/abs/2502.05110v1",
"categories": [
"cs.CY",
"cs.AI"
]
}

15
papers/2502_05111v1.json Normal file
View File

@ -0,0 +1,15 @@
{
"title": "Flexible and Efficient Grammar-Constrained Decoding",
"authors": [
"Kanghee Park",
"Timothy Zhou",
"Loris D'Antoni"
],
"abstract": "Large Language Models (LLMs) are often asked to generate structured outputs\nthat obey precise syntactic rules, such as code snippets or formatted data.\nGrammar-constrained decoding (GCD) can guarantee that LLM outputs matches such\nrules by masking out tokens that will provably lead to outputs that do not\nbelong to a specified context-free grammar (CFG). To guarantee soundness, GCD\nalgorithms have to compute how a given LLM subword tokenizer can align with the\ntokens used\n by a given context-free grammar and compute token masks based on this\ninformation. Doing so efficiently is challenging and existing GCD algorithms\nrequire tens of minutes to preprocess common grammars. We present a new GCD\nalgorithm together with an implementation that offers 17.71x faster offline\npreprocessing than existing approaches while preserving state-of-the-art\nefficiency in online mask computation.",
"pdf_url": "http://arxiv.org/pdf/2502.05111v1",
"entry_id": "http://arxiv.org/abs/2502.05111v1",
"categories": [
"cs.CL",
"cs.AI"
]
}

25
papers/2502_05130v1.json Normal file
View File

@ -0,0 +1,25 @@
{
"title": "Latent Swap Joint Diffusion for Long-Form Audio Generation",
"authors": [
"Yusheng Dai",
"Chenxi Wang",
"Chang Li",
"Chen Wang",
"Jun Du",
"Kewei Li",
"Ruoyu Wang",
"Jiefeng Ma",
"Lei Sun",
"Jianqing Gao"
],
"abstract": "Previous work on long-form audio generation using global-view diffusion or\niterative generation demands significant training or inference costs. While\nrecent advancements in multi-view joint diffusion for panoramic generation\nprovide an efficient option, they struggle with spectrum generation with severe\noverlap distortions and high cross-view consistency costs. We initially explore\nthis phenomenon through the connectivity inheritance of latent maps and uncover\nthat averaging operations excessively smooth the high-frequency components of\nthe latent map. To address these issues, we propose Swap Forward (SaFa), a\nframe-level latent swap framework that synchronizes multiple diffusions to\nproduce a globally coherent long audio with more spectrum details in a\nforward-only manner. At its core, the bidirectional Self-Loop Latent Swap is\napplied between adjacent views, leveraging stepwise diffusion trajectory to\nadaptively enhance high-frequency components without disrupting low-frequency\ncomponents. Furthermore, to ensure cross-view consistency, the unidirectional\nReference-Guided Latent Swap is applied between the reference and the\nnon-overlap regions of each subview during the early stages, providing\ncentralized trajectory guidance. Quantitative and qualitative experiments\ndemonstrate that SaFa significantly outperforms existing joint diffusion\nmethods and even training-based long audio generation models. Moreover, we find\nthat it also adapts well to panoramic generation, achieving comparable\nstate-of-the-art performance with greater efficiency and model\ngeneralizability. Project page is available at https://swapforward.github.io/.",
"pdf_url": "http://arxiv.org/pdf/2502.05130v1",
"entry_id": "http://arxiv.org/abs/2502.05130v1",
"categories": [
"cs.SD",
"cs.AI",
"cs.CV",
"cs.MM",
"eess.AS"
]
}

17
papers/2502_05147v1.json Normal file
View File

@ -0,0 +1,17 @@
{
"title": "LP-DETR: Layer-wise Progressive Relations for Object Detection",
"authors": [
"Zhengjian Kang",
"Ye Zhang",
"Xiaoyu Deng",
"Xintao Li",
"Yongzhe Zhang"
],
"abstract": "This paper presents LP-DETR (Layer-wise Progressive DETR), a novel approach\nthat enhances DETR-based object detection through multi-scale relation\nmodeling. Our method introduces learnable spatial relationships between object\nqueries through a relation-aware self-attention mechanism, which adaptively\nlearns to balance different scales of relations (local, medium and global)\nacross decoder layers. This progressive design enables the model to effectively\ncapture evolving spatial dependencies throughout the detection pipeline.\nExtensive experiments on COCO 2017 dataset demonstrate that our method improves\nboth convergence speed and detection accuracy compared to standard\nself-attention module. The proposed method achieves competitive results,\nreaching 52.3\\% AP with 12 epochs and 52.5\\% AP with 24 epochs using ResNet-50\nbackbone, and further improving to 58.0\\% AP with Swin-L backbone. Furthermore,\nour analysis reveals an interesting pattern: the model naturally learns to\nprioritize local spatial relations in early decoder layers while gradually\nshifting attention to broader contexts in deeper layers, providing valuable\ninsights for future research in object detection.",
"pdf_url": "http://arxiv.org/pdf/2502.05147v1",
"entry_id": "http://arxiv.org/abs/2502.05147v1",
"categories": [
"cs.CV",
"cs.AI"
]
}

28
papers/2502_05151v1.json Normal file
View File

@ -0,0 +1,28 @@
{
"title": "Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation",
"authors": [
"Steffen Eger",
"Yong Cao",
"Jennifer D'Souza",
"Andreas Geiger",
"Christian Greisinger",
"Stephanie Gross",
"Yufang Hou",
"Brigitte Krenn",
"Anne Lauscher",
"Yizhi Li",
"Chenghua Lin",
"Nafise Sadat Moosavi",
"Wei Zhao",
"Tristan Miller"
],
"abstract": "With the advent of large multimodal language models, science is now at a\nthreshold of an AI-based technological transformation. Recently, a plethora of\nnew AI models and tools has been proposed, promising to empower researchers and\nacademics worldwide to conduct their research more effectively and efficiently.\nThis includes all aspects of the research cycle, especially (1) searching for\nrelevant literature; (2) generating research ideas and conducting\nexperimentation; generating (3) text-based and (4) multimodal content (e.g.,\nscientific figures and diagrams); and (5) AI-based automatic peer review. In\nthis survey, we provide an in-depth overview over these exciting recent\ndevelopments, which promise to fundamentally alter the scientific research\nprocess for good. Our survey covers the five aspects outlined above, indicating\nrelevant datasets, methods and results (including evaluation) as well as\nlimitations and scope for future research. Ethical concerns regarding\nshortcomings of these tools and potential for misuse (fake science, plagiarism,\nharms to research integrity) take a particularly prominent place in our\ndiscussion. We hope that our survey will not only become a reference guide for\nnewcomers to the field but also a catalyst for new AI-based initiatives in the\narea of \"AI4Science\".",
"pdf_url": "http://arxiv.org/pdf/2502.05151v1",
"entry_id": "http://arxiv.org/abs/2502.05151v1",
"categories": [
"cs.CL",
"cs.AI",
"cs.CV",
"cs.LG"
]
}

24
papers/2502_05172v1.json Normal file
View File

@ -0,0 +1,24 @@
{
"title": "Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient",
"authors": [
"Jan Ludziejewski",
"Maciej Pióro",
"Jakub Krajewski",
"Maciej Stefaniak",
"Michał Krutul",
"Jan Małaśnicki",
"Marek Cygan",
"Piotr Sankowski",
"Kamil Adamczewski",
"Piotr Miłoś",
"Sebastian Jaszczur"
],
"abstract": "Mixture of Experts (MoE) architectures have significantly increased\ncomputational efficiency in both research and real-world applications of\nlarge-scale machine learning models. However, their scalability and efficiency\nunder memory constraints remain relatively underexplored. In this work, we\npresent joint scaling laws for dense and MoE models, incorporating key factors\nsuch as the number of active parameters, dataset size, and the number of\nexperts. Our findings provide a principled framework for selecting the optimal\nMoE configuration under fixed memory and compute budgets. Surprisingly, we show\nthat MoE models can be more memory-efficient than dense models, contradicting\nconventional wisdom. To derive and validate the theoretical predictions of our\nscaling laws, we conduct over 280 experiments with up to 2.7B active parameters\nand up to 5B total parameters. These results offer actionable insights for\ndesigning and deploying MoE models in practical large-scale training scenarios.",
"pdf_url": "http://arxiv.org/pdf/2502.05172v1",
"entry_id": "http://arxiv.org/abs/2502.05172v1",
"categories": [
"cs.LG",
"cs.AI",
"cs.CL"
]
}

17
papers/2502_05174v1.json Normal file
View File

@ -0,0 +1,17 @@
{
"title": "MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison",
"authors": [
"Kaijie Zhu",
"Xianjun Yang",
"Jindong Wang",
"Wenbo Guo",
"William Yang Wang"
],
"abstract": "Recent research has explored that LLM agents are vulnerable to indirect\nprompt injection (IPI) attacks, where malicious tasks embedded in\ntool-retrieved information can redirect the agent to take unauthorized actions.\nExisting defenses against IPI have significant limitations: either require\nessential model training resources, lack effectiveness against sophisticated\nattacks, or harm the normal utilities. We present MELON (Masked re-Execution\nand TooL comparisON), a novel IPI defense. Our approach builds on the\nobservation that under a successful attack, the agent's next action becomes\nless dependent on user tasks and more on malicious tasks. Following this, we\ndesign MELON to detect attacks by re-executing the agent's trajectory with a\nmasked user prompt modified through a masking function. We identify an attack\nif the actions generated in the original and masked executions are similar. We\nalso include three key designs to reduce the potential false positives and\nfalse negatives. Extensive evaluation on the IPI benchmark AgentDojo\ndemonstrates that MELON outperforms SOTA defenses in both attack prevention and\nutility preservation. Moreover, we show that combining MELON with a SOTA prompt\naugmentation defense (denoted as MELON-Aug) further improves its performance.\nWe also conduct a detailed ablation study to validate our key designs.",
"pdf_url": "http://arxiv.org/pdf/2502.05174v1",
"entry_id": "http://arxiv.org/abs/2502.05174v1",
"categories": [
"cs.CR",
"cs.AI"
]
}

View File

@ -22,13 +22,16 @@ def setup_logging(log_dir: str = "logs"):
}, },
'simple': { 'simple': {
'format': '%(levelname)s: %(message)s' 'format': '%(levelname)s: %(message)s'
},
'error': {
'format': '%(levelname)s: %(message)s\n%(exc_info)s'
} }
}, },
'handlers': { 'handlers': {
'console': { 'console': {
'class': 'logging.StreamHandler', 'class': 'logging.StreamHandler',
'level': 'WARNING', # Only show warnings and above in console 'level': 'INFO', # Show info and above in console
'formatter': 'simple', 'formatter': 'error',
'stream': 'ext://sys.stdout' 'stream': 'ext://sys.stdout'
}, },
'debug_file': { 'debug_file': {

View File

@ -1,11 +1,21 @@
"""Client for interacting with the arXiv API."""
import arxiv import arxiv
from typing import List, Dict, Any from typing import List, Dict, Any, Optional
from tenacity import retry, stop_after_attempt, wait_exponential from tenacity import retry, stop_after_attempt, wait_exponential
import aiohttp import aiohttp
from datetime import datetime, timedelta
import logging
logger = logging.getLogger(__name__)
class ArxivClient: class ArxivClient:
"""
Client for interacting with the arXiv API.
Documentation: https://info.arxiv.org/help/api/user-manual.html
"""
def __init__(self): def __init__(self):
self.client = arxiv.Client() """Initialize the ArxivClient."""
self._session = None self._session = None
async def _get_session(self) -> aiohttp.ClientSession: async def _get_session(self) -> aiohttp.ClientSession:
@ -15,42 +25,116 @@ class ArxivClient:
return self._session return self._session
async def close(self): async def close(self):
"""Close the session.""" """Close any open resources."""
if self._session is not None: if self._session:
await self._session.close() await self._session.close()
self._session = None self._session = None
async def __aenter__(self): async def __aenter__(self):
"""Async context manager entry.""" """Async context manager entry."""
await self._get_session()
return self return self
async def __aexit__(self, exc_type, exc_val, exc_tb): async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Async context manager exit.""" """Async context manager exit."""
await self.close() await self.close()
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10)) async def fetch_papers(self, query: str = None, category: str = None,
async def fetch_papers(self, query: str, max_results: int = 10) -> List[Dict[Any, Any]]: start_date = None, end_date = None, max_results: int = 100) -> List[Dict[str, Any]]:
"""Fetch papers from arXiv based on query.""" """
Fetch papers from arXiv based on query and/or category.
Args:
query (str, optional): Search query
category (str, optional): arXiv category (e.g., 'cs.AI')
start_date: Start date for paper search (naive datetime)
end_date: End date for paper search (naive datetime)
max_results (int): Maximum number of results to return
Returns:
List[Dict[str, Any]]: List of paper metadata
"""
# Build the search query according to arXiv API syntax
search_terms = []
# Add category constraint if specified
if category:
search_terms.append(f"cat:{category}")
# Add custom query if specified
if query:
formatted_query = query.replace(' AND ', ' AND ').replace(' OR ', ' OR ').replace(' NOT ', ' NOT ')
search_terms.append(f"({formatted_query})")
# Add date range constraint if specified
if start_date and end_date:
date_range = (
f"submittedDate:[{start_date.strftime('%Y%m%d')}0000 TO "
f"{end_date.strftime('%Y%m%d')}2359]"
)
search_terms.append(date_range)
logger.info(f"Searching for papers between {start_date.strftime('%Y-%m-%d')} and {end_date.strftime('%Y-%m-%d')}")
# Combine search terms
final_query = ' AND '.join(f'({term})' for term in search_terms) if search_terms else 'all:*'
# Configure and execute search
search = arxiv.Search( search = arxiv.Search(
query=query, query=final_query,
max_results=max_results, max_results=max_results,
sort_by=arxiv.SortCriterion.SubmittedDate sort_by=arxiv.SortCriterion.SubmittedDate,
sort_order=arxiv.SortOrder.Descending
) )
results = [] papers = []
# Use list() to get all results at once instead of async iteration try:
papers = list(self.client.results(search)) client = arxiv.Client()
for paper in papers: @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
results.append({ def fetch_results():
'id': paper.entry_id.split('/')[-1], # Get the ID without the full URL return list(client.results(search))
'title': paper.title,
'authors': [author.name for author in paper.authors], results = fetch_results()
'summary': paper.summary,
'pdf_url': paper.pdf_url, for result in results:
'published': paper.published.isoformat() if paper.published else None, papers.append({
'updated': paper.updated.isoformat() if paper.updated else None, "title": result.title,
'categories': paper.categories "authors": [author.name for author in result.authors],
"abstract": result.summary,
"pdf_url": result.pdf_url,
"entry_id": result.entry_id,
"categories": result.categories
}) })
return results logger.info(f"Successfully fetched {len(papers)} papers")
return papers
except Exception as e:
logger.error(f"Error fetching papers: {e}")
raise
async def get_paper_by_id(self, paper_id: str) -> Optional[Dict[str, Any]]:
"""Fetch a single paper by arXiv ID."""
try:
# Extract arXiv ID from URL format
arxiv_id = paper_id.split('/')[-1].replace('v', '') # Remove version suffix
search = arxiv.Search(id_list=[arxiv_id])
client = arxiv.Client()
results = list(client.results(search))
if not results:
return None
result = results[0]
return {
"entry_id": result.entry_id,
"title": result.title,
"authors": [author.name for author in result.authors],
"summary": result.summary,
"pdf_url": result.pdf_url,
"categories": result.categories
}
except Exception as e:
logger.error(f"Failed to fetch paper {paper_id}: {e}")
return None

View File

@ -2,26 +2,32 @@ import asyncio
import signal import signal
import logging import logging
import sys import sys
from src.agent_controller import AgentController import os
import json
from datetime import datetime, timedelta
from typing import List, Optional
from zoneinfo import ZoneInfo
from src.utils.agent_controller import AgentController
from src.data_acquisition.arxiv_client import ArxivClient
from src.utils.debug import tracker from src.utils.debug import tracker
from src.config.logging_config import setup_logging from src.config.logging_config import setup_logging
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def handle_sigint(): def handle_sigint(signum, frame):
"""Handle interrupt signal.""" """Handle interrupt signal."""
logger.info("Received interrupt signal, canceling tasks...") logger.info("Received interrupt signal, exiting...")
for task in asyncio.all_tasks(): sys.exit(0)
task.cancel()
async def cleanup_resources(loop: asyncio.AbstractEventLoop): async def cleanup_resources():
"""Clean up any remaining resources.""" """Clean up any remaining resources."""
logger.debug("Starting resource cleanup...") logger.debug("Starting resource cleanup...")
tracker.print_active_resources() tracker.print_active_resources()
try: try:
# Cancel all tasks # Cancel all tasks except current
pending = asyncio.all_tasks(loop) pending = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
if pending: if pending:
logger.debug(f"Cancelling {len(pending)} pending tasks...") logger.debug(f"Cancelling {len(pending)} pending tasks...")
for task in pending: for task in pending:
@ -29,107 +35,168 @@ async def cleanup_resources(loop: asyncio.AbstractEventLoop):
task.cancel() task.cancel()
# Wait for tasks to complete with timeout # Wait for tasks to complete with timeout
try:
logger.debug("Waiting for tasks to complete...") logger.debug("Waiting for tasks to complete...")
await asyncio.wait(pending, timeout=5) await asyncio.wait(pending, timeout=5)
except asyncio.CancelledError:
logger.debug("Task wait cancelled")
# Force close any remaining tasks logger.debug("Cleanup completed successfully")
for task in pending: except Exception as e:
if not task.done(): logger.error(f"Error during cleanup: {e}")
logger.warning(f"Force closing task: {task.get_name()}")
async def fetch_papers(days: int = 7, categories: Optional[List[str]] = None) -> None:
"""Fetch and analyze papers from arXiv."""
logger.info(f"Fetching papers from the last {days} days")
if categories is None:
categories = ["cs.AI"]
async with ArxivClient() as client, AgentController() as agent:
try: try:
task.close() # Calculate date range
except Exception as e: end_date = datetime.now()
logger.error(f"Error closing task: {str(e)}") start_date = end_date - timedelta(days=days)
logger.info(f"Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
# Fetch and analyze papers for each category
for category in categories:
logger.info(f"Fetching papers for category: {category}")
papers = await client.fetch_papers(category=category,
start_date=start_date,
end_date=end_date)
if not papers:
print(f"\nNo recent papers found in {category}")
continue
print(f"\nProcessing papers in {category}:")
print("=" * 80)
print()
for i, paper in enumerate(papers, 1):
print(f"Processing paper {i}/{len(papers)}: {paper['title']}")
try:
# Analyze paper
analysis = await agent.analyze_paper(paper)
# Print analysis
print("\nAnalysis:")
print(f"Summary: {analysis.get('summary', 'No summary available')}")
print("\nTechnical Concepts:")
print(analysis.get('technical_concepts', 'No technical concepts available'))
# Print fluff analysis
fluff = analysis.get('fluff', {})
score = fluff.get('score')
if score is not None:
color = '\033[92m' if score < 30 else '\033[93m' if score < 70 else '\033[91m'
reset = '\033[0m'
print(f"\nFluff Score: {color}{score}/100{reset}")
print("Analysis:")
print(fluff.get('explanation', 'No explanation available'))
except Exception as e: except Exception as e:
logger.error(f"Error during cleanup: {str(e)}") logger.error(f"Error processing paper: {e}")
finally: print("Failed to analyze paper")
logger.debug("Resource cleanup completed")
print("-" * 80)
print()
except Exception as e:
logger.error(f"Error fetching papers: {e}")
raise
async def process_query(query: str) -> None:
"""Process a search query and display results."""
async with AgentController() as agent:
try:
results = await agent.process_query(query)
if not results:
print("\nNo matching papers found.")
return
print(f"\nFound {len(results)} matching papers:")
print("=" * 80)
for i, paper in enumerate(results, 1):
print(f"\n{i}. {paper['title']}")
print(f" Authors: {', '.join(paper['authors'])}")
analysis = paper.get('analysis', {})
print("\nSummary:")
print(analysis.get('summary', 'No summary available'))
print("\nTechnical Concepts:")
print(analysis.get('technical_concepts', 'No technical concepts available'))
fluff = analysis.get('fluff', {})
score = fluff.get('score')
if score is not None:
color = '\033[92m' if score < 30 else '\033[93m' if score < 70 else '\033[91m'
reset = '\033[0m'
print(f"\nFluff Score: {color}{score}/100{reset}")
print("Analysis:")
print(fluff.get('explanation', 'No explanation available'))
print("-" * 80)
except Exception as e:
logger.error(f"Error processing query: {e}")
raise
async def main(): async def main():
"""Main application entry point.""" """Main application entry point."""
try: parser = argparse.ArgumentParser(description='Fetch and analyze arXiv papers')
logger.debug("Starting main application...") subparsers = parser.add_subparsers(dest='command', help='Command to run')
async with AgentController() as agent:
query = "Large Language Models recent advances"
logger.debug(f"Processing query: {query}")
try: # Fetch papers command
# Process query and get both new and existing papers fetch_parser = subparsers.add_parser('fetch', help='Fetch recent papers')
results = await agent.process_query(query) fetch_parser.add_argument('--days', type=int, default=7,
help='Number of days to look back')
fetch_parser.add_argument('--categories', nargs='+', default=['cs.AI'],
help='arXiv categories to fetch')
if results: # Search papers command
print("\n=== Analysis Results ===") search_parser = subparsers.add_parser('search', help='Search papers')
for paper in results: search_parser.add_argument('query', help='Search query')
print(f"\nTitle: {paper['title']}")
print(f"Authors: {paper.get('authors', 'Unknown')}")
if 'analysis' in paper: args = parser.parse_args()
# This is a newly processed paper
print("Status: Newly Processed") if args.command == 'fetch':
print("\nSummary:", paper['analysis'].get('summary', 'No summary available')) await fetch_papers(days=args.days, categories=args.categories)
print("\nTechnical Concepts:", paper['analysis'].get('technical_concepts', 'No concepts available')) elif args.command == 'search':
await process_query(args.query)
else: else:
# This is an existing paper from the database parser.print_help()
print("Status: Retrieved from Database")
print("\nRelevant Text:", paper.get('relevant_text', 'No text available')[:500] + "...")
else:
print("\nNo papers found for the query.")
except Exception as e:
logger.error(f"Error in main: {str(e)}")
raise
finally:
logger.debug("Main execution completed")
def run_main(): def run_main():
"""Run the main application.""" """Run the main application."""
# Set up logging # Set up logging
setup_logging() setup_logging()
# Create new event loop # Set up signal handlers for Windows
logger.debug("Creating new event loop...") signal.signal(signal.SIGINT, handle_sigint)
loop = asyncio.new_event_loop() signal.signal(signal.SIGTERM, handle_sigint)
asyncio.set_event_loop(loop)
tracker.track_loop(loop)
# Set up signal handlers
signals = (signal.SIGTERM, signal.SIGINT)
for s in signals:
loop.add_signal_handler(
s, lambda s=s: handle_sigint()
)
try: try:
logger.debug("Running main function...") # Run main application
loop.run_until_complete(main()) asyncio.run(main())
except asyncio.CancelledError: except KeyboardInterrupt:
logger.info("Main execution cancelled") logger.info("Application cancelled by user")
except Exception as e: except Exception as e:
logger.error(f"Error in run_main: {str(e)}") logger.error(f"Application error: {e}")
raise sys.exit(1)
finally: finally:
# Clean up
try: try:
# Run cleanup asyncio.run(cleanup_resources())
logger.debug("Running final cleanup...")
loop.run_until_complete(cleanup_resources(loop))
# Close the loop
logger.debug("Closing event loop...")
loop.run_until_complete(loop.shutdown_asyncgens())
loop.run_until_complete(loop.shutdown_default_executor())
# Remove the event loop
asyncio.set_event_loop(None)
loop.close()
tracker.untrack_loop(loop)
logger.debug("Cleanup completed successfully")
except Exception as e: except Exception as e:
logger.error(f"Error during final cleanup: {str(e)}") logger.error(f"Error during cleanup: {e}")
finally:
# Force exit to ensure all resources are released
sys.exit(0)
if __name__ == "__main__": if __name__ == "__main__":
import argparse
run_main() run_main()

View File

@ -79,20 +79,29 @@ class PaperFetcher:
logger.info(f"Fetching papers from categories: {categories}") logger.info(f"Fetching papers from categories: {categories}")
# Initialize clients # Initialize agent
arxiv_client = ArxivClient()
agent = AgentController() agent = AgentController()
try: try:
await agent.initialize() await agent.initialize()
# Calculate date range # Calculate date range (use a minimum of 7 days if days is too small)
end_date = datetime.now() end_date = datetime.now()
days = max(days, 7) # Ensure we look at least 7 days back
start_date = end_date - timedelta(days=days) start_date = end_date - timedelta(days=days)
logger.info(f"Searching for papers from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
# Create output directory if it doesn't exist
output_dir = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), "data", "papers")
os.makedirs(output_dir, exist_ok=True)
# Create and use ArxivClient as async context manager
async with ArxivClient() as arxiv_client:
# Fetch and process papers for each category # Fetch and process papers for each category
for category in categories: for category in categories:
logger.info(f"Processing category: {category}") logger.info(f"Processing category: {category}")
try:
papers = await arxiv_client.fetch_papers( papers = await arxiv_client.fetch_papers(
category=category, category=category,
start_date=start_date, start_date=start_date,
@ -100,13 +109,47 @@ class PaperFetcher:
) )
if not papers: if not papers:
logger.info(f"No new papers found for category {category}") logger.info(f"No papers found for category {category}")
continue continue
logger.info(f"Found {len(papers)} papers in {category}") logger.info(f"Found {len(papers)} papers in {category}")
for paper in papers:
await agent.process_paper(paper)
# Save papers to JSON file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = os.path.join(output_dir, f"papers_{category.replace('.', '_')}_{timestamp}.json")
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(papers, f, indent=2, ensure_ascii=False)
logger.info(f"Saved papers to {output_file}")
# Display paper information
print(f"\nRecent papers in {category}:")
print("=" * 80)
for i, paper in enumerate(papers, 1):
print(f"\n{i}. {paper['title']}")
print(f" Authors: {', '.join(paper['authors'])}")
print(f" Published: {paper['published']}")
print(f" URL: {paper['pdf_url']}")
print(f" Categories: {', '.join(paper['categories'])}")
if paper.get('doi'):
print(f" DOI: {paper['doi']}")
print("-" * 80)
for paper in papers:
try:
analysis = await agent.analyze_paper(paper)
# TODO: Store analysis results
except Exception as e:
logger.error(f"Failed to analyze paper {paper.get('title', 'Unknown')}: {e}")
continue
except Exception as e:
logger.error(f"Failed to fetch papers for category {category}: {e}")
continue
except Exception as e:
logger.error(f"Error in fetch_papers: {e}")
finally: finally:
await agent.close() await agent.close()

View File

@ -19,6 +19,8 @@ CREATE TABLE IF NOT EXISTS papers (
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP, created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
); );
CREATE UNIQUE INDEX IF NOT EXISTS papers_id_idx ON papers (id);
""" """
class PaperStore: class PaperStore:
@ -59,17 +61,16 @@ class PaperStore:
if self.pool: if self.pool:
await self.pool.close() await self.pool.close()
async def store_paper(self, paper: Dict) -> None: async def store_paper(self, metadata: Dict) -> None:
"""Store a paper and its analysis in the database.""" """Store paper metadata in the database."""
if not self.pool: if not self.pool:
await self.initialize() await self.initialize()
async with self.pool.acquire() as conn: try:
# Extract analysis if present logger.info(f"Storing paper metadata: {metadata.get('title')} (ID: {metadata.get('id')})")
analysis = paper.get('analysis', {})
fluff_analysis = analysis.get('fluff', {})
# Store paper async with self.pool.acquire() as conn:
# Store paper metadata
await conn.execute(''' await conn.execute('''
INSERT INTO papers ( INSERT INTO papers (
id, title, authors, summary, technical_concepts, id, title, authors, summary, technical_concepts,
@ -86,45 +87,52 @@ class PaperStore:
pdf_url = EXCLUDED.pdf_url, pdf_url = EXCLUDED.pdf_url,
updated_at = CURRENT_TIMESTAMP updated_at = CURRENT_TIMESTAMP
''', ''',
paper['id'], metadata['id'],
paper['title'], metadata['title'],
paper['authors'], metadata['authors'],
analysis.get('summary'), metadata.get('summary'),
analysis.get('technical_concepts'), metadata.get('technical_concepts'),
fluff_analysis.get('score'), metadata.get('fluff_score'),
fluff_analysis.get('explanation'), metadata.get('fluff_explanation'),
paper.get('pdf_url')) metadata.get('pdf_url'))
logger.info(f"Successfully stored paper metadata {metadata.get('id')}")
except Exception as e:
logger.error(f"Error storing paper metadata {metadata.get('id')}: {str(e)}")
raise
async def get_paper(self, paper_id: str) -> Optional[Dict]: async def get_paper(self, paper_id: str) -> Optional[Dict]:
"""Retrieve a paper by its ID.""" """Retrieve paper metadata by its ID."""
if not self.pool: if not self.pool:
await self.initialize() await self.initialize()
async with self.pool.acquire() as conn: async with self.pool.acquire() as conn:
# Get paper metadata
row = await conn.fetchrow(''' row = await conn.fetchrow('''
SELECT * FROM papers WHERE id = $1 SELECT id, title, authors, summary, technical_concepts,
fluff_score, fluff_explanation, pdf_url,
created_at, updated_at
FROM papers
WHERE id = $1
''', paper_id) ''', paper_id)
if row: if row:
return dict(row) return dict(row)
return None return None
async def get_all_paper_ids(self) -> List[str]: async def get_all_papers(self) -> List[Dict]:
"""Get all paper IDs in the database.""" """Retrieve all papers ordered by creation date."""
if not self.pool:
await self.initialize()
async with self.pool.acquire() as conn:
rows = await conn.fetch('SELECT id FROM papers')
return [row['id'] for row in rows]
async def get_papers_by_ids(self, paper_ids: List[str]) -> List[Dict]:
"""Retrieve multiple papers by their IDs."""
if not self.pool: if not self.pool:
await self.initialize() await self.initialize()
async with self.pool.acquire() as conn: async with self.pool.acquire() as conn:
# Get all papers
rows = await conn.fetch(''' rows = await conn.fetch('''
SELECT * FROM papers WHERE id = ANY($1) SELECT id, title, authors, summary, technical_concepts,
''', paper_ids) fluff_score, fluff_explanation, pdf_url,
created_at, updated_at
FROM papers
ORDER BY created_at DESC
''')
return [dict(row) for row in rows] return [dict(row) for row in rows]

View File

@ -1,6 +1,9 @@
from typing import List, Dict from typing import List, Dict
import chromadb import chromadb
from chromadb.config import Settings from chromadb.config import Settings
import logging
logger = logging.getLogger(__name__)
class VectorStore: class VectorStore:
def __init__(self, persist_directory: str = "chroma_db"): def __init__(self, persist_directory: str = "chroma_db"):
@ -15,6 +18,11 @@ class VectorStore:
metadata={"hnsw:space": "cosine"} # Use cosine similarity metadata={"hnsw:space": "cosine"} # Use cosine similarity
) )
async def close(self):
"""Close the vector store client."""
# ChromaDB doesn't need explicit cleanup
pass
def paper_exists(self, paper_id: str) -> bool: def paper_exists(self, paper_id: str) -> bool:
"""Check if a paper already exists in the vector store.""" """Check if a paper already exists in the vector store."""
try: try:
@ -61,6 +69,14 @@ class VectorStore:
ids=ids ids=ids
) )
def delete_paper(self, paper_id: str) -> None:
"""Delete all chunks for a paper."""
try:
self.collection.delete(where={"paper_id": paper_id})
logger.info(f"Deleted vector store entries for paper {paper_id}")
except Exception as e:
logger.error(f"Failed to delete paper {paper_id}: {e}")
def query_similar(self, query: str, n_results: int = 5) -> Dict: def query_similar(self, query: str, n_results: int = 5) -> Dict:
"""Query for similar chunks.""" """Query for similar chunks."""
try: try:

1
src/utils/__init__.py Normal file
View File

@ -0,0 +1 @@

View File

@ -0,0 +1,234 @@
"""
Agent Controller for managing AI interactions and paper analysis.
"""
import os
import json
import logging
from typing import Dict, Any, Optional, List
from datetime import datetime
from pathlib import Path
from dotenv import load_dotenv
from src.analysis.llm_analyzer import LLMAnalyzer
from src.storage.paper_store import PaperStore
from src.storage.vector_store import VectorStore
logger = logging.getLogger(__name__)
class AgentController:
"""Controller class for managing AI agent operations."""
def __init__(self):
"""Initialize the agent controller."""
self.initialized = False
self.llm_analyzer = None
self.paper_store = None
self.vector_store = None
self.papers_dir = Path("papers")
async def __aenter__(self):
"""Async context manager entry."""
await self.initialize()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Async context manager exit."""
await self.close()
return None # Don't suppress exceptions
async def initialize(self):
"""Initialize the agent and prepare it for paper analysis."""
if self.initialized:
return
try:
# Load environment variables
load_dotenv()
# Get API keys from environment
deepseek_key = os.getenv('DEEPSEEK_API_KEY')
if not deepseek_key:
raise ValueError("DEEPSEEK_API_KEY environment variable is required")
# Create papers directory if it doesn't exist
self.papers_dir.mkdir(parents=True, exist_ok=True)
# Initialize components
self.llm_analyzer = LLMAnalyzer(api_key=deepseek_key, provider='deepseek')
self.paper_store = PaperStore()
self.vector_store = VectorStore()
# Initialize database
await self.paper_store.initialize()
self.initialized = True
logger.info("Agent controller initialized successfully")
except Exception as e:
logger.error(f"Failed to initialize agent controller: {e}")
raise
async def analyze_paper(self, paper_data: Dict[str, Any]) -> Dict[str, Any]:
"""
Analyze a research paper using AI capabilities.
Args:
paper_data (dict): Paper metadata and content to analyze
Returns:
dict: Analysis results including summary, technical concepts, and fluff analysis
"""
if not self.initialized:
await self.initialize()
try:
# Get paper text (combine title and abstract)
paper_text = f"Title: {paper_data.get('title', '')}\n\n"
if paper_data.get('abstract'):
paper_text += f"Abstract: {paper_data['abstract']}\n\n"
# Analyze paper using LLM
analysis = await self.llm_analyzer.analyze_paper(paper_text)
# Store paper in databases
paper_id = paper_data.get('entry_id') # Use entry_id from arXiv
if paper_id:
logger.debug(f"Checking PostgreSQL for paper {paper_id}")
existing = await self.paper_store.get_paper(paper_id)
if existing:
logger.info(f"Paper {paper_id} already in database - skipping")
return existing
logger.debug(f"Checking vector store for paper {paper_id}")
if self.vector_store.paper_exists(paper_id):
logger.warning(f"Found orphaned vector entry for {paper_id} - repairing")
await self._repair_orphaned_paper(paper_id, paper_data)
return {}
# Clean paper_id to use as filename
safe_id = paper_id.split('/')[-1].replace('.', '_')
# Save paper content to file
paper_path = self.papers_dir / f"{safe_id}.json"
with open(paper_path, 'w', encoding='utf-8') as f:
json.dump(paper_data, f, indent=2, ensure_ascii=False)
# Store metadata in PostgreSQL
metadata = {
'id': paper_id,
'title': paper_data['title'],
'authors': paper_data['authors'],
'summary': analysis.get('summary'),
'technical_concepts': analysis.get('technical_concepts'),
'fluff_score': analysis.get('fluff', {}).get('score'),
'fluff_explanation': analysis.get('fluff', {}).get('explanation'),
'pdf_url': paper_data.get('pdf_url')
}
await self.paper_store.store_paper(metadata)
# Store in vector database for similarity search
chunks = [
paper_text, # Store full text
analysis.get('summary', ''), # Store summary
analysis.get('technical_concepts', '') # Store technical concepts
]
chunk_metadata = [
{'paper_id': paper_id, 'type': 'full_text'},
{'paper_id': paper_id, 'type': 'summary'},
{'paper_id': paper_id, 'type': 'technical_concepts'}
]
chunk_ids = [
f"{safe_id}_text",
f"{safe_id}_summary",
f"{safe_id}_concepts"
]
self.vector_store.add_chunks(chunks, chunk_metadata, chunk_ids)
return analysis
except Exception as e:
logger.error(f"Failed to analyze paper: {e}")
raise
async def process_query(self, query: str) -> List[Dict[str, Any]]:
"""
Process a search query and return relevant papers.
Args:
query (str): Search query string
Returns:
list: List of relevant papers with their analysis
"""
if not self.initialized:
await self.initialize()
try:
# Get similar papers from vector store
results = self.vector_store.query_similar(query)
# Get unique paper IDs from results
paper_ids = set()
for metadata in results['metadatas']:
if metadata and 'paper_id' in metadata:
paper_ids.add(metadata['paper_id'])
# Get paper metadata from PostgreSQL
papers = []
for paper_id in paper_ids:
paper = await self.paper_store.get_paper(paper_id)
if paper:
# Load full paper data if needed
safe_id = paper_id.split('/')[-1].replace('.', '_')
paper_path = self.papers_dir / f"{safe_id}.json"
if paper_path.exists():
with open(paper_path, 'r', encoding='utf-8') as f:
full_paper = json.load(f)
paper['full_data'] = full_paper
papers.append(paper)
return papers
except Exception as e:
logger.error(f"Error processing query: {e}")
raise
async def close(self):
"""Clean up resources."""
if not self.initialized:
return
try:
if self.llm_analyzer:
await self.llm_analyzer.close()
if self.paper_store:
await self.paper_store.close()
if self.vector_store:
await self.vector_store.close()
self.initialized = False
logger.info("Agent controller closed successfully")
except Exception as e:
logger.error(f"Failed to close agent controller: {e}")
raise
async def _repair_orphaned_paper(self, paper_id: str, paper_data: Dict):
"""Repair a paper that exists in vector store but not PostgreSQL."""
try:
# Try to get fresh metadata from arXiv
fresh_data = await self.arxiv_client.get_paper_by_id(paper_id)
if fresh_data:
# Store in PostgreSQL with original analysis data
await self.paper_store.store_paper({
**fresh_data,
"technical_concepts": paper_data.get('technical_concepts'),
"fluff_score": paper_data.get('fluff_score'),
"fluff_explanation": paper_data.get('fluff_explanation')
})
logger.info(f"Repaired orphaned paper {paper_id}")
else:
# Paper no longer exists on arXiv - clean up vector store
self.vector_store.delete_paper(paper_id)
logger.warning(f"Deleted orphaned paper {paper_id} (not found on arXiv)")
except Exception as e:
logger.error(f"Failed to repair paper {paper_id}: {e}")