No Image Available
Advanced AI-powered recursive web scraper utilizing Groq LLMs, Playwright, and Puppeteer for intelligent, structured content extraction and risk assessment.
A high-performance, enterprise-ready recursive web scraper designed for intelligent content extraction. It features concurrent multi-threaded crawling, token bucket rate limiting, and a robust content analysis pipeline using Google Gemini LLMs for topic-based and semantic chunking. The system uses Playwright/Puppeteer for multi-browser support, implements sophisticated content filtering against restricted domains, performs risk analysis (security, content, technical), and outputs structured data in JSON, CSV, or Markdown. The project adheres to strict code quality standards enforced by Biome, ESLint, and cspell, with full CI/CD via GitHub Actions and npm publishing.