No Image Available

Enterprise AI Recursive Web Scraper

AI / Data Engineering / Web Scraping

Completed / Active Maintenance

Mar 2024 - Nov 2025

Description

Advanced AI-powered recursive web scraper utilizing Groq LLMs, Playwright, and Puppeteer for intelligent, structured content extraction and risk assessment.

Project Overview

A high-performance, enterprise-ready recursive web scraper designed for intelligent content extraction. It features concurrent multi-threaded crawling, token bucket rate limiting, and a robust content analysis pipeline using Google Gemini LLMs for topic-based and semantic chunking. The system uses Playwright/Puppeteer for multi-browser support, implements sophisticated content filtering against restricted domains, performs risk analysis (security, content, technical), and outputs structured data in JSON, CSV, or Markdown. The project adheres to strict code quality standards enforced by Biome, ESLint, and cspell, with full CI/CD via GitHub Actions and npm publishing.

Technologies Used

TypeScript

Node.js

Bun

Playwright

Puppeteer

Google Gemini LLM

Groq (LLMs)

Token Bucket Rate Limiting

LRU Cache

Commander CLI

BiomeJS

Vitest / Codecov

GitHub Actions

Back to All Projects