cli-web-scrapper: Clean Web Content for LLM Workflows

cli-web-scrapper extracts clean content from Reddit, YouTube, and documentation sites for LLM workflows. I built it because standard Python requests bounces off modern platforms immediately, and paid scraping APIs felt wrong for a local utility.

Here’s the technical approach, the platform-specific challenges, and what I learned about browser fingerprinting along the way.

Why Standard Tools Fail

Modern websites don’t want bots. Cloudflare, DataDome, and similar services block anything that looks like automated traffic. Standard requests gets rejected with 403s or JavaScript challenges.

Even when you access content, extraction is messy - navigation bars, ads, cookie banners, boilerplate HTML. And different platforms need specialized parsing. Reddit threads have completely different structure than YouTube pages.

I tried the obvious approaches. requests directly: instant 403. Selenium: too slow for a CLI tool. Paid APIs: not interested in subscriptions for a local utility.

The Two-Layer Architecture

The breakthrough came from combining two libraries: curl_cffi for access and trafilatura for extraction.

curl_cffi authenticates as real browsers using proper TLS fingerprints. This isn’t just setting a User-Agent header - it reproduces the actual TLS handshake, HTTP/2 behavior, and browser-specific characteristics that make requests indistinguishable from Chrome 124 or Safari 18. Different platforms recognize different fingerprints, so the tool supports 20+ browser versions.

trafilatura strips everything except main content. Same engine used by HuggingFace, IBM Research, and Stanford for web corpus building. F1 score of 0.937, handles 500+ languages.

1
2
3
cli-web-scrapper https://docs.example.com -f markdown
cli-web-scrapper -b safari https://protected-site.com
cli-web-scrapper --list-browsers

Here’s what it looks like extracting a Reddit thread:

CLI Web Scraper Demo - Reddit Thread Extraction

And YouTube with comments:

CLI Web Scraper Demo - YouTube with Comments

Platform-Specific Parsing

Generic scrapers miss platform-specific metadata. For Reddit, that means losing comment structure, authors, and voting information. For YouTube, it’s video statistics, channel data, and description links (which YouTube makes surprisingly hard to extract - shortened URLs buried in JSON-LD metadata that need resolving and deduplication).

The tool has dedicated parsers for both:

1
2
Reddit URL → curl_cffi → BeautifulSoup parser → Extract post + comments → JSON/Markdown
YouTube URL → curl_cffi → YouTube-specific parser → Metadata + description + comments → JSON/Markdown

Four output formats: Rich (terminal colors), Markdown (LLM context), JSON (full metadata), Plain text (pipelines).

1
2
cli-web-scrapper -f markdown -o output.md https://blog.example.com/article
cli-web-scrapper -f json https://reddit.com/r/programming/comments/xyz123

Performance

Trafilatura is fast. Typical article: under 2 seconds. YouTube videos: 3-5 seconds for full metadata. Reddit threads: 2-4 seconds depending on comment count. The extraction accuracy (F1 score 0.937) means reliable content without manual cleanup.

Installation:

1
uv pip install git+https://github.com/amlucas0xff/cli-web-scrapper.git

What I Learned

Browser fingerprinting is more nuanced than I expected. It’s not just User-Agent headers - it’s TLS handshakes, HTTP/2 behavior, cipher suites, and dozens of other signals. The curl_cffi library handles all of this, but understanding why it works required diving into how modern bot detection actually operates.

The current version does what I need: access modern web platforms, extract clean content, output LLM-friendly formats. Batch processing and rate limiting are on the list for future iterations.

If you’re building LLM workflows that need web content, this might save you some time. Repository: github.com/amlucas0xff/cli-web-scrapper.