Data Tools
Scraping, ETL, notebooks. Production-grade where we'd ship it, dev-only where we wouldn't.
Scraping
Apify
freemiumManaged scraping infra + open-source SDK.
— Hosted or self-run. Their SDK alone is worth a look.
cheerio
OSSjQuery-style server-side HTML parsing.
— We use it for every non-SPA crawl in the KB pipeline.
linkedom
OSSSpec-compliant DOM for worker/Node environments.
— When you need a real DOM and cheerio's jQuery style isn't enough.
Playwright
OSSHeadless browser for when fetch+parse isn't enough.
— Ships its own browsers; no driver drama. Our default for SPA scraping.
Puppeteer
OSSPlaywright's older sibling. Still fine, still maintained.
— Reach for this when the team already knows it; Playwright when starting fresh.
Scrapy
OSSPython-native crawling framework with built-in queue + middlewares.
— The right tool at 10K+ pages. Scales without becoming your project.
ETL
Dagster
freemiumType-first data pipelines with asset lineage.
— More opinionated than Prefect. Pick it when asset thinking fits your data.
DuckDB
OSSLocal analytical SQL on files.
— We use it for CSV wrangling in consulting engagements. One binary, no server.
Prefect
freemiumPython workflow orchestration with UI.
— What VORLUX's orchestrator would look like if we used off-the-shelf scheduling.