Data Tools

Scraping, ETL, notebooks. Production-grade where we'd ship it, dev-only where we wouldn't.

Scraping

Apify

Managed scraping infra + open-source SDK.

— Hosted or self-run. Their SDK alone is worth a look.

scrapingmanagedsdk

cheerio

jQuery-style server-side HTML parsing.

— We use it for every non-SPA crawl in the KB pipeline.

scrapingparsingnode

linkedom

Spec-compliant DOM for worker/Node environments.

— When you need a real DOM and cheerio's jQuery style isn't enough.

scrapingdomparsing

Playwright

Headless browser for when fetch+parse isn't enough.

— Ships its own browsers; no driver drama. Our default for SPA scraping.

scrapingheadless-browsere2e

Puppeteer

Playwright's older sibling. Still fine, still maintained.

— Reach for this when the team already knows it; Playwright when starting fresh.

scrapingheadless-browserchrome

Scrapy

Python-native crawling framework with built-in queue + middlewares.

— The right tool at 10K+ pages. Scales without becoming your project.

scrapingpythonframework

ETL

Dagster

Type-first data pipelines with asset lineage.

— More opinionated than Prefect. Pick it when asset thinking fits your data.

etlpipelineslineage

DuckDB

Local analytical SQL on files.

— We use it for CSV wrangling in consulting engagements. One binary, no server.

sqlanalyticsembedded

Prefect

Python workflow orchestration with UI.

— What VORLUX's orchestrator would look like if we used off-the-shelf scheduling.

etlorchestrationpython

Notebooks

marimo

Reactive Python notebooks that aren't a pile of state.

— Open-source, modern. Pick it over Jupyter for anything shared.

notebookspythonreactive

Polars

DataFrames in Rust. Fast.

— Rewrite your slow pandas cells for 10x speedup, often zero code changes.

dataframesrustperformance