A Rust implementation of Apache Spark that runs as a single 73 MiB binary in-process — think DuckDB — or scales out across a cluster as a Spark-compatible distributed engine, or runs entirely in your browser via WebAssembly. Same code, same plans, four deployment shapes.
DuckDB set the bar for embedded analytics — a single binary with vectorized execution. Apache Spark defined the API surface and deployment model for distributed compute. spark-rust adds the same engine in multiple lanes: one Rust query engine, four packagings, sharing the same analyzer, optimizer rules, physical plans, spill operators, and MLlib bindings.
pip install sparkrust, then sparkrust.connect(). No JVM, no daemon,
no network — the engine runs inside your Python process and reads parquet directly.
DuckDB-shaped ergonomics with the Spark API surface.
The same spark-connect-server binary on localhost. Point
unmodified PySpark (3.5 and 4.0 dialects), notebooks, or JDBC clients at
sc://localhost:50051 and share one engine across clients.
Same code, same query plans, pointed at a cluster driver. Executors receive encoded query plans, run them with their own Arrow-batched DataFusion, and shuffle over gRPC + Arrow IPC. Unmodified PySpark just works.
Distributed query plans · gRPC shuffleThe same SQL core compiled to wasm32 and run client-side in a Web Worker — no
server, no network, your data never leaves the tab. Spark dialect rewrites and UDFs all
execute in WASM against in-memory Arrow batches.
Prototype in-process on a laptop, then deploy the exact same SQL or DataFrame code to a cluster through Spark Connect — without changing a line.
Single-node, 20 vCPU / 157 GiB RAM, SNAPPY parquet on local disk (read directly — no catalog, no blob). Cold single-shot process per query, both engines on the same box reading the same files, full 99-query sweep at four scale factors.
| Scale factor | data | spark-rust total | DuckDB total | Ratio | spark-rust | DuckDB |
|---|---|---|---|---|---|---|
| SF=1 | 818 MB | 16.5s | 9.1s | 1.82× | 99/99 | 99/99 |
| SF=10 | 6.7 GB | 64.2s | 29.3s | 2.19× | 99/99 | 99/99 |
| SF=100 | 38 GB | 540.6s | 189.5s | 2.85× | 99/99 | 99/99 |
| SF=1000 | 561 GB | 13,693s | 17,805s | 0.77× (we win) | 98/99 | 96/99 |
0 row mismatches at every scale. The ranking inverts at 1 TB: DuckDB's tight operator fusion wins at small scale, but at SF=1000 spark-rust's design (runtime filters, bloom pushdown, spilling joins that turn O(fact) scans into a fraction) pulls ahead in aggregate — 0.77×, 48/96 per-query wins over the both-completed set, and it uniquely finishes Q67 and Q85 where DuckDB times out. SF=1000 totals are over the 96 queries both engines complete.
Hardware caveat: SF≤100 ran on an AMD EPYC 9V74 (Zen 5) host; SF=1000 on AMD EPYC 7452 (Zen 2) workers — same vCPU/RAM class, different CPU generation, so the 1 TB win was earned on the slower silicon (conservative). Each scale's ratio is clean apples-to-apples (both engines, same box, same files).
These numbers are a moving target. Performance work is active and ongoing; wall times keep improving as we land optimizer, join-ordering, vectorization, and spill improvements. Expect meaningfully better numbers in future sweeps — these figures are a snapshot of an MVP under active optimization, not a ceiling.
8-executor cluster, identical hardware. spark-rust's distributed path
runs every query byte-identically to Spark and, on warmed
timings, beats Spark's own best-tuned config on both the typical query and the
aggregate total. (Sequential per-query walls — apples-to-apples with Spark's
parallelism=1 sweep totals.)
| Metric | spark-rust | Apache Spark 4.2.0 |
|---|---|---|
| Queries passing | 99 / 99 | 99 / 99 |
| Σ per-query wall (warm) | 28.2 min | 40.0 min (optimized-8×20c) |
| Queries won (head-to-head) | 75 / 99 | 24 / 99 |
| Geomean speedup | 2.71× | — |
| Median speedup | 2.85× | — |
spark-rust's 28.2 min beats all three Spark 4.2 configs — including
Spark's own best (cbo-stats, 35.4 min, with table statistics pre-computed)
and its un-tuned baseline (93 min). 60 of 99 queries are ≥ 2× faster; the biggest
win — q14, Spark's own slowest at 123 s — is 20× faster. The remaining 9 losses are
a bounded compute-heavy theta-join / rollup tail.
pyspark (3.5 & 4.0) connects and runs.A real-time topology page visualizes the cluster live: executors appear as nodes sized by active task count; shuffle edges light up with animated particles showing the direction and volume of in-flight transfers.
The driver accepts executor registrations and per-heartbeat counter snapshots, fanned out via event logs to the history server's dashboard — so you can watch tasks, shuffle transfers, and cluster-wide totals update live as a query runs.
1 · In-process Python no daemon · DuckDB-shaped
# pip install sparkrust import sparkrust con = sparkrust.connect() con.sql("CREATE VIEW sales AS SELECT * FROM read_parquet('sales/*.parquet')") df = con.sql("SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region") df.show() # PySpark-style tabular print rows = df.collect() # list of tuples table = df.arrow() # pyarrow.Table (zero-copy ready) pdf = df.toPandas() # pandas DataFrame
2 · Single-node Spark Connect server laptop-scale
# same spark-connect-server binary, run locally import pyspark.sql spark = pyspark.sql.SparkSession.builder.remote("sc://localhost:50051").getOrCreate() df = spark.read.parquet("s3://my-bucket/sales/") df.groupBy("region").agg({"revenue": "sum"}).show()
3 · Distributed Spark Connect cluster same code, point at a driver
# identical pipeline, now executed across N executors spark = pyspark.sql.SparkSession.builder.remote("sc://driver.example.com:50051").getOrCreate()
4 · In-browser WebAssembly zero install · runs in the tab
// same SQL core compiled to wasm32, loaded in a Web Worker import init, { run_sql_json } from "./pkg/spark_wasm.js"; await init(); // fetch + instantiate the .wasm engine const rows = '[{"region":"west","revenue":10},{"region":"west","revenue":5}]'; const out = run_sql_json( "SELECT region, SUM(revenue) AS rev FROM data GROUP BY region", rows); // no server, no network — your data never leaves the tab
No JVM, no bundled deps, no installer — dynamically linked against the same standard system libs. The Connect server and Python extension are different packagings of the same Rust engine.
| Artifact | Size |
|---|---|
spark-connect-server — full engine + Connect protocol | 72.9 MiB |
sparkrust._sparkrust.abi3.so — in-process Python ext | ~75 MiB |
duckdb-1.5.4 — reference CLI | 59 MiB |
For context, Apache Spark 4.2's distribution is ~330 MB compressed (~1 GB extracted) plus the JVM — reflecting its much broader feature surface.
spark-rust uses a local fork of DataFusion
— the Rust query engine that also powers Polars, ParadeDB, InfluxDB v3 and Sail — for
Arrow-columnar vectorized execution, SQL parsing, optimization, and Parquet I/O. On top we add
the Spark-shaped pieces: a Spark-dialect SQL layer, analyzer rewrites, spill operators,
skew-join split, the gRPC shuffle service, the Connect server, MLlib over Connect, an
Apache Iceberg / Delta metastore writer, a wasm32 build of the SQL core that
runs Spark SQL client-side in the browser, and the topology dashboard.
Coming soon — the natural next direction once the core engine ships.
Ask for queries in English, get them planned and run against your data.
The engine learns from prior runs — cost-model population, plan-shape histograms, adaptive bloom-filter gating — and applies it to new queries automatically.
Ask why a query is slow; get a plain-language summary of the physical plan and per-operator metrics.
The exact same Rust engine, compiled to WebAssembly. Your data never leaves the tab.
spark-rust is currently in private development; open-source release is planned for the near future.