A Rust implementation of Apache Spark that runs as a single 73 MiB binary in-process — think DuckDB — or scales out across a cluster as a Spark-compatible distributed engine, or runs entirely in your browser via WebAssembly. Same code, same plans, four deployment shapes.
DuckDB set the bar for embedded analytics — a single binary with vectorized execution. Apache Spark defined the API surface and deployment model for distributed compute. spark-rust adds the same engine in multiple lanes: one Rust query engine, four packagings, sharing the same analyzer, optimizer rules, physical plans, spill operators, and MLlib bindings.
pip install sparkrust, then sparkrust.connect(). No JVM, no daemon,
no network — the engine runs inside your Python process and reads parquet directly.
DuckDB-shaped ergonomics with the Spark API surface.
The same spark-connect-server binary on localhost. Point
unmodified PySpark (3.5 and 4.0 dialects), notebooks, or JDBC clients at
sc://localhost:50051 and share one engine across clients.
Same code, same query plans, pointed at a cluster driver. Executors receive Substrait-encoded plans, run them with their own Arrow-batched DataFusion, and shuffle over gRPC + Arrow IPC. Unmodified PySpark just works.
Substrait plans · gRPC shuffleThe same SQL core compiled to wasm32 and run client-side in a Web Worker — no
server, no network, your data never leaves the tab. Spark dialect rewrites and UDFs all
execute in WASM against in-memory Arrow batches.
Prototype in-process on a laptop, then deploy the exact same SQL or DataFrame code to a cluster through Spark Connect — without changing a line.
Single-node, 32 cores, 251 GB RAM, 1.2 TB local NVMe, parquet on disk. Cold cache, single-shot, full 99-query sweep at four scale factors.
| Scale factor | spark-rust total | DuckDB total | Ratio | spark-rust | DuckDB |
|---|---|---|---|---|---|
| SF=1 | 18.1s | 6.2s | 2.90× | 99/99 | 99/99 |
| SF=10 | 61.4s | 23.0s | 2.67× | 99/99 | 99/99 |
| SF=100 | 358.7s | 201.5s | 1.78× | 99/99 | 99/99 |
| SF=1000 | 4,146.6s | 1,859.0s | 2.23× | 99/99 | 98/99* |
* At SF=1000, DuckDB hits a working-set limit on Q85 where the intermediate exceeds
the 547 GB local spill disk on this hardware. DuckDB's optimizer is excellent and faster on
most queries in the sweep; this case is a plan-shape difference where spark-rust's join-order
selectivity keeps the intermediate small enough to finish in 8s without spilling. Full per-query
table in docs/tpcds-single-node-benchmark.md.
These numbers are a moving target. Performance work is active and ongoing; wall times keep improving as we land optimizer, join-ordering, vectorization, and spill improvements. Expect meaningfully better numbers in future sweeps — these figures are a snapshot of an MVP under active optimization, not a ceiling.
pyspark (3.5 & 4.0) connects and runs.A real-time topology page visualizes the cluster live: executors appear as nodes sized by active task count; shuffle edges light up with animated particles showing the direction and volume of in-flight transfers.
The driver accepts executor registrations and per-heartbeat counter snapshots, fanned out via
JSONL event logs to the history server's dashboard. See
crates/spark-history-server/src/dashboard.html and the topology-demo-*
workload generators.
1 · In-process Python no daemon · DuckDB-shaped
# pip install sparkrust import sparkrust con = sparkrust.connect() con.sql("CREATE VIEW sales AS SELECT * FROM read_parquet('sales/*.parquet')") df = con.sql("SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region") df.show() # PySpark-style tabular print rows = df.collect() # list of tuples table = df.arrow() # pyarrow.Table (zero-copy ready) pdf = df.toPandas() # pandas DataFrame
2 · Single-node Spark Connect server laptop-scale
# same spark-connect-server binary, run locally import pyspark.sql spark = pyspark.sql.SparkSession.builder.remote("sc://localhost:50051").getOrCreate() df = spark.read.parquet("s3://my-bucket/sales/") df.groupBy("region").agg({"revenue": "sum"}).show()
3 · Distributed Spark Connect cluster same code, point at a driver
# identical pipeline, now executed across N executors spark = pyspark.sql.SparkSession.builder.remote("sc://driver.example.com:50051").getOrCreate()
4 · In-browser WebAssembly zero install · runs in the tab
// same SQL core compiled to wasm32, loaded in a Web Worker import init, { run_sql_json } from "./pkg/spark_wasm.js"; await init(); // fetch + instantiate the .wasm engine const rows = '[{"region":"west","revenue":10},{"region":"west","revenue":5}]'; const out = run_sql_json( "SELECT region, SUM(revenue) AS rev FROM data GROUP BY region", rows); // no server, no network — your data never leaves the tab
No JVM, no bundled deps, no installer — dynamically linked against the same standard system libs. The Connect server and Python extension are different packagings of the same Rust engine.
| Artifact | Size |
|---|---|
spark-connect-server — full engine + Connect protocol | 72.9 MiB |
sparkrust._sparkrust.abi3.so — in-process Python ext | ~75 MiB |
duckdb-1.5.2 — reference CLI | 58.9 MiB |
For context, Apache Spark 4.2's distribution is ~330 MB compressed (~1 GB extracted) plus the JVM — reflecting its much broader feature surface.
spark-rust uses a local fork of DataFusion
— the Rust query engine that also powers Polars, ParadeDB, InfluxDB v3 and Sail — for
Arrow-columnar vectorized execution, SQL parsing, optimization, and Parquet I/O. On top we add
the Spark-shaped pieces: a Spark-dialect SQL layer, analyzer rewrites, spill operators,
skew-join split, the gRPC shuffle service, the Connect server, MLlib over Connect, a metastore
writer, a wasm32 build of the SQL core that runs Spark SQL client-side in the
browser, and the topology dashboard — ~294k LOC across 29 crates.
Coming soon — the natural next direction once the core engine ships.
Ask for queries in English, get them planned and run against your data.
The engine learns from prior runs — cost-model population, plan-shape histograms, adaptive bloom-filter gating — and applies it to new queries automatically.
Ask why a query is slow; get a plain-language summary of the physical plan and per-operator metrics.
The exact same Rust engine, compiled to WebAssembly. Your data never leaves the tab.
spark-rust is currently in private development; open-source release is planned for the near future.