Advanced MVP · Rust-native

One Spark engine.
Embeddable and distributed.

A Rust implementation of Apache Spark that runs as a single 73 MiB binary in-process — think DuckDB — or scales out across a cluster as a Spark-compatible distributed engine, or runs entirely in your browser via WebAssembly. Same code, same plans, four deployment shapes.

73 MiB
Single dynamically-linked binary — no JVM, no installer
90.84%
of Apache Spark's own SQL test suite (9,296 / 10,233)
99/99
TPC-DS queries at every scale through 1 TB, single-node
2.7×
geomean faster than Apache Spark 4.2 — full 99-query TPC-DS at SF=100 (cluster)
4 ways
In-process · single-node Connect · distributed cluster · in-browser WASM
The pitch

Standing on the shoulders of DuckDB and Apache Spark

DuckDB set the bar for embedded analytics — a single binary with vectorized execution. Apache Spark defined the API surface and deployment model for distributed compute. spark-rust adds the same engine in multiple lanes: one Rust query engine, four packagings, sharing the same analyzer, optimizer rules, physical plans, spill operators, and MLlib bindings.

In-process Python

pip install sparkrust, then sparkrust.connect(). No JVM, no daemon, no network — the engine runs inside your Python process and reads parquet directly. DuckDB-shaped ergonomics with the Spark API surface.

Faster than DuckDB @ SF=1000 TPC-DS

Single-node Connect server

The same spark-connect-server binary on localhost. Point unmodified PySpark (3.5 and 4.0 dialects), notebooks, or JDBC clients at sc://localhost:50051 and share one engine across clients.

Full gRPC Connect protocol

Distributed cluster

Same code, same query plans, pointed at a cluster driver. Executors receive encoded query plans, run them with their own Arrow-batched DataFusion, and shuffle over gRPC + Arrow IPC. Unmodified PySpark just works.

Distributed query plans · gRPC shuffle

In-browser WebAssembly

The same SQL core compiled to wasm32 and run client-side in a Web Worker — no server, no network, your data never leaves the tab. Spark dialect rewrites and UDFs all execute in WASM against in-memory Arrow batches.

Zero install · runs in the tab

Prototype in-process on a laptop, then deploy the exact same SQL or DataFrame code to a cluster through Spark Connect — without changing a line.

The benchmark headline

Full TPC-DS 99-query sweep — within 2–3× of DuckDB, ahead at 1 TB

Single-node, 20 vCPU / 157 GiB RAM, SNAPPY parquet on local disk (read directly — no catalog, no blob). Cold single-shot process per query, both engines on the same box reading the same files, full 99-query sweep at four scale factors.

Scale factordataspark-rust totalDuckDB totalRatiospark-rustDuckDB
SF=1818 MB16.5s9.1s1.82×99/9999/99
SF=106.7 GB64.2s29.3s2.19×99/9999/99
SF=10038 GB540.6s189.5s2.85×99/9999/99
SF=1000561 GB13,693s17,805s0.77× (we win)98/9996/99

0 row mismatches at every scale. The ranking inverts at 1 TB: DuckDB's tight operator fusion wins at small scale, but at SF=1000 spark-rust's design (runtime filters, bloom pushdown, spilling joins that turn O(fact) scans into a fraction) pulls ahead in aggregate — 0.77×, 48/96 per-query wins over the both-completed set, and it uniquely finishes Q67 and Q85 where DuckDB times out. SF=1000 totals are over the 96 queries both engines complete.

Hardware caveat: SF≤100 ran on an AMD EPYC 9V74 (Zen 5) host; SF=1000 on AMD EPYC 7452 (Zen 2) workers — same vCPU/RAM class, different CPU generation, so the 1 TB win was earned on the slower silicon (conservative). Each scale's ratio is clean apples-to-apples (both engines, same box, same files).

These numbers are a moving target. Performance work is active and ongoing; wall times keep improving as we land optimizer, join-ordering, vectorization, and spill improvements. Expect meaningfully better numbers in future sweeps — these figures are a snapshot of an MVP under active optimization, not a ceiling.

Head-to-head vs Apache Spark 4.2

Beats Apache Spark 4.2.0 on the full 99-query suite at SF=100

8-executor cluster, identical hardware. spark-rust's distributed path runs every query byte-identically to Spark and, on warmed timings, beats Spark's own best-tuned config on both the typical query and the aggregate total. (Sequential per-query walls — apples-to-apples with Spark's parallelism=1 sweep totals.)

Metricspark-rustApache Spark 4.2.0
Queries passing99 / 9999 / 99
Σ per-query wall (warm)28.2 min40.0 min (optimized-8×20c)
Queries won (head-to-head)75 / 9924 / 99
Geomean speedup2.71×
Median speedup2.85×

spark-rust's 28.2 min beats all three Spark 4.2 configs — including Spark's own best (cbo-stats, 35.4 min, with table statistics pre-computed) and its un-tuned baseline (93 min). 60 of 99 queries are ≥ 2× faster; the biggest win — q14, Spark's own slowest at 123 s — is 20× faster. The remaining 9 losses are a bounded compute-heavy theta-join / rollup tail.

Where it is in its lifecycle

All the major Spark capabilities, verified end-to-end

Spark SQL — full ANSI surface plus Spark extensions, 90.84% of Spark's golden test suite.
DataFrames & Datasets — window functions, UDFs, grouping sets, ROLLUP/CUBE, lateral & recursive CTEs.
Spark Connect — full gRPC server; unmodified PyPI pyspark (3.5 & 4.0) connects and runs.
MLlib via Connect — LogisticRegression, RandomForest, k-means, FPGrowth, Pipelines, CrossValidator, model persistence.
Joins — broadcast hash, sort-merge, shuffle hash, broadcast nested loop, semi/anti, mark; adaptive Bloom-filter pushdown.
Adaptive Query Execution — partition coalescing, skew-join split, runtime broadcast switch, SMJ→BHJ swap.
Operator & shuffle spill — aggregate, sort, hash join, cross join, window, symmetric hash join → Arrow IPC on disk.
Structured Streaming — watermarks, session windows, state stores (RocksDB/file/memory), Kafka/socket/rate sources.
Storage & lakehouse formats — Parquet, ORC, CSV, JSON, Avro with predicate pushdown & bloom pruning; Apache Iceberg & Delta tables with REST/Glue/Hive catalogs and snapshot writes.
Object stores — S3, Azure Blob, GCS, local FS; metastore + DDL persistence across restarts.
Statistics & cost model — column histograms, ANALYZE TABLE FOR COLUMNS, planner hooks consuming stats.
Deployment — standalone gRPC executors, Kubernetes manifests, dynamic allocation, KEDA, graceful shutdown.
Live topology dashboard

Watch data flow across the cluster as queries run

A real-time topology page visualizes the cluster live: executors appear as nodes sized by active task count; shuffle edges light up with animated particles showing the direction and volume of in-flight transfers.

Per-executor counters: active tasks, shuffle read/write, result bytes
Cluster-wide cumulative pills: rows scanned, shuffle bytes, query count
Stale-ring indicators when an executor misses heartbeats
Playback pause for capturing the moment a query executes

Powered by a TopologyService gRPC server

The driver accepts executor registrations and per-heartbeat counter snapshots, fanned out via event logs to the history server's dashboard — so you can watch tasks, shuffle transfers, and cluster-wide totals update live as a query runs.

Four ways to use it

One Rust codebase, four entry points

1 · In-process Python no daemon · DuckDB-shaped

# pip install sparkrust
import sparkrust

con = sparkrust.connect()
con.sql("CREATE VIEW sales AS SELECT * FROM read_parquet('sales/*.parquet')")
df = con.sql("SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region")

df.show()        # PySpark-style tabular print
rows = df.collect()   # list of tuples
table = df.arrow()   # pyarrow.Table (zero-copy ready)
pdf = df.toPandas()   # pandas DataFrame

2 · Single-node Spark Connect server laptop-scale

# same spark-connect-server binary, run locally
import pyspark.sql
spark = pyspark.sql.SparkSession.builder.remote("sc://localhost:50051").getOrCreate()

df = spark.read.parquet("s3://my-bucket/sales/")
df.groupBy("region").agg({"revenue": "sum"}).show()

3 · Distributed Spark Connect cluster same code, point at a driver

# identical pipeline, now executed across N executors
spark = pyspark.sql.SparkSession.builder.remote("sc://driver.example.com:50051").getOrCreate()

4 · In-browser WebAssembly zero install · runs in the tab

// same SQL core compiled to wasm32, loaded in a Web Worker
import init, { run_sql_json } from "./pkg/spark_wasm.js";
await init();   // fetch + instantiate the .wasm engine

const rows = '[{"region":"west","revenue":10},{"region":"west","revenue":5}]';
const out = run_sql_json(
  "SELECT region, SUM(revenue) AS rev FROM data GROUP BY region", rows);
// no server, no network — your data never leaves the tab
Binary footprint

The whole engine in 73 MiB

No JVM, no bundled deps, no installer — dynamically linked against the same standard system libs. The Connect server and Python extension are different packagings of the same Rust engine.

ArtifactSize
spark-connect-server — full engine + Connect protocol72.9 MiB
sparkrust._sparkrust.abi3.so — in-process Python ext~75 MiB
duckdb-1.5.4 — reference CLI59 MiB

For context, Apache Spark 4.2's distribution is ~330 MB compressed (~1 GB extracted) plus the JVM — reflecting its much broader feature surface.

How does this exist?

Built on Apache DataFusion

spark-rust uses a local fork of DataFusion — the Rust query engine that also powers Polars, ParadeDB, InfluxDB v3 and Sail — for Arrow-columnar vectorized execution, SQL parsing, optimization, and Parquet I/O. On top we add the Spark-shaped pieces: a Spark-dialect SQL layer, analyzer rewrites, spill operators, skew-join split, the gRPC shuffle service, the Connect server, MLlib over Connect, an Apache Iceberg / Delta metastore writer, a wasm32 build of the SQL core that runs Spark SQL client-side in the browser, and the topology dashboard.

On the roadmap

AI-native features ahead

Coming soon — the natural next direction once the core engine ships.

Natural-language SQL

Ask for queries in English, get them planned and run against your data.

Self-tuning execution

The engine learns from prior runs — cost-model population, plan-shape histograms, adaptive bloom-filter gating — and applies it to new queries automatically.

AI-assisted plan explanation

Ask why a query is slow; get a plain-language summary of the physical plan and per-operator metrics.

Try it right now — no install

Five demos run entirely in your browser

The exact same Rust engine, compiled to WebAssembly. Your data never leaves the tab.

spark-rust is currently in private development; open-source release is planned for the near future.