Advanced MVP · Rust-native

One Spark engine.
Embeddable and distributed.

A Rust implementation of Apache Spark that runs as a single 73 MiB binary in-process — think DuckDB — or scales out across a cluster as a Spark-compatible distributed engine. Same code, same plans, three deployment shapes.

73 MiB
Single dynamically-linked binary — no JVM, no installer
75.5%
of Apache Spark's own SQL test suite (7,177 / 9,508)
99/99
TPC-DS queries at SF=1000, single-node
3 modes
In-process · single-node Connect · distributed cluster
The pitch

Standing on the shoulders of DuckDB and Apache Spark

DuckDB set the bar for embedded analytics — a single binary with vectorized execution. Apache Spark defined the API surface and deployment model for distributed compute. spark-rust adds the same engine in both lanes: one Rust query engine, three packagings, sharing the same analyzer, optimizer rules, physical plans, spill operators, and MLlib bindings.

In-process Python

pip install sparkrust, then sparkrust.connect(). No JVM, no daemon, no network — the engine runs inside your Python process and reads parquet directly. DuckDB-shaped ergonomics with the Spark API surface.

~2× DuckDB wall time @ SF=1000

Single-node Connect server

The same spark-connect-server binary on localhost. Point unmodified PySpark (3.5 and 4.0 dialects), notebooks, or JDBC clients at sc://localhost:50051 and share one engine across clients.

Full gRPC Connect protocol

Distributed cluster

Same code, same query plans, pointed at a cluster driver. Executors receive Substrait-encoded plans, run them with their own Arrow-batched DataFusion, and shuffle over gRPC + Arrow IPC. Unmodified PySpark just works.

Substrait plans · gRPC shuffle

Prototype in-process on a laptop, then deploy the exact same SQL or DataFrame code to a cluster through Spark Connect — without changing a line.

The benchmark headline

Full TPC-DS 99-query sweep, within 2–3× of DuckDB

Single-node, 32 cores, 251 GB RAM, 1.2 TB local NVMe, parquet on disk. Cold cache, single-shot, full 99-query sweep at four scale factors.

Scale factorspark-rust totalDuckDB totalRatiospark-rustDuckDB
SF=118.1s6.2s2.90×99/9999/99
SF=1061.4s23.0s2.67×99/9999/99
SF=100358.7s201.5s1.78×99/9999/99
SF=10004,146.6s1,859.0s2.23×99/9998/99*

* At SF=1000, DuckDB hits a working-set limit on Q85 where the intermediate exceeds the 547 GB local spill disk on this hardware. DuckDB's optimizer is excellent and faster on most queries in the sweep; this case is a plan-shape difference where spark-rust's join-order selectivity keeps the intermediate small enough to finish in 8s without spilling. Full per-query table in docs/tpcds-single-node-benchmark.md.

Where it is in its lifecycle

All the major Spark capabilities, verified end-to-end

Spark SQL — full ANSI surface plus Spark extensions, 75.5% of Spark's golden test suite.
DataFrames & Datasets — window functions, UDFs, grouping sets, ROLLUP/CUBE, lateral & recursive CTEs.
Spark Connect — full gRPC server; unmodified PyPI pyspark (3.5 & 4.0) connects and runs.
MLlib via Connect — LogisticRegression, RandomForest, k-means, FPGrowth, Pipelines, CrossValidator, model persistence.
Joins — broadcast hash, sort-merge, shuffle hash, broadcast nested loop, semi/anti, mark; adaptive Bloom-filter pushdown.
Adaptive Query Execution — partition coalescing, skew-join split, runtime broadcast switch, SMJ→BHJ swap.
Operator & shuffle spill — aggregate, sort, hash join, cross join, window, symmetric hash join → Arrow IPC on disk.
Structured Streaming — watermarks, session windows, state stores (RocksDB/file/memory), Kafka/socket/rate sources.
Storage formats — Parquet, ORC, Iceberg, Delta, CSV, JSON, Avro; predicate pushdown & bloom pruning.
Object stores — S3, Azure Blob, GCS, local FS; metastore + DDL persistence across restarts.
Statistics & cost model — column histograms, ANALYZE TABLE FOR COLUMNS, planner hooks consuming stats.
Deployment — standalone gRPC executors, Kubernetes manifests, dynamic allocation, KEDA, graceful shutdown.
Live topology dashboard

Watch data flow across the cluster as queries run

A real-time topology page visualizes the cluster live: executors appear as nodes sized by active task count; shuffle edges light up with animated particles showing the direction and volume of in-flight transfers.

Per-executor counters: active tasks, shuffle read/write, result bytes
Cluster-wide cumulative pills: rows scanned, shuffle bytes, query count
Stale-ring indicators when an executor misses heartbeats
Playback pause for capturing the moment a query executes

Powered by a TopologyService gRPC server

The driver accepts executor registrations and per-heartbeat counter snapshots, fanned out via JSONL event logs to the history server's dashboard. See crates/spark-history-server/src/dashboard.html and the topology-demo-* workload generators.

Three ways to use it

One Rust codebase, three entry points

1 · In-process Python no daemon · DuckDB-shaped

# pip install sparkrust
import sparkrust

con = sparkrust.connect()
con.sql("CREATE VIEW sales AS SELECT * FROM read_parquet('sales/*.parquet')")
df = con.sql("SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region")

df.show()        # PySpark-style tabular print
rows = df.collect()   # list of tuples
table = df.arrow()   # pyarrow.Table (zero-copy ready)
pdf = df.toPandas()   # pandas DataFrame

2 · Single-node Spark Connect server laptop-scale

# same spark-connect-server binary, run locally
import pyspark.sql
spark = pyspark.sql.SparkSession.builder.remote("sc://localhost:50051").getOrCreate()

df = spark.read.parquet("s3://my-bucket/sales/")
df.groupBy("region").agg({"revenue": "sum"}).show()

3 · Distributed Spark Connect cluster same code, point at a driver

# identical pipeline, now executed across N executors
spark = pyspark.sql.SparkSession.builder.remote("sc://driver.example.com:50051").getOrCreate()
Binary footprint

The whole engine in 73 MiB

No JVM, no bundled deps, no installer — dynamically linked against the same standard system libs. The Connect server and Python extension are different packagings of the same Rust engine.

ArtifactSize
spark-connect-server — full engine + Connect protocol72.9 MiB
sparkrust._sparkrust.abi3.so — in-process Python ext~75 MiB
duckdb-1.5.2 — reference CLI58.9 MiB

For context, Apache Spark 4.2's distribution is ~330 MB compressed (~1 GB extracted) plus the JVM — reflecting its much broader feature surface.

How does this exist?

Built on Apache DataFusion

spark-rust uses a local fork of DataFusion — the Rust query engine that also powers Polars, ParadeDB, InfluxDB v3 and Sail — for Arrow-columnar vectorized execution, SQL parsing, optimization, and Parquet I/O. On top we add the Spark-shaped pieces: a Spark-dialect SQL layer, analyzer rewrites, spill operators, skew-join split, the gRPC shuffle service, the Connect server, MLlib over Connect, a metastore writer, and the topology dashboard — ~250k LOC across 22+ crates.

On the roadmap

AI-native features ahead

Not in the v0 release — the natural next direction once the core engine ships.

Natural-language SQL

Ask for queries in English, get them planned and run against your data.

Self-tuning execution

The engine learns from prior runs — cost-model population, plan-shape histograms, adaptive bloom-filter gating — and applies it to new queries automatically.

AI-assisted plan explanation

Ask why a query is slow; get a plain-language summary of the physical plan and per-operator metrics.

Try it right now — no install

Both demos run entirely in your browser

The exact same Rust engine, compiled to WebAssembly. Your data never leaves the tab.

spark-rust is currently in private development; open-source release is planned for the near future.