Arrow-native · embeddable · distributed

One Spark engine.
Embeddable and distributed.

A Rust implementation of Apache Spark with first-class Rust, Python, Java, Scala, SparkR-compatible R, TypeScript/Node, Ruby, and SQL APIs. Run it in-process with DuckDB-style simplicity, as a Spark Connect service, across a distributed executor cluster, or entirely in your browser via WebAssembly. Same code, same plans, four deployment shapes.

Try Spark SQL in your browser Learn Spark SQL — hands-on tutorial PySpark playground Hands-on PySpark tutorial Rust API playground Scala API playground TypeScript API playground TPC-DS explorer TPC-H explorer BI dashboard What-if simulator SparkR API playground Ruby API playground Structured Streaming ML classification Geospatial queries Mission control

Embedded

Arrow-native analytics inside Rust and Python processes

Connect

Spark-compatible service for PySpark and native clients

Distributed

Native executors, adaptive plans, shuffle, spill, and recovery

Browser

Spark SQL and DataFrames compiled to WebAssembly

Polyglot

Rust, Python, Java, Scala, R, TypeScript, Ruby, and SQL

The pitch

Embedded simplicity, distributed scale

spark-rust combines the in-process experience and analytical performance associated with DuckDB with the APIs and distributed deployment model associated with Apache Spark. It is competitive with DuckDB for embedded and single-node analytics and with Apache Spark in distributed environments. One engine shares the same analyzer, optimizer, physical plans, spill operators, streaming runtime, and ML APIs from laptop to cluster.

Native playground · Docker

Start the native notebook experience in one command

Install Docker Desktop or Docker Engine. The public multi-platform image automatically selects native AMD64 or ARM64 layers.

Recommended medium edition 12 GiB · six kernels

# Pull a reproducible release
docker pull ghcr.io/tomz/sparkrust-playground:medium-v0.42.1

# Bind to localhost because this is a local trial Hub
docker run --rm --name sparkrust-playground \
  --memory=12g --memory-swap=12g \
  -p 127.0.0.1:8000:8000 \
  ghcr.io/tomz/sparkrust-playground:medium-v0.42.1

Open http://127.0.0.1:8000. Sign in with username sparkrust and password sparkrust.

Running Docker on a remote server?

Keep the server port private and connect through SSH. Run the container command above on the server, then run this on your laptop:

ssh -N -L 8000:127.0.0.1:8000 YOUR_USER@YOUR_SERVER

Open http://127.0.0.1:8000 on your laptop and stop the tunnel with Ctrl-C. The browser traffic travels inside SSH; JupyterHub is never exposed publicly. This image is a single-user trial. For multiple untrusted users, use a production JupyterHub deployment with per-user OIDC/SSO, access policies, and HTTPS.

Medium recommended

medium-v0.42.1 · 12 GiB. Python, Spark SQL, Scala, SparkR, TypeScript, and Ruby, plus seeded lakehouse catalogs.

Large full SDK

large-v0.42.1 or v0.42.1 · 20 GiB. Everything in medium plus full and instant Rust kernels and JavaScript.

Small minimal

small-v0.42.1 · 8 GiB. Python/PySpark and Spark SQL for the smallest native evaluation image.

Replace the image tag and memory limit in the command to choose another edition. latest tracks large; pin v0.42.1 or an edition-qualified tag for reproducible use. See the playground guide for JupyterLab and Spark Connect ports, custom credentials, and source builds.

In-process Python

Build the sparkrust wheel from source, then call sparkrust.connect(). No JVM, daemon, or network hop — the engine runs inside your Python process and reads Parquet directly with Arrow-native execution.

Competitive with DuckDB

Single-node Connect server

The same spark-connect-server binary on localhost. Point unmodified PySpark (3.5 and 4.0 dialects), notebooks, or JDBC clients at sc://localhost:15002 and share one engine across clients.

Full gRPC Connect protocol

Distributed cluster

Same code, same query plans, pointed at a cluster driver. Executors receive encoded query plans, run them with their own Arrow-batched DataFusion, and shuffle over gRPC + Arrow IPC. Unmodified PySpark just works.

Distributed query plans · gRPC shuffle

In-browser WebAssembly

The same SQL core compiled to wasm32 and run client-side in a Web Worker — no server, no network, your data never leaves the tab. Spark dialect rewrites and UDFs all execute in WASM against in-memory Arrow batches.

Zero install · runs in the tab

Prototype in-process on a laptop, then deploy the exact same SQL or DataFrame code to a cluster through Spark Connect — without changing a line.

Language APIs

One relational engine, the same workloads in every API

Embed the engine directly from Rust or Python, or use Spark Connect from Rust, PySpark, Java, Scala, SparkR-compatible R, TypeScript/Node, and Ruby. Rust and TypeScript also run against the embedded browser-WASM engine. SQL is the shared semantic layer across every surface.

Rust

Native SparkSession or remote SparkConnectSession, with the same lazy DataFrames, Columns, typed FromSparkRow decoding, and Arrow results. The Rust API also runs interactively in browser WebAssembly.

Embedded · Spark Connect · browser WASM

Python & PySpark

Embedded sparkrust with lazy relational methods, Arrow/pandas results, and UDFs; or unmodified PySpark 3.5/4.x over Connect.

Embedded or Connect

Java

Spark-shaped DataFrame, typed relational Dataset, Arrow results, DataFrameWriter, catalog operations, TLS/auth, cancellation, and Connect transport.

Java 17 · Connect

Scala 2.13

Scala facade over the JVM transport with DataFrames, typed relational decoding, writers, catalogs, and familiar column/function syntax.

Scala · Connect

SparkR-compatible R

SQL, lazy DataFrame verbs, grouping, joins, actions, Arrow LocalRelation upload, writes, and %>% workflows.

R · Connect

TypeScript / Node.js

Strict TypeScript SDK with gRPC, TLS/auth, Spark Connect plan builders, chunked Arrow IPC decoding, DataFrames, Columns, and actions.

Node 20+ · Connect

Ruby

Native Ruby DataFrames and Columns with TLS/auth, chunked Arrow IPC decoding, joins, aggregation, actions, and Spark-style horizontal or vertical result display.

Ruby · Connect

Rust embedded · Arrow-native

use spark_rust::prelude::*;

let spark = SparkSession::builder().build();
let rows = spark.table("sales").await?
  .filter(col("amount").geq(lit(3_i64)))?
  .group_by([col("region")])
  .agg([sum(col("amount")).alias("total")])?
  .order_by([col("total").desc()])?
  .collect_as::<(String, i64)>().await?;

TypeScript Connect or embedded WASM

import { SparkSession, col, lit, sum }
  from "@spark-rust/connect";

const spark = SparkSession.builder()
  .remote("sc://localhost:15002").getOrCreate();
const rows = await spark.table("sales")
  .filter(col("amount").geq(lit(3)))
  .groupBy("region")
  .agg(sum(col("amount")).alias("total"))
  .orderBy(col("total").desc()).collect();

Side-by-side, not hand-waved

The same filter/aggregate, join, and null/distinct workloads are implemented in the established Rust, Python, Java, Scala, SparkR, TypeScript, Ruby, and SQL APIs, producing identical results through one shared relational engine. The same plans can execute embedded, through Spark Connect, or across distributed executors.

Performance positioning

Competitive from in-process analytics to distributed SQL

spark-rust is engineered to compete with DuckDB for embedded and single-node analytical workloads and with Apache Spark in distributed environments. The same Arrow-native engine, optimizer, SQL semantics, and DataFrame APIs run in every deployment shape.

In-process and single node

Vectorized Arrow execution, Parquet pushdown, runtime filtering, native joins, aggregation, windows, and spill deliver a DuckDB-competitive local analytics experience without a JVM or service hop.

Competitive with DuckDB

Distributed environment

Native executors combine adaptive planning, Arrow IPC shuffle, broadcast and sort-merge joins, skew handling, external shuffle, bounded spill, retries, and executor recovery.

Competitive with Apache Spark

One workload, one engine

Develop with embedded Rust or Python, serve the same plan through Spark Connect, and scale it across executors without changing SQL, DataFrames, formats, catalogs, or application semantics.

Local-to-cluster continuity

Major features

A broad Spark-compatible analytics platform

Spark SQL — ANSI SQL plus Spark extensions, complex types, windows, grouping sets, CTEs, generators, DDL/DML, intervals, JSON, variant, sketches, and Spark functions.

DataFrames & Datasets — window functions, UDFs, grouping sets, ROLLUP/CUBE, lateral & recursive CTEs.

Spark Connect — full gRPC server; unmodified PyPI pyspark (3.5 & 4.0) connects and runs. Opt-in bearer-token auth + TLS.

HiveServer2 JDBC/ODBC — Thrift TCLIService (binary + HTTP + TLS + SASL); unmodified beeline / DBeaver / Tableau connect over jdbc:hive2://.

MLlib via Connect — LogisticRegression, RandomForest, k-means, FPGrowth, Pipelines, CrossValidator, model persistence.

Joins — broadcast hash, sort-merge, shuffle hash, broadcast nested loop, semi/anti, mark; adaptive Bloom-filter pushdown.

Adaptive Query Execution — partition coalescing, skew-join split, runtime broadcast switch, SMJ→BHJ swap.

Operator & shuffle spill — aggregate, sort, hash join, cross join, window, symmetric hash join → Arrow IPC on disk.

Structured Streaming — watermarks, session windows, state stores (RocksDB/file/memory), Kafka/socket/rate sources.

Storage & lakehouse formats — Parquet, ORC, CSV, JSON, Avro with predicate pushdown & bloom pruning; Apache Iceberg (REST/Glue/Hive catalogs, row-level DELETE/UPDATE/MERGE, merge-on-read, snapshot writes), Delta Lake (Liquid Clustering, OPTIMIZE/ZORDER, MERGE, DELETE/UPDATE, VACUUM, Change Data Feed), and Apache Hudi (COW/MOR snapshots, log-block updates/deletes, read-optimized, time travel, incremental reads, atomic local COW writes).

GPU acceleration (opt-in) — CUDA star-join & dense-group-by (spark-gpu) and RAPIDS cuDF GPU-native Parquet scan (spark-cudf), bit-exact vs CPU.

Object stores — S3, Azure Blob, GCS, local FS; metastore + DDL persistence across restarts.

Statistics & cost model — column histograms, ANALYZE TABLE FOR COLUMNS, planner hooks consuming stats.

Native polyglot playground — source-built JupyterHub with seven real kernels, validated notebooks, persistent cross-language magics, live cell progress, and explicit onboarding or control-plane modes backed by one catalog server.

Deployment — standalone gRPC executors, Kubernetes manifests, dynamic allocation, KEDA, graceful shutdown.

Live topology dashboard

Watch data flow across the cluster as queries run

A real-time topology page visualizes the cluster live: executors appear as nodes sized by active task count; shuffle edges light up with animated particles showing the direction and volume of in-flight transfers. One command can start a detached executor cluster that continuously runs analytical workloads through this path.

Per-executor counters: active tasks, shuffle read/write, result bytes

Cluster-wide cumulative pills: rows scanned, shuffle bytes, query count

Stale-ring indicators when an executor misses heartbeats

Playback pause for capturing the moment a query executes

Powered by a TopologyService gRPC server

The driver accepts executor registrations and per-heartbeat counter snapshots, fanned out via event logs to the history server's dashboard — so you can watch task counters, shuffle transfers, and cluster-wide totals update live. The UI combines Spark-style jobs, stages, storage, environment, executors, SQL, and streaming views with native Topology and Alerts pages.

Four deployment shapes

One engine, multiple clients

The snippets below show deployment topology. The language section above covers the Rust, Python, Java, Scala, SparkR, TypeScript, and SQL APIs that target those deployments.

1 · In-process Python no daemon · DuckDB-shaped

# pip install sparkrust
import sparkrust

con = sparkrust.connect()
con.sql("CREATE VIEW sales AS SELECT * FROM read_parquet('sales/*.parquet')")
df = con.sql("SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region")

df.show()        # PySpark-style tabular print
rows = df.collect()   # list of tuples
table = df.arrow()   # pyarrow.Table (zero-copy ready)
pdf = df.toPandas()   # pandas DataFrame

2 · Single-node Spark Connect server laptop-scale

# same spark-connect-server binary, run locally
import pyspark.sql
spark = pyspark.sql.SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

df = spark.read.parquet("s3://my-bucket/sales/")
df.groupBy("region").agg({"revenue": "sum"}).show()

3 · Distributed Spark Connect cluster same code, point at a driver

# identical pipeline, now executed across N executors
spark = pyspark.sql.SparkSession.builder.remote("sc://driver.example.com:15002").getOrCreate()

4 · In-browser WebAssembly zero install · runs in the tab

// same SQL core compiled to wasm32, loaded in a Web Worker
import init, { run_sql_json } from "./pkg/spark_wasm.js";
await init();   // fetch + instantiate the .wasm engine

const rows = '[{"region":"west","revenue":10},{"region":"west","revenue":5}]';
const out = run_sql_json(
  "SELECT region, SUM(revenue) AS rev FROM data GROUP BY region", rows);
// no server, no network — your data never leaves the tab

Native architecture

One Arrow-native engine

The Connect server, embedded Python extension, native Rust API, distributed executors, and browser WebAssembly build are different packagings of the same Rust query engine. They share Spark SQL semantics, optimizer rules, physical operators, formats, catalogs, and UDFs without bundling a JVM.

Foundation

Built on Apache DataFusion

spark-rust builds on DataFusion and Apache Arrow for columnar execution, SQL planning, optimization, and Parquet I/O. It adds the Spark dialect and APIs, adaptive distributed execution, native shuffle and spill, Spark Connect, HiveServer2, MLlib, Structured Streaming, lakehouse federation, GPU acceleration, browser WebAssembly, and operational dashboards.

Native polyglot learning

Seven real Jupyter kernels over one spark-rust engine

Run the published large image for embedded Python and Spark SQL plus Scala 2.13, SparkR-compatible R, Rust, TypeScript, Ruby, and JavaScript over a private Connect service. The notebook suite covers language APIs, TPC-H/TPC-DS, ML, streaming, concurrency, Delta Liquid Clustering, and seven-catalog federation.

Validated notebooks, not screenshots

Every code cell runs against its real kernel in the playground gate. Persistent polyglot magics, live per-cell progress, anchored Run All status, and bounded output make long courses observable and repeatable. Use the published Docker images for a source-free native trial, or build examples/playground locally when developing the image itself.

Try it right now — no install

Explore the engine entirely in your browser

The exact same Rust engine, compiled to WebAssembly. Your data never leaves the tab.

Spark SQL console Hands-on SQL tutorial PySpark playground Hands-on PySpark tutorial Rust API playground Scala API playground TypeScript API playground TPC-DS explorer TPC-H explorer BI dashboard What-if simulator SparkR API playground Ruby API playground Structured Streaming ML classification Geospatial queries Mission control

Public browser demos and versioned native playground images are available now.