I ran a GroupBy on 100 million rows last week. Pandas took 107 seconds, then crashed with an out-of-memory error. Polars finished in 11 seconds. DuckDB finished in 9 seconds — and it never loaded the full dataset into RAM. Same data, same machine, three completely different experiences.
This isn't a contrived benchmark. It's the daily reality for anyone working with datasets that outgrew "fits in a Jupyter notebook" territory. And in 2026, the tooling has finally caught up to the problem.
The Three Contenders
Let's get the numbers out of the way first.
Three tools. Three different architectures. Three different philosophies.
The Benchmarks That Actually Matter
Most benchmark articles test trivial operations on small datasets. That's not useful. Here's what happens when you push these tools on operations that reflect real data engineering work.
GroupBy Aggregation (100M Rows)
This is the bread and butter of analytics — "give me the sum of sales by region by quarter."
| Tool | Time | Memory Usage | Notes |
|---|
| Pandas | 100s+ or OOM crash | Entire dataset in RAM | Single-threaded |
| Polars | Under 30s | 30-60% less than Pandas | All cores used |
| DuckDB | Under 30s | Lowest (spill-to-disk) | Vectorized execution |
CSV Reading + Joins
Memory Efficiency
This is where things get interesting. Polars showed 30-60% lower peak memory than Pandas on large joins. But DuckDB? DuckDB uses the least memory of all three because it can spill intermediate results to disk automatically. You never hit an OOM error — it just gets slower.
Polars achieves its memory efficiency differently: through lazy evaluation. When you build a Polars query lazily, it optimizes the execution plan before running anything. It pushes filters down, eliminates unnecessary columns, and can process data in streaming batches when you call collect(engine="streaming").
How They Actually Work (The Architecture That Matters)
Pandas: The Incumbent
Pandas loads your entire dataset into memory as a collection of NumPy arrays (or, since version 3.0, optionally as Arrow arrays). Every operation creates a copy of the data (though 3.0's copy-on-write default helps here). It's single-threaded — your 16-core machine uses one core for .groupby().
Pandas 3.0, released January 21, 2026, made real improvements:
- Copy-on-Write is now default — eliminates the infamous
SettingWithCopyWarning and reduces unnecessary memory copies
- PyArrow string backend — string operations like
.str.contains() run 5-10x faster
- Arrow PyCapsule interface — enables zero-copy data exchange with Polars and DuckDB
- Microsecond datetime default — fixes the old nanosecond limitation that broke dates outside 1678-2262
These are genuine improvements. But they don't change the fundamental architecture: pandas is still single-threaded, still loads everything into memory, and still copies data more than it needs to.
Polars: The Speed Demon
Polars is written in Rust and exposed to Python via PyO3. It uses Apache Arrow's columnar memory format natively. Every operation parallelizes across all available CPU cores automatically.
The key insight in Polars is its lazy evaluation engine. Instead of executing operations immediately (like pandas), you build a query plan:
import polars as pl
# This builds a plan — nothing executes yet
result = (
pl.scan_parquet("sales_data/*.parquet")
.filter(pl.col("year") >= 2024)
.group_by("region", "quarter")
.agg(pl.col("revenue").sum())
.sort("revenue", descending=True)
.collect() # NOW it executes, with optimizations applied
)
Before execution, Polars optimizes the plan: it pushes the year filter down to the file scan (so it never reads rows it doesn't need), projects only the columns used in the query, and parallelizes the group-by across cores.
For datasets larger than RAM, you swap .collect() for .collect(engine="streaming"), and Polars processes the data in batches without loading everything at once. There's a documented case of processing a 31GB CSV file on a machine with far less RAM.
DuckDB: The SQL Engine
DuckDB is not a DataFrame library. It's an embedded analytical database — "SQLite for analytics." It runs inside your Python process (no server, no network calls) and executes SQL queries using a vectorized, columnar engine.
import duckdb
# Query Parquet files directly — no loading step
result = duckdb.sql("""
SELECT region, quarter, SUM(revenue) as total_revenue
FROM 'sales_data/*.parquet'
WHERE year >= 2024
GROUP BY region, quarter
ORDER BY total_revenue DESC
""").fetchdf() # Returns a pandas DataFrame
DuckDB's magic is that it queries files directly — Parquet, CSV, JSON, even remote files on S3 — without an explicit loading step. Its out-of-core processing means it automatically spills data to disk when memory fills up. You don't configure this. It just works.
Version 1.5, released March 2026, brought further improvements. The 1.4 LTS release added AES-256 encryption at rest — a requirement for healthcare, finance, and legal use cases that previously forced teams onto heavier solutions.
The Zero-Copy Bridge: Apache Arrow
Here's what most comparison articles miss entirely: you don't have to choose just one.
All three tools now speak Apache Arrow, which means data can move between them with zero serialization cost:
import polars as pl
import duckdb
import pandas as pd
# Start with DuckDB for heavy SQL aggregation
heavy_result = duckdb.sql("""
SELECT customer_id, SUM(amount) as total
FROM 'transactions/*.parquet'
GROUP BY customer_id
HAVING total > 10000
""")
# Convert to Polars for complex transformations — zero copy via Arrow
polars_df = heavy_result.pl()
# Apply Polars-specific operations
enriched = (
polars_df
.with_columns(
pl.col("total").rank().alias("rank"),
pl.col("total").log().alias("log_total")
)
.filter(pl.col("rank") <= 100)
)
# Convert to pandas for sklearn or visualization — Arrow-backed
pandas_df = enriched.to_pandas()
The .pl() call converts a DuckDB result to a Polars DataFrame via Arrow with zero copy. The .to_pandas() call from Polars is also Arrow-backed. No serialization. No data duplication. The same memory buffers get reused.
DuckDB can also query Polars DataFrames directly:
import polars as pl
import duckdb
# Create a Polars DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"score": [95, 87, 92]
})
# Query it with SQL — zero copy, no data movement
result = duckdb.sql("SELECT * FROM df WHERE score > 90")
This is the real power move in 2026: use DuckDB for SQL-heavy work, Polars for DataFrame transformations, and pandas for the last mile (visualization, sklearn integration). Arrow makes the handoffs free.
What Other Articles Get Wrong
"Polars is just faster pandas"
No. Polars has a fundamentally different execution model. Lazy evaluation, query optimization, and streaming are not speed improvements on the same approach — they're a different approach. You can't add lazy evaluation to pandas with a library. It's architectural.
DuckDB runs in-process with zero infrastructure. There's no server. No daemon. No port. You import duckdb and write SQL. Calling it a "database" and comparing it to PostgreSQL misses the point. It competes with pandas and Polars, not with Postgres.
"Pandas is dead"
Pandas has 200 million monthly downloads. Every data science tutorial, every kaggle notebook, every sklearn example uses pandas. The ecosystem integration is unmatched — seaborn, plotly, matplotlib, scikit-learn, and hundreds of other libraries expect pandas DataFrames. Pandas 3.0 with copy-on-write and Arrow strings is genuinely better. Pandas isn't dead. It's the last mile.
"Just use the fastest one"
Speed isn't the only axis. DuckDB's SQL interface matters if your team thinks in SQL. Polars' type safety matters if you're building production pipelines. Pandas' ecosystem matters if you need to plug into sklearn or create a quick visualization. The right tool depends on the job, not the benchmark.
The Decision Framework
Here's how I'd actually decide:
| Your Situation | Use This | Why |
|---|
| Quick EDA, small data (under 1GB) | Pandas | Fastest to write, best ecosystem |
| Production data pipeline | Polars | Type-safe, lazy execution, testable |
| SQL-heavy analytics | DuckDB | Native SQL, zero infrastructure |
| Data larger than RAM | DuckDB or Polars (streaming) | Both handle it; DuckDB is simpler |
| Team knows SQL, not Python | DuckDB | They can be productive immediately |
| Team knows pandas, wants speed | Polars | Easier migration than learning SQL |
| ML preprocessing | Pandas (last mile) + Polars or DuckDB | sklearn expects pandas |
| Ad-hoc file queries | DuckDB | SELECT * FROM 'file.parquet' — done |
| Complex window functions in code | Polars | Expression API is more composable |
| Regulated industry (encryption) | DuckDB | AES-256 at rest since v1.4 |
The Hybrid Stack (What I Actually Recommend)
For most data teams in 2026, the answer isn't one tool. It's this:
- DuckDB for initial data exploration and SQL-heavy aggregation on raw files
- Polars for production pipeline transformations (lazy mode, type safety, testability)
- Pandas for the final mile — feeding data to visualization libraries and scikit-learn
The Apache Arrow integration between all three makes handoffs essentially free. You're not choosing a religion. You're building a toolbox.
Migration: Pandas to Polars in 10 Minutes
If you're currently using pandas and want to try Polars, here's the translation table for common operations:
# PANDAS
import pandas as pd
df = pd.read_csv("data.csv")
result = (
df[df["age"] > 30]
.groupby("city")["salary"]
.mean()
.sort_values(ascending=False)
.head(10)
)
# POLARS (eager mode — pandas-like)
import polars as pl
df = pl.read_csv("data.csv")
result = (
df.filter(pl.col("age") > 30)
.group_by("city")
.agg(pl.col("salary").mean())
.sort("salary", descending=True)
.head(10)
)
# POLARS (lazy mode — optimized)
import polars as pl
result = (
pl.scan_csv("data.csv")
.filter(pl.col("age") > 30)
.group_by("city")
.agg(pl.col("salary").mean())
.sort("salary", descending=True)
.head(10)
.collect()
)
The mental model shift: in pandas, you chain method calls that each produce a new DataFrame. In Polars, you build expressions using pl.col() and compose them. The lazy version (scan_csv + collect) is what you should use in production — it lets Polars optimize the entire query before executing.
Migration: Pandas to DuckDB in 5 Minutes
# PANDAS
import pandas as pd
df = pd.read_parquet("data.parquet")
result = df.groupby(["region", "year"])["revenue"].sum().reset_index()
result = result[result["revenue"] > 100000].sort_values("revenue", ascending=False)
-- DUCKDB (just SQL)
SELECT region, year, SUM(revenue) as revenue
FROM 'data.parquet'
GROUP BY region, year
HAVING revenue > 100000
ORDER BY revenue DESC
That's it. No import, no read_parquet, no reset_index. DuckDB reads the Parquet file directly in the SQL query. If you know SQL, you already know DuckDB.
The Governance Question
One thing I care about that most comparison articles ignore: who controls these tools?
Pandas is a NumFOCUS fiscally sponsored project. Community-governed, volunteer-maintained, been stable for over a decade. Not going anywhere.
DuckDB is owned by the DuckDB Foundation, a non-profit. The intellectual property was purposefully moved to the foundation to ensure DuckDB remains MIT-licensed in perpetuity, independent of any commercial entity. DuckDB Labs is the commercial company, and MotherDuck is the cloud offering — but the core project is protected. This is the gold standard for open source governance.
Polars is backed by Polars Inc., a VC-funded company with $25 million raised ($21M Series A from Accel in September 2025). The project is MIT-licensed, but the company controls development direction. This is the same model that worked for companies like HashiCorp (until it didn't — remember the BSL license switch?). I don't think Polars will pull a HashiCorp, but the structural risk exists.
This matters if you're choosing a tool for a 10-year production system. DuckDB's nonprofit foundation structure gives it the strongest long-term guarantee. Pandas has community inertia. Polars has VC money, which is great for development velocity but comes with expectations of returns.
What I Actually Think
Pandas is not dead. But pandas is now the wrong default for new data projects.
For the last decade, "learn data science" meant "learn pandas." Every tutorial, every bootcamp, every YouTube video started with import pandas as pd. That made sense when pandas was the only real option. It doesn't make sense anymore.
Here's my position: new data projects in 2026 should start with Polars as the default DataFrame library and add DuckDB for SQL-heavy work. Pandas should be the integration layer — the tool you convert to when you need sklearn, seaborn, or another library that hasn't adopted the Arrow standard yet.
I think Polars has a slight edge over DuckDB for most data engineering work because:
- Expression-based API is more testable. You can unit test a Polars expression. You can't easily unit test a SQL string.
- Type safety catches errors earlier. Polars' schema validation happens at query plan time, not at runtime. SQL errors happen when you run the query.
- Lazy evaluation is automatic optimization. You write code, Polars optimizes it. With DuckDB, the SQL engine optimizes too, but you have to write SQL first.
But I reach for DuckDB when:
- I'm doing exploratory analysis and want to write quick SQL against files
- The team is SQL-native (data analysts, BI folks)
- I need out-of-core processing with zero configuration
- I need encryption at rest for compliance
The hybrid approach — DuckDB for SQL, Polars for transformations, pandas for the last mile — is not a compromise. It's the optimal architecture. Apache Arrow makes it work without performance penalties.
One more thing: if you're still using pandas 2.x, upgrade to 3.0. Copy-on-write alone will save you hours of debugging SettingWithCopyWarning. The Arrow string backend is a free performance win. And the Arrow PyCapsule interface means your pandas DataFrames can now talk to Polars and DuckDB without serialization overhead.
The data tooling ecosystem in Python finally makes sense. Three tools, three specializations, one memory format connecting them all. It's the best it's ever been.
Sources
- We Benchmarked Pandas vs Polars vs DuckDB — Medium
- DuckDB vs Polars vs Pandas: Benchmark and Comparison — codecentric
- pola-rs/polars — GitHub
- DuckDB 30,000 GitHub Stars
- What's New in pandas 3.0.0 — Official Documentation
- Pandas 3.0: Copy-on-Write, PyArrow — DEV Community
- Pandas 3.0 String Dtype and Copy-on-Write — InfoQ
- Polars Lazy Evaluation — User Guide
- Polars Streaming Mode — Rho Signal
- Handling Larger-than-Memory Datasets with Polars — jtrive.com
- DuckDB FAQ — Out-of-core Processing
- DuckDB Foundation
- DuckDB v1.4.3 LTS Announcement
- Integration with Polars — DuckDB Documentation
- Polars Series A: $21M from Accel — TechCrunch
- Polars Company Announcement
- DuckDB Ecosystem Newsletter March 2026 — MotherDuck
- 15+ Companies Using DuckDB in Production — MotherDuck
- Pandas vs Polars vs DuckDB: What Data Scientists Should Use — Analytics Insight
- DuckDB vs Pandas vs Polars for Python Developers — MotherDuck
- Comparison with Other Tools — Polars User Guide
- Polars + DuckDB: The New Power Combo — Open Source For You
- Arrow IPC Support in DuckDB