ClickHouse OLAP Guide: Billion Rows Per Second

A query that took 47 seconds on PostgreSQL took 120 milliseconds on ClickHouse. Same data. Same columns. Same WHERE clause. The only difference: Postgres stores data row by row. ClickHouse stores it column by column. That architectural choice — made decades ago by different engineers for different reasons — is a 400x performance gap in 2026. And most backend engineers have never touched a columnar database.

I was one of those engineers until last year. I'd heard of OLAP, knew it stood for Online Analytical Processing, and promptly forgot about it because "just use Postgres" had served me fine. Then our analytics dashboard hit 200 million rows and the GROUP BY queries started taking 30+ seconds. Postgres wasn't broken. I was using the wrong tool.

Why Your Database Is Slow (And It's Not Your Database's Fault)

Here's the thing most tutorials get wrong: they compare databases like they're interchangeable. They're not. PostgreSQL and ClickHouse aren't competitors. They're different tools for fundamentally different problems.

Row-oriented databases (PostgreSQL, MySQL, SQLite) store all columns of a row together on disk. When you SELECT * FROM users WHERE id = 42, the database reads one contiguous block — fast. This is perfect for OLTP: transactional workloads where you're reading or writing individual records. Banking systems, shopping carts, user profiles.

Column-oriented databases (ClickHouse, DuckDB, BigQuery) store each column separately. When you SELECT avg(price) FROM orders WHERE date > '2025-01-01', the database reads only the price and date columns — ignoring the other 30 columns entirely. For analytical queries that scan millions of rows but touch few columns, this means 10-100x faster execution than row stores.

The compression benefits compound this. A column of integers compresses far better than a row containing an integer, a string, a timestamp, a JSON blob, and a boolean. ClickHouse typically achieves 7-12x compression ratios, meaning your 1 TB dataset fits in 80-140 GB on disk. Less data to read = faster queries.

This isn't magic. It's physics. And once you understand it, the question stops being "should I use ClickHouse?" and becomes "why am I running analytical queries on a row store?"

ClickHouse: The Numbers

ClickHouse isn't just fast. It's absurdly fast.

Tesla reported ingesting 1 billion rows per second into ClickHouse. Tinybird independently demonstrated 1.72 billion rows ingested in a single second. Altinity analyzed 500 billion rows on an Intel NUC — a $300 mini-PC.

The company behind it just raised a $400 million Series D at a $15 billion valuation in January 2026. They now serve over 3,000 customers on ClickHouse Cloud, with ARR growing more than 250% year over year. The customer list reads like a tech industry hall of fame: OpenAI, Netflix, Tesla, Anthropic, Meta, Lyft, Instacart, Sony, and Capital One.

And in a move that tells you everything about where the market is headed, ClickHouse acquired Langfuse (the leading open-source LLM observability platform) and launched a native Postgres service with built-in CDC. They're not just an analytics database anymore. They're building the data infrastructure layer for the AI era.

How ClickHouse Actually Works (The 5-Minute Architecture)

Three things make ClickHouse fast. Understanding them will save you from the most common mistakes.

1. MergeTree Engine

MergeTree is the default storage engine, and it's unlike anything in Postgres. Data arrives in "parts" — each part is a directory containing column files sorted by primary key. ClickHouse merges these parts in the background, similar to how LSM trees work but with an important difference: data is written directly to the filesystem, not to an in-memory buffer first. This makes ClickHouse better suited for batch inserts.

The primary key isn't a unique constraint — it's a sort order. Data within each part is physically sorted by the primary key columns. This means range queries on primary key columns skip vast amounts of data using sparse indexing: ClickHouse stores one index entry per 8,192 rows by default, not one per row.

2. Vectorized Execution

Instead of processing one row at a time, ClickHouse processes data in batches called "chunks". Each chunk contains thousands of values from a single column. This lets the CPU use SIMD instructions — Single Instruction, Multiple Data — to perform the same operation across many values simultaneously.

Think of it this way: a row-store database is a cashier scanning groceries one item at a time. ClickHouse is a barcode scanner running across an entire conveyor belt. Same operations, orders of magnitude less overhead.

3. Compression and Data Skipping

Each column file is compressed independently using codecs tuned to the data type. Timestamps get delta encoding (store the difference between consecutive values). Integers get LZ4. Low-cardinality strings get dictionary encoding. The result: your data is 7-12x smaller on disk, which means 7-12x less I/O for every query.

On top of compression, ClickHouse maintains min/max statistics for each data "granule" (8,192 rows). A query with WHERE date > '2026-01-01' can skip entire granules whose max date is before 2026 without reading a single byte.

ClickHouse vs DuckDB: The Real Comparison

Every ClickHouse article mentions DuckDB. Most get the comparison wrong by treating them as direct competitors. They're not.

DuckDB is the SQLite of OLAP — an in-process database that runs inside your Python, R, or Node.js process. No server, no port, no configuration. It has over 6 million downloads per month, hit 30,000 GitHub stars in 2025, and jumped from 1.4% to 3.3% adoption in the Stack Overflow Developer Survey.

Here's when to use each:

Dimension	ClickHouse	DuckDB
Architecture	Client-server (or cloud)	In-process (embedded)
Best for	Production analytics, dashboards	Ad-hoc analysis, data science
Concurrency	Hundreds of concurrent queries	Single user / few queries
Data scale	Petabytes	Gigabytes to low terabytes
Ingestion	Real-time streaming	Batch / file-based
Setup	Docker/cloud deployment	`pip install duckdb`
Latency at scale	Sub-second on billions of rows	Sub-second on millions
Multi-node	Yes (distributed)	No (single node)
Cost	$200-100K+/month (Cloud)	Free
Ideal user	Platform/infra team	Data scientist / analyst

The heuristic I use: if you're asking "which should I use for my data science work?" start with DuckDB. If you're asking "which should I use to power my analytics product?" start with ClickHouse.

Many teams use both — DuckDB for development and exploration, ClickHouse for production serving. They share enough SQL compatibility that queries written in DuckDB usually work in ClickHouse with minor adjustments.

The Cost Equation: ClickHouse vs The Data Warehouse Giants

Before ClickHouse, if you needed analytical performance on billions of rows, your options were Snowflake, BigQuery, or Redshift. All excellent. All expensive.

Here's the cost comparison that matters:

Platform	1 TB Storage/Mo	Compute for Heavy Analytics	Egress Fees
Snowflake	~$23/TB	$2-4/credit, scales quickly	Yes
BigQuery	~$20/TB active	$6.25/TB scanned (on-demand)	Yes
Redshift	~$25/TB	$0.25-4.05/hr per node	Yes (AWS)
ClickHouse Cloud	~$35-50/TB compressed	$0.22-0.75/hr	Yes (new in 2025)
ClickHouse Self-Hosted	Disk cost only	Server cost only	None

The compression advantage changes the storage math dramatically. ClickHouse's 7-12x compression means your 1 TB of raw data is 80-140 GB stored. At $35-50/TB compressed, that's $3-7 for what would cost $20-25 on BigQuery or Snowflake.

For small-to-medium workloads processing under 10 GB daily, ClickHouse Cloud costs $200-400/month. Self-hosted on a modest server runs $200-400/month in compute. Compare that to Snowflake, where a moderately active warehouse can easily hit $1,000-3,000/month.

The catch: ClickHouse Cloud introduced egress fees in January 2025, making migrations out more expensive. And the 30% price increase for typical production workloads in that same pricing update pushed some teams toward self-hosting or alternatives like Altinity Cloud.

When Postgres Actually Isn't Enough

Look, I'm not going to tell you to migrate everything to ClickHouse. Postgres is great. I use it for this blog. But there are clear signals that you've outgrown Postgres for analytics:

Signal 1: Your dashboards take more than 3 seconds to load. If users are staring at spinners while Postgres scans a 50-million-row events table, you have an OLAP problem. Adding indexes helps temporarily but doesn't fix the fundamental row-store penalty for wide-table scans.

Signal 2: You're running pre-aggregation jobs to make reports fast. If you're computing hourly rollups, daily summaries, or materialized views just to keep dashboard queries under a second — you're building an ad-hoc OLAP system on top of an OLTP database. ClickHouse does this natively and better.

Signal 3: Your analytics queries touch under 10% of columns but over 10 million rows. This is the columnar sweet spot. SELECT count(*), avg(amount) FROM transactions WHERE date BETWEEN '2025-01-01' AND '2025-12-31' GROUP BY merchant_category reads 3 columns out of 40. In Postgres, it reads all 40. In ClickHouse, it reads 3.

Signal 4: You're spending more on Postgres read replicas than you would on ClickHouse. If you've added 2-3 read replicas specifically for analytics queries, you're already paying ClickHouse-level costs with Postgres-level performance.

Getting Started: From Zero to Querying Billions

Here's the practical deployment path.

Option 1: ClickHouse Cloud (Fastest)

# Sign up at clickhouse.com, create a service, then connect:
clickhouse-client --host your-service.clickhouse.cloud \
  --user default --password your-password --secure

Development tier starts at roughly $1/month. Production clusters range from $200 to $100K+/month depending on compute and storage.

Option 2: Self-Hosted with Docker (Most Control)

# Single-node ClickHouse in 30 seconds
docker run -d \
  --name clickhouse \
  -p 8123:8123 \
  -p 9000:9000 \
  -v clickhouse_data:/var/lib/clickhouse \
  clickhouse/clickhouse-server

# Connect
docker exec -it clickhouse clickhouse-client

Your First Table

-- Create a table for event analytics
CREATE TABLE events (
    event_id UInt64,
    user_id UInt32,
    event_type LowCardinality(String),
    page_url String,
    country LowCardinality(String),
    revenue Decimal(10, 2),
    created_at DateTime
) ENGINE = MergeTree()
ORDER BY (event_type, country, created_at)
PARTITION BY toYYYYMM(created_at);

Key points:

LowCardinality(String) for columns with few unique values — dictionary encoding makes this dramatically faster
ORDER BY is your primary sort key, not a unique constraint. Pick columns you filter on most frequently
PARTITION BY month keeps old partitions from being scanned for recent queries

Loading Data

-- Insert from a CSV file
INSERT INTO events
SELECT * FROM file('events.csv', CSVWithNames);

-- Or from Postgres via ClickHouse's built-in Postgres engine
CREATE TABLE pg_events
ENGINE = PostgreSQL('postgres-host:5432', 'mydb', 'events', 'user', 'pass');

INSERT INTO events SELECT * FROM pg_events;

The Query That Changes Your Mind

-- Count events by country and type for the last 30 days
-- On 500 million rows, this returns in ~200ms
SELECT
    country,
    event_type,
    count() AS events,
    uniqExact(user_id) AS unique_users,
    sum(revenue) AS total_revenue
FROM events
WHERE created_at >= now() - INTERVAL 30 DAY
GROUP BY country, event_type
ORDER BY total_revenue DESC
LIMIT 20;

Try this same query on a 500-million-row Postgres table. Make coffee while you wait.

The Mistakes That'll Cost You

Mistake 1: Treating ClickHouse like Postgres. Don't INSERT one row at a time. ClickHouse is optimized for batch inserts — aim for batches of 10,000-100,000 rows. Single-row inserts create one "part" per insert, and the background merge process will struggle to keep up. This is the #1 mistake new users make.

Mistake 2: Wrong ORDER BY. Your ORDER BY clause determines the physical sort order of data on disk. It's the single most impactful decision for query performance. Put your highest-cardinality filter columns first. If you mostly filter by user_id, make user_id the first column in ORDER BY. Getting this wrong means full table scans on queries that should skip 99% of data.

Mistake 3: Updating and deleting like it's an OLTP database. ClickHouse supports UPDATE and DELETE, but they're expensive "mutations" that rewrite entire parts. If your workload requires frequent updates, you either need a different approach (ReplacingMergeTree, CollapsingMergeTree) or a different database. ClickHouse is append-optimized.

Mistake 4: Ignoring LowCardinality. If a String column has fewer than 10,000 unique values — country codes, status enums, event types — wrapping it in LowCardinality() can improve query performance by 10-50x. Dictionary encoding replaces repeated strings with integer IDs internally. Most tutorials skip this. Don't.

Mistake 5: Self-hosting before you need to. Self-hosted ClickHouse requires 1-3 days for initial setup and 4-8 hours per week of ongoing maintenance. At $150/hour engineering cost, that's $2,400-4,800 per month in labor alone. Start with ClickHouse Cloud or Altinity Cloud. Self-host only when you've outgrown managed pricing or have strict data sovereignty requirements.

Mistake 6: Not using materialized views. ClickHouse materialized views aren't like Postgres materialized views — they're not periodic refreshes. They're real-time aggregation pipelines. When data is inserted into a source table, the materialized view transforms and inserts into a target table automatically. This is how you build real-time rollups, pre-aggregations, and derived tables without ETL jobs:

-- Pre-aggregate hourly revenue as data arrives
CREATE MATERIALIZED VIEW hourly_revenue
ENGINE = SummingMergeTree()
ORDER BY (event_type, hour)
AS SELECT
    event_type,
    toStartOfHour(created_at) AS hour,
    sum(revenue) AS total_revenue,
    count() AS event_count
FROM events
GROUP BY event_type, hour;

Every INSERT into events automatically updates hourly_revenue. No cron jobs. No Airflow DAGs. No stale data. This is one of ClickHouse's most powerful features and it's wildly underused.

The OLAP Decision Framework

Don't pick a database. Pick an architecture. Here's the framework:

Under 10 million rows: Stay on Postgres. Add indexes. Seriously. ClickHouse is overkill and adds operational complexity for no benefit.

10-100 million rows, infrequent analytics: Try DuckDB first. Export your Postgres tables to Parquet, query them with DuckDB. Zero infrastructure, instant results.

import duckdb

# Query a Parquet file with 50M rows — returns in 200ms
result = duckdb.sql("""
    SELECT country, count(*) as events
    FROM 'events.parquet'
    WHERE created_at >= '2025-01-01'
    GROUP BY country
    ORDER BY events DESC
""").fetchall()

100 million to 10 billion rows, real-time dashboards: ClickHouse. This is its sweet spot. Set up CDC from Postgres using ClickHouse's native Postgres integration or Debezium, and run analytics on ClickHouse while keeping OLTP on Postgres.

Over 10 billion rows, multi-team analytics: ClickHouse Cloud or a managed service like Altinity. At this scale, you need auto-scaling, team isolation, and professional support.

What I Actually Think

The OLAP market has been dominated by expensive enterprise tools for decades — Snowflake, BigQuery, Redshift. ClickHouse is eating their lunch from below, and the reason is simple: it's faster and cheaper for the workloads that matter.

I think ClickHouse is the most important database technology of the last five years. Not because it's the fanciest — it's not. The SQL dialect is quirky, the ecosystem is smaller than Postgres, and the learning curve is real. But it solves a problem that every growing company hits: "my analytics are too slow and my database bill is too high." And it solves it with performance that feels unfair.

I think DuckDB is brilliant but overhyped for production serving. It's the perfect tool for data scientists and for prototyping analytics pipelines. But the moment you need concurrent queries, real-time ingestion, or distributed storage, you're back to ClickHouse. DuckDB is the exploration tool. ClickHouse is the production engine. Use both.

I think the ClickHouse acquisition of Langfuse signals something bigger. AI applications generate enormous amounts of observability data — traces, token counts, latencies, evaluations — at exactly the scale that ClickHouse handles best. Every LLM-powered product will need an OLAP backend for its operational data. ClickHouse is positioning itself to be that backend. Smart.

I think the canonical architecture for 2026 is Postgres for OLTP + ClickHouse for OLAP + CDC in between. Not Postgres for everything. Not ClickHouse for everything. Two tools, connected by a data pipeline, each doing what it's best at. The ClickHouse team seems to agree — they literally launched a native Postgres service with built-in CDC to make this architecture turnkey.

And I think the reason most backend engineers haven't tried ClickHouse is the same reason they haven't tried Redis for caching or Elasticsearch for search: they try to make Postgres do everything because it's the tool they know. Postgres is great. It's just not great at everything. The fastest way to fix slow analytics isn't a better Postgres index. It's a columnar database.

The barrier to entry has never been lower. Five years ago, running ClickHouse meant managing ZooKeeper clusters and debugging replication lag. Today, docker run clickhouse/clickhouse-server gives you a fully functional analytical database in 30 seconds. ClickHouse Cloud gives you a managed cluster with zero operational overhead for the price of a couple of coffees per day.

If you're a backend engineer who's never used a columnar database, this is the one to learn. Not because it's trendy — because it fills a gap in your toolkit that Postgres can't. Every application eventually outgrows row-store analytics. When yours does, you'll want to know ClickHouse already.

Load your data. Run a GROUP BY on 100 million rows. Watch it finish before your finger leaves the Enter key. Then tell me you don't need this.

ClickHouse Processes 1 Billion Rows Per Second on a Single Node — OLAP for Engineers Who Hate Complexity

Why Your Database Is Slow (And It's Not Your Database's Fault)

ClickHouse: The Numbers

How ClickHouse Actually Works (The 5-Minute Architecture)

1. MergeTree Engine

2. Vectorized Execution

3. Compression and Data Skipping

ClickHouse vs DuckDB: The Real Comparison

The Cost Equation: ClickHouse vs The Data Warehouse Giants

When Postgres Actually Isn't Enough

Getting Started: From Zero to Querying Billions

Option 1: ClickHouse Cloud (Fastest)

Option 2: Self-Hosted with Docker (Most Control)

Your First Table

Loading Data

The Query That Changes Your Mind

The Mistakes That'll Cost You

The OLAP Decision Framework

What I Actually Think

Sources

Enjoyed this article?

Why Your Database Is Slow (And It's Not Your Database's Fault)

ClickHouse: The Numbers

How ClickHouse Actually Works (The 5-Minute Architecture)

1. MergeTree Engine

2. Vectorized Execution

3. Compression and Data Skipping

ClickHouse vs DuckDB: The Real Comparison

The Cost Equation: ClickHouse vs The Data Warehouse Giants

When Postgres Actually Isn't Enough

Getting Started: From Zero to Querying Billions

Option 1: ClickHouse Cloud (Fastest)

Option 2: Self-Hosted with Docker (Most Control)

Your First Table

Loading Data

The Query That Changes Your Mind

The Mistakes That'll Cost You

The OLAP Decision Framework

What I Actually Think

Sources