I Replaced My ETL Pipeline with SQL — Here's How

Last year I killed a 2,400-line Python ETL pipeline and replaced it with 300 lines of SQL. The Python version had Airflow DAGs, pandas transformations, custom error handlers, retry logic, and a Slack alerting system. The SQL version has CTEs, materialized views, and a cron job. It runs faster, fails less often, and I can actually read it six months later.

I'm not saying SQL is the answer to every data engineering problem. But for the kind of mid-size analytical workloads most of us actually deal with — pulling data from staging tables, cleaning it, aggregating it, and landing it in a reporting schema — SQL does the job with a fraction of the complexity.

This is the story of how I got there, the actual SQL patterns I use in production, and what I honestly think about the state of data engineering tooling in 2026.

The Modern Data Stack Burned Me

I bought into the modern data stack narrative hard. Fivetran for ingestion, Snowflake for storage, dbt for transformation, Looker for dashboards, Airbyte as a backup connector, Airflow for orchestration. It looked clean on the architecture diagram. Six boxes, six arrows, one direction. Simple.

In practice, it was a nightmare.

Six different billing dashboards. Six different credential stores. Six different upgrade cycles that could break each other. When Airbyte pushed a connector update that changed a column type, it broke three dbt models, which broke two Looker dashboards, which generated a dozen Slack alerts at 3 AM. The fix took ten minutes. Figuring out where the break was took two hours.

I'm not the only one who hit this wall. The criticism has been piling up. "Best-of-breed worked with 5 tools. It broke with 15." The tool explosion got so bad that people started calling the modern data stack dead outright. Conflicting dashboards, wasted resources, teams drowning in integration work instead of analysis — that was the reality behind the slick diagrams.

2026 is the era of consolidation. Keep the core tools, kill the redundant ones, combine the overlapping ones. That's exactly what I did. I looked at my stack and asked: what if I just... didn't need half of these tools?

The ELT Shift Already Pointed the Way

Before I explain what I replaced everything with, it helps to understand why this was even possible.

The industry already moved from ETL (extract, transform, load) to ELT (extract, load, transform). ELT became the standard as data systems shifted to the cloud. The key difference: you load raw data into your warehouse first, then transform it in place using SQL. No more Python transformation scripts that run on an application server, shuffling data between systems.

ELT relies heavily on SQL, and that democratized transformation work. Analysts who know SQL can write transforms. You don't need a Python developer for every pipeline change.

dbt is the poster child of this shift. 57,177 companies now use dbt. dbt Labs hit $100M ARR in February 2025, scaling from $2M in just four years. 85% year-over-year growth in Fortune 500 adoption. 9.08% market share in data warehousing. The Fivetran + dbt Labs merger announced in October 2025, with roughly $600M in combined revenue, only confirms that this is the most popular production stack for mid-size companies in 2026.

dbt proved that SQL transformations work. My question was simpler: what if I don't need dbt either? What if the database itself is enough?

The SQL-Only Approach

Here's what my pipeline looks like now. Raw data lands in a staging schema via COPY commands or simple INSERT statements. Transformations happen in SQL — CTEs for multi-step logic, window functions for analytics, materialized views for pre-computed aggregates. A pg_cron job refreshes everything on schedule.

No Airflow. No dbt. No pandas. No Python at all.

Let me walk through each piece.

CTEs for Multi-Step Transformations

Common Table Expressions are the backbone. They let you write readable, step-by-step transformations in a single query. Each CTE is a named intermediate result that the next step can reference.

Here's a real example from my pipeline. I have a staging.raw_orders table with messy e-commerce data. I need to clean it, deduplicate it, enrich it with customer segments, and land it in a reporting table.

WITH cleaned AS (
    SELECT
        order_id,
        LOWER(TRIM(customer_email)) AS customer_email,
        COALESCE(order_total, 0) AS order_total,
        order_date::date AS order_date,
        CASE
            WHEN status IN ('paid', 'completed', 'shipped') THEN 'active'
            WHEN status IN ('refunded', 'cancelled') THEN 'inactive'
            ELSE 'unknown'
        END AS status_group
    FROM staging.raw_orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
),
deduplicated AS (
    SELECT DISTINCT ON (order_id) *
    FROM cleaned
    ORDER BY order_id, order_date DESC
),
enriched AS (
    SELECT
        d.*,
        c.segment AS customer_segment,
        c.lifetime_value,
        CASE
            WHEN c.lifetime_value > 1000 THEN 'high_value'
            WHEN c.lifetime_value > 200 THEN 'mid_value'
            ELSE 'low_value'
        END AS value_tier
    FROM deduplicated d
    LEFT JOIN reporting.customers c
        ON d.customer_email = c.email
)
INSERT INTO reporting.orders
SELECT * FROM enriched
ON CONFLICT (order_id) DO UPDATE SET
    status_group = EXCLUDED.status_group,
    customer_segment = EXCLUDED.customer_segment,
    value_tier = EXCLUDED.value_tier,
    updated_at = NOW();

That's the entire pipeline for this table. Cleaning, deduplication, enrichment, and idempotent loading — all in one query. Each CTE is named. You can read it top to bottom. Three months from now, you'll still understand what it does.

Compare this to the Python equivalent: a pandas script that reads from staging, runs a chain of DataFrame operations, handles nulls with fillna, deduplicates with drop_duplicates, merges with another DataFrame, and writes back with to_sql. Same logic, triple the code, plus you need to manage the database connections, error handling, and deployment.

Window Functions for Analytics

Window functions are where SQL really shines for analytical work. Anything that involves ranking, running totals, period-over-period comparisons, or gap analysis — window functions handle it natively.

WITH daily_revenue AS (
    SELECT
        order_date,
        customer_segment,
        SUM(order_total) AS revenue,
        COUNT(*) AS order_count
    FROM reporting.orders
    WHERE status_group = 'active'
    GROUP BY order_date, customer_segment
),
with_trends AS (
    SELECT
        *,
        LAG(revenue, 7) OVER (
            PARTITION BY customer_segment
            ORDER BY order_date
        ) AS revenue_7d_ago,
        AVG(revenue) OVER (
            PARTITION BY customer_segment
            ORDER BY order_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS revenue_7d_avg,
        SUM(revenue) OVER (
            PARTITION BY customer_segment
            ORDER BY order_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        ROW_NUMBER() OVER (
            PARTITION BY customer_segment
            ORDER BY order_date DESC
        ) AS recency_rank
    FROM daily_revenue
)
SELECT
    order_date,
    customer_segment,
    revenue,
    revenue_7d_avg,
    ROUND(
        (revenue - revenue_7d_ago) / NULLIF(revenue_7d_ago, 0) * 100, 2
    ) AS wow_growth_pct,
    cumulative_revenue
FROM with_trends
WHERE recency_rank <= 30
ORDER BY customer_segment, order_date DESC;

That query gives you daily revenue with 7-day moving averages, week-over-week growth percentages, and cumulative totals — all segmented by customer type. In Python, you'd need pandas groupby, rolling windows, shift operations, and cumsum. The SQL version runs inside the database, close to the data, with no serialization overhead.

PostgreSQL processes data more efficiently when you keep the work inside the database rather than pulling it into an application layer. CTEs, window functions, and subqueries simplify transformations that would otherwise require multiple passes in application code.

Materialized Views for Pre-Computed Reports

Materialized views are the SQL equivalent of a cached transformation. You define a query, PostgreSQL runs it and stores the result as a physical table. When the underlying data changes, you refresh the view.

CREATE MATERIALIZED VIEW reporting.monthly_summary AS
WITH monthly_orders AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        customer_segment,
        value_tier,
        COUNT(*) AS total_orders,
        SUM(order_total) AS total_revenue,
        COUNT(DISTINCT customer_email) AS unique_customers,
        AVG(order_total) AS avg_order_value
    FROM reporting.orders
    WHERE status_group = 'active'
    GROUP BY 1, 2, 3
),
with_growth AS (
    SELECT
        *,
        LAG(total_revenue) OVER (
            PARTITION BY customer_segment, value_tier
            ORDER BY month
        ) AS prev_month_revenue
    FROM monthly_orders
)
SELECT
    month,
    customer_segment,
    value_tier,
    total_orders,
    total_revenue,
    unique_customers,
    avg_order_value,
    ROUND(
        (total_revenue - prev_month_revenue)
        / NULLIF(prev_month_revenue, 0) * 100, 2
    ) AS mom_growth_pct
FROM with_growth
ORDER BY month DESC, customer_segment, value_tier
WITH DATA;

CREATE UNIQUE INDEX idx_monthly_summary_pk
ON reporting.monthly_summary (month, customer_segment, value_tier);

That index on the materialized view is important. It enables REFRESH MATERIALIZED VIEW CONCURRENTLY, which allows non-blocking refreshes so your reporting queries keep working during the refresh.

One limitation to know: PostgreSQL only supports full refresh for materialized views natively. There's a pg_ivm extension for incremental view maintenance, but it's not production-ready yet. For most reporting workloads, a full refresh every 15 minutes or every hour is completely fine. If your base table has millions of rows and your refresh takes too long, that's when you start looking at incremental approaches — or consider whether you've outgrown this pattern.

COPY for Bulk Loading

If you're loading data from flat files — CSVs, TSVs, whatever your upstream system exports — the COPY command is dramatically faster than row-by-row inserts. I use it for the initial load step, getting raw data from files into staging tables.

-- Load a CSV file directly into a staging table
COPY staging.raw_orders (order_id, customer_email, order_total, order_date, status)
FROM '/data/exports/orders_20250701.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', NULL '');

-- Or use \copy from psql for client-side files
-- \copy staging.raw_orders FROM 'orders.csv' WITH (FORMAT csv, HEADER true)

COPY bypasses the SQL parser and writes directly to the table's storage layer. On a local test with 500,000 rows, COPY finished in 2.3 seconds. The equivalent INSERT ... VALUES approach took over 40 seconds. That's not a marginal improvement — it's a different category of performance.

For recurring loads, I keep the staging table as a landing zone. Each run truncates staging, loads fresh data via COPY, then the CTE pipeline transforms and upserts into reporting tables. Clean separation, fast loads, no leftover state.

UPSERT for Idempotent Loads

Every pipeline should be idempotent. Run it once, run it ten times — same result. The INSERT ON CONFLICT pattern gives you this for free.

INSERT INTO reporting.customer_metrics (
    customer_email,
    total_orders,
    total_revenue,
    first_order_date,
    last_order_date,
    avg_order_value,
    updated_at
)
SELECT
    customer_email,
    COUNT(*) AS total_orders,
    SUM(order_total) AS total_revenue,
    MIN(order_date) AS first_order_date,
    MAX(order_date) AS last_order_date,
    AVG(order_total) AS avg_order_value,
    NOW() AS updated_at
FROM reporting.orders
WHERE status_group = 'active'
GROUP BY customer_email
ON CONFLICT (customer_email) DO UPDATE SET
    total_orders = EXCLUDED.total_orders,
    total_revenue = EXCLUDED.total_revenue,
    last_order_date = EXCLUDED.last_order_date,
    avg_order_value = EXCLUDED.avg_order_value,
    updated_at = EXCLUDED.updated_at;

No need for a "check if exists, then update, else insert" dance. No need for a staging-then-merge pattern. One statement, always correct, regardless of how many times you run it.

pg_cron for Scheduling

pg_cron turns PostgreSQL into its own scheduler. No Airflow, no crontab on a separate server, no Lambda triggers.

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS pg_cron;

-- Refresh materialized views every 15 minutes
SELECT cron.schedule(
    'refresh-monthly-summary',
    '*/15 * * * *',
    $$REFRESH MATERIALIZED VIEW CONCURRENTLY reporting.monthly_summary$$
);

-- Run the full orders pipeline every hour
SELECT cron.schedule(
    'orders-pipeline',
    '0 * * * *',
    $$
    WITH cleaned AS (
        SELECT
            order_id,
            LOWER(TRIM(customer_email)) AS customer_email,
            COALESCE(order_total, 0) AS order_total,
            order_date::date AS order_date,
            CASE
                WHEN status IN ('paid', 'completed', 'shipped') THEN 'active'
                WHEN status IN ('refunded', 'cancelled') THEN 'inactive'
                ELSE 'unknown'
            END AS status_group
        FROM staging.raw_orders
        WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
    ),
    deduplicated AS (
        SELECT DISTINCT ON (order_id) *
        FROM cleaned
        ORDER BY order_id, order_date DESC
    )
    INSERT INTO reporting.orders
    SELECT * FROM deduplicated
    ON CONFLICT (order_id) DO UPDATE SET
        status_group = EXCLUDED.status_group,
        updated_at = NOW();
    $$
);

-- Update customer metrics daily at 2 AM
SELECT cron.schedule(
    'customer-metrics',
    '0 2 * * *',
    $$
    INSERT INTO reporting.customer_metrics (
        customer_email, total_orders, total_revenue,
        first_order_date, last_order_date, avg_order_value, updated_at
    )
    SELECT
        customer_email,
        COUNT(*),
        SUM(order_total),
        MIN(order_date),
        MAX(order_date),
        AVG(order_total),
        NOW()
    FROM reporting.orders
    WHERE status_group = 'active'
    GROUP BY customer_email
    ON CONFLICT (customer_email) DO UPDATE SET
        total_orders = EXCLUDED.total_orders,
        total_revenue = EXCLUDED.total_revenue,
        last_order_date = EXCLUDED.last_order_date,
        avg_order_value = EXCLUDED.avg_order_value,
        updated_at = EXCLUDED.updated_at;
    $$
);

-- Check what's scheduled
SELECT * FROM cron.job;

That's the entire orchestration layer. Three cron entries. No DAG definitions. No YAML files. No separate orchestration server.

Python ETL vs SQL ETL: Side by Side

Here's an honest comparison of where I was and where I ended up.

Dimension	Python ETL (Airflow + pandas)	SQL ETL (PostgreSQL-native)
Lines of code	~2,400	~300
Dependencies	47 Python packages	0 (just PostgreSQL)
Infrastructure	Airflow server + workers + DB	PostgreSQL only
Deployment	Docker + CI/CD pipeline	Run SQL files
Scheduling	Airflow DAGs (YAML/Python)	pg_cron (one-liners)
Error handling	Custom retry logic, dead letter queues	Transaction rollback
Monitoring	Airflow UI + Slack integration	pg_cron job logs + pg_stat
Debugging	Read Python tracebacks, check Airflow logs	Run the CTE in isolation
Onboarding new devs	"Learn Airflow, pandas, our custom utils"	"You know SQL? Great."
Cost (monthly)	~$200 (Airflow hosting)	$0 (runs in existing DB)
Data movement	DB -> Python -> DB	Stays in DB
Idempotency	Manual implementation	ON CONFLICT built-in

The cost difference alone justified the switch. But the real win was debugging. When a CTE-based pipeline fails, you can run each CTE independently to find exactly where the problem is. Copy the CTE, add a SELECT * at the bottom, run it. You see the intermediate result instantly. When an Airflow DAG fails, you're reading Python tracebacks, checking task logs, figuring out which retry attempt you're looking at, and wondering whether the state in XCom is stale.

The onboarding story also changed dramatically. My previous team had a 40-page internal wiki on "how to set up the local Airflow environment." Docker Compose files, environment variables, connection strings, DAG folder symlinks, fernet key configuration. The SQL pipeline onboarding doc is three lines: install psql, connect to the database, run the SQL file. Every analyst on the team can now read, debug, and modify the pipeline. That wasn't true when it was Python.

When This Approach Falls Apart

I'd be lying if I said SQL handles everything. It doesn't. Here are the cases where I still reach for Python.

API integrations. If your pipeline starts with pulling data from a REST API, you need code. SQL can't make HTTP requests. I still use Python scripts for the "extract" part — hitting APIs, parsing JSON responses, writing to staging tables. But the transformation and loading? That's SQL.

Complex data parsing. If you're parsing nested JSON with irregular schemas, or processing unstructured text, or doing NLP preprocessing, SQL gets ugly fast. PostgreSQL's JSON functions are powerful but painful for deeply nested structures. Python with jq-style libraries is simply better here.

Machine learning feature engineering. If your transformations involve statistical operations that don't exist in SQL — like custom feature scaling, text vectorization, or time-series decomposition — you need a programming language. SQL has some statistical functions (percentile_cont, stddev, correlation), but it can't replace scikit-learn.

Cross-database orchestration. If your pipeline needs to move data between a PostgreSQL database, a MongoDB collection, and an S3 bucket, SQL alone won't do it. You need something that can talk to multiple systems.

Volume beyond what PostgreSQL can handle. If you're processing billions of rows per hour, PostgreSQL will struggle. That's Spark territory. But be honest with yourself: are you actually processing billions of rows, or are you using Spark because someone told you it was the right tool?

For the 80% of pipelines that are "read from staging, clean, aggregate, write to reporting" — SQL is not just adequate. It's better than the alternative.

What I Actually Think

I think the data engineering industry has a complexity addiction. We reach for distributed systems when a single Postgres instance would do. We build DAG-based orchestration when a cron job would work. We write Python transformations when a CTE would be clearer.

The modern data stack was supposed to simplify things. Instead it gave us six tools that each need their own deployment, their own monitoring, their own upgrade cycle. The Fivetran-Snowflake-dbt-Looker-Airbyte-Airflow stack was "clean in diagrams, nightmare in practice".

I'm not anti-dbt. dbt is good software. It gives you testing, documentation, lineage tracking, and version control for SQL transforms. If your team has 50 models with complex dependencies, dbt's DAG and ref system earn their keep. But for my workload — a dozen transforms, straightforward dependencies — dbt was overhead I didn't need. Adding a jinja templating layer on top of SQL doesn't make simple SQL simpler. It makes it more complicated.

I'm not anti-Airflow either. Airflow is excellent for orchestrating complex workflows with branching logic, retries, SLAs, and cross-system dependencies. But running a SQL query every hour isn't a complex workflow. pg_cron handles it with one line.

Here's my rule of thumb: start with the simplest thing that works.

Can you do it in a single SQL query? Do that.
Do you need multi-step logic? Use CTEs.
Do you need cached results? Use materialized views.
Do you need scheduling? Use pg_cron.
Do you need testing? Write SQL assertions (SELECT COUNT(*) WHERE some_condition, fail if non-zero).
Do you need Python-specific functionality (API calls, ML, parsing)? Add Python for that specific step. Don't rewrite the whole pipeline.

The data industry is consolidating because people are tired. Tired of managing tool sprawl. Tired of debugging integration layers. Tired of spending more time on infrastructure than on actual analysis. 2026 is the year of "keep core, kill redundant, combine overlapping". My SQL-only pipeline is one version of that philosophy.

PostgreSQL has been around since 1996. It will outlast every tool in your current data stack. Writing your transformations in SQL means they'll run on any PostgreSQL-compatible database — Neon, Supabase, RDS, Aurora, your laptop. No vendor lock-in. No framework lock-in. Just SQL.

I replaced my entire ETL pipeline with SQL. It took a weekend. I got rid of an Airflow server, 47 Python dependencies, and a monthly infrastructure bill. The pipeline runs faster because the data never leaves the database. It fails less often because there are fewer moving parts. And when it does fail, I can find the bug in minutes, not hours.

Your mileage will vary. Your data might be messier than mine. Your volume might be higher. Your requirements might genuinely demand Python or Spark or dbt. But before you spin up the next tool in your stack, try writing the SQL version first. You might be surprised how far it gets you.

I Replaced My Entire ETL Pipeline with SQL — Here's How

The Modern Data Stack Burned Me

The ELT Shift Already Pointed the Way

The SQL-Only Approach

CTEs for Multi-Step Transformations

Window Functions for Analytics

Materialized Views for Pre-Computed Reports

COPY for Bulk Loading

UPSERT for Idempotent Loads

pg_cron for Scheduling

Python ETL vs SQL ETL: Side by Side

When This Approach Falls Apart

What I Actually Think

Sources

Enjoyed this article?

The Modern Data Stack Burned Me

The ELT Shift Already Pointed the Way

The SQL-Only Approach

CTEs for Multi-Step Transformations

Window Functions for Analytics

Materialized Views for Pre-Computed Reports

COPY for Bulk Loading

UPSERT for Idempotent Loads

pg_cron for Scheduling

Python ETL vs SQL ETL: Side by Side

When This Approach Falls Apart

What I Actually Think

Sources