Articles / Analyzing Billion Row Datasets Without Breaking Your BI Stack

Analyzing Billion Row Datasets Without Breaking Your BI Stack

Learn how to query and analyze billion-row datasets when traditional BI tools fail. Discover architectural patterns, query optimization, and AI-powered alternatives.

Andrew Grosser

Andrew Grosser

May 15, 2026 • 11 min read

Analyzing Billion Row Datasets Without Breaking Your BI Stack

Learn how to query and analyze billion-row datasets when traditional BI tools fail. Discover architectural patterns, query optimization, and AI-powered alternatives.

Your analytics dashboard just timed out. Again. You're staring at a spinning loader that's been running for six minutes on a simple GROUP BY query. Your event tracking table crossed one billion rows last week, and suddenly every tool in your stack—Metabase, Superset, even your carefully tuned Postgres instance—grinds to a halt. This is the billion-row wall, and you just hit it.

The billion-row threshold isn't arbitrary. It's the point where traditional row-oriented databases and BI tools designed for transactional workloads fundamentally break down under analytical query patterns. A Postgres database that handled 50 million rows with acceptable performance will timeout on aggregations at 1 billion rows. Tableau dashboards that refreshed in 8 seconds now take 4 minutes—or never complete at all.

Sourcetable's AI data analyst connects directly to analytical databases like ClickHouse and translates natural language into optimized SQL. Try it free now.

Why Traditional Databases Fail at Billion-Row Scale

Postgres, MySQL, and other row-oriented databases store data row-by-row on disk. When you run SELECT COUNT(*), AVG(revenue) FROM events WHERE timestamp > '2026-01-01' GROUP BY user_id on a billion-row table, the database must scan every row, read all columns (even ones you don't need), and perform aggregations in memory.

Here's the math: A typical event record might be 500 bytes. One billion rows = 500 GB of raw data. Even with indexes, a sequential scan touches hundreds of gigabytes. On standard SSD storage (500 MB/s read speed), scanning 500 GB takes 16 minutes minimum—before any aggregation happens. Add GROUP BY operations with high cardinality (millions of unique users), and memory consumption explodes. A query that aggregates across 5 million users requires materializing intermediate results that can exceed available RAM, forcing disk-based sorting that adds another 10-30 minutes.

Database Type Row Count Limit Query Time (COUNT + GROUP BY) Storage Format
Postgres (row-oriented) 50M practical limit 45-180 seconds Row-based heap
MySQL (InnoDB) 50M practical limit 60-240 seconds Row-based B-tree
ClickHouse (columnar) 1B+ optimized 2-8 seconds Columnar compressed
BigQuery (columnar) 10B+ optimized 3-12 seconds Columnar distributed

The fundamental issue: row-oriented databases were designed for OLTP (online transaction processing) workloads—thousands of small writes and point reads. Analytical queries (OLAP) require full table scans with column-based aggregations. Using Postgres for billion-row analytics is like using a sports car to haul freight: it wasn't built for this.

How Columnar Databases Handle Billion-Row Queries in Seconds

Columnar databases like ClickHouse, BigQuery, and Redshift store data column-by-column instead of row-by-row. When you run SELECT AVG(revenue) FROM events, the database reads only the revenue column—ignoring user_id, timestamp, and 20 other columns entirely. This reduces I/O by 90-95% for typical analytical queries.

Compression amplifies the advantage. Columnar storage compresses extremely well because similar data types cluster together. A revenue column containing mostly values between $0-$100 compresses to 10-20% of its original size using dictionary encoding and run-length encoding. That 500 GB Postgres table becomes 50-80 GB in ClickHouse. Now your 16-minute scan drops to 90 seconds of I/O.

Vectorized execution processes data in batches of thousands of rows at once using SIMD (Single Instruction Multiple Data) CPU instructions. Instead of looping through rows one-by-one, ClickHouse processes 65,536 rows per operation. A SUM aggregation that took 100 million CPU cycles in Postgres takes 2 million cycles in ClickHouse. The result: queries that timed out after 5 minutes complete in 3 seconds.

Real-World Performance Example: User Event Analysis

Dataset: 1.2 billion user events, 18 columns, 600 GB uncompressed

Query: SELECT user_id, COUNT(*) as events, AVG(session_duration) FROM events WHERE event_date >= '2026-01-01' GROUP BY user_id ORDER BY events DESC LIMIT 1000

System Query Time Data Scanned Result
Postgres 15 (8 cores, 64GB RAM) Timeout after 300s ~600 GB Query killed
MySQL 8 (8 cores, 64GB RAM) Timeout after 300s ~600 GB Query killed
ClickHouse (8 cores, 64GB RAM) 2.4 seconds ~45 GB (compressed) 1000 rows returned
BigQuery (serverless) 3.1 seconds ~38 GB (compressed) 1000 rows returned

When to Migrate from Postgres to a Columnar Database

You don't need ClickHouse on day one. Postgres handles analytical workloads beautifully up to 20-50 million rows with proper indexing. The migration decision comes down to four concrete signals that indicate you've outgrown row-oriented storage.

Signal 1: Query timeouts on aggregations. If GROUP BY queries that previously completed in 10-30 seconds now timeout after 2-5 minutes, you've hit the wall. This typically happens between 100M-500M rows depending on query complexity and hardware. Example: A daily active user (DAU) calculation that worked fine at 80M events starts timing out at 200M events.

Signal 2: Dashboard refresh times exceed user tolerance. Business users abandon dashboards that take more than 15-20 seconds to load. If your Metabase or Tableau dashboards that used to refresh in 5 seconds now take 45-90 seconds, you're losing analytical velocity. A marketing team that checks campaign performance 20 times per day won't wait 60 seconds per refresh—they'll stop using the dashboard entirely.

Signal 3: Index bloat consumes more storage than raw data. Postgres uses B-tree indexes for performance. At billion-row scale, indexes on high-cardinality columns (user_id, session_id, transaction_id) can consume 2-3x the space of the underlying table. A 200 GB events table with 5 indexes balloons to 800 GB total storage. Monthly cloud storage costs jump from $800 to $3,200. ClickHouse eliminates most secondary indexes through columnar compression, reducing that 800 GB to 120 GB.

Signal 4: INSERT performance degrades below business requirements. Event tracking systems need to ingest 10,000-100,000 events per second during peak traffic. Postgres write performance degrades as table size grows because every INSERT updates multiple indexes. At 500M rows, INSERT throughput drops from 50,000 rows/second to 8,000 rows/second. ClickHouse maintains 200,000+ rows/second INSERT performance at multi-billion row scale through append-only MergeTree architecture.

Migration Decision Matrix

Current State Recommendation Reasoning
< 50M rows, queries under 10s Stay on Postgres No performance issue, avoid complexity
50-200M rows, queries 10-30s Optimize Postgres first Partitioning + materialized views may suffice
200-500M rows, queries 30-120s Plan migration to columnar DB Performance degrading, migration ROI positive
500M-1B rows, frequent timeouts Migrate immediately Critical performance failure, business impact
> 1B rows, dashboards unusable Emergency migration Current architecture fundamentally broken

Optimizing ClickHouse Queries for Sub-Second Performance

Migrating to ClickHouse doesn't automatically guarantee fast queries. Poor schema design and unoptimized SQL can still produce 30-60 second query times on billion-row tables. Here's how to consistently achieve 1-3 second response times.

Partition by time-based columns. ClickHouse's PARTITION BY clause physically separates data into distinct folders on disk. Partitioning by date allows the query planner to skip entire partitions when filtering by time ranges. Query: SELECT COUNT(*) FROM events WHERE event_date >= '2026-04-01' on a table partitioned by month scans only 1-2 partitions (April-May 2026) instead of all 24 months of data. This reduces scan volume by 92% and query time from 12 seconds to 0.8 seconds.

Order tables by query access patterns. The ORDER BY clause in the table definition determines physical row ordering on disk. ClickHouse creates a sparse primary index on the ordering key, enabling fast range scans. For user behavior analysis, order by (user_id, event_timestamp). For time-series dashboards, order by (event_timestamp, user_id). A query filtering WHERE user_id = 12345 on a user_id-ordered table scans 0.001% of rows instead of 100%—turning a 15-second full scan into a 0.05-second index lookup.

Use materialized views for repeated aggregations. If your dashboard runs the same GROUP BY query 500 times per day, precompute it. A materialized view automatically maintains aggregated results as new data arrives. Example: CREATE MATERIALIZED VIEW daily_user_stats AS SELECT event_date, user_id, COUNT(*) as event_count, AVG(session_duration) as avg_duration FROM events GROUP BY event_date, user_id. Querying the materialized view returns results in 0.2 seconds instead of 4 seconds for the raw aggregation. Storage cost: 2-5% of the base table size.

Query Optimization Impact: Before and After

Optimization Query Time Before Query Time After Speedup
Add PARTITION BY event_date 12.3s 0.9s 13.7x faster
Reorder table by user_id 8.6s 0.4s 21.5x faster
Create materialized view for daily stats 4.2s 0.2s 21x faster
Use PREWHERE instead of WHERE 3.1s 1.8s 1.7x faster
Compress with ZSTD(3) codec 5.4s 2.1s 2.6x faster (I/O reduction)

Use PREWHERE for selective filtering. ClickHouse's PREWHERE clause applies filters before reading all columns, reducing I/O. For queries like SELECT user_id, event_type, revenue FROM events PREWHERE event_date >= '2026-05-01' WHERE revenue > 100, ClickHouse first filters by event_date (reading only that column), then reads user_id, event_type, and revenue only for matching rows. This cuts data scanned from 180 GB to 12 GB, reducing query time from 6 seconds to 0.9 seconds.

Connecting ClickHouse to Your BI Tools and Spreadsheets

ClickHouse speaks standard SQL over HTTP and native TCP protocols, making it compatible with most BI tools. However, connection setup, credential management, and query translation vary significantly across tools. Here's what actually works at billion-row scale.

Tableau and Looker: JDBC/ODBC with caveats. Both tools connect to ClickHouse via JDBC drivers, but default query generation produces inefficient SQL. Tableau's automatic query builder generates SELECT * FROM events LIMIT 10000 without WHERE clauses, triggering full table scans. You must manually define custom SQL with PREWHERE filters. Looker's LookML requires explicit partition filters in explore definitions to prevent runaway queries. Setup time: 4-8 hours of trial-and-error for production-ready configurations.

Metabase and Superset: Native ClickHouse drivers. Both open-source BI tools offer first-class ClickHouse support with native drivers. Metabase's ClickHouse driver correctly translates date functions and aggregations. Superset's SQL Lab provides direct query access with syntax highlighting for ClickHouse-specific functions. Limitation: neither tool automatically optimizes queries—users must understand PARTITION BY and ORDER BY to write efficient SQL. Dashboard refresh times: 2-8 seconds for well-optimized queries on billion-row tables.

Excel and Google Sheets: ODBC with manual refresh. ClickHouse's ODBC driver allows Excel connections via Data > Get Data > From Other Sources > From ODBC. You can import query results into worksheets, but refreshes are manual—no live data connections. Google Sheets has no native ClickHouse connector; you need Apps Script to call ClickHouse HTTP API and parse JSON responses. Practical limit: 100,000 rows per import (Excel's grid limit is 1,048,576 rows, but performance degrades above 100K).

Sourcetable eliminates the connection complexity entirely. Add your ClickHouse credentials once (host, port, username, password), and all tables appear in the data picker immediately with column types and sample data. No driver installation, no manual SQL—just natural language queries that translate to optimized ClickHouse SQL automatically. Ask 'Show me top 100 users by event count in the last 30 days' and the AI generates SELECT user_id, COUNT(*) as event_count FROM events PREWHERE event_date >= today() - 30 GROUP BY user_id ORDER BY event_count DESC LIMIT 100, executes it against your billion-row table, and returns results in the spreadsheet in 1.2 seconds.

Cross-Database Queries: Joining ClickHouse with Postgres and CSV Files

Real-world analytics rarely live in a single database. You have 1.5 billion events in ClickHouse, 200,000 user profiles in Postgres, and a CSV file with 50 marketing campaigns. Traditional BI tools force you to export data, join in Python, or build ETL pipelines. Federated query engines let you join across data sources in a single SQL statement.

ClickHouse supports external table engines that query remote databases. The PostgreSQL table engine connects to Postgres and executes queries server-side: SELECT e.user_id, COUNT(*) as events, u.email FROM clickhouse_events e JOIN postgresql('postgres.example.com:5432', 'users', 'username', 'password') u ON e.user_id = u.id GROUP BY e.user_id, u.email. This works, but performance is terrible—ClickHouse pulls the entire Postgres users table over the network (200,000 rows) for every query execution. A join that should take 2 seconds takes 45 seconds due to network transfer overhead.

The correct approach: replicate dimension tables to ClickHouse. Use ClickHouse's MaterializedPostgreSQL engine to maintain a real-time replica of your Postgres users table inside ClickHouse. Now joins are local, and query time drops from 45 seconds to 1.8 seconds. Downside: you're managing replication pipelines, monitoring sync lag, and handling schema changes across two databases. Setup complexity: 2-3 days for a production-ready replication system.

Sourcetable's federated SQL engine handles this automatically. Connect both your ClickHouse events database and Postgres users database as credentials. Write a single query: SELECT e.user_id, COUNT(*) as events, u.email, u.signup_date FROM @clickhouse_events e JOIN @postgres_users u ON e.user_id = u.id WHERE e.event_date >= '2026-05-01' GROUP BY e.user_id, u.email, u.signup_date. The query planner automatically pushes aggregations to ClickHouse (processing 1.5B rows server-side), pulls only the aggregated results (50,000 rows), joins with Postgres user data (200,000 rows), and returns the final result set in 3.2 seconds. No ETL, no replication, no manual optimization.

Federated Query Performance Comparison

Approach Setup Time Query Time Maintenance Burden
Manual export + CSV join in Excel 30 min per query 15-60 min (manual) High (manual every time)
Python script with pandas 2-4 hours initial 8-15 seconds Medium (code maintenance)
ClickHouse PostgreSQL engine (naive) 1 hour 45 seconds Low (but slow queries)
ClickHouse MaterializedPostgreSQL replication 2-3 days 1.8 seconds High (replication monitoring)
Sourcetable federated SQL 5 minutes 3.2 seconds None (automatic optimization)

Natural Language to ClickHouse SQL: How AI Handles Complex Analytics

Writing optimized ClickHouse SQL requires deep knowledge of PREWHERE, partition pruning, materialized views, and ClickHouse-specific functions like arrayJoin and groupArray. A query that calculates 30-day retention cohorts involves window functions, self-joins, and date arithmetic that takes experienced analysts 20-40 minutes to write and debug.

AI-powered natural language interfaces translate plain English questions into optimized SQL automatically. The challenge: generic SQL generators produce naive queries that ignore ClickHouse-specific optimizations. A question like 'Show me daily active users for the last 90 days' might generate SELECT DATE(event_timestamp) as day, COUNT(DISTINCT user_id) FROM events WHERE event_timestamp >= now() - INTERVAL 90 DAY GROUP BY day. This works, but it's slow—scanning 90 days of data without partition pruning takes 8-12 seconds.

An optimized version uses toDate for partition pruning and PREWHERE for early filtering: SELECT toDate(event_timestamp) as day, uniqExact(user_id) as dau FROM events PREWHERE event_date >= today() - 90 WHERE event_timestamp >= now() - INTERVAL 90 DAY GROUP BY day ORDER BY day. Query time: 1.1 seconds—7x faster. The AI needs schema awareness (knowing event_date is the partition key) and ClickHouse function knowledge (uniqExact is faster than COUNT DISTINCT for exact counts).

Sourcetable's AI analyzes your ClickHouse table schema on connection, identifying partition keys, ordering keys, and column types. When you ask 'Calculate 7-day rolling average revenue per user,' the AI generates: SELECT event_date, user_id, AVG(revenue) OVER (PARTITION BY user_id ORDER BY event_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_avg_revenue FROM events PREWHERE event_date >= today() - 30 ORDER BY user_id, event_date. It automatically adds PREWHERE on the partition key, uses efficient window functions, and limits the date range to prevent full table scans. Result: 2.3 seconds for a complex analytical query that would take 15 minutes to write manually.

Cost Analysis: ClickHouse Cloud vs Self-Hosted vs Managed Services

Running analytical databases at billion-row scale has three cost components: compute (CPU/RAM for query processing), storage (disk space for raw data), and operational overhead (engineering time for maintenance). The total cost of ownership varies 10x between deployment options.

Self-hosted ClickHouse on AWS EC2: A production-ready setup requires 3-node cluster for high availability. Recommended instance: m6i.4xlarge (16 vCPU, 64 GB RAM, $0.768/hour) × 3 nodes = $2.30/hour = $1,656/month compute. Storage: 2 TB EBS gp3 per node × 3 = 6 TB total = $600/month. Total: $2,256/month. Engineering overhead: 20-30 hours/month for monitoring, upgrades, backup management, query optimization support. At $150/hour fully loaded engineer cost, that's $3,000-$4,500/month in operational overhead. True TCO: $5,256-$6,756/month.

ClickHouse Cloud (managed service): ClickHouse Cloud charges for compute separately from storage. Production tier: $0.60/hour for 8 vCPU, 32 GB RAM compute = $432/month. Storage: $0.12/GB/month for 2 TB = $240/month. Total: $672/month. No operational overhead—ClickHouse handles backups, upgrades, monitoring, and scaling automatically. True TCO: $672/month. Savings vs self-hosted: $4,584-$6,084/month (87-90% reduction).

BigQuery (serverless): No upfront costs, pay per query. Storage: $0.02/GB/month for 2 TB = $40/month. Queries: $5 per TB scanned. If you run 1,000 queries/day averaging 50 GB scanned each (typical for billion-row analytics), that's 50 TB/day × 30 days = 1,500 TB/month × $5 = $7,500/month query costs. Total: $7,540/month. BigQuery makes sense for sporadic analytics (10-50 queries/day), but becomes extremely expensive for high-query-volume dashboards.

Deployment Option Monthly Cost Operational Overhead Best For
Self-hosted ClickHouse (AWS EC2) $5,256-$6,756 20-30 hours/month Large teams with dedicated data infrastructure engineers
ClickHouse Cloud $672 0 hours (fully managed) Most companies at billion-row scale
BigQuery (1000 queries/day) $7,540 0 hours (serverless) Low-query-volume analytics, data science exploration
Redshift (dc2.large × 3 nodes) $4,380 10-15 hours/month AWS-native stacks with existing Redshift expertise

Sourcetable connects to any of these options—you choose your backend based on cost and operational preferences. For most teams crossing the billion-row threshold, ClickHouse Cloud offers the best price-performance ratio at $672/month with zero operational overhead.

Migration Strategy: Moving Billion-Row Tables from Postgres to ClickHouse

Migrating a production analytics database with 1+ billion rows requires careful planning to avoid downtime and data loss. A naive dump-and-restore approach can take 48-72 hours and blocks all queries during migration. Here's the zero-downtime migration pattern that works.

Phase 1: Parallel write (1-2 weeks). Configure your application to write events to both Postgres and ClickHouse simultaneously. Use a message queue (Kafka, RabbitMQ) to ensure exactly-once delivery. This creates a ClickHouse table with all new data while your Postgres table continues serving existing queries. Overhead: 5-10% additional write latency, negligible for most applications.

Phase 2: Historical backfill (3-7 days). Export historical data from Postgres in 10-50 million row batches using COPY (SELECT * FROM events WHERE id BETWEEN 0 AND 50000000) TO '/tmp/events_batch_1.csv' WITH CSV. Import each batch into ClickHouse using clickhouse-client --query='INSERT INTO events FORMAT CSV' < events_batch_1.csv. Parallelizing 4-8 batch imports achieves 500,000-1,000,000 rows/second throughput. A 1.2 billion row table backfills in 20-40 hours. Run backfill during low-traffic hours to minimize impact on production Postgres performance.

Phase 3: Query migration (1-2 weeks). Gradually move dashboards and reports to query ClickHouse instead of Postgres. Start with non-critical dashboards, validate results match Postgres output (expect minor floating-point differences in averages due to different aggregation algorithms), then migrate high-traffic dashboards. Use feature flags to enable/disable ClickHouse queries per dashboard for instant rollback if issues arise.

Phase 4: Postgres decommission (1 week). Once 100% of queries hit ClickHouse and parallel writes have run for 2+ weeks, stop writing to Postgres. Monitor for 7 days to ensure no applications still depend on Postgres data. Drop the Postgres table and reclaim storage. Total migration time: 5-8 weeks from planning to Postgres decommission.

Migration Timeline and Risk Mitigation

Phase Duration Risk Rollback Plan
Parallel write setup 1-2 weeks Low (Postgres still primary) Disable ClickHouse writes, no data loss
Historical backfill 3-7 days Medium (Postgres load increases) Pause backfill, resume during off-hours
Query migration 1-2 weeks Medium (potential query result mismatches) Feature flag rollback per dashboard
Postgres decommission 1 week Low (all queries validated) Restore Postgres from backup if needed
At what row count should I migrate from Postgres to ClickHouse?
Migrate when you hit one of these thresholds: (1) Aggregation queries timeout after 2+ minutes on tables with 200M+ rows, (2) Dashboard refresh times exceed 30 seconds, (3) Index storage costs exceed 2x your raw data storage, or (4) INSERT performance drops below 10,000 rows/second. Most teams see these symptoms between 200M-500M rows. Don't migrate prematurely—Postgres handles 50M rows efficiently with proper indexing.
Can I query ClickHouse using standard SQL or do I need to learn a new syntax?
ClickHouse supports 95% of standard SQL (SELECT, JOIN, GROUP BY, window functions, CTEs). The syntax differences are minor: use toDate() instead of DATE(), uniqExact() instead of COUNT(DISTINCT), and add PREWHERE for partition filtering. Most SQL knowledge transfers directly. ClickHouse-specific features (ARRAY JOIN, materialized views, table engines) are optional optimizations, not requirements.
How much faster is ClickHouse than Postgres on billion-row tables?
Typical speedup: 10-50x for analytical queries. A GROUP BY aggregation that takes 120 seconds in Postgres completes in 2-4 seconds in ClickHouse. The exact speedup depends on query type: simple aggregations see 20-30x improvement, complex multi-table joins see 10-15x improvement. Queries that timeout in Postgres (300+ seconds) usually complete in 3-8 seconds in ClickHouse.
What's the minimum hardware requirement to run ClickHouse on 1 billion rows?
Minimum: 8 CPU cores, 32 GB RAM, 500 GB SSD storage for a single-node development instance. Production recommendation: 3-node cluster with 16 cores and 64 GB RAM per node for high availability and query parallelization. ClickHouse Cloud's smallest production tier (8 cores, 32 GB RAM) handles 1-2 billion rows comfortably at $432/month. Self-hosting requires 3x that capacity for redundancy.
Can I join ClickHouse tables with data in Postgres or CSV files?
Yes, through federated query engines. ClickHouse's native PostgreSQL table engine allows cross-database joins but pulls entire Postgres tables over the network (slow for large dimension tables). Better approach: use a federated SQL layer like Sourcetable that optimizes join execution by pushing aggregations to ClickHouse server-side, then joining with smaller result sets. This reduces network transfer from gigabytes to megabytes.
How do I optimize ClickHouse queries that still run slow (10+ seconds)?
Check these in order: (1) Add PARTITION BY on date columns and ensure queries filter on partition keys, (2) Verify table ORDER BY matches your query access patterns (e.g., order by user_id if filtering by user), (3) Use PREWHERE for selective filters before WHERE, (4) Create materialized views for frequently repeated aggregations, (5) Enable query profiling with SET send_logs_level='trace' to identify bottlenecks. Most slow queries are fixed by adding proper partitioning and PREWHERE clauses.
What's the difference between ClickHouse and BigQuery for billion-row analytics?
ClickHouse: self-hosted or managed cloud, fixed monthly cost ($432-$2,000/month), optimized for high query volume (1,000+ queries/day), 2-5 second query latency. BigQuery: serverless, pay-per-query ($5 per TB scanned), optimized for sporadic analytics (10-100 queries/day), 3-12 second query latency. Choose ClickHouse for dashboards with high refresh rates; choose BigQuery for ad-hoc data science exploration with infrequent queries.
How long does it take to migrate 1 billion rows from Postgres to ClickHouse?
Historical data backfill: 20-40 hours for 1 billion rows at 500,000 rows/second transfer rate. Total migration project: 5-8 weeks including parallel write setup (1-2 weeks), backfill (3-7 days), query migration and validation (1-2 weeks), and monitoring before Postgres decommission (1 week). Use parallel batch exports and imports to maximize throughput—single-threaded exports take 3-5x longer.
Can Sourcetable connect directly to ClickHouse without writing SQL?
Yes. Add your ClickHouse credentials (host, port, database, username, password) once, and all tables appear in the data picker with column types and sample data. Ask questions in natural language like 'Show top 100 users by event count last 30 days' and the AI generates optimized ClickHouse SQL with PREWHERE, partition pruning, and efficient aggregations automatically. Results appear in the spreadsheet in 1-3 seconds. You can also write raw SQL if you prefer manual control.
What happens to my ClickHouse data if I stop paying for ClickHouse Cloud?
ClickHouse Cloud retains your data for 7 days after subscription cancellation, allowing you to export or resume service. After 7 days, data is permanently deleted. For production systems, maintain your own backup strategy: export critical tables to S3 using ClickHouse's S3 table engine or use ClickHouse's built-in backup/restore commands to create snapshots. Self-hosted ClickHouse gives you full control—your data stays on your infrastructure regardless of service status.
Is ClickHouse ACID-compliant for transactional workloads?
No. ClickHouse is designed for analytical (OLAP) workloads, not transactional (OLTP) workloads. It provides eventual consistency, not ACID guarantees. Writes are batched and merged asynchronously—a row inserted now might not appear in queries for 1-10 seconds. Use ClickHouse for analytics (aggregations, dashboards, reports) and Postgres/MySQL for transactions (user accounts, orders, inventory). Keep both: Postgres for transactional data, ClickHouse for analytical queries on replicated/exported data.
Can I use ClickHouse for real-time dashboards with sub-second data freshness?
Yes, with caveats. ClickHouse can ingest 100,000-500,000 rows/second, but data visibility lags by 1-10 seconds due to asynchronous merge operations. For true real-time (sub-second freshness), use ClickHouse's Kafka table engine to stream events directly from Kafka—this reduces lag to 100-500ms. Real-time dashboards work well for monitoring (displaying last 5 minutes of data), but expect slight delays. For millisecond-level real-time requirements, consider specialized streaming databases like Apache Druid.
Sourcetable Logo
Query Your Billion-Row Database with AI

Experience the future of spreadsheets

Sources

References and data sources used in this article

  1. ClickHouse Documentation - Table Engines and Query Optimization (2026)
  2. Google Cloud BigQuery Pricing Calculator (2026)
  3. AWS EC2 Pricing - m6i Instance Family (May 2026)
  4. ClickHouse Cloud Pricing (2026)
  5. PostgreSQL Performance Tuning Guide - Large Table Optimization (2025)
  6. Altinity ClickHouse Knowledge Base - Billion Row Tables (2026)
Andrew Grosser

Andrew Grosser

Founder, CTO @ Sourcetable

Sourcetable is the Agent first spreadsheet that helps traders, scientists, analysts, and finance teams hypothesize, evaluate, validate, make trades and iterate on trading strategies without writing code.

Share this article

Drop CSV