What is columnar?
Columnar (or column‑oriented) refers to a way of storing data where all the values of a single column are kept together, instead of storing complete rows one after another. Think of a spreadsheet turned on its side: each column’s data is stored in a continuous block.
Let's break it down
- Row‑based storage: A row contains all fields for one record (e.g., name, age, city) stored together.
- Columnar storage: All values for the same field (e.g., all names, then all ages, then all cities) are stored in separate, contiguous blocks.
- Because similar data types sit side by side, the system can compress them more efficiently and read only the columns needed for a query.
Why does it matter?
When you run analytical queries that only need a few columns from a huge table, a columnar layout lets the engine skip the irrelevant data, dramatically reducing I/O and speeding up results. It also enables better compression, saving storage space and further improving performance.
Where is it used?
- Data warehouses (e.g., Amazon Redshift, Snowflake, Google BigQuery)
- Analytical databases (e.g., ClickHouse, Apache Parquet files, Apache ORC)
- Business intelligence tools that query large datasets for reporting and dashboards
- Big‑data processing frameworks like Apache Spark when reading columnar file formats.
Good things about it
- Fast read performance for analytics and reporting
- High compression ratios because similar values are stored together
- Selective column reading reduces I/O and network traffic
- Efficient for aggregations (sums, averages, counts) on large datasets
- Works well with modern hardware (SSD, parallel processing).
Not-so-good things
- Slower write and update performance; inserting or modifying rows can be costly.
- Not ideal for transactional workloads (OLTP) that need frequent single‑row inserts/updates.
- Complexity in data loading pipelines; may require batch loading or special tools.
- Potentially higher memory usage during query execution if many columns are accessed simultaneously.