Row Storage (CSV) vs Column Storage (Parquet)
CSV: Row by Row
name
age
city
Alice
25
NYC
Bob
30
LA
Carol
28
CHI
← Must read ALL rows + columns
SELECT AVG(age) FROM data
Reads 9 cells to get 3 values
vs
Parquet: Column by Column
name
Alice
Bob
Carol
age
25
30
28
city
NYC
LA
CHI
← Only reads the age column
SELECT AVG(age) FROM data
Reads 3 cells to get 3 values
With 50 columns and 1M rows, Parquet reads 1 column.
CSV reads all 50. That is a 50x difference in I/O.