Column vs row-oriented databases
The key difference between column and row-oriented databases is the way values are stored. Column-oriented databases store all the values for a given column in the same place, while row-oriented databases store all the values in a row (no matter the number of columns) together.
This means that if one needs to query a lot of rows but only a few columns, using a database that keeps columns stored together can provide a huge performance advantage, as data can be retrieved based only on the columns that need to be projected. If we were to use a row-oriented database, since each row is stored together we would always have to go over the entire row even when not required.
Let’s consider the following example:
id | name | age |
---|---|---|
1 | john | 32 |
2 | alexa | 29 |
3 | dinis | 25 |
Row-oriented database
In a row-oriented database, data is stored row by row, i.e. the data is represented in the disk by a sequential list of rows, like the following example:
1 | john | 32 | 2 | alexa | 29 | 3 | dinis | 25 |
This is usually how traditional databases like Postgres and MySQL work and it works quite well for OLTP (Online Transactional Processing) applications, as they can provide good performance for writes - since writing new data is just a matter of appending a new row after the last one.
When it comes to reads, it depends on whether we need to read the entire row or just a few columns - i.e. when dealing with complex operations usually the entire row has to be kept in memory even though only part of its columns are needed.
Column-oriented database
Column-oriented databases, on the other hand, were created to support fast reads (and ad-hoc queries) even if that requires compromising writes as we will see. This is why they are usually used by OLAP (Online Analytical Processing) applications. In a column-oriented database, the values for a given column are grouped, like the following example:
1 | 2 | 3 | john | alexa | dinis | 32 | 29 | 25 |
This means that adding a new row forces the database to add a new value after all the values for each of the columns the table has - this can be simplified by having each column stored in a different file/disk.
When it comes to reads, this database is more appropriate for fast reads as it’s a lot easier to go over the values for a specific column without having to go through all the other columns.