Skip to content

Column Stores, Column-Oriented, noSQL Data Structures, etc

In 2020 as I had a few job interviews with various “FAANG” companies, I found that some of my knowledge of Column Store databases was lacking, so I wrote this draft blog post with a bunch of links, planning to polish it later. I don’t recall what the original vision of this post was. Instead, I will call this post “a list of good links and some definitions pulled from various websites for learning about column stores and related database technologys”:

https://en.wikipedia.org/wiki/List_of_column-oriented_DBMSes

https://www.geeksforgeeks.org/aggregate-functions-in-sql/

https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver15

Parquet

“Parquet isn’t a database. Instead it’s a file format which can be used to store database tables on distributed file systems like HDFS, CEPH or AWS S3. Data is stored in Parquet in chunks that contain blocks of column data in a fashion that makes it possible to break up a Parquet file. Storing the file on many distributed hosts while allowing it to be processed in parallel. You can access Parquet files using Apache Spark, Hive, Pig, Apache Drill and Cloudera’s Impala.”

For more information on Parquet: https://parquet.apache.org/documentation/latest/

“Column store databases store data in columns instead of rows. They make it possible to compute statistics on those columns one to two orders of magnitude or more, faster than on traditional row-oriented databases.

A column-oriented table is very good for analytics but usually terrible for traditional transactional workloads.

Most, although not all, column stores are designed to operated on a distributed cluster of servers.”

https://www.quora.com/Which-NoSQL-database-is-most-suitable-for-GROUP-BY-Aggregation-queries-on-large-dataset

“Which NoSQL database is most suitable for GROUP BY\Aggregation queries on large dataset ?AnswerFollow·7Request13 AnswersAndrew Patterson, Software EngineerAnswered April 25, 2018

Colomn-oriented databases are your best bet because fields are stored together on disk thereby minimising seeks and hopefully faster queries.”

Free and open-source software (FOSS)

Database NameLanguage Implemented inNotes
Apache DruidJavastarted in 2011 for low-latency massive ingestion and queries
Apache KuduC++released in 2016 to complete the Apache Hadoop ecosystem
Calpont InfiniDBC++
ClickHouseC++released in 2016 to analyze data that is updated in real time
CrateDBJava
C-Store
Greenplum DatabaseC
PostgreSQL cstore_fdw [1]vops [2]Ccstore_fdw uses ORC format
MariaDB ColumnStoreC & C++formerly Calpont InfiniDB
MapDC++
MetakitC++
MonetDBC
Scylla (database) Open SourceC++

Platform as a Service (PaaS)

Proprietary

Leave a Reply

Your email address will not be published. Required fields are marked *

We've updated our privacy policy (link at bottom of site) in compliance with data protection law. By continuing to use this site, you are agreeing to our updated privacy policy.