The Cassandra data model defines
Cassandra1.2+reliesonCQLschema,concepts,andterminology, though the older Thrift API remains available
Table (CQL API terms) | Column Family (Thrift API terms) |
---|---|
Table is a set of partitions | Column family is a set of rows |
Partition may be single or multiple row | Row may be skinny or wide |
Partition key uniquely identifies a partition, and may be simple or composite | Row key uniquely identifies a row, and may be simple or composite |
Column uniquely identifies a cell in a partition, and may be regular or clustering | Column key uniquely identies a cell in a row, and may be simple or composite |
Primary key is comprised of a partition key plus clustering columns, if any, and uniquely identifies a row in both its partition and table |
Row is the smallest unit that stores related data in Cassandra
Rows may be described asskinny
orwide
A compound primary key consists of the partition key and one or more additional columns that determine clustering. The partition key determines which node stores the data. It is responsible for data distribution across the nodes. The additional columns determine per-partition clustering. Clustering is a storage engine process that sorts data within the partition.
In a simple primary key, Apache Cassandra™ uses the first column name as the partition key. (Note that Cassandra can use
in the definition of a partition key.) In the music service's playlists table, id is the partition key. The remaining columns can be defined as
. In the playlists table below, the song_order is defined as the clustering column column:
PRIMARYKEY (id, song_order);
The data for each partition is clustered by the remaining column or columns of the primary key definition. On a physical node, when rows for a partition key are stored in order based on the clustering columns, retrieval of rows is very efficient. For example, because the id in the playlists table is the partition key, all the songs for a playlist are clustered in the order of the remaining song_order column. The others columns are displayed in alphabetical order by Cassandra.
Insertion, update, and deletion operations on rows sharing the same partition key for a table are performed atomically and in isolation.
You can query a single sequential set of data on disk to get the songs for a playlist.
SELECT * FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order DESC LIMIT 50;
The output looks something like this:
Cassandra stores an entire row of data on a node by partition key. If you have too much data in a partition and want to spread the data over multiple nodes, use a composite partition key.
Data can also end up unevenly distributed if your data model is not designed for your data profile. You must have a good understanding of your domain and your data model must be fit for your domain. In Cassandra, your partition key is responsible for your distributing data across a cluster and must be carefully chosen.
multiple components separated by colon
multiple components separated by colon
set of rows with a similar structure