Data Models

Understand the Cassandra data model

The Cassandra data model defines

  • Column family as a way to store and organize data
  • Table as a two-dimensional view of a multi-dimensional column family
  • Operations on tables using the Cassandra Query Language (CQL)

Cassandra1.2+reliesonCQLschema,concepts,andterminology, though the older Thrift API remains available

Table (CQL API terms) Column Family (Thrift API terms)
Table is a set of partitions Column family is a set of rows
Partition may be single or multiple row Row may be skinny or wide
Partition key uniquely identifies a partition, and may be simple or composite Row key uniquely identifies a row, and may be simple or composite
Column uniquely identifies a cell in a partition, and may be regular or clustering Column key uniquely identies a cell in a row, and may be simple or composite
Primary key is comprised of a partition key plus clustering columns, if any, and uniquely identifies a row in both its partition and table

Row (Partition)

Row is the smallest unit that stores related data in Cassandra

  • Rows: individual rows constitute a column family
  • Row key: uniquely identifies a row in a column family
  • Row: stores pairs of column keys and column values
  • Column key: uniquely identifies a column value in a row
  • Column value: stores one value or a collection of values

Rows may be described asskinnyorwide

  • Skinny row: has a fixed, relatively small number of column keys
  • Wide row: has a relatively large number of column keys (hundreds or thousands); this number may increase as new data values are inserted

Key

A compound primary key consists of the partition key and one or more additional columns that determine clustering. The partition key determines which node stores the data. It is responsible for data distribution across the nodes. The additional columns determine per-partition clustering. Clustering is a storage engine process that sorts data within the partition.

In a simple primary key, Apache Cassandra™ uses the first column name as the partition key. (Note that Cassandra can use

multiple columns

in the definition of a partition key.) In the music service's playlists table, id is the partition key. The remaining columns can be defined as

clustering columns

. In the playlists table below, the song_order is defined as the clustering column column:

PRIMARYKEY (id, song_order);

The data for each partition is clustered by the remaining column or columns of the primary key definition. On a physical node, when rows for a partition key are stored in order based on the clustering columns, retrieval of rows is very efficient. For example, because the id in the playlists table is the partition key, all the songs for a playlist are clustered in the order of the remaining song_order column. The others columns are displayed in alphabetical order by Cassandra.

Insertion, update, and deletion operations on rows sharing the same partition key for a table are performed atomically and in isolation.

You can query a single sequential set of data on disk to get the songs for a playlist.

SELECT * FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
  ORDER BY song_order DESC LIMIT 50;

The output looks something like this:

Cassandra stores an entire row of data on a node by partition key. If you have too much data in a partition and want to spread the data over multiple nodes, use a composite partition key.

unbalanced Cassandra Cluster

Data can also end up unevenly distributed if your data model is not designed for your data profile. You must have a good understanding of your domain and your data model must be fit for your domain. In Cassandra, your partition key is responsible for your distributing data across a cluster and must be carefully chosen.

Illustration

Composite row key

multiple components separated by colon

Composite column key

multiple components separated by colon

Column family (Table)

set of rows with a similar structure

Table with single-row partitions

Table with multi-row partitions

results for ""

    No results matching ""