Home

Not only SQL (NoSQL)

NoSQL – Not only SQL

NoSQL, a term used to describe a subset of databases which differ in various ways to traditional relational databases (RDBMS). Some of the more notable features/differences are usually to do with Schemas, or the lack thereof, and the ability to scale horizontally. The term was used in 1998 by Carlo Strozzi and again in 2009 by Eric Evans, which can be attributed to the current “NoSQL Movement”.

So what are NoSQL databases and why have they emerged and progressed so rapidly?
Traditional SQL servers allow users to create a Schema and define the structure of that schema which often have very rigid rules. These systems often provide the promise of ACID.

The promise: ACID

  • Atomicity
  • Consistency
  • Isolation
  • Durability

The challenge presented by modern web applications steps outside the considerations that were taken into account when ACID compliance was being devised. A none-exhaustive list of “new” considerations is :

  • Web-scale data
  • High read-write rates
  • Frequent schema changes
  • “social” apps – not banks – They don’t need the same  level of ACID 

 

NoSQL databases attempt to provide the flexibility and scalability to serve these demands. The progress made in the last few years has been astounding, considering the production ready NoSQL options that are currently available. Cassandra, MongoDB, Jackrabbit , CouchDB, BigTable and Dynamo, are some of the major players. Given the amount of them that there are, it is obvious there has been a need for these systems, not only because of the amount of them that there are, but the user base and community that each of these projects now have. Front line companies which saw the need for these includes, Facebook,Google (BigTable) ,Amazon (Dynamo ) , Digg,Twitter and many others.

Apache Cassandra

Cassandra

I am a Cassandra fan, as a result the rest of this article will be geared towards Cassandra’s approach to providing a NoSQL solution.

Background

Cassandra’s inception happened while the Facebook team was seeking a solution for their inbox search functionality. Jeff Hammerbacher, who led the team, described Cassandra as a Bigtable like data model, on top of Dynamo’s infrastructure. Apache Cassandra is described, in their own words as:

a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.”

In a nutshell it allows you to get a horizontally scalable, high performance and distributed data store with a very rich data model.

 

Architecture & DataModel

Arin Sarkissian (Digg) did a brilliant piece on Cassandra’s data model . In this section, I’ll go through a quick over view of Cassandra’s data model and its architecture. Unfortunately this topic alone could span an entire article so this will be a very succinct introduction.

Cassandra’s data model can be seen as having various layers. The layers build on top of each other, each layer in the hierarchy encapsulating the smaller layers.

Columns

At the bottom end of the hierarchy there is a column, a column has three parts to it; A name, value and a timestamp. The name and value is stored as a raw byte array (byte[]) and can be of any size. It could be represented as shown below:

Column Name Username
Value Courts
timestamp 123456789

 

Try to visualise this in the JSON (ish) notation that many others use to describe Cassandra’s data model. E.g. If we are representing a user’s first name in a column we may have:

{
name: “firstName”,

value: “Courtney”,

timestamp: 123456789

}

 

Moving up the encapsulation chain Cassandra has what is known as a Super column. A super column is similar in terms of having a name,value pair however, it does not have a timestamp.
The major difference between a column and a super column is that :
A column maps to the binary representation of a string value and a super column maps to a number of columns. To try to visualise this see the table below:

Super column name contactDetails
Super column values
Column Name mobile home
Value 07745361723 02043422457
timestamp 123456789 123456789

 

 

Carrying on with our JSON notation the same super column can be represented as:

{

name: “contactDetails”,

value: {

mobile: {name: “mobile”, value: “07745361723″, timestamp: 123456789},

home: {name: “home”, value: “02043422457″, timestamp: 123456789},

}

}

Looking at the examples above, “contactDetails” is the name of the super column, within the super column there is the potential for an infinite amount of columns. A good point to note is that the key to the columns within the super column is same as the name of the column.

Column Family (CF)

To group columns and super columns, Cassandra has, what is known as a Column Family (CF) . A column family can be of type STANDARD or SUPER. Standard column families store the smaller column types which have a name,value and timestamp. Column families of type super however, store the super columns that you may create. To relate to what most of us already know from relational databases, a Column Family can be thought of as a table. CFs contain an infinite amount of rows, these rows can also be imagined the same way you would or a RDBMS.

Within a CF columns are grouped within a row (Similar relationship between rows and columns in an RDBMS. A row contains a key, this key map to the columns you have in that row. The row key again uses the column names as the internal key (internal i.e. within the row) to which the row key maps. The columns that are contained within a row can be of types, SUPER or STANDARD. Depending on which one your columns are the Column family’s type also matches. In other words, if you have a set of columns that are all of the type SUPER then your column family will also be of type SUPER, the opposite is also true.

A very important point to note here is that, there are no strict schema rules enforced between a CF and its rows and columns. That means, a row can have any number of columns and not influence the structure of another row within the CF. For example, if you design a system to only accept a user’s first name and password; If you store these within two columns for early users of the system but later you decided you wanted to collect the user’s surname, address etc you can add the additional columns to new users without having to modify any existing data. Likewise, if you decided you wanted the old users to provide all the additional information, no changes are necessary except collecting the extra data and updating the row.

Because the structure is getting more complicated, I am going to stop putting name=xyz, value=abc and timestamp=1…9, instead I’ll now use xyz : abc : 123 as the format. This will help make it easier to visualise. To help visualise all of this, please see the table below:

 

CF= User

rowKey1
fName mobile home
Adrian 09434535623 01683299348
123456789 123456789 123456789

 

rowKey2

fName

lName

mobile

home

courtney

Robinson

07745361723

02043422457

123456789

123456789

123456789

123456789

 

rowKey3
fName
Damion
123456789

 

 

If you find the JSON notation easier then this should help:

User = {

rowKey1: {

fName: “Adrian” : 123456789,

mobile: “09434535623″ : 123456789,

home: “01683299348″ : 123456789

},

rowKey2: {

fName: “courtney” : 123456789,

lName: “Robinson” : 123456789,

mobile: “07745361723″ : 123456789,

home: “02043422457″ : 123456789,

},

rowKey3: {

fName: “Damion” : 123456789

}

}

Keyspace

Cassandra’s data model has another layer within its encapsulation hierarchy, the Keyspace. A Keyspace is the outer most layer of Cassandra’s data model, it encapsulates all the column families you create… There is not relationship between the column families that a Keyspace has, as seen earlier, each row within a Keyspace can have different sets of columns so having one set of columns in a Keyspace named User does not imply there is a relationship to any of the columns/rows in another Keyspace, Profile.

Sorting

Cassandra cannot be queried the same way you would query an SQL database. There is no possibility of a join, you cannot tell Cassandra how you want data sorted at the time you’re getting it. This is a decision you make when creating your data structure. Once you tell it how you want your data sorted, they remain in that order. There are many ways to sort your data and you can even write your own class which tell Cassandra how to sort data. Columns are sorted by their type, the default types currently included are:

  • BytesType 
  • UTF8Type
  • AsciiType
  • LongType
  • LexicalUUIDType
  • TimeUUIDType

Rows are sorted by their Partitioner

  • RandomPartitioner
  • OrderPreservingPartitioner
  • CollatingOrderPreservingPartitioner

A partitioner is responsible for distributing rows (by key) across

nodes in the cluster. You may also provide a class you have written to decide where a particular key should be stored within the cluster.

Benefits for the Web

How can Cassandra benefit web developers? Often times lone developers or even small teams become a victim of their own success. By that I mean, when we build an application which then attracts a large audience, we can’t always grow with the audience we attract. Whether it be because of a lack of resources (inc money and servers) or not having the technology to scale with the demands. Although Cassandra is written to be a distributed server, it also runs just as well with a single node. So you get:

  • Horizontal scalability (Add new hardware when required)
  • Super fast responses even as demands grow
  • Even faster write speeds to handle your growing populous
  • Distributed storage (No sharding required!)
  • Flexibility to change schema as users demand more functionality
  • A nice simple and clean API in your favourite language
  • Automatic failure detection
  • No single point of failure (every node knows about the others)
  • Decentralised
  • Fault tolerant
  • Durable
  • Hadoop support for Map Reduce
  • Hinted hand off

To try and sneak this in somewhere: Hinted handoff is the process by which Cassandra realises that a node is potentially down, once detected, if it receives data that was destined for that failed node then Cassandra writes the data to another Node. Once the failed node is back online and rejoined the cluster, the data that was meant for it is then handed off from the temporary Node that it was written to.

Drawbacks

Like everything in this world there are tradeoffs that had to be made for the benefits you enjoy. For Cassandra these include:

  • No joins (trade off for speed)
  • Not able to sort at query time
  • No SQL (If you’ve been using SQL all your life, you might miss it)

Supported language bindings

Cassandra uses the http://thrift.apache.org/ Thrift library to provide a language specific version of its API. Thrift supports a large array of languages which includes C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml. I have also been working on a language independent client for Cassandra which will allow you to access data using a REST interface.

PHP Hello World 

The fastest way to get started with Cassandra is to use one of the many client libraries available. Libraries can be useful to you if you don’t like using the raw Thrift interface, or if you need some advance features such as connection pooling and the like. The library we’ll use for our hello world program is called https://github.com/thobbs/phpcassa “phpcassa” and the code samples build on top of the http://thobbs.github.com/phpcassa/tutorial.html tutorial provided by phpcassa for new users.

Installation of Cassandra is straight forward and this http://wiki.apache.org/cassandra/GettingStarted “Getting started with Cassandra” tutorial walks you through downloading and installing.

 

Import the sample schema

Once you’ve finished installing and started Cassandra, import the sample structure included by running the following from an ssh window:

bin/schematool localhost 8080 import
Create a file, say index.php and put the following piece of code in it:

[php]
//specify an array of servers that could be used
$servers[0] = array(‘host’ => ’127.0.0.1′, ‘port’ => 9160)
//open a new connection and use the keyspace named “Keyspace1”
$conn = new Connection(‘Keyspace1′, $servers);
//get a column family object to perform inserts/updates/deletes
$column_family = new ColumnFamily($conn, ‘Standard1′);
//to insert we use the insert method on the column family object
$column_family->insert(‘row_key1′, array(‘col_name’ => ‘Hello world’));
//insert more than once columns
$column_family->insert(‘row_key2′, array(‘name1′ => ‘Hello world1′, ‘name2′ => ‘Hello world 2′));
//insert multiple rows with multiple columns
$column_family->batch_insert(
array(‘row1′ => array(‘name2′ => ‘val1′, ‘name3′ => ‘val2′)),
array(‘row2′ => array(‘foo’ => ‘bar’)));
//getting all columns in a row up to column count, default=100
$column_family->get(‘row_key1′);
// returns: array(‘colname’ => ‘col_val’)
//get a set of columns you know the names of, i.e. name1 and name2
$column_family->get(‘row_key’, $columns=array(‘name1′, ‘name2′));
// returns: array(‘name1′ => ‘foo’, ‘name2′ => ‘bar’)
//get a slice i.e. a subset of columns
$column_family->get(‘row_key’, $columns=null, $column_start=’5′, $column_finish=’7′);
// returns: array(’5′ => ‘foo’, ’6′ => ‘bar’, ’7′ => ‘baz’)
//get multiple rows
$column_family->multiget(['row_key1', 'row_key2']);
// returns: array(‘row_key1′ => array(‘name’ => ‘val’), ‘row_key2′ => array(‘name’ => ‘val’))
[/php]
This block of code has done quite a lot. The method signatures may be specific to the phpcassa library but you will be performing similar operations no matter which library you use. The examples just cover, possibly the most common operations.

Hopefully the inline comments made sense but effectively the code first creates an array of servers. Notice I only specified one, in production this array can be as large as you like giving a multitude of nodes within your cluster for even nice connection pooling and the rest of it. It carries on by creating a connection object which is then used to create a column family object. The column family object is what you use in your code to perform CRUD operations. At this point you should have the basics of Cassandra and phpcassa sorted and be ready to start playing with the DB. Enjoy…

About these ads

One thought on “NoSQL – Not only SQL (Introduction to Apache Cassandra)

  1. Pingback: cassandra简介 « Marrysam's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s