NoSQL databases

NoSQL 2 database types: key-value, documentary and columnar

Following the post in which the comparison between NoSQL and SQL databases In this publication, we will explain the first three types of NoSQL databases: key-value, documentary and columnar.

1. Key-value

General characteristics, advantages and disadvantages

This is the simplest and most flexible model, based on key-value pair structures. The key can be synthetic or self-generated, and can have different formats, but in all cases it has to be unique. However, in a partitioned model, the data is divided into buckets, so that different buckets can contain the same key. This means that unique elements will be identified by the tuple (bucket, key). Values, on the other hand, will have a simple structure, and can accept string, numeric, JSON or even more complex structures. Its use is based on the basic operations get (obtain data associated with a key), put (associating a value with a key) and delete (delete entry with a specific key).

Its main advantages are 3: simplicity, efficiency and flexibility, which allow for quick searches of the entire database, as well as effective aggregation functions. On the other hand, simplicity will also mark its main disadvantages: lacking structure it is not possible to launch queries through queries, and only consist of a collection, complicating the implementation of complex models.

Possible applications

  1. Web page caches, where the URL is the key and the content is the value.
  2. Transaction logs, with timestamp as key and content as value.

Main key-value DBs

Riak KV (AP)

Database with open source (Apache) and Enterprise licence, designed for tracking information related to sessions and users. It allows search operations for specific values, use of secondary indexes and map-reduce operations. Allows automatic deletion of old data. With spark connectors, apache mesos framework and redis integration. Sharding partitioning.

Redis (CP)

Licensed databases open source and in-memory processing. Useful for cache, user sessions and message monitoring, but has additional modules for data processing, such as lookups, secondary indexes, transactions, or machine learning modelling. Partitioning through sharding and master-slave and multi-master replication systems. User-password based access control.

2. Documentaries

General characteristics, advantages and disadvantages

They are derived from key-value DBs, but allow for a higher level of complexity through the use of metadata. The unit of data organisation is the document, which is made up of a series of pairs key-value whose value can take different formats. Each document has a unique ID to facilitate indexing methods, and often has a pre-defined schema, although this will be flexible. Because there is a pre-defined schema, data is grouped into collections, which will usually have similar schema. In a SQL analogy, collections will be equivalent to tables, and documents to rows. Generally these databases will follow two types of structure: JSON and XML, with JSON being the most commonly used format.

The main advantage of these databases is their organisation. By having predefined structures, many vendors have implemented SQL-like languages for querying, in some cases even allowing the use of joins between collections. Moreover, thanks to the structured nature of the documents and the use of indexes, these databases respond well to queries and filtering and aggregation operations.

On the other hand, and especially when comparing these databases with SQL, the use of a flexible schema makes them prone to data entry errors, establishing the need to implement sanitisation and data cleansing procedures.

Possible applications

  1. Sensor data from different manufacturers
  2. Customer files with different characteristics
  3. Inventory catalogues of products for a shop or a factory

Main documentary databases

MongoDB (CP)

One of the most widely used on the market. Open source with data storage in BSON (binary JSON) format. Allows secondary indexes, partitioning through sharding and replication through master-slave systems. New versions allow joins between collections, queries through queries and the use of 2 frameworks to operate in parallel: mapreduce and aggregation frameworks.

CouchDB (AP)

Open source that uses JSON natively, although it allows binary formats. Specialised in master-master replication on different devices and platforms, it has variants for web browsers (pouchDB) and iOS and Android systems (couchbase lite). Allows the use of HTTP protocols. Optimal for platforms that normally work off-line thanks to this replication system. Partitioning by sharding.

CouchBase (AP)

Derived from couchDB with integrated memcache, it is also a document database based on JSON files. It is defined as an "engagement database", where high accessibility on different types of devices and apps is a priority. As with couchDB, it offers master-master replication as well as partitioning via sharding.

MarkLogic (CP)

Cross-platform database based mainly on an XML and JSON file system. Fee-based. Allows ACID transactions and the implementation of role-based security systems at document and sub-document level. Partitioning by sharding and allows the application of map-reduce routines.

3. Columnars

General characteristics, advantages and disadvantages

Conceptually, they are the most similar model to SQL databases (along with document databases), as the data follows a row and column structure. However, unlike SQL, they functionally group cells into columns, where each column is a tuple of values (corresponding to rows), whereas SQL organises its data in rows. Although they were later extended to other non-SQL formats, map-reduce routines were designed based on this type of DB, so if our queries are based on this paradigm, this option will be the most optimal.

As advantages, columnar databases contain a conceptually simple, yet still flexible, schema that allows the use of SQL language for queries. Their columnar structure favours queries that require full table reads, such as data extraction and aggregation. Queries that are streamlined through the application of map-reduce routines. They also allow the use of joins, which are more effective than SQL (although these databases are still not optimised for their use). The main disadvantage is that they allow unstructured dataThe inconsistencies created are going to be problematic when performing operations and queries. In addition, they are generally designed as persistent databases that perform reads over the entire database, so they will not be optimised for real-time queries (although they can be very effective in establishing transactions given their similarity to SQL).

Possible applications

  1. Product catalogues with predefined characteristics
  2. Homogeneous sensor data with high sampling rates
  3. Messaging applications

Main columnar DBs

Cassandra (AP)

Open source under Apache license. Its main asset is robust and flexible scalability, with continuous availability and robust object-level security. It allows the application of map-reduce routines and is easy to use using SQL-like languages.

Hbase (CP)

Open source under the Apache license, it runs under the HDFS infrastructure of the Hadoop. Like Cassandra, its strength is scalability, with sharding partitioning systems and replication models on regional servers. They also allow the use of map-reduce routines.

In the next post, we will look at NoSQL databases of type network, time series and content repositories.




Tech & Data

You may be interested in

Take the leap

Contact us.