Since they began to be developed in the early 1970s, SQL databases have become a standard in the field of data storage. However, the increase in stored volumes and the need to execute real-time queries on large amounts of data has gradually led to the emergence of new storage models, known as NoSQL. Since these SQL databases are highly extended and established in the industry, in this document we are going to focus on exposing this other type of databases: NoSQL models.
1. Comparison of NoSQL vs SQL. The CAP theorem
Before going into their characteristics, it is important to mention another concept: the CAP theorem. Every database is based on three pillars: consistency, availability and partition tolerance.
- Consistency: every query will always return the same data, and in case of an error, it will rather choose not to return data than to return something wrong. Important in databases where transactions are important, such as banks.
- Availabilityevery query will return data, despite possible errors in the system, although these may not necessarily be consistent (they may return different outputs to two different clients).
- Partition toleranceData can be stored in a distributed manner on different nodes.
Although these are the three pillars of a database, it is not possible for all three to be fulfilled at the same time. They will always be mutually exclusive. This means that in all cases two of the pillars will have to be prioritised over the third. That is why this theorem is represented on the CAP triangle. The choice of which vertex to move to will depend on our needs.
The SQL databases are based on the CA segment: they provide consistency and availability. However, this is achieved by having all data on the same machine, which makes it intolerant to partitioning and therefore not horizontally scalable. Databases requiring transactions will typically move at this apex.
The NoSQL databases will by definition be scalable, and therefore will possess property P. Thus, if horizontal scalability is a necessity, depending on the properties we need, we will have to opt for a CP or an AP database.
CP databases prioritise consistency, so that the answer will always be unique. But this is achieved by blocking the system to queries in case of two simultaneous queries or a failure (of course, until this failure has been fixed), which gives the risk that a query will not return results. These solutions will be ideal in the case of banks or companies where economic transactions take place, as they will prioritise the correctness of the transaction even at the risk that in case of failure the transaction will not take place.
On the other hand, AP databases prioritise availability. This means that, in case of failure (or if the primary node is busy), the system will pull one of the replicas on secondary nodes. This means that they will always return results, but we run the risk that the failure (or simultaneous operation) may have caused information to be lost so that two queries give different results. These solutions will be ideal when we are looking for the answer to a query at all times, even if this information is not 100% accurate, so we can apply them in the use of social networks or systems that require real-time queries.
2. NoSQL databases. General characteristics
As discussed in the previous section, the NoSQL databases are created in order to fill the gap left by SQL in the CAP triangle: the possibility of partitioning data and offering horizontal scalability. In addition, the rigid and predefined schema of SQL databases, although functionally very useful, is a problem because the data is often heterogeneous and does not respond to a predefined schema. As a result, all NoSQL databases will base their properties around this factor of horizontal scalability and flexibility. These properties can be summarised in the following points:
- Flexible schemes: the table structure is lost, different hierarchical levels can be established, and even different types of data can be stored in the same field.
- No predefined schemes: it is not necessary to predefine the database schema, but new fields can be added a posteriori. Even so, it is generally advisable to have a predefined structure, even if it is later changed (this is known as semi-structured data).
- Horizontal scalability: They allow operation in clusters. It is no longer necessary to have all the data on a single machine, so when more space or processing capacity is required, a new machine can be added. These databases are generally set up to make this process simple and dynamic, without the need for restructuring or migration.
- Replicability and high availability: By operating in clusters, it is possible to have multiple copies of the same document on multiple computers, so that if one computer fails, the others automatically take over without loss of data or performance.
- Partitioned: One of the biggest problems with storing large amounts of data is that an entire database cannot be stored in a single document due to the sheer size of the document. The possibility of partitioning is based on the fact that a document can be distributed over several computers, eliminating the need for a single disk to have the capacity to store a whole set of data.
- Speed of consultation: As databases grow in size, SQL query times increase, especially when joins are required. Different data structures and clustered architectures reduce query times when the amount of data handled is large.
- Processing speed: when you want to perform operations on the entire database, the NoSQL databases have built-in parallel processing engines, which reduce read and write times on large tables.
It is important to note that, although the NoSQL databases generally revolve around these properties, this does not mean that all NoSQL databases will fulfil all of them, nor that all NoSQL databases will be equally effective in fulfilling all properties. In fact, NoSQL databases are divided into several types, oriented to satisfy some priorities or others according to different needs. We will go into more detail on these aspects in the following sections.
3. Types of NoSQL databases. Division by groups
- Time series.
- Content repositories.