Databases

At the heart of many web applications is some form of database, but the choice of what type to use isn't at all clear. A lot of modern systems like to use NoSQL, claiming that it's more scalable, but why do the say that? It turns out that different databases make different decisions about the tradeoffs involved, in practice they all lie on a continuum between scalability and consistency.

The Scalability/Consistency Continuum

In theoretical computer science, there's something called the CAP Theorem. The formal language is a bit difficult to digest, but the up-shot is that it's only possible for a database to achieve two out of three of the following:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a non-error response
Partition Tolerance: The system is tolerant of networking issues between machines that are part of the database

In practice, networking issues are impossible to prevent, so databases are forced to choose between consistency and availability.

Race Conditions

To understand these concepts, let's consider logic to increment a number. More specifically, let's say these records are already in the system:

Value
1
2
3
4

Now we want to write the next value into the database. To accomplish this, you have a query to identify the largest value, in this case 4, and insert a record that is 1 more than that value.

Now suppose this same code is called twice at exactly the same time, we'll call these processes A and B. This is a fairly classic example of a race condition, and before dealing with how it relates to databases let me illustrate how it can go wrong. Consider the following sequence of events:

A gets the largest value in the database (a = 4)
B gets the largest value in the database (b = 4)
A writes a + 1 to the database (5)
B writes b + 1 to the database (5)

As you can see, the two processes step on each others toes and both end up writing the same value, leaving the final database in this form:

Value
1
2
3
4
5
5

Consistency

A consistent database is one where the situation I outlined above is impossible. It provides the guarantee that all data is up to date, typically through a mechanism known as a transaction.

A full understanding of transactions is outside the scope of this class, but this is how they would behave in our example above:

A opens a transaction on the table
A gets the largest value in the database (a = 4)
B opens a transaction on the table
B has to wait because another transaction is already open
A writes a + 1 to the database (5)
A closes the transaction
B gets the largest value in the database (b = 5)
B writes b + 1 to the database (6)
B closes the transaction

Most databases are actually doing much more complicated things, but this sequence can serve as a reasonable mental model when thinking about them. The transaction forces the second process to wait until the first completes to be able to complete its task.

Availability

At this point, consistency probably sounds amazing, and in fact that quality dominated the industry for a long time. The problem, though, arises when the system served by the database begins to grow.

Ideally, as the load begins to climb on the server hosting the database, you'd be able to introduce a new host and balance the load between them. Let's return to our example, now adding in servers X and Y:

A opens a transaction on the table through X
A gets the largest value in the database (a = 4)
B opens a transaction on the table through Y
Y communicates with X to determine that a transaction is already open, so B has to wait
A writes a + 1 to the database (5)
A closes the transaction
B gets the largest value in the database (b = 5)
B closes the transaction
B writes b + 1 to the database (6)

As you can see, to maintain consistency communication must occur between the two servers. Maybe this doesn't sound so bad, but now consider what happens when you have 1000 machines. Whenever a transaction is opened on any one of them, it must somehow be aware of transactions on all 999 other machines to continue to provide this guarantee.

It's debatable if this assurance is even possible at such a large scale, but even if it were the amount of time necessary to check completely would be astronomical. During this time, the host would be unavailable, and the only way to provide availability at this scale is to sacrifice this consistency.

Horizontal vs Vertical Scaling

Because consistency dominated the industry for so long, the only reasonable way of handling more load as systems scaled was to upgrade to bigger and bigger machines. This is also known as vertical scaling, and leads to individual extremely expensive machines.

Systems that prefer availability give us a different option. Instead of buying bigger and bigger machines it's possible to buy lots of cheap ones. This is what's known as horizontal scaling, and is what happens at large copanies.

Types of Databases

This tension between consistency and scalability means different types of databases are appropriate in different situations. When considering what database to use, you should think about these tradeoffs and which direction is better for the situation.

Relational Databases

When you don't think things will scale to millions of transactions, you can't beat the reliability of a traditional database. These systems are rock solid and you never have to worry about issues resulting from inconsistent state.

Typical options here include Oracle and PostgreSQL. There are a ton of different implementations, though, and they all conform to a common language called SQL. Most will build extensions to the language that allow more complex queries, but if you limit use of those special forms it's not an inordinate amount of effort to move between databases.

NoSQL

If you think the system is going to need to eventually handle a massive amount of traffic, it's worth considering a NoSQL database. Unlike relational databases, these will all look extremely different, and each has its own specific query language. Popular choices in this category include Redis, Mongo and Cassandra, each providing different performance tradeoffs.

One term that's thrown around in these communities is eventual consistency. All this really means is that, even though they can't provide full consistency, the database eventually becomes consistent through replicating the data across machines.

There are also some more specialized databases out there that fall under the umbrella of NoSQL. These include things like graph and object databases, and generally serve more specialized purposes.

Testing

When it comes to databases, testing can get fairly complicated. If you use the database that runs in production during tests, it can incur substantial performance penalties that make it difficult to run your tests frequently. In addition, getting a database set up in a Continuous Integration system can be relatively complicated.

To combat this, I'll be adding some facilities to the codebase which allow you to isolate the database interactions. You'll still need to design the queries, but writing unit tests to ensure the queries work correctly is going to be too complex for this course. This concept will be covered more thoroughly in the assignment.