Those days, I’ve been observing the emergence of radically innovative approaches to data persistence. This article starts with a quick summary of the – now a classic – NoSQL domain and then introduces the event based persistence approach that is emerging.
It’s now been several years that the NoSQL databases are being talked about and are maturing with the experience feedback, sometimes with some difficulties. The concept of the “polyglot persistence” is spreading out. This term, first coined by Martin Fowler, means that several types of databases are used in the same software, each one fitting particular needs.
I also use that approach myself with ZenModeler (neo4j graph database, MongoDB document database, classical relational DB and I look with a strong interest towards event based databases). Each database type fulfills a precise need. The encapsulation of the data access behind a Repository object is essential according to me because it strongly helps to separate concerns between business logic and persistence, and then to encapsulate the persistence logic behind a clean business interface.
In short, the typology of NoSQL databases is :
- Column-oriented Databases : store data by columns and not by rows, allows complex and efficient queries, useful for mainly read processing.
- Key-Value databases : needs more efffort to query, but very easy to scale. Typical databases for parallel processing like “Map-Reduce”. Lots of people here, like Cassandra, Riak, Dynamo and the french guys at bureau14 with wrpme.
- Document oriented databases : store document and access them with a key, the hype being to store document as JSON like with MongoDB (if you plan to use MongoDB, you have to check the »MongoDB gotchas »). The DDD Aggregate pattern is very useful with that kind of storage.
- Graph databases : use concepts of nodes and edges, each holding properties (typical example : Neo4j).
- Object databases or distributed caches : I’m feeling nostalgic about that one as I work with the ancestor Versant in 1999 for one of my internship :-). Nowadays, most of the object oriented databases switched to distributed caches in front of relational databases.
That‘s it for the technical aspects. There are big differences within the data structures used in each of them ranging from the narrower to the wider : maps, trees (as JSON or XML) or graphs. Keeping in mind that the more the better, a graph database can represent a tree and each node or relation can also store a map. For the substance I don’t see any differences because these tools don’t integrate any paradigm shift toward persistence approach…
Data Persistence Fundamentals
Let’s get back to the fundamental of what we store : data that describe reality !
A data is the relationship formed by a concept and its measure. A measure is characterized by a type : either quantitative (weight, amount, age) or qualitative (name, city, date). The composition of several data in a structure leads to the description of a fact.
For example, a user is browsing on an e-commerce website and is interacting with it : she adds some products to her cart and removes some, when she has finalized her purchases, she validates her order and then pays. Each of the user’s action is processed by the system. Overall, the later guarantees the ACID properties for her actions : atomicity, consistency, integrity and durability. The last point, durability, is essential in case a link of that extremely complex chain would fail. That durability is guaranteed by redundancy mechanisms and by writing on supports that “persist” in the event of a lack of electricity (sorry I need to get back to some basics before moving on :-)).
There is now two ways to analyze the user’s actions :
- From the point of view in the cart: it changes along with the user’s actions (add a product, remove one, modify its quantity, recalculate the balance, etc.). The cart is always in a coherent state characterized by its identifier, the product list with its total quantity and the total balance. It is more or less the vision of “objects and states” of the OO paradigm.
- From the user’s point of view: user is doing a sequence of actions (add, remove, change quantity, etc.). Of course, each action will be received by the cart which will process the necessary state transitions. The vision is more functionally oriented, where events trigger functions that make the state transitions. The whole thing being deterministic and, by definition, the event immutable…so it’s more or less the vision of « function » of the FP paradigm.
Event Oriented Persistence
An event describes a change that occurred in the past. Each event triggers an action that modifies the state of the object which receives it through a state transition. The receiving object becomes known in a specific state, following the transitions that occurred in the triggered functions associated with each event. Arriving in a specific state can also triggers another internal event.
For example : the addition of 3 products of $100 each results to a cart balance of $300.
If these events are replayed from the starting state in the same order, then the final state is identical in a deterministic manner. Moreover, if every event can be associated to its inverse that compensate the original one, the starting state can then be reversed forward from the final state.
Events are an implicit way of measuring the time that is passing by. The order of the events are a relative measure of time. The instant when the event occurred is the absolute one measured by our clocks, that’s to say an absolute measure of time.
The mutable objects, those performing state transitions, are the ones we are interested in their life cycles. The DDD approach names that kind of objects “Entities” rather than “Value Objects” (see that post about identity and state).
The following schema describes the relationship between the identity of the entity, its states and the events it received.
The events are the « atoms » that can be stored in order to retrieve our object states at any point in time instead of the traditional way of persisting only the « head » or the last known state of our object.
In practice, the databases I think about while writing this article are Datomic and EventStore. EDIT : Rich Hickey just post an article on InfoQ describing the great design he did for the architecture of Datomic.
In conclusion, a fundamental paradigm shift is at work with the new kids on the block databases like Datomic as they force us to think about persistence in a completely new way: to wrap up, thinking about storing the state transition represented by the event and not only about storing the last known data. The approaches related to the « event » way of storing data are named Event sourcing, CQRS, etc.
My next article on that persistence topic will be about the blurred differences between fast memory and mass memory with the introduction of the memristor and the consequences in our software design.