The NoSQL movement continues to gain momentum as developers continue to grow weary of traditional SQL based database management and look for advancements in storage technology. A recent article provided a great roundup of some of the great new technologies in this area, particularly focusing on the different approaches to replication and partitioning. There are excellent new technologies available, but using a NoSQL database is not just a straight substitute for a SQL server. NoSQL changes the rules in many ways, and using a NoSQL database is best accompanied by a corresponding change in application architecture.
The NoSQL database approach is characterized by a move away from the complexity of SQL based servers. The logic of validation, access control, mapping querieable indexed data, correlating related data, conflict resolution, maintaining integrity constraints, and triggered procedures is moved out of the database layer. This enables NoSQL database engines to focus on exceptional performance and scalability. Of course, these fundamental data concerns of an application don’t go away, but rather must move to a programmatic layer. One of the key advantages of the NoSQL-driven architecture is that this logic can now be codified in our own familiar, powerful, flexible turing-complete programming languages, rather than relying on the vast assortment of complex APIs and languages in a SQL server (data column, definitions, queries, stored procedures, etc).
In this article, we’ll explore the different aspects of data management and suggest an architecture that uses a data management tier on top of NoSQL databases, where this tier focuses on the concerns of handling and managing data like validation, relation correlation, and integrity maintenance. Further, I believe this architecture also suggests a more user-interface-focused lightweight version of the model-viewer-controller (MVC) for the next tier. I then want to demonstrate how the Persevere 2.0 framework is well suited to be a data management layer on top of NoSQL databases. Lets look at the different aspects of databases and how NoSQL engines affect our handling of data and architecture.
Architecture with NoSQL
In order to understand how to properly architect applications with NoSQL databases you must understand the separation of concerns between data management and data storage. The past era of SQL based databases attempted to satisfy both concerns with databases. This is very difficult, and inevitably applications would take on part of the task of data management, providing certain validation tasks and adding modeling logic. One of the key concepts of the NoSQL movement is to have DBs focus on the task of high-performance scalable data storage, and provide low-level access to a data management layer in a way that allows data management tasks to be conveniently written in the programming language of choice rather than having data management logic spread across Turing-complete application languages, SQL, and sometimes even DB-specific stored procedure languages.
Complex Data Structures
One important capability that most NoSQL databases provide is hierarchical nested structures in data entities. Hierarchical data and data with list type structures are easily described with JSON and other formats used by NoSQL databases, where multiple tables with relations would be necessary in traditional SQL databases to describe these data structures. Furthermore, JSON (or alternatives) provide a format that much more closely matches the common programming languages data structure, greatly simplifying object mapping. The ability to easily store object-style structures without impedance mismatch is a big attractant of NoSQL.
Nested data structures work elegantly in situations where the children/substructures are always accessed from within a parent document. Object oriented and RDF databases also work well with data structures that are uni-directional, one object is accessed from another, but not vice versa. However, if the data entities may need to be individually accessed and updated or relations are bi-directional, real relations become necessary. For example, if we had a database of employees and employers, we could easily envision scenarios where we would start with an employee and want to find their employer, or start with an employer and find all their employees. It may also be desirable to individually update an employee or employer without having to worry about updating all the related entities.
In some situations, nested structures can eliminate unnecessary bi-directional relations and greatly simplify database design, but there are still critical parts of real applications where relations are essential.
Handling Relational Data
The NoSQL style databases has often been termed non-relational databases. This is an unfortunate term. These databases can certainly be used with data that has relations, which is actually extremely important. In fact, real data almost always has relations. Truly non-relational data management would be virtually worthless. Understanding how to deal with relations has not always been well-addressed by NoSQL discussions and is perhaps one of the most important issues for real application development on top of NoSQL databases.
The handling of relations with traditional RDBMSs is very well understood. Table structures are defined by data normalization, and data is retrieved through SQL queries that often make extensive use of joins to leverage the relations of data to aggregate information from multiple normalized tables. The benefits of normalization are also clear. How then do we model relations and utilize them with NoSQL databases?
There are a couple approaches. First, we can retain normalization strategies and avoid any duplication of data. Alternately, we can choose to de-normalize data which can have benefits for improved query performance.
With normalized data we can preserve key invariants, making it easy to maintain consistent data, without having to worry about keeping duplicated data in sync. However, normalization can often push the burden of effort on to queries to aggregate information from multiple records and can often incur substantial performance costs. Substantial effort has been put into providing high-performance JOINs in RDBMSs to provide optimally efficient access to normalized data. However, in the NoSQL world, most DBs do not provide any ad-hoc JOIN type of query functionality. Consequently, to perform a query that aggregates information across tables often requires application level iteration, or creative use of map-reduce functions. Queries that utilize joining for filtering across different mutable records often cannot be properly addressed with map-reduce functions, and must use application level iteration.
NoSQL advocates might suggest that the lack of JOIN functionality is beneficial; it encourages de-normalization that provides much more efficient query-time data access. All aggregation happens for each (less frequent) write, thus allowing queries to avoid any O(n) aggregation operations. However, de-normalization can have serious consequences. De-normalization means that data is prone to inconsistencies. Generally, this means duplication of data; when that data is mutated, applications must rely on synchronization techniques to avoid having copies become inconsistent. This invariant can easily be violated by application code. While it is typically suitable for multiple applications to access database management servers, with de-normalized data, database access becomes fraught with invariants that must be carefully understood.
These hazards do not negate the value of database de-normalization as an optimization and scalability technique. However, with such an approach, database access should be viewed as an internal aspect of implementation rather than a reusable API. The management of data consistency becomes an integral compliment to the NoSQL storage as part of the whole database system.
The NoSQL approach is headed in the wrong direction if it is attempting to invalidate the historic pillars of data management, established by Edgar Codd. These basic rules for maintaining consistent data are timeless, but with the proper architecture a full NoSQL-based data management system does not need to contradict these ideas. Rather it couples NoSQL data storage engines with database management logic, allowing for these rules to be fulfilled in much more natural ways. In fact, Codd himself, the undisputed father of relational databases, was opposed to SQL. Most likely, he would find a properly architected database management application layer combined with a NoSQL storage engine to fit much closer to his ideals of a relational database then the traditional SQL database.
Network or In-process Programmatic Interaction?
With the vastly different approach of NoSQL servers, it is worth considering if the traditional network-based out-of-process interaction approach of SQL servers is truly optimal for NoSQL servers. Interestingly, both of the approaches to relational data point to the value of more direct in-process programmatic access to indexes rather than the traditional query-request-over-tcp style communication. JOIN style queries over normalized data is very doable with NoSQL databases, but it relies on iterating through data sets with lookups during each loop. These lookups can be very cheap at the index level, but can incur a lot of overhead at the TCP handling and query parsing level. Direct programmatic interaction with the database sidesteps the unnecessary overhead, allowing for reasonably fast ad-hoc relational queries. This does not hinder clustering or replication across multiple machines, the data management layer can be connected to the storage system on each box.
De-normalization approaches also work well with in-process programmatic access. Here the reasons are different. Now, access to the database should be funneled through a programmatic layer that handles all data synchronization needs to preserve invariants so that multiple higher level application modules can safely interact with the database (whether programmatically or a higher level TCP/IP based communication such as HTTP). With programmatic-only access, the data can be more safely protected from access that might violate integrity expectations.
Browser vendors have also come to similar conclusions of programmatic access to indexes rather than query-based access in the W3C process to define the browser-based database API. Earlier efforts to provide browser-based databases spurred by Google Gears and later implemented in Safari were SQL-based. But the obvious growing dissatisfaction with SQL among developers and the impedance mismatches between RDBMS style data structures and JavaScript style data structures, has led the W3C, with a proposal from Oracle (and supported by Mozilla and Microsoft), to orient towards a NoSQL-style indexed key-value document database API modeled after the Berkeley DB API.
Schemas/Validation
Most NoSQL databases could also be called schema-free databases as this is often one of the most highly touted aspects of these type of databases. The key advantage of schema-free design is that it allows applications to quickly upgrade the structure of data without expensive table rewrites. It also allows for greater flexibility in storing heterogeneously structured data. But while applications may benefit greatly from freedom from storage schemas, this certainly does not eliminate the need to enforce data validity and integrity constraints.
Moving the validity/integrity enforcement to the data management layer has significant advantages. SQL databases had very limited stiff schemas, whereas we have much more flexibility enforcing constraints with a programming language. We can enforce complex rules, mix strict type enforcements on certain properties, and leave other properties free to carry various types or be optional. Validation can even employ access to external systems to verify data. By moving validation out of the storage layer, we can centralize validation in our data management layer and have the freedom to create rich data structures and evolve our applications without storage system induced limitations.
ACID/BASE and Relaxing Consistency Constraints
One aspect of the NoSQL movement has been a move away from trying to maintain completely perfect consistency across distributed servers (everyone has the same view of data) due to the burden this places on databases, particularly in distributed systems. The now famous CAP theorem states that of consistency, availability, and network partitioning, only two can be guaranteed at any time. Traditional relational databases have kept strict transactional semantics to preserve consistency, but many NoSQL databases are moving towards a more scalable architecture that relaxes consistency. Relaxing consistency is often called eventual consistency. This permits much more scalable distributed storage systems where writes can occur without using two phase commits or system-wide locks.
However, relaxing consistency does lead to the possibility of conflicting writes. When multiple nodes can accept modifications without expensive lock coordination, concurrent writes can occur in conflict. Databases like CouchDB will put objects into a conflict state when this occurs. However, it is inevitably the responsibility of the application to deal with these conflicts. Again, our suggested data management layer is naturally the place for the conflict resolution logic.
Data management can also be used to customize the consistency level. In general, one can implement more relaxed consistency-based replication systems on top of individual database storage systems based on stricter transactional semantics. Customized replication and consistency enforcements can be very useful for applications where some updates may require higher integrity and some may require the higher scalability of relaxed consistency.
Customizing replication can also be useful for determining exactly what constitutes a conflict. Multi-Version Concurency Control (MVCC) style conflict resolution like that of CouchDB can be very naive. MVCC assumes the precondition for any update is the version number of the previous version of the document. This certainly is not necessarily always the correct precondition, and many times unexpected inconsistent data may be due to updates that were based on other record/document states. Creating the proper update logs and correctly finding conflicts during synchronization can often involve application level design decisions that a storage can’t make on its own.
Persevere
Persevere is a RESTful persistence framework, version 2.0 is designed for NoSQL databases while maintaining the architecture principles of the relational model, providing a solid complementary approach. Persevere’s persistence system, Perstore, uses a data store API that is actually directly modeled after W3C’s No-SQL-inspired indexed database API. Combined with Persevere’s RESTful HTTP handler (called Pintura), data can be efficiently and scalably stored in NoSQL storage engines and accessed through a data modeling logic layer that allows users to access data in familiar RESTful terms with appropriate data views, while preserving data consistency. Persevere provides JSON Schema based validation, data modification, creation, and deletion handlers, notifications, and a faceted-based security system.
Persevere’s evolutionary schema approach leans on the convenience of JSON schema to provide a powerful set of validation tools. In Persevere 2.0, we can define a data model:
var Model = require("model").Model;
// productStore provides the API for accessing the storage DB.
Model("Product", productStore, {
properties: {
name: String, // we can easily define type constraints
price: { // we can create more sophisticated constraints
type: "number",
miminum: 0,
},
productCode: {
set: function(value){
// or even programmatic validation
}
}
},
// note that we have not restricted additional properties from being added
// we could restrict additional properties with:
// additionalProperties: false
// we can specify different action handlers. These are optional, they will
// pass through to the store if not defined.
query: function(query, options){
// we can specify how queries should be handled and
//delivered to the storage system
productStore.query(query, options
},
put: function(object, directives){
// we could specify any access checks, or updates to other objects
// that need to take place
productStore.put(object, directives);
}
});
The Persevere 2.0 series of articles provides more information about creating data models as well as using facets for controlling access to data.
Persevere’s Relational Links
Persevere also provides relation management. This is also based on the JSON Schema specification, and has a RESTful design based on the link relation mechanism that is used in HTML and Atom. JSON Schema link definitions provide a clean technique for defining bi-directional links in a way that gives link visibility to clients. Building on our product example, we could define a Manufacturer model, and link products to their manufacturers:
Model("Product", productStore, {
properties: {
...
manufacturerId: String
},
links: [
{
rel: "manufacturer",
href: "Manufacturer/{manufacterId}"
}
],
...
});
Model("Manufacturer", manufacturerStore, {
properties: {
name: String,
id: String,
...
},
links: [
{
rel: "products",
href: "Product/?manufacterId={id}"
}
]
});
With this definition we have explicitly defined how one can traverse back and forth (bi-directionally) between a product and a manufacturer, using a standard normalized foreign key (no extra effort in synchronizing bi-direction references).
The New mVC
By implementing a logically complete data management system, we have effectively implemented the “model” of the MVC architecture. This actually allows the MVC layer to stay more focused and lightweight. Rather than having to handle data modeling and management concerns, the MVC can focus on the user interface (the viewer and controller), and simply a minimal model connector that interfaces with the full model implementation, the data management layer. I’d suggest this type of user interface layer be recapitalized as mVC to denote the de-emphasis on data modeling concerns in the user interface logic. This type of architecture facilitates multiple mVC UIs connecting to a single data management system, or vice versa, a single mVC could connect to multiple data management systems, aggregating data for the user. This decoupling improves the opportunity for independent evolution of components.
Conclusion
The main conceptual ideas that I believe are key to the evolution of NoSQL-based architectures:
- Database management needs to move to a two layer architecture, separating the concerns of data modeling and data storage. Persevere demonstrates this data modeling type of framework with a web-oriented RESTful design that complements the new breed of NoSQL servers.
- With this two layered approach, the data storage server should be coupled to a particular data model manager that ensures consistency and integrity. All access must go through this data model manager to protect the invariants enforced by the managerial layer.
- With a coupled management layer, storage servers are most efficiently accessed through a programmatic API, preferably keeping the storage system in-process to minimize communication overhead.
- The W3C Indexed Database API fits this model well, with an API that astutely abstracts the storage (including indexing storage) concerns from the data modeling concerns. This is applicable on the server just as well as the client-side. Kudos to the W3C for an excellent NoSQL-style API.