Friday, January 25, 2013

Jurisdictional Nodes: Role of Relational Databases

Firstly let me start by affirming the great power and overall quality of relational databases. All praise be them.  I'm not being sarcastic, I just want to load up on the appreciation prior to possibly criticizing overuse of them. Or I might merely be fantasizing about a world where they were not so necessary in some circumstances, and it might even be in vain.  As I say later, I consider them worthy, reliable, and necessary tools  which excel at indexing data in a general way, probably crucial for arbitrary data mining especially for business research. There may be no other way to understand how your resources are expended and utilized as an operational entity.

RDBMSs are reliable repositories for millions of records at a time, they are generally well written, and they are very well researched in academic computer science and are generally constructed to use some of the most powerful indexing and searching techniques from computer science in general. The first relational database I ever worked with was Ingres in the 80s, since then I've worked directly with Sybase, mysql, postgres, sqlite and various ORMs (Object Relational Mappers). In addition I've worked indirectly with many more, from Oracle to IBM databases.  I don't expect to ever stop using them throughout my work. But this essay is about where I would prefer to use them, and an attempt to relegate that to that position. My current goal is to delay use of them chronologically, ideally to an end point where the verticle application structure does not depend on one, but all useful data can be exported to database tool to enable analysis with standard tools. At this point I would rely on a RDBMS for long, massive queries, but not for the small everyday queries needed to support individual users.

So not only is this not a hit piece and neither is it uninformed on the topic or practical utility of an RDBMS.  This essay is me exploring an idea that frankly has an escape hatch.  I can map what I'm doing to a relational db at any point, and the later I have to do that the more successful my experiment will have been. The less runtime dependence, the better.  As such, it is an open research, an in progress communication in the spirit of R&D, I seek, if anything, rational proof for and against my ideas to guide me.  I expect drawbacks which will have to be met head on to be mitigated or  ideally, to be solved.

Having said that I think I can afford to put the goal bluntly: I want to make a mission critical capable stack that does not rely on an RDBMS at runtime, but which will export data to one for use by standard issue tools.  Therefore I don't intend to duplicate all the abilities of a DB by any means, I merely mean to free the application from having to work with one on a constant basis for mundane tasks, like finding a small set of recently created objects, or objects associated with a particular user.

Here is an attempt to give a sketch of the sort of things that motivate me to want to do without an RDBMS, at least during development, and ideally minimized and only aimed at after-the-fact data mining.  Why I'd prefer not to let RDBMS concerns affect the entire design:
  • RDBMSs cooperate poorly with iterative development.  Say I have an object type A for which I want to develop feature sets X, Y, and Z, in the long run.  I am inclined to develop these one at a time, of course.  Assuming I created a table to store information about objects of type A, it will incorporate first the needs of feature set X. Thus the table in an RDBMS will at first reflect those needed by X, then by X+Y, and then by X+Y+Z.  Since there are likely to be relationships between X, Y and Z, not only will properties/columns need to be added (something which was not supported when I started with relational databases), but I expect that along the way I will probably find ways to consolidate properties needed by each feature set.  That is I might create something for X that needs to be generalized to support Y instead of adding a separate column/property altogether, creating more abstract properties with more flexibility. The RDBMS is an incooperative dead weight when making these exchanges.  It's all quite clear why it's necessary, and indeed, it's amazing how flexible they have made them in modern times, such that columns can be added after the fact. To do this properly without clearing the DB and starting from scratch each time, is to create special software, or special plans, to survive each transition. I have worked on a variety of ways to help regularize such processes and I hope to see that such work can be lessened greatly. Forward compatibility issues are crucial for iteratively developed software.
  • RDBMSs are general purpose. That is great, especially when needed, but the general danger of such tools are that they solve every job with average performance, when a special purpose tool solves a smaller domain of problems more efficiently. While the RDBMS speeds up a massive query they will never be any faster than a custom written search using appropriate indexing, but in many cases they will be on a par with that result. This means one of their strengths is that they save development effort thinking about the logical particulars of one's particular problem domain, and in general developer time is worth a lot of efficiency in the code which can, after all, be optimized later. But obviously the drawback is performance less than optimal, and the value of that difference is dependent on the task at hand. The RDBMS stays prepared to execute novel queries on any given property.  Fact is, however, you are never going to search for blog stored in an SQL DB given the text of the blog as search parameter. Some of this generality is overkill.  If job responsibility is partitioned well, shedding unnecessary generality is an opportunity for new mission critical systems.  The generality is regained over time, and one must estimate the likely that such an outcome is possible, and the approach doesn't have internal contradictions or bottlenecks that cannot be reasonably mitigated.
  • Use of an RDBMS requires one to model data as objects made of exactly the same properties, distinguished only by different settings. This is never the an ideal model for the program consuming the data.  A program necessarily needs to have its own model and will, if at all well built, will communicating with the RDBMS by querying into and out of the model.  That means that I still have to develop the appropriate data structure model for my data as used in the program throughout the stack, while also creating a separate RDBMS-oriented model. I worry that efficiency concerns lead people to subsequently design entirely around the what one views as a required RDBMS model.
  • The RDBMS hides the storage from us at the shell level.  During development, and even in deployment, I find it useful to have data access available to me at the file system level. While I respect the model incarnate in the RDBMS, I also appreciate the unix name-pipe/file system model as a very useful heterogeneous data abstraction, albeit one that is poorly indexed relying only on lookup namespaces which are generally invented before the data itself, in the reverse of the proper order. File system elements are easier to inspect and hacked from time to time. I can also use standard tools not designed to work with a Database.  There is a considerable amount of power in the unix shell, and none of it is of much use dealing with data stored in a locked, binary, cabinent that cannot be poked around with.  This means that solutions to problems that might have a quick remedy may be long in coming as a more formal fix is required.  A custom system can react to any peculiarity of the data or  circumstance, and then plan a path to a formal, steady state, answer.  This is the kind of flexibility I want in development and afterward.

 Situation

This is an experiment, but not an academic one, more of a standard R&D ambition, based on a hope that the benefits outweigh any drawbacks at least for the problem domain targeted. I'm using it in development right now so let me start by explaining a chronology of events that will explain where I am so far, and which ideas I hope can carry this to a successful end. The narrative should also reveal any errors in my thinking that I, or another, might thus find visible and rectifiable. I go in not with conviction but of exploration.

Over the last many years I have built various sets of software as projects on my own time, sometimes to explore new technologies or to tinker with long term ideas of mind.  Most valuable from my point of view are my alternate interface to youtube and notation app, a javascript graphical two dimensional tile piece and board game interface, and a modular client-server application model based on a geometrical graphical design orientation. Most lately I have built a video and image processing system designed for linux virtual servers. I am now converting all of the above to the linux virtual server marketed as the cloud. 

While the data "cloud" did not strike me as significant, the ability to cell CPU as a commodity does impress me, especially when the host is a standard OS rather than a proprietary system like Google App Engine.  Thus I am converting the prior code, which runs on GAE to ubuntu and apache based server code. The first program I converted was a script I have that allows one to specify six images with the subsequent result that the images are cut up and put in an single image such that if printed, the image can be folded into an origami box such that the six images appear on the six sides of the box.

I obviously did not need an RDBMS for this application, which I called photogami, which was nothing more than a stand alone script converted for running on a web server.  However, if I were to promote the site, if it were to become heavily used, such a need would naturally arrise even for such a simple application because I let people upload their images, for example, and I would in that case be well advised to keep track of who was using what.

Subsequent to this I developed a video server with the initial issue being mastering ffmpeg which is used to convert videos on the back end. I needed to have a login system so my system was not abused, and this is something one might likely address by keeping a user table in an RDBMS. But instead I used apache authentication, and put off use of a DB.

This was possible through to the point where the videos were successfully uploaded and converted (and for good measure the server could automatically upload them to youtube as well, if desired).  But at that point I hit the first stopping block.  The issues is this: while i can safely let the web server upload a video, and put it's meta-information in a file on disk for the first time, I cannot safely do subsequent manipulation of the data.  E.g. if I want to put that video id in an index of "recently uploaded" files, I cannot do this with the web server app because it is running multiple processes and those processes have to cooperate somehow to ensure they don't overwrite each others work which is what would happen if they simply write to files at will.  

It's crucial when modifying shared data, that something controls access to the official data and ensures a transactional and ordered nature to the modifications, which are required to keep the data complete and coherent.  Also, such a manager will cache the data so you often don't have to go back to disk at all.

Note, this is the least of what the DBs do, managing concurrent access to common data.  For that purpose alone they are certainly overkill, BUT if one is of a mind that they will have to  have a DB eventually to support their system, this is one of the first reasons to use one sooner rather than later. At this point I decided again to defer the need until later, and instead made a C++ component that would do this part of the work for me. I named the program a Jurisdictional Clerk or Jurisdictional Control Node. Data under its jurisdiction, somewhere in a file system or file-system-like data store could only be modified by it.  Theoretically readers (the web server processes) could read the data themselves, but must modify and write through the JCN.  In practice, they of course should (and do) also read the data using JCN to control I/O.

But this means I as a developer can inspect the data with shell commands while the system runs.  I cannot modify the data safely unless I am sure it is not being written, which I can ensure by making the JCN react to a command which locks that file from writing.  Or, by brute force, I can simply stop the JCN, ensuring no one can write to it, allowing me to perform developmental surgery on the data as need be, perhaps running scripts to handle a transition, as above, from feature set X to feature set X+Y.

This is relatively straight forward. While we often use database to look things up by name, it's not needed for that.  A file system is already such a database, indexed by path specifiers which allow quick access to arbitrary data. When they ask for the data from the jcn, it is read only once, when first requested, and subsequently one gets the data from jcn memory (this takes a thousandth of a second for the data structures the current size, which is similar to a row of information).  In this case the path is LITERALLY the key (in C++ map<string, Morsel *>), where Morsel is a class I have created which uses boost property trees to keep heterogeneous trees of data, and can write them to JSON, XML or a native format.  

Which format to use is thus generalized, and can be changed throughout development in reaction to differences in performance (including to an RDBMS, aka the escape hatch).  Because I'm maximizing access to the data during development, for inspection and manual manipulation, I'm using JSON at the moment.  This is also useful because this format means I can send data unparsed to the client, instead of doing a query and formatting it as JSON for the client, since the client is a javascript program which of course uses JSON natively. Naturally, I suspect for sites with a lot of traffic, a text format for the disk storage is not going to cut it, and the Morsel is the interface which allows us to put data in less convienient but more efficient binary formats by changing only the Morsel class.

A nice thing about the JCN approach so far, however, is that even when that optimization becomes the case for  a particular object class, it needn't apply to them all.  While object classes with a lot of particular instances, like comments to videos for example, may need a more efficient format or storage system, something like the site configuration file needn't. The JCN and its service classes can use the less convienient but more efficient system selectively, for things that call for it only, and also only at that time during development where it's really become necessary. Indeed, it's possible that throughout development everything can be kept in the easier format, and changing this in deployment can be merely a matter of changing a configuration flag in the object class definition files.

So the JCN started as a central point of contact for the video bank software data, especially the meta-data, such as the description, uploader, view counts, and so on.  Rather than have the web server scripts contact the jcn after a video had been uploaded, I instead made the JCN use inotify to watch files appear within its jurisdiction, which it can then react to.  Initially these were specific to the video server needs but I have since, as part of converting the Novem interface system to this new server, started to generalize the JCN and give the design more definition.  The morsel is a data abstraction which contains a nested map (or "dictionary") of property:value pairs (that is, values can themselves be maps) and which borrows the concept from javascript that such a thing actually IS an object.

While the JCN currently performs its video services by watching for files of particular extension to ingest (let's say it's .vidinfo), the more general solution that will come from the recent modifications will be that if any file of type ".mrsl" appears on the disk, the jcn will ingest it and expect to find type information in the .mrsl. The dictionary can (and to be handled by the JCN directly, must) have an "_obj" member, which contains properties generally intended to be known by the jcn or other control system, meta-data such as the object's current namespace, the objects name, its object classification, and so on.

A video information note would have the objectClass member of _obj set to "videonote", allowing the jcn to dispatch the data to specific handlers that actually know what to do with such a thing (e.g. that know what the date, author, or other fields might actually mean). I have another class to act as the collection, the database if you will, of these Morsel instances, the Prandium.  This class is what keeps the map to all the morsels in memory.

I feel confident with this design. Right now I can let the prandium keep track of data that is in principle on disk, and can cache it in memory to save IO, confident that I can make the store a more well indexed file if needed, or from here execute the escape hatch and link the Prandium to a RDBMS and at least keep the prandium's interface, which is designed to my convienience. 

 This all works quite well and I can see the implementation will satisfy the requirement mentioned above and will act as a competent (efficient) shared datastore manager for retrieving/modifying data by "name and namespace" (i.e. the "object path").  My current system support a path language within the data morsel as well, so the JCN can allow arbitrary modification of these type of structures (heterogenous trees).

I also said above that while an RDBMS is often used to do this sort of lookup that such a job is not its real strength. The real strength of the RDBMS is its ability to do queries efficiently, especially important for very large queries.  The ability to do queries is something I need right now.  The JCN will currently provide a list of current videos, but it will not cull them for you, you can't ask it for the latest dozen videos, for example.

The RDBMS excels at doing such things, and is ready to do them from the moment you put the data in.  Generally, any property of a row (aka "column") can be used in a query. Thus, especially with the wanton aid of ORMs (which is indulgent to the OO programmer), one creates a column/property for each property one desired to have stored in the database. This means you can query on any of these in the future, without needing to have forethought about what you will search on. 

But the fact is, many of the properties in the database won't ever be used in a query.  The text of a blog entry will not be used as a query for the blog object in the database.  It has occured to me that for business reasons it's valuable to be able to search on any given property, because this supports arbitrary business analysis, and for this reason I do expect to export pandium and morsel data to RDBMS for such studies.  I am not trying to remove the RDBMS from the whole endeavor, I'm just trying to keep my server system from having one as an operational dependency for the runtime system.

A flexibility of the JCN store (again, remembering it can be put in various types of storage on disk and is already more or less ideally efficient when in memory, indexed by path) is that it has no column concept and thus you can put any data in these dictionaries. You can easily add extra sub-objects, which will be well ignored by systems that don't know where to look for them, but which can nevertheless be handled by the infrastructure by iterating over all the present keys, and also through the special _obj member.

However, this makes it difficult to use these things as arbitrary search parameters, and by difficult I mean impossible. While most of properties do not need to be used as search parameters and never will be, what about those that do need to be used for searching? 

The current situation: I support the storing of "blobs" in the jcn by pathnames.  The jcn can inspect objects, but currently only does so to find control information in the _obj member.  It will check the type and creator for authorization purposes, but in every other sense does not have to know what the type actually is (just that it's allowed). 

What I am developing next is in order to use internal properties as search terms. The idea that I will implement is that one can do the jcn equivalent of "create table" AFTER a bunch of such object have been stored.  However, the command will not be creating tables, but rather storing object class definitions which drive creation of indexes. For those familiar with RDBMSs, this create table action, which is done to tell the database that you have a type of data you would like it to accept and the format of the data, specifies both what can and can not be used as a search parameter, and what's stored.  In the JCN model, what can be stored is more or less open and does not have to be pre-specified.  For binary types that cannot be stored in the actual meta-information file, a field can simply point to a file (or URL), accomplishing much the same thing in effect.  Thus anything that can be stored in the filesystem can be stored (in a file system or an optimized stored able to retrieve arbitrary data by namespace and path).  Any file system level utility, like revision control, can also easily be used.

The definition of things that can be used as search parameters.  For each of these an index will be created and when creating and modifying that type of object, these indexes will be maintained and stored efficiently on disk.  This index is a simple map of the given property as key, to the given Morsels with that value.  The properties are given as path names which "path" through the property tree in memory.

In the long run I could be swamped by a need for indeces on all fields, but I know that at least in development I will have only a gradually growing and limited number of things I need to be able to search and filter on.  As long as the indices can get me to small sets (hundreds or thousands at most), I can also support final filtering on arbitrary values in the property tree for which there are no indexes. This means that if I get through development without an RDBMS, but find my system cannot perform in deployment (perhaps it never will, perhaps it's just not ready), that if I take the escape hatch at this future point not only can I still benefit from the  RDBMS (by pointing key parts of the storage model to an RDBMS instead of a file system, and changing the "create object class" flow so that it has to precede ingestion of data), but I can also continue to develop to my general model, a server capable of replacing the RDBMS in development and possibly production use, and using the RDBMS for what it truly excels at, REALLY BIG LONG MASSIVE queries, at which it is a rock star.

One advantage is that at this point it is much more fun to work on the jcn, and seems to require less effort than making yet another RDBMS data model in parallel with the datamodel that actually gets the work of the applications done. I suppose I will go straight for high performance in storage of the indexes, and store them in some very efficient binary format. Perhaps I hope to redeem the filesystem as the ultimate data store... one I have often wandered in.  It could be, I'm like that sometimes.

 




No comments:

Post a Comment