Musings: A strategy to assimilate unstructured information

There is a crisis in information technology and it has been developing for a while. This crisis has a name and its name is data. There is so much of it, it is everywhere, it is constantly being produced and constantly changing, but worst of all it makes no sense.

Let me explain:

Some time back there was an interesting Innocentive challenge called "Strategy to Assimilate Unstructured Information" like the title of this article. One imagines a vast amount of data that is readily available but whose lack of structure makes it useless. This could be a stream of data, from mobile devices like cell phones or tablets or some Arduino application on the Internet of Things (IoT). One could also imagine trying to make sense of a data-dump from some old systems, or some server on the Invisible Web. The challenge is non trivial and only set to become more pronounced as more and more data is generated by more and more devices and more and more of our lives become digitized.

There are other compelling reasons to be interested in organizing unstructured data into useful information: Organizations, be they corporations, non-profits, government or non-governmental require governance. Good governance and leadership require statistics. Not just opinion-polling data but real statistics about everything related to the institution and its environment.

Having and using data and analysis tools and up to date statistics can help to track trends, encourage success and monitor progress. When decisions about policy or business decision are based on facts borne out by data, these decisions are more likely to be effective and success is that much more likely. Good decisions lead to better outcomes, but good decisions depend on excellent information.

Collecting, collating and analyzing data is an industry in itself. Databases about people, education, health, infrastructure, geography, economic and financial information are valuable (in fact some believe: dangerous in the wrong hands). .

The fact that so much unstructured information is generated and the need to store this information actually led to the development of NoSql database. According to DataStax the driving force behind NoSql is Big Data characterized by:

1. Data velocity: lots of data coming in very quickly, possibly from different locations

2. Data variety: storage of data that is structured, semi-structured, and unstructured

3. Data volume: data that involves many terabytes or petabytes in size

4. Data complexity: data that is stored and managed in different locales, data centers, or cloud geo-zones

Couchbase, makers of a NoSql database of their own, have an almost identical list of the motivators for NoSql :

Support large numbers of concurrent users (tens of thousands, perhaps millions)
Deliver highly responsive experiences to a globally distributed base of users
Be always available – no downtime
Handle semi- and unstructured data
Rapidly adapt to changing requirements with frequent updates and new features

NoSql is here to stay, in fact it will probably grow by leaps and bounds as more "things" are connected together and to the Internet as industry builds out the Internet of Things.

The upside of NoSql is that we achieve high scalability and throughput, but there is a downside, and that is; we accumulate lots of unstructured/semi-structured data, data that isn't easy to manipulate in the powerful ways that we once could with Relational databases.

We could call this the "Data-Relational impedance mismatch" a parallel to the "Object-Relational impedance mismatch", a mismatch that led to the development of Object-Relational Mapping(ORM) technology.

It isn't likely that this problem will be reconciled by making relational databases faster, the problem is inherent in the complexity of arranging massive amounts of data into tables or relations which represent the information as attributes and values of related data.

Therefore the challenge is how to make sense of this data, and I have a suggestion.

MDX

Lets use MDX!

In particular lets use MDX over NoSql. MDX is an acronym for MultiDimensional eXpressions (MDX) language. Microsoft played a big part in developing MDX as a language. MDX syntax is very similar to SQL, however it was primarily developed for OLAP and Business intelligence(BI). In my opinion this is why it would be ideal for this kind of application.

MongoDB (and I suspect other NoSql databases) typically build out their business intelligence capabilities by creating a connector/driver that exports the data into a SQL compatible table format for further BI processing. This kind of processing is typical of BI applications which have traditionally included an Extract-Transform-Load (ETL) step, to shape the data for BI processing.These ETL steps can be enhanced and/or simplified by MDX.

MongoDB actually created a language called "mongodrdl" for MongoDB Document Relational Definition Language that the MongoDB Connector for BI implements in order to extract and transform data from a MongoDB database into a suitable SQL Schema. MDX could be used in place of this language and it would be more intuitive for many developers who use SQL on a daily basis today.

How?

The implementation that I have in mind is basically an abstraction of map-reduce. There is an uncanny correspondence between the concepts underpinning map-reduce and those of Business Intelligence. Consider this:

In BI we are typically dealing with summarized data organised as: cubes, dimensions and measures.

A dimension is an attribute of an "entity". For instance if I had a table of Students, a dimension might be their Gender, or the Course they are taking

A measure is typically a numeric value and it represents some operation on dimensions. For instance given my table with Students above and given the dimension Course, a "measure" might be: a count of the number of Students taking a particular Course. So an MDX query might reveal how many students are taking the Mathematics course. Alternatively another MDX query might show how many of the Students are Male or Female.

Map-reduce is used for exactly the same purpose, indeed given the same dataset, and the same problem, a MongoDB developer would immediately reach for map-reduce to answer the same questions.

Implementation

The pilot project is to build out an MDX driver for MongoDB.

A feature of C# that has been around since C# 4.0 is the Dynamic Language Runtime (DLR). The DLR enables other (usually dynamic) Languages to be implemented and run on C#. For instance, this is how IronRuby, IronPython and IronJS are implemented. Its pretty amazing (also see here and here). This is the technology stack that will be leveraged to create an MDX driver for MongoDB. Fortunately there is also a robust and open source C# driver for MongoDB.

Musings

Wednesday, April 20, 2016

A strategy to assimilate unstructured information

No comments:

Post a Comment