There is a crisis in information technology and it has been developing for a while. This
crisis has a name and its name is data. There is so much of it, it is
everywhere, it is constantly being produced and constantly changing,
but worst of all it makes no sense.
Let me explain:
Some time back there was an interesting Innocentive challenge called
"Strategy to Assimilate Unstructured Information" like the
title of this article. One imagines a vast amount of data that is
readily available but whose lack of structure makes it useless. This
could be a stream of data, from mobile devices like cell phones or
tablets or some Arduino application on the Internet of Things (IoT).
One could also imagine trying to make sense of a data-dump from some
old systems, or some server on the Invisible Web. The challenge is
non trivial and only set to become more pronounced as more and more
data is generated by more and more devices and more and more of our
lives become digitized.
There are other compelling
reasons to be interested in organizing unstructured data into useful
information: Organizations, be they corporations, non-profits,
government or non-governmental require governance. Good governance and
leadership require statistics. Not just opinion-polling data but real
statistics about everything related to the institution and its
environment.
Having and using data and
analysis tools and up to date statistics can help to track trends,
encourage success and monitor progress. When decisions about policy
or business decision are based on facts borne out by data, these
decisions are more likely to be effective and success is that much
more likely. Good decisions lead to better outcomes, but good
decisions depend on excellent information.
Collecting, collating and
analyzing data is an industry in itself. Databases about people,
education, health, infrastructure, geography, economic and financial
information are valuable (in fact some believe: dangerous in the
wrong hands). .
The fact that so much
unstructured information is generated and the need to store this
information actually led to the development of NoSql database.
According to DataStax
the driving force behind NoSql is Big Data characterized by:
1. Data velocity: lots of
data coming in very quickly, possibly from different locations
2. Data variety: storage
of data that is structured, semi-structured, and unstructured
3. Data volume: data that
involves many terabytes or petabytes in size
4. Data complexity: data
that is stored and managed in different locales, data
centers, or cloud geo-zones
Couchbase, makers of a NoSql
database of their own, have an almost identical list of the motivators
for NoSql :
-
Support large numbers of concurrent users (tens of thousands, perhaps millions)
-
Deliver highly responsive experiences to a globally distributed base of users
-
Be always available – no downtime
-
Handle semi- and unstructured data
-
Rapidly adapt to changing requirements with frequent updates and new features
NoSql is here to stay, in fact it will probably grow by leaps and bounds as more "things"
are connected together and to the Internet as industry builds out the
Internet of Things.
The upside of NoSql is that
we achieve high scalability and throughput, but there is a downside,
and that is; we accumulate lots of unstructured/semi-structured data,
data that isn't easy to manipulate in the powerful ways that we once
could with Relational databases.
We could call this the
"Data-Relational impedance mismatch" a parallel to the
"Object-Relational
impedance mismatch", a mismatch that led to the development
of Object-Relational Mapping(ORM) technology.
It isn't likely that this
problem will be reconciled by making relational databases faster, the
problem is inherent in the complexity of arranging massive amounts of
data into tables or relations which represent the information as
attributes and values of related data.
Therefore the challenge is
how to make sense of this data, and I have a suggestion.
MDX
Lets use MDX!
In particular lets use MDX
over NoSql. MDX is an acronym for MultiDimensional
eXpressions (MDX) language. Microsoft played a big part in
developing MDX as a language. MDX syntax is very similar to SQL,
however it was primarily developed for OLAP and Business
intelligence(BI). In my opinion this is why it would be ideal for
this kind of application.
MongoDB (and I suspect other
NoSql databases) typically build out their business intelligence capabilities by creating a connector/driver that exports the data
into a SQL compatible table format for further BI processing. This
kind of processing is typical of BI applications which have
traditionally included an Extract-Transform-Load (ETL) step, to shape
the data for BI processing.These ETL steps can be enhanced and/or
simplified by MDX.
MongoDB actually created a
language called "mongodrdl" for MongoDB Document Relational
Definition Language that the MongoDB Connector for BI implements in
order to extract and transform data from a MongoDB database into a
suitable SQL Schema. MDX could be used in place of this language and
it would be more intuitive for many developers who use SQL on a daily
basis today.
How?
The implementation that I
have in mind is basically an abstraction of map-reduce. There is an
uncanny correspondence between the concepts underpinning map-reduce
and those of Business Intelligence. Consider this:
In BI we are typically
dealing with summarized data organised as: cubes, dimensions and
measures.
A dimension is an attribute
of an "entity". For instance if I had a table of Students,
a dimension might be their Gender, or the Course they are taking
A measure is typically a
numeric value and it represents some operation on dimensions. For
instance given my table with Students above and given the dimension
Course, a "measure" might be: a count of the number of
Students taking a particular Course. So an MDX query might reveal how
many students are taking the Mathematics course. Alternatively
another MDX query might show how many of the Students are Male or
Female.
Map-reduce is used for
exactly the same purpose, indeed given the same dataset, and the same
problem, a MongoDB developer would immediately reach for map-reduce
to answer the same questions.
Implementation
The pilot project is to
build out an MDX driver for MongoDB.
A feature of C# that has
been around since C# 4.0 is the Dynamic Language Runtime (DLR). The
DLR enables other (usually dynamic) Languages to be implemented and
run on C#. For instance, this is how IronRuby,
IronPython
and IronJS are
implemented. Its
pretty amazing (also see here
and here).
This is the technology stack that will be leveraged to create an MDX
driver for MongoDB. Fortunately there is also a robust and open
source C# driver for MongoDB.
No comments:
Post a Comment