Wednesday, April 20, 2016

A strategy to assimilate unstructured information



There is a crisis in information technology and it has been developing for a while. This crisis has a name and its name is data. There is so much of it, it is everywhere, it is constantly being produced and constantly changing, but worst of all it makes no sense.

Let me explain:

Some time back there was an interesting Innocentive challenge called "Strategy to Assimilate Unstructured Information" like the title of this article. One imagines a vast amount of data that is readily available but whose lack of structure makes it useless. This could be a stream of data, from mobile devices like cell phones or tablets or some Arduino application on the Internet of Things (IoT). One could also imagine trying to make sense of a data-dump from some old systems, or some server on the Invisible Web. The challenge is non trivial and only set to become more pronounced as more and more data is generated by more and more devices and more and more of our lives become digitized.

There are other compelling reasons to be interested in organizing unstructured data into useful information: Organizations, be they corporations, non-profits, government or non-governmental require governance. Good governance and leadership require statistics. Not just opinion-polling data but real statistics about everything related to the institution and its environment.

Having and using data and analysis tools and up to date statistics can help to track trends, encourage success and monitor progress. When decisions about policy or business decision are based on facts borne out by data, these decisions are more likely to be effective and success is that much more likely. Good decisions lead to better outcomes, but good decisions depend on excellent information.

Collecting, collating and analyzing data is an industry in itself. Databases about people, education, health, infrastructure, geography, economic and financial information are valuable (in fact some believe: dangerous in the wrong hands). .

The fact that so much unstructured information is generated and the need to store this information actually led to the development of NoSql database. According to DataStax the driving force behind NoSql is Big Data characterized by:

1. Data velocity: lots of data coming in very quickly, possibly from different locations
2. Data variety: storage of data that is structured, semi-structured, and unstructured
3. Data volume: data that involves many terabytes or petabytes in size
4. Data complexity: data that is stored and managed in different locales, data centers, or cloud geo-zones

Couchbase, makers of a NoSql database of their own, have an almost identical list of the motivators for NoSql :

  • Support large numbers of concurrent users (tens of thousands, perhaps millions)
  • Deliver highly responsive experiences to a globally distributed base of users
  • Be always available – no downtime
  • Handle semi- and unstructured data
  • Rapidly adapt to changing requirements with frequent updates and new features

NoSql is here to stay, in fact it will probably grow by leaps and bounds as more "things" are connected together and to the Internet as industry builds out the Internet of Things.

The upside of NoSql is that we achieve high scalability and throughput, but there is a downside, and that is; we accumulate lots of unstructured/semi-structured data, data that isn't easy to manipulate in the powerful ways that we once could with Relational databases.

We could call this the "Data-Relational impedance mismatch" a parallel to the "Object-Relational impedance mismatch", a mismatch that led to the development of Object-Relational Mapping(ORM) technology.

It isn't likely that this problem will be reconciled by making relational databases faster, the problem is inherent in the complexity of arranging massive amounts of data into tables or relations which represent the information as attributes and values of related data.

Therefore the challenge is how to make sense of this data, and I have a suggestion.

MDX

Lets use MDX!

In particular lets use MDX over NoSql. MDX is an acronym for MultiDimensional eXpressions (MDX) language. Microsoft played a big part in developing MDX as a language. MDX syntax is very similar to SQL, however it was primarily developed for OLAP and Business intelligence(BI). In my opinion this is why it would be ideal for this kind of application.

MongoDB (and I suspect other NoSql databases) typically build out their business intelligence capabilities by creating a connector/driver that exports the data into a SQL compatible table format for further BI processing. This kind of processing is typical of BI applications which have traditionally included an Extract-Transform-Load (ETL) step, to shape the data for BI processing.These ETL steps can be enhanced and/or simplified by MDX.

MongoDB actually created a language called "mongodrdl" for MongoDB Document Relational Definition Language that the MongoDB Connector for BI implements in order to extract and transform data from a MongoDB database into a suitable SQL Schema. MDX could be used in place of this language and it would be more intuitive for many developers who use SQL on a daily basis today.

How?

The implementation that I have in mind is basically an abstraction of map-reduce. There is an uncanny correspondence between the concepts underpinning map-reduce and those of Business Intelligence. Consider this:

In BI we are typically dealing with summarized data organised as: cubes, dimensions and measures.

A dimension is an attribute of an "entity". For instance if I had a table of Students, a dimension might be their Gender, or the Course they are taking

A measure is typically a numeric value and it represents some operation on dimensions. For instance given my table with Students above and given the dimension Course, a "measure" might be: a count of the number of Students taking a particular Course. So an MDX query might reveal how many students are taking the Mathematics course. Alternatively another MDX query might show how many of the Students are Male or Female.

Map-reduce is used for exactly the same purpose, indeed given the same dataset, and the same problem, a MongoDB developer would immediately reach for map-reduce to answer the same questions.

Implementation

The pilot project is to build out an MDX driver for MongoDB.

A feature of C# that has been around since C# 4.0 is the Dynamic Language Runtime (DLR). The DLR enables other (usually dynamic) Languages to be implemented and run on C#. For instance, this is how IronRuby, IronPython and IronJS are implemented. Its pretty amazing (also see here and here). This is the technology stack that will be leveraged to create an MDX driver for MongoDB. Fortunately there is also a robust and open source C# driver for MongoDB.






Saturday, May 16, 2009

UPCA Assembly for dotNET

Format UPC-A Barcodes using .NET assembly



I created a .NET assembly which when used with the associated UPC-A true type font, allows a developer to format UPC-A barcodes from any .NET project with just a few lines of code. As demonstrated here.

The assembly, barcode font and example project are avaialbe for free, for private or  non-commercial use. Just email me and I'll send you the entire package. 

Or download the zipped files from this link

This demostration assumes you're using Visual Studio and C#. Enjoy!

Highly skippable backgrounder


The Universal Product Code (UPC) specifications include three versions: A, D, and E. Version A, the regular version, is used to encode a twelve digit number. Version E, the zero suppressed version, is a six digit code used for marking small packages. Version D, the variable length version, is not commonly used for package marking. It is used in limited special
applications.


Both Version A and E may include either a 2 digit or a 5 digit supplemental encodation. These extra digits are primarily used on periodicals and books. Supplemental
encodations are supported.


Version A encodes a twelve digit number. The first number encoded is the number system character, the next ten digits are the data characters, and the last digit is the
check character.


The number system character is printed in human readable form to the left of the UPC symbol. Seven of the ten possible numbers have been assigned.


UPC Number System Characters:












































0



Regular UPC Codes



1



Reserved



2



Random weight items which are symbol marked at the store level



3



National Drug Code and National Health related Items Code



4



For use without code format restrictions and with check digit
protection for in store marking of non food items.



5



For use on coupons



6



Regular UPC Codes



7



Regular UPC Codes



8



Reserved



9



Reserved


These number system characters are accessible as the UPCA.NumberSystem property of the UPCA object in the assembly.


Assembly Usage


In order to use the assembly and format UPC-A barcodes download the attached zip files, unzip them and follow the steps outlined below:


  1. First ensure that the UPCA font is installed on the target system


  1. Navigate to Control Panel and click on it to open


  2. Double click fonts in control panel to open the fonts control panel applet


  3. Either drag the UPCA font into the fonts cache(the fonts control panel applet) or click
    File->Install new font…navigate to the UPCA font and select it


  1. Add the UPCA assembly to your project in the following manner


  1. Copy the assembly to your project folder


  2. Add a reference to the assembly in your project


  1. Add the using clause as below

using ako.UPC_A;


  1. You can now construct a UPCA object and call it’s methods in the following manner(also see the attached example project)


  1. Construct a UPCA object with a number system and barcode data( assuming the rich text box font is set to the UPCA font)

richTextBox1.Text = new UPCA(number_system,textBox1.Text).upca;

or…


  1. Construct a UPCA object then set the NumberSystem and Data properties ( assuming the rich text box font is set to the UPCA font)

UPCA upc = new UPCA();


upc.NumberSystem = "4";


upc.Data = textBox1.Text;


richTextBox1.Text = upc.upca;



  1. Obtain the formatted barcode from the upca property of the UPCA object e.g.(if declared as in (ii.) above)

richTextBox1.Text = upc.upca;


Notes:

The UPCA object is declared as a partial class, allowing a user to extend its functionality as required (reference .NET partial classes). The actual declaration is below

public partial class UPCA : System.Object

The read only UPCA.upca property is formatted to include all the data required for a properly formatted UPC-A barcode, including leading and trailing barcode stop characters, distinct left and right barcode characters and the middle barcode characters.

The UPCA font is free for personal, non-commercial use, but requires licensing for commercial use. If you require the font for commercial purposes please contact
me (agolakisira@gmail.com) for details.