Friday, June 17, 2016

KEEP CALM and JSON


JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for software to parse and generate. Here’s a JSON representation of a customer data in JSON. This has served well for interchange and integration.



So far, so good.

As long as JSON was used for data interchange between multiple layers of application, everything was good. The moment people started using JSON as the database storage format, all hell broke loose. When I first saw JSON as the data format in database, I was surprised databases would pay the cost of keeping both key names and values for every document and process them for every query. I've heard numerous questions in my talks and articles.

Here are the common objections:
JSON is text. It’s inefficient.
JSON has no enforceable structure. Data quality is gone.
Key-value pairs for every document (row)? You’re nuts.

1. JSON is TEXT. It’s inefficient
Online transactions provided by RDBMS was key to life as we know it even with the advent of Web 2.0, RDBMS was running the all the websites in the back. Cost of transaction had come down from $5 when mankind when to the moon down to $0.02 by the time Jim Gray took his last sailing trip. This is because of tremendous effort on RDBMS and hardware vendors to continuously improve to win TPC benchmark wars in ‘90s. Oracle would publish on SUN hardware. Sybase would beat the numbers on HP. Informix would beat it on SGI. IBM would beat it on IBM. Then, Oracle will come back. Everyone would spend $25,000 to boast in a full page ad on WSJ. Yes, people read newspapers on paper in ‘90s.

For this effort, databases would optimize the physical design, access paths, I/O, data loading, transactions, et cetera. This drove many innovations in hardware such as, microprocessor instruction sets to SMP system design, infiniband. Databases squeezed every clock cycle, every IO, took every locking, logging advantage. Every bottleneck was attacked vigorously. Every hardware vendor tried to have the best TPC number..

RDBMS is efficient.

RDBMS is cost conscious -- space, time, spacetime, everything.

Compared to RDBMS storage formats, JSON is inefficient.
All the data is in character form of JSON requires conversion.
Even numerical data is stored in text.
Each value comes with it’s key name (key-value) in each document. More stored & processing.
Allows complex structure to be represented resulting in a runtime overhead.
Officially supports small set of data types. No decimal, no timestamp, etc.

Having JSON in a database means:
Database engine has to understand each document’s schema and process them instead of handling regular tuples and attributes (rows & columns).
The data types for each key-value in JSON has to be interpreted and processed
JSON may be a skinnier version of XML, but it’s still fat.

With all these inefficiencies, why should any database use JSON as the data model? Or implement a JSON data type?

Yes. JSON was invented as data interchange format and now is the preferred format for public APIs, data integration scenarios. Most devices, most layers of software can consume and product data in the form of JSON. Storing and publishing data in JSON means, APIs and applications can interact with the database in the medium they’re familiar with, in addition to the advantages below. You’ve to think in term of n-tier application efficiency instead of comparing it to tuple insert and update. Most RDBMS have added JSON as a data type now. As the JSON usage grows, the storage and processing efficiency will increase.

2. JSON has no enforceable structure in the database. Data quality is gone.
Since the JSON databases do not enforce any structure. Application, intentionally or unintentionally, can insert data in any format as long as it’s a valid JSON.

JSON based databases prefer a “schema on read” model. This means, reader has responsibility to understand and interpret the schema of each document or result from the query. Query language like N1QL has many feature for introspection of the document itself.

Having said that, there are tools like Ottoman and Mongoose help you to validate a given JSON against a predefined structure.While this isn’t a fool proof method for data loading into JSON database, right controls in the application layer help you enforce consistency.

On the other hand, the flexible, nested structure of JSON helps you map object data into a single JSON document.

Using JSON also avoids the overhead of object-relational mapping . Applications model and modify objects. To store these in relational schema, you need to normalize the data into multiple relations and store them in district tables. When you need to reconstruct the object, you need to join data in these tables, and construct the object back. While object-relational mapping layers like Hibernate automate this process, you still pay in performance because you issue multiple queries for each step.

By representing the object in hierarchical nested structure of JSON, you avoid most of the overhead of object-relational mapping.




3. Key-value pairs for every document? You’re nuts!
Answer is, the bottleneck has moved. Bottleneck has moved from database core performance to database flexibility and change management for iterative application development.

In ‘80s and ‘90s, as RDBMS was used for many more applications and use cases, it was important for RDBMS to keep up with the scale and all the hardware innovations. Higher clock speed, new instructions, more cores, larger memory, all were exploited. As business grew, demand for more more automation, applications, processes increased. As the apps got more popular and valuable, businesses want to do more with this. Business applications went from back office automation to real business change drivers.

RDBMS is great at many things. But, not CHANGE themselves. Issues are both organizational and technical.

RDBMS use a “schema on write” approach. Schema is determined at CREATE TABLE time, checked for at every INSERT, UPDATE, MERGE. Changing the table structure typically involves multiple stakeholders, multiple applications, testing all the dependent modules.

Typical ALTER table to add columns, change data types requires exclusive lock on the table resulting in application and business scheduled downtime. If you need a column to store distinct types of data or structure, you’re out of luck. this trend, there have been attempts alleviate small subset of the issues by allowing users to add columns ONLINE.

All these takes time, slowing the pace of development and delivery. You can see further requirements of schema evolution in this presentation [starting slide 17].

Using self-describing JSON model via {“key”:value} pairs to store the data makes each document self describing. So, you can avoid enforcing the schema on write. When the documents is read by the query processing engine or the application, they’ll have to read and interpret the document. Applications typically version the documents so they know how what to expect as the schema of the document itself evolves. Query languages like N1QL, SQL++ have extended query languages to introspect and process flexible schema.

Summary
Yes. JSON is text, makes schema enforcement challenging and takes more space. But, you get easier application development, schema evolution, and object-relational mapping.

Is Oracle's Larry Ellison Wrong on Object Stores?

It's All About the DATA

Here's Larry Ellison's critique of Workday. "Workday does not use a database, they use an object store. So, they can't really do reporting. They use flash. So, they can't run on iPhones or iPads. Besides that, they're great!"


 Here's the overview Workday application architecture (from Workday).
Workday uses object model and stores the objects MySQL table as a blob instead of normalizing and storing the data in multiple tables, tuples, and columns. Object store makes object-relational mapping, object disassembly/assembly unnecessary. Because you store objects, you lose some benefits of RDBMS: query processing -- select, join, project, schema, optimizer to rewrite complex query, and help the query to perform better.  Complex query processing is essential for canned or ad hoc reporting.  Each task will be have to be written as a program that retrieves all the objects, explode them into runtime objects, and then do the analysis.  It's slow. It's expensive.
Larry Ellison may have been right back in 2012.  BLOB or TEXT were the options to store objects in most RDBMS. DB2 and others had XML store with XQuery never really took off in any serious enterprise application. Back in 2012, MongoDB, provided object store with its flexible JSON data model, basic query processing, was focusing on developers, rather than enterprise.  Any reporting or complex query processing was not possible.
Workday would have to support very limited reporting on their object store or provide ways to extract the data into a normalized form to enable reporting on them. Additional setup, maintenance, work and cost.
Document oriented databases like MongoDB, Couchbase, DynamoDB, DocumentDB have all provided a way to store in JSON.  JSON stores hierarchical data as nested basic types, objects, and arrays.  So, we can store hierarchical objects in JSON, limiting the object to store mapping work. Here’s an example of customer object in JSON.
Now that we have object and its persistence defined, you can develop application on this object store.  Couchbase SDKs and MongoDB SDKs help you develop applications. So far, so good.
The criticism from Larry Ellison back in 2012 was that while you can develop applications, trying to reports on the data these applications created is impossible.
While document databases started out with with the NoSQL approach of basic GET, SET APIs, they’ve seen the need for complex query and provided them to various degrees. Mongo’s query API and Couchbase N1QL is inspired by SQL.  Some more than others :-)
MongoDB provides SQL inspired basic version of SELECT, INSERT, UPDATE, DELETE, UPSERT.  For reporting, MongoDB’s find() provides basic select-join-project operation,  You can even get the plan using the explain() on find().  Aggregation and joins are provided via the aggregate() method.  The aggregation method provides all of the basic filtering, grouping, aggregation, sorting functionality.  Pipelining of the aggregation helps you with multi-level filtering/ordering/grouping.  The aggregate comparison page shows you how the simple SQL statements can be written in MongoDB form.
On the other hand, Couchbase N1QL stays close to SQL, extending it to access & manipulate JSON. In Couchbase N1QL, you can issue queries like this on JSON data.  Select, join, project, nest, unnest, subqueries, aggregation, pagination, sorting are all available just like SQL. You can also generate reports using Couchbase N1QL.  Look at the example below. 
So simple and so SQL. 
Checkout more examples at querycouchbase.com.  Couchbase N1QL also has ODBC, JDBC drivers from Simba to enable BI tools, Excel to directly access and create complex reporting and visualization. Checkout the talk and the blog on using this on your data.
Is Larry Ellison wrong for saying: “you can’t do reporting on object store” in 2012?  Fortunately, you’re not in 2012 anymore.

SQL on Twitter: Analyzing Twitter Data Made Simple

SQL on Twitter

"If I had more time, I would have written shorter letter"
-- Blaise Pascal

There have been lengthy articles on analyzing twitter data by Cloudera here, hereandhere. More from Hortonworks here and here. This article is going to be short. Thanks to features in Couchbase 4.5!

There have been lengthy articles on analyzing twitter data by Cloudera here, hereandhere. More from Hortonworks here and here. This article is going to be short. Thanks to features in Couchbase 4.5!

Step 1: Install Couchbase 4.5 Use Couchbase console create a bucket called twitterandCREATE PRIMARY INDEX on twitter using query workbench or any other tool.


Step 2: Request for your twitter archive. Once you receive it, unzip it. (You can use larger twitter archives as well).


cd <to the unzipped location>/data/tweets
Step 3: $ for i in `ls`; 
        do 
          grep -i ^Grailbird $i > $i.out ; 
        done

Step 4: update your ip, username & password before you run this.

$ for i in `ls *.out`;
  do 
   /opt/couchbase/bin/cbbackupmgr json --host http://172.23.123.107:8091 --username Administrator --password password --bucket default --dataset file:///root/$i --format list --generate-key %id_str%; 
 done

Step 5: There is no step 5.
Login to Couchbase query workbench and start playing! Simply use SQL based N1QL to query and play with the data. Online interactive tutorial makes this even easier.

Here are some observations for my own twitter deck:

In the query tab, simply click on the bucketname twitter, it'll automatically analyze the data to give you the structure of the documents with sample data. Move your cursor over the fields to see the sample data.




2. Get me a sample twitter document.

select * 
from twitter 
limit 1;



3. Give me the count of my tweets.

select count(*) as my_tweet_count 
from twitter;



4. Give me top 5 hashtags and counts in my tweet


select ht.text, count(1) 
from twitter unnest entities.hashtags ht 
group by ht 
order by count(1) desc limit 5;





Yes. I worked for Informix and IBM :-)


5. What days did I tweet most?

select substr(created_at, 0, 10), count(1)
from twitter
group by substr(created_at, 0, 10)
order by count(1) desc
limit 5;





Couchbase 4.5 makes it very easy to ingest JSON so you can get insights.

Try out with your own twitter data or a public JSON archive. Create indices on fields, and arrays. Ask more questions, find more insights!

On Par with Window Functions

Use golf analogy when explaining to executives.                                Use a car analogy for all others.  — Confucius. Th...