Tuesday, November 18, 2014

Notes on Real Time Data Processing

Real time is real hot right now. Days became hours, hours became minutes, minutes became seconds, seconds became nano seconds. Time is money. Go ahead. Show me the money. This is the mantra from all of hardware and software vendors. Do the analytics in real time. Your decisions, results and revenues will automatically jump up. When you perform a database transaction, events are triggered by exchange of messages using some protocol. The systems are recording changes as the event are unfolding. Analysis of this data should happen immediately after the event.  So, colloquially, real time processing is analysis of data as the event is unfolding

But, What is real time?

In all of cosmos, there is one, just one phenomenon we know happening in real time.  EPR thought experiment was invented by Einstein, Podolsky, Rosen to show quantum theory was an incomplete explanation of reality.  Simply put, it shows the information can travel the length of the universe — instantly. This violates Einstein's special theory of relativity which states that nothing can travel faster than light. Experiments have shown EPR's REAL TIME effect, seemingly faster than light, does happen. Such is the strange quantum world. See http://en.wikipedia.org/wiki/EPR_paradox, if you’re curious about more information. 

So, in physics, you  hardly hear about real time travel or information processing.  Speed of information travel, except in quantum world, is restricted by speed of light.


Real Time in Software. 

Computer science and software, of course, won’t be bothered with such constraints.  Everything from streaming, fast data ingestion, messaging, business intelligence to queries on the stream claim real time computing.  There is, however,a restrictive and reasonable definition of real-time computing in Wikepedia hardware and software systems that are subject to a "real-time constraint", for example operational deadlines from event to system response. Real-time programs must guarantee response within strict time constraints, often referred to as "deadlines". The “real-time constraint” can be defined exactly in cases where violation has immediate effect, like fire extinguisher, furnace temperature monitor, etc.

In other cases, e.g. stock trade or option trading, real time means doing it faster than competition to maximize revenue & profit.  The faster you do it, the higher is the revenue/profit possibility. So, for that, what’s the limit? Nano second? Planck second? Here, every nano second counts. Every type of bandwidth counts. Every optimization counts.  All to save precious time.

So, it's important to realize which use case your real time system is handling -- whether you're time-bound to ensure you avoid catastrophe or you're chasing time's tail to maximize profit. Once you determine your constraint, it's easier to design systems, select the right architecture & tools to achieve that.

For interesting evolution of the term real time, see Phillip Laplante's article on It Isn’t Your Father’s Realtime Anymore at: http://queue.acm.org/detail.cfm?id=1117409

Saturday, November 15, 2014

When to go from SQL to NoSQL databases?

While SQL has become lingua franca of relational database systems, SQL <> RDBMS. SQL is overwhelmingly  query language on RDBMS, terms, SQL database and RDBMS are used interchangeably.  I developed SQL features in RDBMS at Sybase, Informix & IBM. So, I feel good about it.

On the other hand, the rebels use NoSQL as an antonym to SQL.  While the popular expansion of NoSQL was Not-Only-SQL, it usually means the database does not support SQL... at least, it won't support good bit of SQL operations like joins, aggregations, etc.

Let’s assume, you have an application using SQL database.  Your boss returns from another big data conference — where they’ve invariably pitched the power of NoSQL — and asks you: “Josh,  why aren’t you moving to NoSQL?”.

While — volume, velocity, variety — give quick reasons to get to NoSQL, let’s look at the reasons and scenarios in bit more detail.

Let’s look at each criteria:

1. Critical Stable Application:   You’re running a mission critical application developed in-house that hasn’t changed in years. The database is reliable. Backup scripts work fine.  Report queries work reliably.
Backup & restore scripts are verified. Hot stand by is always ready. Your hardware can store & handle the data for next two years. And all the users you expect.

So, there is really no obvious reason to consider the alternative approach. 

2. A Very Dynamic Application: You’re attending meetings after meetings schema re-design?  You’re planning additional columns to the table in case it’s needed later on.  You add, unnamed_int, unnamed_varchar32, etc so when the request comes to add a new type, you can instantly provide that just by renaming the column.  You’re a hero!  However, just to decide the need for a particular column, there are numerous meetings.

Application is adjusting to market requirements.  Hence you need to change the data stored in tables — new data, new forms of data.  Any time you change the application, you need downtime.  And the changes never end. 

This is definitely a major symptom to consider NoSQL databases which usually have flexible schema. You only create the table and don’t specify the columns and types in NoSQL.  

In fact,in MongoDB, you don’t need to create the database or table ahead of time.
Simply start USEing a database, start to SAVE data into a table.  Database and tables are automatically created for you.  No column names to decide, no type lengths to think about.

3. Dynamic data sources: Many applications like single view of customer need to integrate new data sources: e.g. twitter feed of the customer or sales from the company you just acquired.  Changing data sources means changing the schema. Changing schema does not only mean you’re adding the columns — It also means, there could be conflicts that cannot be resolved. 

One data source has customer-id defined as integer and another as alphanumeric. That would force you to keep all the data in one superset data type — in this case, alphanumeric.  It would also mean, any existing data should be migrated.  Then, you need to change the SQL on this schema. Testing everything.

4. Scale-up and Scale-out:  I was watching shark-tank yesterday.  One of the contestants said, "year to date sales is 2 million dollars and we expect to close the year with 7 million dollars". Even the experienced sharks were surprised.  The reason for this was 75% of their sales come during christmas season.

Each business has variability due to seasons and events.  If you don't handle increased demand seamlessly, you'll risk losing revenue at the point of greatest profitability.

RDBMS can scale.  They've scaled from single processor systems to tens or even hundreds of processor-cores in a single machine, known as SMP -- symmetric multi-processor machines.    This is scale-up.  Every time you scale-up, you need planning, downtime, data migration and power-on procedures.  This is too costly, slow and risky.  

NoSQL systems were designed to scale-out.  When you reach system capacity, you simply add new systems into the cluster.  Data and workload gets distributed to the new configuration seamlessly.  No application downtime!   This is highly suited for modern commodity clusters and instant provisioning on the cloud.

5. TCO -Total cost of Ownership & Operation: Roughly, RDBMS vendors charge PER CORE.  Ivy Bridge from Intel has  12-core and 15-core model. NoSQL vendors charge PER NODE (machine within certain configuration limits).  The list price per node (defined as machine with up to 24-cores)  is typically the price of 2-cores.  So, 12 times difference right there!

So, it should be an easy call, right?  Unfortunately, not.  There is a huge feature difference between SQL and NoSQL systems. 

RDBMS does more than simple get and put of the row. Once you design a good schema, the number of SQL you can issue on this schema is infinite.  Data architects understand the data relationships and general application requirements and design the schema. Applications have flexibility on the queries they issue on this.  

In NoSQL, you're often encouraged to use the primary-key.  Often (e.g. HBase), primary-key is the only way to access the data efficiently. 

While RDBMS provides rich functionality, using the 80/20 hypothesis, majority of applications are using under 20% of the features.  Even the 20% of the RDBMS features used by these apps are often considered advanced NoSQL features!  However, many of them have workarounds that make migration worthwhile. When you consider the cost difference, some work arounds are worth it!

6. Read-Write Patterns: RDBMS generally assumes you write the data once and read it many-may times.  Assuming 80% SELECT and 20% INSERT/UPDATE/DELETE the typical data pattern.  Hence, a slower WRITE throughput on the system isn't penal. There are data warehousing situations where you're loading large amounts of data.  And there are many tools and techniques to do this.  Special loaders which load without the overhead of transactional log is one of them.  Other is to create a separate table, prepare the data in a separate table and then simply ATTACH that table into a single RDBMS table online.


NoSQL databases are designed to handle continuous stream of data at a very high speed.  And since you're advised to create minimal additional indices, the udpates will be eve faster.   The WRITE patterns in NoSQL suitable applications is typically much higher than 20%. 

SQL or NoSQL?

Going from a SQL to NoSQL system is a lengthy process. The pros-cons should be considered carefully. NoSQL systems offer exciting new technologies to solve many business problems now and will eventually mature to support complex operations and applications. 

On Par with Window Functions

Use golf analogy when explaining to executives.                                Use a car analogy for all others.  — Confucius. Th...