IT Management

Feb 16

Two recent Sloan Management Review articles noted that a) sometimes good enough data delivered fast is better, and b) an overload of information delivered faster to the wrong target doesn't help. This is echoed by a recent IBM survey of Chief Marketing Officers (CMOs), which showed that CMOs are keenly concerned with the data explosion resulting in information overload. Better, rather than faster, analytics is one of the top-three corporate strategies today, according to the survey.

Much of the problem with today’s fast but ineffective Business Intelligence (BI) is a result of legacy patterns of data analysis that are difficult to change but easy to adapt to new needs. To do this, you need a central controller to reroute analysis, and the mainframe is best qualified to be that controller.

The Big Data Challenge

A key component of upcoming challenges to the enterprise’s BI is Big Data. Some of Big Data is a relabeling of the incessant scaling of existing corporate queries and their extension to internal semi-structured (e.g., corporate documents) and unstructured (e.g., check images) data. This is generally an easier problem to solve. The part that matters to today’s enterprise, however, is the typically unstructured data (video, audio, graphics) that’s an integral part of customers’ social media channels. This is global, enormous, fast-growing, and a vital piece (again, according to the CMO survey) of the overall task of engaging with a customer of one throughout a long-term relationship.

Technically, handling this kind of Big Data is quite different from handling a data warehouse. The best way to understand the place of Big Data databases (such as Hadoop) in the computing universe is to view the history of data processing as a constant battle between parallelism and concurrency. Think of the database as a data store plus a protective layer of software that’s constantly being bombarded by transactions—and often, another transaction on a piece of data arrives before the first is finished. To handle all the transactions, databases have two choices at each stage in computation: parallelism, in which two transactions are literally being processed simultaneously, and concurrency, in which a processor rapidly switches between the two in the middle of the transaction.

Pure parallelism is obviously faster; but to avoid inconsistencies in the results of the transaction, you often need coordinating software. That coordinating software is hard to operate in parallel because it involves frequent communication between the parallel threads of the two transactions. At a global level (like that of the Internet), this translates into a choice between distributed and scale-up, single-system processing.

Consider the relative performance merits of networks of microcomputers vs. machines with a fixed number of parallel processors. Here, some general rules are applicable, and two key factors are especially relevant: data locality and number of connections used. This means you can get away with parallelism if, say, you can operate on a small chunk of the overall data stored on each node, and if you don’t have to coordinate too many nodes at one time.

Enter the problems of cost and scalability. The server farms that grew like Topsy during Web 1.0 had hundreds and thousands of PC-like servers that were set up to handle transactions in parallel. This had obvious cost advantages since PCs were relatively cheap. But data locality was a serious problem in trying to scale. Even in the beginning when data was partitioned correctly between clusters of PCs, data copies and data links proliferated over time, requiring more coordination. Meanwhile, in the High Performance Computing (HPC) area, grids of PC-type small machines operating in parallel found that scaling required all sorts of caching and coordination tricks, even when, by choosing the transaction type carefully, the user could minimize the need for coordination.

In certain instances, however, relational databases designed for scale-up systems and structured data did even less well. A relational database would insist on careful consistency between data copies in a distributed configuration, and so couldn’t squeeze the last ounce of parallelism out of transaction streams involving:

  • Indexing and serving massive amounts of rich text (text plus graphics, audio, and video)
  • Data such as Facebook pages
  • Streaming media
  • HPC.

To minimize costs and maximize the parallelism of these types of transactions, Google, the open source movement, and various others turned to MapReduce, Hadoop, and various other non-relational approaches. These efforts combined:

2 Pages