Posted January 20, 2015 by Jennifer Costley

SQL-on-Hadoop: Still chasing the dream

Some people learn their word-of-the-day from calendars and web sites; I am lucky enough to interact with financial services technologists who bring their own colorful language. As an example, the response I got from one when I asked about SQL-on-Hadoop: chimera. I later looked up the dictionary definition, an illusion or fabrication of the mind; especially: an unrealizable dream. SQL-on-Hadoop advocates are hoping to prove this skeptic wrong.

There are many SQL-on-Hadoop solutions competing for market share. By my count, at least 30 open source or proprietary products enable SQL queries to get at HDFS data. (I’m deliberately side-stepping the thorny issue of SQL conformance and compatibility. Suffice it to say that most of these products enable a large set of SQL commands to function.) This product menagerie offers a variety of architectural approaches, some very innovative features, and no lack of ambition.

Here I’ll focus just on interactive analytics (as opposed to batch analytics or transactions) and discuss just three products that came up during our interviews for last year’s big data white paper: Stinger-enabled Hive, Impala and Spark SQL. Each has a fundamentally different architectural approach and faces different hurdles to becoming the SQL-on-Hadoop champion for interactive analytics.

Stinger-enabled Hive

Once the only choice for SQL-on-Hadoop, Hive has been the de-facto standard, is widely adopted, and is among the top 20 database management systems. However, Hive’s performance for interactive queries has been a major issue for analytics. The initial versions of Hive compiled SQL-like Hive Query Language (HiveQL) statements into MapReduce jobs. MapReduce is a batch-oriented job execution framework with high latency and job processing overhead. Interactive query performance was a problem not just for SQL developers but also for business intelligence (BI) tools with interfaces to Hive.

Hortonworks has taken a head-on approach to addressing Hive performance, supporting the Apache Hive community in a releasing a series of enhancements. The releases up to and including Hive 0.13 were designated “the Stinger Initiative” and included Tez, which replaces MapReduce and translates SQL statements into data processing graphs. Hortonworks claims major performance improvements for interactive queries. Stinger.next is the continuation of this initiative with impressive goals including improved performance for complex queries and allowing “rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL.”

A technologist from a major financial services firm running Stinger (Hive 0.13) in production for regulatory and compliance use cases against archived data expressed satisfaction: “Tez has sped things up a lot; the users are happy with it.” For him, performance is not an issue although he does caveat this with “at the moment.” They have also integrated with existing BI tools (e.g. Tableau) with good results. He expects the solution to continue to be used within this set of use cases for at least the next few years. Others are skeptical that Hive performance will get to a level where it can be a general replacement for an RDBMS, limiting it to certain use cases. Time will tell.

Impala

Cloudera has taken a different approach to performance. Rather than replacing the plumbing beneath Hive, they have spearheaded the development of Impala, a special-purpose run-time designed for SQL-on-Hadoop. Impala provides a massively parallel processing (MPP) database engine optimized for interactive SQL via queries or BI tools operating directly on data in HDFS (rather than through an intermediary such as Spark, Tez, or MapReduce), leveraging caching in HDFS, and optimizing in-memory data transfers during queries.

Customers I’ve spoken with feel that this approach does indeed provide good performance for their financial services workloads. But if these same conversations are any indicator, the biggest challenge for Impala may be a concern that despite being open source, it will not be embraced by other Hadoop vendors or the broader open source community. That said, most market analysts put Cloudera significantly ahead of its competitors in Hadoop market share, so it’s possible that Impala could become a market leader even without the support of other Hadoop vendors.

Spark SQL

Databricks is taking yet another approach to the problem: building a SQL engine as an integral part of Spark, a general-purpose run-time for large-scale data processing designed to support interactive queries. Spark, a cluster computing framework for processing complex workloads, can both run in memory and persist data to disk-based storage such as HDFS. (The members of the STAC Big Data Special Interest Group were introduced to Spark and the rest of the open-source Berkeley Data Analytics Stack back in March 2013.) Most of the big Hadoop vendors now support Spark or have announced plans to.

Shark, the original SQL query facility for Spark, was suboptimal since it was a version of Hive reengineered to compile HiveQL into Spark and constrained by that legacy. Spark SQL, the successor to Shark, is a SQL engine designed for and integrated tightly into Spark. We have heard mostly positive reactions to Spark SQL; there is support for its open architecture and high expectation for its efficiency (even potential to replace RDBMS in commit-heavy OLTP use cases, which is something else altogether). Spark SQL’s main challenge is maturity, with Databricks founded in October 2013 and Spark SQL first delivered in March 2014. As one technologist expressed it “Everyone is getting behind it and wants it to be successful, despite the relative immaturity of the product and its main vendor.”

Closing thoughts

It seems that each of these three solutions - Stinger-enabled Hive, Impala, and Spark SQL- has challenges to overcome to reach the ideal combination of maturity, performance, and community adoption. It's possible one of them will emerge as a clear winner in the next year or two. Or perhaps one of the many other SQL-on-Hadoop solutions will establish a big lead in the market. But if I were to place bets, I'd bet against both outcomes. Given how many uses (and associated expectations) have accumulated for SQL over the past 40 years, perhaps no solution will cater to every case, even just interactive analytics cases. The perfect SQL-on-Hadoop solution may always be a chimera.

About the STAC Blog

STAC and members of the STAC community post blogs from time to time on issues related to technology selection, development, engineering, and operations in financial services.

About the blogger

Jennifer Costley

Jennifer Costley is a scientifically-trained technologist with broad multi-disciplinary experience in enterprise architecture, software development, line management, and infrastructure operations. Following 31 years of technology leadership in organizations like Credit Suisse, Bankers Trust, and Double Click, Jennifer now consults to companies, non-profit organizations, and individuals in areas related to data, governance, and sustainability. This includes roles with the IEEE and the STAC Benchmark Council. Jennifer has a PhD in Chemical Physics from Columbia University and a Bachelors in Physics and Chemistry from Brandeis University.