You almost list all open sourced MPP real time SQL-ON-Hadoop. I prefer Tajo, which was relased by 0.9.0 recently, and still working in progress for 1.0
On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <[email protected]> wrote: > Since Hadoop <https://hive.apache.org> came out, there have been various > commercial and/or open-source attempts to expose some compatibility with > SQL <http://drill.apache.org>. > > I am seeking one which is good for low-latency querying, and supports the > most common CRUD <https://spark.apache.org>, including [the basics!] > along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table > SET C1=2 WHERE, DELETE FROM, and DROP TABLE. > > I will be utilising them from Python, however there does seem to be a Python > JDBC wrapper <https://spark.apache.org/sql>. Additionally it needs to be > scalable for big and small data (starting on a single-node "cluster"). > > Here is what I've found thus far: > > - Apache Hive <https://hive.apache.org> (SQL-like, with interactive > SQL thanks to the Stinger initiative) > - Apache Drill <http://drill.apache.org> (ANSI SQL support) > - Apache Spark <https://spark.apache.org> (Spark SQL > <https://spark.apache.org/sql>, queries only, add data via Hive, RDD > > <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD> > or Paraquet <http://parquet.io/>) > - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase > <http://hbase.apache.org>, lacks full transaction > <http://en.wikipedia.org/wiki/Database_transaction> support, relational > operators <http://en.wikipedia.org/wiki/Relational_operators> and some > built-in functions) > - Presto <https://github.com/facebook/presto> from Facebook (can query > Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc. > Doesn't seem to be designed for low-latency responses across small > clusters, or support UPDATE operations. It is optimized for data > warehousing or analytics¹ > <http://prestodb.io/docs/current/overview/use-cases.html>) > - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR > community edition <https://www.mapr.com/products/hadoop-download> > (seems to be a packaging of Hive, HP Vertica > <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL, > Drill and a native ODBC wrapper > <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) > - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL > interface and multi-dimensional analysis [OLAP > <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and > supports most ANSI SQL query functions". It depends on HDFS, MapReduce, > Hive and HBase; and seems targeted at very large data-sets though maintains > low query latency) > - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard > compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver > support [benchmarks against Hive and Impala > > <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space> > ]) > - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s > Lingual <http://docs.cascading.org/lingual/1.0/>² > <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual > provides JDBC Drivers, a SQL command shell, and a catalog manager for > publishing files [or any resource] as schemas and tables.") > > Which—from this list or elsewhere—would you recommend, and why? > Thanks for all suggestions, > > Samuel Marks > http://linkedin.com/in/samuelmarks >
