Samuel, You've come and asked your question on the Apache Drill group so of course the answer is Apache Drill is best for everything, right?
The reality is that each tool has a set of strengths and weaknesses for each particular use case. An Apache user support mailing list is definitely NOT the place to have this discussion. You're really asking for technology selection advice and this entire topic is very subjective. The people in any one community would never do full justice to all the options. As such I suggest you use another forum such as Quora or LinkedIn to get advice. (There is also a helpful article on Gigaom that just came out yesterday and all sorts of friendly sales people at companies like MapR and IBM who love giving this kind of advice.) What we can do here is tell you how Drill can solve or not solve your different use cases and help you work through those. If you to go into more detail, on those, we'd be happy to help. Thanks again for the interest. Sorry if this seems abrupt but these threads generally aren't productive and tend to be very divisive. Welcome to the community :) Jacques On Jan 30, 2015 3:28 AM, "Samuel Marks" <[email protected]> wrote: > Since Hadoop <https://hive.apache.org> came out, there have been various > commercial and/or open-source attempts to expose some compatibility with > SQL > <http://drill.apache.org>. Obviously by posting here I am not expecting an > unbiased answer. > > Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and > supports the most common CRUD <https://spark.apache.org>, including [the > basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, > UPDATE > Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support > would be nice also, but is not a must-have. > > Essentially I want a full replacement for the more traditional RDBMS, one > which can scale from 1 node to a serious Hadoop cluster. > > Python is my language of choice for interfacing, however there does seem to > be a Python JDBC wrapper <https://spark.apache.org/sql>. > > Here is what I've found thus far: > > - Apache Hive <https://hive.apache.org> (SQL-like, with interactive SQL > thanks to the Stinger initiative) > - Apache Drill <http://drill.apache.org> (ANSI SQL support) > - Apache Spark <https://spark.apache.org> (Spark SQL > <https://spark.apache.org/sql>, queries only, add data via Hive, RDD > < > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD > > > or Paraquet <http://parquet.io/>) > - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase > <http://hbase.apache.org>, lacks full transaction > <http://en.wikipedia.org/wiki/Database_transaction> support, relational > operators <http://en.wikipedia.org/wiki/Relational_operators> and some > built-in functions) > - Cloudera Impala > < > http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html > > > (significant HiveQL support, some SQL language support, no support for > indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT; > amongst others) > - Presto <https://github.com/facebook/presto> from Facebook (can query > Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc. > Doesn't seem to be designed for low-latency responses across small > clusters, or support UPDATE operations. It is optimized for data > warehousing or analytics¹ > <http://prestodb.io/docs/current/overview/use-cases.html>) > - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR > community edition <https://www.mapr.com/products/hadoop-download> > (seems > to be a packaging of Hive, HP Vertica > <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL, > Drill and a native ODBC wrapper > <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) > - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL > interface and multi-dimensional analysis [OLAP > <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and > supports most ANSI SQL query functions". It depends on HDFS, MapReduce, > Hive and HBase; and seems targeted at very large data-sets though > maintains > low query latency) > - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard > compliance > with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support > [benchmarks > against Hive and Impala > < > http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space > > > ]) > - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s > Lingual <http://docs.cascading.org/lingual/1.0/>² > <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual provides > JDBC Drivers, a SQL command shell, and a catalog manager for publishing > files [or any resource] as schemas and tables.") > > Which—from this list or elsewhere—would you recommend, and why? > Thanks for all suggestions, > > Samuel Marks > http://linkedin.com/in/samuelmarks >
