Re: Which [open-souce] SQL engine atop Hadoop?

Samuel Marks Fri, 30 Jan 2015 04:11:10 -0800

Hi Uli,

My use-case is two-fold: generic and "high-powered" analytics.


There are various offerings which I can use that will push data back to
HDFS at regular intervals. Even Apache Sqoop <http://sqoop.apache.org> can
do that.

However I was thinking that it'd be better to keep everything in the Hadoop
or cache-atop-Hadoop space, to reduce levels of indirection, ease in e.g.:
explanations, debugging, tracing, profiling and orchestration.

The generic components would just include CRUD, and basic related queries
(such as propagated updates utilising joins).

More interesting is on the analytics side, wherein I'll be executing a
variety of Machine Learning, Natural Language Processing, recommenders,
time series sequence matching and related tasks. Some of these require
near-realtime responses, whereas others can be delayed significantly.

I haven't actually looked at Splice Machine. It's just Apache Derby +
Apache HBase married together cleanly, right? - It doesn't seem like
they're open-source though… definitely an interesting project.

Best,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Fri, Jan 30, 2015 at 10:54 PM, Uli Bethke <uli.bet...@sonra.io> wrote:

>  What exactly is your use case? Analytics or OLTP?
> Have you looked at Splice Machine? If your use case is OLTP, have you
> looked at NewSQL offerings (outside Hadoop)?
> Cheers
> uli
>
>
> On 30/01/2015 11:26, Samuel Marks wrote:
>
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>. Obviously by posting here I am not
> expecting an unbiased answer.
>
> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
> and supports the most common CRUD <https://spark.apache.org>, including
> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
> support would be nice also, but is not a must-have.
>
> Essentially I want a full replacement for the more traditional RDBMS, one
> which can scale from 1 node to a serious Hadoop cluster.
>
> Python is my language of choice for interfacing, however there does seem
> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>
> Here is what I've found thus far:
>
>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>    SQL thanks to the Stinger initiative)
>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>    - Apache Spark <https://spark.apache.org> (Spark SQL
>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>    
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>    or Paraquet <http://parquet.io/>)
>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>    <http://hbase.apache.org>, lacks full transaction
>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>    operators <http://en.wikipedia.org/wiki/Relational_operators> and some
>    built-in functions)
>    - Cloudera Impala
>    
> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>    (significant HiveQL support, some SQL language support, no support for
>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>    amongst others)
>     - Presto <https://github.com/facebook/presto> from Facebook (can
>    query Hive, Cassandra <http://cassandra.apache.org>, relational DBs
>    &etc. Doesn't seem to be designed for low-latency responses across small
>    clusters, or support UPDATE operations. It is optimized for data
>    warehousing or analytics¹
>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>    community edition <https://www.mapr.com/products/hadoop-download>
>    (seems to be a packaging of Hive, HP Vertica
>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>    Drill and a native ODBC wrapper
>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>    interface and multi-dimensional analysis [OLAP
>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
>    supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>    Hive and HBase; and seems targeted at very large data-sets though maintains
>    low query latency)
>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>    support [benchmarks against Hive and Impala
>    
> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>    ])
>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>    publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
>  Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
>
> --
> ___________________________
> Uli Bethke
> Co-founder Sonra
> p: +353 86 32 83 040
> w: www.sonra.io
> l: linkedin.com/in/ulibethke
> t: twitter.com/ubethke
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to