Re: Which [open-souce] SQL engine atop Hadoop?

Samuel Marks Fri, 30 Jan 2015 09:14:41 -0800

Dear Jacques,

Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL
in Apache Drill is quite impressive, not to mention the useful ODBC
interface alongside the expected JDBC one. Additionally on the scalability
side your documentation claims: "Scales from a single laptop to a 1000-node
cluster".


You mention that this entire topic is subjective. I suppose with
insufficient information about my use-case, you may just be right.

Without giving away my full use-case—FYI: I will be open-sourcing what I'm
building—I will tell you a little bit about the components.

The generic components would just include CRUD, and basic related queries
(such as propagated updates utilising joins).

More interesting is on the analytics side, wherein I'll be executing a
variety of Machine Learning, information filtering (recommender systems,
internal search engine most with some element of Natural Language
Processing), time series sequence matching and related tasks. Some of these
require near-realtime responses, whereas others can be delayed
significantly.

I posted something similar to this on StackOverflow, it was very quickly
removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried
about speaking to enterprise sales people, as they're being paid to push
their own offering (and I doubt they have extensive benchmarks across all
their competitors).

Thanks for your continuing advice,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <[email protected]> wrote:

> Samuel,
>
> You've come and asked your question on the Apache Drill group so of course
> the answer is Apache Drill is best for everything, right?
>
> The reality is that each tool has a set of strengths and weaknesses for
> each particular use case. An Apache user support mailing list is definitely
> NOT the place to have this discussion.  You're really asking for technology
> selection advice and this entire topic is very subjective. The people in
> any one community would never do full justice to all the options. As such I
> suggest you use another forum such as Quora or LinkedIn to get advice.
> (There is also a helpful article on Gigaom that just came out yesterday and
> all sorts of friendly sales people at companies like MapR and IBM who love
> giving this kind of advice.)
>
> What we can do here is tell you how Drill can solve or not solve your
> different use cases and help you work through those.  If you to go into
> more detail, on those,  we'd be happy to help.
>
> Thanks again for the interest. Sorry if this seems abrupt but these threads
> generally aren't productive and tend to be very divisive.
>
> Welcome to the community :)
>
> Jacques
> On Jan 30, 2015 3:28 AM, "Samuel Marks" <[email protected]> wrote:
>
> > Since Hadoop <https://hive.apache.org> came out, there have been various
> > commercial and/or open-source attempts to expose some compatibility with
> > SQL
> > <http://drill.apache.org>. Obviously by posting here I am not expecting
> an
> > unbiased answer.
> >
> > Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
> and
> > supports the most common CRUD <https://spark.apache.org>, including [the
> > basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
> > UPDATE
> > Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
> > would be nice also, but is not a must-have.
> >
> > Essentially I want a full replacement for the more traditional RDBMS, one
> > which can scale from 1 node to a serious Hadoop cluster.
> >
> > Python is my language of choice for interfacing, however there does seem
> to
> > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> >
> > Here is what I've found thus far:
> >
> >    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
> SQL
> >    thanks to the Stinger initiative)
> >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> >    - Apache Spark <https://spark.apache.org> (Spark SQL
> >    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
> >    <
> >
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD
> > >
> >    or Paraquet <http://parquet.io/>)
> >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
> >    <http://hbase.apache.org>, lacks full transaction
> >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> relational
> >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> some
> >    built-in functions)
> >    - Cloudera Impala
> >    <
> >
> http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
> > >
> >    (significant HiveQL support, some SQL language support, no support for
> >    indexes on its tables, importantly missing DELETE, UPDATE and
> INTERSECT;
> >    amongst others)
> >    - Presto <https://github.com/facebook/presto> from Facebook (can
> query
> >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> >    Doesn't seem to be designed for low-latency responses across small
> >    clusters, or support UPDATE operations. It is optimized for data
> >    warehousing or analytics¹
> >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> >    community edition <https://www.mapr.com/products/hadoop-download>
> > (seems
> >    to be a packaging of Hive, HP Vertica
> >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> >    Drill and a native ODBC wrapper
> >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> >    interface and multi-dimensional analysis [OLAP
> >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
> and
> >    supports most ANSI SQL query functions". It depends on HDFS,
> MapReduce,
> >    Hive and HBase; and seems targeted at very large data-sets though
> > maintains
> >    low query latency)
> >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> > compliance
> >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support
> > [benchmarks
> >    against Hive and Impala
> >    <
> >
> http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space
> > >
> >    ])
> >    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
> >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> provides
> >    JDBC Drivers, a SQL command shell, and a catalog manager for
> publishing
> >    files [or any resource] as schemas and tables.")
> >
> > Which—from this list or elsewhere—would you recommend, and why?
> > Thanks for all suggestions,
> >
> > Samuel Marks
> > http://linkedin.com/in/samuelmarks
> >
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to