Re: Which [open-souce] SQL engine atop Hadoop?

Tomer Shiran Fri, 30 Jan 2015 15:58:11 -0800

Yes, Drill is currently focused on querying data as opposed to inserting or
updating. While most of the systems you listed take a traditional approach
to SQL in which a DBA must create and manage schemas, Drill is designed for
Hadoop and NoSQL databases. In these systems, most of the data is usually
self-describing (JSON, Parquet, etc.) and sometimes even schema-less (as in
JSON, HBase, MongoDB) so it doesn't make sense to require schemas to be
created and managed manually, and data to be transformed before it can be
queried.  Drill's unique architecture makes it unique in its ability to
enable self-service data exploration where agility is essential.


On Fri, Jan 30, 2015 at 10:10 AM, Andrew Brust <
[email protected]> wrote:

> Not sure Drill -- or any of the other SQL-on-Hadoop engines -- are truly
> well-suited to CRUD.  They excel at the "R" -- the "CUD" is not their forte.
>
> -----Original Message-----
> From: Samuel Marks [mailto:[email protected]]
> Sent: Friday, January 30, 2015 8:50 AM
> To: [email protected]
> Subject: Re: Which [open-souce] SQL engine atop Hadoop?
>
> Dear Jacques,
>
> Seeing the support for 03 SQL syntax, nested objects, and schema-free SQL
> in Apache Drill is quite impressive, not to mention the useful ODBC
> interface alongside the expected JDBC one. Additionally on the scalability
> side your documentation claims: "Scales from a single laptop to a 1000-node
> cluster".
>
> You mention that this entire topic is subjective. I suppose with
> insufficient information about my use-case, you may just be right.
>
> Without giving away my full use-case—FYI: I will be open-sourcing what I'm
> building—I will tell you a little bit about the components.
>
> The generic components would just include CRUD, and basic related queries
> (such as propagated updates utilising joins).
>
> More interesting is on the analytics side, wherein I'll be executing a
> variety of Machine Learning, information filtering (recommender systems,
> internal search engine most with some element of Natural Language
> Processing), time series sequence matching and related tasks. Some of these
> require near-realtime responses, whereas others can be delayed
> significantly.
>
> I posted something similar to this on StackOverflow, it was very quickly
> removed. Haven't tried LinkedIn or Quora, probably worth a shot. Worried
> about speaking to enterprise sales people, as they're being paid to push
> their own offering (and I doubt they have extensive benchmarks across all
> their competitors).
>
> Thanks for your continuing advice,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Sat, Jan 31, 2015 at 12:22 AM, Jacques Nadeau <[email protected]>
> wrote:
>
> > Samuel,
> >
> > You've come and asked your question on the Apache Drill group so of
> > course the answer is Apache Drill is best for everything, right?
> >
> > The reality is that each tool has a set of strengths and weaknesses
> > for each particular use case. An Apache user support mailing list is
> > definitely NOT the place to have this discussion.  You're really
> > asking for technology selection advice and this entire topic is very
> > subjective. The people in any one community would never do full
> > justice to all the options. As such I suggest you use another forum such
> as Quora or LinkedIn to get advice.
> > (There is also a helpful article on Gigaom that just came out
> > yesterday and all sorts of friendly sales people at companies like
> > MapR and IBM who love giving this kind of advice.)
> >
> > What we can do here is tell you how Drill can solve or not solve your
> > different use cases and help you work through those.  If you to go
> > into more detail, on those,  we'd be happy to help.
> >
> > Thanks again for the interest. Sorry if this seems abrupt but these
> > threads generally aren't productive and tend to be very divisive.
> >
> > Welcome to the community :)
> >
> > Jacques
> > On Jan 30, 2015 3:28 AM, "Samuel Marks" <[email protected]> wrote:
> >
> > > Since Hadoop <https://hive.apache.org> came out, there have been
> > > various commercial and/or open-source attempts to expose some
> > > compatibility with SQL <http://drill.apache.org>. Obviously by
> > > posting here I am not expecting
> > an
> > > unbiased answer.
> > >
> > > Seeking an SQL-on-Hadoop offering which provides: low-latency
> > > querying,
> > and
> > > supports the most common CRUD <https://spark.apache.org>, including
> > > [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT *
> > > FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
> > > Transactional support would be nice also, but is not a must-have.
> > >
> > > Essentially I want a full replacement for the more traditional
> > > RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
> > >
> > > Python is my language of choice for interfacing, however there does
> > > seem
> > to
> > > be a Python JDBC wrapper <https://spark.apache.org/sql>.
> > >
> > > Here is what I've found thus far:
> > >
> > >    - Apache Hive <https://hive.apache.org> (SQL-like, with
> > > interactive
> > SQL
> > >    thanks to the Stinger initiative)
> > >    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
> > >    - Apache Spark <https://spark.apache.org> (Spark SQL
> > >    <https://spark.apache.org/sql>, queries only, add data via Hive,
> RDD
> > >    <
> > >
> > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.s
> > park.sql.SchemaRDD
> > > >
> > >    or Paraquet <http://parquet.io/>)
> > >    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
> HBase
> > >    <http://hbase.apache.org>, lacks full transaction
> > >    <http://en.wikipedia.org/wiki/Database_transaction> support,
> > relational
> > >    operators <http://en.wikipedia.org/wiki/Relational_operators> and
> > some
> > >    built-in functions)
> > >    - Cloudera Impala
> > >    <
> > >
> > http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/
> > impala.html
> > > >
> > >    (significant HiveQL support, some SQL language support, no support
> for
> > >    indexes on its tables, importantly missing DELETE, UPDATE and
> > INTERSECT;
> > >    amongst others)
> > >    - Presto <https://github.com/facebook/presto> from Facebook (can
> > query
> > >    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
> > >    Doesn't seem to be designed for low-latency responses across small
> > >    clusters, or support UPDATE operations. It is optimized for data
> > >    warehousing or analytics¹
> > >    <http://prestodb.io/docs/current/overview/use-cases.html>)
> > >    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
> > >    community edition <https://www.mapr.com/products/hadoop-download>
> > > (seems
> > >    to be a packaging of Hive, HP Vertica
> > >    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
> > >    Drill and a native ODBC wrapper
> > >    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
> > >    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
> > >    interface and multi-dimensional analysis [OLAP
> > >    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
> > > Hadoop
> > and
> > >    supports most ANSI SQL query functions". It depends on HDFS,
> > MapReduce,
> > >    Hive and HBase; and seems targeted at very large data-sets though
> > > maintains
> > >    low query latency)
> > >    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
> > > compliance
> > >    with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support
> > > [benchmarks
> > >    against Hive and Impala
> > >    <
> > >
> > http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-
> > hadoop-space
> > > >
> > >    ])
> > >    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29
> >'s
> > >    Lingual <http://docs.cascading.org/lingual/1.0/>²
> > >    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
> > provides
> > >    JDBC Drivers, a SQL command shell, and a catalog manager for
> > publishing
> > >    files [or any resource] as schemas and tables.")
> > >
> > > Which—from this list or elsewhere—would you recommend, and why?
> > > Thanks for all suggestions,
> > >
> > > Samuel Marks
> > > http://linkedin.com/in/samuelmarks
> > >
> >
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to