Re: Which [open-souce] SQL engine atop Hadoop?

Koert Kuipers Sat, 31 Jan 2015 12:05:17 -0800

seems the metastore thrift service support SASL. thats great. so if i
understand it correctly all i need is the metastore thrift definition to
query the metastore.
is the metastore thrift definition stable across hive versions? if so, then
i can build my app once without worrying about the hive version deployed.
in that case i admit its not as bad as i thought. lets see!


On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:

> oh sorry edward, i misread you post. seems we agree that "SQL constructs
> inside hive" are not for other systems.
>
> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> edward,
>> i would not call "SQL constructs inside hive" accessible for other
>> systems. its inside hive after all
>>
>> it is true that i can contact the metastore in java using
>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>> error). these jars need to be "provided" and added to the classpath on the
>> cluster, unless someone is willing to build versions of an application for
>> every hive version out there. and even when you do all this you can only
>> pray its going to be compatible with the next hive version, since backwards
>> compatibility is... well lets just say lacking. the attitude seems to be
>> that hive does not have a java api, so there is nothing that needs to be
>> stable.
>>
>> you are right i could go the pure thrift road. i havent tried that yet.
>> that might just be the best option. but how easy is it to do this with a
>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>> somehow pass tokens into thrift i assume?
>>
>> contrast all of this with an avro file on hadoop with metadata baked in,
>> and i think its safe to say hive metadata is not easily accessible.
>>
>> i will take a look at your book. i hope it has an example of using thrift
>> on a secure cluster to contact hive metastore (without using the
>> HiveMetaStoreClient), that would be awesome.
>>
>>
>>
>>
>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>> "with the metadata in a special metadata store (not on hdfs), and its
>>> not as easy for all systems to access hive metadata." I disagree.
>>>
>>> Hives metadata is not only accessible through the SQL constructs like
>>> "describe table". But the entire meta-store also is actually a thrift
>>> service so you have programmatic access to determine things like what
>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>> major language.
>>>
>>> In the programming hive book
>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>> there is even examples where I show how to iterate all the tables inside
>>> the database from a java client.
>>>
>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>>> that hive makes this general access pattern just a little harder, since
>>>> hive has a tendency to store data and metadata separately, with the
>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>> all systems to access hive metadata.
>>>>
>>>> i am not familiar at all with tajo or drill.
>>>>
>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the advice
>>>>>
>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>
>>>>> E.g.: GraphX, Mahout &etc.
>>>>>
>>>>> Also, what about Tajo or Drill?
>>>>>
>>>>> Best,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>
>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>> stay sane while doing so, you require the ability to "drop out of sql" 
>>>>>> when
>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>
>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>
>>>>>> so i would say spark-sql
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>>>>>> open-source
>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>>>>> open source Pivotal HD: HAWQ.
>>>>>>>
>>>>>>> So that doesn't meet my requirements. I should note that the project
>>>>>>> I am building will also be open-source, which heightens the importance 
>>>>>>> of
>>>>>>> having all components also being open-source.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Samuel Marks
>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>
>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>> siddharth.tiw...@live.com> wrote:
>>>>>>>
>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>>> various commercial and/or open-source attempts to expose some 
>>>>>>>> compatibility
>>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>>>>> not expecting an unbiased answer.
>>>>>>>>
>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>> querying, and supports the most common CRUD
>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
>>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>>> would be nice also, but is not a must-have.
>>>>>>>>
>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>
>>>>>>>> Python is my language of choice for interfacing, however there does
>>>>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>>>>
>>>>>>>> Here is what I've found thus far:
>>>>>>>>
>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>    Hive, RDD
>>>>>>>>    
>>>>>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, 
>>>>>>>> relational
>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>    and some built-in functions)
>>>>>>>>    - Cloudera Impala
>>>>>>>>    
>>>>>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>    (significant HiveQL support, some SQL language support, no support 
>>>>>>>> for
>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and 
>>>>>>>> INTERSECT;
>>>>>>>>    amongst others)
>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency 
>>>>>>>> responses
>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>>    community edition
>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be a
>>>>>>>>    packaging of Hive, HP Vertica
>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on 
>>>>>>>> HDFS,
>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large 
>>>>>>>> data-sets
>>>>>>>>    though maintains low query latency)
>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>>>>    support [benchmarks against Hive and Impala
>>>>>>>>    
>>>>>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>    ])
>>>>>>>>    - Cascading
>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager 
>>>>>>>> for
>>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>>>
>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>> Thanks for all suggestions,
>>>>>>>>
>>>>>>>> Samuel Marks
>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to