1: "SQL constructs inside hive" <--use jdbc driver "describe table" read result set 2: "use thrift" 3: web hcat https://cwiki.apache.org/confluence/display/Hive/WebHCat+InstallWebHCat#WebHCatInstallWebHCat-WebHCatInstalledwithHive 4: Just go the mysql db that backs the metastore and query directly
That gives you 4 ways to get at hive's meta data. >> "since backwards compatibility is... well lets just say lacking" Welcome to open source software. Or all software in general really. All I am getting at was there is 4 ways right there to get at the metadata. >>"but how easy is it to do this with a secure hadoop/hive ecosystem? now i need to handle kerberos myself and somehow pass tokens into thrift i assume?" Frankly I do not give a crud about the "secure bla bla" but I have seen several tickets on thift/sasl so I assume someone does. My only point was hive seems to give 4 ways to get at the metadata, which is better then say mysql or vertica which only really gives you the option to do #1 over jdbc. Hive actually works with avro formats where it can read the schema from the data https://cwiki.apache.org/confluence/display/Hive/AvroSerDe so that other then pointing your "table" at a folder the metadata is magic. Which is what you are basically describing. So again it depends on your definition of easily accessible. But the fact that I have a thrift API which I can use to walk through the tables in a database seems more accessable than many other databases I am aware of. On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> wrote: > edward, > i would not call "SQL constructs inside hive" accessible for other > systems. its inside hive after all > > it is true that i can contact the metastore in java using > HiveMetaStoreClient, but then i need to bring in a whole slew of > dependencies (the miniumum seems to be hive-metastore, hive-common, > hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and > error). these jars need to be "provided" and added to the classpath on the > cluster, unless someone is willing to build versions of an application for > every hive version out there. and even when you do all this you can only > pray its going to be compatible with the next hive version, since backwards > compatibility is... well lets just say lacking. the attitude seems to be > that hive does not have a java api, so there is nothing that needs to be > stable. > > you are right i could go the pure thrift road. i havent tried that yet. > that might just be the best option. but how easy is it to do this with a > secure hadoop/hive ecosystem? now i need to handle kerberos myself and > somehow pass tokens into thrift i assume? > > contrast all of this with an avro file on hadoop with metadata baked in, > and i think its safe to say hive metadata is not easily accessible. > > i will take a look at your book. i hope it has an example of using thrift > on a secure cluster to contact hive metastore (without using the > HiveMetaStoreClient), that would be awesome. > > > > > On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >> "with the metadata in a special metadata store (not on hdfs), and its not >> as easy for all systems to access hive metadata." I disagree. >> >> Hives metadata is not only accessible through the SQL constructs like >> "describe table". But the entire meta-store also is actually a thrift >> service so you have programmatic access to determine things like what >> columns are in a table etc. Thrift creates RPC clients for almost every >> major language. >> >> In the programming hive book >> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e >> there is even examples where I show how to iterate all the tables inside >> the database from a java client. >> >> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com> >> wrote: >> >>> yes you can run whatever you like with the data in hdfs. keep in mind >>> that hive makes this general access pattern just a little harder, since >>> hive has a tendency to store data and metadata separately, with the >>> metadata in a special metadata store (not on hdfs), and its not as easy for >>> all systems to access hive metadata. >>> >>> i am not familiar at all with tajo or drill. >>> >>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelma...@gmail.com> >>> wrote: >>> >>>> Thanks for the advice >>>> >>>> Koert: when everything is in the same essential data-store (HDFS), >>>> can't I just run whatever complex tools I'm whichever paradigm they like? >>>> >>>> E.g.: GraphX, Mahout &etc. >>>> >>>> Also, what about Tajo or Drill? >>>> >>>> Best, >>>> >>>> Samuel Marks >>>> http://linkedin.com/in/samuelmarks >>>> >>>> PS: Spark-SQL is read-only IIRC, right? >>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote: >>>> >>>>> since you require high-powered analytics, and i assume you want to >>>>> stay sane while doing so, you require the ability to "drop out of sql" >>>>> when >>>>> needed. so spark-sql and lingual would be my choices. >>>>> >>>>> low latency indicates phoenix or spark-sql to me. >>>>> >>>>> so i would say spark-sql >>>>> >>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelma...@gmail.com> >>>>> wrote: >>>>> >>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and >>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does >>>>>> open-source >>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they >>>>>> open source Pivotal HD: HAWQ. >>>>>> >>>>>> So that doesn't meet my requirements. I should note that the project >>>>>> I am building will also be open-source, which heightens the importance of >>>>>> having all components also being open-source. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Samuel Marks >>>>>> http://linkedin.com/in/samuelmarks >>>>>> >>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari < >>>>>> siddharth.tiw...@live.com> wrote: >>>>>> >>>>>>> Have you looked at HAWQ from Pivotal ? >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelma...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been >>>>>>> various commercial and/or open-source attempts to expose some >>>>>>> compatibility >>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am >>>>>>> not expecting an unbiased answer. >>>>>>> >>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency >>>>>>> querying, and supports the most common CRUD >>>>>>> <https://spark.apache.org>, including [the basics!] along these >>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET >>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support >>>>>>> would be nice also, but is not a must-have. >>>>>>> >>>>>>> Essentially I want a full replacement for the more traditional >>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster. >>>>>>> >>>>>>> Python is my language of choice for interfacing, however there does >>>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>. >>>>>>> >>>>>>> Here is what I've found thus far: >>>>>>> >>>>>>> - Apache Hive <https://hive.apache.org> (SQL-like, with >>>>>>> interactive SQL thanks to the Stinger initiative) >>>>>>> - Apache Drill <http://drill.apache.org> (ANSI SQL support) >>>>>>> - Apache Spark <https://spark.apache.org> (Spark SQL >>>>>>> <https://spark.apache.org/sql>, queries only, add data via Hive, >>>>>>> RDD >>>>>>> >>>>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD> >>>>>>> or Paraquet <http://parquet.io/>) >>>>>>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache >>>>>>> HBase <http://hbase.apache.org>, lacks full transaction >>>>>>> <http://en.wikipedia.org/wiki/Database_transaction> support, >>>>>>> relational >>>>>>> operators <http://en.wikipedia.org/wiki/Relational_operators> >>>>>>> and some built-in functions) >>>>>>> - Cloudera Impala >>>>>>> >>>>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html> >>>>>>> (significant HiveQL support, some SQL language support, no support >>>>>>> for >>>>>>> indexes on its tables, importantly missing DELETE, UPDATE and >>>>>>> INTERSECT; >>>>>>> amongst others) >>>>>>> - Presto <https://github.com/facebook/presto> from Facebook (can >>>>>>> query Hive, Cassandra <http://cassandra.apache.org>, relational >>>>>>> DBs &etc. Doesn't seem to be designed for low-latency responses >>>>>>> across >>>>>>> small clusters, or support UPDATE operations. It is optimized >>>>>>> for data warehousing or analytics¹ >>>>>>> <http://prestodb.io/docs/current/overview/use-cases.html>) >>>>>>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR >>>>>>> community edition <https://www.mapr.com/products/hadoop-download> >>>>>>> (seems to be a packaging of Hive, HP Vertica >>>>>>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, >>>>>>> SparkSQL, Drill and a native ODBC wrapper >>>>>>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) >>>>>>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL >>>>>>> interface and multi-dimensional analysis [OLAP >>>>>>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on >>>>>>> Hadoop and supports most ANSI SQL query functions". It depends on >>>>>>> HDFS, >>>>>>> MapReduce, Hive and HBase; and seems targeted at very large data-sets >>>>>>> though maintains low query latency) >>>>>>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard >>>>>>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver >>>>>>> support [benchmarks against Hive and Impala >>>>>>> >>>>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space> >>>>>>> ]) >>>>>>> - Cascading >>>>>>> <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s Lingual >>>>>>> <http://docs.cascading.org/lingual/1.0/>² >>>>>>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual >>>>>>> provides JDBC Drivers, a SQL command shell, and a catalog manager for >>>>>>> publishing files [or any resource] as schemas and tables.") >>>>>>> >>>>>>> Which—from this list or elsewhere—would you recommend, and why? >>>>>>> Thanks for all suggestions, >>>>>>> >>>>>>> Samuel Marks >>>>>>> http://linkedin.com/in/samuelmarks >>>>>>> >>>>>>> >>>>>> >>>>> >>> >> >