Re: NPE when reading Parquet using Hive on Tez

2016-02-02 Thread Adam Hunt
HI Gopal, With the release of 0.8.2, I thought I would give tez another shot. Unfortunately, I got the same NPE. I dug a little deeper and it appears that the configuration property "columns.types", which is used org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(), is not

Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi, My understanding is that with Hive on Spark engine, one gets the Hive optimizer and Spark query engine With spark using Hive metastore, Spark does both the optimization and query engine. The only value add is that one can access the underlying Hive tables from spark-sql etc Is

Re: NPE when reading Parquet using Hive on Tez

2016-02-02 Thread Gopal Vijayaraghavan
> I dug a little deeper and it appears that the configuration property >"columns.types", which is used >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(), > is not being set. When I manually set that property in hive, your >example works fine. Good to know more about the

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
When comparing the performance, you need to do it apple vs apple. In another thread, you mentioned that Hive on Spark is much slower than Spark SQL. However, you configured Hive such that only two tasks can run in parallel. However, you didn't provide information on how much Spark SQL is

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
I think the diff is not only about which does optimization but more on feature parity. Hive on Spark offers all functional features that Hive offers and these features play out faster. However, Spark SQL is far from offering this parity as far as I know. On Tue, Feb 2, 2016 at 2:38 PM, Mich

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Thanks Jeff. Obviously Hive is much more feature rich compared to Spark. Having said that in certain areas for example where the SQL feature is available in Spark, Spark seems to deliver faster. This may be: 1.Spark does both the optimisation and execution seamlessly 2.Hive

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Philip Lee
>From my experience, spark sql has its own optimizer to support Hive query and metastore. After 1.5.2 spark, its optimizer is named catalyst. 2016. 2. 3. 오전 12:12에 "Xuefu Zhang" 님이 작성: > I think the diff is not only about which does optimization but more on > feature parity.

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi, Are you referring to spark-shell with Scala, Python and others? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi Jeff, In below …. You should be able to see the resource usage in YARN resource manage URL. Just to be clear we are talking about Port 8088/cluster? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE

Re: GenericUDF

2016-02-02 Thread Jason Dere
- Created once when registering the function to the FunctionRegistry. - The UDF is copied from the version in the registry during query compilation - The query plan is serialized, then deserialized by the tasks during query execution, which constructs another instance of the UDF.

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
uuuhm with spark using Hive metastore you actually have a real programming environment and you can write real functions, versus just being boxed into some version of sql and limited udfs? On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang wrote: > When comparing the performance,

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Ryan Harris
https://github.com/myui/hivemall as long as you are comfortable with java UDFs, the sky is really the limit...it's not for everyone and spark does have many advantages, but they are two tools that can complement each other in numerous ways. I don't know that there is necessarily a universal

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Jörn Franke
Check HiveMall > On 03 Feb 2016, at 05:49, Koert Kuipers wrote: > > yeah but have you ever seen somewhat write a real analytical program in hive? > how? where are the basic abstractions to wrap up a large amount of operations > (joins, groupby's) into a single function

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
yes. the ability to start with sql but when needed expand into more full blown programming languages, machine learning etc. is a huge plus. after all this is a cluster, and just querying or extracting data to move it off the cluster into some other analytics tool is going to be very inefficient

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
yeah but have you ever seen somewhat write a real analytical program in hive? how? where are the basic abstractions to wrap up a large amount of operations (joins, groupby's) into a single function call? where are the tools to write nice unit test for that? for example in spark i can write a

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
ok i am sure there is some way to do it. i am going to guess snippets of hive code stuck together with oozie jobs or whatever. the oozie jobs become the re-usable pieces perhaps? now you got sql and xml, completely lacking any benefits of a compiler to catch errors. unit tests will be slow if even

Re: GenericUDF

2016-02-02 Thread Anirudh Paramshetti
Thanks Jason for your inputs. I believe you are talking about the number of instances created, which explains why the constructor was called thrice. But I'm still unclear about the two calls made to the initialize method, when I use the temporary function in the query. Can you put some more light

Hive Query Timeout in hive-jdbc

2016-02-02 Thread Satya Harish Appana
Hi Team, I am trying to connect to hiveServer via hive-jdbc. Can we configure client side timeout at each query executed inside each jdbc connection. (When I looked at HiveStatement.setQueryTimeout method it says operation unsupported). Is there any other way of timing out and cancelling the

GenericUDF

2016-02-02 Thread Anirudh Paramshetti
Hi, I have written a custom UDF in Java extending the GenericUDF class. I have some print statements in the constructor and initialize method, as to understand the number of calls made to them. From what I have read about GenericUDF, I was expecting the constructor and initialize method to be

Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread 董亚军
hive does not support timeout on the client side. and I think it is not recommended that if the client exit with timeout exception, the hiveserver side may also running the job. this will result in inconsistent state. On Tue, Feb 2, 2016 at 4:49 PM, Satya Harish Appana <

Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread Loïc Chanel
Actually, Hive doesn't support timeout, but Tez and MapReduce does. Therefore, you can set a timeout on these tools to kill failed queries. Hope this helps, Loïc Loïc CHANEL System & virtualization engineer TO - XaaS Ind - Worldline (Villeurbanne, France) 2016-02-02 11:10 GMT+01:00 董亚军

Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread Satya Harish Appana
Queries I am running over Hive JDBC are ddl statements(none of the queries are select or insert. which will result in an execution engine(tez/mr) job to be launched.. all the queries are create external table .. and drop table .. and alter table add partitions). On Tue, Feb 2, 2016 at 3:54 PM,

Re: ORC format

2016-02-02 Thread Lefty Leverenz
Can't resist teasing Mich about this: "Indeed one often demoralises data taking advantages of massive parallel processing in Hive." Surely he meant denormalizes . Nobody would want to demoralise their data -- performance would suffer. ;) -- Lefty

Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread Loïc Chanel
Then indeed Tez and MR timeout won't be any help, sorry. I would be very interested in your solution though. Regards, Loïc Loïc CHANEL System & virtualization engineer TO - XaaS Ind - Worldline (Villeurbanne, France) 2016-02-02 11:27 GMT+01:00 Satya Harish Appana

How to run multiple queries from one tool

2016-02-02 Thread Riesland, Zack
I'm sure this is a total rookie question, but I'm months into using Hive and it hasn't become obvious to me yet: When I use a tool like Aqua Data Studio and point it at a MSSQL Server database, I can run multiple queries, separated by a semicolon character ';' So: select blah from blah where

Re: Hive table over S3 bucket with s3a

2016-02-02 Thread Terry Siu
Yeah, that’s what I thought. I found this: https://issues.apache.org/jira/browse/HADOOP-3733. Posted a couple of questions there, but prior to that, the last comment was over a year ago. Thanks for the response! -Terry From: Elliot West > Reply-To:

RE: ORC format

2016-02-02 Thread Mich Talebzadeh
You are welcome Phil Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15

RE: ORC format

2016-02-02 Thread Mich Talebzadeh
Correct :). Lord knows how these spell checkers work sometime! Perish the thought of demoralising the data. Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008 A

Hive table over S3 bucket with s3a

2016-02-02 Thread Terry Siu
Hi, I’m wondering if anyone has found a workaround for defining a Hive table over a S3 bucket when the secret access key has ‘/‘ characters in it. I’m using Hive 0.14 in HDP 2.2.4 and the statement that I used is: CREATE EXTERNAL TABLE IF NOT EXISTS s3_foo ( key INT, value STRING ) ROW

Re: ORC format

2016-02-02 Thread Philip Lee
I really appreicate what you told me through this emailing-list. Best, Phil On Tue, Feb 2, 2016 at 12:16 PM, Mich Talebzadeh wrote: > Correct J. > > > > Lord knows how these spell checkers work sometime! Perish the thought of > demoralising the data. > > > > > > Regards, >

Re: Hive table over S3 bucket with s3a

2016-02-02 Thread Elliot West
When I last looked at this it was recommended to simply regenerate the key as you suggest. On 2 February 2016 at 15:52, Terry Siu wrote: > Hi, > > I’m wondering if anyone has found a workaround for defining a Hive table > over a S3 bucket when the secret access key has ‘/‘

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Edward Capriolo
Hive has numerous extension points, you are not boxed in by a long shot. On Tuesday, February 2, 2016, Koert Kuipers wrote: > uuuhm with spark using Hive metastore you actually have a real > programming environment and you can write real functions, versus just being > boxed

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
Yes, regardless what spark mode you're running in, from Spark AM webui, you should be able to see how many task are concurrently running. I'm a little surprised to see that your Hive configuration only allows 2 map tasks to run in parallel. If your cluster has the capacity, you should parallelize