Next gen metastore

2022-04-02 Thread Edward Capriolo
While not active in the development community as much I have been using
hive in the field as well as spark and impala for some time.

My ancmecdotal opinion is that the current metastore needs a significant re
write to deal with "next generation" workloads. By next generation I
actually mean last generation.

Currently cloudera's impala advice is . No more then 1k rows in table. And
tables with lots of partitions are problematic.

Thus really "wont get it done" at the "new" web scale. Hive server can have
memory problems with tables with 2k columns and 5k partitions.

It feels like design ideas like "surely we can fetch all the columns of a
table in one go' dont make sense universally.

Amazon has glue which can scale to amazon scale. Hive metastore cant even
really scale to q single organization. So what are the next steps,  I dont
think its simple as "move it to nosql" I think it has to be reworked from
ground up.


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: How can I know use execute or executeQuery

2021-09-15 Thread Edward Capriolo
In most sql drivers.. you can always use executeQuery even if the query has
no result set.

On Wednesday, September 15, 2021, Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hi Igyu,
> sending different SQL statements is exactly what beeline has to handle,
> I'd have a look at how they handle this:
> https://github.com/apache/hive/tree/master/beeline (this
> 
> seems a good starting point).
>
> HTH,
> Alessandro
>
> On Wed, 15 Sept 2021 at 03:50, igyu  wrote:
>
>> SQL is writed by user.
>> user can write "show tables", "use db"or "select * from table".
>> so I don't know SQL before send to server.
>> when SQL into the server how can I know use execute or executeQuery.
>>
>> SQL type is too many
>>
>> --
>> igyu
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: [EXTERNAL] Re: Any plan for new hive 3 or 4 release?

2021-03-11 Thread Edward Capriolo
"My hope has been that Hive 4.x would be built on Java 11.  However, I've
hit many stumbling blocks over the past year towards this goal."

There is not much value in holding back a release.

Funny life story, I work at a bank and there are some people still on Java7
which wasend of life in 2015. I can't let the "lowest common denominator"
be a blocker for everything. You want to support java 7, well you have to
use a log4j from 7 years ago and a mockito from 8. You have to make weird
interfaces and shims for everything you want to be newer. I made a costly
mistake because the Java doc for something was backwards with the return
type because I made the mistake of reading javadoc from 6 years ago.

For something like Druid or hive on spark that is aging. Move it to
contrib, give folks some time to fix it, if no one upkeeps it toss it.
First, the number of people using Druid cant be that much. I see a lot of
debates like this, hive-on-spark, druid whatever.

Latest News

   - Spark 3.1.1 released
   <http://spark.apache.org/news/spark-3-1-1-released.html> (Mar 02, 2021)
   - Spark 3.0.2 released
   <http://spark.apache.org/news/spark-3-0-2-released.html> (Feb 19, 2021)
   - Next official release: Spark 3.1.1
   <http://spark.apache.org/news/next-official-release-spark-3.1.1.html> (Jan
   07, 2021)
   - Spark 2.4.7 released
   <http://spark.apache.org/news/spark-2-4-7-released.html> (Sep 12, 2020)

News¶ <http://hive.apache.org/downloads.html#news>17 January 2021: release
2.3.8 available
<http://hive.apache.org/downloads.html#17-january-2021-release-238-available>

This release works with Hadoop 2.x.y You can look at the complete JIRA
change log for this release
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349428=Text=12310843>
.
18 April 2020: release 2.3.7 available
<http://hive.apache.org/downloads.html#18-april-2020-release-237-available>

This release works with Hadoop 2.x.y You can look at the complete JIRA
change log for this release
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346056=Text=12310843>
.
26 August 2019: release 3.1.2 available
<http://hive.apache.org/downloads.html#26-august-2019-release-312-available>

This release works with Hadoop 3.x.y. You can look at the complete JIRA
change log for this release
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12344397=Html=12310843>
.
23 August 2019: release 2.3.6 available
<http://hive.apache.org/downloads.html#23-august-2019-release-236-available>

This release works with Hadoop 2.x.y. You can look at the complete JIRA
change log for this release
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345603=Text=12310843>
.
14 May 2019: release 2.3.5 available
<http://hive.apache.org/downloads.html#14-may-2019-release-235-available>

This release works with Hadoop 2.x.y. You can look at the complete JIRA
change log for this release
<https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345394=Text=12310843>
.
7 November 2018: release 2.3.4 available



On Sat, Feb 27, 2021 at 10:02 PM David  wrote:

> Hello,
>
> My hope has been that Hive 4.x would be built on Java 11.  However, I've
> hit many stumbling blocks over the past year towards this goal.  I've been
> able to make some progress, but several things are still stuck.  It mostly
> stems from the fact that hive has many big-ticket dependencies like HDFS,
> Kafka, Druid, HBase and those are in different degrees of Java 11
> readiness.  I got close, but druid has some very outdated dependencies that
> clash with Hive and I got into a game of updating a dependency broke one
> project, downgrading it broke a different project.
>
> https://github.com/apache/druid/pull/10683
>
> HDFS-15790
>
> On Sat, Feb 27, 2021, 7:31 PM Matt McCline
>  wrote:
>
>> Yes to Hive 4 release. Plenty of changes (1,500+).
>> Yes to regular release cadence (e.g. 3 month).
>>
>> -Original Message-
>> From: Edward Capriolo 
>> Sent: Saturday, February 27, 2021 12:16 PM
>> To: Michel Sumbul 
>> Cc: d...@hive.apache.org; user@hive.apache.org
>> Subject: [EXTERNAL] Re: Any plan for new hive 3 or 4 release?
>>
>> The challenge is the venders. They almost always want to tie a release to
>> some offering of there's.
>>
>> Healthy software is released all the time. Just ship it.
>>
>> Call a vote and propose a release. I'll +1 it if the tests pass!
>>
>>
>> On Friday, February 26, 2021, Michel Sumbul 
>> wrote:
>>
>> > It will be amazing if the community could produce a release every
>> > quarter/6months. :-)
>> >
>> > Le ven. 26 févr. 2021 à 14:30, Edward Capriolo 
>> > a écrit :
>> &

Re: Any plan for new hive 3 or 4 release?

2021-02-27 Thread Edward Capriolo
The challenge is the venders. They almost always want to tie a release to
some offering of there's.

Healthy software is released all the time. Just ship it.

Call a vote and propose a release. I'll +1 it if the tests pass!


On Friday, February 26, 2021, Michel Sumbul  wrote:

> It will be amazing if the community could produce a release every
> quarter/6months. :-)
>
> Le ven. 26 févr. 2021 à 14:30, Edward Capriolo  a
> écrit :
>
>> Hive was releasable trunk for the longest time. Facebook days. Then the
>> big data vendors got more involved. Then it became a pissing match about
>> features. This vendor likes tez this vendor dont, this vendor likes hive on
>> spark this one dont.
>>
>> Then this vendor wants to tell everyone hive stinks use impala. Then this
>> vendor aquired that vendor..
>>
>> The best thing for hive is to have one branch master and do quarterly
>> releases.
>>
>>
>>
>> On Friday, February 26, 2021, Peter Vary 
>> wrote:
>>
>>> Hi Lee,
>>>
>>> When I started to work on Hive around 4 years ago, MR was already set as
>>> deprecated. So you definitely should scan even older archives.
>>>
>>> For Iceberg integration, it would be good to have more frequent releases
>>> for Hive as well.
>>>
>>> Thanks, Peter
>>>
>>>
>>>
>>> Lee Ming-Ta  ezt írta (időpont: 2021. febr. 24.,
>>> Sze
>>> 4:34):
>>>
>>> > Dear all,
>>> >
>>> > I probably didn't follow that much and would like to ask if anyone can
>>> > point me to some resources about the reason to remove MR?
>>> > Or what kine of keyword to search on Google?
>>> >
>>> > Thank you very much! Wish everyone a happy Lunar New Year.
>>> >
>>> > --
>>> > *寄件者:* Mass Dosage 
>>> > *寄件日期:* 2021年2月23日 下午 09:49
>>> > *收件者:* d...@hive.apache.org 
>>> > *副本:* Michel Sumbul ; user@hive.apache.org <
>>> > user@hive.apache.org>
>>> > *主旨:* Re: Any plan for new hive 3 or 4 release?
>>> >
>>> > I would love to see a HIve 3.1 release which is capable of being used
>>> on
>>> > Java 11 like Hive 2 is.
>>> >
>>> > What is the main difference going to be between Hive 3 and 4? The
>>> removal
>>> > of MR?
>>> >
>>> > On Mon, 22 Feb 2021 at 16:46, Zoltan Haindrich  wrote:
>>> >
>>> > Hey Michel!
>>> >
>>> > Yes it was a long time ago we had a release; we have quite a few new
>>> > features in master.
>>> > I think we are scaring people for some time now that we will be
>>> dropping
>>> > MR support...I think we should do that.
>>> >
>>> > I would really like to see a new Hive release in the near future as
>>> well -
>>> > there is no way for users to even try out new features.
>>> > I was planning to add nightly builds to package the latest master's
>>> state
>>> > into a deployable artifact - I think a service like may help pretest
>>> our
>>> > next release; I think it
>>> > won't take much to do it so I'll probably throw it together in the next
>>> > couple days!
>>> >
>>> > cheers,
>>> > Zoltan
>>> >
>>> > On 2/21/21 2:27 PM, Michel Sumbul wrote:
>>> > > Hi Guys,
>>> > >
>>> > > If I'm not wrong, the last release of Hive 3.x is 18 months old.
>>> > > I wanted to ask if you had any roadmap / plan to release a new
>>> version of
>>> > > Hive 3.x or Hive 4?
>>> > >
>>> > > Thanks,
>>> > > Michel
>>> > >
>>> >
>>> >
>>>
>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Any plan for new hive 3 or 4 release?

2021-02-26 Thread Edward Capriolo
Hive was releasable trunk for the longest time. Facebook days. Then the big
data vendors got more involved. Then it became a pissing match about
features. This vendor likes tez this vendor dont, this vendor likes hive on
spark this one dont.

Then this vendor wants to tell everyone hive stinks use impala. Then this
vendor aquired that vendor..

The best thing for hive is to have one branch master and do quarterly
releases.



On Friday, February 26, 2021, Peter Vary  wrote:

> Hi Lee,
>
> When I started to work on Hive around 4 years ago, MR was already set as
> deprecated. So you definitely should scan even older archives.
>
> For Iceberg integration, it would be good to have more frequent releases
> for Hive as well.
>
> Thanks, Peter
>
>
>
> Lee Ming-Ta  ezt írta (időpont: 2021. febr. 24., Sze
> 4:34):
>
> > Dear all,
> >
> > I probably didn't follow that much and would like to ask if anyone can
> > point me to some resources about the reason to remove MR?
> > Or what kine of keyword to search on Google?
> >
> > Thank you very much! Wish everyone a happy Lunar New Year.
> >
> > --
> > *寄件者:* Mass Dosage 
> > *寄件日期:* 2021年2月23日 下午 09:49
> > *收件者:* d...@hive.apache.org 
> > *副本:* Michel Sumbul ; user@hive.apache.org <
> > user@hive.apache.org>
> > *主旨:* Re: Any plan for new hive 3 or 4 release?
> >
> > I would love to see a HIve 3.1 release which is capable of being used on
> > Java 11 like Hive 2 is.
> >
> > What is the main difference going to be between Hive 3 and 4? The removal
> > of MR?
> >
> > On Mon, 22 Feb 2021 at 16:46, Zoltan Haindrich  wrote:
> >
> > Hey Michel!
> >
> > Yes it was a long time ago we had a release; we have quite a few new
> > features in master.
> > I think we are scaring people for some time now that we will be dropping
> > MR support...I think we should do that.
> >
> > I would really like to see a new Hive release in the near future as well
> -
> > there is no way for users to even try out new features.
> > I was planning to add nightly builds to package the latest master's state
> > into a deployable artifact - I think a service like may help pretest our
> > next release; I think it
> > won't take much to do it so I'll probably throw it together in the next
> > couple days!
> >
> > cheers,
> > Zoltan
> >
> > On 2/21/21 2:27 PM, Michel Sumbul wrote:
> > > Hi Guys,
> > >
> > > If I'm not wrong, the last release of Hive 3.x is 18 months old.
> > > I wanted to ask if you had any roadmap / plan to release a new version
> of
> > > Hive 3.x or Hive 4?
> > >
> > > Thanks,
> > > Michel
> > >
> >
> >
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Avro tables with 5k columns any tips?

2021-02-24 Thread Edward Capriolo
Hello all,

It has been a long time. I have been forced to use avro and create a table
with over 5k columns. It's helluva slow. I warned folks that all the best
practices say "dont make a table more than 1k or 2k columns" (impala hive
cloudera). No one listened to me, so now the table is a mess. Impala works
my show stats and refresh table take ages.

Spark sql might take an hour to go back and forth getting the meta data.
hive-server, hive-thrift- oracle type setup.

I have literally tied upping my spark heap to like 10GB. Does anyone have
any tips for this insanity? Client or server side? Client would be easier
because as you can guess if I cant stop folks from making a 5k column table
I wont be able to get a server setting changed without selling my left leg.

Also note this is using cloudera, so its probably not hive 3.x its whatever
version they are backporting.

Thanks,
Edward


Re: Article on the correctness of Hive on MR3, Presto, and Impala

2019-06-26 Thread Edward Capriolo
I like the approach of applying an arbitrary limit. Hive's q files tend to
add an ordering to everything. Would it make sense to simply order by
multiple columns in the result set and conduct a large diff on them?

On Wednesday, June 26, 2019, Sungwoo Park  wrote:

> I have published a new article on the correctness of Hive on MR3, Presto,
> and Impala:
>
> https://mr3.postech.ac.kr/blog/2019/06/26/correctness-
> hivemr3-presto-impala/
>
> Hope you enjoy reading the article.
>
> --- Sungwoo
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Hive on Tez vs Impala

2019-04-16 Thread Edward Capriolo
I have changes jobs 3 times since tez was introduced. It is a true waste of
compute resources and time that it was never patched in. So I either have
to waste my time patching it in, waste my time running a side deployment,
or not installing it and waste money having queries run longer on mr/spark
engine.

Imagine how much compute hours have been lost world wide.
On Tuesday, April 16, 2019, Manoj Murumkar  wrote:

> If we install our own build of Hive, we'll be out of support from CDH.
>
> Tez is not supported anyway and we're not touching any CDH bits, so it's
> not a big issue to have our own build of Tez engine.
>
> > On Apr 15, 2019, at 9:20 PM, Gopal Vijayaraghavan 
> wrote:
> >
> >
> > Hi,
> >
> >>> However, we have built Tez on CDH and it runs just fine.
> >
> > Down that path you'll also need to deploy a slightly newer version of
> Hive as well, because Hive 1.1 is a bit ancient & has known bugs with the
> tez planner code.
> >
> > You effectively end up building the hortonworks/hive-release builds, by
> undoing the non-htrace tracing impl & applying the htrace one back etc.
> >
> >> Lol. I was hoping that the merger would unblock the "saltyness".
> >
> > Historically, I've unofficially supported folks using Tez on CDH in prod
> (assuming they buy me enough coffee), though I might have discontinue that.
> >
> > https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/
> vendor-repos.xml#L11
> >
> > Cheers,
> > Gopal
> >
> >
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Hive on Tez vs Impala

2019-04-15 Thread Edward Capriolo
Lol. I was hoping that the merger would unblock the "saltyness". I wonder
what is the official position is now because back in the day there was a
puff piece produced to the effect of hive was not the way forward and
impala is the bees knees.

On Monday, April 15, 2019, Manoj Murumkar  wrote:

> No, not yet. However, we have built Tez on CDH and it runs just fine.
> Following blog summarizes part of the work (bit old, we currently run Tez
> 0.9.1 on CDH 5.16.1).
>
> https://blog.upala.com/2017/03/04/setting-up-tez-on-cdh-cluster/
>
> Blog says use ATS from open source hadoop, which will not work if you've
> kerberized the cluster. You'll have to build a version of ATS against CDH
> libraries that provides the classes needed to run the engine. We have done
> this work as well and it runs pretty smoothly.
>
>
>
> On Mon, Apr 15, 2019 at 8:33 AM Edward Capriolo 
> wrote:
>
>> Out of band question. Given:
>> https://hortonworks.com/blog/welcome-brand-new-cloudera/
>>
>> Does cdh finally ship with a tea you dont have to manually patch in?
>> On Monday, April 15, 2019, Sungwoo Park  wrote:
>>
>>> I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
>>> 5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
>>> new execution engine for Hadoop and Kubernetes). You can find the result at:
>>>
>>> https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
>>>
>>> On average, Hive on MR3 is about 30% faster than Hive on Tez on
>>> sequential queries. For concurrent queries, the throughput of Hive on MR3
>>> is about three times higher than Hive on Tez (when tested with 16
>>> concurrent queries). You can find the result at:
>>>
>>> https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/
>>>
>>> --- Sungwoo Park
>>>
>>> On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko 
>>> wrote:
>>>
>>>> Hi,
>>>> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1
>>>> (MapReduce)
>>>> I can't find the info regarding Hive on Tez performance compared to
>>>> Impala.
>>>> Does someone know or compared it?
>>>>
>>>> Thanks
>>>>
>>>> Artur Sukhenko
>>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Hive on Tez vs Impala

2019-04-15 Thread Edward Capriolo
Out of band question. Given:
https://hortonworks.com/blog/welcome-brand-new-cloudera/

Does cdh finally ship with a tea you dont have to manually patch in?
On Monday, April 15, 2019, Sungwoo Park  wrote:

> I tested the performance of Impala 2.12.0+cdh5.15.2+0 in Cloudera CDH
> 5.15.2 a while ago. I compared it with Hive 3.1.1 on MR3 (where MR3 is a
> new execution engine for Hadoop and Kubernetes). You can find the result at:
>
> https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
>
> On average, Hive on MR3 is about 30% faster than Hive on Tez on sequential
> queries. For concurrent queries, the throughput of Hive on MR3 is about
> three times higher than Hive on Tez (when tested with 16 concurrent
> queries). You can find the result at:
>
> https://mr3.postech.ac.kr/blog/2018/10/30/performance-evaluation-0.4/
>
> --- Sungwoo Park
>
> On Mon, Apr 15, 2019 at 8:44 PM Artur Sukhenko 
> wrote:
>
>> Hi,
>> We are using CDH 5, with Impala  2.7.0-cdh5.9.1  and Hive 1.1 (MapReduce)
>> I can't find the info regarding Hive on Tez performance compared to
>> Impala.
>> Does someone know or compared it?
>>
>> Thanks
>>
>> Artur Sukhenko
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Creating temp tables in select statements

2019-03-28 Thread Edward Capriolo
I made a udtf a while back that's let's you specify lists of tuples from
there you can explode them into rows

On Thursday, March 28, 2019, Jesus Camacho Rodriguez <
jcamachorodrig...@hortonworks.com> wrote:

> Depending on the version you are using, table + values syntax is supported.
>
> https://issues.apache.org/jira/browse/HIVE-18416
>
>
>
> SELECT a, b *FROM* TABLE(*VALUES*(1,2),(3,4)) AS x(a,b);
>
>
>
> -Jesús
>
>
>
>
>
> *From: *David Lavati 
> *Reply-To: *"user@hive.apache.org" 
> *Date: *Thursday, March 28, 2019 at 4:44 AM
> *To: *"user@hive.apache.org" 
> *Subject: *Re: Creating temp tables in select statements
>
>
>
> Hi Mainak,
>
>
>
> For select queries the only way I know of for multiple records is through
> using union:
>
>
>
> 0: jdbc:hive2://localhost:1> with x as (select 1 num union select 2
> union select 3) select * from x;
> ++
> | x.num  |
> ++
> | 1  |
> | 2  |
> | 3  |
> ++
>
>
>
> For table insertion you can use a syntax somewhat similar to VALUES
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#
> LanguageManualDML-InsertingvaluesintotablesfromSQL
>
>
>
> Kind Regards,
>
> David
>
>
>
>
>
> On Wed, Mar 27, 2019 at 12:40 AM Mainak Ghosh  wrote:
>
> Hello,
>
> We want to create temp tables at a select query level. For example:
>
> with x as (1, 2, 3) select * from x;
>
> Or
>
> Select * from table where id in ; Here list of integers
> is an input and can change.
>
> Currently Postgres VALUES syntax is not supported in Hive. Is there some
> easy workarounds which does not involved explicitly creating temporary
> tables and can be specified at the select query level?
>
> Thanks and Regards,
> Mainak
>
>
>
> --
>
> *David Lavati* | Software Engineer
>
> t. (+3620) 951-7468 <0036209517468>
>
> cloudera.com 
>
> [image: Image removed by sender. Cloudera] 
>
> [image: Image removed by sender. Cloudera on Twitter]
> 
>
> [image: Image removed by sender. Cloudera on Facebook]
> 
>
> [image: Image removed by sender. Cloudera on LinkedIn]
> 
> --
>
>
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Announce: MR3 0.6 released

2019-03-23 Thread Edward Capriolo
Thanks! Very cool.

On Sat, Mar 23, 2019 at 1:33 PM Sungwoo Park  wrote:

> I am pleased to announce the release of MR3 0.6. New key features are:
>
> - In Hive on Kubernetes, DAGAppMaster can run in its own Pod.
> - MR3-UI requires only Timeline Server.
> - Hive on MR3 is much more stable because it supports memory monitoring
> when loading hash tables for Map-side join.
>
> You can download MR3 0.6 at:
>
> https://mr3.postech.ac.kr/download/home/
>
> With the release of MR3 0.6, I ran experiments to compare the performance
> of Impala, Presto, and Hive on MR3. The result can be found in a new
> article:
>
> https://mr3.postech.ac.kr/blog/2019/03/22/performance-evaluation-0.6/
>
> I hope you enjoy reading the article.
>
> --- Sungwoo
>


Re: just released: Docker image of a minimal Hive server

2019-02-21 Thread Edward Capriolo
On Thu, Feb 21, 2019 at 6:34 PM Thai Bui  wrote:

> Great work!!
>
> Just curious, is it possible to take it one step further to provide a
> standalone local Hive that requires no hdfs (local filesystem instead) with
> embedded metastore and beeline?
>
> Would love to collaborate to make this happen similar to how spark-shell
> works.
>
> On Thu, Feb 21, 2019 at 3:10 PM Prasanth Jayachandran <
> j.prasant...@gmail.com> wrote:
>
>> Hi
>>
>> I did this a while back to run hive on tez
>> https://github.com/prasanthj/docker-hive-on-tez
>>
>> Some forks have updated versions of hive. With minimal changes it should
>> work with current hive master too.
>>
>> Thanks
>> Prasanth
>>
>> --
>> *From:* Prasanth Jayachandran 
>> *Sent:* Thursday, February 21, 2019 12:57 PM
>> *To:* user@hive.apache.org; user@hive.apache.org
>> *Subject:* Re: just released: Docker image of a minimal Hive server
>>
>> Hi
>>
>> I did this a while back to run hive on tez
>> https://github.com/prasanthj/docker-hive-on-tez
>> Some forks have updated versions of hive. With some minimal changes it
>> should work with current hive master too.
>>
>> Thanks
>> Prasanth
>>
>> --
>> *From:* Furcy Pin 
>> *Sent:* Thursday, February 21, 2019 11:59 AM
>> *To:* user@hive.apache.org
>> *Subject:* Re: just released: Docker image of a minimal Hive server
>>
>> Hello!
>>
>> If that might help, I did this repo a while ago:
>> https://github.com/FurcyPin/docker-hive-spark
>> It provides a pre-installed Hive Metastore and a HiveServer running on
>> Spark (Spark-SQL not Hive on Spark)
>>
>> I also did some config to acces AWS s3 data with it.
>>
>> Cheers,
>>
>> Furcy
>>
>> On Thu, 21 Feb 2019 at 18:30, Edward Capriolo 
>> wrote:
>>
>>> Good deal and great name!
>>>
>>> On Thu, Feb 21, 2019 at 11:31 AM Aidan L Feldman (CENSUS/ADEP FED) <
>>> aidan.l.feld...@census.gov> wrote:
>>>
>>>> Hi there-
>>>>
>>>> I am a new Hive user, working at the US Census Bureau. I was interested
>>>> in getting Hive running locally, but wanted to keep the dependencies
>>>> isolated. I could find Hadoop and Hive Docker images, but not one that had
>>>> both. Therefore, I present:  WeeHive <https://github.com/xdgov/weehive>
>>>> , *a minimal-as-possible Hive deployment*! This allows getting Hive up
>>>> and running (for development, not production) in a handful of steps.
>>>>
>>>>
>>>> I'm new to Hadoop and Hive, so I'm sure there are improvements that
>>>> could be made. Feedback (email ,issues
>>>> <https://github.com/xdgov/weehive/issues>, or pull requests) welcome.
>>>>
>>>>
>>>> Enjoy!
>>>>
>>>>
>>>> Aidan Feldman
>>>>
>>>> xD (Experimental Data) team
>>>>
>>>> Office of Program, Performance, and Stakeholder Integration (PPSI)
>>>>
>>>> Office of the Director
>>>>
>>>> Census Bureau
>>>>
>>>> --
> Thai
>

"Just curious, is it possible to take it one step further to provide a
standalone local Hive that requires no hdfs (local filesystem instead) with
embedded metastore and beeline?"
If you simply extract the hive tarball into a directory that is what you
get.


Re: just released: Docker image of a minimal Hive server

2019-02-21 Thread Edward Capriolo
Good deal and great name!

On Thu, Feb 21, 2019 at 11:31 AM Aidan L Feldman (CENSUS/ADEP FED) <
aidan.l.feld...@census.gov> wrote:

> Hi there-
>
> I am a new Hive user, working at the US Census Bureau. I was interested in
> getting Hive running locally, but wanted to keep the dependencies isolated.
> I could find Hadoop and Hive Docker images, but not one that had both.
> Therefore, I present:  WeeHive , *a
> minimal-as-possible Hive deployment*! This allows getting Hive up and
> running (for development, not production) in a handful of steps.
>
>
> I'm new to Hadoop and Hive, so I'm sure there are improvements that could
> be made. Feedback (email ,issues
> , or pull requests) welcome.
>
>
> Enjoy!
>
>
> Aidan Feldman
>
> xD (Experimental Data) team
>
> Office of Program, Performance, and Stakeholder Integration (PPSI)
>
> Office of the Director
>
> Census Bureau
>
>


Re: Is 'application' a reserved word?

2018-05-30 Thread Edward Capriolo
We got bit pretty hard when "exchange partitions" was added. How many
people in ad-tech work with exchange's? everyone!

On Wed, May 30, 2018 at 1:38 PM, Alan Gates  wrote:

> It is.  You can see the definitive list of keywords at
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/parse/HiveLexer.g (Note this is for the master
> branch, you can switch the branch around to find the list for a particular
> release.)  It would be good to file a JIRA on this so we fix the
> documentation.
>
> Alan.
>
> On Wed, May 30, 2018 at 7:48 AM Matt Burgess  wrote:
>
>> I tried the following simple statement in beeline (Hive 3.0.0):
>>
>> create table app (application STRING);
>>
>> And got the following error:
>>
>> Error: Error while compiling statement: FAILED: ParseException line
>> 1:18 cannot recognize input near 'application' 'STRING' ')' in column
>> name or constraint (state=42000,code=4)
>>
>> I checked the Wiki [1] but didn't see 'application' on the list of
>> reserved words. However if I change the column name to anything else
>> (even 'applicatio') it works. Can someone confirm whether this is a
>> reserved word?
>>
>> Thanks in advance,
>> Matt
>>
>> [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#
>> LanguageManualDDL-Keywords,Non-reservedKeywordsandReservedKeywords
>>
>


Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Edward Capriolo
True. The spec does not mandate the bucket files have to be there if they
are empty. (missing directories are 0 row tables).

Thanks,
Edward

On Tue, Apr 3, 2018 at 4:42 PM, Richard A. Bross  wrote:

> Gopal,
>
> The Presto devs say they are willing to make the changes to adhere to the
> Hive bucket spec.  I quoted
>
> "Presto could fix their fail-safe for bucketing implementation to actually
> trust the Hive bucketing spec & get you out of this mess - the bucketing
> contract for Hive is actual file name -> hash % buckets (Utilities::
> getBucketIdFromFile)."
>
> so there asking "where is the Hive bucketing spec".  Is it just to read
> the code for that function?  They were looking for something more explicit,
> I think.
>
> Thanks
>
> - Original Message -
> From: "Gopal Vijayaraghavan" 
> To: user@hive.apache.org
> Sent: Tuesday, April 3, 2018 3:15:46 AM
> Subject: Re: Hive, Tez, clustering, buckets, and Presto
>
> >* I'm interested in your statement that CLUSTERED BY does not CLUSTER
> BY.  My understanding was that this was related to the number of buckets,
> but you are relating it to ORC stripes.  It is odd that no examples that
> I've seen include the SORTED BY statement other than in relation to ORC
> indexes (that I understand).  So the question is; regardless of whether
> efficient ORC stripes are created (wouldn't I have to also specify
> 'orc.create.index’=’true’ for this to have much of an effect)
>
> ORC + bucketing has been something I've spent a lot of time with - a lot
> of this has to do with secondary characteristics of data (i.e same device
> has natural progressions for metrics), which when combined with a columnar
> format & ordering within files produces better storage and runtimes
> together (which I guess is usually a trade-off).
>
> Without a SORTED BY, the organizing function for the data-shuffle does not
> order in any specific way - the partition key for the shuffle is the
> modulus, while the order key is 0 bytes long, so it sorts by (modulus,)
> which for a quick-sort also loses the input order into the shuffle & each
> bucket file is produced in random order within itself.
>
> An explicit sort with bucketing is what I recommend to most of the HDP
> customers who have performance problems with ORC.
>
> This turns the shuffle key into (modulus, key1, key2) producing more
> predictable order during shuffle.
>
> Then the key1 can be RLE encoded so that ORC vector impl will pass it on
> as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows
> for integers.
>
> https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5
>
> was written as a warning to customers who use bucketing to try & solve
> performance problems, but have ended up bucketing as their main problem.
>
> Most of what I have written above was discussed a few years back and in
> general, bucketing on a high cardinality column + sorting on a low
> cardinality together has given good results to my customers.
>
> >I hadn't thought of the even number issue, not having looked at the
> function; I had assumed that it was a hash, not a modulus; shame on me.
> Reading the docs I see that hash is only used on string columns
>
> Actually a hash is used in theory, but I entirely blame Java for it - the
> Java hash is an identity function for Integers.
>
> scala> 42.hashCode
> res1: Int = 42
>
> scala> 42L.hashCode
> res2: Int = 42
>
> > Finally, I'm not sure that I got a specific answer to my original
> question, which is can I force Tez to create all bucket files so Presto
> queries can succeed?  Anyway, I will be testing today and the solution will
> either be to forgo buckets completely or to simply rely on ORC indexes.
>
> There's no config to do that today & Presto is already incompatible with
> Hive 3.0 tables (Update/Delete support).
>
> Presto could fix their fail-safe for bucketing implementation to actually
> trust the Hive bucketing spec & get you out of this mess - the bucketing
> contract for Hive is actual file name -> hash % buckets (Utilities::
> getBucketIdFromFile).
>
> The file-count is a very flaky way to check if the table is bucketed
> correctly - either you trust the user to have properly bucketed the table
> or you don't use it. Failing to work on valid tables does look pretty bad,
> instead of soft fallbacks.
>
> I wrote a few UDFs which was used to validate suspect tables and fix them
> for customers who had bad historical data, which was loaded with
> "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.
>
> https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/
> master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27
>
> LLAP has a bucket pruning implementation if Presto wants to copy from it
> (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).
>
> Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/
> 

Re: Proposal: File based metastore

2018-01-30 Thread Edward Capriolo
On Tue, Jan 30, 2018 at 1:16 PM, Ryan Blue <b...@apache.org> wrote:

> Thanks, Owen.
>
> I agree, Iceberg addresses a lot of the problems that you're hitting here.
> It doesn't quite go as far as moving all metadata into the file system. You
> can do that in HDFS and implementations that support atomic rename, but not
> in S3 (Iceberg has an implementation of the HDFS one strategy). For S3 you
> need some way of making commits atomic, for which we are using a metastore
> that is far more light-weight. You could also use a ZooKeeper cluster for
> write-side locking, or maybe there are other clever ideas out there.
>
> Even if Iceberg is agnostic to the commit mechanism, it does almost all of
> what you're suggesting and does it in a way that's faster than the current
> metastore while providing snapshot isolation.
>
> rb
>
> On Mon, Jan 29, 2018 at 9:10 AM, Owen O'Malley <owen.omal...@gmail.com>
> wrote:
>
>> You should really look at what the Netflix guys are doing on Iceberg.
>>
>> https://github.com/Netflix/iceberg
>>
>> They have put a lot of thought into how to efficiently handle tabular
>> data in S3. They put all of the metadata in S3 except for a single link to
>> the name of the table's root metadata file.
>>
>> Other advantages of their design:
>>
>>- Efficient atomic addition and removal of files in S3.
>>- Consistent schema evolution across formats
>>- More flexible partitioning and bucketing.
>>
>>
>> .. Owen
>>
>> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>> All,
>>>
>>> I have been bouncing around the earth for a while and have had the
>>> privilege of working at 4-5 places. On arrival each place was in a variety
>>> of states in their hadoop journey.
>>>
>>> One large company that I was at had a ~200 TB hadoop cluster. They
>>> actually ran PIG and there ops group REFUSED to support hive, even though
>>> they had written thousands of lines of pig macros to deal with selecting
>>> from a partition, or a pig script file you would import so you would know
>>> what the columns of the data at location /x/y/z is.
>>>
>>> In another lifetime I have been at a shop that used SCALDING. Again lots
>>> of custom effort there with avro and parquet, all to do things that hive
>>> would do our of the box. Again the biggest challenge is the thrift service
>>> and metastore.
>>>
>>> In the cloud many people will use a bootstrap script
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado
>>> op-script.html or 'msck repair'
>>>
>>> The "rise of the cloud" has changed us all the metastore is being a
>>> database is a hard paradigm to support. Imagine for example I created data
>>> to an s3 bucket with hive, and another group in my company requires read
>>> only access to this data for an ephemeral request. Sharing the data is
>>> easy, S3 access can be granted, sharing the metastore and thrift services
>>> are much more complicated.
>>>
>>> So lets think out of the box:
>>>
>>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca
>>> ssandra-together-at-last
>>>
>>> Datastax was able to build a platform where the filesystem and the
>>> metastore were backed into Cassandra. Even though a HBase user would not
>>> want that, the novel thing about that approach is that the metastore was
>>> not "some extra thing in a database" that you had to deal with.
>>>
>>> What I am thinking is that for the user of s3, the metastore should be
>>> in s3. Probably in hidden files inside the warehouse/table directory(ies).
>>>
>>> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k
>>> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins
>>> ights.commsql.doc/doc/biga_msckrep.html"
>>>
>>> The implementation could be something like this:
>>>
>>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>>> us locate the databases and in the databases we can locate tables, with the
>>> tables we can locate partitions.
>>>
>>> This will of course scale horribly across tables with 9000
>>> partitions but that would not be our use case. For all the people with
>>> "msck repair" in the bootstrap they have a much cleaner way of using hive.
>>>
>>> The implementations 

Re: Proposal: File based metastore

2018-01-29 Thread Edward Capriolo
On Mon, Jan 29, 2018 at 12:44 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

>
>
> On Jan 29, 2018, at 9:29 AM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>
>
> On Mon, Jan 29, 2018 at 12:10 PM, Owen O'Malley <owen.omal...@gmail.com>
> wrote:
>
>> You should really look at what the Netflix guys are doing on Iceberg.
>>
>> https://github.com/Netflix/iceberg
>>
>> They have put a lot of thought into how to efficiently handle tabular
>> data in S3. They put all of the metadata in S3 except for a single link to
>> the name of the table's root metadata file.
>>
>> Other advantages of their design:
>>
>>- Efficient atomic addition and removal of files in S3.
>>- Consistent schema evolution across formats
>>- More flexible partitioning and bucketing.
>>
>>
>> .. Owen
>>
>> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>> All,
>>>
>>> I have been bouncing around the earth for a while and have had the
>>> privilege of working at 4-5 places. On arrival each place was in a variety
>>> of states in their hadoop journey.
>>>
>>> One large company that I was at had a ~200 TB hadoop cluster. They
>>> actually ran PIG and there ops group REFUSED to support hive, even though
>>> they had written thousands of lines of pig macros to deal with selecting
>>> from a partition, or a pig script file you would import so you would know
>>> what the columns of the data at location /x/y/z is.
>>>
>>> In another lifetime I have been at a shop that used SCALDING. Again lots
>>> of custom effort there with avro and parquet, all to do things that hive
>>> would do our of the box. Again the biggest challenge is the thrift service
>>> and metastore.
>>>
>>> In the cloud many people will use a bootstrap script
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado
>>> op-script.html or 'msck repair'
>>>
>>> The "rise of the cloud" has changed us all the metastore is being a
>>> database is a hard paradigm to support. Imagine for example I created data
>>> to an s3 bucket with hive, and another group in my company requires read
>>> only access to this data for an ephemeral request. Sharing the data is
>>> easy, S3 access can be granted, sharing the metastore and thrift services
>>> are much more complicated.
>>>
>>> So lets think out of the box:
>>>
>>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca
>>> ssandra-together-at-last
>>>
>>> Datastax was able to build a platform where the filesystem and the
>>> metastore were backed into Cassandra. Even though a HBase user would not
>>> want that, the novel thing about that approach is that the metastore was
>>> not "some extra thing in a database" that you had to deal with.
>>>
>>> What I am thinking is that for the user of s3, the metastore should be
>>> in s3. Probably in hidden files inside the warehouse/table directory(ies).
>>>
>>> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k
>>> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins
>>> ights.commsql.doc/doc/biga_msckrep.html"
>>>
>>> The implementation could be something like this:
>>>
>>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>>> us locate the databases and in the databases we can locate tables, with the
>>> tables we can locate partitions.
>>>
>>> This will of course scale horribly across tables with 9000
>>> partitions but that would not be our use case. For all the people with
>>> "msck repair" in the bootstrap they have a much cleaner way of using hive.
>>>
>>> The implementations could even be "Stacked" files first metastore
>>> lookback second.
>>>
>>> It would be also wise to have a tool available in the CLI "metastore
>>>  toJson" making it drop dead simple to export the schema
>>> definitions.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>
> Close!
>
> They ultimately have many concepts right but the dealbreaker is they have
> there own file format. This ultimately will be a downfall. Hive needs to
> continue working with a variety of formats. This seems like a non-starter
> as everyone is already divided into camps on no

Re: Proposal: File based metastore

2018-01-29 Thread Edward Capriolo
On Mon, Jan 29, 2018 at 12:10 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

> You should really look at what the Netflix guys are doing on Iceberg.
>
> https://github.com/Netflix/iceberg
>
> They have put a lot of thought into how to efficiently handle tabular data
> in S3. They put all of the metadata in S3 except for a single link to the
> name of the table's root metadata file.
>
> Other advantages of their design:
>
>- Efficient atomic addition and removal of files in S3.
>- Consistent schema evolution across formats
>- More flexible partitioning and bucketing.
>
>
> .. Owen
>
> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> All,
>>
>> I have been bouncing around the earth for a while and have had the
>> privilege of working at 4-5 places. On arrival each place was in a variety
>> of states in their hadoop journey.
>>
>> One large company that I was at had a ~200 TB hadoop cluster. They
>> actually ran PIG and there ops group REFUSED to support hive, even though
>> they had written thousands of lines of pig macros to deal with selecting
>> from a partition, or a pig script file you would import so you would know
>> what the columns of the data at location /x/y/z is.
>>
>> In another lifetime I have been at a shop that used SCALDING. Again lots
>> of custom effort there with avro and parquet, all to do things that hive
>> would do our of the box. Again the biggest challenge is the thrift service
>> and metastore.
>>
>> In the cloud many people will use a bootstrap script
>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado
>> op-script.html or 'msck repair'
>>
>> The "rise of the cloud" has changed us all the metastore is being a
>> database is a hard paradigm to support. Imagine for example I created data
>> to an s3 bucket with hive, and another group in my company requires read
>> only access to this data for an ephemeral request. Sharing the data is
>> easy, S3 access can be granted, sharing the metastore and thrift services
>> are much more complicated.
>>
>> So lets think out of the box:
>>
>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca
>> ssandra-together-at-last
>>
>> Datastax was able to build a platform where the filesystem and the
>> metastore were backed into Cassandra. Even though a HBase user would not
>> want that, the novel thing about that approach is that the metastore was
>> not "some extra thing in a database" that you had to deal with.
>>
>> What I am thinking is that for the user of s3, the metastore should be in
>> s3. Probably in hidden files inside the warehouse/table directory(ies).
>>
>> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k
>> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins
>> ights.commsql.doc/doc/biga_msckrep.html"
>>
>> The implementation could be something like this:
>>
>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>> us locate the databases and in the databases we can locate tables, with the
>> tables we can locate partitions.
>>
>> This will of course scale horribly across tables with 9000 partitions
>> but that would not be our use case. For all the people with "msck repair"
>> in the bootstrap they have a much cleaner way of using hive.
>>
>> The implementations could even be "Stacked" files first metastore
>> lookback second.
>>
>> It would be also wise to have a tool available in the CLI "metastore
>>  toJson" making it drop dead simple to export the schema
>> definitions.
>>
>> Thoughts?
>>
>>
>>
>
Close!

They ultimately have many concepts right but the dealbreaker is they have
there own file format. This ultimately will be a downfall. Hive needs to
continue working with a variety of formats. This seems like a non-starter
as everyone is already divided into camps on not-invented-here file formats.

Potentially we could implement as a StorageHandler, this interface has been
flexible and has had success.
https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage, a storage handler
can delegate to iceberg or something else.

I was thinking of this problem as more of a "docker" type solution. For
example, lets say you have build a 40GB dataset divided into partition by
day. Imagine we build a docker image the image would launch with an
embedded derby DB (read only) with a start script that completely describes
the data and the partitions.  (You need some way to connect it to your
processing) but now we have a one-shot "shippable" hive.

Another approach: We have a JSON format with files that live in each of the
40 partitions. If you are running Hive metastore and your system admins are
start you can run:

hive> scan /data/sent/to/me/data.bundle

The above command would scan and import that data into your datastore. It
could be a wizard, it could be headless. But now I can share datasets on
clouds and use them easily.


Proposal: File based metastore

2018-01-28 Thread Edward Capriolo
All,

I have been bouncing around the earth for a while and have had the
privilege of working at 4-5 places. On arrival each place was in a variety
of states in their hadoop journey.

One large company that I was at had a ~200 TB hadoop cluster. They actually
ran PIG and there ops group REFUSED to support hive, even though they had
written thousands of lines of pig macros to deal with selecting from a
partition, or a pig script file you would import so you would know what the
columns of the data at location /x/y/z is.

In another lifetime I have been at a shop that used SCALDING. Again lots of
custom effort there with avro and parquet, all to do things that hive would
do our of the box. Again the biggest challenge is the thrift service and
metastore.

In the cloud many people will use a bootstrap script
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html
or 'msck repair'

The "rise of the cloud" has changed us all the metastore is being a
database is a hard paradigm to support. Imagine for example I created data
to an s3 bucket with hive, and another group in my company requires read
only access to this data for an ephemeral request. Sharing the data is
easy, S3 access can be granted, sharing the metastore and thrift services
are much more complicated.

So lets think out of the box:

https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-cassandra-together-at-last

Datastax was able to build a platform where the filesystem and the
metastore were backed into Cassandra. Even though a HBase user would not
want that, the novel thing about that approach is that the metastore was
not "some extra thing in a database" that you had to deal with.

What I am thinking is that for the user of s3, the metastore should be in
s3. Probably in hidden files inside the warehouse/table directory(ies).

Think of it as msck repair "on the fly" "
https://www.ibm.com/support/knowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.biginsights.commsql.doc/doc/biga_msckrep.html
"

The implementation could be something like this:

On startup read hive.warehouse.dir look for "_warehouse" That would help us
locate the databases and in the databases we can locate tables, with the
tables we can locate partitions.

This will of course scale horribly across tables with 9000 partitions
but that would not be our use case. For all the people with "msck repair"
in the bootstrap they have a much cleaner way of using hive.

The implementations could even be "Stacked" files first metastore lookback
second.

It would be also wise to have a tool available in the CLI "metastore
 toJson" making it drop dead simple to export the schema
definitions.

Thoughts?


Re: Format dillema

2017-06-23 Thread Edward Capriolo
"You're off by a couple of orders of magnitude - in fact, that was my last
year's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC +
LLAP."

"We've got sub-second SQL execution, sub-second compiles, sub-second
submissions … with all of it adding up to a single or double digit seconds
over a billion rows of data."

I guess I see different things. Having used all the tech. In particular for
large hive queries I see OOM simply SCANNING THE INPUT of a data directory,
after 20 seconds!


I also have used impala, (which is no magic bullet) but it has three extra
components (1 catalog server, 2 impala server (each node), 1 state server).
I put in a query AGAINST A TEXT FILE and the results come back in
milliseconds for the same query.

"Magically" jk. Impala allow me to query those TEXT files in milliseconds,
so logical deduction says the format of the data ORC/TEXT can't be the most
important factor here.

On Fri, Jun 23, 2017 at 2:53 PM, Gopal Vijayaraghavan 
wrote:

>
> > It is not that simple. The average Hadoop user has years 6-7 of data.
> They do not have a "magic" convert everything button. They also have legacy
> processes that don't/can't be converted.
> …
> > They do not want the "fastest format" they want "the fastest hive for
> their data".
>
> I've yet to run into that sort of luddite yet - maybe engineers can hold
> onto an opinion like that in isolation, but businesses are in general cost
> sensitive when it comes to storage & compute.
>
> The cynic in me says that if there are a few more down rounds, ORC
> adoption will suddenly skyrocket in companies which hoard data.
>
> ORC has massive compression advantages over Text, especially for
> attribute+metric SQL data. A closer look at this is warranted.
>
> Some of this stuff literally blows my mind - customer_demographics in
> TPC-DS is a great example of doing the impossible.
>
> tpcds_bin_partitioned_orc_1000.customer_demographics [numFiles=1,
> numRows=1920800, totalSize=46194, rawDataSize=726062400]
>
> which makes it 0.19 bit per-row (not byte, *BIT*).
>
> Compare to Parquet (which is still far better than text)
>
> tpcds_bin_partitioned_parq_1000.customer_demographics  [numFiles=1,
> numRows=1920800, totalSize=16813614, rawDataSize=17287200]
>
> which uses 70 bits per-row.
>
> So as companies "age" in their data over years, they tend to be very
> receptive to the idea of switching their years old data to ORC (and then
> use tiered HDFS etc).
>
> Still no magic button, but apparently money is a strong incentive to solve
> hard problems.
>
> > They get data dumps from potentially non sophisticated partners maybe
> using S3 and csv and, cause maybe their partner uses vertica or redshift. I
> think you understand this.
>
> That is something I'm painfully aware of - after the first few months, the
> second request is "Can you do Change-Data-Capture, so that we can reload
> every 30 mins? Can we do every 5 minutes?".
>
> And that's why Hive ACID has got SQL MERGE statements, so that you can
> grab a ChangeLog and apply it over with an UPSERT/UPDATE LATEST. And unlike
> the old lock manager, writes don't lock out any readers.
>
> Then as the warehouse gets bigger, "can you prevent the UPSERT from
> thrashing my cache & IO? Because the more data I have in the warehouse the
> longer the update takes."
>
> And that's what the min-max SemiJoin reduction in Tez does (i.e the
> min/max from the changelog goes pushed into the ORC index on the target
> table scan, so that only the intersection is loaded into cache). We gather
> a runtime range from the updates and push it to the ACID base, so that we
> don't have to read data into memory that doesn't have any updates.
>
> Also, if you have a sequential primary key on the OLTP side, this comes in
> as a ~100x speed up for such a use-case … because ACID ORC has
> transaction-consistent indexes built-in.
>
> > Suppose you have 100 GB text data in an S3 bucket, and say queying it
> takes lets just say "50 seconds for a group by type query".
> …
> > Now that second copy..Maybe I can do the same group by in 30 seconds.
>
> You're off by a couple of orders of magnitude - in fact, that was my last
> year's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC +
> LLAP.
>
> http://people.apache.org/~gopalv/LLAP-S3.gif (GIANT 38Mb GIF)
>
> That's doing nearly a billion rows a second across 9 nodes, through a join
> + group-by - a year ago. You can probably hit 720M rows/sec with plain Text
> with latest LLAP on the same cluster today.
>
> And with LLAP, adding S3 SSE (encrypted data on S3) adds a ~4% overhead
> for ORC, which is another neat trick. And with S3Guard, we have the
> potential to get the consistency needed for ACID.
>
> The format improvements are foundational to the cost-effectiveness on the
> cloud - you can see the impact of the format on the IO costs when you use a
> non-Hive engine like AWS Athena with ORC and Parquet [1].
>
> > 1) io bound
> > 2) have 

Re: Format dillema

2017-06-23 Thread Edward Capriolo
"Yes, it's a tautology - if you cared about performance, you'd use ORC,
because ORC is the fastest format."

It is not that simple. The average Hadoop user has years 6-7 of data. They
do not have a "magic" convert everything button. They also have legacy
processes that don't/can't be converted. They do not want the "fastest
format" they want "the fastest hive for their data". They get data dumps
from potentially non sophisticated partners maybe using S3 and csv and,
cause maybe their partner uses vertica or redshift. I think you understand
this.

Suppose you have 100 GB text data in an S3 bucket, and say queying it takes
lets just say "50 seconds for a group by type query".

It takes a "70 second CTAS query" and maybe 40GB more storage to create a
second copy in ORC.  Now that second copy..Maybe I can do the same group by
in 30 seconds. But in reality, you are
1) io bound
2) have 10 seconds of startup time anyway.
3) now have two copies of data 2x metastore 2 to cleanup

So its great that ORC is great but the reality is I can not make my
webserver spit out a log in ORC format :)


On Thu, Jun 22, 2017 at 7:30 PM, Gopal Vijayaraghavan 
wrote:

>
> > I kept hearing about vectorization, but later found out it was going to
> work if i used ORC.
>
> Yes, it's a tautology - if you cared about performance, you'd use ORC,
> because ORC is the fastest format.
>
> And doing performance work to support folks who don't quite care about it,
> is not exactly "see a need, fill a need".
>
> > Litterally years have come and gone and we are talking like 3.x is going
> to vectorize text.
>
> Literally years have gone by since the feature came into Hive. Though it
> might have crept up on you - if Vectorization had been enabled by default,
> it would've been immediately obvious.
>
> HIVE-9937 is so old, that I'd say the first line towards Text
> vectorization came in in Q1 2015.
>
> In the current master, you can get a huge boost out of it - if you want
> you can run BI over 100Tb of text.
>
> https://www.slideshare.net/Hadoop_Summit/llap-building-cloudfirst-bi/27
>
> > … where some not negligible part of the features ONLY work with ORC.
>
> You've got it backwards - ORC was designed to support those features.
>
> Parquet could be following ORC closely, but at least the Java
> implementation hasn't.
>
> Cheers,
> Gopal
>
>
>
>
>


Re: Format dillema

2017-06-20 Thread Edward Capriolo
"Hive 3.x branch has text vectorization and LLAP cache support for it, so
hopefully the only relevant concern about Text will be the storage costs
due to poor compression (& the lack of updates)."

I kept hearing about vectorization, but later found out it was going to
work if i used ORC. Litterally years have come and gone and we are talking
like 3.x is going to vectorize text. I get it that LazySimpleSerde is in
many ways the polar opposite of a batched approach that attempts to pull
down N thousand rows and process them 'in a batch'. Also get the dynamics
of the situation that people ultimately work on what they want etc etc.

Your going to laugh, but from my personal experience in a number of
environments from small to mid sized, I have actually just had the best
luck with gzip text files. When ORC was still a twinkle in someones eye I
was playing with the original RCFILES!

"The start of this thread is the exact opposite - trying to suggest ORC is
better for storage & wanting to use it."

Right, I am not trying to say that any one format is better than the other
on a case by case. I'm happy that we have something better then RCFILE, but
really generally trying to avoid Hive becoming the quasi ORC Datastore
where some not negligible part of the features ONLY work with ORC.

https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

"Currently only ORC is supported for the format of the destination table."

Say what? I can do "INSERT INTO AVROTABLE AS SELECT * FROM JSON_TABLE ",
but somehow I can ONLY streaming ingest to a table if it is this one type.
If only a given features were supported by more than one format (what a
world it would be)!

On Tue, Jun 20, 2017 at 5:05 PM, Gopal Vijayaraghavan 
wrote:

>
> > 1) both do the same thing.
>
> The start of this thread is the exact opposite - trying to suggest ORC is
> better for storage & wanting to use it.
>
> > As it relates the columnar formats, it is silly arms race.
>
> I'm not sure "silly" is the operative word - we've lost a lot of
> fragmentation of the community and are down to 2 good choices, neither of
> them wrong.
>
> Impala's original format was Trevni, which lives on in Avro docs. And
> there was RCFile - a sequence file format, which stored columnar data in a
>  pair. And then there was LazySimple SequenceFile, LazyBinary
> SequenceFile, Avro and Text with many SerDes.
>
> Purely speculatively, we're headed into more fragmentation again, with
> people rediscovering that they need updates.
>
> Uber's Hoodie is the Parquet fork, but for Spark, not Impala. While ORC
> ACID is getting much easier to update with MERGE statements and a deadlock
> aware txn manager.
>
> > Parquet had C/C++ right off the bat of course because impala has to work
> in C/C++.
>
> I think that is the primary reason why the Java Parquet readers are still
> way behind in performance.
>
> Nobody sane wants to work on performance tuning a data reader library in
> Java, when it is so much easier to do it in C++.
>
> Doing C++ after tuning the format for optimal performance in Java8 makes a
> lot of sense, in hindsight. The marshmallow test is easier if you can't
> have a marshmallow now.
>
> > 1) uses text file anyway because it is the ONLY format all tools support
>
> I see this often, folks who just throw in plain text into S3 and querying
> it.
>
> Hive 3.x branch has text vectorization and LLAP cache support for it, so
> hopefully the only relevant concern about Text will be the storage costs
> due to poor compression (& the lack of updates).
>
> Cheers,
> Gopal
>
>
>


Re: Format dillema

2017-06-20 Thread Edward Capriolo
"Hive and LLAP do support Parquet precisely because the developers want to
be able to process everyone's data."

Yes. But there are a number of optimizations on the Hive ORC side that we
know are not implemented on the Parquet support. Which is why I made my
statement. Impala( Parq=yes, orc=no) Hive (ORC=yes, parq=lame). E.G.

https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

This requires a reader that is smart enough to understand the predicates.
Fortunately ORC has had the corresponding improvements to allow predicates
to be pushed into it, and takes advantages of its inline indexes to deliver
performance benefits.

IE. Universal improvements won't happen.

"Part of having a thriving ecosystem is that there are competitors, which
creates some user confusion, but makes the ecosystem stronger. "

True in many cases. But the fork happy not-invented-here-ness is two much.
To the average user:
1) both do the same thing.
2) each vendor has some white paper power point selling you on why their
solutions is naturally better/smaller/fast.

As it relates the columnar formats, it is silly arms race. Parquet had
C/C++ right off the bat of course because impala has to work in C/C++. But
hey maybe 2.3 years later someone has a github that does that for ORC, and
maybe 3.2 years later someone adds predicate push downs in hive to parquet.

In the mean time actual users are stuck in the middle:
1) uses text file anyway because it is the ONLY format all tools support
2) makes two outputs for each query using 2x space


(Can someone please make a competitor for Oozie? *grin*)
https://github.com/apache/incubator-airflow , mrjobs, luigi,  askaban :)

On Tue, Jun 20, 2017 at 1:45 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

>
>
> On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> It is whack that two optimized row columnar formats exists and each
>> respective project (hive/impala) has good support for one and lame/no
>> support for the other.
>>
>
> We have two similar formats because they were designed at roughly the same
> time by different teams with similar, but not identical goals. Part of
> having a thriving ecosystem is that there are competitors, which creates
> some user confusion, but makes the ecosystem stronger. (Can someone please
> make a competitor for Oozie? *grin*)
>
> Hive and LLAP do support Parquet precisely because the developers want to
> be able to process everyone's data. The Impala project is free to make
> their own choices about what to work on.
>
> .. Owen
>
>


Re: Format dillema

2017-06-20 Thread Edward Capriolo
It is whack that two optimized row columnar formats exists and each
respective project (hive/impala) has good support for one and lame/no
support for the other.

Impala is now an Apache project.  Also 'whack' and 'lame' are technical
terms often used by the people in the real world that have to use TEXT
format because they care about interoperability.

As the world's hugest hive fan I can say: Impala is a really nice tool.
Many queries work at interactive speeds on large datasets. (Anecdotal)  I
highly doubt Hive + LLAP will be in that ball-park of performance for maybe
2 years.

Presto. Ha Ha I call presto the "tease". It teases you by letting you think
you will not have to re-write your queries, then you do to have to deal
with nulls and try_cast. It "teases" you because some queries work at
interactive speeds. Then it reaches this point where based on your data
size it goes from "interactive speed" to "kinda slow". Then it reaches the
point where it goes from "kinda slow" to "fail after 20 minutes". Then you
just switch back to hive because regardless of the speed 4 minutes / 10
minutes/ whatever you are about 99.999 certain the query will actually run.

On Tue, Jun 20, 2017 at 12:51 PM, Owen O'Malley 
wrote:

> You should also try LLAP. With ORC or text, it will cache the hot columns
> and partitions in memory. I can't seem to find the slides yet, but the
> Comcast team had good results with LLAP:
>
> https://dataworkssummit.com/san-jose-2017/sessions/hadoop-
> query-performance-smackdown/
>
> https://twitter.com/thejasn/status/875065727056715776
>
> Now that ORC has a C++ reader (and soon a writer), someone could write a
> patch for Impala to support ORC. You'd need to talk to the Impala project
> though.
>
> .. Owen
>
> On Tue, Jun 20, 2017 at 1:00 AM, Furcy Pin  wrote:
>
>> Another option would be to try Facebook's Presto https://prestodb.io/
>>
>> Like Impala, Presto is designed for fast interactive querying over Hive
>> tables, but it is also capable of querying data from many other SQL sources
>> (mySQL, postgreSQL, Kafka, Cassandra, ... https://prestodb.io/docs/curre
>> nt/connector.html)
>>
>> In terms of performances on small queries, it seems to be as fast as
>> Impala, a league over Spark-SQL, and of course two leagues over Hive.
>>
>> Unlike Impala, Presto is also able to read ORC file format, and make the
>> most of it (e.g. read pre-aggregated values from ORC headers).
>>
>> It can also make use of Hive's bucketing feature, while Impala still
>> cannot:
>> https://github.com/prestodb/presto/issues/
>> https://issues.apache.org/jira/browse/IMPALA-3118
>>
>> Regards,
>>
>> Furcy
>>
>>
>>
>>
>>
>> On Tue, Jun 20, 2017 at 5:36 AM, Sruthi Kumar Annamneedu <
>> sruthikumar...@gmail.com> wrote:
>>
>>> Try using Parquet with Snappy compression and Impala will work with this
>>> combination.
>>>
>>> On Sun, Jun 18, 2017 at 3:35 AM, rakesh sharma <
>>> rakeshsharm...@hotmail.com> wrote:
>>>
 We are facing an issue of format. We would like to do bi style queries
 from hive using impala and that supports parquet but we would like the data
 to be compressed to the best ratio like orc. But impala cannot query orc
 formats. What can be a design consideration for this. Any help

 Thanks
 Rakesh

 Get Outlook for Android 


>>>
>>
>


Re: Pro and Cons of using HBase table as an external table in HIVE

2017-06-09 Thread Edward Capriolo
Think about it like this one system is scanning a local file ORC, using an
hbase scanner (over the network), and scanning the data in sstable format?

On Fri, Jun 9, 2017 at 5:50 AM, Amey Barve  wrote:

> Hi Michael,
>
> "If there is predicate pushdown, then you will be faster, assuming that
> the query triggers an implied range scan"
> ---> Does this bring results faster than plain hive querying over ORC /
> Text file formats
>
> In other words Is querying over plain hive (ORC or Text) *always* faster
> than through HiveStorageHandler?
>
> Regards,
> Amey
>
> On 9 June 2017 at 15:08, Michael Segel  wrote:
>
>> The pro’s is that you have the ability to update a table without having
>> to worry about duplication of the row.  Tez is doing some form of
>> compaction for you that already exists in HBase.
>>
>> The cons:
>>
>> 1) Its slower. Reads from HBase have more overhead with them than just
>> reading a file.  Read Lars George’s book on what takes place when you do a
>> read.
>>
>> 2) HBase is not a relational store. (You have to think about what that
>> implies)
>>
>> 3) You need to query against your row key for best performance, otherwise
>> it will always be a complete table scan.
>>
>> HBase was designed to give you fast access for direct get() and limited
>> range scans.  Otherwise you have to perform full table scans.  This means
>> that unless you’re able to do a range scan, your full table scan will be
>> slower than if you did this on a flat file set.  Again the reason why you
>> would want to use HBase if your data set is mutable.
>>
>> You also have to trigger a range scan when you write your hive query and
>> you have make sure that you’re querying off your row key.
>>
>> HBase was designed as a  store. Plain and simple.  If you
>> don’t use the key, you have to do a full table scan. So even though you are
>> partitioning on row key, you never use your partitions.  However in Hive or
>> Spark, you can create an alternative partition pattern.  (e.g your key is
>> the transaction_id, yet you partition on month/year portion of the
>> transaction_date)
>>
>> You can speed things up a little by using an inverted table as a
>> secondary index. However this assumes that you want to use joins. If you
>> have a single base table with no joins then you can limit your range scans
>> based on making sure you are querying against the row key.  Note: This will
>> mean that you have limited querying capabilities.
>>
>> And yes, I’ve done this before but can’t share it with you.
>>
>> HTH
>>
>> P.S.
>> I haven’t tried Hive queries where you have what would be the equivalent
>> of a get() .
>>
>> In earlier versions of hive, the issue would be “SELECT * FROM foo where
>> rowkey=BAR”  would still do a full table scan because of the lack of
>> predicate pushdown.
>> This may have been fixed in later releases of hive. That would be your
>> test case.   If there is predicate pushdown, then you will be faster,
>> assuming that the query triggers an implied range scan.
>> This would be a simple thing. However keep in mind that you’re going to
>> generate a map/reduce job (unless using a query engine like Tez) where you
>> wouldn’t if you just wrote your code in Java.
>>
>>
>>
>>
>> > On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan <
>> ramasubramanian.naraya...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Can you please let us know Pro and Cons of using HBase table as an
>> external table in HIVE.
>> >
>> > Will there be any performance degrade when using Hive over HBase
>> instead of using direct HIVE table.
>> >
>> > The table that I am planning to use in HBase will be master table like
>> account, customer. Wanting to achieve Slowly Changing Dimension. Please
>> through some lights on that too if you have done any such implementations.
>> >
>> > Thanks and Regards,
>> > Rams
>>
>>
>


Re: FYI: Backports of Hive UDFs

2017-06-06 Thread Edward Capriolo
I dont care about 'security issues' over the ability to work quickly. One
popular compute system let you define mappers in scala in a shell for
example.

On Monday, June 5, 2017, Makoto Yui <m...@apache.org> wrote:

> Alan,
>
> Putting Hive backported UDFs to Hive branch-1 will cause dependencies
> to the specific Hive branch-1, the next stable release of v1.x.
> Artifact should be a distinct jar that only includes backported UDFs
> to use it in exiting Hive clusters.
>
> Better to support possibly all Hive versions since v0.13.0 or later.
> So, better to be a distinct Maven submodule.
>
> Edward,
>
> Gems-like dynamic plugin loading from Maven repository (or github
> repos by using jitpack.io) is possible by using Eclipse Aether but
> dynamic plugin/class loading involves security issues.
> https://stackoverflow.com/questions/35598239/load-maven-
> artifact-via-classloader
> https://github.com/treasure-data/digdag/tree/master/
> digdag-core/src/main/java/io/digdag/core/plugin
>
> Thanks,
> Makoto
>
> 2017-06-03 3:26 GMT+09:00 Edward Capriolo <edlinuxg...@gmail.com
> <javascript:;>>:
> > Don't we currently support features that load functions from external
> places
> > like maven http server etc? I wonder if it would be easier to back port
> that
> > back port a handful of functions ?
> >
> > On Fri, Jun 2, 2017 at 2:22 PM, Alan Gates <alanfga...@gmail.com
> <javascript:;>> wrote:
> >>
> >> Rather than put that code in hive/contrib I was thinking that you could
> >> just backport the Hive 2.2 UDFs into the same locations in Hive 1
> branch.
> >> That seems better than putting them into different locations on
> different
> >> branches.
> >>
> >> If you are willing to do the porting and post the patches (including
> >> relevant unit tests so we know they work) I and other Hive committers
> can
> >> review the patches and commit them to branch-1.
> >>
> >> Alan.
> >>
> >> On Thu, Jun 1, 2017 at 6:36 PM, Makoto Yui <m...@apache.org
> <javascript:;>> wrote:
> >>>
> >>> That's would be a help for existing Hive users.
> >>> Welcome to put it into hive/contrib or something else.
> >>>
> >>> Minimum dependancies are hive 0.13.0 and hadoop 2.4.0.
> >>> It'll work for any Hive environment, version 0.13.0 or later.
> >>> https://github.com/myui/hive-udf-backports/blob/master/pom.xml#L49
> >>>
> >>> Thanks,
> >>> Makoto
> >>>
> >>> --
> >>> Makoto YUI 
> >>> Research Engineer, Treasure Data, Inc.
> >>> http://myui.github.io/
> >>>
> >>> 2017-06-02 2:24 GMT+09:00 Alan Gates <alanfga...@gmail.com
> <javascript:;>>:
> >>> > I'm curious why these can't be backported inside Hive.  If someone is
> >>> > willing to do the work to do the backport we can check them into the
> >>> > Hive 1
> >>> > branch.
> >>> >
> >>> > On Thu, Jun 1, 2017 at 1:44 AM, Makoto Yui <m...@apache.org
> <javascript:;>> wrote:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I created a repository for backporting recent Hive UDFs (as of
> v2.2.0)
> >>> >> to legacy Hive environment (v0.13.0 or later).
> >>> >>
> >>> >>https://github.com/myui/hive-udf-backports
> >>> >>
> >>> >> Hope this helps for those who are using old Hive env :-(
> >>> >>
> >>> >> FYI
> >>> >>
> >>> >> Makoto
> >>> >>
> >>> >> --
> >>> >> Makoto YUI 
> >>> >> Research Engineer, Treasure Data, Inc.
> >>> >> http://myui.github.io/
> >>> >
> >>> >
> >>
> >>
> >
>
>
>
> --
> Makoto YUI 
> Research Engineer, Treasure Data, Inc.
> http://myui.github.io/
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: FYI: Backports of Hive UDFs

2017-06-02 Thread Edward Capriolo
Don't we currently support features that load functions from external
places like maven http server etc? I wonder if it would be easier to back
port that back port a handful of functions ?

On Fri, Jun 2, 2017 at 2:22 PM, Alan Gates  wrote:

> Rather than put that code in hive/contrib I was thinking that you could
> just backport the Hive 2.2 UDFs into the same locations in Hive 1 branch.
> That seems better than putting them into different locations on different
> branches.
>
> If you are willing to do the porting and post the patches (including
> relevant unit tests so we know they work) I and other Hive committers can
> review the patches and commit them to branch-1.
>
> Alan.
>
> On Thu, Jun 1, 2017 at 6:36 PM, Makoto Yui  wrote:
>
>> That's would be a help for existing Hive users.
>> Welcome to put it into hive/contrib or something else.
>>
>> Minimum dependancies are hive 0.13.0 and hadoop 2.4.0.
>> It'll work for any Hive environment, version 0.13.0 or later.
>> https://github.com/myui/hive-udf-backports/blob/master/pom.xml#L49
>>
>> Thanks,
>> Makoto
>>
>> --
>> Makoto YUI 
>> Research Engineer, Treasure Data, Inc.
>> http://myui.github.io/
>>
>> 2017-06-02 2:24 GMT+09:00 Alan Gates :
>> > I'm curious why these can't be backported inside Hive.  If someone is
>> > willing to do the work to do the backport we can check them into the
>> Hive 1
>> > branch.
>> >
>> > On Thu, Jun 1, 2017 at 1:44 AM, Makoto Yui  wrote:
>> >>
>> >> Hi,
>> >>
>> >> I created a repository for backporting recent Hive UDFs (as of v2.2.0)
>> >> to legacy Hive environment (v0.13.0 or later).
>> >>
>> >>https://github.com/myui/hive-udf-backports
>> >>
>> >> Hope this helps for those who are using old Hive env :-(
>> >>
>> >> FYI
>> >>
>> >> Makoto
>> >>
>> >> --
>> >> Makoto YUI 
>> >> Research Engineer, Treasure Data, Inc.
>> >> http://myui.github.io/
>> >
>> >
>>
>
>


Re: Migrating Variable Length Files to Hive

2017-06-02 Thread Edward Capriolo
On Fri, Jun 2, 2017 at 12:07 PM, Nishanth S  wrote:

> Hello hive users,
>
> We are looking at migrating  files(less than 5 Mb of data in total) with
> variable record lengths from a mainframe system to hive.You could think of
> this as metadata.Each of these records can have columns  ranging from 3 to
>  n( means  each record type have different number of columns) based on
> record type.What would be the best strategy to migrate this  to hive .I was
> thinking of converting these files  into one  variable length csv file and
> then importing them to a hive table .Hive table will consist of 4 columns
> with the 4th column having comma separated list of  values from column
> column 4 to n.Are there other alternative or better approaches for this
> solution.Appreciate any  feedback on this.
>
> Thanks,
> Nishanth
>

Hive supports complex types like List, Map, and Struct and they can be
arbitrarily nested. If the nested data has a schema that may be your best
option. Potentially using thrift/avro/parquet/protobuf support.

Otherwise you can store the data as Json and at read time parse things out
using json udfs.

Edward


Re: drop table - external - aws

2017-05-17 Thread Edward Capriolo
Im pretty sure schema tool does this for people who convert to ha name node.

On Wednesday, May 17, 2017, Neil Jonkers  wrote:

> Hi,
>
> Inspecting the Hive Metastore tables.
> Table SDS has a location field.
>
> If for reason this does not work:
> "ALTER TABLE ... SET LOCATION ... ?"
>
> Manually updating the SDS metadata table is an option :
>
> update SDS set  location =  "hdfs://Node:8020/user/hive/warehouse/t"
> where ...
>
> On Wed, May 17, 2017 at 8:41 PM, Furcy Pin  > wrote:
>
>> for that, sublime text + multi-line edit is your friend !
>>
>> https://www.youtube.com/watch?v=-paR5m6m-Nw
>>
>> On Wed, May 17, 2017 at 7:24 PM, Stephen Sprague > > wrote:
>>
>>> yeah. that's a potential idea too.  gotta put the time in to script it
>>> with 200+ tables though.
>>>
>>> On Wed, May 17, 2017 at 10:07 AM, Furcy Pin >> > wrote:
>>>
 Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have
 worked.


 On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar <
 vih...@cloudera.com
 > wrote:

> This is interesting and possibly a bug. Did you try changing them to
> managed tables and then dropping or truncating them? How do we reproduce
> this on our setup?
>
> On Tue, May 16, 2017 at 6:38 PM, Stephen Sprague  > wrote:
>
>> fwiw. i ended up re-creating the ec2 cluster with that same host name
>> just so i could drop those tables from the metastore.
>>
>> note to self.  be careful - be real careful - with "sharing" hive
>> metastores between different compute paradigms.
>>
>> Regards,
>> Stephen.
>>
>> On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague > > wrote:
>>
>>> hey guys,
>>> here's something bizarre.   i created about 200 external tables with
>>> a location something like this 'hdfs:///path'.  this was three
>>> months ago and now i'm revisiting and want to drop these tables.
>>>
>>> ha! no can do!
>>>
>>> that  is long gone.
>>>
>>> Upon issuing the drop table command i get this:
>>>
>>> Error while processing statement: FAILED: Execution Error, return
>>> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
>>> MetaException(message:java.lang.IllegalArgumentException:
>>> java.net.UnknownHostException: )
>>>
>>> where  is that old host name.
>>>
>>> so i ask is there a work around for this?  given they are external
>>> tables i'm surprised it "checks" that that location exists (or not.)
>>>
>>> thanks,
>>> Stephen
>>>
>>
>>
>

>>>
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Edward Capriolo
Here is a similar but not exact way I did something similar to what you
did. I had two data files in different formats the different columns needed
to be different features. I wanted to feed them into spark's:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm

This only works because I have a few named features, and they become fields
in the model object AntecedentUnion. This would be a crappy solution for a
large sparse matrix.

Also my Scala code is crap too so there is probably a better way to do this!


val b = targ.as[TargetingAntecedent]
val b1 = b.map(c => (c.tdid, c)).rdd.groupByKey()
val bgen = b1.map(f =>
  (f._1 , f._2.map
  ( x => AntecedentUnion("targeting", "", x.targetingdataid,
"", "") )
  ) )

val c = imp.as[ImpressionAntecedent]
val c1 = c.map(k => (k.tdid, k)).rdd.groupByKey()
val cgen = c1.map (f =>
  (f._1 , f._2.map
  ( x => AntecedentUnion("impression", "", "", x.campaignid,
x.adgroupid) ).toSet.toIterable
  ) )

val bgen = TargetingUtil.targetingAntecedent(sparkSession, sqlContext,
targ)
val cgen = TargetingUtil.impressionAntecedent(sparkSession, sqlContext,
imp)
val joined = bgen.join(cgen)

val merged = joined.map(f => (f._1, f._2._1++:(f._2._2) ))
val fullResults : RDD[Array[AntecedentUnion]] = merged.map(f =>
f._2).map(_.toArray[audacity.AntecedentUnion])


So essentially converting everything into AntecedentUnion where the first
column is the type of the tuple, and other fields are supplied or not. Then
merge all those and run fpgrowth on them. Hope that helps!



On Mon, May 15, 2017 at 12:06 PM, goun na  wrote:
>
> I mentioned it opposite. collect_list generates duplicated results.
>
> 2017-05-16 0:50 GMT+09:00 goun na :
>>
>> Hi, Jone Zhang
>>
>> 1. Hive UDF
>> You might need collect_set or collect_list (to eliminate duplication),
but make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.
>>
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>>
>> 2.Spark Dataframe Pivot
>>
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
>>
>> - Goun
>>
>> 2017-05-15 22:15 GMT+09:00 Jone Zhang :
>>>
>>> For example
>>> Data1(has 1 billion records)
>>> user_id1  feature1
>>> user_id1  feature2
>>>
>>> Data2(has 1 billion records)
>>> user_id1  feature3
>>>
>>> Data3(has 1 billion records)
>>> user_id1  feature4
>>> user_id1  feature5
>>> ...
>>> user_id1  feature100
>>>
>>> I want to get the result as follow
>>> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>>>
>>> Is there a more efficient way except join?
>>>
>>> Thanks!
>>
>>
>


Re: Hive LLAP with Parquet format

2017-05-04 Thread Edward Capriolo
The parquet orc thing has to be tje biggest detractor. Your forced to chose
between a format good for impala or good for hive.

On May 4, 2017 3:57 PM, "Gopal Vijayaraghavan"  wrote:

> Hi,
>
>
> > Does Hive LLAP work with Parquet format as well?
>
>
>
> LLAP does work with the Parquet format, but it does not work very fast,
> because the java Parquet reader is slow.
>
> https://issues.apache.org/jira/browse/PARQUET-131
> +
>
> https://issues.apache.org/jira/browse/HIVE-14826
>
> In particular to your question, Parquet's columnar data reads haven't been
> optimized for Azure/S3/GCS.
>
> There was a comparison of ORC vs Parquet for NYC taxi data and it found
> that for simple queries Parquet read ~4x more data over the network - your
> problem might be bandwidth related.
>
> You might want to convert a small amount to ORC and see whether the
> BYTES_READ drops or not.
>
> In my tests with a recent LLAP, Text data was faster on LLAP on S3 & Azure
> than Parquet, because Text has a vectorized reader & cache support.
>
> Cheers,
>
> Gopal
>


Re: Error with Hive 2.1.1 and Spark 2.1

2017-04-18 Thread Edward Capriolo
On Tue, Apr 18, 2017 at 3:32 PM, hernan saab 
wrote:

> The effort of configuring an apache big data system by hand for your
> particular needs is equivalent to herding rattlesnakes and cats into one
> small room.
> The documentation is poor and most of the time the community developers
> don't really feel like helping you.
> Use Ambari or any other orchestration tool you can find. It will save you
> a lot of angry moments and time.
>
>
>
>
> On Tuesday, April 18, 2017 11:45 AM, Vihang Karajgaonkar <
> vih...@cloudera.com> wrote:
>
>
> +sergio
>
> Thank you for pointing this out. Based on what I see here https://github.com/
> apache/hive/blob/branch-2.1/ pom.xml#L179
>  Hive 2.1
> supports Sparks 1.6. There is a JIRA to add support for Spark 2.0 
> https://issues.apache.org/
> jira/browse/HIVE-14029 
> but that is available from Hive 2.2.x
>
> I have created https://issues.apache.org/ jira/browse/HIVE-16472
>  to fix the wiki for
> documentation issues and any bugs in the code if needed.
>
> On Mon, Apr 17, 2017 at 6:19 PM, hernan saab  > wrote:
>
> IMO, that page is a booby trap for the newbies to make them waste their
> time needlessly.
> As far as I know Hive on Spark does not work today.
> I would be the reason that page still stays on is because there is a level
> of shame in the Hive dev community that a feature like this should be
> functional by now.
> DO NOT USE SPARK ON HIVE.
> Instead use Tez on Hive.
>
> Hernan
>
>
>
> On Monday, April 17, 2017 3:45 PM, Krishnanand Khambadkone <
> kkhambadk...@yahoo.com> wrote:
>
>
> Hi,   I am trying to run Hive queries by using Spark as the execution
> engine.   I am following the instructions on this page,
>
> https://cwiki.apache.org/ confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started
> 
>
> When I try to run my query which is. a simple count(*) command, I get this
> error,
>
> Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.
> metadata.HiveException(Failed to create spark client.)'
> FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.
> exec.spark.SparkTask
>
>
>
>
>
>
>
When you choose a package or a tool you do not always get the version you
want. We (hive pmc) discussed "support" in our private list. In short, you
can not expect software released 1 year ago to be drop-in-replaced by
something released a month ago.

The plan I am going to put forward is Hive binaries will come shipped with
all of its dependencies. Then one version of hive supports one version of X
explicitly and for all other versions use at your own risk.

Edward


Re: [ANNOUNCE] Apache Hive 1.2.2 Released

2017-04-08 Thread Edward Capriolo
Nice job

On Saturday, April 8, 2017, Vaibhav Gumashta 
wrote:

> The Apache Hive team is proud to announce the release of Apache Hive version 
> 1.2.2.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top
> of Apache Hadoop (TM), it provides, among others:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS (TM) or in other
>   data storage systems such as Apache HBase (TM)
>
> * Query execution via Apache Hadoop MapReduce, Apache Tez and Apache Spark 
> frameworks.
>
> For Hive release details and downloads, please 
> visit:https://hive.apache.org/downloads.html
>
> Hive 1.2.2 Release Notes are available here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332952=Text=12310843
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Hive SerDe maven dependency

2017-03-29 Thread Edward Capriolo
You should match your hive versions as close as possible. It makes sense
that both hive and hadoop dependencies use a PROVIDED scope, this way if
you are building an assembly/fat/shaded jar the jar is as thin as possible.

On Wed, Mar 29, 2017 at 3:01 PM, srinu reddy  wrote:

>
>
> Hi
>
> I want to implement custom SerDe. But I confused to select the Hive SerDe
> version for maven dependency and also hadoop-core dependency
>
> Could any one please suggest me
>
>
> Below are the hadoop and hive versions which we are using
>
> HDP : 2.2.0
> Hadoop : 2.6.0
> Hive-Hcatalog : 0.14.0
>
>
> Thanks
> Srinu
>


Re: hive on spark - version question

2017-03-17 Thread Edward Capriolo
On Fri, Mar 17, 2017 at 2:56 PM, hernan saab 
wrote:

> I have been in a similar world of pain. Basically, I tried to use an
> external Hive to have user access controls with a spark engine.
> At the end, I realized that it was a better idea to use apache tez instead
> of a spark engine for my particular case.
>
> But the journey is what I want to share with you.
> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop ,
> Parquet etc etc are not interchangeable as we would like to think. There
> are very limited combinations for very specific versions. This is why tools
> like Ambari can be useful. Ambari sets a path of combos of versions known
> to work and the dirty work is done under the UI.
>
> More often than not, when you try a version that few people tried, you
> will get error messages that will derailed you and cause you to waste a lot
> of time.
>
> In addition, this group, as well as many other apache big data user
> groups,  provides extremely poor support for users. The answers you usually
> get are not even hints to a solution. Their answers usually translate to
> "there is nothing I am willing to do about your problem. If I did, I should
> get paid" in many cryptic ways.
>
> If you ask your question to the Spark group they will take you to the Hive
> group and viceversa (I can almost guarantee it based on previous
> experiences)
>
> But in hindsight, people who work on this kinds of things typically make
> more money that the average developers. If you make more $$s it makes sense
> learning this stuff is supposed to be harder.
>
> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
> you are querying large files.
>
>
>
> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
> wrote:
>
>
> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
> with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
> to me, alas.
>
> thanks in advance.
>
> Cheers,
> Stephen.
>
> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
> wrote:
>
> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/ questions/41953688/setting-
> spark-as-default-execution- engine-for-hive
> 
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/ jira/browse/SPARK-17563
> 
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/ jira/browse/HIVE-14029
> 
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>
>
>
>
Stephan,

I understand some of your frustration.  Remember that many in open source
are volunteering their time. This is why if you pay a vendor for support of
some software you might pay 50K a year or $200.00 an hour. If I was your
vendor/consultant I would have started the clock 10 minutes ago just to
answer this email :). The only "pay" I ever got from Hive is that I can use
it as a resume bullet point, and I wrote a book which pays me royalties.

As it relates specifically to your problem, when you see the trends you are
seeing it probably means you are in a minority of the user base. Either
your doing something no one else is doing, you are too cutting edge, or no
one has an easy solution. Hive is making the move from the classic
MapReduce, two other execution engines have been made Tez and HiveOnSpark.
Because we are open source we allow people to "scratch an itch" that is the
Apache way. From time to time in means something that was added stops being
viable because of lack of support.

I agree with your final assessment which is Tez is the most viable engine
for Hive. This is by no means a put down of the HiveOnSpark work and it
does not mean it will never the most viable. By the same token if the
versions fall out of sync and all that exists is complains the viability
speaks for itself.

Remember that keeping two fast moving things together is no easy chore. I
used to run Hive + cassandra. Seems easy, crap two versions of common CLI,
shade one version everything works, crap new hive release has different
versions of thrift, shade + patch, crap now one of the other dependencies
is incompatible fork + shade + patch. At some point you have to say to
yourself if I can not make critical mass of this solution such that I am
the only one doing/patching it then 

Data brinks loves showing charts saying they are faster then hive

2017-02-28 Thread Edward Capriolo
https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html?utm_campaign=Open%20Source_content=47640295_medium=social_source=twitter


Always neglect to include the fact that spark has a complete copy of hive
inside of it!


[Discuss] tez jars ship with hive in indivisable fashion

2017-02-24 Thread Edward Capriolo
There are a few hadoop vendors that make it an unnecesary burden on users
to get tez running.

This forces users to compile patch in tez support.

Imho this is shameful. These same vendors include all types of extra add
ins like say hbase or even mongo support.

This 'creative packaging' only serves to drive users away from hive. Users
have to patch to get the performance they should get for free.

I am propsing we engineer hive in such a way it installs the execution
engines it wants and we mange those jara directly so that a vendor can not
alter the storage engine offerings.


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Only External tables can have an explicit location

2017-01-25 Thread Edward Capriolo
Error 40003]: Only External tables can have an explicit location

using hive 1.2. I got this error. This was definitely not a requirement
before

Why way this added? External table ONLY used to be dropping the table will
not drop the physical files.


Re: Maintaining big and complex Hive queries

2016-12-21 Thread Edward Capriolo
I have been contemplating attaching meta data for the query lineage to each
table such that I can know where the data came from and have a 1 click
regenerate button.

On Wed, Dec 21, 2016 at 3:02 PM, Stephen Sprague  wrote:

> my 2 cents. :)
>
> as soon as you say "complex query" i would submit you've lost the
> upperhand and you're behind the eight-ball right off the bat.  And you know
> this too otherwise you wouldn't have posted here. ha!
>
> i use cascading CTAS statements so that i can examine the intermediate
> tables.  Another approach is to use CTE's but while that makes things
> easier to read it's still one big query and you don't get insight to the
> "work" tables.
>
> yes, it could take longer execution time if those intermediate tables
> can't be run in parallel but small price to pay compared to human debug
> time in my book anyway.
>
> thoughts?
>
> Cheers,
> Stephen.
>
>
>
>
>
> On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hi Elliot,
>>
>> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
>> in the documentation I could not see any way to pass output generated by
>> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
>> platform for multiple big-data systems (http://www.hplsql.org/about).
>>
>> However in order to break single complex hive query, DDLs look to be only
>> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>>
>> -- Saumitra S. Shahapure
>>
>> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:
>>
>>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>>> I expect that is another approach that you could use to modularise:
>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa
>>> geId=59690156
>>> http://www.hplsql.org/doc
>>>
>>> On 15 December 2016 at 17:17, Elliot West  wrote:
>>>
 Some options are covered here, although there is no definitive guidance
 as far as I know:

 https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
 g+Hive+SQL#UnitTestingHiveSQL-Modularisation

 On 15 December 2016 at 17:08, Saumitra Shahapure <
 saumitra.offic...@gmail.com> wrote:

> Hello,
>
> We are running and maintaining quite big and complex Hive SELECT query
> right now. It's basically a single SELECT query which performs JOIN of
> about ten other SELECT query outputs.
>
> A simplest way to refactor that we can think of is to break this query
> down into multiple views and then join the views. There is similar
> possibility to create intermediate tables.
>
> However creating multiple DDLs in order to maintain a single DML is
> not very smooth. We would end up polluting metadata database by creating
> views / intermediate tables which are used in just this ETL.
>
> What are the other efficient ways to maintain complex SQL queries
> written in Hive? Are there better ways to break Hive query into multiple
> modules?
>
> -- Saumitra S. Shahapure
>


>>>
>>
>


Re: Hive Serialization issues

2016-11-23 Thread Edward Capriolo
I believe json itself has encoding rules. What i suggest you do is build
your own input format or serde and escape those fieds possibly by
converting them to hex.

On Wednesday, November 23, 2016, Dana Ram Meghwal  wrote:

> Hey,
> Any leads?
>
> On Tue, Nov 22, 2016 at 5:35 PM, Dana Ram Meghwal  > wrote:
>
>> Hey All,
>>
>> I am using Hive 2.0 with external meta-store on EMR-5.0.0 and TEZ as
>> execution engine.
>> Our data are stored in json format so for serialization and
>> deserialization purpose we are planning to use lazy serde
>> (classname is  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' ).
>>
>> My table definition is
>>
>> CREATE EXTERNAL TABLE IF NOT EXISTS 
>> daily_active_users_summary_json_partition_dt_paths_v1
>> (uid string, city string, user string, songcount string, songid_list
>> array  ) PARTITIONED BY ( dt string)
>>
>>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>>
>>  WITH SERDEPROPERTIES ('paths'='uid,city,user,songcount,songid_list')
>>
>>  LOCATION 's3:///users/daily_active_us
>> ers_summary_json_partition_dt';
>>
>>
>> and data look like this---
>>
>> {"uid":"xx","listening_user_flag":"non_listening","
>> platform":"android","model":"micromax a110q","aquisition_channel":"o
>> rganic","state":"delhi","app_version":"3.2:","country":"IN","city":"new
>> delhi","new_listening_user_flag":"non_listening","manufactur
>> er":"Micromax","login_mode":"loggedout","new_user_flag":"
>> returning","digital_channel":"Not Source"}
>>
>>
>> Note: I have pasted here one record in table.
>>
>>
>> Now, When I do query
>>
>> select * from daily_active_users_summary_json_partition_dt_paths_v1
>> limit 5;
>>
>>
>> the first field of table takes the complete record and rest of field are
>> showing to be NULL.
>>
>> When I use different serde  'org.apache.hive.hcatalog.data.JsonSerDe'
>>
>> then I can see the above query works fine and able to serialize data
>> perfectly fine. We want to user the lazy serde because our data contains
>> non-utf-8 character and the later serde does not support non-utf-8
>> character serialization/deserialization.
>>
>>
>> Can you please help me solve this, we mostly want to use lazy serde only
>> as we have already experimented with other serde's none of them is working
>> for us Is there any configuration which enable
>> serialization/deserialization while using lazy Serde.
>>
>> Or is there any other serde which can fine process non-utf-8 character in
>> hive-2 and tez.
>>
>> Thank you
>>
>>
>> Best Regards,
>> Dana Ram Meghwal
>> Software Engineer
>> dana...@saavn.com 
>>
>>
>
>
> --
> Dana Ram Meghwal
> Software Engineer
> dana...@saavn.com 
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Adding a New Primitive Type in Hive

2016-11-19 Thread Edward Capriolo
Technically very do-able timestamps and other decimal types have been added
over the years. It actually turns out to be a fair amount of work mostly
due to the proliferations of serde that need to be able to read/write that
type

On Sat, Nov 19, 2016 at 4:11 PM, Juan Delard de Rigoulières <
j...@datarepublic.io> wrote:

> Hi,
> We'd like to extend Hive to support a new primitive type. For simplicity
> sake, think of UUID. (https://en.wikipedia.org/wiki
> /Universally_unique_identifier)
> UUIDs are string with a particular/simple structure - known regex
> matchable. (/^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{
> 3}-[0-9a-f]{12}$/i)
> We've looked into serde & udf but it doesn't seem elegant enough, so that
> it's possible to write DDLs like:
> *CREATE TABLE `awesome` {*
> *  users STRING,*
> *  id UUID*
> *};*
> We are looking to validation of values on ingestion (INSERT); so in the
> example, values for the second column will get validated as UUID records.
> Thanks in advance.
>
> Juan
>
>


Re: Big Data Event London, 3-4th November 2016 from Tomorrow

2016-11-02 Thread Edward Capriolo
Mich,

 Looking through the event on a few talks seem to be about hadoop and none
mention hive.

I understand how hive and this conference relate but I believe this is off
topic for the hive mailing list.

Thank you,
Edward

On Wednesday, November 2, 2016, Mich Talebzadeh 
wrote:

> Hi,
>
> For those in London there is this Big Data  event
>
>
> There are some interesting talks as per attached schedule.
>
> Anyone coming?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Quota for rogue ad-hoc queries

2016-09-01 Thread Edward Capriolo
I have written nagios scripts that watch the job tracker UI and report when
things take too long.

On Thu, Sep 1, 2016 at 11:08 AM, Loïc Chanel 
wrote:

> On the topic of timeout, if I may say, they are a dangerous way to deal
> with requests as a "good" request may last longer than an "evil" one.
> Be sure timeouts won't kill any important job before putting them into
> place. You can set these things on in the components (Tez, MapReduce ...)
> parameters, but not directly into YARN. At least it was the case when I
> tried this (one year ago).
>
> Regards,
>
> Loïc CHANEL
> System & virtualization engineer
> TO - XaaS Ind - Worldline (Villeurbanne, France)
>
> 2016-09-01 16:52 GMT+02:00 Stephen Sprague :
>
>> > rogue queries
>>
>> so this really isn't limited to just hive is it?  any dbms system perhaps
>> has to contend with this.  even malicious rogue queries as a matter of fact.
>>
>> timeouts are cheap way systems handle this - assuming time is related to
>> resource. i'm sure beeline or whatever client you use has a timeout feature.
>>
>> maybe one could write a separate service - say a governor - that watches
>> over YARN (or hdfs or whatever resource is rare) - and terminates the
>> process if it goes beyond a threshold.  think OOM killer.
>>
>> but, yeah, i admittedly don't know of something out there already you can
>> just tap into but YARN's Resource Manager seems to be place i'd research
>> for starters. Just look look at its name. :)
>>
>> my unsolicited 2 cents.
>>
>>
>>
>> On Wed, Aug 31, 2016 at 10:24 PM, ravi teja  wrote:
>>
>>> Thanks Mich,
>>>
>>> Unfortunately we have many insert queries.
>>> Are there any other ways?
>>>
>>> Thanks,
>>> Ravi
>>>
>>> On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Trt this

 hive.limit.optimize.fetch.max

- Default Value: 5
- Added In: Hive 0.8.0

 Maximum number of rows allowed for a smaller subset of data for simple
 LIMIT, if it is a fetch query. Insert queries are not restricted by this
 limit.


 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 31 August 2016 at 13:42, ravi teja  wrote:

> Hi Community,
>
> Many users run adhoc hive queries on our platform.
> Some rogue queries managed to fill up the hdfs space and causing
> mainstream queries to fail.
>
> We wanted to limit the data generated by these adhoc queries.
> We are aware of strict param which limits the data being scanned, but
> it is of less help as huge number of user tables aren't partitioned.
>
> Is there a way we can limit the data generated from hive per query,
> like a hve parameter for setting HDFS quotas for job level *scratch*
> directory or any other approach?
> What's the general approach to gaurdrail such multi-tenant cases.
>
> Thanks in advance,
> Ravi
>


>>>
>>
>


Re: does Hive implement any Combiner by default?

2016-08-08 Thread Edward Capriolo
Hive uses map side aggregation instead of combiners

http://dev.bizo.com/2013/02/map-side-aggregations-in-apache-hive.html

On Mon, Aug 8, 2016 at 2:59 PM, Edson Ramiro  wrote:

> hi all,
>
> I'm executing TPC-H on Hive 2.0.1, using Yarn 2.7, and I'm wondering if
> Hive implements any Combiner by default? If so, how do I enable it?
>
> I am asking this because I checked the values of COMBINE_INPUT_RECORDS and
> COMBINE_OUTPUT_RECORDS and they are always zero.
>
> Thanks in advance,
>
>   Edson Ramiro
>


Re: Re: hive will die or not?

2016-08-07 Thread Edward Capriolo
A few entities going to "kill/take out/better than hive"
I seem to remember HadoopDb, Impala, RedShift , voltdb...

But apparent hive is still around and probably faster
http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final




On Sun, Aug 7, 2016 at 9:49 PM, 理  wrote:

> in  my opinion, multiple  engine  is not  advantage,  but reverse.  it
>  disperse  the dev energy.
>   consider  the activity ,sparksql  support  all  tpc ds without modify
> syntax!  but  hive cannot.
> consider the tech,   dag, vectorization,   etc sparksql also has,   seems
> the  code  is  more   efficiently.
>
>
> regards
> On 08/08/2016 08:48, Will Du  wrote:
>
> First, hive supports different engines. Look forward it's dynamic engine
> switch
> Second, look forward hadoop 3rd gen and map reduce on memory will fill the
> gap
>
> Thanks,
> Will
>
> On 2016年8月7日, at 20:27, 理  wrote:
>
> hi,
>   sparksql improve  so fast,   both  hive and sparksql  are similar,  so
> hive  will  lost  or not?
>
> regards
>
>
>
>
>
>


Re: hue / hive issue with sqlite

2016-08-07 Thread Edward Capriolo
The "database" that is locked has nothing to do with hive the problem is
complete a hue problem. Find the appropriate hue mailing list to get help.

On Sun, Aug 7, 2016 at 9:11 AM, Sumit Khanna <sumit.kha...@askme.in> wrote:

> Hey Edward,
>
> would still the issue ? How often? As in the storage format in here is
> parquet, and am able to view sample sets for each column and raw select
> queries are working just fine, but none of min / max / distinct / where 'd
> work.
>
> Thanks,
>
> On Sun, Aug 7, 2016 at 6:38 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> You need to take this up with the appropriate hue/cloudera user group.
>> One issue is that SQL lite is a embedded single user database and does not
>> work well with more than one user. We switched to postges in our deployment
>> and would still hit this issue. I never got it resolved,
>>
>> On Sun, Aug 7, 2016 at 4:42 AM, Sumit Khanna <sumit.kha...@askme.in>
>> wrote:
>>
>>> Hello,
>>>
>>> we have hue configured against sqlite as default database.
>>>
>>> *queries like select height from students limit 222;* work
>>>
>>> but *queries like select max(height) from students;* wont.
>>>
>>> in fact that displays " database is locked " error message.
>>>
>>> is sqlite  / hue and as in not migrating to mysql the only reason?
>>>
>>> Kindly let me know. The sample data (which is nothing but select
>>> queries) are working / displaying up from hue UI too. just that any queries
>>> which actually involve M/R ( a reducer) arent working.
>>>
>>> Thanks,
>>> Sumit
>>>
>>
>>
>


Re: hue / hive issue with sqlite

2016-08-07 Thread Edward Capriolo
You need to take this up with the appropriate hue/cloudera user group. One
issue is that SQL lite is a embedded single user database and does not work
well with more than one user. We switched to postges in our deployment and
would still hit this issue. I never got it resolved,

On Sun, Aug 7, 2016 at 4:42 AM, Sumit Khanna  wrote:

> Hello,
>
> we have hue configured against sqlite as default database.
>
> *queries like select height from students limit 222;* work
>
> but *queries like select max(height) from students;* wont.
>
> in fact that displays " database is locked " error message.
>
> is sqlite  / hue and as in not migrating to mysql the only reason?
>
> Kindly let me know. The sample data (which is nothing but select queries)
> are working / displaying up from hue UI too. just that any queries which
> actually involve M/R ( a reducer) arent working.
>
> Thanks,
> Sumit
>


Re: A dedicated Web UI interface for Hive

2016-07-15 Thread Edward Capriolo
I build one a long time ago still in tree very defunct best bet is working
with hue or whatever Horton is pushing as not to fragment 4 ways.
On Jul 15, 2016 12:56 PM, "Mich Talebzadeh" 
wrote:

> Hi Marcin,
>
> For Hive on Spark I can use Spark 1.3.1 UI which does not have DAG diagram
> (later versions like 1.6.1 have it). But yes you are correct.
>
> However, I was certain that Gopal was working on a UI interface if my
> memory serves right.
>
> Cheers,
>
> Mich
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 July 2016 at 16:08, Marcin Tustin  wrote:
>
>> I was thinking of query and admin interfaces.
>>
>> There's ambari, which has plugins for introspecting what's up with tez
>> sessions. I can't use those because I don't use the yarn history server (I
>> find it very flaky).
>>
>> There's also hue, which is a query interface.
>>
>> If you're running on spark as the execution engine, can you not use the
>> spark UI for those applications to see what's up with hive?
>>
>> On Fri, Jul 15, 2016 at 3:19 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Marcin,
>>>
>>> Which two web interfaces are these. I know the usual one on 8088 any
>>> other one?
>>>
>>> I want something in line with what Spark provides. I thought Gopal has
>>> got something:
>>>
>>> [image: Inline images 1]
>>>
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 14 July 2016 at 23:29, Marcin Tustin  wrote:
>>>
 What do you want it to do? There are at least two web interfaces I can
 think of.

 On Thu, Jul 14, 2016 at 6:04 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Gopal,
>
> If I recall you were working on a UI support for Hive. Currently the
> one available is the standard Hadoop one on port 8088.
>
> Do you have any timelines which release of Hive is going to have this
> facility?
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>


 Want to work at Handy? Check out our culture deck and open roles
 
 Latest news  at Handy
 Handy just raised $50m
 
  led
 by Fidelity


>>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> 
>> Latest news  at Handy
>> Handy just raised $50m
>> 
>>  led
>> by Fidelity
>>
>>
>


Re: Analyzing Bitcoin blockchain data with Hive

2016-05-01 Thread Edward Capriolo
Good stuff!

On Fri, Apr 29, 2016 at 1:30 PM, Jörn Franke  wrote:

> Dear all,
>
> I prepared a small Serde to analyze Bitcoin blockchain data with Hive:
>
> https://snippetessay.wordpress.com/2016/04/28/hive-bitcoin-analytics-on-blockchain-data-with-sql/
>
> There are some example queries, but I will add some in the future.
> Additionally, more unit tests will be added.
>
> Let me know if this is useful for you and of course please report bugs ;)
>
> Thank you.
>
> Cheers
>


Re: [VOTE] Bylaws change to allow some commits without review

2016-04-22 Thread Edward Capriolo
+1

On Friday, April 22, 2016, Lars Francke  wrote:

> Yet another update. I went through the PMC list.
>
> These seven have not been active (still the same list as Vikram posted
> during the last vote):
> Ashish Thusoo
> Kevin Wilfong
> He Yongqiang
> Namit Jain
> Joydeep Sensarma
> Ning Zhang
> Raghotham Murthy
>
> There are 29 PMCs in total - 7 = 22 active * 2/3 = 15 votes required
>
> So far the following PMCs have voted:
>
> Alan Gates
> Jason Dere
> Sushanth Sowmyan
> Lefty Leverenz
> Navis Ryu
> Owen O'Malley
> Prasanth J
> Sergey Shelukhin
> Thejas Nair
>
> = 9 +1s
>
> So I'm hoping for six more. I've contacted a bunch of PMCs (sorry for the
> spam!) and hope to get a few more.
>
> In addition there have been six non-binding +1s. Thank you everyone for
> voting.
>
>
>
>
>
> On Fri, Apr 22, 2016 at 10:42 PM, Lars Francke  > wrote:
>
>> Hi everyone, thanks for the votes. I've been held up by personal stuff
>> this week but as there have been no -1s or other objections I'd like to
>> keep this vote open a bit longer until I've had time to go through the PMCs
>> and contact those that have not yet voted.
>>
>> On Thu, Apr 21, 2016 at 9:12 PM, Denise Rogers > > wrote:
>>
>>> +1
>>>
>>> Regards,
>>> Denise
>>> Cell - (860)989-3431
>>>
>>> Sent from mi iPhone
>>>
>>> On Apr 21, 2016, at 2:56 PM, Sergey Shelukhin >> > wrote:
>>>
>>> +1
>>>
>>> From: Tim Robertson >> >
>>> Reply-To: "user@hive.apache.org
>>> " <
>>> user@hive.apache.org
>>> >
>>> Date: Wednesday, April 20, 2016 at 06:17
>>> To: "user@hive.apache.org
>>> " <
>>> user@hive.apache.org
>>> >
>>> Subject: Re: [VOTE] Bylaws change to allow some commits without review
>>>
>>> +1
>>>
>>> On Wed, Apr 20, 2016 at 1:24 AM, Jimmy Xiang >> > wrote:
>>>
 +1

 On Tue, Apr 19, 2016 at 2:58 PM, Alpesh Patel > wrote:
 > +1
 >
 > On Tue, Apr 19, 2016 at 1:29 PM, Lars Francke >
 > wrote:
 >>
 >> Thanks everyone! Vote runs for at least one more day. I'd appreciate
 it if
 >> you could ping/bump your colleagues to chime in here.
 >>
 >> I'm not entirely sure how many PMC members are active and how many
 votes
 >> we need but I think a few more are probably needed.
 >>
 >> On Mon, Apr 18, 2016 at 8:02 PM, Thejas Nair >
 >> wrote:
 >>>
 >>> +1
 >>>
 >>> 
 >>> From: Wei Zheng >
 >>> Sent: Monday, April 18, 2016 10:51 AM
 >>> To: user@hive.apache.org
 
 >>> Subject: Re: [VOTE] Bylaws change to allow some commits without
 review
 >>>
 >>> +1
 >>>
 >>> Thanks,
 >>> Wei
 >>>
 >>> From: Siddharth Seth >
 >>> Reply-To: "user@hive.apache.org
 " <
 user@hive.apache.org
 >
 >>> Date: Monday, April 18, 2016 at 10:29
 >>> To: "user@hive.apache.org
 " <
 user@hive.apache.org
 >
 >>> Subject: Re: [VOTE] Bylaws change to allow some commits without
 review
 >>>
 >>> +1
 >>>
 >>> On Wed, Apr 13, 2016 at 3:58 PM, Lars Francke <
 lars.fran...@gmail.com
 >
 >>> wrote:
 
  Hi everyone,
 
  we had a discussion on the dev@ list about allowing some forms of
  contributions to be committed without a review.
 
  The exact sentence I propose to add is: "Minor issues (e.g. typos,
 code
  style issues, JavaDoc changes. At committer's discretion) can be
 committed
  after soliciting feedback/review on the mailing list and not
 receiving
  feedback within 2 days."
 
  The proposed bylaws can also be seen 

Re: Column type conversion in Hive

2016-03-21 Thread Edward Capriolo
Explicit conversion is done using cast (x as bigint)

You said: As a matter of interest what is the underlying storage for
Integer?

This is dictated on disk by the input format the "temporal in memory
format" is dictated by the serde, an integer could be stored as "1",
"1" , as dictated by the Input Format and Serde Storage
Handler in use.

On Sun, Mar 20, 2016 at 6:27 PM, Mich Talebzadeh 
wrote:

>
> As a matter of interest how does how cast columns from say String to
> Integer implicitly?
>
> For example the following shows this
>
> create table s(col1 String);
> insert into s values("1");
> insert into s values("2");
>
> Now create a target table with col1 being integer
>
>
> create table o (col1 Int);
> insert into o select * from s;
>
> select * from o;
> +-+--+
> | o.col1  |
> +-+--+
> | 1   |
> | 2   |
> +-+--+
>
> So this implicit column conversion from String to Integer happens without
> intervention in the code. As a matter of interest what is the underlying
> storage for Integer. In a conventional RDBMS this needs to be done through
> cast (CHAR AS INT) etc?
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: Simple UDFS and IN Operator

2016-03-08 Thread Edward Capriolo
The IN UDF is a special one in that unlike many others there is support in
the ANTLR language and parsers for it. The rough answer is it can be done
but it is not as direct as making other UDFs.


On Tue, Mar 8, 2016 at 2:32 PM, Lavelle, Shawn 
wrote:

> Hello All,
>
>I hope that this question isn’t too rudimentary – but I’m relatively
> new to HIVE.
>
>
>
>In Hive 0.11, I’ve written a UDF that returns a list of Integers. I’d
> like to use this in a WHERE clause of a query, something like SELECT * FROM
>  WHERE   in ( getList() ). (Extra parenthesis needed to pass
> the parser.)  Is such a thing possible?  Keying in values for the list
> parameter works, but they have WritableConstantIntObjectInspectors whereas
> what is returned by my list (despite my best efforts) has an element
> inspector of WritabeIntObjectInspector. This doesn’t work.
>
>   So, two questions – Should It? (The HIVE I’m working on is heavily
> modified :/ ) and how might I accomplish this?  Joins would be ideal, but
> we haven’t upgraded yet.
>
>   Thank you for your insight,
>
>
>
> ~ Shawn M Lavelle
>
>
>
>
>
>
> Shawn Lavelle
> Software Development
>
> 4101 Arrowhead Drive
> Medina, Minnesota 55340-9457
> Phone: 763 551 0559
> Fax: 763 551 0750
> *Email:* shawn.lave...@osii.com
> *Website: **www.osii.com* 
>


Re: Hive and Impala

2016-03-01 Thread Edward Capriolo
My nocks on impala. (not intended to be a post knocking impala)

Impala really has not delivered on the complex types that hive has (after
promising it for quite a while), also it only works with the 'blessed'
input formats, parquet, avro, text.

It is very annoying to work with impala, In my version if you create a
partition in hive impala does not see it. You have to run "refresh".

In impala I do not have all the UDFS that hive has like percentile, etc.

Impala is fast. Many data-analysts / data-scientist types that can't wait
10 seconds for a query so when I need top produce something for them I make
sure the data has no complex types and uses a table type that impala
understands.

But for my work I still work primarily in hive, because I do not want to
deal with all the things that impala does not have/might have/ and when I
need something special like my own UDFs it is easier to whip up the
solution in hive.

Having worked with M$ SQL server, and vertica, Impala is on par with them
but I don'think of it like i think of hive. To me it just feels like a
vertica that I can cheat loading sometimes because it is backed by hdfs.

Hive is something different, I am making pipelines, I am transforming data,
doing streaming, writing custom udfs, querying JSON directly. Its not !=
impala.

::random message of the day::




On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar  wrote:

>
> Dr Mitch,
>
> My two cents here.
>
> I don't have direct experience of Impala but in my humble opinion I share
> your views that Hive provides the best metastore of all Big Data systems.
> Looking around almost every product in one form and shape use Hive code
> somewhere. My colleagues inform me that Hive is one of the most stable Big
> Data products.
>
> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of
> course MR, there is really little need for many other products in the same
> space. It is good to keep things simple.
>
> Warmest
>
>
> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> I have not heard of Impala anymore. I saw an article in LinkedIn titled
>
> "Apache Hive Or Cloudera Impala? What is Best for me?"
>
> "We can access all objects from Hive data warehouse with HiveQL which
> leverages the map-reduce architecture in background for data retrieval and
> transformation and this results in latency."
>
> My response was
>
> This statement is no longer valid as you have choices of three engines now
> with MR, Spark and Tez. I have not used Impala myself as I don't think
> there is a need for it with Hive on Spark or Spark using Hive metastore
> providing whatever needed. Hive is for Data Warehouse and provides what is
> says on the tin. Please also bear in mind that Hive offers ORC storage
> files that provide store Index capabilities further optimizing the queries
> with additional stats at file, stripe and row group levels.
>
> Anyway the question is with Hive on Spark or Spark using Hive metastore
> what we cannot achieve that we can achieve with Impala?
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
>
>


Re: TBLPROPERTIES K/V Comprehensive List

2016-02-19 Thread Edward Capriolo
There is no comprehensive list, each serde could use the parameters for
whatever it desires while other serde's use none at all.

On Fri, Feb 19, 2016 at 3:23 PM, mahender bigdata <
mahender.bigd...@outlook.com> wrote:

> +1, Any information available ?
>
> On 2/10/2016 1:26 AM, Mathan Rajendran wrote:
>
>> Hi ,
>>
>> Is there any place where I can see a list of Key/Value Pairs used in Hive
>> while creating a Table.
>>
>> I went through the code and find the java doc
>> hive_metastoreConstants.java is having few constants list but not the
>> complete list.
>>
>>
>> Eg. Compression like orc.compression and other properties are missing.
>>
>>
>> Regards,
>> Madhan
>>
>
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Edward Capriolo
Lol very off beat convo for the hive list. Lets not drag ourselves too far
down here.

On Wednesday, February 3, 2016, Stephen Sprague  wrote:

> i refuse to take anybody seriously who has a sig file longer than one line
> and  that there is just plain repugnant.
>
> On Wed, Feb 3, 2016 at 1:47 PM, Mich Talebzadeh 
> wrote:
>
> I just did some further tests joining a 5 million rows FACT tables with 2
> DIMENSION tables.
>
>
>
> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
> TotalSales
>
> FROM sales s, times t, channels c
>
> WHERE s.time_id = t.time_id
>
> AND   s.channel_id = c.channel_id
>
> GROUP BY t.calendar_month_desc, c.channel_desc
>
> ;
>
>
>
>
>
> Hive on Spark crashes, Hive with MR finishes in 85 sec and Spark on Hive
> finishes in 267 sec. I am trying to understand this behaviour
>
>
>
> OK I changed the three below parameters as suggested by Jeff
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
>
>
> *1)**Hive 1.2.1 on Spark 1.3.1*
>
> It fails. Never completes.
>
>
>
> ERROR : Status: Failed
>
> Error: Error while processing statement: FAILED: Execution Error, return
> code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
> (state=08S01,code=3)
>
>
>
> *2)**Hive 1.2.1 on MR engine Looks good and completes in 85 sec*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> SELECT t.calendar_month_desc,
> c.channel_desc, SUM(s.amount_sold) AS TotalSales
>
> 0: jdbc:hive2://rhes564:10010/default> FROM sales s, times t, channels c
>
> 0: jdbc:hive2://rhes564:10010/default> WHERE s.time_id = t.time_id
>
> 0: jdbc:hive2://rhes564:10010/default> AND   s.channel_id = c.channel_id
>
> 0: jdbc:hive2://rhes564:10010/default> GROUP BY t.calendar_month_desc,
> c.channel_desc
>
> 0: jdbc:hive2://rhes564:10010/default> ;
>
> INFO  : Execution completed successfully
>
> INFO  : MapredLocal task succeeded
>
> INFO  : Number of reduce tasks not specified. Estimated from input data
> size: 1
>
> INFO  : In order to change the average load for a reducer (in bytes):
>
> INFO  :   set hive.exec.reducers.bytes.per.reducer=
>
> INFO  : In order to limit the maximum number of reducers:
>
> INFO  :   set hive.exec.reducers.max=
>
> INFO  : In order to set a constant number of reducers:
>
> INFO  :   set mapreduce.job.reduces=
>
> WARN  : Hadoop command-line option parsing not performed. Implement the
> Tool interface and execute your application with ToolRunner to remedy this.
>
> INFO  : number of splits:1
>
> INFO  : Submitting tokens for job: job_1454534517374_0002
>
> INFO  : The url to track the job:
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Starting Job = job_1454534517374_0002, Tracking URL =
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1454534517374_0002
>
> INFO  : Hadoop job information for Stage-3: number of mappers: 1; number
> of reducers: 1
>
> INFO  : 2016-02-03 21:25:17,769 Stage-3 map = 0%,  reduce = 0%
>
> INFO  : 2016-02-03 21:25:29,103 Stage-3 map = 2%,  reduce = 0%, Cumulative
> CPU 7.52 sec
>
> INFO  : 2016-02-03 21:25:32,205 Stage-3 map = 5%,  reduce = 0%, Cumulative
> CPU 10.19 sec
>
> INFO  : 2016-02-03 21:25:35,295 Stage-3 map = 7%,  reduce = 0%, Cumulative
> CPU 12.69 sec
>
> INFO  : 2016-02-03 21:25:38,392 Stage-3 map = 10%,  reduce = 0%,
> Cumulative CPU 15.2 sec
>
> INFO  : 2016-02-03 21:25:41,502 Stage-3 map = 13%,  reduce = 0%,
> Cumulative CPU 17.31 sec
>
> INFO  : 2016-02-03 21:25:44,600 Stage-3 map = 16%,  reduce = 0%,
> Cumulative CPU 21.55 sec
>
> INFO  : 2016-02-03 21:25:47,691 Stage-3 map = 20%,  reduce = 0%,
> Cumulative CPU 24.32 sec
>
> INFO  : 2016-02-03 21:25:50,786 Stage-3 map = 23%,  reduce = 0%,
> Cumulative CPU 26.3 sec
>
> INFO  : 2016-02-03 21:25:52,858 Stage-3 map = 27%,  reduce = 0%,
> Cumulative CPU 28.52 sec
>
> INFO  : 2016-02-03 21:25:55,948 Stage-3 map = 31%,  reduce = 0%,
> Cumulative CPU 30.65 sec
>
> INFO  : 2016-02-03 21:25:59,032 Stage-3 map = 35%,  reduce = 0%,
> Cumulative CPU 32.7 sec
>
> INFO  : 2016-02-03 21:26:02,120 Stage-3 map = 40%,  reduce = 0%,
> Cumulative CPU 34.69 sec
>
> INFO  : 2016-02-03 21:26:05,217 Stage-3 map = 43%,  reduce = 0%,
> Cumulative CPU 36.67 sec
>
> INFO  : 2016-02-03 21:26:08,310 Stage-3 map = 47%,  reduce = 0%,
> Cumulative CPU 38.78 sec
>
> INFO  : 2016-02-03 21:26:11,408 Stage-3 map = 52%,  reduce = 0%,
> Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Edward Capriolo
Hive is not the correct tool for every problem. Use the tool that makes the
most sense for your problem and your experience.

Many people like hive because it is generally applicable. In my case study
for the hive book I highlighted many smart capably organizations use hive.

Your argument is totally valid. You like X better because X works for you.
You don't need to 'preach' hear we all know hive has it's limits.

On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers <ko...@tresata.com> wrote:

> Is the sky the limit? I know udfs can be used inside hive, like lambas
> basically i assume, and i will assume you have something similar for
> aggregations. But that's just abstractions inside a single map or reduce
> phase, pretty low level stuff. What you really need is abstractions around
> many map and reduce phases, because that is the level an algo is expressed
> at.
>
> For example when doing logistic regression you want to be able to do
> something like:
> read("somefile").train(settings).write("model")
> Here train is an eternally defined method that is well tested and could do
> many map and reduce steps internally (or even be defined at a higher level
> and compile into those steps). What is the equivalent in hive? Copy pasting
> crucial parts of the algo around while using udfs is just not the same
> thing in terms of reusability and abstraction. Its the opposite of keeping
> it DRY.
> On Feb 3, 2016 1:06 AM, "Ryan Harris" <ryan.har...@zionsbancorp.com>
> wrote:
>
>> https://github.com/myui/hivemall
>>
>>
>>
>> as long as you are comfortable with java UDFs, the sky is really the
>> limit...it's not for everyone and spark does have many advantages, but they
>> are two tools that can complement each other in numerous ways.
>>
>>
>>
>> I don't know that there is necessarily a universal "better" for how to
>> use spark as an execution engine (or if spark is necessarily the **best**
>> execution engine for any given hive job).
>>
>>
>>
>> The reality is that once you start factoring in the numerous tuning
>> parameters of the systems and jobs there probably isn't a clear answer.
>> For some queries, the Catalyst optimizer may do a better job...is it going
>> to do a better job with ORC based data? less likely IMO.
>>
>>
>>
>> *From:* Koert Kuipers [mailto:ko...@tresata.com]
>> *Sent:* Tuesday, February 02, 2016 9:50 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>
>>
>>
>> yeah but have you ever seen somewhat write a real analytical program in
>> hive? how? where are the basic abstractions to wrap up a large amount of
>> operations (joins, groupby's) into a single function call? where are the
>> tools to write nice unit test for that?
>>
>> for example in spark i can write a DataFrame => DataFrame that internally
>> does many joins, groupBys and complex operations. all unit tested and
>> perfectly re-usable. and in hive? copy paste round sql queries? thats just
>> dangerous.
>>
>>
>>
>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> uuuhm with spark using Hive metastore you actually have a real
>> programming environment and you can write real functions, versus just being
>> boxed into some version of sql and limited udfs?
>>
>>
>>
>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>>
>> When comparing the performance, you need to do it apple vs apple. In
>> another thread, you mentioned that Hive on Spark is much slower than Spark
>> SQL. However, you configured Hive such that only two tasks can run in
>> parallel. However, you didn't provide information on how much Spark SQL is
>> utilizing. Thus, it's hard to tell whether it's just a configuration
>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>> see the resource usage in YARN resource manage URL.
>>
>> --Xuefu
>>
>>
>>
>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
>> wrote:
>>
>> Thanks Jeff.
>>
>>
>>
>> Obviously Hive is much more feature rich compared to Spark. Having said
>> that in certain areas for example where the SQL feature is available in
>> Spark, Spark seems to deliver faster.
>>
>>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Edward Capriolo
Thank you for the speech. There is an infinite list of things hive does not
do/cant to well.
There is an infinite list of things spark does not do /cant do well.

Some facts:
1) spark has a complete fork of hive inside it. So its hard to trash hive
without at least noting the fact that its a portion of sparks guts.
2) there were lots of people touting benchmarks about spark sql beating
hive, lots of fud about catalyst awesome sause. But then it seems like hive
and tez made spark say uncle...
https://www.slideshare.net/mobile/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final


On Wednesday, February 3, 2016, Koert Kuipers <ko...@tresata.com> wrote:

> ok i am sure there is some way to do it. i am going to guess snippets of
> hive code stuck together with oozie jobs or whatever. the oozie jobs become
> the re-usable pieces perhaps? now you got sql and xml, completely lacking
> any benefits of a compiler to catch errors. unit tests will be slow if even
> available at all. so yeah
> yeah i am sure it can be made to *work*. just like you can get a nail into
> a wall with a screwdriver if you really want.
>
> On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers <ko...@tresata.com
> <javascript:_e(%7B%7D,'cvml','ko...@tresata.com');>> wrote:
>
>> yeah but have you ever seen somewhat write a real analytical program in
>> hive? how? where are the basic abstractions to wrap up a large amount of
>> operations (joins, groupby's) into a single function call? where are the
>> tools to write nice unit test for that?
>>
>> for example in spark i can write a DataFrame => DataFrame that internally
>> does many joins, groupBys and complex operations. all unit tested and
>> perfectly re-usable. and in hive? copy paste round sql queries? thats just
>> dangerous.
>>
>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>> wrote:
>>
>>> Hive has numerous extension points, you are not boxed in by a long shot.
>>>
>>>
>>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com
>>> <javascript:_e(%7B%7D,'cvml','ko...@tresata.com');>> wrote:
>>>
>>>> uuuhm with spark using Hive metastore you actually have a real
>>>> programming environment and you can write real functions, versus just being
>>>> boxed into some version of sql and limited udfs?
>>>>
>>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com>
>>>> wrote:
>>>>
>>>>> When comparing the performance, you need to do it apple vs apple. In
>>>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>>>> SQL. However, you configured Hive such that only two tasks can run in
>>>>> parallel. However, you didn't provide information on how much Spark SQL is
>>>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>>>> see the resource usage in YARN resource manage URL.
>>>>>
>>>>> --Xuefu
>>>>>
>>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>>>> wrote:
>>>>>
>>>>>> Thanks Jeff.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Obviously Hive is much more feature rich compared to Spark. Having
>>>>>> said that in certain areas for example where the SQL feature is available
>>>>>> in Spark, Spark seems to deliver faster.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This may be:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1.Spark does both the optimisation and execution seamlessly
>>>>>>
>>>>>> 2.Hive on Spark has to invoke YARN that adds another layer to
>>>>>> the process
>>>>>>
>>>>>>
>>>>>>
>>>>>> Now I did some simple tests on a 100Million rows ORC table available
>>>>>> through Hive to both.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>&g

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Edward Capriolo
Hive has numerous extension points, you are not boxed in by a long shot.

On Tuesday, February 2, 2016, Koert Kuipers  wrote:

> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  > wrote:
>
>> When comparing the performance, you need to do it apple vs apple. In
>> another thread, you mentioned that Hive on Spark is much slower than Spark
>> SQL. However, you configured Hive such that only two tasks can run in
>> parallel. However, you didn't provide information on how much Spark SQL is
>> utilizing. Thus, it's hard to tell whether it's just a configuration
>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>> see the resource usage in YARN resource manage URL.
>>
>> --Xuefu
>>
>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh > > wrote:
>>
>>> Thanks Jeff.
>>>
>>>
>>>
>>> Obviously Hive is much more feature rich compared to Spark. Having said
>>> that in certain areas for example where the SQL feature is available in
>>> Spark, Spark seems to deliver faster.
>>>
>>>
>>>
>>> This may be:
>>>
>>>
>>>
>>> 1.Spark does both the optimisation and execution seamlessly
>>>
>>> 2.Hive on Spark has to invoke YARN that adds another layer to the
>>> process
>>>
>>>
>>>
>>> Now I did some simple tests on a 100Million rows ORC table available
>>> through Hive to both.
>>>
>>>
>>>
>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>
>>>
>>>
>>>
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>
>>> 1   0   0   63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>> xx
>>>
>>> 5   0   4   31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>> xx
>>>
>>> 10  99  999 188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>> xx
>>>
>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>
>>> 1   0   0   63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>> xx
>>>
>>> 5   0   4   31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>> xx
>>>
>>> 10  99  999 188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>> xx
>>>
>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>
>>> 1   0   0   63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>> xx
>>>
>>> 5   0   4   31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>> xx
>>>
>>> 10  99  999 188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>> xx
>>>
>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>
>>>
>>>
>>> So three runs returning three rows just over 50 seconds
>>>
>>>
>>>
>>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>>
>>>
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 10);
>>>
>>> INFO  :
>>>
>>> Query Hive on Spark job[4] stages:
>>>
>>> INFO  : 4
>>>
>>> INFO  :
>>>
>>> Status: Running (Hive on Spark job[4])
>>>
>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>
>>>
>>> +---+--+--+---+-+-++--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> | dummy.random_string | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +---+--+--+---+-+-++--+
>>>
>>> | 1 | 0| 0| 63|
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>> xx |
>>>
>>> | 5 | 0| 4| 31|
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>> xx |
>>>
>>> | 10| 99   | 999  | 188   |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>> xx |
>>>
>>>
>>> +---+--+--+---+-+-++--+
>>>
>>> 3 rows selected (82.66 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 

Re: Convert string to map

2016-01-20 Thread Edward Capriolo
create table  ()..
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
collection items terminated by ',' map keys terminated by ':' works in many
cases

On Wed, Jan 20, 2016 at 9:07 PM, Buntu Dev  wrote:

> I found the brickhouse Hive udf `json_map' that seems to convert to map of
> given type.
>
> Thanks!
>
> On Wed, Jan 20, 2016 at 2:03 PM, Buntu Dev  wrote:
>
>> I got json string of the form:
>>
>>   {"k1":"v1","k2":"v2,"k3":"v3"}
>>
>> How would I go about converting this to a map?
>>
>> Thanks!
>>
>
>


Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Edward Capriolo
You can not 'add jar' input formats and serde's. They need to be part of
your auxlib.

On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion  wrote:

> I tried now. still getting
>
> 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: 
> hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
>  org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
> Serialization trace:
> inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
> aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>
>
> HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.
>
>
> On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure  wrote:
>
>> did you try -- jars property in spark submit? if your jar is of huge
>> size, you can pre-load the jar on all executors in a common available
>> directory to avoid network IO.
>>
>> On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion 
>> wrote:
>>
>>> I' trying to add jars before running a query using hive on spark on cdh
>>> 5.4.3.
>>> I've tried applying the patch in
>>> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch
>>> is done on a different hive version) but still hasn't succeeded.
>>>
>>> did anyone manage to do ADD JAR successfully with CDH?
>>>
>>> Thanks,
>>> Ophir
>>>
>>
>>
>


Re: adding jars - hive on spark cdh 5.4.3

2016-01-08 Thread Edward Capriolo
Yes you can add UDF's via add Jar. But strangely the classpath of  'the
driver' of the hive process does not seem to be able to utilize
InputFormats and Serde's that have been added to the session via ADD JAR.
At one point I understood why. This is probably something we should ticket
and come up with a more elegant solution.

On Fri, Jan 8, 2016 at 12:26 PM, Ophir Etzion <op...@foursquare.com> wrote:

> Thanks!
> In certain use cases you could but forgot about the aux thing, thats
> probably it.
>
> On Fri, Jan 8, 2016 at 12:24 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> You can not 'add jar' input formats and serde's. They need to be part of
>> your auxlib.
>>
>> On Fri, Jan 8, 2016 at 12:19 PM, Ophir Etzion <op...@foursquare.com>
>> wrote:
>>
>>> I tried now. still getting
>>>
>>> 16/01/08 16:37:34 ERROR exec.Utilities: Failed to load plan: 
>>> hdfs://hadoop-alidoro-nn-vip/tmp/hive/hive/c2af9882-38a9-42b0-8d17-3f56708383e8/hive_2016-01-08_16-36-41_370_3307331506800215903-3/-mr-10004/3c90a796-47fc-4541-bbec-b196c40aefab/map.xml:
>>>  org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>>> Serialization trace:
>>> inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
>>> aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
>>> org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find 
>>> class: com.foursquare.hadoop.hive.io.HiveThriftSequenceFileInputFormat
>>>
>>>
>>> HiveThriftSequenceFileInputFormat is in one of the jars I'm trying to add.
>>>
>>>
>>> On Thu, Jan 7, 2016 at 9:58 PM, Prem Sure <premsure...@gmail.com> wrote:
>>>
>>>> did you try -- jars property in spark submit? if your jar is of huge
>>>> size, you can pre-load the jar on all executors in a common available
>>>> directory to avoid network IO.
>>>>
>>>> On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion <op...@foursquare.com>
>>>> wrote:
>>>>
>>>>> I' trying to add jars before running a query using hive on spark on
>>>>> cdh 5.4.3.
>>>>> I've tried applying the patch in
>>>>> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the
>>>>> patch is done on a different hive version) but still hasn't succeeded.
>>>>>
>>>>> did anyone manage to do ADD JAR successfully with CDH?
>>>>>
>>>>> Thanks,
>>>>> Ophir
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Seeing strange limit

2015-12-30 Thread Edward Capriolo
This message means the garbage collector runs but is unable to free memory
after trying for a while.

This can happen for a lot of reasons. With hive it usually happens when a
query has a lot of intermediate data.

For example imaging a few months ago count (distinct(ip)) returned 20k.
Everything works, then your data changes and suddenly you have issues.

Try tuning mostly raising your xmx.

On Wednesday, December 30, 2015, Gary Clark  wrote:

> Hello,
>
>
>
> I have a multi-node cluster (hadoop 2.6.0) and am seeing the below message
> causing the hive workflow to fail:
>
>
>
> Looking at the hadoop logs I see the below:
>
>
>
> 45417 [main] ERROR org.apache.hadoop.hive.ql.Driver  - FAILED: Execution
> Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask.
> GC overhead limit exceeded
>
>
>
> I have been running for months without problems. When I removed a large
> amount of the files from the directory which I was running a query on the
> query succeeded. It looks like I’m hitting a limit not sure how to remedy
> this.
>
>
>
> Has anybody else seen this problem?
>
>
>
> Thanks,
>
> Gary C
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Seeing strange limit

2015-12-30 Thread Edward Capriolo
In the old days mapred.chils.opts was the one. Knowing the query and the
dataset helps as well.

On Wednesday, December 30, 2015, Gary Clark <gcl...@neces.com> wrote:

> -Xmx1024m -XX:-UseGCOverheadLimit
>
>
>
> I think this is the limit I need to tweak.
>
>
>
> *From:* Gary Clark [mailto:gcl...@neces.com
> <javascript:_e(%7B%7D,'cvml','gcl...@neces.com');>]
> *Sent:* Wednesday, December 30, 2015 8:59 AM
> *To:* user@hive.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>
> *Subject:* RE: Seeing strange limit
>
>
>
> Thanks, currently  have the below:
>
>
>
> export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
>
>
>
> # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
>
> export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
>
>
>
> and HADOOP_HEAPSIZE=4096
>
>
>
> I’m assuming just raising the above would work.
>
>
>
> Much Appreciated,
>
> Gary C
>
>
>
> *From:* Edward Capriolo [mailto:edlinuxg...@gmail.com
> <javascript:_e(%7B%7D,'cvml','edlinuxg...@gmail.com');>]
> *Sent:* Wednesday, December 30, 2015 8:55 AM
> *To:* user@hive.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>
> *Subject:* Re: Seeing strange limit
>
>
>
> This message means the garbage collector runs but is unable to free memory
> after trying for a while.
>
>
>
> This can happen for a lot of reasons. With hive it usually happens when a
> query has a lot of intermediate data.
>
>
>
> For example imaging a few months ago count (distinct(ip)) returned 20k.
> Everything works, then your data changes and suddenly you have issues.
>
>
>
> Try tuning mostly raising your xmx.
>
> On Wednesday, December 30, 2015, Gary Clark <gcl...@neces.com
> <javascript:_e(%7B%7D,'cvml','gcl...@neces.com');>> wrote:
>
> Hello,
>
>
>
> I have a multi-node cluster (hadoop 2.6.0) and am seeing the below message
> causing the hive workflow to fail:
>
>
>
> Looking at the hadoop logs I see the below:
>
>
>
> 45417 [main] ERROR org.apache.hadoop.hive.ql.Driver  - FAILED: Execution
> Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask.
> GC overhead limit exceeded
>
>
>
> I have been running for months without problems. When I removed a large
> amount of the files from the directory which I was running a query on the
> query succeeded. It looks like I’m hitting a limit not sure how to remedy
> this.
>
>
>
> Has anybody else seen this problem?
>
>
>
> Thanks,
>
> Gary C
>
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: hacking the hive ql parser?

2015-12-29 Thread Edward Capriolo
hive --service lineage 'hql' exists i believe.

On Tue, Dec 29, 2015 at 3:05 PM, Yang  wrote:

> I'm trying to create a utility to parse out the data lineage (i.e. DAG
> dependency graph) among all my hive scripts.
>
> to do this I need to parse out the input and output tables from a query.
> does this ability existing already? if not, I'm going to hack the parser.
> I am not very familiar with the parser code structure of hive, could
> anybody give me some tips on where to start?
> (I see the .g files, but not sure where is the rest  I am more
> familiar with the ASTvisitor paradigm in antlr, but can't find similar
> files in the parser dir)
>
>
> thanks
> Yang
>


Re: Null Representation in Hive tables

2015-12-27 Thread Edward Capriolo
Your best bet is take the serde you s re using and copy it and change the
code to accept bith null types

On Sunday, December 27, 2015, mahender bigdata <mahender.bigd...@outlook.com>
wrote:

> Can any one update on this
>
> On 12/23/2015 9:37 AM, mahender bigdata wrote:
>
> Our Files are not text Files, they are csv and dat. Any possibility to
> include 2 serialization.null format in table property
>
> On 12/23/2015 9:16 AM, Edward Capriolo wrote:
>
> In text formats the null is accepted as \N.
>
> On Wed, Dec 23, 2015 at 12:00 PM, mahender bigdata <
> <javascript:_e(%7B%7D,'cvml','mahender.bigd...@outlook.com');>
> mahender.bigd...@outlook.com
> <javascript:_e(%7B%7D,'cvml','mahender.bigd...@outlook.com');>> wrote:
>
>> Hi,
>>
>> Is there any possibility of mentioning both*
>> "serialization.null.format"=""  and  **"serialization.null.format"="\000"
>> *has table properties, currently we are creating external table, where
>> there is chance of having data with empty string or \000,  As a  work
>> around, we have created 2 external tables, one with 
>> "serialization.null.format"=""
>> has table property and another with "serialization.null.format"="\000"
>> where we insert data from external table 1 to table 2. Is there way to
>> reduce to single step having mentioning both *"serialization.null.format"=""
>> and  **"serialization.null.format"="\000"* in the same table property.
>>
>> Thanks,
>> Mahender
>>
>
>
>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Null Representation in Hive tables

2015-12-23 Thread Edward Capriolo
In text formats the null is accepted as \N.

On Wed, Dec 23, 2015 at 12:00 PM, mahender bigdata <
mahender.bigd...@outlook.com> wrote:

> Hi,
>
> Is there any possibility of mentioning both*
> "serialization.null.format"=""  and  **"serialization.null.format"="\000"
> *has table properties, currently we are creating external table, where
> there is chance of having data with empty string or \000,  As a  work
> around, we have created 2 external tables, one with 
> "serialization.null.format"=""
> has table property and another with "serialization.null.format"="\000"
> where we insert data from external table 1 to table 2. Is there way to
> reduce to single step having mentioning both *"serialization.null.format"=""
> and  **"serialization.null.format"="\000"* in the same table property.
>
> Thanks,
> Mahender
>


Re: how to search the archive

2015-12-04 Thread Edward Capriolo
2) Sometimes I find that managed tables are not removed from HDFS even
after I drop them from the Hive shell. After a "drop table foo", foo does
not show up in a "show tables" listing however that table is present in
HDFS. These are not external tables.

I have noticed this as well. Sometimes this can be about the permissions of
the user that created the table. For example based on the UserGroupAll +
StickyBit a user might be able to create a table but another user might not
be able to drop them.

I had one weird issue where I could not drop a table as a different user
because one of the subdirs had a sticky bit and attempting to move the
sticky bit caused a failed file operation

On Fri, Dec 4, 2015 at 10:33 AM, Timothy Garza <
timothy.ga...@collinsongroup.com> wrote:

> No, definitely not. A Hive table with Sequence Files stored in hdfs:
> /user/warehouse/
>
>
>
>
>
> *Kind Regards *
>
>
> *Timothy Garza*
>
> Data Integration Developer
> *Collinson Technology Services*
>
> Skype: timothy.garza.cts
>
> collinsongroup.com 
>
>
>
> [image: Collinson Group] 
>
>
>
>
>
> *From:* Takahiko Saito [mailto:tysa...@gmail.com]
> *Sent:* 04 December 2015 15:12
> *To:* user@hive.apache.org
> *Subject:* Re: how to search the archive
>
>
>
> Could a table be an external table?
>
>
>
> On Fri, Dec 4, 2015 at 5:56 AM, Timothy Garza <
> timothy.ga...@collinsongroup.com> wrote:
>
> I find the same thing, especially with Hive v1.2.1 that I am currently
> trialling. It does lead to issues with the Metastore when trying to re-use
> the same Hive Table name and I find manually deleting the files in HDFS
> serves as a workaround.
>
>
>
> Q. What does that have to do with the text in the Subject line in your
> email?
>
>
>
>
>
> *From:* Awhan Patnaik [mailto:aw...@spotzot.com]
> *Sent:* 04 December 2015 10:26
> *To:* user@hive.apache.org
> *Subject:* how to search the archive
>
>
>
> Hey all!
>
> I have two questions:
>
> 1) How do I search the entire mailing list archive?
>
> 2) Sometimes I find that managed tables are not removed from HDFS even
> after I drop them from the Hive shell. After a "drop table foo", foo does
> not show up in a "show tables" listing however that table is present in
> HDFS. These are not external tables.
>
>
>
> The Collinson Group Limited; Registered number: 2577557, Registered in
> England & Wales; Registered Office: Cutlers Exchange, 123 Houndsditch,
> London, EC3A 7BU.
>
>
> This e-mail may contain privileged and confidential information and/or
> copyright material and is intended for the use of the addressee only. If
> you receive this e-mail by mistake please advise the sender immediately by
> using the reply facility in your e-mail software and delete this e-mail
> from your computer system. You may not deliver, copy or disclose its
> contents to anyone else. Any unauthorised use may be unlawful. Any views
> expressed in this e-mail are those of the individual sender and may not
> necessarily reflect the views of The Collinson Group Ltd and/or its
> subsidiaries or any other associated company (collectively “Collinson
> Group”).
>
> As communications via the Internet are not secure Collinson Group cannot
> accept any liability if this e-mail is accessed by third parties during the
> course of transmission or is modified or amended in any way following
> despatch. Collinson Group cannot guarantee that any attachment to this
> email does not contain a virus, therefore it is strongly recommended that
> you carry out your own virus check before opening any attachment, as we
> cannot accept liability for any damage sustained as a result of software
> virus infection. Senders of messages shall be taken to consent to the
> monitoring and recording of e-mails addressed to members of the Company.
>
>
>
>
>
> --
>
> Takahiko Saito
>


Strict mode and joins

2015-10-15 Thread Edward Capriolo
So I have strict mode on and I like to keep it that way.

I am trying to do this query.

INSERT OVERWRITE TABLE vertical_stats_recent PARTITION (dt=2015101517)
SELECT ...

FROM entry_hourly_v3 INNER JOIN article_meta ON
entry_hourly_v3.entry_id = article_meta.entry_id
INNER JOIN channel_meta ON
channel_meta.section_name = article_meta.channel

WHERE entry_hourly_v3.dt=2015101517
AND article_meta.dt=2015101517
AND channel_meta.hitdate=20151015
AND article_meta.publish_timestamp > ((unix_timestamp() * 1000) - (1000 *
60 * 60 * 24 * 2))
GROUP

entry_hourly_v3, channel_meta and article_meta are partitioned tables.

*Your query has the following error(s):*

Error while compiling statement: FAILED: SemanticException [Error 10041]:
No partition predicate found for Alias "entry_hourly_v3" Table
"entry_hourly_v3"

I also tried putting views on the table and I had no luck.

Is there any way I can do this query without turning strict mode off?


Re: Better way to do UDF's for Hive

2015-10-01 Thread Edward Capriolo
You can define them in groovy from inside the CLI...

https://gist.github.com/mwinkle/ac9dbb152a1e10e06c16

On Thu, Oct 1, 2015 at 12:57 PM, Ryan Harris 
wrote:

> If you want to use python...
>
> The python script should expect tab-separated input on stdin and it should
> return tab-separated delimited columns for the output...
>
>
>
> add file mypython.py;
>
> SELECT TRANSFORM (tbl.id, tbl.name, tbl.city)
>
> USING 'python mypython.py'
>
> AS (id, name, city, state)
>
> FROM my_db.my_table ;
>
>
>
> *From:* Daniel Lopes [mailto:dan...@bankfacil.com.br]
> *Sent:* Thursday, October 01, 2015 7:12 AM
> *To:* user@hive.apache.org
> *Subject:* Better way to do UDF's for Hive
>
>
>
> Hi,
>
>
>
> I'd like to know the good way to do a a UDF for a single field, like
>
>
>
> SELECT
>
>   tbl.id AS id,
>
>   tbl.name AS name,
>
>   tbl.city AS city,
>
>   state_from_city(tbl.city) AS state
>
> FROM
>
>   my_db.my_table tbl;
>
>
>
> *Native Java*? *Python *over *Hadoop* *Streaming*?
>
>
>
> I prefer Python, but I don't know how to do in a good way.
>
>
>
> Thanks,
>
>
> *Daniel Lopes, B.Eng*
>
> Data Scientist - BankFacil
>
> CREA/SP 5069410560
> 
>
> Mob +55 (18) 99764-2733 
>
> Ph +55 (11) 3522-8009
>
> http://about.me/dannyeuu
>
>
>
> Av. Nova Independência, 956, São Paulo, SP
>
> Bairro Brooklin Paulista
>
> CEP 04570-001
>
> https://www.bankfacil.com.br
>
>
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


Re: Hive Macros roadmap

2015-09-11 Thread Edward Capriolo
Macro's are in and tested. No one will remove them. The unit tests ensure
they keep working.

On Fri, Sep 11, 2015 at 3:38 PM, Elliot West  wrote:

> Hi,
>
> I noticed some time ago the Hive Macro feature. To me at least this seemed
> like an excellent addition to HQL, allowing the user to encapsulate complex
> column logic as an independent HQL, reusable macro while avoiding the
> complexities of Java UDFs. However, few people seem to be aware of them or
> use them. If you are unfamiliar with macros they look like this:
>
> hive> create temporary macro MYSIGMOID(x DOUBLE)
> > 2.0 / (1.0 + exp(-x));
> OK
>
> hive> select MYSIGMOID(1.0) from dual;
> OK
>
> 1.4621171572600098
>
>
> As far as I can tell, they are no longer documented on the Hive wiki.
> There is a tiny reference to them in the O'Reilly 'Programming Hive' book
> (page 185). Can anyone advise me on the following:
>
>- Are there are plans to keep or remove this functionality?
>- Are there are plans to document this functionality?
>- Aside from limitations of HQL are there compelling reasons not to
>use macros?
>
> Thanks - Elliot.
>
>


Re: Is it possible to set the data schema on a per-partition basis?

2015-08-31 Thread Edward Capriolo
Yes. Specifically the avro ser-de like avro support "evolving schema".

On Mon, Aug 31, 2015 at 5:15 PM, Dominik Choma 
wrote:

> I have external hcat structures over lzo-compressed datafiles , data is
> partitioned by date string
> Is it possible to handle schema changes by setting diffrent schema(column
> names & datatypes) per-partition?
>
> Thanks,
> Dominik.
>
>


Hive 1.1 arg!

2015-07-07 Thread Edward Capriolo
Hey all. I am using cloudera 5.4.something which uses hive 1.1 almost.

I am getting bit by this error:
https://issues.apache.org/jira/browse/HIVE-10437

So I am trying to update my test setup to 1.1 so I can include the
annotation.


@SerDeSpec(schemaProps = {serdeConstants.LIST_COLUMNS,
  serdeConstants.LIST_COLUMN_TYPES,
  serdeConstants.TIMESTAMP_FORMATS})

I added this annotation. Now during my testing I am seeing this:

My serde does not read any table meta-data. It always returns the same list
of columns.

There are a lot of deeply nested columns. I have a unit test that is
creating a table using this serde.

Hive is angry:

Caused by: java.sql.SQLDataException: A truncation error was encountered
trying to shrink VARCHAR 'Video fields in beacon: vidId, vidAdViewed,
vidTime, vidStat' to length 256.
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)

Does anyone understand why Hive is attempting to edit the meta-store. It
should just always read the values from this serde, and not need to persist
the columns.

AFAIK there is NO documentation anywhere as to what schema props should be
set when

What does serdeConstants.LIST_COLUMNS, do? When should someone use it? When
should someone not use it?


Re: hive locate from s3 - query

2015-07-03 Thread Edward Capriolo
You probably need to make your own serde/input format that trims the line.

On Fri, Jul 3, 2015 at 8:15 AM, ram kumar ramkumarro...@gmail.com wrote:

 when i map the hive table to locate the s3 path,
 it throws exception for the* new line at the beginning of line*.
 Is there a solution to trim the new line at the beginning in hive?
 Or any alternatives?


 CREATE EXTERNAL TABLE work (
 time BIGINT,
 uid STRING,
 type STRING
 )
 ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
 LOCATION 's3n://work/';



 *hive  select * from work;Failed with exception java.io
 http://java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
 error parsing JSON*



 Thanks



Re: join 2 tables located on different clusters

2015-06-24 Thread Edward Capriolo
I do not know what your exact problem is. Set you debug logging on. This
can be done however assuming both clusters have network access to each other

On Wed, Jun 24, 2015 at 4:33 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 Hello Everyone

 Can I define external table on cluster_1 pointing to hdfs location on
 cluster_2?
 I tried and got some strange exception in hive
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.DDLTask.
 MetaException(message:java.lang.reflect.InvocationTargetException)

 I want to do full outer join btw table A which exist on cluster_1 and
 table A on cluster_2.

 My idea was to create external table A_2 (on cluster_1) which points to
 cluster_2 and run hive query on cluster_1

 select a.*, a_2.*
 from a
 full outer join a_2 on (a.id = a_2.id)



Re: Merging small files in partitions

2015-06-16 Thread Edward Capriolo
https://github.com/edwardcapriolo/filecrush

On Tue, Jun 16, 2015 at 5:05 PM, Chagarlamudi, Prasanth 
prasanth.chagarlam...@epsilon.com wrote:

  Hello,

 I am looking for an optimized way to merge small files in hive partitions
 into one big file.

 I came across *Alter Table/Partition Concatenate *
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate.
 Doc says this only works for RCFiles. I wish there is something similar for
 TEXT FILE format.

 Any suggestions?



 Thanks in advance

 Prasanth





 --

 This e-mail and files transmitted with it are confidential, and are
 intended solely for the use of the individual or entity to whom this e-mail
 is addressed. If you are not the intended recipient, or the employee or
 agent responsible to deliver it to the intended recipient, you are hereby
 notified that any dissemination, distribution or copying of this
 communication is strictly prohibited. If you are not one of the named
 recipient(s) or otherwise have reason to believe that you received this
 message in error, please immediately notify sender by e-mail, and destroy
 the original message. Thank You.



Hive-1.2.0 does not work with stock hadoop 2.6.0

2015-06-07 Thread Edward Capriolo
[edward@jackintosh apache-hive-1.2.0-bin]$ export
HADOOP_HOME=/home/edward/Downloads/hadoop-2.6.0
[edward@jackintosh apache-hive-1.2.0-bin]$ bin/hive

Logging initialized using configuration in
jar:file:/home/edward/Downloads/apache-hive-1.2.0-bin/lib/hive-common-1.2.0.jar!/hive-log4j.properties
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but
interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
at jline.TerminalFactory.get(TerminalFactory.java:158)
at jline.console.ConsoleReader.init(ConsoleReader.java:229)
at jline.console.ConsoleReader.init(ConsoleReader.java:221)
at jline.console.ConsoleReader.init(ConsoleReader.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.setupConsoleReader(CliDriver.java:787)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:721)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Exception in thread main java.lang.IncompatibleClassChangeError: Found
class jline.Terminal, but interface was expected
at jline.console.ConsoleReader.init(ConsoleReader.java:230)
at jline.console.ConsoleReader.init(ConsoleReader.java:221)
at jline.console.ConsoleReader.init(ConsoleReader.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.setupConsoleReader(CliDriver.java:787)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:721)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


Re: Hive-1.2.0 does not work with stock hadoop 2.6.0

2015-06-07 Thread Edward Capriolo
Should we add
HADOOP_USER_CLASSPATH_FIRST=true

to the hive scripts?

On Sun, Jun 7, 2015 at 11:06 AM, Edward Capriolo edlinuxg...@gmail.com
wrote:

 [edward@jackintosh apache-hive-1.2.0-bin]$ export
 HADOOP_HOME=/home/edward/Downloads/hadoop-2.6.0
 [edward@jackintosh apache-hive-1.2.0-bin]$ bin/hive

 Logging initialized using configuration in
 jar:file:/home/edward/Downloads/apache-hive-1.2.0-bin/lib/hive-common-1.2.0.jar!/hive-log4j.properties
 [ERROR] Terminal initialization failed; falling back to unsupported
 java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but
 interface was expected
 at jline.TerminalFactory.create(TerminalFactory.java:101)
 at jline.TerminalFactory.get(TerminalFactory.java:158)
 at jline.console.ConsoleReader.init(ConsoleReader.java:229)
 at jline.console.ConsoleReader.init(ConsoleReader.java:221)
 at jline.console.ConsoleReader.init(ConsoleReader.java:209)
 at
 org.apache.hadoop.hive.cli.CliDriver.setupConsoleReader(CliDriver.java:787)
 at
 org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:721)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

 Exception in thread main java.lang.IncompatibleClassChangeError: Found
 class jline.Terminal, but interface was expected
 at jline.console.ConsoleReader.init(ConsoleReader.java:230)
 at jline.console.ConsoleReader.init(ConsoleReader.java:221)
 at jline.console.ConsoleReader.init(ConsoleReader.java:209)
 at
 org.apache.hadoop.hive.cli.CliDriver.setupConsoleReader(CliDriver.java:787)
 at
 org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:721)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)




Re: Keys in Hive

2015-06-02 Thread Edward Capriolo
Hive does not support primary key or other types of index constraints.

On Tue, Jun 2, 2015 at 4:37 AM, Ravisankar Mani 
ravisankarm...@syncfusion.com wrote:

  Hi everyone,



 I am unable to create an table in hive with primary key

 Example :



 create table Hivetable((name string),primary key(name));



 Could please help about the primary key query?



 Regards,



 Ravisankar M R



Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Edward Capriolo
What about outer lateral view?

On Wed, May 20, 2015 at 11:28 AM, matshyeq matsh...@gmail.com wrote:

 From my experience SparkSQL is still way faster than tez.
 Also, SparkSQL (even 1.2.1 which I'm on) supports *lateral view*

 On Wed, May 20, 2015 at 3:41 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Beyond window queries, hive still has concepts like cube or lateral view
 that many better than hive systems don't have.

 Also now many people went around broadcasting SparkSQL/SparkSQL was/is
 better/faster than hive but now that tez has whooped them in a benchmark
 they are very quite.


 http://www.quora.com/What-do-the-people-who-answered-Quora-questions-about-Spark-being-faster-than-Hive-say-now-that-Hortonworks-claims-that-Hive-on-Tez-is-faster-than-Spark




 On Wed, May 20, 2015 at 9:50 AM, Dragga, Christopher 
 chris.dra...@netapp.com wrote:

  While I’ve not experimented with the most recent versions of SparkSQL,
 earlier releases could not cope with intermediate result sets that exceeded
 the available memory; Hive handles this sort of situation much more
 gracefully.  If you have a smallish cluster and large data, this could pose
 a problem.  Still, it’s worth looking into SparkSQL to see if this is still
 an issue.



 -Chris Dragga



 *From:* Uli Bethke [mailto:uli.bet...@sonra.io]
 *Sent:* Wednesday, May 20, 2015 7:04 AM
 *To:* user@hive.apache.org
 *Subject:* Re: Hive on Spark VS Spark SQL



 Interesting question and one that I have asked myself. If you are
 already heavily invested in the Hive ecosystem in terms of code and skills
 I would look at Hive on Spark as my engine. In theory swapping out engines
 (MR, TEZ, Spark) should be easy. Even though the devil is in the detail.
 SparkSQL supports a broad subset of HiveQL (some esoteric features are
 not supported). Crucially in my opinion SparkSQL 1.4 will also introduce
 windowing functions. If starting out on a greenfield site I would
 exclusively look at SparkSQL.

  On 20/05/2015 06:38, guoqing0...@yahoo.com.hk wrote:

  Hive on Spark and SparkSQL which should be better , and what are the
 key characteristics and the advantages and the disadvantages between ?


  --

 guoqing0...@yahoo.com.hk



  --

 ___

 Uli Bethke

 Co-founder Sonra

 p: +353 86 32 83 040

 w: www.sonra.io

 l: linkedin.com/in/ulibethke

 t: twitter.com/ubethke



 Chair Hadoop User Group Ireland:

 http://www.meetup.com/hadoop-user-group-ireland/






Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Edward Capriolo
Beyond window queries, hive still has concepts like cube or lateral view
that many better than hive systems don't have.

Also now many people went around broadcasting SparkSQL/SparkSQL was/is
better/faster than hive but now that tez has whooped them in a benchmark
they are very quite.

http://www.quora.com/What-do-the-people-who-answered-Quora-questions-about-Spark-being-faster-than-Hive-say-now-that-Hortonworks-claims-that-Hive-on-Tez-is-faster-than-Spark




On Wed, May 20, 2015 at 9:50 AM, Dragga, Christopher 
chris.dra...@netapp.com wrote:

  While I’ve not experimented with the most recent versions of SparkSQL,
 earlier releases could not cope with intermediate result sets that exceeded
 the available memory; Hive handles this sort of situation much more
 gracefully.  If you have a smallish cluster and large data, this could pose
 a problem.  Still, it’s worth looking into SparkSQL to see if this is still
 an issue.



 -Chris Dragga



 *From:* Uli Bethke [mailto:uli.bet...@sonra.io]
 *Sent:* Wednesday, May 20, 2015 7:04 AM
 *To:* user@hive.apache.org
 *Subject:* Re: Hive on Spark VS Spark SQL



 Interesting question and one that I have asked myself. If you are already
 heavily invested in the Hive ecosystem in terms of code and skills I would
 look at Hive on Spark as my engine. In theory swapping out engines (MR,
 TEZ, Spark) should be easy. Even though the devil is in the detail.
 SparkSQL supports a broad subset of HiveQL (some esoteric features are not
 supported). Crucially in my opinion SparkSQL 1.4 will also introduce
 windowing functions. If starting out on a greenfield site I would
 exclusively look at SparkSQL.

  On 20/05/2015 06:38, guoqing0...@yahoo.com.hk wrote:

  Hive on Spark and SparkSQL which should be better , and what are the key
 characteristics and the advantages and the disadvantages between ?


  --

 guoqing0...@yahoo.com.hk



  --

 ___

 Uli Bethke

 Co-founder Sonra

 p: +353 86 32 83 040

 w: www.sonra.io

 l: linkedin.com/in/ulibethke

 t: twitter.com/ubethke



 Chair Hadoop User Group Ireland:

 http://www.meetup.com/hadoop-user-group-ireland/




Re: Hive documentation update for isNull, isNotNull etc.

2015-04-18 Thread Edward Capriolo
show functions returns =  etc.  I believe I added NVL (
https://issues.apache.org/jira/browse/HIVE-2288) and hive also has
coalesce. Even if you can access isNull as a function I think it might be
more clear to just write the query as 'column IS NULL' that would be a more
portable query.

On Sat, Apr 18, 2015 at 6:26 PM, Moore, Douglas 
douglas.mo...@thinkbiganalytics.com wrote:

   Dmitry Lefty the Hive docs updated
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions

  NVL has been around since 0.5, maybe earlier.
 One day I may do a full audit (compare show functions vs. the
 documentation).

  - Douglas

   From: Lefty Leverenz leftylever...@gmail.com
 Reply-To: user@hive.apache.org
 Date: Fri, 17 Apr 2015 19:19:47 -0400
 To: user@hive.apache.org
 Subject: Re: Hive documentation update for isNull, isNotNull etc.

  Hooray for updating the docs!  Douglas, if you tell me your Confluence
 username I'll grant you write access to the wiki (see About This Wiki
 https://cwiki.apache.org/confluence/display/Hive/AboutThisWiki#AboutThisWiki-Howtogetpermissiontoedit).


  Thanks.

  -- Lefty

 On Fri, Apr 17, 2015 at 6:11 PM, Dmitry Tolpeko dmtolp...@gmail.com
 wrote:

 I also recently realized that NVL function is available, but not
 documented :(

  Dmitry Tolpeko

  --
 PL/HQL - Procedural SQL-on-Hadoop - www.plhql.org


 On Sat, Apr 18, 2015 at 12:22 AM, Moore, Douglas 
 douglas.mo...@thinkbiganalytics.com wrote:

   I'm having major trouble finding documentation on hive functions
 isNull and isNotNull.
 At first I was assuming the function just wasn't available, now I
 believe these functions are not documented.

  I believe that the
 LanguageManual+UDF#LanguageManualUDF-Built-inFunctions
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inFunctions
  is
 the right wiki page:

  Interestingly, I found a HIVE-521 (resolved in 0.4) to be the first
 reference for isNull within JIRA.

  Is Lefty volunteering for this? If not I can make the edits if given
 permissions.

  Thanks
  Douglas






Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

2015-04-14 Thread Edward Capriolo
That is too many partitions. Way to much overhead in anything that has that
many partitions.

On Tue, Apr 14, 2015 at 12:53 PM, Tianqi Tong tt...@brightedge.com wrote:

  Hi Slava and Ferdinand,

 Thanks for the reply! Later when I was looking at the hive.log, I found
 Hive was indeed calculating the partition stats, and the log looks like:

 ….

 2015-04-14 09:38:21,146 WARN  [main]: hive.log
 (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
 stats fast for: parquet_table

 2015-04-14 09:38:21,147 WARN  [main]: hive.log
 (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to
 5533480

 2015-04-14 09:38:44,511 WARN  [main]: hive.log
 (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
 stats fast for: parquet_table

 2015-04-14 09:38:44,512 WARN  [main]: hive.log
 (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 66246

 2015-04-14 09:39:07,554 WARN  [main]: hive.log
 (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
 stats fast for: parquet_table

 2015-04-14 09:39:07,555 WARN  [main]: hive.log
 (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 418925

 ….



 One interesting thing is, it's getting slower and slower. Right after I
 launched the job, it took less than 1s to calculate for one partition. Now
 it's taking 20+s for each one.

 I tried hive.stats.autogather=false, but somehow it didn't seem to work. I
 also ended up hard coding a little bit to the Hive source code.



 In my case, I have around 4 partitions with one file (varies from 1M
 to 1G) in each of them. Now it's been 4 days and the first job I launched
 is still not done yet, with partition stats.



 Thanks

 Tianqi Tong



 *From:* Slava Markeyev [mailto:slava.marke...@upsight.com]
 *Sent:* Monday, April 13, 2015 11:00 PM
 *To:* user@hive.apache.org
 *Cc:* Sergio Pena
 *Subject:* Re: [Hive] Slow Loading Data Process with Parquet over 30k
 Partitions



 This is something I've encountered when doing ETL with hive and having it
 create 10's of thousands partitions. The issue is each partition needs to
 be added to the metastore and this is an expensive operation to perform. My
 work around was adding a flag to hive that optionally disables the
 metastore partition creation step. This may not be a solution for everyone
 as that table then has no partitions and you would have to run msck repair
 but depending on your use case, you may just want the data in hdfs.

 If there is interest in having this be an option I'll make a ticket and
 submit the patch.

 -Slava



 On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A cheng.a...@intel.com
 wrote:

 Hi Tianqi,

 Can you attach hive.log as more detailed information?

 +Sergio



 Yours,

 Ferdinand Xu



 *From:* Tianqi Tong [mailto:tt...@brightedge.com]
 *Sent:* Friday, April 10, 2015 1:34 AM
 *To:* user@hive.apache.org
 *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k
 Partitions



 Hello Hive,

 I'm a developer using Hive to process TB level data, and I'm having some
 difficulty loading the data to table.

 I have 2 tables now:



 -- table_1:

 CREATE EXTERNAL TABLE `table_1`(

   `keyword` string,

   `domain` string,

   `url` string

   )

 PARTITIONED BY (yearmonth INT, partition1 STRING)

 STORED AS RCfile



 -- table_2:

 CREATE EXTERNAL TABLE `table_2`(

   `keyword` string,

   `domain` string,

   `url` string

   )

 PARTITIONED BY (yearmonth INT, partition2 STRING)

 STORED AS Parquet



 I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with
 dynamic partitioning, and the number of partitions grows dramatically from
 1500 to 40k (because I want to use something else as partitioning).

 The mapreduce job was fine.

 Somehow the process stucked at  Loading data to table default.table_2
 (yearmonth=null, domain_prefix=null) , and I've been waiting for hours.



 Is this expected when we have 40k partitions?



 --

 Refs - Here are the parameters that I used:

 export HADOOP_HEAPSIZE=16384

 set PARQUET_FILE_SIZE=268435456;

 set parquet.block.size=268435456;

 set dfs.blocksize=268435456;

 set parquet.compression=SNAPPY;

 SET hive.exec.dynamic.partition.mode=nonstrict;

 SET hive.exec.max.dynamic.partitions=50;

 SET hive.exec.max.dynamic.partitions.pernode=5;

 SET hive.exec.max.created.files=100;





 Thank you very much!

 Tianqi Tong




 --

 Slava Markeyev | Engineering | Upsight



hive-jdbc do set commands work on the connection or statement level

2015-04-07 Thread Edward Capriolo
I am setting compression variables in multiple statements
conn.createStatement().execute(set compression.type=5=snappy);
conn.createStatement().execute(select into X ...);

Does the set statement set a connection level variable or a statement level
variable? Or are things set in other ways?

TX


Re: Is it possible to do a LEFT JOIN LATERAL in Hive?

2015-04-05 Thread Edward Capriolo
Lateral view does support outer if that helps.

On Sunday, April 5, 2015, @Sanjiv Singh sanjiv.is...@gmail.com wrote:

 Hi Jeremy,

 Adding to my response 

 1. Hive doesn't support named insertion , so need to use other ways of
 insertion data in hive table ..

 2.  As you know , hive doesn't support LEFT JOIN LATERAL.  Query , I given
 , is producing same result . hope that it can help you formulate things and
 achieve the same in hive.
 On Apr 5, 2015 3:55 PM, @Sanjiv Singh sanjiv.is...@gmail.com
 javascript:_e(%7B%7D,'cvml','sanjiv.is...@gmail.com'); wrote:

 --  create table lhs

 create table lhs (
 subject_id int,
 date_time  BIGINT
 );

 --  insert some records in table lhs , named insertion will not work
 in case of  hive

 insert into table lhs select 1,1000 from tmpTableWithOneRecords limit 1;
 insert into table  lhs select 1,1100 from tmpTableWithOneRecords limit 1;
 insert into table  lhs select 1,2000 from tmpTableWithOneRecords limit 1;
 insert into table  lhs select 2,1002 from tmpTableWithOneRecords limit 1;
 insert into table  lhs select 2,1998 from tmpTableWithOneRecords limit 1;

 create table events (
 subject_id  int,
 date_time   BIGINT,
 event_val   int
 );

insert into table events select 1,999, 1 from
 tmpTableWithOneRecords limit 1;
insert into table events select 1,1000, 2 from
 tmpTableWithOneRecords limit 1;
insert into table events select 1,1001, 3 from
 tmpTableWithOneRecords limit 1;
insert into table events select 1,1999, 4 from
 tmpTableWithOneRecords limit 1;
insert into table events select 1,2000, 5 from
 tmpTableWithOneRecords limit 1;
insert into table events select 1,2001, 6 from
 tmpTableWithOneRecords limit 1;

insert into table events select 2,999, 10 from
 tmpTableWithOneRecords limit 1;
insert into table events select 2,1000, 20 from
 tmpTableWithOneRecords limit 1;
insert into table events select 2,1001, 30 from
 tmpTableWithOneRecords limit 1;
insert into table events select 2,1999, 40 from
 tmpTableWithOneRecords limit 1;
insert into table events select 2,2000, 50 from
 tmpTableWithOneRecords limit 1;
insert into table events select 2,2001, 60 from
 tmpTableWithOneRecords limit 1;


 select subject_id,adate,SUM(event_val),COUNT(event_val) from (SELECT
 a.subject_id as subject_id ,a.date_time as adate , b.date_time as
 bdate , b.event_val as event_val  FROM events b LEFT OUTER JOIN lhs a
 ON b.subject_id = a.subject_id) abc where bdate  adate group by
 subject_id,adate;



 1   10001   1
 1   11006   3
 1   200010  4
 2   100260  3
 2   199860  3


 On 4/5/15, Jeremy Davis jda...@datasong.com
 javascript:_e(%7B%7D,'cvml','jda...@datasong.com'); wrote:
  Hello!
  I would like to do a LEFT JOIN LATERAL .. Which is using values on the
 LHS
  as parameters on the RHS. Is this sort of thing possible in Hive?
 
 
  -JD
 
 
   Some example SQL:
 
 
  create table lhs (
  subject_id integer,
  date_time  BIGINT
  );
 
 —Subjects and responses at Arbitrary response times:
 insert into lhs (subject_id, date_time) values (1,1000);
 insert into lhs (subject_id, date_time) values (1,1100);
 insert into lhs (subject_id, date_time) values (1,2000);
 insert into lhs (subject_id, date_time) values (2,1002);
 insert into lhs (subject_id, date_time) values (2,1998);
 
  create table events (
  subject_id  integer,
  date_time   BIGINT,
  event_val   integer
  );
 
  SELECT * from lhs LEFT JOIN LATERAL ( select SUM(event_val) as val_sum,
  count(event_val) as ecnt from events WHERE date_time  lhs.date_time and
  subject_id = lhs.subject_id ) rhs1 ON true;
 
 
 insert into events (subject_id, date_time, event_val) values
 (1,999,
  1);
 insert into events (subject_id, date_time, event_val) values
 (1,1000,
  2);
 insert into events (subject_id, date_time, event_val) values
 (1,1001,
  3);
 insert into events (subject_id, date_time, event_val) values
 (1,1999,
  4);
 insert into events (subject_id, date_time, event_val) values
 (1,2000,
  5);
 insert into events (subject_id, date_time, event_val) values
 (1,2001,
  6);
 
 insert into events (subject_id, date_time, event_val) values
 (2,999,
  10);
 insert into events (subject_id, date_time, event_val) values
 (2,1000,
  20);
 insert into events (subject_id, date_time, event_val) values
 (2,1001,
  30);
 insert into events (subject_id, date_time, event_val) values
 (2,1999,
  40);
 insert into events (subject_id, date_time, event_val) values
 (2,2000,
  50);
 insert into events (subject_id, date_time, event_val) values
 (2,2001,
  60);
 
 SELECT * from lhs LEFT JOIN LATERAL ( select SUM(event_val) as
  val_sum, count(event_val) as ecnt from events WHERE date_time 
  lhs.date_time 

Re: How to read Protobuffers in Hive

2015-03-25 Thread Edward Capriolo
You may be able to use:
https://github.com/edwardcapriolo/hive-protobuf

(Use the branch not master)

This code is based on the avro support. It works well even with nested
objects.




On Wed, Mar 25, 2015 at 12:28 PM, Lukas Nalezenec 
lukas.naleze...@firma.seznam.cz wrote:

  Hi,
 I am trying to write Serde + ObjectInspectors for reading Protobuffers in
 Hive.
 I tried to use class ProtocolBuffersStructObjectInspector from Hive but it
 last worked with old protobuffer version 2.3.
 I tried to use ObjectInspector from Twitter Elephant-bird but it does not
 work too.

 It looks like that it could help if I havent used DynamicMessage$Builder .
 The problem is that DynamicMessage$Builder cannot be reused.
 When message is build from builder the builder field fields is set tu
 null and it throws NPE on second build() call.

 ...
   at 
 com.lukas.AbstractProtobufStructObjectInspector.setStructFieldData_default(AbstractProtobufStructObjectInspector.java:253)
   at 
 com.lukas.ProtobufStructObjectInspector.setStructFieldData(ProtobufStructObjectInspector.java:61)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:325)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:324)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:630)
   ... 9 more
 Caused by: java.lang.NullPointerException
   at 
 com.google.protobuf.DynamicMessage$Builder.clearField(DynamicMessage.java:386)
   at 
 com.google.protobuf.DynamicMessage$Builder.clearField(DynamicMessage.java:252)
   at 
 com.lukas.AbstractProtobufStructObjectInspector.setStructFieldData_default(AbstractProtobufStructObjectInspector.java:177)
   ... 13 more



 I am using Hive 10 but I am also interested in solution for Hive 13.

 Does anybody have working Protobuffer Serde ?

 Thanks
 Best Regards

 Lukas



The curious case of the hive-server-2 empty partitions.

2015-03-24 Thread Edward Capriolo
Hey all,

I have cloudera 5.3, and an issue involving HiveServer2, Hive.

We have a process that launches Hive JDBC queries, hourly. This process
selects from one table and builds another.

It looks something like this (slightly obfuscated query)

 FROM beacon INSERT OVERWRITE TABLE author_article_hourly PARTITION
(dt=2015032412) SELECT author, article_id, sum(if(referrer_fields.type !=
'seed' AND clicktype='beauty',1,0)) AS viral_count,
sum(if(referrer_fields.type = 'seed' AND clicktype='beauty',1,0)) AS
seed_count, sum(if(clicktype='beauty',1,0)) AS pageview,
sum(if(clicktype='click',1,0)) AS clicks WHERE dt=2015032412 AND (author IS
NOT null OR article_id IS NOT NULL) GROUP by author,article_id,dt ORDER BY
viral_count DESC LIMIT 10 INSERT OVERWRITE TABLE author_hourly
PARTITION (dt=2015032412) SELECT author, sum(if(referrer_fields.type !=
'seed' AND clicktype='beauty',1,0)) AS viral_count,
sum(if(referrer_fields.type = 'seed' AND clicktype='beauty',1,0)) AS
seed_count, sum(if(clicktype='beauty',1,0)) AS pageview,
sum(if(clicktype='click',1,0)) AS clicks WHERE dt=2015032412 AND (author IS
NOT null OR article_id IS NOT NULL) GROUP by author,dt ORDER BY viral_count
DESC LIMIT 10


1) I have confirmed that the source table had data at the time of the query
2) The jdbc statement did not throw an exception.
3) The jobs that produced one empty output file ran as long as those that
produced data.

   11 655720
/user/hive/warehouse/author_hourly/dt=201503
   11  0
/user/hive/warehouse/author_hourly/dt=2015032223
   11 644289
/user/hive/warehouse/author_hourly/dt=2015032300
   11  0
/user/hive/warehouse/author_hourly/dt=2015032301
   11 640076
/user/hive/warehouse/author_hourly/dt=2015032302
   11  0
/user/hive/warehouse/author_hourly/dt=2015032303
   11  0
/user/hive/warehouse/author_hourly/dt=2015032304
   11 715033
/user/hive/warehouse/author_hourly/dt=2015032320
   11 691352
/user/hive/warehouse/author_hourly/dt=2015032321
   11  0
/user/hive/warehouse/author_hourly/dt=2015032322
   11 653690
/user/hive/warehouse/author_hourly/dt=2015032323
   11  0
/user/hive/warehouse/author_hourly/dt=2015032400
   11 650930
/user/hive/warehouse/author_hourly/dt=2015032401
   11 639389
/user/hive/warehouse/author_hourly/dt=2015032402
   11  0
/user/hive/warehouse/author_hourly/dt=2015032403
   11 544848
/user/hive/warehouse/author_hourly/dt=2015032404
   11 495953
/user/hive/warehouse/author_hourly/dt=2015032405
   11  0
/user/hive/warehouse/author_hourly/dt=2015032406
   11  0
/user/hive/warehouse/author_hourly/dt=2015032407
   11 425209
/user/hive/warehouse/author_hourly/dt=2015032408
   11 443696
/user/hive/warehouse/author_hourly/dt=2015032409
   11 472888
/user/hive/warehouse/author_hourly/dt=2015032410
   11  0
/user/hive/warehouse/author_hourly/dt=2015032411
   11  0
/user/hive/warehouse/author_hourly/dt=2015032412
   11  0
/user/hive/warehouse/author_hourly/dt=2015032413

I have turned hiveserver2 logging up to debug. There are no logs at level
error.

The folders with 0 bytes have a single empty file in them named 0_00

Drifting thought job history server logs I have found this:

2015-03-24 16:16:35,446 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_1422629510062_294146_01_05
2015-03-24 16:16:35,447 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:1
AssignedMaps:0 AssignedReds:0 CompletedMaps:1 CompletedReds:0
ContAlloc:1 ContRel:0 HostLocal:0 RackLocal:1
2015-03-24 16:16:35,448 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
Diagnostics report from attempt_1422629510062_294146_m_00_0:
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

Can anyone explain why queries launched through HiveServer2 sometimes move
empty files to the final directory?  I am pretty clueless as to the clause.
I am assuming load on the cluster is killing 

Call for case studies for Programming Hive, 2nd edition

2015-03-22 Thread Edward Capriolo
Hello all,

Work is getting underway for Programming Hive 2nd Edition! One of the parts
I enjoyed most is the case studies. They showed hive used in a number of
enterprises and for different purposes.

Since the 2nd edition is on the way I want to make another call for case
studies and use cases of hive. For the last book many of these case studies
came in at the buzzer (very close to publishing time) . We will be avoiding
this for the second edition so that we have more time to iterate on them so
they have a more consistent feel.

Some of the existing case studies need to be reworked for the second
edition. For example a case studies that address doing RANK queries before
Hive had windowing. If you did a case study for the book I will try to
track you down individual to see if they can be updated, but it would be
easier if you found me.


Thank you,
Edward


Re: Why hive 0.13 will initialize derby database if the metastore parameters are not set in hive-site.xml?

2015-03-06 Thread Edward Capriolo
Make sure hive autogather stats is false . Or aetup the stats db

On Friday, March 6, 2015, Jim Green openkbi...@gmail.com wrote:

 Hi Team,

 Starting from hive 0.13, if the metastore parameters are not set in
 hive-site.xml, but we set in .hiverc, hive will try to initialize derby
 database in current working directory.
 This behavior did not exist in hive 0.12.
 Is it a known bug? or behavior change?

 I have the repro as below:

 *Env:*
 Hive 0.13
 MySQL as backend metastore database.
 No hive metastore service.

 *Case 1:*
 .hiverc is not used and hive-site.xml has below 4 parameters:
 property
   namejavax.jdo.option.ConnectionURL/name
   valuejdbc:mysql://localhost/metastore/value
   descriptionthe URL of the MySQL database/description
 /property

 property
   namejavax.jdo.option.ConnectionDriverName/name
   valuecom.mysql.jdbc.Driver/value
 /property

 property
   namejavax.jdo.option.ConnectionUserName/name
   valuehive/value
 /property

 property
   namejavax.jdo.option.ConnectionPassword/name
   valuemypassword/value
 /property

 In this case, if we run hive and it works fine and connects to mysql as
 the backend metastore database.
 It will NOT initialize the derby database in current directory.

 *Case 2:*
 hive-site.xml is empty and .hiverc has below 3 parameters:
 [root@~]# cat .hiverc
 set javax.jdo.option.ConnectionURL=jdbc:mysql://localhost/metastore;
 set javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver;
 set javax.jdo.option.ConnectionUserName=hive;
 set javax.jdo.option.ConnectionPassword=mypassword;

 In this case, if we run hive and it also works fine and connects to
 mysql as the backend metastore database.
 However it initialized the derby database in current working directory
 where you run hive command:

 drwxr-xr-x   5 root root   4096 Mar  6 12:18 metastore_db
 -rw-r--r--   1 root root  70754 Mar  6 12:18 derby.log

 If we open another putty session and run hive in the same directory, it
 will fail with below error:
 Caused by: ERROR XSDB6: Another instance of Derby may have already booted
 the database /xxx/xxx/xxx/metastore_db.
 This is because derby database only allows one connection.

 We do not understand why after we moved the 4 parameters from
 hive-site.xml to .hiverc, hive will try to initialize the derby database?


 --
 Thanks,
 www.openkb.info
 (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)



-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Edward Capriolo
with the metadata in a special metadata store (not on hdfs), and its not
as easy for all systems to access hive metadata. I disagree.

Hives metadata is not only accessible through the SQL constructs like
describe table. But the entire meta-store also is actually a thrift
service so you have programmatic access to determine things like what
columns are in a table etc. Thrift creates RPC clients for almost every
major language.

In the programming hive book
http://www.amazon.com/dp/1449319335/?tag=mh0b-20hvadid=3521269638ref=pd_sl_4yiryvbf8k_e
there is even examples where I show how to iterate all the tables inside
the database from a java client.

On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers ko...@tresata.com wrote:

 yes you can run whatever you like with the data in hdfs. keep in mind that
 hive makes this general access pattern just a little harder, since hive has
 a tendency to store data and metadata separately, with the metadata in a
 special metadata store (not on hdfs), and its not as easy for all systems
 to access hive metadata.

 i am not familiar at all with tajo or drill.

 On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks samuelma...@gmail.com
 wrote:

 Thanks for the advice

 Koert: when everything is in the same essential data-store (HDFS), can't
 I just run whatever complex tools I'm whichever paradigm they like?

 E.g.: GraphX, Mahout etc.

 Also, what about Tajo or Drill?

 Best,

 Samuel Marks
 http://linkedin.com/in/samuelmarks

 PS: Spark-SQL is read-only IIRC, right?
 On 31 Jan 2015 03:39, Koert Kuipers ko...@tresata.com wrote:

 since you require high-powered analytics, and i assume you want to stay
 sane while doing so, you require the ability to drop out of sql when
 needed. so spark-sql and lingual would be my choices.

 low latency indicates phoenix or spark-sql to me.

 so i would say spark-sql

 On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks samuelma...@gmail.com
 wrote:

 HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
 exposing both JDBC and ODBC interfaces. However, although Pivotal does 
 open-source
 a lot of software http://www.pivotal.io/oss, I don't believe they
 open source Pivotal HD: HAWQ.

 So that doesn't meet my requirements. I should note that the project I
 am building will also be open-source, which heightens the importance of
 having all components also being open-source.

 Cheers,

 Samuel Marks
 http://linkedin.com/in/samuelmarks

 On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari 
 siddharth.tiw...@live.com wrote:

 Have you looked at HAWQ from Pivotal ?

 Sent from my iPhone

 On Jan 30, 2015, at 4:27 AM, Samuel Marks samuelma...@gmail.com
 wrote:

 Since Hadoop https://hive.apache.org came out, there have been
 various commercial and/or open-source attempts to expose some 
 compatibility
 with SQL http://drill.apache.org. Obviously by posting here I am
 not expecting an unbiased answer.

 Seeking an SQL-on-Hadoop offering which provides: low-latency
 querying, and supports the most common CRUD https://spark.apache.org,
 including [the basics!] along these lines: CREATE TABLE, INSERT INTO, 
 SELECT
 * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
 Transactional support would be nice also, but is not a must-have.

 Essentially I want a full replacement for the more traditional RDBMS,
 one which can scale from 1 node to a serious Hadoop cluster.

 Python is my language of choice for interfacing, however there does
 seem to be a Python JDBC wrapper https://spark.apache.org/sql.

 Here is what I've found thus far:

- Apache Hive https://hive.apache.org (SQL-like, with
interactive SQL thanks to the Stinger initiative)
- Apache Drill http://drill.apache.org (ANSI SQL support)
- Apache Spark https://spark.apache.org (Spark SQL
https://spark.apache.org/sql, queries only, add data via Hive,
RDD

 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD
or Paraquet http://parquet.io/)
- Apache Phoenix http://phoenix.apache.org (built atop Apache
HBase http://hbase.apache.org, lacks full transaction
http://en.wikipedia.org/wiki/Database_transaction support, relational
operators http://en.wikipedia.org/wiki/Relational_operators and
some built-in functions)
- Cloudera Impala

 http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
(significant HiveQL support, some SQL language support, no support for
indexes on its tables, importantly missing DELETE, UPDATE and 
 INTERSECT;
amongst others)
- Presto https://github.com/facebook/presto from Facebook (can
query Hive, Cassandra http://cassandra.apache.org, relational
DBs etc. Doesn't seem to be designed for low-latency responses across
small clusters, or support UPDATE operations. It is optimized for
data warehousing or analytics¹
http://prestodb.io/docs/current/overview/use-cases.html)
- SQL-Hadoop 

Re: Hive JSON Serde question

2015-01-25 Thread Edward Capriolo
Nested lists require nested lateral views.

On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

 hey guys

 This is the Hive table definition I have created based on the JSON
 I am using this version of hive json serde
 https://github.com/rcongiu/Hive-JSON-Serde

 ADD JAR
 /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
 ;
 DROP TABLE IF EXISTS
   datafeed_json
 ;
 CREATE EXTERNAL TABLE IF NOT EXISTS
datafeed_json (
object STRING,
entry array
   struct
 id:STRING,
  time:BIGINT,
  changes:array
struct
  field:STRING,
   value:struct
 item:STRING,
  verb:STRING,
  parent_id:STRING,
  sender_id:BIGINT,
  created_time:BIGINT
 ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE
 LOCATION '/data/sanjay/datafeed'
 ;


 QUERY 1
 ===
 ADD JAR
 /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
 ;
  SELECT
 object,
 entry[0].id,
 entry[0].time,
 entry[0].changes[0].field,
 entry[0].changes[0].value.item,
 entry[0].changes[0].value.verb,
 entry[0].changes[0].value.parent_id,
 entry[0].changes[0].value.sender_id,
 entry[0].changes[0].value.created_time
   FROM
 datafeed_json
 ;

 RESULT1
 ==
 foo123  113621765320467 1418608223 leads song1 rock
 113621765320467_1107142375968396 14748082019 1418608223


 QUERY2
 ==
 ADD JAR
 /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
 ;
  SELECT
 object,
 entry.id,
 entry.time,
 ntry
   FROM
 datafeed_json
   LATERAL VIEW EXPLODE
 (datafeed_json.entry.changes) oc1 AS ntry
 ;

 RESULT2
 ===
 This gives 4 rows but I was not able to iteratively do the LATERAL VIEW
 EXPLODE


 I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE,
 json_tuple to extract all fields in an exploded view from the JSON in tab
 separated format but no luck.

 Any thoughts ?



 Thanks

 sanjay





Getting Tez working against cdh 5.3

2015-01-20 Thread Edward Capriolo
It seems that CDH does not ship with enough jars to run tez out of the box.

I have found the related cloudera forked pom.

In this pom hive is built against tez 0.4.1-incubating-tez2.0-SHAPSHOT

Thus I followed the instructions here:

http://tez.apache.org/install_pre_0_5_0.html

hive dfs -lsr /apps
 ;
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
/apps/tez-0.4.1-incubating
drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib
-rw-r--r--   3 ecapriolo supergroup 303139 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/avro-1.7.4.jar
-rw-r--r--   3 ecapriolo supergroup  41123 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/commons-cli-1.2.jar
-rw-r--r--   3 ecapriolo supergroup 610259 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/commons-collections4-4.0.jar
-rw-r--r--   3 ecapriolo supergroup1648200 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/guava-11.0.2.jar
-rw-r--r--   3 ecapriolo supergroup 710492 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/guice-3.0.jar
-rw-r--r--   3 ecapriolo supergroup 656365 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-common-2.2.0.jar
-rw-r--r--   3 ecapriolo supergroup1455001 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-core-2.2.0.jar
-rw-r--r--   3 ecapriolo supergroup  21537 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-shuffle-2.2.0.jar
-rw-r--r--   3 ecapriolo supergroup  81743 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/jettison-1.3.4.jar
-rw-r--r--   3 ecapriolo supergroup 533455 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/protobuf-java-2.5.0.jar
-rw-r--r--   3 ecapriolo supergroup 995968 2015-01-16 23:00
/apps/tez-0.4.1-incubating/lib/snappy-java-1.0.4.1.jar
-rw-r--r--   3 ecapriolo supergroup 752332 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-api-0.4.1-incubating.jar
-rw-r--r--   3 ecapriolo supergroup  34089 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-common-0.4.1-incubating.jar
-rw-r--r--   3 ecapriolo supergroup 980132 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-dag-0.4.1-incubating.jar
-rw-r--r--   3 ecapriolo supergroup 246395 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-mapreduce-0.4.1-incubating.jar
-rw-r--r--   3 ecapriolo supergroup 199984 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-mapreduce-examples-0.4.1-incubating.jar
-rw-r--r--   3 ecapriolo supergroup 114676 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-runtime-internals-0.4.1-incubating.jar
-rw-r--r--   3 ecapriolo supergroup 352835 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-runtime-library-0.4.1-incubating.jar
-rw-r--r--   3 ecapriolo supergroup   6832 2015-01-16 23:00
/apps/tez-0.4.1-incubating/tez-tests-0.4.1-incubating.jar

This is my tez-site.xml

configuration
  property
nametez.lib.uris/name
value${fs.default.name}/apps/tez-0.4.1-incubating,${fs.default.name
}/apps/tez-0.4.1-incubating/lib//value
  /property
/configuration

[ecapriolo@production-hadoop-cdh-69-7 ~]$ ls -lahR
/home/ecapriolo/tez-0.4.1-incubating/
/home/ecapriolo/tez-0.4.1-incubating/:
total 2.7M
drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
drwx-- 7 ecapriolo ecapriolo  20K Jan 20 15:20 ..
drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 lib
-rw-rw-r-- 1 ecapriolo ecapriolo 735K Jan 16 22:54
tez-api-0.4.1-incubating.jar
-rw-rw-r-- 1 ecapriolo ecapriolo  34K Jan 16 22:54
tez-common-0.4.1-incubating.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 958K Jan 16 22:54
tez-dag-0.4.1-incubating.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 241K Jan 16 22:54
tez-mapreduce-0.4.1-incubating.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 196K Jan 16 22:54
tez-mapreduce-examples-0.4.1-incubating.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 112K Jan 16 22:54
tez-runtime-internals-0.4.1-incubating.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 345K Jan 16 22:54
tez-runtime-library-0.4.1-incubating.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 6.7K Jan 16 22:54
tez-tests-0.4.1-incubating.jar

/home/ecapriolo/tez-0.4.1-incubating/lib:
total 6.8M
drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 ..
-rw-rw-r-- 1 ecapriolo ecapriolo 297K Jan 16 22:54 avro-1.7.4.jar
-rw-rw-r-- 1 ecapriolo ecapriolo  41K Jan 16 22:54 commons-cli-1.2.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 596K Jan 16 22:54
commons-collections4-4.0.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 1.6M Jan 16 22:54 guava-11.0.2.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 694K Jan 16 22:54 guice-3.0.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 641K Jan 16 22:54
hadoop-mapreduce-client-common-2.2.0.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 1.4M Jan 16 22:54
hadoop-mapreduce-client-core-2.2.0.jar
-rw-rw-r-- 1 ecapriolo ecapriolo  22K Jan 16 22:54
hadoop-mapreduce-client-shuffle-2.2.0.jar
-rw-rw-r-- 1 ecapriolo ecapriolo  80K Jan 16 22:54 jettison-1.3.4.jar
-rw-rw-r-- 1 ecapriolo ecapriolo 521K Jan 16 22:54 

Re: Getting Tez working against cdh 5.3

2015-01-20 Thread Edward Capriolo
I see. That helped a lot.

java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
at
org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:132)
at
org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:148)
at
org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:163)
at
org.apache.tez.runtime.library.common.sort.impl.IFile$Writer.init(IFile.java:128)
at
org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter.spill(DefaultSorter.java:749)
at
org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter.sortAndSpill(DefaultSorter.java:723)
at
org.apache.tez.runtime.library.common.sort.impl.dflt.DefaultSorter.flush(DefaultSorter.java:610)
at
org.apache.tez.runtime.library.output.OnFileSortedOutput.close(OnFileSortedOutput.java:134)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:331)
at
org.apache.hadoop.mapred.YarnTezDagChild$5.run(YarnTezDagChild.java:567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at
org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:553)

Which I see looks like this:

https://issues.apache.org/jira/browse/AMBARI-7994

I am not sure what this fix is doing exactly. Does this mean the JVM
getting called is not being sent the path to the hadoop-native? Am I better
off just compiling tez against the cdh jars like you did?


'If you're trying to install this for a private install, I keep updating
this github which installs all of this under your user instead of as
hive.'

No I am not trying for a private install. I want the default hive-server to
be able to submit tez jobs.

My goal is to have a quick recipe for getting tez to work with cdh 5.3 with
minimal hacking of the install.

Edward



On Tue, Jan 20, 2015 at 6:39 PM, Gopal V gop...@apache.org wrote:

 On 1/20/15, 12:34 PM, Edward Capriolo wrote:

 Actually more likely something like this:

 https://issues.apache.org/jira/browse/TEZ-1621


 I have a working Hive-13 + Tez install on CDH-5.2.0-1.cdh5.2.0.p0.36.

 Most of the work needed to get that to work was to build all of Hive+Tez
 against the CDH jars instead of the Apache 2.4.0.

 You need to do

 yarn logs -applicationId app-id

 which should give you the stderr as well, because like the JIRA
 referenced, this looks like an ABI compatibility issue.

 Tez hasn't got any diagnostic for those cases since it never saw the
 container come up and send a heartbeat.

 If you're trying to install this for a private install, I keep updating
 this github which installs all of this under your user instead of as hive.

 https://github.com/t3rmin4t0r/tez-autobuild

 Cheers,
 Gopal

  On Tue, Jan 20, 2015 at 2:02 PM, Prasanth Jayachandran 
 pjayachand...@hortonworks.com wrote:

  My guess is..
  java binary is not in PATH of the shell script that launches the
 container.. try creating a symbolic link in /bin/ to point to java..

 On Tue, Jan 20, 2015 at 7:22 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

  It seems that CDH does not ship with enough jars to run tez out of the
 box.

 I have found the related cloudera forked pom.

 In this pom hive is built against tez 0.4.1-incubating-tez2.0-SHAPSHOT

 Thus I followed the instructions here:

 http://tez.apache.org/install_pre_0_5_0.html

 hive dfs -lsr /apps
  ;
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib
 -rw-r--r--   3 ecapriolo supergroup 303139 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/avro-1.7.4.jar
 -rw-r--r--   3 ecapriolo supergroup  41123 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-cli-1.2.jar
 -rw-r--r--   3 ecapriolo supergroup 610259 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-collections4-4.0.jar
 -rw-r--r--   3 ecapriolo supergroup1648200 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guava-11.0.2.jar
 -rw-r--r--   3 ecapriolo supergroup 710492 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guice-3.0.jar
 -rw-r--r--   3 ecapriolo supergroup 656365 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-common-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup1455001 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-core-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup  21537 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-
 shuffle-2.2.0.jar
 -rw-r--r--   3

Re: Getting Tez working against cdh 5.3

2015-01-20 Thread Edward Capriolo
Actually more likely something like this:

https://issues.apache.org/jira/browse/TEZ-1621

On Tue, Jan 20, 2015 at 2:02 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.com wrote:

 My guess is..
  java binary is not in PATH of the shell script that launches the
 container.. try creating a symbolic link in /bin/ to point to java..

 On Tue, Jan 20, 2015 at 7:22 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 It seems that CDH does not ship with enough jars to run tez out of the
 box.

 I have found the related cloudera forked pom.

 In this pom hive is built against tez 0.4.1-incubating-tez2.0-SHAPSHOT

 Thus I followed the instructions here:

 http://tez.apache.org/install_pre_0_5_0.html

 hive dfs -lsr /apps
  ;
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib
 -rw-r--r--   3 ecapriolo supergroup 303139 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/avro-1.7.4.jar
 -rw-r--r--   3 ecapriolo supergroup  41123 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-cli-1.2.jar
 -rw-r--r--   3 ecapriolo supergroup 610259 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-collections4-4.0.jar
 -rw-r--r--   3 ecapriolo supergroup1648200 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guava-11.0.2.jar
 -rw-r--r--   3 ecapriolo supergroup 710492 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guice-3.0.jar
 -rw-r--r--   3 ecapriolo supergroup 656365 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-common-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup1455001 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-core-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup  21537 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-shuffle-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup  81743 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/jettison-1.3.4.jar
 -rw-r--r--   3 ecapriolo supergroup 533455 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/protobuf-java-2.5.0.jar
 -rw-r--r--   3 ecapriolo supergroup 995968 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/snappy-java-1.0.4.1.jar
 -rw-r--r--   3 ecapriolo supergroup 752332 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-api-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup  34089 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-common-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 980132 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-dag-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 246395 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-mapreduce-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 199984 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-mapreduce-examples-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 114676 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-runtime-internals-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 352835 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-runtime-library-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup   6832 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-tests-0.4.1-incubating.jar

 This is my tez-site.xml

 configuration
   property
 nametez.lib.uris/name
 value${fs.default.name}/apps/tez-0.4.1-incubating,${fs.default.name
 }/apps/tez-0.4.1-incubating/lib//value
   /property
 /configuration

 [ecapriolo@production-hadoop-cdh-69-7 ~]$ ls -lahR
 /home/ecapriolo/tez-0.4.1-incubating/
 /home/ecapriolo/tez-0.4.1-incubating/:
 total 2.7M
 drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
 drwx-- 7 ecapriolo ecapriolo  20K Jan 20 15:20 ..
 drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 lib
 -rw-rw-r-- 1 ecapriolo ecapriolo 735K Jan 16 22:54
 tez-api-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo  34K Jan 16 22:54
 tez-common-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 958K Jan 16 22:54
 tez-dag-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 241K Jan 16 22:54
 tez-mapreduce-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 196K Jan 16 22:54
 tez-mapreduce-examples-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 112K Jan 16 22:54
 tez-runtime-internals-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 345K Jan 16 22:54
 tez-runtime-library-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 6.7K Jan 16 22:54
 tez-tests-0.4.1-incubating.jar

 /home/ecapriolo/tez-0.4.1-incubating/lib:
 total 6.8M
 drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
 drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 ..
 -rw-rw-r-- 1 ecapriolo ecapriolo 297K Jan 16 22:54 avro-1.7.4.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo  41K Jan 16 22:54 commons-cli-1.2.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 596K Jan 16 22:54
 commons-collections4-4.0.jar
 -rw-rw-r-- 1

Re: Getting Tez working against cdh 5.3

2015-01-20 Thread Edward Capriolo
Silly question but...

2015-01-20 15:01:45,366 INFO [IPC Server handler 0 on 41329]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1420748315294_70716_01_02 transitioned from IDLE to
RUNNING via event C_PULL_TA
2015-01-20 15:01:45,366 INFO [IPC Server handler 0 on 41329]
org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with
id: container_1420748315294_70716_01_02 given task:
attempt_1420748315294_70716_1_01_00_0
2015-01-20 15:01:45,367 INFO [AsyncDispatcher event handler]
org.apache.hadoop.yarn.util.RackResolver: Resolved
production-hadoop-cdh-64-77.use1.huffpo.net to /default
2015-01-20 15:01:45,369 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: TaskAttempt:
[attempt_1420748315294_70716_1_01_00_0] started. Is using
containerId: [container_1420748315294_70716_01_02] on NM:
[production-hadoop-cdh-64-77.use1.huffpo.net:8041]
2015-01-20 15:01:45,374 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.history.HistoryEventHandler:
[HISTORY][DAG:dag_1420748315294_70716_1][Event:TASK_ATTEMPT_STARTED]:
vertexName=Map 1,
taskAttemptId=attempt_1420748315294_70716_1_01_00_0,
startTime=1421766105367,
containerId=container_1420748315294_70716_01_02,
nodeId=production-hadoop-cdh-64-77.use1.huffpo.net:8041,
inProgressLogs=production-hadoop-cdh-64-77.use1.huffpo.net:8042/node/containerlogs/container_1420748315294_70716_01_02/ecapriolo,
completedLogs=
2015-01-20 15:01:45,374 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl:
attempt_1420748315294_70716_1_01_00_0 TaskAttempt Transitioned
from START_WAIT to RUNNING due to event TA_STARTED_REMOTELY
2015-01-20 15:01:45,375 INFO [AsyncDispatcher event handler]
org.apache.tez.common.counters.Limits: Counter limits initialized with
parameters:  GROUP_NAME_MAX=128, MAX_GROUPS=500, COUNTER_NAME_MAX=64,
MAX_COUNTERS=1200
2015-01-20 15:01:45,384 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.dag.impl.TaskImpl:
task_1420748315294_70716_1_01_00 Task Transitioned from SCHEDULED
to RUNNING
2015-01-20 15:01:45,758 INFO [IPC Server handler 2 on 41329]
org.apache.tez.dag.app.dag.impl.TaskImpl:
TaskAttempt:attempt_1420748315294_70716_1_01_00_0 sent events:
(0-1)
2015-01-20 15:01:49,495 INFO [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocated container
completed:container_1420748315294_70716_01_02 last allocated to
task: attempt_1420748315294_70716_1_01_00_0
2015-01-20 15:01:49,498 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
container_1420748315294_70716_01_02 exited with diagnostics set to
Exception from container-launch.
Container id: container_1420748315294_70716_01_02
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Is there any way to log the command being run that causes then shell to fail?



On Tue, Jan 20, 2015 at 2:09 PM, Edward Capriolo edlinuxg...@gmail.com
wrote:

 Java is on the PATH of our datanode/nodemanager systems

 [ecapriolo@production-hadoop-cdh-67-142 ~]$ which java
 /usr/bin/java

 [ecapriolo@production-hadoop-cdh-67-142 ~]$ java -version
 java version 1.7.0_65
 OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
 OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)


 On Tue, Jan 20, 2015 at 2:02 PM, Prasanth Jayachandran 
 pjayachand...@hortonworks.com wrote:

 My guess is..
  java binary is not in PATH of the shell script that launches the
 container.. try creating a symbolic link in /bin/ to point to java..

 On Tue, Jan 20, 2015 at 7:22 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 It seems that CDH does not ship with enough jars to run tez out of the
 box.

 I have found the related cloudera forked pom.

 In this pom hive is built against tez 0.4.1-incubating-tez2.0-SHAPSHOT

 Thus I followed the instructions here:

 http://tez.apache.org/install_pre_0_5_0.html

 hive dfs -lsr /apps
  ;
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - ecapriolo supergroup

Re: Getting Tez working against cdh 5.3

2015-01-20 Thread Edward Capriolo
Java is on the PATH of our datanode/nodemanager systems

[ecapriolo@production-hadoop-cdh-67-142 ~]$ which java
/usr/bin/java

[ecapriolo@production-hadoop-cdh-67-142 ~]$ java -version
java version 1.7.0_65
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)


On Tue, Jan 20, 2015 at 2:02 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.com wrote:

 My guess is..
  java binary is not in PATH of the shell script that launches the
 container.. try creating a symbolic link in /bin/ to point to java..

 On Tue, Jan 20, 2015 at 7:22 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 It seems that CDH does not ship with enough jars to run tez out of the
 box.

 I have found the related cloudera forked pom.

 In this pom hive is built against tez 0.4.1-incubating-tez2.0-SHAPSHOT

 Thus I followed the instructions here:

 http://tez.apache.org/install_pre_0_5_0.html

 hive dfs -lsr /apps
  ;
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib
 -rw-r--r--   3 ecapriolo supergroup 303139 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/avro-1.7.4.jar
 -rw-r--r--   3 ecapriolo supergroup  41123 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-cli-1.2.jar
 -rw-r--r--   3 ecapriolo supergroup 610259 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-collections4-4.0.jar
 -rw-r--r--   3 ecapriolo supergroup1648200 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guava-11.0.2.jar
 -rw-r--r--   3 ecapriolo supergroup 710492 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guice-3.0.jar
 -rw-r--r--   3 ecapriolo supergroup 656365 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-common-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup1455001 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-core-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup  21537 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-shuffle-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup  81743 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/jettison-1.3.4.jar
 -rw-r--r--   3 ecapriolo supergroup 533455 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/protobuf-java-2.5.0.jar
 -rw-r--r--   3 ecapriolo supergroup 995968 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/snappy-java-1.0.4.1.jar
 -rw-r--r--   3 ecapriolo supergroup 752332 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-api-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup  34089 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-common-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 980132 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-dag-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 246395 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-mapreduce-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 199984 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-mapreduce-examples-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 114676 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-runtime-internals-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 352835 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-runtime-library-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup   6832 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-tests-0.4.1-incubating.jar

 This is my tez-site.xml

 configuration
   property
 nametez.lib.uris/name
 value${fs.default.name}/apps/tez-0.4.1-incubating,${fs.default.name
 }/apps/tez-0.4.1-incubating/lib//value
   /property
 /configuration

 [ecapriolo@production-hadoop-cdh-69-7 ~]$ ls -lahR
 /home/ecapriolo/tez-0.4.1-incubating/
 /home/ecapriolo/tez-0.4.1-incubating/:
 total 2.7M
 drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
 drwx-- 7 ecapriolo ecapriolo  20K Jan 20 15:20 ..
 drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 lib
 -rw-rw-r-- 1 ecapriolo ecapriolo 735K Jan 16 22:54
 tez-api-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo  34K Jan 16 22:54
 tez-common-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 958K Jan 16 22:54
 tez-dag-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 241K Jan 16 22:54
 tez-mapreduce-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 196K Jan 16 22:54
 tez-mapreduce-examples-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 112K Jan 16 22:54
 tez-runtime-internals-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 345K Jan 16 22:54
 tez-runtime-library-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 6.7K Jan 16 22:54
 tez-tests-0.4.1-incubating.jar

 /home/ecapriolo/tez-0.4.1-incubating/lib:
 total 6.8M
 drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
 drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 ..
 -rw

Re: Getting Tez working against cdh 5.3

2015-01-20 Thread Edward Capriolo
Actually you are correct. Java is not universally on the path. Is there a
way to make tez/hadoop respect/use whatever the node manager is using. If I
am running node-managing I must have java right? :)

On Tue, Jan 20, 2015 at 2:02 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.com wrote:

 My guess is..
  java binary is not in PATH of the shell script that launches the
 container.. try creating a symbolic link in /bin/ to point to java..

 On Tue, Jan 20, 2015 at 7:22 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 It seems that CDH does not ship with enough jars to run tez out of the
 box.

 I have found the related cloudera forked pom.

 In this pom hive is built against tez 0.4.1-incubating-tez2.0-SHAPSHOT

 Thus I followed the instructions here:

 http://tez.apache.org/install_pre_0_5_0.html

 hive dfs -lsr /apps
  ;
 lsr: DEPRECATED: Please use 'ls -R' instead.
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating
 drwxr-xr-x   - ecapriolo supergroup  0 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib
 -rw-r--r--   3 ecapriolo supergroup 303139 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/avro-1.7.4.jar
 -rw-r--r--   3 ecapriolo supergroup  41123 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-cli-1.2.jar
 -rw-r--r--   3 ecapriolo supergroup 610259 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/commons-collections4-4.0.jar
 -rw-r--r--   3 ecapriolo supergroup1648200 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guava-11.0.2.jar
 -rw-r--r--   3 ecapriolo supergroup 710492 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/guice-3.0.jar
 -rw-r--r--   3 ecapriolo supergroup 656365 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-common-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup1455001 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-core-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup  21537 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/hadoop-mapreduce-client-shuffle-2.2.0.jar
 -rw-r--r--   3 ecapriolo supergroup  81743 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/jettison-1.3.4.jar
 -rw-r--r--   3 ecapriolo supergroup 533455 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/protobuf-java-2.5.0.jar
 -rw-r--r--   3 ecapriolo supergroup 995968 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/lib/snappy-java-1.0.4.1.jar
 -rw-r--r--   3 ecapriolo supergroup 752332 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-api-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup  34089 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-common-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 980132 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-dag-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 246395 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-mapreduce-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 199984 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-mapreduce-examples-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 114676 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-runtime-internals-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup 352835 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-runtime-library-0.4.1-incubating.jar
 -rw-r--r--   3 ecapriolo supergroup   6832 2015-01-16 23:00
 /apps/tez-0.4.1-incubating/tez-tests-0.4.1-incubating.jar

 This is my tez-site.xml

 configuration
   property
 nametez.lib.uris/name
 value${fs.default.name}/apps/tez-0.4.1-incubating,${fs.default.name
 }/apps/tez-0.4.1-incubating/lib//value
   /property
 /configuration

 [ecapriolo@production-hadoop-cdh-69-7 ~]$ ls -lahR
 /home/ecapriolo/tez-0.4.1-incubating/
 /home/ecapriolo/tez-0.4.1-incubating/:
 total 2.7M
 drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
 drwx-- 7 ecapriolo ecapriolo  20K Jan 20 15:20 ..
 drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 lib
 -rw-rw-r-- 1 ecapriolo ecapriolo 735K Jan 16 22:54
 tez-api-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo  34K Jan 16 22:54
 tez-common-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 958K Jan 16 22:54
 tez-dag-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 241K Jan 16 22:54
 tez-mapreduce-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 196K Jan 16 22:54
 tez-mapreduce-examples-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 112K Jan 16 22:54
 tez-runtime-internals-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 345K Jan 16 22:54
 tez-runtime-library-0.4.1-incubating.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo 6.7K Jan 16 22:54
 tez-tests-0.4.1-incubating.jar

 /home/ecapriolo/tez-0.4.1-incubating/lib:
 total 6.8M
 drwxrwxr-x 2 ecapriolo ecapriolo 4.0K Jan 16 22:54 .
 drwxrwxr-x 3 ecapriolo ecapriolo 4.0K Jan 16 22:54 ..
 -rw-rw-r-- 1 ecapriolo ecapriolo 297K Jan 16 22:54 avro-1.7.4.jar
 -rw-rw-r-- 1 ecapriolo ecapriolo  41K Jan 16 22:54 commons-cli

jdbc:hive vs jdbc:hive2

2015-01-14 Thread Edward Capriolo
Just a heads up. For anyone that has been using jdbc:hive

I noticed a recent hive...

jdbc:hive2://myhost:port

SQL exception Invalid URL

It might be better if the exception said  Invalid URL. Url must start with
jdbc:hive


Re: Hive parquet vs Vertica vs Impala

2015-01-03 Thread Edward Capriolo
 Hive is the only system that can store and query xml directly, with the
help of different serde's or input formats.

Impala and Vertical have more standard schema systems that do not support
Collections like List, Map, Struct or nested collections you might need to
store and process a complex XML document.

Parquet (A storage format that works with Hive and Impala can support
List,Map, Structs) but he the Impala engine can not access these at the
moment. Last I checked impala refuses to read tables that have one of these
elements ( instead of skipping them).

It sounds like you want to do one of a few things:
1) Normalize your xml into a table and then you can use Vertica, Hive, or
Imapa
2) Write your data using using an Parquet (to handle nested objects ) and
Hive to query it.(Hopefully then when Impala adds collection support you
can switch over.

But mostly you need to do more research.

Edward

On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao raoshashidhar...@gmail.com
wrote:

 Hi,

 Can someone help me with insights into Hive with parquet vs Vertica
 comparison.

 I need to store large xml data into one these database so please help me
 with query performance.

 Is Impala opensource and can we use it without Cloudera license.

 Thanks
 Shashi





Re: Hive parquet vs Vertica vs Impala

2015-01-03 Thread Edward Capriolo
Shashi,

Your questions are too broad, and you are asking questions that are
impossible to answer.
Q. What is faster X or Y?.
A. This depends on countless variables and can not be answered.

For one example even databases that are very similar in nature like
mysql/postgres might execute a query a different way based on it's query
planner or even the characteristics of the data.

How can you show if a query is faster then vertica if you do not have
access vertica to prove it?

I understand some of what you are trying to determine, but you should
really attempt to install these things and build a prototype to determine
what is the best fit for your application. This will grow your
understanding of the systems, help you ask better questions, and
potentially give you the ability to answer those questions yourself and
make better decisions.

The right way to ask this question might be Hello, I have loaded 50Million
rows of data into hive and I am running this query 'select X, from bla
bla'. My vertica instances runs this query in X seconds and hive runs this
in Y seconds. Can this be optimized further?

The software license for Impala is included here:
https://github.com/cloudera/Impala/blob/master/LICENSE.txt

Edward


On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao raoshashidhar...@gmail.com
wrote:

 Edward,

 Thanks for your reply.
 Can you please tell me the query performance of Hive-parquet against
 Vertica. Can Hive -parquet match against Vertica's retrieval performance,
 as I have been told Vertica is also compressed columnar format and is fast?
 What if I query against some 50 millions of rows , which one will be
 faster?

 And moreover is Impala open source ? In some blogs I have seen Impala as
 open source but in some it says Impala as Cloudera proprietary engine.

 Ultimately, I want to use Hive -parquet but need to show that it is better
 than Vertica, a few microseconds here and there would be fine. I don't have
 access to Vertica.

 Thanks
 shashi

 On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

  Hive is the only system that can store and query xml directly, with the
 help of different serde's or input formats.

 Impala and Vertical have more standard schema systems that do not support
 Collections like List, Map, Struct or nested collections you might need to
 store and process a complex XML document.

 Parquet (A storage format that works with Hive and Impala can support
 List,Map, Structs) but he the Impala engine can not access these at the
 moment. Last I checked impala refuses to read tables that have one of these
 elements ( instead of skipping them).

 It sounds like you want to do one of a few things:
 1) Normalize your xml into a table and then you can use Vertica, Hive, or
 Imapa
 2) Write your data using using an Parquet (to handle nested objects ) and
 Hive to query it.(Hopefully then when Impala adds collection support you
 can switch over.

 But mostly you need to do more research.

 Edward

 On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 Can someone help me with insights into Hive with parquet vs Vertica
 comparison.

 I need to store large xml data into one these database so please help me
 with query performance.

 Is Impala opensource and can we use it without Cloudera license.

 Thanks
 Shashi







  1   2   3   4   5   >