Re: does anyone care about list bucketing stored as directories?

2017-10-08 Thread Xuefu Zhang
Lack a response doesn't necessarily means "don't care". Maybe you can have
a good description of the problem and proposed solution. Frankly I cannot
make much sense out of the previous email.

Thanks,
Xuefu

On Fri, Oct 6, 2017 at 5:05 PM, Sergey Shelukhin 
wrote:

> Looks like nobody does… I’ll file a ticket to remove it shortly.
>
> From: Sergey Shelukhin >
> Date: Tuesday, October 3, 2017 at 12:59
> To: "user@hive.apache.org" <
> user@hive.apache.org>, "d...@hive.apache.org
> "  v...@hive.apache.org>>
> Subject: does anyone care about list bucketing stored as directories?
>
> 1) There seem to be some bugs and limitations in LB (e.g. incorrect
> cleanup - https://issues.apache.org/jira/browse/HIVE-14886) and nobody
> appears to as much as watch JIRAs ;) Does anyone actually use this stuff?
> Should we nuke it in 3.0, and by 3.0 I mean I’ll remove it from master in a
> few weeks? :)
>
> 2) I actually wonder, on top of the same SQL syntax, wouldn’t it be much
> easier to add logic to partitioning to write skew values into partitions
> and non-skew values into a new type of default partition? It won’t affect
> nearly as many low level codepaths in obscure and unobvious ways, instead
> keeping all the logic in metastore and split generation, and would
> integrate with Hive features like PPD automatically.
> Esp. if we are ok with the same limitations - e.g. if you add a new skew
> value right now, I’m not sure what happens to the rows with that value
> already sitting in the non-skew directories, but I don’t expect anything
> reasonable...
>
>


Re: Aug. 2017 Hive User Group Meeting

2017-08-21 Thread Xuefu Zhang
Dear Hive users and developers,

As reminder, the next Hive User Group Meeting will occur this Thursday,
Aug. 24. The agenda is available on the event page (
https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/).

See you all there!

Thanks,
Xuefu

On Tue, Aug 1, 2017 at 7:18 PM, Xuefu Zhang <xu...@apache.org> wrote:

> Hi all,
>
> It's an honor to announce that Hive community is launching a Hive user
> group meeting in the bay area this month. The details can be found at
> https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/.
>
> We are inviting talk proposals from Hive users as well as developers at
> this time. We currently have 5 openings.
>
> Please let me know if you have any questions or suggestions.
>
> Thanks,
> Xuefu
>
>


Aug. 2017 Hive User Group Meeting

2017-08-01 Thread Xuefu Zhang
Hi all,

It's an honor to announce that Hive community is launching a Hive user
group meeting in the bay area this month. The details can be found at
https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/.

We are inviting talk proposals from Hive users as well as developers at
this time. We currently have 5 openings.

Please let me know if you have any questions or suggestions.

Thanks,
Xuefu


Welcome Rui Li to Hive PMC

2017-05-24 Thread Xuefu Zhang
Hi all,

It's an honer to announce that Apache Hive PMC has recently voted to invite
Rui Li as a new Hive PMC member. Rui is a long time Hive contributor and
committer, and has made significant contribution in Hive especially in Hive
on Spark. Please join me in congratulating him and looking forward to a
bigger role that he will play in Apache Hive project.

Thanks,
Xuefu


Jimmy Xiang now a Hive PMC member

2017-05-24 Thread Xuefu Zhang
Hi all,

It's an honer to announce that Apache Hive PMC has recently voted to invite
Jimmy Xiang as a new Hive PMC member. Please join me in congratulating him
and looking forward to a bigger role that he will play in Apache Hive
project.

Thanks,
Xuefu


Welcome new Hive committer, Zhihai Xu

2017-05-05 Thread Xuefu Zhang
Hi all,

I'm very please to announce that Hive PMC has recently voted to offer
Zhihai a committership which he accepted. Please join me in congratulating
on this recognition and thanking him for his contributions to Hive.

Regards,
Xuefu


Re: [ANNOUNCE] Apache Hive 2.0.0 Released

2016-02-16 Thread Xuefu Zhang
Congratulation, guys!!!

--Xuefu

On Tue, Feb 16, 2016 at 11:54 AM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Great news! Thanks Sergey for the effort.
>
> Thanks
> Prasanth
>
> > On Feb 16, 2016, at 1:44 PM, Sergey Shelukhin  wrote:
> >
> > The Apache Hive team is proud to announce the the release of Apache Hive
> > version 2.0.0.
> >
> > The Apache Hive (TM) data warehouse software facilitates querying and
> > managing large datasets residing in distributed storage. Built on top of
> > Apache Hadoop (TM), it provides:
> >
> > * Tools to enable easy data extract/transform/load (ETL)
> >
> > * A mechanism to impose structure on a variety of data formats
> >
> > * Access to files stored either directly in Apache HDFS (TM) or in other
> > data storage systems such as Apache HBase (TM)
> >
> > * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
> >
> > For Hive release details and downloads, please visit:
> > https://hive.apache.org/downloads.html
> >
> > Hive 2.0.0 Release Notes are available here:
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332641
> > jectId=12310843
> >
> > We would like to thank the many contributors who made this release
> > possible.
> >
> > Regards,
> >
> > The Apache Hive Team
> >
> >
> >
>
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
   | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
> xx |
>
> | 10| 99   | 999  | 188   |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
> xx |
>
>
> +---+--+--+---+-+-++--+
>
> 3 rows selected (76.835 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 10);
>
> INFO  : Status: Finished successfully in 80.54 seconds
>
>
> +---+--+--+---+-+-++--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> | dummy.random_string | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
> xx |
>
> | 10| 99   | 999  | 188   |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
> xx |
>
>
> +---+--+--+---+-+-++--+
>
> 3 rows selected (80.718 seconds)
>
>
>
> Three runs returning the same rows in 80 seconds.
>
>
>
> It is possible that My Spark engine with Hive is 1.3.1 which is out of
> date and that causes this lag.
>
>
>
> There are certain queries that one cannot do with Spark. Besides it does
> not recognize CHAR fields which is a pain.
>
>
>
> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>
>  > SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>  > FROM sales s, times t, channels c
>
>  > WHERE s.time_id = t.time_id
>
>  > AND   s.channel_id = c.channel_id
>
>  > GROUP BY t.calendar_month_desc, c.channel_desc
>
>  > ;
>
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>
> .
>
> You are likely trying to use an unsupported Hive feature.";
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries no

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
I think the diff is not only about which does optimization but more on
feature parity. Hive on Spark offers all functional features that Hive
offers and these features play out faster. However, Spark SQL is far from
offering this parity as far as I know.

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> My understanding is that with Hive on Spark engine, one gets the Hive
> optimizer and Spark query engine
>
>
>
> With spark using Hive metastore, Spark does both the optimization and
> query engine. The only value add is that one can access the underlying Hive
> tables from spark-sql etc
>
>
>
>
>
> Is this assessment correct?
>
>
>
>
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
Yes, regardless what spark mode you're running in, from Spark AM webui, you
should be able to see how many task are concurrently running. I'm a little
surprised to see that your Hive configuration only allows 2 map tasks to
run in parallel. If your cluster has the capacity, you should parallelize
all the tasks to achieve optimal performance. Since I don't know your Spark
SQL configuration, I cannot tell how much parallelism you have over there.
Thus, I'm not sure if your comparison is valid.

--Xuefu

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Hi Jeff,
>
>
>
> In below
>
>
>
> …. You should be able to see the resource usage in YARN resource manage
> URL.
>
>
>
> Just to be clear we are talking about Port 8088/cluster?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>

Re: Running Spark-sql on Hive metastore

2016-01-31 Thread Xuefu Zhang
For Hive on Spark, there is a startup cost. The second run should be
faster. More importantly, it looks like you have 18 map tasks but only your
cluster only runs two of them at a time. Thus, you cluster is basically
having only two way parallelism. If you configure your cluster to give more
capacity to Hive, the speed should improve as well. Note that each your map
task takes only seconds to complete.

On Sun, Jan 31, 2016 at 3:07 PM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> · Spark 1.5.2 on Hive 1.2.1
>
> · Hive 1.2.1 on Spark 1.3.1
>
> · Oracle Release 11.2.0.1.0
>
> · Hadoop 2.6
>
>
>
> I am running spark-sql using Hive metastore and I am pleasantly surprised
> by the speed by which Spark performs certain queries on Hive tables.
>
>
>
> I imported a 100 Million rows table from Oracle into a Hive staging table
> via Sqoop and then did an insert/select into an ORC table in Hive as
> defined below.
>
>
>
> ++--+
>
> |   createtab_stmt   |
>
> ++--+
>
> | CREATE TABLE `dummy`(  |
>
> |   `id` int,|
>
> |   `clustered` int, |
>
> |   `scattered` int, |
>
> |   `randomised` int,|
>
> |   `random_string` varchar(50), |
>
> |   `small_vc` varchar(10),  |
>
> |   `padding` varchar(10))   |
>
> | CLUSTERED BY ( |
>
> |   id)  |
>
> | INTO 256 BUCKETS   |
>
> | ROW FORMAT SERDE   |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  |
>
> | STORED AS INPUTFORMAT  |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'|
>
> | OUTPUTFORMAT   |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'   |
>
> | LOCATION   |
>
> |   'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy'  |
>
> | TBLPROPERTIES (|
>
> |   'COLUMN_STATS_ACCURATE'='true',  |
>
> |   'numFiles'='35', |
>
> |   'numRows'='1',   |
>
> |   'orc.bloom.filter.columns'='ID', |
>
> |   'orc.bloom.filter.fpp'='0.05',   |
>
> |   'orc.compress'='SNAPPY', |
>
> |   'orc.create.index'='true',   |
>
> |   'orc.row.index.stride'='1',  |
>
> |   'orc.stripe.size'='16777216',|
>
> |   'rawDataSize'='338', |
>
> |   'totalSize'='5660813776',|
>
> |   'transient_lastDdlTime'='1454234981')|
>
> ++--+
>
>
>
> I am doing simple min,max functions on columns scattered and randomised
> from the above table that are not part of cluster etc in Hive. In addition,
> in Oracle there is no index on these columns as well.
>
>
>
> *If I use Hive 1.2.1 on Spark 1.3.1 it comes back in 50.751 seconds*
>
>
>
> *select min(scattered), max(randomised) from dummy;*
>
> INFO  :
>
> Query Hive on Spark job[0] stages:
>
> INFO  : 0
>
> INFO  : 1
>
> INFO  :
>
> Status: Running (Hive on Spark job[0])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO  : 2016-01-31 22:55:05,114 Stage-0_0: 0/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:06,122 Stage-0_0: 0(+2)/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:09,165 Stage-0_0: 0(+2)/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:12,190 Stage-0_0: 2(+2)/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:14,201 Stage-0_0: 3(+2)/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:15,209 Stage-0_0: 4(+2)/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:17,218 Stage-0_0: 6(+2)/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:20,234 Stage-0_0: 8(+2)/18 Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:22,245 Stage-0_0: 10(+2)/18Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:25,257 Stage-0_0: 12(+2)/18Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:27,270 Stage-0_0: 14(+2)/18Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:30,289 Stage-0_0: 16(+2)/18Stage-1_0: 0/1
>
> INFO  : 2016-01-31 22:55:31,294 Stage-0_0: 17(+1)/18Stage-1_0: 

Re: Two results are inconsistent when i use Hive on Spark

2016-01-27 Thread Xuefu Zhang
Hi Jone,

Did you meant you get different results from time to time? If so, could you
run "explain query" multiple times to see if there is any difference. Also,
could you try without map-join hint?

Without dataset, it's hard to reproduce the problem. Thus, it's great if
you can provide DML, and the dataset.

Thanks,
Xuefu

On Tue, Jan 26, 2016 at 11:25 PM, Jone Zhang  wrote:

> *Some properties on hive-site.xml is*
>
> 
> hive.ignore.mapjoin.hint
> false
> 
> 
> hive.auto.convert.join
> true
> 
> 
> hive.auto.convert.join.noconditionaltask
> true
>
>
> *If more information is required,please let us know.*
>
> *Thanks.*
>
> 2016-01-27 15:20 GMT+08:00 Jone Zhang :
>
>> *I have run a query many times, there will be two results without
>> regular.*
>> *One is 36834699 and other is 18464706.*
>>
>> *The query is *
>> set spark.yarn.queue=soft.high;
>> set hive.execution.engine=spark;
>> select /*+mapjoin(t3,t4,t5)*/
>>   count(1)
>> from
>>   (
>>   select
>> coalesce(t11.qua,t12.qua,t13.qua) qua,
>> coalesce(t11.scene,t12.lanmu_id,t13.lanmu_id) scene,
>> coalesce(t11.app_id,t12.appid,t13.app_id) app_id,
>> expos_pv,
>> expos_uv,
>> dload_pv,
>> dload_uv,
>> dload_cnt,
>> dload_user,
>> evil_dload_cnt,
>> evil_dload_user,
>> update_dcnt,
>> update_duser,
>> hand_suc_incnt,
>> hand_suc_inuser,
>> day_hand_suc_incnt,
>> day_hand_suc_inuser
>>   from
>> (select * from t_ed_soft_assist_useraction_stat where ds=20160126)t11
>> full outer join
>> (select * from t_md_soft_lanmu_app_dload_detail where ds=20160126)t12
>> on t11.qua=t12.qua and t11.app_id=t12.appid and t11.scene=t12.lanmu_id
>> full outer join
>> (select * from t_md_soft_client_install_lanmu  where ds=20160126)t13
>> on t11.qua=t13.qua and t11.app_id=t13.app_id and
>> t11.scene=t13.lanmu_id
>>   )t1
>>   left outer join t_rd_qua t3 on t3.ds=20160126 and t1.qua=t3.qua
>>   left outer join t_rd_soft_appnew_last t4 on t4.ds=20160126 and
>> t1.app_id=t4.app_id
>>   left outer join t_rd_soft_page_conf t5 on t5.ds=20160126 and
>> t1.scene=t5.pageid and t3.client_type_id=t5.ismtt;
>>
>>
>> *Explain query is*
>> STAGE DEPENDENCIES:
>>   Stage-2 is a root stage
>>   Stage-1 depends on stages: Stage-2
>>   Stage-0 depends on stages: Stage-1
>>
>> STAGE PLANS:
>>   Stage: Stage-2
>> Spark
>>   DagName: mqq_20160127151826_e8197f40-18d7-430c-9fc8-993facb74534:2
>>   Vertices:
>> Map 6
>> Map Operator Tree:
>> TableScan
>>   alias: t3
>>   Statistics: Num rows: 1051 Data size: 113569 Basic
>> stats: COMPLETE Column stats: NONE
>>   Spark HashTable Sink Operator
>> keys:
>>   0 _col0 (type: string)
>>   1 qua (type: string)
>> Local Work:
>>   Map Reduce Local Work
>> Map 7
>> Map Operator Tree:
>> TableScan
>>   alias: t4
>>   Statistics: Num rows: 2542751 Data size: 220433659
>> Basic stats: COMPLETE Column stats: NONE
>>   Spark HashTable Sink Operator
>> keys:
>>   0 UDFToDouble(_col2) (type: double)
>>   1 UDFToDouble(app_id) (type: double)
>> Local Work:
>>   Map Reduce Local Work
>> Map 8
>> Map Operator Tree:
>> TableScan
>>   alias: t5
>>   Statistics: Num rows: 143 Data size: 28605 Basic stats:
>> COMPLETE Column stats: NONE
>>   Spark HashTable Sink Operator
>> keys:
>>   0 _col1 (type: string), UDFToDouble(_col20) (type:
>> double)
>>   1 pageid (type: string), UDFToDouble(ismtt) (type:
>> double)
>> Local Work:
>>   Map Reduce Local Work
>>
>>   Stage: Stage-1
>> Spark
>>   Edges:
>> Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 5), Map 4
>> (PARTITION-LEVEL SORT, 5), Map 5 (PARTITION-LEVEL SORT, 5)
>> Reducer 3 <- Reducer 2 (GROUP, 1)
>>   DagName: mqq_20160127151826_e8197f40-18d7-430c-9fc8-993facb74534:1
>>   Vertices:
>> Map 1
>> Map Operator Tree:
>> TableScan
>>   alias: t_ed_soft_assist_useraction_stat
>>   Statistics: Num rows: 16368107 Data size: 651461220
>> Basic stats: COMPLETE Column stats: NONE
>>   Select Operator
>> expressions: qua (type: string), scene (type:
>> string), app_id (type: string)
>> outputColumnNames: _col0, _col1, _col2
>> Statistics: Num rows: 16368107 Data size: 651461220
>> Basic stats: COMPLETE Column stats: NONE
>> Reduce Output Operator
>>   key expressions: 

Re: January Hive User Group Meeting

2016-01-21 Thread Xuefu Zhang
For those who cannot attend in person, here is the webex info:

https://cloudera.webex.com/meet/xzhang
1-650-479-3208  Call-in toll number (US/Canada)
623 810 662 (access code)

Thanks,
Xuefu

On Wed, Jan 20, 2016 at 9:45 AM, Xuefu Zhang <xzh...@cloudera.com> wrote:

> Hi all,
>
> As a reminder, the meeting will be held tomorrow as scheduled. Please
> refer to the meetup page[1] for details. Looking forward to meeting you all!
>
> Thanks,
> Xuefu
>
> [1] http://www.meetup.com/Hive-User-Group-Meeting/events/227463783/
>
> On Wed, Dec 16, 2015 at 3:38 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
>> Dear Hive users and developers,
>>
>> Hive community is considering a user group meeting[1] January 21, 2016
>> at Cloudera facility in Palo Alto, CA. This will be a great opportunity
>> for vast users and developers to find out what's happening in the
>> community and share each other's experience with Hive. Therefore, I'd urge
>> you to attend the meetup. Please RSVP and the list will be closed a few
>> days ahead of the event.
>>
>> At the same time, I'd like to solicit light talks (15 minutes max) from
>> users and developers. If you have a proposal, please let me or Thejas know.
>> Your participation is greatly appreciated.
>>
>> Sincerely,
>> Xuefu
>>
>> [1] http://www.meetup.com/Hive-User-Group-Meeting/events/227463783/
>>
>
>


Re: January Hive User Group Meeting

2016-01-20 Thread Xuefu Zhang
Hi all,

As a reminder, the meeting will be held tomorrow as scheduled. Please refer
to the meetup page[1] for details. Looking forward to meeting you all!

Thanks,
Xuefu

[1] http://www.meetup.com/Hive-User-Group-Meeting/events/227463783/

On Wed, Dec 16, 2015 at 3:38 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:

> Dear Hive users and developers,
>
> Hive community is considering a user group meeting[1] January 21, 2016 at
> Cloudera facility in Palo Alto, CA. This will be a great opportunity for
> vast users and developers to find out what's happening in the community
> and share each other's experience with Hive. Therefore, I'd urge you to
> attend the meetup. Please RSVP and the list will be closed a few days ahead
> of the event.
>
> At the same time, I'd like to solicit light talks (15 minutes max) from
> users and developers. If you have a proposal, please let me or Thejas know.
> Your participation is greatly appreciated.
>
> Sincerely,
> Xuefu
>
> [1] http://www.meetup.com/Hive-User-Group-Meeting/events/227463783/
>


Re: Hive on Spark task running time is too long

2016-01-11 Thread Xuefu Zhang
You should check executor log to find out why it failed. There might have
more explanation.

--Xuefu

On Sun, Jan 10, 2016 at 11:21 PM, Jone Zhang  wrote:

> *I have submited a application many times.*
> *Most of applications running correctly.See attach 1.*
> *But one of the them breaks as expected.See attach 2.1 and 2.2.*
>
> *Why a small data size task running so long, and can't find any helpful
> information in yarn logs.*
>
> *Part of the log information is as follows*
> 16/01/11 12:45:19 INFO storage.BlockManagerMasterEndpoint: Trying to
> remove executor 1 from BlockManagerMaster.
> 16/01/11 12:45:19 INFO storage.BlockManagerMasterEndpoint: Removing block
> manager BlockManagerId(1, 10.226.148.160, 44366)
> 16/01/11 12:45:19 INFO storage.BlockManagerMaster: Removed 1 successfully
> in removeExecutor
> 16/01/11 12:50:32 INFO storage.BlockManagerInfo: Removed
> broadcast_2_piece0 on 10.219.58.123:39594 in memory (size: 92.2 KB, free:
> 441.4 MB)
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 2 with
> no recent heartbeats: 604535 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 2
> (already removed): Executor heartbeat timed out after 604535 ms
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 1 with
> no recent heartbeats: 609228 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 1
> (already removed): Executor heartbeat timed out after 609228 ms
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 4 with
> no recent heartbeats: 615098 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 4
> (already removed): Executor heartbeat timed out after 615098 ms
> 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 3 with
> no recent heartbeats: 616730 ms exceeds timeout 60 ms
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 2
> 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 3
> (already removed): Executor heartbeat timed out after 616730 ms
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 2 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 1
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 1 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 4
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 4 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to
> kill executor(s) 3
> 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to
> kill 3 does not exist!
> 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested
> to kill executor(s) .
> 16/01/11 14:29:55 WARN client.RemoteDriver: Shutting down driver because
> RPC channel was closed.
> 16/01/11 14:29:55 INFO client.RemoteDriver: Shutting down remote driver.
> 16/01/11 14:29:55 INFO scheduler.DAGScheduler: Asked to cancel job 1
> 16/01/11 14:29:55 INFO client.RemoteDriver: Failed to run job
> 2fbbb881-988b-4454-ad9e-a20783aaf38e
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:503)
> at
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:371)
> at
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Cancelling stage 2
> 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Removed TaskSet 2.0,
> whose tasks have all completed, from pool
> 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Stage 2 was cancelled
> 16/01/11 14:29:55 INFO scheduler.DAGScheduler: ShuffleMapStage 2
> (mapPartitionsToPair at MapTran.java:31) failed in 6278.824 s
> 16/01/11 14:29:55 INFO handler.ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/metrics/json,null}
> 16/01/11 14:29:55 INFO handler.ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/01/11 14:29:55 INFO handler.ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/api,null}
> 

Re: How to ensure that the record value of Hive on MapReduce and Hive on Spark are completely consistent?

2016-01-07 Thread Xuefu Zhang
If the number of records are in synch, then the chance for any value
disagreement is very low because Hive on Spark and Hive on MR are basically
running the same byte code. If there is anything wrong specific to Spark,
then the disparity would be much bigger than that. I suggest you test your
production queries on the same data set using the two engines, making sure
they agree. Once you have done that, you would be more confident for your
work load.

Thanks,
Xuefu

On Thu, Jan 7, 2016 at 7:37 PM, Jone Zhang  wrote:

>
>
> 2016-01-08 11:37 GMT+08:00 Jone Zhang :
>
>> We made a comparison of the number of records between Hive on MapReduce
>> and Hive on Spark.And they are in good agreement.
>> But how to ensure that the record values of Hive on MapReduce and Hive on
>> Spark are completely consistent?
>> Do you have any suggestions?
>>
>> Best wishes.
>> Thanks.
>>
>
>


Re: It seems that result of Hive on Spark is mistake And result of Hive and Hive on Spark are not the same

2015-12-22 Thread Xuefu Zhang
It seems that the plan isn't quite right, possibly due to union all
optimization in Spark. Could you create a JIRA for this?

CC Chengxiang as he might have some insight.

Thanks,
Xuefu

On Tue, Dec 22, 2015 at 3:39 AM, Jone Zhang  wrote:

> Hive 1.2.1 on Spark1.4.1
>
> 2015-12-22 19:31 GMT+08:00 Jone Zhang :
>
>> *select  * from staff;*
>> 1 jone 22 1
>> 2 lucy 21 1
>> 3 hmm 22 2
>> 4 james 24 3
>> 5 xiaoliu 23 3
>>
>> *select id,date_ from trade union all select id,"test" from trade ;*
>> 1 201510210908
>> 2 201509080234
>> 2 201509080235
>> 1 test
>> 2 test
>> 2 test
>>
>> *set hive.execution.engine=spark;*
>> *set spark.master=local;*
>> *select /*+mapjoin(t)*/ * from staff s join *
>> *(select id,date_ from trade union all select id,"test" from trade ) t on
>> s.id =t.id ;*
>> 1 jone 22 1 1 201510210908
>> 2 lucy 21 1 2 201509080234
>> 2 lucy 21 1 2 201509080235
>>
>> *set hive.execution.engine=mr;*
>> *select /*+mapjoin(t)*/ * from staff s join *
>> *(select id,date_ from trade union all select id,"test" from trade ) t on
>> s.id =t.id ;*
>> FAILED: SemanticException [Error 10227]: Not all clauses are supported
>> with mapjoin hint. Please remove mapjoin hint.
>>
>> *I have two questions*
>> *1.Why result of hive on spark not include the following record?*
>> 1 jone 22 1 1 test
>> 2 lucy 21 1 2 test
>> 2 lucy 21 1 2 test
>>
>> *2.Why there are two different ways of dealing same query?*
>>
>>
>> *explain 1:*
>> *set hive.execution.engine=spark;*
>> *set spark.master=local;*
>> *explain *
>> *select id,date_ from trade union all select id,"test" from trade;*
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>> Spark
>>   DagName:
>> jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>>   Vertices:
>> Map 1
>> Map Operator Tree:
>> TableScan
>>   alias: trade
>>   Statistics: Num rows: 6 Data size: 48 Basic stats:
>> COMPLETE Column stats: NONE
>>   Select Operator
>> expressions: id (type: int), date_ (type: string)
>> outputColumnNames: _col0, _col1
>> Statistics: Num rows: 6 Data size: 48 Basic stats:
>> COMPLETE Column stats: NONE
>> File Output Operator
>>   compressed: false
>>   Statistics: Num rows: 12 Data size: 96 Basic stats:
>> COMPLETE Column stats: NONE
>>   table:
>>   input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>   output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>   serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>> Map 2
>> Map Operator Tree:
>> TableScan
>>   alias: trade
>>   Statistics: Num rows: 6 Data size: 48 Basic stats:
>> COMPLETE Column stats: NONE
>>   Select Operator
>> expressions: id (type: int), 'test' (type: string)
>> outputColumnNames: _col0, _col1
>> Statistics: Num rows: 6 Data size: 48 Basic stats:
>> COMPLETE Column stats: NONE
>> File Output Operator
>>   compressed: false
>>   Statistics: Num rows: 12 Data size: 96 Basic stats:
>> COMPLETE Column stats: NONE
>>   table:
>>   input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>   output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>   serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>
>>   Stage: Stage-0
>> Fetch Operator
>>   limit: -1
>>   Processor Tree:
>> ListSink
>>
>>
>> *explain 2:*
>> *set hive.execution.engine=spark;*
>> *set spark.master=local;*
>> *explain *
>> *select /*+mapjoin(t)*/ * from staff s join *
>> *(select id,date_ from trade union all select id,"209" from trade
>> ) t on s.id =t.id ;*
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-2 is a root stage
>>   Stage-1 depends on stages: Stage-2
>>   Stage-0 depends on stages: Stage-1
>>
>> STAGE PLANS:
>>   Stage: Stage-2
>> Spark
>>   DagName:
>> jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>>   Vertices:
>> Map 1
>> Map Operator Tree:
>> TableScan
>>   alias: trade
>>   Statistics: Num rows: 6 Data size: 48 Basic stats:
>> COMPLETE Column stats: NONE
>>   Filter Operator
>> predicate: id is not null (type: boolean)
>> Statistics: Num rows: 3 Data size: 

Re: Hive on Spark throw java.lang.NullPointerException

2015-12-18 Thread Xuefu Zhang
Could you create a JIRA with repro case?

Thanks,
Xuefu

On Thu, Dec 17, 2015 at 9:21 PM, Jone Zhang  wrote:

> *My query is *
> set hive.execution.engine=spark;
> select
>
> t3.pcid,channel,version,ip,hour,app_id,app_name,app_apk,app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num,
> (case when t4.cnt is null then 0 else 1 end) as is_evil
> from
> (select /*+mapjoin(t2)*/
> pcid,channel,version,ip,hour,
> (case when t2.app_id is null then t1.app_id else t2.app_id end) as app_id,
> t2.name as app_name,
> app_apk,
>
> app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num
> from
> t_ed_soft_downloadlog_molo t1 left outer join t_rd_soft_app_pkg_name t2 on
> (lower(t1.app_apk) = lower(t2.package_id) and t1.ds = 20151217 and t2.ds =
> 20151217)
> where
> t1.ds = 20151217) t3
> left outer join
> (
> select pcid,count(1) cnt  from t_ed_soft_evillog_molo where ds=20151217
>  group by pcid
> ) t4
> on t3.pcid=t4.pcid;
>
>
> *And the error log is *
> 2015-12-18 08:10:18,685 INFO  [main]: spark.SparkMapJoinOptimizer
> (SparkMapJoinOptimizer.java:process(79)) - Check if it can be converted to
> map join
> 2015-12-18 08:10:18,686 ERROR [main]: ql.Driver
> (SessionState.java:printError(966)) - FAILED: NullPointerException null
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedParentMapJoinSize(SparkMapJoinOptimizer.java:312)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedMapJoinSize(SparkMapJoinOptimizer.java:292)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getMapJoinConversionInfo(SparkMapJoinOptimizer.java:271)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.process(SparkMapJoinOptimizer.java:80)
> at
> org.apache.hadoop.hive.ql.optimizer.spark.SparkJoinOptimizer.process(SparkJoinOptimizer.java:58)
> at
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:92)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:97)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:81)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:135)
> at
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:112)
> at
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:128)
> at
> org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:102)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10238)
> at
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:210)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:233)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:425)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
> at
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1123)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1171)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1060)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1050)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:208)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:160)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:447)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357)
> at
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:795)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:767)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:704)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>
>
> *Some properties on hive-site.xml is *
> 
>hive.ignore.mapjoin.hint
>false
> 
> 
> hive.auto.convert.join
> true
> 
> 
>hive.auto.convert.join.noconditionaltask
>true
> 
>
>
> *The error relevant code is *
> long mjSize = ctx.getMjOpSizes().get(op);
> *I think it should be checked whether or not * ctx.getMjOpSizes().get(op) *is
> null.*
>
> *Of course, more strict logic need to you to decide.*
>
>
> *Thanks.*
> *Best Wishes.*
>


Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-17 Thread Xuefu Zhang
These missing classes are in hadoop jar. If you have HADOOP_HOME set, then
they should be in Hive classpath.

--Xuefu

On Thu, Dec 17, 2015 at 10:12 AM, Ophir Etzion <op...@foursquare.com> wrote:

> it seems like the problem is that the spark client needs FSDataInputStream
> but is not included in the hive-exec-1.1.0-cdh5.4.3.jar that is passed in
> the class path.
> I need to look more in spark-submit / org.apache.spark.deploy to see if
> there is a way to include more jars.
>
>
> 2015-12-17 17:34:01,679 INFO org.apache.hive.spark.client.SparkClientImpl:
> Running client driver with argv:
> /export/hdb3/data/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/spark/bin/spark-submit
> --executor-cores 1 --executor-memory 268435456 --proxy-user anonymous
> --properties-file /tmp/spark-submit.1508744664719491459.properties --class
> org.apache.hive.spark.client.RemoteDriver
> /export/hdb3/data/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/jars/hive-exec-1.1.0-cdh5.4.3.jar
> --remote-host ezaq6.prod.foursquare.com --remote-port 44306 --conf
> hive.spark.client.connect.timeout=1000 --conf
> hive.spark.client.server.connect.timeout=9 --conf
> hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/FSDataInputStream
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> org.apache.spark.deploy.SparkSubmitDriverBootstrapper$.main(SparkSubmitDriverBootstrapper.scala:71)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> org.apache.spark.deploy.SparkSubmitDriverBootstrapper.main(SparkSubmitDriverBootstrapper.scala)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl:
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.fs.FSDataInputStream
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.security.AccessController.doPrivileged(Native Method)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: ...
> 2 more
> 2015-12-17 17:34:02,438 WARN org.apache.hive.spark.client.SparkClientImpl:
> Child process exited with code 1.
>
> On Tue, Dec 15, 2015 at 11:15 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
>> As to the spark versions that are supported. Spark has made
>> non-compatible API changes in 1.5, and that's the reason why Hive 1.1.0
>> doesn't work with Spark 1.5. However, the latest Hive in master or branch-1
>> should work with spark 1.5.
>>
>> Also, later CDH 5.4.x versions have already supported Spark 1.5. CDH 5.7,
>> which is coming so, will support Spark 1.6.
>>
>> --Xuefu
>>
>> On Tue, Dec 15, 2015 at 3:50 PM, Mich Talebzadeh <m...@peridale.co.uk>
>> wrote:
>>
>>> To answer your point:
>>>
>>>
>>>
>>> “why would spark 1.5.2 specifically would not work with hive?”
>>>
>>>
>>>
>>> Because I tried Spark 1.5.2 and it did not work and unfortunately the
>>> only version seem to work (albeit requires messaging around) is version
>>> 1.3.1 of Spark.
>>>
>>>
>>>
>>> Look at the threads on “Managed to make Hive run on Spark engine” in
>>> user@hive.apache.org
>>>
>>>
>>>
>>>
>>>
>>> HTH,
>>>
>>>
>>>
>>>
>>>
>>> Mich Talebzadeh
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.co

Re: making session setting "set spark.master=yarn-client" for Hive on Spark

2015-12-16 Thread Xuefu Zhang
Mich,

By switching the values for spark.master, you're basically asking Hive to
use your YARN cluster rather than your spark standalone cluster. Both modes
are supported besides local, local-cluster, and yarn-cluster. And
yarn-cluster is the recommended mode.

Thanks,
Xuefu

On Wed, Dec 16, 2015 at 1:39 PM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> My environment:
>
>
>
> Hadoop 2.6.0
>
> Hive 1.2.1
>
> spark-1.3.1-bin-hadoop2.6 (downloaded from prebuild 
> spark-1.3.1-bin-hadoop2.6.gz
>
>
> The Jar file used in $HIVE_HOME/lib to link Hive to spark was à
> spark-assembly-1.3.1-hadoop2.4.0.jar
>
>(built from the source downloaded as zipped file spark-1.3.1.gz and
> built with command line make-distribution.sh --name
> "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
>
> I try to use Hive on Spark.
>
>
>
> Before I had:
>
>
>
> set spark.home=/usr/lib/spark-1.3.1-bin-hadoop2.6;
>
> set hive.execution.engine=spark;
>
> set spark.master=spark://50.140.197.217:7077;
>
> set spark.eventLog.enabled=true;
>
> set spark.eventLog.dir=/usr/lib/spark-1.3.1-bin-hadoop2.6/logs;
>
> set spark.executor.memory=512m;
>
> set spark.executor.cores=2;
>
> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>
> set hive.spark.client.server.connect.timeout=22ms;
>
> set spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;
>
> set spark.SPARK_PID_DIR=/work/hadoop/tmp/spark;
>
>
>
>
>
> And It sporadically worked
>
>
>
>
>
> Today I changed spark.master to
>
>
>
> set spark.master=yarn-client;
>
>
>
> and it works fine without any intermittent connectivity issue. The Haddop
> application UI shows the job as “Hive on Spark”  and the application type
> as SPARK as well.
>
>
>
>
>
> What are the implications of this please?
>
>
>
>
>
>
>
> Thanks
>
>
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>


Re: January Hive User Group Meeting

2015-12-16 Thread Xuefu Zhang
Yeah. I can try to set up a webex for this. However, I'd encourage folks to
attend in person to get more live experience, especially for those from
local Bay Area.

Thanks,
Xuefu

On Wed, Dec 16, 2015 at 3:42 PM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Thanks for heads up.
>
>
>
> Will it be possible to remote to this meetings for live sessions?
>
>
>
> Regards,
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 16 December 2015 23:39
> *To:* user@hive.apache.org; d...@hive.apache.org; Thejas M Nair <
> thejas.n...@yahoo.com>
> *Subject:* January Hive User Group Meeting
>
>
>
> Dear Hive users and developers,
>
> Hive community is considering a user group meeting[1] January 21, 2016 at
> Cloudera facility in Palo Alto, CA. This will be a great opportunity for
> vast users and developers to find out what's happening in the community and
> share each other's experience with Hive. Therefore, I'd urge you to attend
> the meetup. Please RSVP and the list will be closed a few days ahead of the
> event.
>
> At the same time, I'd like to solicit light talks (15 minutes max) from
> users and developers. If you have a proposal, please let me or Thejas know.
> Your participation is greatly appreciated.
>
> Sincerely,
>
> Xuefu
>
> [1] http://www.meetup.com/Hive-User-Group-Meeting/events/227463783/
>


Re: Pros and cons -Saving spark data in hive

2015-12-15 Thread Xuefu Zhang
You might want to consider Hive on Spark where you can work directly with
Hive and your query execution is powered by Spark as an engine.

--Xuefu

On Tue, Dec 15, 2015 at 6:04 PM, Divya Gehlot 
wrote:

> Hi,
> I am new bee to Spark and  I am exploring option and pros and cons which
> one will work best in spark and hive context.My  dataset  inputs are CSV
> files, using spark to process the my data and saving it in hive using
> hivecontext
>
> 1) Process the CSV file using spark-csv package and create temptable and
> store the data in hive using hive context.
> 2) Process the file as normal text file in sqlcontext  ,register its as
> temptable in sqlcontext and store it as ORC file and read that ORC file in
> hive context and store it in hive.
>
> Is there any other best options apart from mentioned above.
> Would really appreciate the inputs.
> Thanks in advance.
>
> Thanks,
> Regards,
> Divya
>


Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-15 Thread Xuefu Zhang
Ophir,

Can you provide your hive.log here? Also, have you checked your spark
application log?

When this happens, it usually means that Hive is not able to launch an
spark application. In case of spark on YARN, this application is the
application master. If Hive fails to launch it, or the application master
fails before it can connect back, you would see such error messages. To get
more information, you should check the spark application log.

--Xuefu

On Tue, Dec 15, 2015 at 2:26 PM, Ophir Etzion  wrote:

> Hi,
>
> when trying to do Hive on Spark on CDH5.4.3 I get the following error when
> trying to run a simple query using spark.
>
> I've tried setting everything written here (
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> as well as what the cdh recommends.
>
> any one encountered this as well? (searching for it didn't help much)
>
> the error:
>
> ERROR : Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:97)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1640)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1399)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1044)
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:144)
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:69)
> at
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:196)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:208)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException:
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancel
> client '2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited
> before connecting back
> at com.google.common.base.Throwables.propagate(Throwables.java:156)
> at
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:109)
> at
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
> at
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:91)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:55)
> ... 22 more
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.RuntimeException: Cancel client
> '2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited before
> connecting back
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:99)
> ... 26 more
> Caused by: java.lang.RuntimeException: Cancel client
> '2b2d7314-e0cc-4933-82a1-992a3299d109'. Error: Child process exited before
> connecting back
> at
> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:179)
> at
> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:427)
> ... 1 more
>
> ERROR : Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:114)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:120)
> at
> 

Re: Hive on Spark application will be submited more times when the queue resources is not enough.

2015-12-09 Thread Xuefu Zhang
Hi Jone,

Thanks for reporting the problem. When you say there is no enough resource,
do you mean that you cannot launch Yarn application masters?

I feel that we should error out right way if the application cannot be
submitted. Any attempt of resubmitted seems problematic. I'm not sure if
there is such control over this, but I think that's a good direction to
look at. I will check with our spark expert on this.

Thanks,
Xuefu

On Wed, Dec 9, 2015 at 8:48 PM, Jone Zhang  wrote:

> *It seems that the submit number depend on stage of the query.*
> *This query include three stages.*
>
> If queue resources is still *not enough after submit threee applications,** 
> Hive
> client will close.*
> *"**Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'*
> *FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.spark.SparkTask**"*
> *And this time, the port(eg **34682**)  kill in hive client(eg *
> *10.179.12.140**) use to **communicate with RSC **will  lost.*
>
> *The reources of queue is free **after awhile, the AM of three
> applications will fast fail beacause of "**15/12/10 12:28:43 INFO
> client.RemoteDriver: Connecting to:
> 10.179.12.140:34682...java.net.ConnectException: Connection refused:
> /10.179.12.140:34682 **"*
>
> *So, The application will fail if the queue resources if not **enough at
> point of the query be submited, even if the resources is free **after
> awhile.*
> *Do you have more idea about this question?*
>
> *Attch the query*
> set hive.execution.engine=spark;
> set spark.yarn.queue=tms;
> set spark.app.name=t_ad_tms_heartbeat_ok_3;
> insert overwrite table t_ad_tms_heartbeat_ok partition(ds=20151208)
> SELECT
> NVL(a.qimei, b.qimei) AS qimei,
> NVL(b.first_ip,a.user_ip) AS first_ip,
> NVL(a.user_ip, b.last_ip) AS last_ip,
> NVL(b.first_date, a.ds) AS first_date,
> NVL(a.ds, b.last_date) AS last_date,
> NVL(b.first_chid, a.chid) AS first_chid,
> NVL(a.chid, b.last_chid) AS last_chid,
> NVL(b.first_lc, a.lc) AS first_lc,
> NVL(a.lc, b.last_lc) AS last_lc,
> NVL(a.guid, b.guid) AS guid,
> NVL(a.sn, b.sn) AS sn,
> NVL(a.vn, b.vn) AS vn,
> NVL(a.vc, b.vc) AS vc,
> NVL(a.mo, b.mo) AS mo,
> NVL(a.rl, b.rl) AS rl,
> NVL(a.os, b.os) AS os,
> NVL(a.rv, b.rv) AS rv,
> NVL(a.qv, b.qv) AS qv,
> NVL(a.imei, b.imei) AS imei,
> NVL(a.romid, b.romid) AS romid,
> NVL(a.bn, b.bn) AS bn,
> NVL(a.account_type, b.account_type) AS
> account_type,
> NVL(a.account, b.account) AS account
> FROM
> (SELECT
> ds,user_ip,guid,sn,vn,vc,mo,rl,chid,lcid,os,rv,qv,imei,qimei,lc,romid,bn,account_type,account
> FROMt_od_tms_heartbeat_ok
> WHERE   ds = 20151208) a
> FULL OUTER JOIN
> (SELECT
> qimei,first_ip,last_ip,first_date,last_date,first_chid,last_chid,first_lc,last_lc,guid,sn,vn,vc,mo,rl,os,rv,qv,imei,romid,bn,account_type,account
> FROMt_ad_tms_heartbeat_ok
> WHERE   last_date > 20150611
> AND ds = 20151207) b
> ON   a.qimei=b.qimei;
>
> *Thanks.*
> *Best wishes.*
>
> 2015-12-09 19:51 GMT+08:00 Jone Zhang :
>
>> But in some cases all of the applications will fail which caused
>>> by SparkContext did not initialize after waiting for 15 ms.
>>> See attchment (hive.spark.client.server.connect.timeout is set to 5min).
>>
>>
>> *The error log is different  from original mail*
>>
>> Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041
>>
>> 
>> LogType: stderr
>> LogLength: 3302
>> Log Contents:
>> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
>> in the future
>> Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled
>> in the future
>> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers
>> for [TERM, HUP, INT]
>> 15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId:
>> appattempt_1448873753366_113453_01
>> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq
>> 15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq
>> 15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set(mqq); users with modify permissions: Set(mqq)
>> 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user
>> application in a separate Thread
>> 15/12/09 02:11:49 INFO 

Re: Getting error when trying to start master node after building spark 1.3

2015-12-06 Thread Xuefu Zhang
That basically says that snappy isn't working properly on your box. You can
forget about that for now by running:

set spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;


On Sat, Dec 5, 2015 at 1:45 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Great stuff.
>
>
>
> Built the source code for 1.3.1 and generated
> spark-assembly-1.3.1-hadoop2.4.0.jar
>
>
>
> jar tvf spark-assembly-1.3.1-hadoop2.4.0.jar|grep hive | grep -i -v Archive
>
>
>
> so no hive there
>
>
>
> Downloaded prebuilt spark 1.3.1 and started master and slave OK
>
>
>
> Started hive as usual in debug mode  and did a simple select count(1) from
> t;
>
>
>
> Spark app started OK
>
>
>
> hduser@rhes564::/usr/lib/spark-1.3.1-bin-hadoop2.6/logs>
>
>
>
> -rw-r--r-- 1 hduser hadoop  31562 Dec  5 21:18
> spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
> -rw-r--r-- 1 hduser hadoop  19684 Dec  5 21:18
> spark-hduser-org.apache.spark.deploy.worker.Worker-1-rhes564.out
>
> -rwxrwx--- 1 hduser hadoop  60491 Dec  5 21:18
> app-20151205211814-0005.inprogress
>
>
>
> Now I get some library error
>
>
>
> 5/12/05 21:18:16 [stderr-redir-1]: INFO client.SparkClientImpl: Caused by:
> java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5-libsnappyjava.so:
> /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by
> /tmp/snappy-1.0.5-libsnappyjava.so)
>
>
>
>
>
> strings /usr/lib/libstdc++.so.6 | grep GLIBC
>
> GLIBCXX_3.4
>
> GLIBCXX_3.4.1
>
> GLIBCXX_3.4.2
>
> GLIBCXX_3.4.3
>
> GLIBCXX_3.4.4
>
> GLIBCXX_3.4.5
>
> GLIBCXX_3.4.6
>
> GLIBCXX_3.4.7
>
> GLIBCXX_3.4.8
>
> GLIBC_2.3
>
> GLIBC_2.0
>
> GLIBC_2.3.2
>
> GLIBC_2.4
>
> GLIBC_2.1
>
> GLIBC_2.1.3
>
> GLIBC_2.2
>
> GLIBCXX_FORCE_NEW
>
>
>
> Looking into sorting this out.
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 04 December 2015 17:47
> *To:* user@hive.apache.org
> *Subject:* Re: FW: Getting error when trying to start master node after
> building spark 1.3
>
>
>
> 1.3.1 is what officially supported by Hive 1.2.1. 1.3.0 might be okay too.
>
>
>
> On Fri, Dec 4, 2015 at 9:34 AM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Appreciated the response. Just to clarify the build will be spark 1.3 and
> the pre-build download will be 1.3. this is the version I am attempting to
> make it work.
>
>
>
> Thanks
>
>
>
> Mich
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 04 December 2015 17:03
> *To:* user@hive.apache.org
> *Subject:* Re: FW: Getting error when trying to start master node after
> building spark 1.3
>
>
>
> My last attempt:
>
> 1. Make sure the spark-assembly.jar from your own build doesn't contain
> hive classes, using "jar -tf spark-assembly.jar | grep hive" command. Copy
> it to Hive's /lib directory. After this, you can forget everything about
> this build.
>
> 2. Download prebuilt tarball from Spark download site and deploy it.
> Forget about Hive for a moment. Make sure the cluster comes up and
> functions.
>
> 3. Unset environment variable SPARK_HOME before you start Hive. Start
> Hive, and run "set spark.home=/path/to/spark/dir" command. Then run other
> commands as you did previously when trying hive on spark.
>
>
>
>
>
>
>


Re: FW: Getting error when trying to start master node after building spark 1.3

2015-12-04 Thread Xuefu Zhang
1.3.1 is what officially supported by Hive 1.2.1. 1.3.0 might be okay too.

On Fri, Dec 4, 2015 at 9:34 AM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Appreciated the response. Just to clarify the build will be spark 1.3 and
> the pre-build download will be 1.3. this is the version I am attempting to
> make it work.
>
>
>
> Thanks
>
>
>
> Mich
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 04 December 2015 17:03
> *To:* user@hive.apache.org
> *Subject:* Re: FW: Getting error when trying to start master node after
> building spark 1.3
>
>
>
> My last attempt:
>
> 1. Make sure the spark-assembly.jar from your own build doesn't contain
> hive classes, using "jar -tf spark-assembly.jar | grep hive" command. Copy
> it to Hive's /lib directory. After this, you can forget everything about
> this build.
>
> 2. Download prebuilt tarball from Spark download site and deploy it.
> Forget about Hive for a moment. Make sure the cluster comes up and
> functions.
>
> 3. Unset environment variable SPARK_HOME before you start Hive. Start
> Hive, and run "set spark.home=/path/to/spark/dir" command. Then run other
> commands as you did previously when trying hive on spark.
>
>
>
>
>
> On Fri, Dec 4, 2015 at 3:05 AM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> I sent this one to Spark user group but no response
>
>
>
>
>
> Hi,
>
>
>
>
>
> I am trying to make Hive work with Spark.
>
>
>
> I have been told that I need to use Spark 1.3 and build it from source
> code WITHOUT HIVE libraries.
>
>
>
> I have built it as follows:
>
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
>
>
>
> Now the issue I have that I cannot start master node which I think I need
> it to make it work with Hive on Spark!
>
>
>
> When I try
>
>
>
> hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin>
> ./start-master.sh
>
> starting org.apache.spark.deploy.master.Master, logging to
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
> failed to launch org.apache.spark.deploy.master.Master:
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
> full log in
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
>
>
> I get
>
>
>
> Spark Command: /usr/java/latest/bin/java -cp
> :/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
> --webui-port 8080
>
> 
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
>
> at java.lang.Class.getDeclaredMethods0(Native Method)
>
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
>
> at java.lang.Class.getMethod0(Class.java:2764)
>
> at java.lang.Class.getMethod(Class.java:1653)
>
> at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>
> at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
>
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
>
>
> Any advice will be appreciated.
>
>
>
> Thanks,
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
>
>


Re: FW: Getting error when trying to start master node after building spark 1.3

2015-12-04 Thread Xuefu Zhang
My last attempt:

1. Make sure the spark-assembly.jar from your own build doesn't contain
hive classes, using "jar -tf spark-assembly.jar | grep hive" command. Copy
it to Hive's /lib directory. After this, you can forget everything about
this build.

2. Download prebuilt tarball from Spark download site and deploy it. Forget
about Hive for a moment. Make sure the cluster comes up and functions.

3. Unset environment variable SPARK_HOME before you start Hive. Start Hive,
and run "set spark.home=/path/to/spark/dir" command. Then run other
commands as you did previously when trying hive on spark.


On Fri, Dec 4, 2015 at 3:05 AM, Mich Talebzadeh  wrote:

> I sent this one to Spark user group but no response
>
>
>
>
>
> Hi,
>
>
>
>
>
> I am trying to make Hive work with Spark.
>
>
>
> I have been told that I need to use Spark 1.3 and build it from source
> code WITHOUT HIVE libraries.
>
>
>
> I have built it as follows:
>
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
>
>
>
> Now the issue I have that I cannot start master node which I think I need
> it to make it work with Hive on Spark!
>
>
>
> When I try
>
>
>
> hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin>
> ./start-master.sh
>
> starting org.apache.spark.deploy.master.Master, logging to
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
> failed to launch org.apache.spark.deploy.master.Master:
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
> full log in
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
>
>
> I get
>
>
>
> Spark Command: /usr/java/latest/bin/java -cp
> :/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
> --webui-port 8080
>
> 
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
>
> at java.lang.Class.getDeclaredMethods0(Native Method)
>
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
>
> at java.lang.Class.getMethod0(Class.java:2764)
>
> at java.lang.Class.getMethod(Class.java:1653)
>
> at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>
> at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
>
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
>
>
> Any advice will be appreciated.
>
>
>
> Thanks,
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>


Re: Quick Question

2015-12-04 Thread Xuefu Zhang
Create a table with the file and query the table. Parquet is fully
supported in Hive.

--Xuefu

On Fri, Dec 4, 2015 at 10:58 AM, Siva Kanth Sattiraju (ssattira) <
ssatt...@cisco.com> wrote:

> Hi All,
>
> Is there a way to read “parquet” file through Hive?
>
> Regards,
> Siva
>
>


Re: Why there are two different stages on the same query when i use hive on spark.

2015-12-04 Thread Xuefu Zhang
op.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   name: u_wsd.t_sd_ucm_cominfo_incremental
>
>   Stage: Stage-0
> Move Operator
>   tables:
>   partition:
> ds 20151201
>   replace: true
>   table:
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   name: u_wsd.t_sd_ucm_cominfo_incremental
>
>   Stage: Stage-2
> Stats-Aggr Operator
>
> *Thanks.*
>
> 2015-12-03 22:17 GMT+08:00 Xuefu Zhang <xzh...@cloudera.com>:
>
>> Can you also attach explain query result? What's your data format?
>>
>> --Xuefu
>>
>> On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang <joyoungzh...@gmail.com>
>> wrote:
>>
>>> Hive1.2.1 on Spark1.4.1
>>>
>>> *The first query is:*
>>> set mapred.reduce.tasks=100;
>>> use u_wsd;
>>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>>> 20151202)
>>> select t1.uin,t1.clientip from
>>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
>>> t1
>>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>>> where ds=20151201) t2
>>> on t1.uin=t2.uin
>>> where t2.clientip is NULL;
>>>
>>> *The second query is:*
>>> set mapred.reduce.tasks=100;
>>> use u_wsd;
>>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>>> 20151201)
>>> select t1.uin,t1.clientip from
>>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
>>> t1
>>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>>> where ds=20151130) t2
>>> on t1.uin=t2.uin
>>> where t2.clientip is NULL;
>>>
>>> *The attachment show the two query's stages.*
>>> *Here is the partition info*
>>> 104.3 M
>>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
>>> 110.0 M
>>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
>>> 112.6 M
>>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130
>>>
>>>
>>>
>>> *Why there are two different stages?*
>>> *The stage1 in first query is very slowly.*
>>>
>>> *Thanks.*
>>> *Best wishes.*
>>>
>>
>>
>


Re: Building spark 1.3 from source code to work with Hive 1.2.1

2015-12-03 Thread Xuefu Zhang
Mich,

To start your Spark standalone cluster, you can just download the tarball
from Spark repo site. In other words, you don't need to start your cluster
using your build.

You only need to spark-assembly.jar to Hive's /lib directory and that's it.

I guess you have been confused by this, which I tried to explain previously.

Thanks,
Xuefu



On Thu, Dec 3, 2015 at 2:28 AM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> I have seen mails that state that the user has managed to build spark 1.3
> to work with Hive. I tried Spark 1.5.2 but no luck
>
>
>
> I downloaded spark source 1.3 source code spark-1.3.0.tar and built it as
> follows
>
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
>
> This successfully completed and created the tarred zip file. I then
> created spark 1.3 tree from this zipped file. $SPARK_HOME is /
> usr/lib/spark
>
>
>
> Other steps that I performed:
>
>
>
> 1.In $HIVE_HOME/lib , I copied  spark-assembly-1.3.0-hadoop2.4.0.jar  to
> this directory
>
> 2.  In $SPARK_HOME/conf I created a syblink to
> /usr/lib/hive/conf/hive-site.xml
>
>
>
> Then I tried to start spark master node
>
>
>
> /usr/lib/spark/sbin/start-master.sh
>
>
>
> I get the following error:
>
>
>
>
>
> cat
> /usr/lib/spark/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
> Spark Command: /usr/java/latest/bin/java -cp
> :/usr/lib/spark/sbin/../conf:/usr/lib/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip rhes564 --port 7077 --webui-port
> 8080
>
> 
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
>
> at java.lang.Class.getDeclaredMethods0(Native Method)
>
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
>
> at java.lang.Class.getMethod0(Class.java:2764)
>
> at java.lang.Class.getMethod(Class.java:1653)
>
> at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>
> at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
>
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
>
> I also notice that in /usr/lib/spark/lib, I only have the following jar
> files
>
>
>
> -rw-r--r-- 1 hduser hadoop 98795479 Dec  3 09:03
> spark-examples-1.3.0-hadoop2.4.0.jar
>
> -rw-r--r-- 1 hduser hadoop 98187168 Dec  3 09:03
> spark-assembly-1.3.0-hadoop2.4.0.jar
>
> -rw-r--r-- 1 hduser hadoop  4136760 Dec  3 09:03
> spark-1.3.0-yarn-shuffle.jar
>
>
>
> Wheras in pre-build downloaded one à /usr/lib/spark-1.3.0-bin-hadoop2.4,  
> there
> are additional  JAR files
>
>
>
> -rw-rw-r-- 1 hduser hadoop   1890075 Mar  6  2015
> datanucleus-core-3.2.10.jar
>
> -rw-rw-r-- 1 hduser hadoop 112446389 Mar  6  2015
> spark-examples-1.3.0-hadoop2.4.0.jar
>
> -rw-rw-r-- 1 hduser hadoop 159319006 Mar  6  2015
> spark-assembly-1.3.0-hadoop2.4.0.jar
>
> -rw-rw-r-- 1 hduser hadoop   4136744 Mar  6  2015
> spark-1.3.0-yarn-shuffle.jar
>
> -rw-rw-r-- 1 hduser hadoop   1809447 Mar  6  2015
> datanucleus-rdbms-3.2.9.jar
>
> -rw-rw-r-- 1 hduser hadoop339666 Mar  6  2015
> datanucleus-api-jdo-3.2.6.jar
>
>
>
> Any ideas what is is missing? I am sure someone has sorted this one out
> before.
>
>
>
>
>
> Thanks,
>
>
>
> Mich
>
>
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>


Re: Why there are two different stages on the same query when i use hive on spark.

2015-12-03 Thread Xuefu Zhang
Can you also attach explain query result? What's your data format?

--Xuefu

On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang  wrote:

> Hive1.2.1 on Spark1.4.1
>
> *The first query is:*
> set mapred.reduce.tasks=100;
> use u_wsd;
> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151202
> )
> select t1.uin,t1.clientip from
> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
> t1
> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
> where ds=20151201) t2
> on t1.uin=t2.uin
> where t2.clientip is NULL;
>
> *The second query is:*
> set mapred.reduce.tasks=100;
> use u_wsd;
> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151201
> )
> select t1.uin,t1.clientip from
> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
> t1
> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
> where ds=20151130) t2
> on t1.uin=t2.uin
> where t2.clientip is NULL;
>
> *The attachment show the two query's stages.*
> *Here is the partition info*
> 104.3 M
>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
> 110.0 M
>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
> 112.6 M
>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130
>
>
>
> *Why there are two different stages?*
> *The stage1 in first query is very slowly.*
>
> *Thanks.*
> *Best wishes.*
>


Re: Hive on spark table caching

2015-12-02 Thread Xuefu Zhang
Depending on the query, Hive on Spark does implicitly cache datasets (not
necessarily the input tables) for performance benefits. Such queries
include multi-insert, self-join, self-union, etc. However, no caching
happens across queries at this time, which may be improved in the future.

Thanks,
Xuefu

On Wed, Dec 2, 2015 at 3:00 PM, Udit Mehta  wrote:

> Hi,
>
> I have started using Hive on Spark recently and am exploring the benefits
> it offers. I was wondering if Hive on Spark has capabilities to cache table
> like Spark SQL. Or does it do any form of implicit caching in the long
> running job which it starts after running the first query?
>
> Thanks,
> Udit
>


Re: Problem with getting start of Hive on Spark

2015-12-01 Thread Xuefu Zhang
Mich,

As I understand, you have a problem with Hive on Spark due to duel network
interfaces. I agree that this is something that should be fixed in Hive.
However, saying Hive on Spark doesn't work seems unfair. At Cloudera, we
have many customers that successfully deployed Hive on Spark on their
clusters.

As discussed in another thread, we don't have all the bandwidth we like to
answer every user problem. Thus, it's crucial to provided as much
information as possible when reporting a problem. This includes reproducing
steps as well as Hive, Spark, and/ Yarn logs.

Thanks,
Xuefu

On Tue, Dec 1, 2015 at 1:32 AM, Mich Talebzadeh  wrote:

> Hi Link,
>
>
>
> I am afraid it seems that using Spark as the execution engine for Hive
> does not seem to work. I am still trying to make it work.
>
>
>
> An alternative is to use Spark with Hive data set. To be precise
> spark-sql. You set spark to use Hive metastore and then use Hive as the
> heavy DML engine. That will give you the ability to use spark for queries
> that can be done in-memory.
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Link Qian [mailto:fastupl...@outlook.com]
> *Sent:* 01 December 2015 00:57
>
> *To:* user@hive.apache.org
> *Subject:* RE: Problem with getting start of Hive on Spark
>
>
>
> Hi Mich,
> I set hive execution engine as Spark.
>
> Link Qian
> --
>
> From: m...@peridale.co.uk
> To: user@hive.apache.org
> Subject: RE: Problem with getting start of Hive on Spark
> Date: Mon, 30 Nov 2015 16:15:31 +
>
> To clarify are you running Hive and using Spark as the execution engine
> (as opposed to default Hive execution engine MapReduce)?
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Link Qian [mailto:fastupl...@outlook.com ]
>
> *Sent:* 30 November 2015 13:21
> *To:* user@hive.apache.org
> *Subject:* Problem with getting start of Hive on Spark
>
>
>
> Hello,
>
> Following the Hive wiki page,
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> ,
> I got a several fails that execute HQL based on Spark engine with yarn. I
> have hadoop-2.6.2, yarn-2.6.2 and Spark-1.5.2.
> The fails got either spark-1.5.2-hadoop2.6 distribution version or
> spark-1.5.2-without-hive customer compiler version with instruction on that
> wiki page.
>
> Hive cli submits spark job but the job runs a short time and RM web app
> shows the job is successfully.  but hive 

Re: Problem with getting start of Hive on Spark

2015-12-01 Thread Xuefu Zhang
Link,

It seems that you're using Hive 1.2.1, which doesn't support Spark 1.5.2,
or at least not tested. Please try Hive master branch if you want to use
Spark 1.5.2. If the problem remains, please provide all the commands you
run in your Hive session that leads to the failure.

Thanks,
Xuefu

On Mon, Nov 30, 2015 at 4:57 PM, Link Qian  wrote:

> Hi Mich,
> I set hive execution engine as Spark.
>
> Link Qian
>
> --
> From: m...@peridale.co.uk
> To: user@hive.apache.org
> Subject: RE: Problem with getting start of Hive on Spark
> Date: Mon, 30 Nov 2015 16:15:31 +
>
>
> To clarify are you running Hive and using Spark as the execution engine
> (as opposed to default Hive execution engine MapReduce)?
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Link Qian [mailto:fastupl...@outlook.com]
> *Sent:* 30 November 2015 13:21
> *To:* user@hive.apache.org
> *Subject:* Problem with getting start of Hive on Spark
>
>
>
> Hello,
>
> Following the Hive wiki page,
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> ,
> I got a several fails that execute HQL based on Spark engine with yarn. I
> have hadoop-2.6.2, yarn-2.6.2 and Spark-1.5.2.
> The fails got either spark-1.5.2-hadoop2.6 distribution version or
> spark-1.5.2-without-hive customer compiler version with instruction on that
> wiki page.
>
> Hive cli submits spark job but the job runs a short time and RM web app
> shows the job is successfully.  but hive cli show the job fails.
>
> Here is a snippet of hive cli debug log. any suggestion?
>
>
> 15/11/30 07:31:36 [main]: INFO status.SparkJobMonitor: state = SENT
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/30
> 07:31:37 INFO yarn.Client: Application report for
> application_1448886638370_0001 (state: RUNNING)
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/30
> 07:31:37 INFO yarn.Client:
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> client token: N/A
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> diagnostics: N/A
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> ApplicationMaster host: 192.168.1.12
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> ApplicationMaster RPC port: 0
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> queue: default
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> start time: 1448886649489
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> final status: UNDEFINED
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> tracking URL:
> http://namenode.localdomain:8088/proxy/application_1448886638370_0001/
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl:
> user: hadoop
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/30
> 07:31:37 INFO cluster.YarnClientSchedulerBackend: Application
> application_1448886638370_0001 has started running.
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/30
> 07:31:37 INFO util.Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51326.
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/30
> 07:31:37 INFO netty.NettyBlockTransferService: Server created on 51326
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/30
> 07:31:37 INFO storage.BlockManagerMaster: Trying to register BlockManager
> 15/11/30 07:31:37 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/30
> 07:31:37 INFO storage.BlockManagerMasterEndpoint: Registering block 

Re: Hive version with Spark

2015-11-29 Thread Xuefu Zhang
Sofia,

What specific problem did you encounter when trying spark.master other than
local?

Thanks,
Xuefu

On Sat, Nov 28, 2015 at 1:14 AM, Sofia Panagiotidi <
sofia.panagiot...@taiger.com> wrote:

> Hi Mich,
>
>
> I never managed to run Hive on Spark with a spark master other than local
> so I am afraid I don’t have a reply here.
> But do try some things. Firstly, run hive as
>
> hive --hiveconf hive.root.logger=DEBUG,console
>
>
> so that you are able to see what the exact error is.
>
> I am afraid I cannot be much of a help as I think I reached the same point
> (where it would work only when setting spark.master=local) before
> abandoning.
>
> Cheers
>
>
>
> On 27 Nov 2015, at 01:59, Mich Talebzadeh  wrote:
>
> Hi Sophia,
>
>
> There is no Hadoop-2.6. I believe you should use Hadoop-2.4 as shown below
>
>
> mvn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package
>
> Also if you are building it for Hive on Spark engine, you should not
> include Hadoop.jar files in your build.
>
> For example I tried to build spark 1.3 from source code (I read that this
> version works OK with Hive, having tried unsuccessfully spark 1.5.2).
>
> The following command created the tar file
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
> spark-1.3.0-bin-hadoop2-without-hive.tar.gz
>
>
> Now I have other issues making Hive to use Spark execution engine
> (requires Hive 1.1 or above )
>
> In hive I do
>
> set spark.home=/usr/lib/spark;
> set hive.execution.engine=spark;
> set spark.master=spark://127.0.0.1:7077;
> set spark.eventLog.enabled=true;
> set spark.eventLog.dir=/usr/lib/spark/logs;
> set spark.executor.memory=512m;
> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
> use asehadoop;
> select count(1) from t;
>
> I get the following
>
> OK
> Time taken: 0.753 seconds
> Query ID = hduser_20151127003523_e9863e84-9a81-4351-939c-36b3bef36478
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> Failed to execute spark task, with exception
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
> client.)'
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
>
> HTH,
>
> Mich
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
> *From:* Sofia [mailto:sofia.panagiot...@taiger.com
> ]
> *Sent:* 18 November 2015 16:50
> *To:* user@hive.apache.org
> *Subject:* Hive version with Spark
>
> Hello
>
> After various failed tries to use my Hive (1.2.1) with my Spark (Spark
> 1.4.1 built for Hadoop 2.2.0) I decided to try to build again Spark with
> Hive.
> I would like to know what is the latest Hive version that can be used to
> build Spark at this point.
>
> When downloading Spark 1.5 source and trying:
>
> *mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-1.2.1
> -Phive-thriftserver  -DskipTests clean package*
>
> I get :
>
> *The requested profile "hive-1.2.1" could not be activated because it does
> not exist.*
>
> Thank you
> Sofia
>
>
>


Re: Java heap space occured when the amount of data is very large with the same key on join sql

2015-11-28 Thread Xuefu Zhang
How much data you're dealing with and how skewed it's? The code comes from
Spark as far as I can see. To overcome the problem, you have a few things
to try:

1. Increase executor memory.
2. Try Hive's skew join.
3. Rewrite your query.

Thanks,
Xuefu

On Sat, Nov 28, 2015 at 12:37 AM, Jone Zhang  wrote:

> Add a little:
> The Hive version is 1.2.1
> The Spark version is 1.4.1
> The Hadoop version is 2.5.1
>
> 2015-11-26 20:36 GMT+08:00 Jone Zhang :
>
>> Here is an error message:
>>
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(Arrays.java:2245)
>> at java.util.Arrays.copyOf(Arrays.java:2219)
>> at java.util.ArrayList.grow(ArrayList.java:242)
>> at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
>> at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
>> at java.util.ArrayList.add(ArrayList.java:440)
>> at
>> org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:95)
>> at
>> org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:70)
>> at
>> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
>> at
>> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
>> at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216)
>> at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> And the note from the SortByShuffler.java
>>   // TODO: implement this by accumulating rows with the same
>> key into a list.
>>   // Note that this list needs to improved to prevent
>> excessive memory usage, but this
>>   // can be done in later phase.
>>
>>
>> The join sql run success when i use hive on mapreduce.
>> So how do mapreduce deal with it?
>> And Is there plan to improved to prevent excessive memory usage?
>>
>> Best wishes!
>> Thanks!
>>
>
>


Re: Answers to recent questions on Hive on Spark

2015-11-28 Thread Xuefu Zhang
You should be able to set that property as any other Hive property: just do
"set hive.spark.client.server.address=xxx;" before you start a query. Make
sure that you can reach this server address from your nodemanager nodes
because they are where the remote driver runs. The driver needs to connect
back to HS2. Sometimes firewall may blocks the access, causing the error
you seen.

Thanks,
Xuefu

On Sat, Nov 28, 2015 at 9:33 AM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Hi Xuefu,
>
>
>
> Thanks for the response. I did the changes as requested (coping the
> assembly jar file from build to $HIVE_HOME/lib). I will give full response
> when I get the debug outpout
>
>
>
> In summary when I ran the sql query from Hive and expected Spark to act as
> execution engine, it came back with client connection error.
>
>
>
> Cruically I noticed that it was trying to connect to eth1 (the internet
> connection) as opposed to eth0 (the local network. This host has two
> Ethernet cards one for local area network and the other for linternet
> (directly no proxy)
>
>
>
> It suggested that I can change the address using the configuration
> parameter hive.spark.client.server.address
>
>
>
> Now I don’t seem to be able to set it up in hive-site.xml or as a set
> parameter in hive prompt itself!
>
>
>
> Any hint would be appreciated or any work around?
>
>
>
> Regards,
>
>
>
> Mich
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 28 November 2015 04:35
> *To:* user@hive.apache.org
> *Cc:* d...@hive.apache.org
> *Subject:* Re: Answers to recent questions on Hive on Spark
>
>
>
> Okay. I think I know what problem you have now. To run Hive on Spark,
> spark-assembly.jar is needed and it's also recommended that you have a
> spark installation (identified by spark.home) on the same host where HS2 is
> running. You only need spark-assembly.jar in HS2's /lib directory. Other
> than those, Hive on Spark doesn't have any other dependency at service
> level. On the job level, Hive on Spark jobs of course run on a spark
> cluster, which could be standalone, yarn-cluster, etc. However, how you get
> the binaries for your spark cluster and how you start them is completely
> independent of Hive.
>
> Thus, you only need to build the spark-assembly.jar w/o HIve and put it in
> Hive's /lib directory. The one in the existing spark build may contain Hive
> classes and that's why you need to build your own. Your spark installation
> can still have a jar that's different from what you build for Hive on
> Spark. Your spark.home can still point to your existing spark installation.
> In fact, Hive on Spark only needs spark-submit from your Spark
> installation. Therefore, you should be okay even if your spark installation
> contains Hive classes.
>
> By following this, I'm sure you will get your Hive on Spark to work.
> Depending on the Hive version that your spark installation contains, you
> may have problem with spark applications such as SparkSQL, but it shouldn't
> be a concern if you decide that you use Hive in Hive.
>
> Let me know if you are still confused.
>
> Thanks,
>
> Xuefu
>
>
>
> On Fri, Nov 27, 2015 at 4:34 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Hi,
>
>
>
> Thanks for heads up and comments.
>
>
>
> Sounds like when it comes to using spark as the execution engine for Hive,
> we are in no man’s land so to speak. I have opened questions in both Hive
> and Spark user forums. Not much of luck for reasons that you alluded to.
>
>
>
> Ok just to clarify the prebuild version of spark (as opposed get the
> source code and build your spec) works fine for me.
>
>
>
> Components are
>
>
>
> hadoop version
>
> Hadoop 2.6.0
>
>
>
> hive --version
>
> Hive 1.2.1
>
>
>
> Spark
>
> version 1.5.2
>
>
>
> It does what it says on the tin. For example I can start the master node
> OK start-master.sh.
>
>
>
>
>
> Spark Command: */usr/java/latest/bin/java -cp
> /usr/lib/spark_1.5.2_bin/sbin/../conf/:/usr/lib/spark_1.5.2_bin/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/usr/lib/spark_1.5.2_bin/lib/datanucleus-core-3.2.10.jar:/usr/lib/spark_1.5.2_bin/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark_1.5.2_bin/lib/datanucleus-rdbms-3.2.9.jar:/home/hduser/hadoop-2.6.0/etc/hadoop/
> -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
> --ip 127.0.0.1 --port 7077 --webui-port 8080*
>
> 
>
> 15/11/28 00:05:23 INFO master.Master: Registered signal handlers for
> [TERM, HUP, INT]
>
> 15/11/28 00:05:23 WARN util.Utils: Your ho

Re: Answers to recent questions on Hive on Spark

2015-11-28 Thread Xuefu Zhang
This appears to be a problem with Hive: hive.spark.client.server.address is
not exposed in HiveConf. Could you please create a JIRA and we can get it
fixed.

If the meantime, could you try to disable the internet interface card to
see if that helps?

Thanks,
Xuefu

On Sat, Nov 28, 2015 at 1:30 PM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Hi,
>
>
>
> As I mentioned that parameter does not seem to work I am afraid!
>
>
>
> hive> set hive.spark.client.server.address=50.140.197.217;
>
> Query returned non-zero code: 1, cause: hive configuration
> hive.spark.client.server.address does not exists.
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 28 November 2015 20:53
>
> *To:* user@hive.apache.org
> *Cc:* d...@hive.apache.org
> *Subject:* Re: Answers to recent questions on Hive on Spark
>
>
>
> You should be able to set that property as any other Hive property: just
> do "set hive.spark.client.server.address=xxx;" before you start a query.
> Make sure that you can reach this server address from your nodemanager
> nodes because they are where the remote driver runs. The driver needs to
> connect back to HS2. Sometimes firewall may blocks the access, causing the
> error you seen.
>
> Thanks,
>
> Xuefu
>
>
>
> On Sat, Nov 28, 2015 at 9:33 AM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Hi Xuefu,
>
>
>
> Thanks for the response. I did the changes as requested (coping the
> assembly jar file from build to $HIVE_HOME/lib). I will give full response
> when I get the debug outpout
>
>
>
> In summary when I ran the sql query from Hive and expected Spark to act as
> execution engine, it came back with client connection error.
>
>
>
> Cruically I noticed that it was trying to connect to eth1 (the internet
> connection) as opposed to eth0 (the local network. This host has two
> Ethernet cards one for local area network and the other for linternet
> (directly no proxy)
>
>
>
> It suggested that I can change the address using the configuration
> parameter hive.spark.client.server.address
>
>
>
> Now I don’t seem to be able to set it up in hive-site.xml or as a set
> parameter in hive prompt itself!
>
>
>
> Any hint would be appreciated or any work around?
>
>
>
> Regards,
>
>
>
> Mich
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 28 November 2015 04:35
> *To:* user@hive.apache.org
> *Cc:* d...@hive.apache.org
> *Subject:* Re: Answers to recent questions on Hive on Spark
>
>
>
> Okay. I think I know what problem you have now. To run Hive on Spark,
> spark-assembly.jar is needed and it's also recommended that you have a
> spark installation (identified by spark.home) on the same host where HS2 is
> running. You only need spark-assembly.jar in HS2's /lib directory. Other
> than those, Hive on Spark doesn't have any other dependency at service
> level. On the job level, Hive on Spark jobs of course run on a spark
> cluster, which could be standalone, yarn-cluster, etc. However, how you get
> the binaries for your spark cluster and how you start them is completely
> independent of Hive.
>
> Thus, you only need to build the spark-assembly.jar w/o HIve and put it in
> Hive's /lib directory. The one in the existing spark build may contain Hive
> classes and that's wh

Answers to recent questions on Hive on Spark

2015-11-27 Thread Xuefu Zhang
Hi there,

There seemed an increasing interest in Hive On Spark From the Hive users. I
understand that there have been a few questions or problems reported and I
can see some frustration sometimes. It's impossible for Hive on Spark team
to respond every inquiry even thought we wish we could. However, there are
a few items to be noted:

1. Hive on Spark is being tested as part of Precommit test.
2. Hive on Spark is supported in some distributions such as CDH.
3. I tried a couple of days ago with latest master and branch-1, and they
all worked with my Spark 1.5 build.

Therefore, if you are facing some problem, it's likely due to your setup.
Please refer to Wiki on how to do it right. Nevertheless, I have a few
suggestions here:

1. Start with simple. Try out a CDH sandbox or distribution first and to
see it works in action before building your own. Comparing with your setup
may give you some clues.
2. Try with spark.master=local first, making sure that you have all the
necessary dependent jars, and then move to your production setup. Please
note that yarn-cluster is recommended and mesos is not supported. I tried
both yarn-cluster and local-cluster and both worked for me.
3. Check logs beyond hive.log such as spark log, and yarn-log to get more
error messages.

When you report your problem, please provide as much info as possible, such
as your platform, your builds, your configurations, and relevant logs so
that others can reproduce.

Please note that we are not in a good position to answer questions with
respect to Spark itself, such as spark-shell. Not only is that beyond the
scope of Hive on Scope, but also the team may not have the expertise to
give your meaningful answers. One thing to emphasize. When you build your
spark jar, don't include Hive, as it's very likely there is a version
mismatch. Again, a distribution may have solve the problem for you if you
like to give it a try.

Hope this helps.

Thanks,
Xuefu


Re: 答复: Answers to recent questions on Hive on Spark

2015-11-27 Thread Xuefu Zhang
Hi Wenli,

Hive on Spark team believes that Hive on Spark is production ready. In
fact, CDH already provides support for selected customers in 5.4, which is
based on Hive 1.1.0. CDH will release Hive on Spark as GA in 5.7 which is
coming soon.

Thanks,
Xuefu

On Fri, Nov 27, 2015 at 7:28 PM, Wangwenli <wangwe...@huawei.com> wrote:

> Hi xuefu ,
>
>
>
> thanks for the information.
>
> One simple question, *any plan when the hive on spark can be used in
> production environment?*
>
>
>
> Regards
>
> wenli
>
>
>
> *发件人:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *发送时间:* 2015年11月28日 2:12
> *收件人:* user@hive.apache.org; d...@hive.apache.org
> *主题:* Answers to recent questions on Hive on Spark
>
>
>
> Hi there,
>
> There seemed an increasing interest in Hive On Spark From the Hive users.
> I understand that there have been a few questions or problems reported and
> I can see some frustration sometimes. It's impossible for Hive on Spark
> team to respond every inquiry even thought we wish we could. However, there
> are a few items to be noted:
>
> 1. Hive on Spark is being tested as part of Precommit test.
>
> 2. Hive on Spark is supported in some distributions such as CDH.
>
> 3. I tried a couple of days ago with latest master and branch-1, and they
> all worked with my Spark 1.5 build.
>
> Therefore, if you are facing some problem, it's likely due to your setup.
> Please refer to Wiki on how to do it right. Nevertheless, I have a few
> suggestions here:
>
> 1. Start with simple. Try out a CDH sandbox or distribution first and to
> see it works in action before building your own. Comparing with your setup
> may give you some clues.
>
> 2. Try with spark.master=local first, making sure that you have all the
> necessary dependent jars, and then move to your production setup. Please
> note that yarn-cluster is recommended and mesos is not supported. I tried
> both yarn-cluster and local-cluster and both worked for me.
>
> 3. Check logs beyond hive.log such as spark log, and yarn-log to get more
> error messages.
>
> When you report your problem, please provide as much info as possible,
> such as your platform, your builds, your configurations, and relevant logs
> so that others can reproduce.
>
> Please note that we are not in a good position to answer questions with
> respect to Spark itself, such as spark-shell. Not only is that beyond the
> scope of Hive on Scope, but also the team may not have the expertise to
> give your meaningful answers. One thing to emphasize. When you build your
> spark jar, don't include Hive, as it's very likely there is a version
> mismatch. Again, a distribution may have solve the problem for you if you
> like to give it a try.
>
> Hope this helps.
>
> Thanks,
>
> Xuefu
>


Re: hive1.2.1 on spark connection time out

2015-11-25 Thread Xuefu Zhang
There usually a few more messages before this but after "spark-submit" in
hive.log. Do you have spark.home set?

On Sun, Nov 22, 2015 at 10:17 PM, zhangjp  wrote:

>
> I'm using hive1.2.1 . I want to run hive on spark model,but there is some
> issues.
> have been set spark.master=yarn-client;
> spark version  1.4.1 which run spark-shell --master yarn-client there is
> no problem.
>
> *log*
> 2015-11-23 13:54:56,068 ERROR [main]: spark.SparkTask
> (SessionState.java:printError(960)) - Failed to execute spark task, with
> exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to
> create spark client.)'
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:57)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:116)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:112)
> at
> org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:101)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1653)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1412)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Caused by: java.lang.RuntimeException:
> java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: Timed out waiting for client
> connection.
> at com.google.common.base.Throwables.propagate(Throwables.java:156)
> at
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:109)
> at
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
> at
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:90)
> at
> org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
> at
> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:55)
> ... 21 more
> Caused by: java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: Timed out waiting for client
> connection.
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> at
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:99)
> ... 25 more
> Caused by: java.util.concurrent.TimeoutException: Timed out waiting for
> client connection.
> at org.apache.hive.spark.client.rpc.RpcServer$2.run(RpcServer.java:141)
> at
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
> at
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:123)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> at java.lang.Thread.run(Thread.java:745)
>


Re: Write access request to the Hive wiki

2015-11-25 Thread Xuefu Zhang
Hi Aihua,

I just granted you the write access to Hive wiki. Let me know if problem
remains.

Thanks,
Xuefu

On Wed, Nov 25, 2015 at 10:50 AM, Aihua Xu  wrote:

> I'd like to request write access to the Hive wiki to update some of the
> docs.
>
> My Confluence user name is aihuaxu.
>
> Thanks!
> Aihua
>


Re: [ANNOUNCE] New PMC Member : John Pullokkaran

2015-11-24 Thread Xuefu Zhang
Congratulations, John!

--Xuefu

On Tue, Nov 24, 2015 at 3:01 PM, Prasanth J  wrote:

> Congratulations and Welcome John!
>
> Thanks
> Prasanth
>
> On Nov 24, 2015, at 4:59 PM, Ashutosh Chauhan 
> wrote:
>
> On behalf of the Hive PMC I am delighted to announce John Pullokkaran is
> joining Hive PMC.
> John is a long time contributor in Hive and is focusing on compiler and
> optimizer areas these days.
> Please give John a warm welcome to the project!
>
> Ashutosh
>
>
>


Re: Upgrading from Hive 0.14.0 to Hive 1.2.1

2015-11-24 Thread Xuefu Zhang
This upgrade should be no different from other upgrade. You can use Hive's
schema tool to upgrade your existing metadata.

Thanks,
Xuefu

On Tue, Nov 24, 2015 at 10:05 AM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> I would like to upgrade to Hive 1.2.1 as I understand one cannot deploy
> Spark execution engine on 0.14
>
>
>
> *Chooses execution engine. Options are: **mr** (Map reduce, default), *
> *tez** (Tez
>  execution,
> for Hadoop 2 only), or **spark** (Spark
>  execution,
> for Hive 1.1.0 onward).*
>
>
>
> Is there any upgrade path (I don’t want to lose my existing databases in
> Hive) or I have to start from new including generating new metatsore etc?
>
>
>
> Thanks,
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>


Re: Building Spark to use for Hive on Spark

2015-11-22 Thread Xuefu Zhang
Hive is supposed to work with any version of Hive (1.1+) and a version of
Spark w/o Hive. Thus, to make HoS work reliably and also simply the
matters, I think it still makes to require that spark-assembly jar
shouldn't contain Hive Jars. Otherwise, you have to make sure that your
Hive version matches the same as the "other" Hive version that's included
in Spark.

In CDH 5.x, Spark version is 1.5, and we still build Spark jar w/o Hive.

Therefore, I don't see a need to update the doc.

--Xuefu

On Sun, Nov 22, 2015 at 9:23 PM, Lefty Leverenz 
wrote:

> Gopal, can you confirm the doc change that Jone Zhang suggests?  The
> second sentence confuses me:  "You can choose Spark1.5.0+ which  build
> include the Hive jars."
>
> Thanks.
>
> -- Lefty
>
>
> On Thu, Nov 19, 2015 at 8:33 PM, Jone Zhang 
> wrote:
>
>> I should add that Spark1.5.0+ is used hive1.2.1 default when you use
>> -Phive
>>
>> So this page
>> 
>>  shoule
>> write like below
>> “Note that you must have a version of Spark which does *not* include the
>> Hive jars if you use Spark1.4.1 and before, You can choose Spark1.5.0+
>> which  build include the Hive jars ”
>>
>>
>> 2015-11-19 5:12 GMT+08:00 Gopal Vijayaraghavan :
>>
>>>
>>>
>>> > I wanted to know  why is it necessary to remove the Hive jars from the
>>> >Spark build as mentioned on this
>>>
>>> Because SparkSQL was originally based on Hive & still uses Hive AST to
>>> parse SQL.
>>>
>>> The org.apache.spark.sql.hive package contains the parser which has
>>> hard-references to the hive's internal AST, which is unfortunately
>>> auto-generated code (HiveParser.TOK_TABNAME etc).
>>>
>>> Everytime Hive makes a release, those constants change in value and that
>>> is private API because of the lack of backwards-compat, which is violated
>>> by SparkSQL.
>>>
>>> So Hive-on-Spark forces mismatched versions of Hive classes, because it's
>>> a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
>>> laws of causality.
>>>
>>> Spark cannot depend on a version of Hive that is unreleased and
>>> Hive-on-Spark release cannot depend on a version of Spark that is
>>> unreleased.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>


Re: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-11-20 Thread Xuefu Zhang
This seems belonging to Spark user list. I don't see any relevance to Hive
except the directory containing "hive" word.

--Xuefu

On Fri, Nov 20, 2015 at 1:13 PM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> Has this been resolved. I don’t think this has anything to do with
> /tmp/hive directory permission
>
>
>
> spark-shell
>
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>
> log4j:WARN Please initialize the log4j system properly.
>
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> Using Spark's repl log4j profile:
> org/apache/spark/log4j-defaults-repl.properties
>
> To adjust logging level use sc.setLogLevel("INFO")
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>
>   /_/
>
>
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_25)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch
> dir: /tmp/hive on HDFS should be writable. Current permissions are:
> rwx--
>
>
>
>
>
> :10: error: not found: value sqlContext
>
>import sqlContext.implicits._
>
>   ^
>
> :10: error: not found: value sqlContext
>
>import sqlContext.sql
>
>   ^
>
>
>
> scala>
>
>
>
> Thanks,
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>


Re: troubleshooting: "unread block data' error

2015-11-19 Thread Xuefu Zhang
Are you able to run queries that are not touching HBase? This problem were
seen before but fixed.

On Tue, Nov 17, 2015 at 3:37 AM, Sofia  wrote:

> Hello,
>
> I have configured Hive to work Spark.
>
> I have been trying to run a query on a Hive table managing an HBase table
> (created via HBaseStorageHandler) at the Hive CLI.
>
> When spark.master is “local" it works just fine, but when I set it to my
> spark master spark://spark-master:7077 I get the following error:
>
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/17
> 10:49:30 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from
> ShuffleMapStage 0 (MapPartitionsRDD[1] at mapPartitionsToPair at
> MapTran.java:31)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/17
> 10:49:30 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/17
> 10:49:30 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, 192.168.1.64, ANY, 1688 bytes)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: 15/11/17
> 10:49:30 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 192.168.1.64): java.lang.IllegalStateException: unread block data
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2428)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 15/11/17 10:49:30 [stderr-redir-1]: INFO client.SparkClientImpl: at
> java.lang.Thread.run(Thread.java:745)
>
>
> I read something about the guava.jar missing but I am not sure how to fix
> it.
> I am using Spark 1.4.1, HBase 1.1.2 and Hive 1.2.1.
> Any help more than appreciated.
>
> Sofia
>


Re: Do you have more suggestions on when to use Hive on MapReduce or Hive on Spark?

2015-11-04 Thread Xuefu Zhang
Hi Jone,

Thanks for trying Hive on Spark. I don't know about your cluster, so I
cannot comment too much on your configurations. We do have a "Getting
Started" guide [1] which you may refer to. (We are currently updating the
document.) Your executor size (cores/memory) seems rather small and not
aligning to our guide well.

To me, there is no reason to use MR unless you have encounter a potential
bug or problem, like #4 in your list. However, it would be great if you can
share more details on the problem. It could be just a matter of heap size,
for which you can increase your executor memory. (I do understand you have
some constraints on that.)

For #1, you probably need to increase one configuration,
hive.auto.convert.join.noconditionaltask.size, which is the threshold for
converting common join to map join based on statistics. Even though this
configuration is used for both Hive on MapReduce and Hive on Spark, it is
interpreted differently. There are two types of statistics about data size:
totalSize and rawDataSize. totalSize is approximately the data size on
disk, while rawDataSize is approximately the data size in memory. Hive on
MapReduce uses totalSize. When both are available, Hive on Spark will
choose rawDataSize. Because of possible compression and serialization,
there could be huge difference between totalSize and rawDataSize for the
same dataset. Thus, For Hive on Spark, you might need to specify a higher
value for the configuration in order to convert the same join to a map
join. Once a join is converted to map join for Spark, then better or
similar performance should be expected.

Hope this helps.

Thanks,
Xuefu

[1]
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/admin_hos_config.html

On Wed, Nov 4, 2015 at 12:23 AM, Jone Zhang  wrote:

> Hi, Xuefu
>  we plan to move the Hive on MapReduce to Hive on Spark selectively.
> Because the disposition of cluser consisting of the compute nodes is
> uneven, we chose the following disposition at last.
>
> spark.dynamicAllocation.enabled true
> spark.shuffle.service.enabled   true
> spark.dynamicAllocation.minExecutors10
> spark.rdd.compress  true
>
> spark.executor.cores2
> spark.executor.memory   7000m
> spark.yarn.executor.memoryOverhead  1024
>
>  We sample test dozens of operating online SQL, expecting to find out
> which can run on MapReduce and which can run on Spark under the limited
> resources.
>  Following tios are the conclusion.
>  1. If the SQL is not contain shuffle stage, use Hive on MapReduce,
> such as  mapjoin and seclect * from table where...
>   2. About the SQL which has been join with many times, such as
> seclect from table 1 join table 2 join table 3, it is highly suitable for
> using Hive on Spark.
>   3. As to multi-insert, using Hive on Spark is much faster than using
> Hive on MapReduce.
>   4. it's possible to occur ''Container killed be YERN for exceeding
> memory limits" when using large date which shuttle over 10T, so we don't
> advice to use Hive on Spark.
>
>  Do you have more suggestions on when to use Hive on MapReduce or Hive
> on Spark? Anyway , you are the writer. ☺
>
>   Best wishes!
>   Thank you!
>


Re: Hive on Spark NPE at org.apache.hadoop.hive.ql.io.HiveInputFormat

2015-11-03 Thread Xuefu Zhang
Yeah. it seems that the NPE is a result of the warning msg, missing map.xml
file. Not sure why, but I believe that Hortonworks doesn't support Hive on
Spark. You can try get a build from the master branch or try other
distributions.

Thanks,
Xuefu

On Mon, Nov 2, 2015 at 10:18 PM, Jagat Singh <jagatsi...@gmail.com> wrote:

> This is the virtual machine from Hortonworks.
>
> The query is this
>
> select count(*) from sample_07;
>
> It should run fine with MR.
>
> I am trying to run on Spark.
>
>
>
>
>
>
> On Tue, Nov 3, 2015 at 4:39 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
>> That msg could be just noise. On the other hand, there is NPE, which
>> might be the problem you're having. Have you tried your query with
>> MapReduce?
>>
>> On Sun, Nov 1, 2015 at 5:32 PM, Jagat Singh <jagatsi...@gmail.com> wrote:
>>
>>> One interesting message here , *No plan file found: *
>>>
>>> 15/11/01 23:55:36 INFO exec.Utilities: No plan file found: hdfs://
>>> sandbox.hortonworks.com:8020/tmp/hive/root/119652ff-3158-4cce-b32d-b300bfead1bc/hive_2015-11-01_23-54-47_767_5715642849033319370-1/-mr-10003/40878ced-7985-40d9-9b1d-27f06acb1bef/map.xml
>>> <https://contactmonkey.com/api/v1/tracker?cm_session=027d853f-0e03-41b4-ad1e-a017b45dcb74_type=link_link=3fd1ebf2-8ac7-4140-a457-8f9e0dd4cb69_destination=http://sandbox.hortonworks.com:8020/tmp/hive/root/119652ff-3158-4cce-b32d-b300bfead1bc/hive_2015-11-01_23-54-47_767_5715642849033319370-1/-mr-10003/40878ced-7985-40d9-9b1d-27f06acb1bef/map.xml>
>>>
>>> Similar error message was here
>>> https://issues.apache.org/jira/browse/HIVE-7210
>>> <https://contactmonkey.com/api/v1/tracker?cm_session=027d853f-0e03-41b4-ad1e-a017b45dcb74_type=link_link=fe471ebf-79d4-48cf-a5fc-7b724d5ac331_destination=https://issues.apache.org/jira/browse/HIVE-7210>
>>>
>>>
>>
>


Re: Hive on Spark NPE at org.apache.hadoop.hive.ql.io.HiveInputFormat

2015-11-02 Thread Xuefu Zhang
That msg could be just noise. On the other hand, there is NPE, which might
be the problem you're having. Have you tried your query with MapReduce?

On Sun, Nov 1, 2015 at 5:32 PM, Jagat Singh  wrote:

> One interesting message here , *No plan file found: *
>
> 15/11/01 23:55:36 INFO exec.Utilities: No plan file found: hdfs://
> sandbox.hortonworks.com:8020/tmp/hive/root/119652ff-3158-4cce-b32d-b300bfead1bc/hive_2015-11-01_23-54-47_767_5715642849033319370-1/-mr-10003/40878ced-7985-40d9-9b1d-27f06acb1bef/map.xml
> 
>
> Similar error message was here
> https://issues.apache.org/jira/browse/HIVE-7210
> 
>
>


Re: Hive on Spark

2015-10-23 Thread Xuefu Zhang
quick answers:
1. you can pretty much set any spark configuration at hive using set
command.
2. no. you have to make the call.



On Thu, Oct 22, 2015 at 10:32 PM, Jone Zhang  wrote:

> 1.How can i set Storage Level when i use Hive on Spark?
> 2.Do Spark have any intention of  dynamically determined Hive on MapReduce
> or Hive on Spark, base on SQL features.
>
> Thanks in advance
> Best regards
>


Re: Hive on Spark

2015-10-23 Thread Xuefu Zhang
Yeah. for that, you cannot really cache anything through Hive on Spark.
Could you detail more what you want to achieve?

When needed, Hive on Spark uses memory+disk for storage level.

On Fri, Oct 23, 2015 at 4:29 AM, Jone Zhang <joyoungzh...@gmail.com> wrote:

> 1.But It's no way to set Storage Level through properties file in spark,
> Spark provided "def persist(newLevel: StorageLevel)"
> api only...
>
> 2015-10-23 19:03 GMT+08:00 Xuefu Zhang <xzh...@cloudera.com>:
>
>> quick answers:
>> 1. you can pretty much set any spark configuration at hive using set
>> command.
>> 2. no. you have to make the call.
>>
>>
>>
>> On Thu, Oct 22, 2015 at 10:32 PM, Jone Zhang <joyoungzh...@gmail.com>
>> wrote:
>>
>>> 1.How can i set Storage Level when i use Hive on Spark?
>>> 2.Do Spark have any intention of  dynamically determined Hive on
>>> MapReduce or Hive on Spark, base on SQL features.
>>>
>>> Thanks in advance
>>> Best regards
>>>
>>
>>
>


Re: Hive on Spark

2015-10-23 Thread Xuefu Zhang
you need to increase spark.yarn.executor.memoryOverhead. it has nothing to
do with storage layer.

--Xuefu

On Fri, Oct 23, 2015 at 4:49 AM, Jone Zhang <joyoungzh...@gmail.com> wrote:

> I get an the error every time while I run a query on a large data set. I
> think use MEMORY_AND_DISK can avoid this problem under the limited
> resources.
> "15/10/23 17:37:13 Reporter WARN
> org.apache.spark.deploy.yarn.YarnAllocator>> Container killed by YARN for
> exceeding memory limits. 7.6 GB of 7.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead."
>
> 2015-10-23 19:40 GMT+08:00 Xuefu Zhang <xzh...@cloudera.com>:
>
>> Yeah. for that, you cannot really cache anything through Hive on Spark.
>> Could you detail more what you want to achieve?
>>
>> When needed, Hive on Spark uses memory+disk for storage level.
>>
>> On Fri, Oct 23, 2015 at 4:29 AM, Jone Zhang <joyoungzh...@gmail.com>
>> wrote:
>>
>>> 1.But It's no way to set Storage Level through properties file in
>>> spark, Spark provided "def persist(newLevel: StorageLevel)"
>>> api only...
>>>
>>> 2015-10-23 19:03 GMT+08:00 Xuefu Zhang <xzh...@cloudera.com>:
>>>
>>>> quick answers:
>>>> 1. you can pretty much set any spark configuration at hive using set
>>>> command.
>>>> 2. no. you have to make the call.
>>>>
>>>>
>>>>
>>>> On Thu, Oct 22, 2015 at 10:32 PM, Jone Zhang <joyoungzh...@gmail.com>
>>>> wrote:
>>>>
>>>>> 1.How can i set Storage Level when i use Hive on Spark?
>>>>> 2.Do Spark have any intention of  dynamically determined Hive on
>>>>> MapReduce or Hive on Spark, base on SQL features.
>>>>>
>>>>> Thanks in advance
>>>>> Best regards
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Hive and Spark on Windows

2015-10-20 Thread Xuefu Zhang
Yes. You need HADOOP_HOME, which tells Hive how to connect to HDFS and get
its dependent libraries there.

On Tue, Oct 20, 2015 at 7:36 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:

> I've already installed cygwin, and configure spark_home, but when I tried
> to run ./hive , hive expected HADOOP_HOME.
> Does Hive needs hadoop always? or there are some configuration missing?
>
> Thanks
>
> On Mon, Oct 19, 2015 at 11:31 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
>> Hi Andres,
>>
>> We haven't tested Hive on Spark on Windows. However, if you can get Hive
>> and Spark to work on Windows, I'd assume that the configuration is no
>> different from on Linux. Let's know if you encounter any specific problems.
>>
>> Thanks,
>> Xuefu
>>
>> On Mon, Oct 19, 2015 at 5:13 PM, Andrés Ivaldi <iaiva...@gmail.com>
>> wrote:
>>
>>> Hello, I would like to install Hive with Spark on Windows, I've already
>>> installed Spark, but I cant find a clear documentation on how to configure
>>> hive on windows with spark.
>>>
>>> Regards
>>>
>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>


Re: Hive and Spark on Windows

2015-10-20 Thread Xuefu Zhang
I have zero experience on Hadoop on windows. However, I assume you need
HDFS running somewhere at least.

On Tue, Oct 20, 2015 at 8:49 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:

> Thanks for the prompt response, so is it only needed for is libs?, I dont
> need to run hadoop?
>
> On Tue, Oct 20, 2015 at 11:46 AM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
>> Yes. You need HADOOP_HOME, which tells Hive how to connect to HDFS and
>> get its dependent libraries there.
>>
>> On Tue, Oct 20, 2015 at 7:36 AM, Andrés Ivaldi <iaiva...@gmail.com>
>> wrote:
>>
>>> I've already installed cygwin, and configure spark_home, but when I
>>> tried to run ./hive , hive expected HADOOP_HOME.
>>> Does Hive needs hadoop always? or there are some configuration missing?
>>>
>>> Thanks
>>>
>>> On Mon, Oct 19, 2015 at 11:31 PM, Xuefu Zhang <xzh...@cloudera.com>
>>> wrote:
>>>
>>>> Hi Andres,
>>>>
>>>> We haven't tested Hive on Spark on Windows. However, if you can get
>>>> Hive and Spark to work on Windows, I'd assume that the configuration is no
>>>> different from on Linux. Let's know if you encounter any specific problems.
>>>>
>>>> Thanks,
>>>> Xuefu
>>>>
>>>> On Mon, Oct 19, 2015 at 5:13 PM, Andrés Ivaldi <iaiva...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello, I would like to install Hive with Spark on Windows, I've
>>>>> already installed Spark, but I cant find a clear documentation on how to
>>>>> configure hive on windows with spark.
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ing. Ivaldi Andres
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>


Re: Alias vs Assignment

2015-10-08 Thread Xuefu Zhang
It looks to me that this adds only syntactic suger which doesn't provide
much additional value. On the contrary, it might even bring confusion to
non-sql-server users. As you have already noted, it's not ISO standard.
Writing queries this way actually make them less portable. Personally I'd
discourage such an addition.

Thanks,
Xuefu

On Thu, Oct 8, 2015 at 5:48 AM, Furcy Pin  wrote:

> Hi folks,
>
>
> I would like to start a discussion with the Hive user and developper
> community about an element of syntax present in SQL Server that could be
> nice to have in Hive.
>
>
> Back in 2012, before I started Hive, and was using SQL Server, I came
> accross this post :
>
>
> http://sqlblog.com/blogs/aaron_bertrand/archive/2012/01/23/bad-habits-to-kick-using-as-instead-of-for-column-aliases.aspx
>
> that convinced me to write my queries like
>
> #1
> SELECT
> myColumn = someFunction(someColumn),
> myOtherColumn = someOtherFunction(someOtherColumn)
> FROM ...
>
> rather than
>
> #2
> SELECT
> someFunction(someColumn) as myColumn
> someOtherFunction(someOtherColumn) as myOtherColumn
> FROM ...
>
> The two syntax are equivalent in SQL Server, but only the second is
> allowed in Hive.
>
> In my opinion, there are two advantages of using #1 over #2 (and it seems
> the blog post I mention above only mentions the first) :
>
>1. Readability: usually the name of the columns you are computing
>matters more than how you compute them.
>2. Updates: #1 can easily be transformed into an update query, #2
>requires some rewriting (thank god I discovered Sublime Text and its
>multi-line editing)
>
>
> On the other side, #1 is unfortunately not ISO compliant, even though IMHO
> ISO did not pick the best choice this time... Besides Hive it would not be
> Hive's first deviation from ISO.
>
> I would like to hear what do you people think, would it be a good idea to
> implement this in Hive?
>
> Cheers,
>
> Furcy
>
>


Fw: read this

2015-09-28 Thread Xuefu Zhang
Hello!

 

New message, please read <http://elatronic.com/story.php?a>

 

Xuefu Zhang



Re: hive on spark query error

2015-09-25 Thread Xuefu Zhang
What's the value of spark.master in your case? The error specifically says
something wrong with it.

--Xuefu

On Fri, Sep 25, 2015 at 9:18 AM, Garry Chen  wrote:

> Hi All,
>
> I am following
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started?
> To setup hive on spark.  After setup/configuration everything startup I am
> able to show tables but when executing sql statement within beeline I got
> error.  Please help and thank you very much.
>
>
>
> Cluster Environment (3 nodes) as following
>
> hadoop-2.7.1
>
> spark-1.4.1-bin-hadoop2.6
>
> zookeeper-3.4.6
>
> apache-hive-1.2.1-bin
>
>
>
> Error from hive log:
>
> 2015-09-25 11:51:03,123 INFO  [HiveServer2-Handler-Pool: Thread-50]:
> client.SparkClientImpl (SparkClientImpl.java:startDriver(375)) - Attempting
> impersonation of oracle
>
> 2015-09-25 11:51:03,133 INFO  [HiveServer2-Handler-Pool: Thread-50]:
> client.SparkClientImpl (SparkClientImpl.java:startDriver(409)) - Running
> client driver with argv:
> /u01/app/spark-1.4.1-bin-hadoop2.6/bin/spark-submit --proxy-user oracle
> --properties-file /tmp/spark-submit.840692098393819749.properties --class
> org.apache.hive.spark.client.RemoteDriver
> /u01/app/apache-hive-1.2.1-bin/lib/hive-exec-1.2.1.jar --remote-host
> ip-10-92-82-229.ec2.internal --remote-port 40476 --conf
> hive.spark.client.connect.timeout=1000 --conf
> hive.spark.client.server.connect.timeout=9 --conf
> hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>
> 2015-09-25 11:51:03,867 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - Warning: Ignoring non-spark config
> property: hive.spark.client.server.connect.timeout=9
>
> 2015-09-25 11:51:03,868 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - Warning: Ignoring non-spark config
> property: hive.spark.client.rpc.threads=8
>
> 2015-09-25 11:51:03,868 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - Warning: Ignoring non-spark config
> property: hive.spark.client.connect.timeout=1000
>
> 2015-09-25 11:51:03,868 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - Warning: Ignoring non-spark config
> property: hive.spark.client.secret.bits=256
>
> 2015-09-25 11:51:03,868 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - Warning: Ignoring non-spark config
> property: hive.spark.client.rpc.max.size=52428800
>
> 2015-09-25 11:51:03,876 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - Error: Master must start with yarn,
> spark, mesos, or local
>
> 2015-09-25 11:51:03,876 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - Run with --help for usage help or
> --verbose for debug output
>
> 2015-09-25 11:51:03,885 INFO  [stderr-redir-1]: client.SparkClientImpl
> (SparkClientImpl.java:run(569)) - 15/09/25 11:51:03 INFO util.Utils:
> Shutdown hook called
>
> 2015-09-25 11:51:03,889 WARN  [Driver]: client.SparkClientImpl
> (SparkClientImpl.java:run(427)) - Child process exited with code 1.
>
>
>


Re: [ANNOUNCE] New Hive PMC Chair - Ashutosh Chauhan

2015-09-16 Thread Xuefu Zhang
Congratulations, Ashutosh!. Well-deserved.

Thanks to Carl also for the hard work in the past few years!

--Xuefu

On Wed, Sep 16, 2015 at 12:39 PM, Carl Steinbach  wrote:

> I am very happy to announce that Ashutosh Chauhan is taking over as the
> new VP of the Apache Hive project. Ashutosh has been a longtime contributor
> to Hive and has played a pivotal role in many of the major advances that
> have been made over the past couple of years. Please join me in
> congratulating Ashutosh on his new role!
>


Re: Hive on Spark on Mesos

2015-09-09 Thread Xuefu Zhang
Mesos isn't supported for Hive on Spark. We have never attempted to run
against it.

--Xuefu

On Wed, Sep 9, 2015 at 6:12 AM, John Omernik  wrote:

> In the docs for Hive on Spark, it appears to have instructions only for
> Yarn.  Will there be instructions or the ability to run hive on spark with
> Mesos implementations of spark?  Is it possible now and just not
> documented? What are the issues in running it this way?
>
> John
>


Re: Hive on Spark

2015-08-31 Thread Xuefu Zhang
What you described isn't part of the functionality of Hive on Spark.
Rather, Spark is used here as a general purpose engine similar to MR but
without intemediate stages. It's batch origientated.

Keeping 100T data in memory is hardly beneficial unless you know that that
dataset is going to be used in subsequent queries.

For loading data in memory and providing near real-time response, you might
want to look at some memory-based DBs.

Thanks,
Xuefu

On Thu, Aug 27, 2015 at 9:11 AM, Patrick McAnneny <
patrick.mcann...@leadkarma.com> wrote:

> Once I get "hive.execution.engine=spark" working, how would I go about
> loading portions of my data into memory? Lets say I have a 100TB database
> and want to load all of last weeks data in spark memory, is this possible
> or even beneficial? Or am I thinking about hive on spark in the wrong way.
>
> I also assume hive on spark could get me to near-real-time capabilities
> for large queries. Is this true?
>


Hive User Group Meeting Singapore

2015-08-31 Thread Xuefu Zhang
Dear Hive users,

Hive community is considering a user group meeting during Hadoop World that
will be held in Singarpore [1] Dec 1-3, 2015. As I understand, this will be
the first time that this meeting ever happens in Asia Pacific even though
there is a large user base in that region. As another good news, the
conference organiser is able to provide the venue for this meeting.

However, before I set up a meetup event here [2] to formally announce this,
I'd like to check if there is enough interest from the user community. At
the same time, I will also need to solicit talks from users as well as
developers. Thus, please also let me know if you like to give a short talk.

Your comments and suggestions are greatly appreciated.

Sincerely,
Xuefu

[1] http://strataconf.com/big-data-conference-sg-2015
[2] http://www.meetup.com/Hive-User-Group-Meeting/


Re: HIVE:1.2, Query taking huge time

2015-08-20 Thread Xuefu Zhang
Please check out HIVE-11502. For your poc, you can simply get around using
other data types instead of double.

On Thu, Aug 20, 2015 at 2:08 AM, Nishant Aggarwal nishant@gmail.com
wrote:

 Thanks for the reply Noam. I have already tried the later point of
 dividing the query. But the challenge comes during the joining of the table.


 Thanks and Regards
 Nishant Aggarwal, PMP
 Cell No:- +91 99588 94305


 On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson noam.has...@kenshoo.com
 wrote:

 Hi,

 Have you look at counters in Hadoop side? It's possible you are dealing
 with a bad join which causes multiplication of items, if you see huge
 number of record input/output in map/reduce phase and keeps increasing
 that's probably the case.

 Another thing I would try is to divide the job into several different
 smaller queries, for example start with filter only, after than join and so
 on.

 Noam.

 On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal nishant@gmail.com
  wrote:

 Dear Hive Users,

 I am in process of running over a poc to one of my customer
 demonstrating the huge performance benefits of Hadoop BigData using Hive.

 Following is the problem statement i am stuck with.

 I have generate a large table with 28 columns( all are double). Table
 size on disk is 70GB (i ultimately created compressed table using ORC
 format to save disk space bringing down the table size to  1GB) with more
 than 450Million records.

 In order to demonstrate a complex use case i joined this table with
 itself. Following are the queries i have used to create table and  join
 query i am using.

 *Create Table and Loading Data, Hive parameters settigs:*
 set hive.vectorized.execution.enabled = true;
 set hive.vectorized.execution.reduce.enabled = true;
 set mapred.max.split.size=1;
 set mapred.min.split.size=100;
 set hive.auto.convert.join=false;
 set hive.enforce.sorting=true;
 set hive.enforce.bucketing=true;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 set mapreduce.reduce.input.limit=-1;
 set hive.exec.parallel = true;

 CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
 double,col10 double,col11 double,col12 double,col13 double,col14
 double,col15 double,col16 double,col17 double,col18 double,col19
 double,col20 double,col21 double,col22 double,col23 double,col24
 double,col25 double,col26 double,col27 double,col28 double)
 clustered by (col1) sorted by (col1) into 240 buckets
 STORED AS ORC tblproperties (orc.compress=SNAPPY);

 from huge_numeric_table insert overwrite table huge_numeric_table_orc2
 select * sort by col1;


 *JOIN QUERY:*

 select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
 from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
 on t1.col1=t2.col1 where (t1.col1)  34.11 and (t2.col1) 10.12


 *The problem is that this query gets stuck at reducers :80-85%. and goes
 in a loop and never finishes. *

 Version of Hive is 1.2.

 Please help.


 Thanks and Regards
 Nishant Aggarwal, PMP
 Cell No:- +91 99588 94305



 This e-mail, as well as any attached document, may contain material which
 is confidential and privileged and may include trademark, copyright and
 other intellectual property rights that are proprietary to Kenshoo Ltd,
  its subsidiaries or affiliates (Kenshoo). This e-mail and its
 attachments may be read, copied and used only by the addressee for the
 purpose(s) for which it was disclosed herein. If you have received it in
 error, please destroy the message and any attachment, and contact us
 immediately. If you are not the intended recipient, be aware that any
 review, reliance, disclosure, copying, distribution or use of the contents
 of this message without Kenshoo's express permission is strictly prohibited.





Re: Request write access to the Hive wiki

2015-08-10 Thread Xuefu Zhang
Done!

On Mon, Aug 10, 2015 at 1:05 AM, Xu, Cheng A cheng.a...@intel.com wrote:

 Hi,

 I’d like to have write access to the Hive wiki. My Confluence username is
 cheng.a...@intel.com with Full Name “Ferdinand Xu”. Please help me deal
 with it. Thank you!



 Regards,

 Ferdinand Xu





Re: Request write access to the Hive wiki

2015-08-10 Thread Xuefu Zhang
I couldn't find your user id based on either your name or email address.
You probably need to register there first.

On Mon, Aug 10, 2015 at 12:41 PM, kulkarni.swar...@gmail.com 
kulkarni.swar...@gmail.com wrote:

 @Xuefu While you are already at it, would you mind giving me this access
 too? :)

 Thanks,

 On Mon, Aug 10, 2015 at 2:37 PM, Xuefu Zhang xzh...@cloudera.com wrote:

 Done!

 On Mon, Aug 10, 2015 at 1:05 AM, Xu, Cheng A cheng.a...@intel.com
 wrote:

 Hi,

 I’d like to have write access to the Hive wiki. My Confluence username
 is cheng.a...@intel.com with Full Name “Ferdinand Xu”. Please help me
 deal with it. Thank you!



 Regards,

 Ferdinand Xu







 --
 Swarnim



Re: Computation timeout

2015-07-29 Thread Xuefu Zhang
this works for me:
In hive-site.xml:
  1. hive.server2.session.check.interva=3000;
  2. hive.server2.idle.operation.timeou=-3;
restart HiveServer2.

at beeline, I do analyze table X compute statistics for columns, which
takes longer than 30s. it was aborted by HS2 because of above settings. I
guess it didn't work for you because you didn't have #1.

--Xuefu

On Wed, Jul 29, 2015 at 9:23 AM, Loïc Chanel loic.cha...@telecomnancy.net
wrote:

 I don't think your solution works, as after more than 4 minutes I could
 still see logs of my job showing that it was running.
 Do you have a way to check that even if the job was running, it was not
 being killed by Hive ?
 Or another solution ?

 Thanks for your help,


 Loïc

 Loïc CHANEL
 Engineering student at TELECOM Nancy
 Trainee at Worldline - Villeurbanne

 2015-07-29 16:26 GMT+02:00 Loïc Chanel loic.cha...@telecomnancy.net:

 Yes, I set it to negative 60.

 It's not a problem if the session is killed. That's actually what I try
 to do, because I can't allow to a user to try to end an infinite request.
 Therefore I'll try your solution :)

 Thanks,


 Loïc

 Loïc CHANEL
 Engineering student at TELECOM Nancy
 Trainee at Worldline - Villeurbanne

 2015-07-29 16:14 GMT+02:00 Xuefu Zhang xzh...@cloudera.com:

 Okay. To confirm, you set it to negative 60s?

 The next thing you can try is to set
 hive.server2.idle.session.timeou=6 (60sec) and
 hive.server2.idle.session.check.operation=false. I'm pretty sure this
 works, but the user's session will be killed though.

 --Xuefu

 On Wed, Jul 29, 2015 at 7:02 AM, Loïc Chanel 
 loic.cha...@telecomnancy.net wrote:

 I confirm : I just tried hive.server2.idle.operation.timeout setting it
 to -60 (seconds), but my veery slow job have not been killed. The issue
 here is what if another user come and try to submit a MapReduce job but
 the cluster is stuck in an infinite loop ?.

 Do you or anyone else have another idea ?
 Thanks,


 Loïc

 Loïc CHANEL
 Engineering student at TELECOM Nancy
 Trainee at Worldline - Villeurbanne

 2015-07-29 15:34 GMT+02:00 Loïc Chanel loic.cha...@telecomnancy.net:

 No, because I thought the idea of infinite operation was not very
 compatible with the idle word (as the operation will not stop running),
 but I'll try :-)
 Thanks for the idea,


 Loïc

 Loïc CHANEL
 Engineering student at TELECOM Nancy
 Trainee at Worldline - Villeurbanne

 2015-07-29 15:27 GMT+02:00 Xuefu Zhang xzh...@cloudera.com:

 Have you tried hive.server2.idle.operation.timeout?

 --Xuefu

 On Wed, Jul 29, 2015 at 5:52 AM, Loïc Chanel 
 loic.cha...@telecomnancy.net wrote:

 Hi all,

 As I'm trying to build a secured and multi-tenant Hadoop cluster
 with Hive, I am desperately trying to set a timeout to Hive requests.
 My idea is that some users can make mistakes such as a join with
 wrong keys, and therefore start an infinite loop believing that they are
 just launching a very heavy job. Therefore, I'd like to set a limit to 
 the
 time a request should take, in order to kill the job automatically if it
 exceeds it.

 As such a notion cannot be set directly in YARN, I saw that
 MapReduce2 provides with its own native timeout property, and I would 
 like
 to know if Hive provides with the same property someway.

 Did anyone heard about such a thing ?

 Thanks in advance for your help,


 Loïc

 Loïc CHANEL
 Engineering student at TELECOM Nancy
 Trainee at Worldline - Villeurbanne










Re: Computation timeout

2015-07-29 Thread Xuefu Zhang
Have you tried hive.server2.idle.operation.timeout?

--Xuefu

On Wed, Jul 29, 2015 at 5:52 AM, Loïc Chanel loic.cha...@telecomnancy.net
wrote:

 Hi all,

 As I'm trying to build a secured and multi-tenant Hadoop cluster with
 Hive, I am desperately trying to set a timeout to Hive requests.
 My idea is that some users can make mistakes such as a join with wrong
 keys, and therefore start an infinite loop believing that they are just
 launching a very heavy job. Therefore, I'd like to set a limit to the time
 a request should take, in order to kill the job automatically if it exceeds
 it.

 As such a notion cannot be set directly in YARN, I saw that MapReduce2
 provides with its own native timeout property, and I would like to know if
 Hive provides with the same property someway.

 Did anyone heard about such a thing ?

 Thanks in advance for your help,


 Loïc

 Loïc CHANEL
 Engineering student at TELECOM Nancy
 Trainee at Worldline - Villeurbanne



Re: Obtain user identity in UDF

2015-07-27 Thread Xuefu Zhang
There is a udf, current_user, which returns a value that can passed to your
udf as an input, right?

On Mon, Jul 27, 2015 at 1:13 PM, Adeel Qureshi adeelmahm...@gmail.com
wrote:

 Is there a way to obtain user authentication information in a UDF like
 kerberos username that they have logged in with to execute a hive query.

 I would appreciate any help.

 Thanks
 Adeel



Re: Error: java.lang.RuntimeException: org.apache.hive.com/esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 380

2015-07-16 Thread Xuefu Zhang
Same as https://issues.apache.org/jira/browse/HIVE-11269?

On Thu, Jul 16, 2015 at 7:25 AM, Anupam sinha ak3...@gmail.com wrote:

 Hi Guys,

 I am writing the simple hive query,Receiving the following error
 intermittently. This error
 presents itself for 30min-2hr then goes away.

 Appreciate your help to resolve this issue.

 Error: java.lang.RuntimeException:
 org.apache.hive.com/esotericsoftware.kryo.KryoException:
 Encountered unregistered class ID: 380

 on hive server the following Hive jar is installed:

 i have using kryo version 2.22
 hive-exec-0.13.1-cdh5.2.1.jar


 Task with the most failures(4):
 -
 Task ID:
   task_1436470122113_0171_m_14

 URL:

 http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1436470122113_0171tipid=task_1436470122113_0171_m_14
 -
 Diagnostic Messages for this Task:
 Error: java.lang.RuntimeException:
 org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered
 unregistered class ID: -245153628
 Serialization trace:
 startTimes (org.apache.hadoop.hive.ql.log.PerfLogger)
 perfLogger (org.apache.hadoop.hive.ql.exec.MapJoinOperator)
 childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
 aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
 at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:364)
 at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:275)
 at
 org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:254)
 at
 org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:440)
 at
 org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:433)
 at
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:587)
 at
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.init(MapTask.java:169)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
 Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException:
 Encountered unregistered class ID: -245153628
 Serialization trace:
 startTimes (org.apache.hadoop.hive.ql.log.PerfLogger)
 perfLogger (org.apache.hadoop.hive.ql.exec.MapJoinOperator)
 childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
 aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
 at
 org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:99)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at
 org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at
 org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at
 org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
 at
 org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(Utilities.java:918)
 at
 org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:826)
 at
 org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:840)
 at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:333)
 ... 13 more

 Container killed by the ApplicationMaster.
 

Re: EXPORTing multiple partitions

2015-06-25 Thread Xuefu Zhang
Hi Brian,

If you think that is useful, please feel free to create a JIRA requesting
for it.

Thanks,
Xuefu

On Thu, Jun 25, 2015 at 10:36 AM, Brian Jeltema 
brian.jelt...@digitalenvoy.net wrote:

 Answering my own question:

   create table foo_copy like foo;
   insert into foo_copy partition (id) select * from foo where id in
 (1,2,3);
   export table foo_copy to ‘path’;
   drop table foo_copy;

 It would be nice if export could do this automatically, though.

 Brian

 On Jun 25, 2015, at 11:34 AM, Brian Jeltema 
 brian.jelt...@digitalenvoy.net wrote:

  Using Hive .13, I would like to export multiple partitions of a table,
 something conceptually like:
 
EXPORT TABLE foo PARTITION (id=1,2,3) to ‘path’
 
  Is there any way to accomplish this?
 
  Brian




Re: Error using UNION ALL operator on tables of different storage format !!!

2015-06-18 Thread Xuefu Zhang
Sounds like a bug. However, could you reproduce with the latest Hive code?

--Xuefu

On Thu, Jun 18, 2015 at 8:56 PM, @Sanjiv Singh sanjiv.is...@gmail.com
wrote:

 Hi All

 I was trying to combine records of two tables using UNION ALL.
 One table testTableText is on TEXT format and another table testTableORC
 is on ORC format. It is failing with given error.
 It seems error related to input format.

 Is it bug ? or ..


 See the given scenario :



 *Hive Version  : 1.0.0-- Create TEXT Table*
 create table testTableText(id int,name string)row format delimited fields
 terminated by ',';

 *-- Create ORC Table*
 create table testTableORC(id int ,name string ) clustered by (id) into 2
 buckets stored as orc TBLPROPERTIES('transactional'='true');

 *-- query with UNION *
 SELECT * FROM testTableORC
 UNION ALL
 SELECT * FROM testTableText ;

 *-- Error : *
 Query ID = cloud_20150618225656_fbad7df0-9063-478e-8b6e-f0631d9978e6
 Total jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 java.lang.NullPointerException
 at
 org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:265)
 at
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:272)
 at
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:509)
 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624)
 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616)
 at
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
 at
 org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:429)
 at
 org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:137)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
 at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:994)
 at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:201)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:153)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:364)
 at
 org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:712)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:631)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:570)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
 Job Submission failed with exception 'java.lang.NullPointerException(null)'
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.mr.MapRedTask



 Regards
 Sanjiv Singh
 Mob :  +091 9990-447-339



Hosting Hive User Group Meeting During Hadoop World NY

2015-06-10 Thread Xuefu Zhang
Dear Hive users,

Hive community is considering a user group meeting during Hadoop World that
will be held in New York at the end of September. To make this happen, your
support is essential. First, I'm wondering if any user in New York area
would be willing to host the meetup. Secondly, I'm soliciting talks from
users as well as developers, and so please propose or share your thoughts
on the contents of the meetup.

I will soon set up a meetup event to  formally announce this. In the
meantime, your suggestions, comments, and kind assistance are greatly
appreciated.

Sincerely,
 Xuefu


Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Xuefu Zhang
I'm afraid you're at the wrong community. You might have a better chance to
get an answer in Spark community.

Thanks,
Xuefu

On Wed, May 27, 2015 at 5:44 PM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com wrote:

 hey guys

 On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x
 , there are about 300+ hive tables.
 The data is stored an text (moving slowly to Parquet) on HDFS.
 I want to use SparkSQL and point to the Hive metadata and be able to
 define JOINS etc using a programming structure like this

 import org.apache.spark.sql.hive.HiveContext
 val sqlContext = new HiveContext(sc)
 val schemaRdd = sqlContext.sql(some complex SQL)


 Is that the way to go ? Some guidance will be great.

 thanks

 sanjay






Re: Hive on Spark VS Spark SQL

2015-05-22 Thread Xuefu Zhang
Hi Cheolsoo,

Thanks for the correction. I took that for granted and didn't actually
check the code to verify. Yes, from the Spark version (1.2), I did see
their parser etc. Below is a portion of the README from Spark's sql package
for reference.

Thanks,
Xuefu

Spark SQL is broken up into four subprojects:
 - Catalyst (sql/catalyst) - An implementation-agnostic framework for
manipulating trees of relational operators and expressions.
 - Execution (sql/core) - A query planner / execution engine for
translating Catalyst’s logical query plans into Spark RDDs.  This component
also includes a new public interface, SQLContext, that allows users to
execute SQL or LINQ statements against existing RDDs and Parquet files.
 - Hive Support (sql/hive) - Includes an extension of SQLContext called
HiveContext that allows users to write queries using *a subset of HiveQL*
and access data from a Hive Metastore using Hive SerDes.  There are also
wrappers that allows users to run queries that include Hive UDFs, UDAFs,
and UDTFs.
 - HiveServer and CLI support (sql/hive-thriftserver) - Includes support
for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC)
compatible server.


On Thu, May 21, 2015 at 10:31 PM, Cheolsoo Park piaozhe...@gmail.com
wrote:

 Hi Xuefu,

 Thanks for the good comparison. I agree with most points, but #1 isn't
 true.

 SparkSQL has its own parser (implemented with Scala parser combinator
 library), analyzer, and optimizer although they're not as mature as Hive.
 What it depends on Hive for is Metastore, CliDriver, DDL parser, etc.

 Cheolsoo

 On Wed, May 20, 2015 at 10:45 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 I have been working on HIve on Spark, and knows a little about SparkSQL.
 Here are a few factors to be considered:

 1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's
 front end (parser and semantic analyzer) and metastore, and inject in
 between a laryer where Hive's operator tree is reinterpreted in Spark's
 constructs (transactions and actions). Thus, it's tied to a specific
 version of Hive, which is always behind official Hive releases.
 2. Because of the reinterpretation, many features (window functions,
 lateral views, etc) from Hive need to be reimplemented in Spark world. If
 an implementation hasn't been done, you see a gap. That's why you would
 expect functional disparity, not to mention future Hive futures.
 3. SparkSQL is far from production ready.
 4. On the other hand, Hive on Spark is native in Hive, embracing all Hive
 features and growing with Hive. Hive's operators are honored without
 re-interpretation. The integration is done at the execution layer, where
 Spark is nothing but an advanced MapReduce engine.
 5. Hive is aiming at enterprise use cases, where there are more important
 concerns such as security than purely if it works or if it runs fast. Hive
 on Spark certainly makes the query run faster, but still keeps the same
 enterprise-readiness.
 6. SparkSQL is a good fit if you're a heavy Spark user who occasionally
 needs to run some SQL. Or you're a casual SQL user and like to try
 something new.
 7. If haven't touched either Spark or Hive, I'd suggest you start with
 Hive, especially for an enterprise.
 8. If you're an existing Hive user and consider taking advantage of
 Spark, consider Hive on Spark.
 9. It's strongly discouraged to mix Hive and SparkSQL in your deployment.
 SparkSQL includes a version of Hive, which is very likely at a different
 version of the Hive that you have (even if you don't use Hive on Spark).
 Library conflicts can put you in a nightmare.
 10. I haven't benchmarked SparkSQL myself, but I heard several reports
 that SparkSQL, when being tried at scale, is either fast or failing your
 queries.

 Hope this helps.

 Thanks,


 On Tue, May 19, 2015 at 10:38 PM, guoqing0...@yahoo.com.hk 
 guoqing0...@yahoo.com.hk wrote:

 Hive on Spark and SparkSQL which should be better , and what are the key
 characteristics and the advantages and the disadvantages between ?

 --
 guoqing0...@yahoo.com.hk






Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Xuefu Zhang
I have been working on HIve on Spark, and knows a little about SparkSQL.
Here are a few factors to be considered:

1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's
front end (parser and semantic analyzer) and metastore, and inject in
between a laryer where Hive's operator tree is reinterpreted in Spark's
constructs (transactions and actions). Thus, it's tied to a specific
version of Hive, which is always behind official Hive releases.
2. Because of the reinterpretation, many features (window functions,
lateral views, etc) from Hive need to be reimplemented in Spark world. If
an implementation hasn't been done, you see a gap. That's why you would
expect functional disparity, not to mention future Hive futures.
3. SparkSQL is far from production ready.
4. On the other hand, Hive on Spark is native in Hive, embracing all Hive
features and growing with Hive. Hive's operators are honored without
re-interpretation. The integration is done at the execution layer, where
Spark is nothing but an advanced MapReduce engine.
5. Hive is aiming at enterprise use cases, where there are more important
concerns such as security than purely if it works or if it runs fast. Hive
on Spark certainly makes the query run faster, but still keeps the same
enterprise-readiness.
6. SparkSQL is a good fit if you're a heavy Spark user who occasionally
needs to run some SQL. Or you're a casual SQL user and like to try
something new.
7. If haven't touched either Spark or Hive, I'd suggest you start with
Hive, especially for an enterprise.
8. If you're an existing Hive user and consider taking advantage of Spark,
consider Hive on Spark.
9. It's strongly discouraged to mix Hive and SparkSQL in your deployment.
SparkSQL includes a version of Hive, which is very likely at a different
version of the Hive that you have (even if you don't use Hive on Spark).
Library conflicts can put you in a nightmare.
10. I haven't benchmarked SparkSQL myself, but I heard several reports that
SparkSQL, when being tried at scale, is either fast or failing your queries.

Hope this helps.

Thanks,


On Tue, May 19, 2015 at 10:38 PM, guoqing0...@yahoo.com.hk 
guoqing0...@yahoo.com.hk wrote:

 Hive on Spark and SparkSQL which should be better , and what are the key
 characteristics and the advantages and the disadvantages between ?

 --
 guoqing0...@yahoo.com.hk



Re: Repeated Hive start-up issues

2015-05-15 Thread Xuefu Zhang
Your namenode is in safe mode, as the exception shows. You need to
verify/fix that before trying Hive.

Secondly, != may not work as expected. Try  or other simpler query
first.

--Xuefu

On Fri, May 15, 2015 at 6:17 AM, Anand Murali anand_vi...@yahoo.com wrote:

 Hi All:

 I have installed Hadoop-2.6, Hive 1.1 and try to start hive and get the
 following, first time when I start the cluster

 $hive

 Logging initialized using configuration in
 jar:file:/home/anand_vihar/hive-1.1.0/lib/hive-common-1.1.0.jar!/hive-log4j.properties
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in
 [jar:file:/home/anand_vihar/hive-1.1.0/lib/hive-jdbc-1.1.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in
 [jar:file:/home/anand_vihar/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 Exception in thread main java.lang.RuntimeException:
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
 Cannot create directory
 /tmp/hive/anand_vihar/a9d68b70-01b4-4d4d-9d06-1f86efc3b2bc. Name node is in
 safe mode.
 The reported blocks 2 has reached the threshold 0.9990 of total blocks 2.
 The number of live datanodes 1 has reached the minimum number 0. In safe
 mode extension. Safe mode will be turned off automatically in 13 seconds.
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1364)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4216)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4191)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
 at
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
 Caused by:
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
 Cannot create directory
 /tmp/hive/anand_vihar/a9d68b70-01b4-4d4d-9d06-1f86efc3b2bc. Name node is in
 safe mode.
 The reported blocks 2 has reached the threshold 0.9990 of total blocks 2.
 The number of live datanodes 1 has reached the minimum number 0. In safe
 mode extension. Safe mode will be turned off automatically in 13 seconds.
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1364)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4216)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4191)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
 at
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
 at java.security.AccessController.doPrivileged(Native Method)
   

Re: Table Lock Manager: ZooKeeper cluster

2015-04-20 Thread Xuefu Zhang
I'm not a zookeeper expert, but zookeeper is supposed to be characteristics
of light-weight, high performance, and fast response. Unless you zookeeper
is already overloaded, I don't see why you would need a separate zookeeper
cluster just for Hive.

There are a few zookeeper usages in Hive, the additional stress on
zookeeper is determined by the load on your HS2. As most of the time user
sessions are waiting on query execution, I don't expect the additional
stress on your zookeeper will be significant.

You do need to test it out before putting it in production as a general
practice.

On Fri, Apr 17, 2015 at 1:56 PM, Eduardo Ferreira eafon...@gmail.com
wrote:

 Hi there,

 I read on the Hive installation documentation that we need to have a
 ZooKeeper cluster setup to support Table Lock Manager (Cloudera docs link
 below).

 As we have HBase with a ZooKeeper cluster already, my question is if we
 can use the same ZK cluster for Hive.
 Is that recommended?
 What kind of load and constrains would this put on the HBase ZK cluster?


 http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hiveserver2_configure.html

 Thanks in advance.
 Eduardo.



Re: merge small orc files

2015-04-20 Thread Xuefu Zhang
Also check hive.merge.size.per.task and hive.merge.smallfiles.avgsize.

On Mon, Apr 20, 2015 at 8:29 AM, patcharee patcharee.thong...@uni.no
wrote:

 Hi,

 How to set the configuration hive-site.xml to automatically merge small
 orc file (output from mapreduce job) in hive 0.14 ?

 This is my current configuration

 property
   namehive.merge.mapfiles/name
   valuetrue/value
 /property

 property
   namehive.merge.mapredfiles/name
   valuetrue/value
 /property

 property
   namehive.merge.orcfile.stripe.level/name
   valuetrue/value
 /property

 However the output from a mapreduce job, which is stored into an orc file,
 was not merged. This is the output

 -rwxr-xr-x   1 root hdfs  0 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/_SUCCESS
 -rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/part-r-0
 -rwxr-xr-x   1 root hdfs  29049 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/part-r-1
 -rwxr-xr-x   1 root hdfs  29075 2015-04-20 15:23
 /apps/hive/warehouse/coordinate/zone=2/part-r-2

 Any ideas?

 BR,
 Patcharee



Re: [ANNOUNCE] New Hive Committers - Jimmy Xiang, Matt McCline, and Sergio Pena

2015-03-23 Thread Xuefu Zhang
Congratulations to all!

--Xuefu

On Mon, Mar 23, 2015 at 11:08 AM, Carl Steinbach c...@apache.org wrote:

 The Apache Hive PMC has voted to make Jimmy Xiang, Matt McCline, and
 Sergio Pena committers on the Apache Hive Project.

 Please join me in congratulating Jimmy, Matt, and Sergio.

 Thanks.

 - Carl




Re: Hive on Spark

2015-03-16 Thread Xuefu Zhang
(TaskRunner.java:75)
 Caused by: java.lang.RuntimeException:
 java.util.concurrent.ExecutionException:
 java.util.concurrent.TimeoutException: Timed out waiting for client
 connection.
 at com.google.common.base.Throwables.propagate(Throwables.java:156)
 at
 org.apache.hive.spark.client.SparkClientImpl.init(SparkClientImpl.java:104)
 at
 org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
 at
 org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.init(RemoteHiveSparkClient.java:88)
 at
 org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:58)
 at
 org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:55)
 ... 6 more
 Caused by: java.util.concurrent.ExecutionException:
 java.util.concurrent.TimeoutException: Timed out waiting for client
 connection.
 at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
 at
 org.apache.hive.spark.client.SparkClientImpl.init(SparkClientImpl.java:94)
 ... 10 more
 Caused by: java.util.concurrent.TimeoutException: Timed out waiting
 for client connection.
 at org.apache.hive.spark.client.rpc.RpcServer$2.run(RpcServer.java:134)
 at
 io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
 at
 io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:123)
 at
 io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
 at
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 at java.lang.Thread.run(Thread.java:744)
 2015-03-16 10:42:12,204 ERROR [main]: ql.Driver
 (SessionState.java:printError(861)) - FAILED: Execution Error, return
 code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
 2015-03-16 10:42:12,205 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=Driver.execute
 start=1426482638193 end=1426482732205 duration=94012
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:42:12,205 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=releaseLocks
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:42:12,544 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=releaseLocks
 start=1426482732205 end=1426482732544 duration=339
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:42:12,583 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=releaseLocks
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:42:12,583 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=releaseLocks
 start=1426482732583 end=1426482732583 duration=0
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:44:30,939 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=Driver.run
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:44:30,939 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=TimeToSubmit
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:44:30,939 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=compile
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:44:30,940 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogBegin(121)) - PERFLOG method=parse
 from=org.apache.hadoop.hive.ql.Driver
 2015-03-16 10:44:30,941 INFO  [main]: parse.ParseDriver
 (ParseDriver.java:parse(185)) - Parsing command: insert into table
 test values(5,8900)
 2015-03-16 10:44:30,942 INFO  [main]: parse.ParseDriver
 (ParseDriver.java:parse(206)) - Parse Completed
 2015-03-16 10:44:30,942 INFO  [main]: log.PerfLogger
 (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=parse
 start=1426482870940 end=1426482870942 duration=2
 from=org.apache.hadoop.hive.ql.Driver









 Thanks  Regards
 Amithsha


 On Fri, Mar 13, 2015 at 7:36 PM, Xuefu Zhang xzh...@cloudera.com wrote:
  You need to copy the spark-assembly.jar to your hive/lib.
 
  Also, you can check hive.log to get more messages.
 
  On Fri, Mar 13, 2015 at 4:51 AM, Amith sha amithsh...@gmail.com wrote:
 
  Hi all,
 
 
  Recently i have configured Spark 1.2.0 and my environment is hadoop
  2.6.0 hive 1.1.0 Here i have tried hive on Spark while executing
  insert into i am getting the following g error.
 
  Query ID = hadoop2_20150313162828_8764adad-a8e4-49da-9ef5-35e4ebd6bc63
  Total jobs = 1
  Launching Job 1 out of 1
  In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=number
  In order to limit the maximum number of reducers:
set hive.exec.reducers.max=number
  In order to set a constant number of reducers:
set mapreduce.job.reduces=number
  Failed to execute spark task, with exception
  'org.apache.hadoop.hive.ql.metadata.HiveException(Failed

Re: Hive on Spark

2015-03-13 Thread Xuefu Zhang
You need to copy the spark-assembly.jar to your hive/lib.

Also, you can check hive.log to get more messages.

On Fri, Mar 13, 2015 at 4:51 AM, Amith sha amithsh...@gmail.com wrote:

 Hi all,


 Recently i have configured Spark 1.2.0 and my environment is hadoop
 2.6.0 hive 1.1.0 Here i have tried hive on Spark while executing
 insert into i am getting the following g error.

 Query ID = hadoop2_20150313162828_8764adad-a8e4-49da-9ef5-35e4ebd6bc63
 Total jobs = 1
 Launching Job 1 out of 1
 In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=number
 In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=number
 In order to set a constant number of reducers:
   set mapreduce.job.reduces=number
 Failed to execute spark task, with exception
 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create
 spark client.)'
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.spark.SparkTask



 Have added the spark-assembly jar in hive lib
 And also in hive console using the command add jar followed by the  steps

 set spark.home=/opt/spark-1.2.1/;


 add jar
 /opt/spark-1.2.1/assembly/target/scala-2.10/spark-assembly-1.2.1-hadoop2.4.0.jar;



 set hive.execution.engine=spark;


 set spark.master=spark://xxx:7077;


 set spark.eventLog.enabled=true;


 set spark.executor.memory=512m;


 set spark.serializer=org.apache.spark.serializer.KryoSerializer;

 Can anyone suggest



 Thanks  Regards
 Amithsha



Re: Does any one know how to deploy a custom UDAF jar file in SparkSQL

2015-03-10 Thread Xuefu Zhang
This question seems more suitable to Spark community. FYI, this is Hive
user list.

On Tue, Mar 10, 2015 at 5:46 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where
 should i put the jar file so SparkSQL can pick it up and make it accessible
 for SparkSQL applications?
 I do not use spark-shell instead I want to use it in an spark application.


  I posted same question to Spark mailing list, but no answer so far !

 best,
 /Shahab



Re: [ANNOUNCE] Apache Hive 1.1.0 Released

2015-03-09 Thread Xuefu Zhang
Great job, guys! This is a much major release with significant new features
and improvement. Thanks to everyone who contributed to make this happen.

Thanks,
Xuefu


On Sun, Mar 8, 2015 at 10:40 PM, Brock Noland br...@apache.org wrote:

 The Apache Hive team is proud to announce the the release of Apache
 Hive version 1.1.0.

 The Apache Hive (TM) data warehouse software facilitates querying and
 managing large datasets residing in distributed storage. Built on top
 of Apache Hadoop (TM), it provides:

 * Tools to enable easy data extract/transform/load (ETL)

 * A mechanism to impose structure on a variety of data formats

 * Access to files stored either directly in Apache HDFS (TM) or in other
   data storage systems such as Apache HBase (TM)

 * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.

 For Hive release details and downloads, please visit:
 https://hive.apache.org/downloads.html

 Hive X.Y.Z Release Notes are available here:

 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310843styleName=Textversion=12329363

 We would like to thank the many contributors who made this release
 possible.

 Regards,

 The Apache Hive Team



Re: error: Failed to create spark client. for hive on spark

2015-03-02 Thread Xuefu Zhang
)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
 Caused by: java.lang.RuntimeException: 
 java.util.concurrent.ExecutionException:
 java.util.concurrent.TimeoutException: Timed out waiting for client
 connection.
 at com.google.common.base.Throwables.propagate(
 Throwables.java:156)
 at org.apache.hive.spark.client.SparkClientImpl.init(
 SparkClientImpl.java:106)
 at org.apache.hive.spark.client.SparkClientFactory.createClient(
 SparkClientFactory.java:80)
 at org.apache.hadoop.hive.ql.exec.spark.
 RemoteHiveSparkClient.init(RemoteHiveSparkClient.java:88)
 at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
 createHiveSparkClient(HiveSparkClientFactory.java:65)
 at org.apache.hadoop.hive.ql.exec.spark.session.
 SparkSessionImpl.open(SparkSessionImpl.java:55)
 ... 22 more
 Caused by: java.util.concurrent.ExecutionException: 
 java.util.concurrent.TimeoutException:
 Timed out waiting for client connection.
 at io.netty.util.concurrent.AbstractFuture.get(
 AbstractFuture.java:37)
 at org.apache.hive.spark.client.SparkClientImpl.init(
 SparkClientImpl.java:96)
 ... 26 more
 Caused by: java.util.concurrent.TimeoutException: Timed out waiting for
 client connection.
 at org.apache.hive.spark.client.rpc.RpcServer$2.run(RpcServer.
 java:134)
 at io.netty.util.concurrent.PromiseTask$RunnableAdapter.
 call(PromiseTask.java:38)
 at io.netty.util.concurrent.ScheduledFutureTask.run(
 ScheduledFutureTask.java:123)
 at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
 SingleThreadEventExecutor.java:380)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
 at io.netty.util.concurrent.SingleThreadEventExecutor$2.
 run(SingleThreadEventExecutor.java:116)
 at java.lang.Thread.run(Thread.java:722)


 and i do not find spark.log

 thanks.


 On 2015/3/2 22:39, Xuefu Zhang wrote:

 Could you check your hive.log and spark.log for more detailed error
 message? Quick check though, do you have spark-assembly.jar in your hive
 lib folder?

 Thanks,
 Xuefu

 On Mon, Mar 2, 2015 at 5:14 AM, scwf wangf...@huawei.com mailto:
 wangf...@huawei.com wrote:

 Hi all,
anyone met this error: HiveException(Failed to create spark
 client.)

 M151:/opt/cluster/apache-hive-__1.2.0-SNAPSHOT-bin # bin/hive

 Logging initialized using configuration in
 jar:file:/opt/cluster/apache-__hive-1.2.0-SNAPSHOT-bin/lib/_
 _hive-common-1.2.0-SNAPSHOT.__jar!/hive-log4j.properties
 [INFO] Unable to bind key for unsupported operation:
 backward-delete-word
 [INFO] Unable to bind key for unsupported operation:
 backward-delete-word
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 hive set spark.home=/opt/cluster/spark-
 __1.3.0-bin-hadoop2-without-__hive;
 hive set hive.execution.engine=spark;
 hive set spark.master=spark://9.91.8.__151:7070 
 http://9.91.8.151:7070;
 hive select count(1) from src;
 Query ID = root_2015030220_4bed4c2a-__b9a5-4d99-a485-67570e2712b7
 Total jobs = 1
 Launching Job 1 out of 1
 In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.__reducer=number
 In order to limit the maximum number of reducers:
set hive.exec.reducers.max=__number
 In order to set a constant number of reducers:
set mapreduce.job.reduces=number
 Failed to execute spark task, with exception
 'org.apache.hadoop.hive.ql.__metadata.HiveException(Failed to create
 spark client.)'
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.__exec.spark.SparkTask

 thanks







Re: error: Failed to create spark client. for hive on spark

2015-03-02 Thread Xuefu Zhang
Could you check your hive.log and spark.log for more detailed error
message? Quick check though, do you have spark-assembly.jar in your hive
lib folder?

Thanks,
Xuefu

On Mon, Mar 2, 2015 at 5:14 AM, scwf wangf...@huawei.com wrote:

 Hi all,
   anyone met this error: HiveException(Failed to create spark client.)

 M151:/opt/cluster/apache-hive-1.2.0-SNAPSHOT-bin # bin/hive

 Logging initialized using configuration in jar:file:/opt/cluster/apache-
 hive-1.2.0-SNAPSHOT-bin/lib/hive-common-1.2.0-SNAPSHOT.
 jar!/hive-log4j.properties
 [INFO] Unable to bind key for unsupported operation: backward-delete-word
 [INFO] Unable to bind key for unsupported operation: backward-delete-word
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 [INFO] Unable to bind key for unsupported operation: up-history
 [INFO] Unable to bind key for unsupported operation: down-history
 hive set spark.home=/opt/cluster/spark-1.3.0-bin-hadoop2-without-hive;
 hive set hive.execution.engine=spark;
 hive set spark.master=spark://9.91.8.151:7070;
 hive select count(1) from src;
 Query ID = root_2015030220_4bed4c2a-b9a5-4d99-a485-67570e2712b7
 Total jobs = 1
 Launching Job 1 out of 1
 In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=number
 In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=number
 In order to set a constant number of reducers:
   set mapreduce.job.reduces=number
 Failed to execute spark task, with exception 
 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed
 to create spark client.)'
 FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.
 exec.spark.SparkTask

 thanks




Re: Where does hive do sampling in order by ?

2015-03-02 Thread Xuefu Zhang
there is no sampling for order by in Hive. Hive uses a single reducer for
order by (if you're talking about MR execution engine).

Hive on Spark is different for this, thought.

Thanks,
Xuefu

On Mon, Mar 2, 2015 at 2:17 AM, Jeff Zhang zjf...@gmail.com wrote:

 Order by usually invoke 2 steps (sampling job and repartition job) but
 hive only run one mr job for order by, so wondering when and where does
 hive do sampling ? client side ?


 --
 Best Regards

 Jeff Zhang



Re: Bucket map join - reducers role

2015-02-27 Thread Xuefu Zhang
Could you post your query and  explain your_query result?

On Fri, Feb 27, 2015 at 5:32 AM, murali parimi 
muralikrishna.par...@icloud.com wrote:

 Hello team,

 I have two tables A and B. A has 360Million rows with one column K. B has
 around two billion rows with multiple columns including K.

 Both tables are clustered and sorted by K and Bucketed into same number of
 buckets.

 When I perform a join, I assumed there won't be any reducers spawned as
 the join would happen on map side. Still user fee reducers getting spawned.

 Any role for reducers here? Am I missing something?

 Sent from my iPhone


Re: Union all with a field 'hard coded'

2015-02-21 Thread Xuefu Zhang
I haven't tried union distinct, but I assume the same rule applies.

Thanks for putting it together. It looks good to me.

--Xuefu

On Fri, Feb 20, 2015 at 11:44 PM, Lefty Leverenz leftylever...@gmail.com
wrote:

 Great, thanks Xuefu.  So this only applies to UNION ALL, not UNION
 DISTINCT?  I had wondered about that.

 I made the changes and added some subheadings:  Union wikidoc
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union
  -- Column Aliases for UNION ALL
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union#LanguageManualUnion-ColumnAliasesforUNIONALL
 .

 Please review it one more time.

 -- Lefty

 On Fri, Feb 20, 2015 at 7:06 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 Hi Lefty,

 The description seems good to me. I just slightly modified it so that it
 sounds more technical, for your consideration.

 Thanks,
 Xuefu

 UNION ALL expected the same schema on both sides of the expression list.
 As a result, the following query may fail with an error message such as
 FAILED: SemanticException 4:47 Schema of both sides of union should
 match.
 [query]
 In such cases, column aliases can be used to force equal schema:
 [corrected query]



 On Thu, Feb 19, 2015 at 1:04 AM, Lefty Leverenz leftylever...@gmail.com
 wrote:

 Xuefu, I've taken a stab at documenting this in the Union wikidoc
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union 
 (near
 the end).  Would you please review it and make any necessary corrections or
 additions?

 Thanks.

 -- Lefty

 On Mon, Feb 2, 2015 at 2:02 PM, DU DU will...@gmail.com wrote:

 This is a part of standard SQL syntax, isn't it?

 On Mon, Feb 2, 2015 at 2:22 PM, Xuefu Zhang xzh...@cloudera.com
 wrote:

 Yes, I think it would be great if this can be documented.

 --Xuefu

 On Sun, Feb 1, 2015 at 6:34 PM, Lefty Leverenz 
 leftylever...@gmail.com wrote:

 Xuefu, should this be documented in the Union wikidoc
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union
 ?

 Is it relevant for other query clauses?

 -- Lefty

 On Sun, Feb 1, 2015 at 11:27 AM, Philippe Kernévez 
 pkerne...@octo.com wrote:

 Perfect.

 Thank you Xuefu.

 Philippe

 On Fri, Jan 30, 2015 at 11:32 PM, Xuefu Zhang xzh...@cloudera.com
 wrote:

 Use column alias:

 INSERT OVERWRITE TABLE all_dictionaries_ext
  SELECT name, id, category FROM dictionary
  UNION ALL SELECT NAME, ID, CAMPAIGN as category FROM
 md_campaigns


 On Fri, Jan 30, 2015 at 1:41 PM, Philippe Kernévez 
 pkerne...@octo.com wrote:

 Hi all,

 I would like to do union all with a field that is hardcoded in the
 request.

INSERT OVERWRITE TABLE all_dictionaries_ext
  SELECT name, id, category FROM dictionary
  UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns

 Name type is String
 Id type is int
 Category type is string

 When I run this command I had an error :
 FAILED: SemanticException 4:47 Schema of both sides of union
 should match. _u1-subquery2 does not have the field category. Error
 encountered near token 'md_campaigns'

 I supposed that the error is cause by the String CAMPAIGN which
 should not have a type.

 How can do this kind of union ?

 The union all with 2 hard coded fields is ok.
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT NAME, ID, CAMPAIGN FROM md_campaigns
  UNION ALL SELECT NAME, ID, AD_SERVER FROM md_ad_servers
  UNION ALL SELECT NAME, ID, AVERTISER FROM md_advertisers
  UNION ALL SELECT NAME, ID, AGENCIES FROM md_agencies


 More debug info :

 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parsing command:
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT name, id, category FROM byoa_dictionary
 UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns
 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parse Completed
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: /PERFLOG
 method=parse start=1422653663887 end=1422653663900 duration=13
 from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: PERFLOG
 method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Starting
 Semantic Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Completed
 phase 1 of Semantic Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for subqueries
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for source tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get
 metadata for destination tables
 15/01/30 22

Re: Union all with a field 'hard coded'

2015-02-20 Thread Xuefu Zhang
Hi Lefty,

The description seems good to me. I just slightly modified it so that it
sounds more technical, for your consideration.

Thanks,
Xuefu

UNION ALL expected the same schema on both sides of the expression list. As
a result, the following query may fail with an error message such as
FAILED: SemanticException 4:47 Schema of both sides of union should
match.
[query]
In such cases, column aliases can be used to force equal schema:
[corrected query]



On Thu, Feb 19, 2015 at 1:04 AM, Lefty Leverenz leftylever...@gmail.com
wrote:

 Xuefu, I've taken a stab at documenting this in the Union wikidoc
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union (near
 the end).  Would you please review it and make any necessary corrections or
 additions?

 Thanks.

 -- Lefty

 On Mon, Feb 2, 2015 at 2:02 PM, DU DU will...@gmail.com wrote:

 This is a part of standard SQL syntax, isn't it?

 On Mon, Feb 2, 2015 at 2:22 PM, Xuefu Zhang xzh...@cloudera.com wrote:

 Yes, I think it would be great if this can be documented.

 --Xuefu

 On Sun, Feb 1, 2015 at 6:34 PM, Lefty Leverenz leftylever...@gmail.com
 wrote:

 Xuefu, should this be documented in the Union wikidoc
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union
 ?

 Is it relevant for other query clauses?

 -- Lefty

 On Sun, Feb 1, 2015 at 11:27 AM, Philippe Kernévez pkerne...@octo.com
 wrote:

 Perfect.

 Thank you Xuefu.

 Philippe

 On Fri, Jan 30, 2015 at 11:32 PM, Xuefu Zhang xzh...@cloudera.com
 wrote:

 Use column alias:

 INSERT OVERWRITE TABLE all_dictionaries_ext
  SELECT name, id, category FROM dictionary
  UNION ALL SELECT NAME, ID, CAMPAIGN as category FROM
 md_campaigns


 On Fri, Jan 30, 2015 at 1:41 PM, Philippe Kernévez 
 pkerne...@octo.com wrote:

 Hi all,

 I would like to do union all with a field that is hardcoded in the
 request.

INSERT OVERWRITE TABLE all_dictionaries_ext
  SELECT name, id, category FROM dictionary
  UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns

 Name type is String
 Id type is int
 Category type is string

 When I run this command I had an error :
 FAILED: SemanticException 4:47 Schema of both sides of union should
 match. _u1-subquery2 does not have the field category. Error encountered
 near token 'md_campaigns'

 I supposed that the error is cause by the String CAMPAIGN which
 should not have a type.

 How can do this kind of union ?

 The union all with 2 hard coded fields is ok.
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT NAME, ID, CAMPAIGN FROM md_campaigns
  UNION ALL SELECT NAME, ID, AD_SERVER FROM md_ad_servers
  UNION ALL SELECT NAME, ID, AVERTISER FROM md_advertisers
  UNION ALL SELECT NAME, ID, AGENCIES FROM md_agencies


 More debug info :

 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parsing command:
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT name, id, category FROM byoa_dictionary
 UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns
 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parse Completed
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: /PERFLOG
 method=parse start=1422653663887 end=1422653663900 duration=13
 from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: PERFLOG
 method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Starting
 Semantic Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Completed
 phase 1 of Semantic Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for subqueries
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for source tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata
 for destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Completed
 getting MetaData in Semantic Analysis
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Not invoking
 CBO because the statement has too few joins
 FAILED: SemanticException 4:47 Schema of both sides of union should
 match. _u1-subquery2 does not have the field category. Error encountered
 near token 'md_campaigns'
 15/01/30 22:34:24 [main]: ERROR ql.Driver: FAILED: SemanticException
 4:47 Schema of both sides of union should match. _u1-subquery2 does not
 have the field category. Error encountered near token 'md_campaigns'
 org.apache.hadoop.hive.ql.parse.SemanticException: 4:47 Schema

Re: Does Hive 1.0.0 still support commandline

2015-02-09 Thread Xuefu Zhang
There should be no confusion. While in 1.0 you can still use HiveCLI, you
don't have HiveCLI + HiveSever1 option. You will not able to connect
HiveServer2 with HiveCLI.

Thus, the clarification is: You can only use HiveCLI as a standalone
application in 1.0.

--Xuefu

On Mon, Feb 9, 2015 at 9:17 AM, DU DU will...@gmail.com wrote:

 The blog says For CLI users, migrating to HiveServer2 will require migrating
 to Beeline
 http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/.
 Is this misleading since Hive CLI is still in the v1.0.0

 On Mon, Feb 9, 2015 at 12:07 PM, Alan Gates alanfga...@gmail.com wrote:

 Hive CLI and HiveServer2/beeline are both in Hive 1.0.

 Alan.

   DU DU will...@gmail.com
  February 9, 2015 at 8:54
 According to the release note of Hive 1.0.0, the HiveServer1 is removed.
 Can we still use command line in 1.0.0?
 --
 Thanks,
 Dayong




 --
 Thanks,
 Dayong



Re: Union all with a field 'hard coded'

2015-02-02 Thread Xuefu Zhang
Yes, I think it would be great if this can be documented.

--Xuefu

On Sun, Feb 1, 2015 at 6:34 PM, Lefty Leverenz leftylever...@gmail.com
wrote:

 Xuefu, should this be documented in the Union wikidoc
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union?

 Is it relevant for other query clauses?

 -- Lefty

 On Sun, Feb 1, 2015 at 11:27 AM, Philippe Kernévez pkerne...@octo.com
 wrote:

 Perfect.

 Thank you Xuefu.

 Philippe

 On Fri, Jan 30, 2015 at 11:32 PM, Xuefu Zhang xzh...@cloudera.com
 wrote:

 Use column alias:

 INSERT OVERWRITE TABLE all_dictionaries_ext
  SELECT name, id, category FROM dictionary
  UNION ALL SELECT NAME, ID, CAMPAIGN as category FROM md_campaigns


 On Fri, Jan 30, 2015 at 1:41 PM, Philippe Kernévez pkerne...@octo.com
 wrote:

 Hi all,

 I would like to do union all with a field that is hardcoded in the
 request.

INSERT OVERWRITE TABLE all_dictionaries_ext
  SELECT name, id, category FROM dictionary
  UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns

 Name type is String
 Id type is int
 Category type is string

 When I run this command I had an error :
 FAILED: SemanticException 4:47 Schema of both sides of union should
 match. _u1-subquery2 does not have the field category. Error encountered
 near token 'md_campaigns'

 I supposed that the error is cause by the String CAMPAIGN which
 should not have a type.

 How can do this kind of union ?

 The union all with 2 hard coded fields is ok.
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT NAME, ID, CAMPAIGN FROM md_campaigns
  UNION ALL SELECT NAME, ID, AD_SERVER FROM md_ad_servers
  UNION ALL SELECT NAME, ID, AVERTISER FROM md_advertisers
  UNION ALL SELECT NAME, ID, AGENCIES FROM md_agencies


 More debug info :

 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parsing command:
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT name, id, category FROM byoa_dictionary
 UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns
 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parse Completed
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: /PERFLOG method=parse
 start=1422653663887 end=1422653663900 duration=13
 from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: PERFLOG
 method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Starting
 Semantic Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Completed phase
 1 of Semantic Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 subqueries
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 source tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Completed
 getting MetaData in Semantic Analysis
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Not invoking CBO
 because the statement has too few joins
 FAILED: SemanticException 4:47 Schema of both sides of union should
 match. _u1-subquery2 does not have the field category. Error encountered
 near token 'md_campaigns'
 15/01/30 22:34:24 [main]: ERROR ql.Driver: FAILED: SemanticException
 4:47 Schema of both sides of union should match. _u1-subquery2 does not
 have the field category. Error encountered near token 'md_campaigns'
 org.apache.hadoop.hive.ql.parse.SemanticException: 4:47 Schema of both
 sides of union should match. _u1-subquery2 does not have the field
 category. Error encountered near token 'md_campaigns'
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genUnionPlan(SemanticAnalyzer.java:9007)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9600)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9620)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9607)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10093)
 at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:221)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:415)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:303)
 at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1067)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1129)
 at org.apache.hadoop.hive.ql.Driver.run

Re: Union all with a field 'hard coded'

2015-01-30 Thread Xuefu Zhang
Use column alias:

INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT name, id, category FROM dictionary
 UNION ALL SELECT NAME, ID, CAMPAIGN as category FROM md_campaigns


On Fri, Jan 30, 2015 at 1:41 PM, Philippe Kernévez pkerne...@octo.com
wrote:

 Hi all,

 I would like to do union all with a field that is hardcoded in the request.

INSERT OVERWRITE TABLE all_dictionaries_ext
  SELECT name, id, category FROM dictionary
  UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns

 Name type is String
 Id type is int
 Category type is string

 When I run this command I had an error :
 FAILED: SemanticException 4:47 Schema of both sides of union should match.
 _u1-subquery2 does not have the field category. Error encountered near
 token 'md_campaigns'

 I supposed that the error is cause by the String CAMPAIGN which should
 not have a type.

 How can do this kind of union ?

 The union all with 2 hard coded fields is ok.
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT NAME, ID, CAMPAIGN FROM md_campaigns
  UNION ALL SELECT NAME, ID, AD_SERVER FROM md_ad_servers
  UNION ALL SELECT NAME, ID, AVERTISER FROM md_advertisers
  UNION ALL SELECT NAME, ID, AGENCIES FROM md_agencies


 More debug info :

 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parsing command:
   INSERT OVERWRITE TABLE all_dictionaries_ext
 SELECT name, id, category FROM byoa_dictionary
 UNION ALL SELECT NAME, ID, CAMPAIGN FROM md_campaigns
 15/01/30 22:34:23 [main]: INFO parse.ParseDriver: Parse Completed
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: /PERFLOG method=parse
 start=1422653663887 end=1422653663900 duration=13
 from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO log.PerfLogger: PERFLOG
 method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Starting Semantic
 Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Completed phase 1
 of Semantic Analysis
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 subqueries
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 source tables
 15/01/30 22:34:23 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 source tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 subqueries
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Get metadata for
 destination tables
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Completed getting
 MetaData in Semantic Analysis
 15/01/30 22:34:24 [main]: INFO parse.SemanticAnalyzer: Not invoking CBO
 because the statement has too few joins
 FAILED: SemanticException 4:47 Schema of both sides of union should match.
 _u1-subquery2 does not have the field category. Error encountered near
 token 'md_campaigns'
 15/01/30 22:34:24 [main]: ERROR ql.Driver: FAILED: SemanticException 4:47
 Schema of both sides of union should match. _u1-subquery2 does not have the
 field category. Error encountered near token 'md_campaigns'
 org.apache.hadoop.hive.ql.parse.SemanticException: 4:47 Schema of both
 sides of union should match. _u1-subquery2 does not have the field
 category. Error encountered near token 'md_campaigns'
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genUnionPlan(SemanticAnalyzer.java:9007)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9600)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9620)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9607)
 at
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10093)
 at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:221)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:415)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:303)
 at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1067)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1129)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:994)
 at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:247)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:345)
 at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:733)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
 at 

Re: [ANNOUNCE] New Hive PMC Members - Szehon Ho, Vikram Dixit, Jason Dere, Owen O'Malley and Prasanth Jayachandran

2015-01-28 Thread Xuefu Zhang
Congratulations to all!

--Xuefu

On Wed, Jan 28, 2015 at 1:15 PM, Carl Steinbach c...@apache.org wrote:

 I am pleased to announce that Szehon Ho, Vikram Dixit, Jason Dere, Owen
 O'Malley and Prasanth Jayachandran have been elected to the Hive Project
 Management Committee. Please join me in congratulating the these new PMC
 members!

 Thanks.

 - Carl



Re: how to determine the memory usage of select,join, in hive on spark?

2015-01-24 Thread Xuefu Zhang
Hi,

Since you have only one worker, you should be able to use jmap to get a
dump of the worker process. In Hive, you can configure the memory usage for
join.

As to the slowness and hive GC you observed, I'm thinking this might have
to do with your query. Could you share it?

Thanks,
Xuefu

On Thu, Jan 22, 2015 at 11:29 PM, 诺铁 noty...@gmail.com wrote:

 hi,

 when I am trying to join several tables, then write result to another
 table, it runs very slow.  by observing worker log and spark ui, I found
 many gc time.

 the input tables are not very big, their size are:
 84M
 705M
 2.7G
 2.4M
 573M

 the resulting output is about 1.5GB.
 the worker is given 70G memory(only 1 worker), and I set spark to use
 Kryo.
 I don't understand the reason why there are so many gc, that makes job
 very slow.

 when using spark core api, I can call RDD.cache(), than watch how much
 memory the rdd used,  in hive on spark, are there anyway to profile memory
 usage?




Re: Hive create table line terminated by '\n'

2015-01-13 Thread Xuefu Zhang
Consider using dataformat other than TEXT such as sequence file.

On Mon, Jan 12, 2015 at 10:54 PM, 王鹏飞 wpf5...@gmail.com wrote:

 Thank you,maybe i didn't express my question explicitly.I know the hive
 create table clause,and there exists FIELDS TERMINATED BY etc.
 For example,if i use FIELDS TERMINATED BY ' ,',what if  the ' ,' is
 contained in One field,hive will use the rule to separate One field.
 You might suggested me to change the fields terminator,But  how about the
 lines terminator?The Lines Terminator only support '\n',So the problem is
 what could i do if one column or field contains the '\n'?

 On Tue, Jan 13, 2015 at 2:25 PM, Xiaoyong Zhu xiaoy...@microsoft.com
 wrote:

  I guess you could use fields terminated by clause..

 CREATE TABLE IF NOT EXISTS default.table_name

 ROW FORMAT DELIMITED

 FIELDS TERMINATED BY '\001'

 COLLECTION ITEMS TERMINATED BY '\002'

 MAP KEYS TERMINATED BY '\003'

 STORED AS TEXTFILE





 Xiaoyong



 *From:* 王鹏飞 [mailto:wpf5...@gmail.com]
 *Sent:* Tuesday, January 13, 2015 2:15 PM
 *To:* user@hive.apache.org
 *Subject:* Hive create table line terminated by '\n'



 At default hive table lines terminated by only supports '\n' right
 now;But if there is a column that contains a '\n',how could i do ? Hive
 split the column and went wrong,Is there any solutions ? Thank you.





Re: Set variable via query

2015-01-13 Thread Xuefu Zhang
select * from someothertable where dt IN (select max(dt) from sometable);

On Tue, Jan 13, 2015 at 4:39 PM, Martin, Nick nimar...@pssd.com wrote:

  Hi all,

  I'm looking to set a variable in Hive and use the resulting value in a
 subsequent query. Something like:

  set startdt='select max(dt) from sometable';
 select * from someothertable where dt=${hiveconf:startdt};

  I found this is still open HIVE-2165
 https://issues.apache.org/jira/browse/HIVE-2165

  Any options? Tried a flavor of above via CLI and it didn't work.

  On Hive 13

  Thanks!
 Nick



Re: spark worker nodes getting disassociated while running hive on spark

2015-01-05 Thread Xuefu Zhang
Hi Somnath,

The error seems nothing to do with Hive. I haven't seen this problem, but
I'm wondering if your cluster has any configuration issue, especially the
timeout values for network communications. The default values worked well
for us fine.

If the problem persists, please provide detailed information about your
spark build and hive build.

Thanks,
Xuefu




On Sun, Jan 4, 2015 at 11:07 PM, Somnath Pandeya 
somnath_pand...@infosys.com wrote:

  Hi,



 I have setup the spark 1.2 standalone cluster and trying to run hive on
 spark by following  below link.




 https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started



 I got the latest build of hive on spark from git and was trying to running
 few queries. Queries are running fine for some time and after that I am
 getting following errors



 Error on master node

 15/01/05 12:16:59 INFO actor.LocalActorRef: Message
 [akka.remote.transport.AssociationHandle$Disassociated] from
 Actor[akka://sparkMaster/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40xx.xx.xx.xx%3A34823-1#1101564287]
 was not delivered. [1] dead letters encountered. This logging can be turned
 off or adjusted with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.

 15/01/05 12:16:59 INFO master.Master: akka.tcp://sparkWorker@machinename:58392
 got disassociated, removing it.

 15/01/05 12:16:59 INFO master.Master: Removing worker
 worker-20150105120340-machine-58392 on
 indhyhdppocap03.infosys-platforms.com:58392



 Error on slave node



 15/01/05 12:20:21 INFO transport.ProtocolStateActor: No response from
 remote. Handshake timed out or transport failure detector triggered.

 15/01/05 12:20:21 INFO actor.LocalActorRef: Message
 [akka.remote.transport.AssociationHandle$Disassociated] from
 Actor[akka://sparkWorker/deadLetters] to
 Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40machineName%3A7077-1#-1301148631]
 was not delivered. [1] dead letters encountered. This logging can be turned
 off or adjusted with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.

 15/01/05 12:20:21 INFO worker.Worker: Disassociated
 [akka.tcp://sparkWorker@machineName:58392] -
 [akka.tcp://sparkMaster@machineName:7077] Disassociated !

 15/01/05 12:20:21 ERROR worker.Worker: Connection to master failed!
 Waiting for master to reconnect...

 15/01/05 12:20:21 WARN remote.ReliableDeliverySupervisor: Association with
 remote system [akka.tcp://sparkMaster@machineName:7077] has failed,
 address is now gated for [5000] ms. Reason is: [Disassociated].





 Please Help



 -Somnath

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
 for the use of the addressee(s). If you are not the intended recipient, please
 notify the sender by e-mail and delete the original message. Further, you are 
 not
 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken
 every reasonable precaution to minimize this risk, but is not liable for any 
 damage
 you may sustain as a result of any virus in this e-mail. You should carry out 
 your
 own virus checks before opening the e-mail or attachment. Infosys reserves the
 right to monitor and review the content of all messages sent to or from this 
 e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***




Re: Job aborted due to stage failure

2014-12-03 Thread Xuefu Zhang
Hi Yuemeng,

I'm glad that Hive on Spark finally works for you. As you know, this
project is still in development and yet to be released. Thus, please
forgive about the lack of proper documentation. We have a Get Started
page that's linked in HIVE-7292. If you can improve the document there, it
would be very helpful for other Hive users.

Thanks,
Xuefu

On Wed, Dec 3, 2014 at 5:42 PM, yuemeng1 yueme...@huawei.com wrote:

  hi,thanks a lot for your help,with your help ,my hive-on-spark can work
 well now
 it take me long time to install and deploy.here are  some advice,i think
 we need to improve the installation documentation, allowing users to use
 the least amount of time to compile and install
 1)add which spark version we should pick from spark github if we select
 built spark instead of download a spark pre-built,tell them the right built
 commad!(not include Pyarn ,Phive)
 2)if they get some error during built ,such as [ERRO
 /hive/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/SparkJobStatus.java:
 [22,24] cannot find symbol
 [ERROR] symbol: class JobExecutionStatus,tell them what they can do?
 for our users,first to use it ,then  feel good or bad?
 and if u need,i can add something to start document


 thanks
 yuemeng






 On 2014/12/3 11:03, Xuefu Zhang wrote:

  When you build Spark, remove -Phive as well as -Pyarn. When you run hive
 queries, you may need to run set spark.home=/path/to/spark/dir;

  Thanks,
  Xuefu

 On Tue, Dec 2, 2014 at 6:29 PM, yuemeng1 yueme...@huawei.com wrote:

  hi,XueFu,thanks a lot for your help,now i will provide more detail to
 reproduce this ssue:
 1),i checkout a spark branch from hive github(
 https://github.com/apache/hive/tree/spark on Nov 29,becasue of for
 version now it will give something wrong about:Caused by:
 java.lang.RuntimeException: Unable to instantiate
 org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ),
 and built command:mvn clean package -DskipTests -Phadoop-2 -Pdist
 after built i get package from
 :/home/ym/hive-on-spark/hive1129/hive/packaging/target(apache-hive-0.15.0-SNAPSHOT-bin.tar.gz)
 2)i checkout spark from
 https://github.com/apache/spark/tree/v1.2.0-snapshot0,becasue of spark
 branch-1.2 is with spark parent(1.2.1-SNAPSHOT),so i chose
 v1.2.0-snapshot0 and i compare this spark's pom.xml with
 spark-parent-1.2.0-SNAPSHOT.pom(get from
 http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark_2.10-1.2-SNAPSHOT/org/apache/spark/spark-parent/1.2.0-SNAPSHOT/),and
 there is only difference is spark-parent name,and built command is :

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean 
 package

 3)comand i execute in hive-shell:
 ./hive --auxpath
 /opt/hispark/spark/assembly/target/scala-2.10/spark-assembly-1.2.0-hadoop2.4.0.jar(copy
 this jar to hive dir lib already)
 create table student(sno int,sname string,sage int,ssex string) row
 format delimited FIELDS TERMINATED BY ',';
 create table score(sno int,cno int,sage int) row format delimited FIELDS
 TERMINATED BY ',';
 load data local inpath
 '/home/hive-on-spark/temp/spark-1.2.0/examples/src/main/resources/student.txt'
 into table student;
 load data local inpath
 '/home/hive-on-spark/temp/spark-1.2.0/examples/src/main/resources/score.txt'
 into table score;
 set hive.execution.engine=spark;
 set spark.master=spark://10.175.xxx.xxx:7077;
 set spark.eventLog.enabled=true;
 set spark.executor.memory=9086m;
 set spark.serializer=org.apache.spark.serializer.KryoSerializer;
 select distinct st.sno,sname from student st join score sc
 on(st.sno=sc.sno) where sc.cno IN(11,12,13) and st.sage  28;(work in mr)
 4)
 studdent.txt file
 1,rsh,27,female
 2,kupo,28,male
 3,astin,29,female
 4,beike,30,male
 5,aili,31,famle

 score.txt file
 1,10,80
 2,11,85
 3,12,90
 4,13,95
 5,14,100






























 On 2014/12/2 23:28, Xuefu Zhang wrote:

  Could you provide details on how to reproduce the issue? such as the
 exact spark branch, the command to build Spark, how you build Hive, and
 what queries/commands you run.

  We are running Hive on Spark all the time. Our pre-commit test runs
 without any issue.

  Thanks,
  Xuefu

 On Tue, Dec 2, 2014 at 4:13 AM, yuemeng1 yueme...@huawei.com wrote:

  hi,XueFu
 i checkout a spark branch from sparkgithub(tags:v1.2.0-snapshot0)and i
 compare this spark's pom.xml with spark-parent-1.2.0-SNAPSHOT.pom(get from
 http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark_2.10-1.2-SNAPSHOT/org/apache/spark/spark-parent/1.2.0-SNAPSHOT/),and
 there is only difference is follow:
 in spark-parent-1.2.0-SNAPSHOT.pom
   artifactIdspark-parent/artifactId
   version1.2.0-SNAPSHOT/version
 and in v1.2.0-snapshot0
 artifactIdspark-parent/artifactId
   version1.2.0/version
 i think there is no essence diff,and i built v1.2.0-snapshot0 and deploy
 it as my spark clusters
 when i run query about join two table ,it still give some error what i
 show u earlier

 Job aborted due to stage failure: Task 0

Re: Job aborted due to stage failure

2014-12-02 Thread Xuefu Zhang
Could you provide details on how to reproduce the issue? such as the exact
spark branch, the command to build Spark, how you build Hive, and what
queries/commands you run.

We are running Hive on Spark all the time. Our pre-commit test runs without
any issue.

Thanks,
Xuefu

On Tue, Dec 2, 2014 at 4:13 AM, yuemeng1 yueme...@huawei.com wrote:

  hi,XueFu
 i checkout a spark branch from sparkgithub(tags:v1.2.0-snapshot0)and i
 compare this spark's pom.xml with spark-parent-1.2.0-SNAPSHOT.pom(get from
 http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark_2.10-1.2-SNAPSHOT/org/apache/spark/spark-parent/1.2.0-SNAPSHOT/),and
 there is only difference is follow:
 in spark-parent-1.2.0-SNAPSHOT.pom
   artifactIdspark-parent/artifactId
   version1.2.0-SNAPSHOT/version
 and in v1.2.0-snapshot0
 artifactIdspark-parent/artifactId
   version1.2.0/version
 i think there is no essence diff,and i built v1.2.0-snapshot0 and deploy
 it as my spark clusters
 when i run query about join two table ,it still give some error what i
 show u earlier

 Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most
 recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18):
 java.lang.NullPointerException+details

 Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most 
 recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18): 
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
   at 
 org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:437)
   at 
 org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:430)
   at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:587)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:233)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)

 Driver stacktrace:



 i think my spark clusters did't had any problem,but why always give me
 such error






















 On 2014/12/2 13:39, Xuefu Zhang wrote:

  You need to build your spark assembly from spark 1.2 branch. this should
 give your both a spark build as well as spark-assembly jar, which you need
 to copy to Hive lib directory. Snapshot is fine, and spark 1.2 hasn't been
 released yet.

  --Xuefu

 On Mon, Dec 1, 2014 at 7:41 PM, yuemeng1 yueme...@huawei.com wrote:



 hi.XueFu,
 thanks a lot for your inforamtion,but as far as i know ,the latest spark
 version on github is spark-snapshot-1.3,but there is no spark-1.2,only have
 a branch-1.2 with spark-snapshot-1.2,can u tell me which spark version i
 should built,and for now,that's
 spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar produce error like that


 On 2014/12/2 11:03, Xuefu Zhang wrote:

  It seems that wrong class, HiveInputFormat, is loaded. The stacktrace
 is way off the current Hive code. You need to build Spark 1.2 and copy
 spark-assembly jar to Hive's lib directory and that it.

  --Xuefu

 On Mon, Dec 1, 2014 at 6:22 PM, yuemeng1 yueme...@huawei.com wrote:

  hi,i built a hive on spark package and my spark assembly jar is
 spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar,when i run a query in hive
 shell,before execute this query,
 i set all the  require which hive need with  spark.and i execute a join
 query :
 select distinct st.sno,sname from student st join score sc
 on(st.sno=sc.sno) where sc.cno IN(11,12,13) and st.sage  28;
 but it failed,
 get follow error in spark webUI:
 Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18):
 java.lang.NullPointerException+details

 Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most 
 recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18): 
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255

Re: Job aborted due to stage failure

2014-12-02 Thread Xuefu Zhang
When you build Spark, remove -Phive as well as -Pyarn. When you run hive
queries, you may need to run set spark.home=/path/to/spark/dir;

Thanks,
Xuefu

On Tue, Dec 2, 2014 at 6:29 PM, yuemeng1 yueme...@huawei.com wrote:

  hi,XueFu,thanks a lot for your help,now i will provide more detail to
 reproduce this ssue:
 1),i checkout a spark branch from hive github(
 https://github.com/apache/hive/tree/spark on Nov 29,becasue of for
 version now it will give something wrong about:Caused by:
 java.lang.RuntimeException: Unable to instantiate
 org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ),
 and built command:mvn clean package -DskipTests -Phadoop-2 -Pdist
 after built i get package from
 :/home/ym/hive-on-spark/hive1129/hive/packaging/target(apache-hive-0.15.0-SNAPSHOT-bin.tar.gz)
 2)i checkout spark from
 https://github.com/apache/spark/tree/v1.2.0-snapshot0,becasue of spark
 branch-1.2 is with spark parent(1.2.1-SNAPSHOT),so i chose
 v1.2.0-snapshot0 and i compare this spark's pom.xml with
 spark-parent-1.2.0-SNAPSHOT.pom(get from
 http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark_2.10-1.2-SNAPSHOT/org/apache/spark/spark-parent/1.2.0-SNAPSHOT/),and
 there is only difference is spark-parent name,and built command is :

 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean 
 package

 3)comand i execute in hive-shell:
 ./hive --auxpath
 /opt/hispark/spark/assembly/target/scala-2.10/spark-assembly-1.2.0-hadoop2.4.0.jar(copy
 this jar to hive dir lib already)
 create table student(sno int,sname string,sage int,ssex string) row format
 delimited FIELDS TERMINATED BY ',';
 create table score(sno int,cno int,sage int) row format delimited FIELDS
 TERMINATED BY ',';
 load data local inpath
 '/home/hive-on-spark/temp/spark-1.2.0/examples/src/main/resources/student.txt'
 into table student;
 load data local inpath
 '/home/hive-on-spark/temp/spark-1.2.0/examples/src/main/resources/score.txt'
 into table score;
 set hive.execution.engine=spark;
 set spark.master=spark://10.175.xxx.xxx:7077;
 set spark.eventLog.enabled=true;
 set spark.executor.memory=9086m;
 set spark.serializer=org.apache.spark.serializer.KryoSerializer;
 select distinct st.sno,sname from student st join score sc
 on(st.sno=sc.sno) where sc.cno IN(11,12,13) and st.sage  28;(work in mr)
 4)
 studdent.txt file
 1,rsh,27,female
 2,kupo,28,male
 3,astin,29,female
 4,beike,30,male
 5,aili,31,famle

 score.txt file
 1,10,80
 2,11,85
 3,12,90
 4,13,95
 5,14,100






























 On 2014/12/2 23:28, Xuefu Zhang wrote:

  Could you provide details on how to reproduce the issue? such as the
 exact spark branch, the command to build Spark, how you build Hive, and
 what queries/commands you run.

  We are running Hive on Spark all the time. Our pre-commit test runs
 without any issue.

  Thanks,
  Xuefu

 On Tue, Dec 2, 2014 at 4:13 AM, yuemeng1 yueme...@huawei.com wrote:

  hi,XueFu
 i checkout a spark branch from sparkgithub(tags:v1.2.0-snapshot0)and i
 compare this spark's pom.xml with spark-parent-1.2.0-SNAPSHOT.pom(get from
 http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark_2.10-1.2-SNAPSHOT/org/apache/spark/spark-parent/1.2.0-SNAPSHOT/),and
 there is only difference is follow:
 in spark-parent-1.2.0-SNAPSHOT.pom
   artifactIdspark-parent/artifactId
   version1.2.0-SNAPSHOT/version
 and in v1.2.0-snapshot0
 artifactIdspark-parent/artifactId
   version1.2.0/version
 i think there is no essence diff,and i built v1.2.0-snapshot0 and deploy
 it as my spark clusters
 when i run query about join two table ,it still give some error what i
 show u earlier

 Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18):
 java.lang.NullPointerException+details

 Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most 
 recent failure: Lost task 0.3 in stage 1.0 (TID 7, datasight18): 
 java.lang.NullPointerException
  at 
 org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
  at 
 org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:437)
  at 
 org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:430)
  at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:587)
  at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:233)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
  at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
  at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:230

  1   2   >