Re: ORC Transaction Table - Spark

2017-08-23 Thread Aviral Agarwal
So, there is no way possible right now for Spark to read Hive 2.x data ?

On Thu, Aug 24, 2017 at 12:17 AM, Eugene Koifman 
wrote:

> This looks like you have some data written by Hive 2.x and Hive 1.x code
> trying to read it.
>
> That is not supported.
>
>
>
> *From: *Aviral Agarwal 
> *Reply-To: *"user@hive.apache.org" 
> *Date: *Wednesday, August 23, 2017 at 12:24 AM
> *To: *"user@hive.apache.org" 
> *Subject: *Re: ORC Transaction Table - Spark
>
>
>
> Hi,
>
> Yes it caused by wrong naming convention of the delta directory :
>
> /apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_
> 0645253_0645253_0001
>
> How do I solve this ?
>
> Thanks !
> Aviral Agarwal
>
>
>
> On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman 
> wrote:
>
> Could you do recursive “ls” in your table or partition that you are trying
> to read?
>
> Most likely you have files that don’t follow expected naming convention
>
>
>
> Eugene
>
>
>
>
>
> *From: *Aviral Agarwal 
> *Reply-To: *"user@hive.apache.org" 
> *Date: *Tuesday, August 22, 2017 at 5:39 AM
> *To: *"user@hive.apache.org" 
> *Subject: *ORC Transaction Table - Spark
>
>
>
> Hi,
>
>
>
> I am trying to read hive orc transaction table through Spark but I am
> getting the following error
>
>
> Caused by: java.lang.RuntimeException: serious problem
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(
> OrcInputFormat.java:1021)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(
> OrcInputFormat.java:1048)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> .
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NumberFormatException:
> For input string: "0645253_0001"
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(
> OrcInputFormat.java:998)
> ... 118 more
>
>
> Any help would be appreciated.
>
> Thanks and Regards,
> Aviral Agarwal
>
>
>


Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-23 Thread panfei
by decreasing mapreduce.reduce.shuffle.parallelcopies from 20 to 5,  it
seems that everything goes well, no OOM ~~

2017-08-23 17:19 GMT+08:00 panfei :

> The full error stack is (which described here : https://issues.apache.org/
> jira/browse/MAPREDUCE-6108) :
>
> this error can not reproduce every time, after retry several times, the
> job successfully finished.
>
> 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#2
>   at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:56)
>   at 
> org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:46)
>   at 
> org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.(InMemoryMapOutput.java:63)
>   at 
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305)
>   at 
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295)
>   at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514)
>   at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
>   at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
>
> 2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnning 
> cleanup for the task
>
>
> 2017-08-23 13:10 GMT+08:00 panfei :
>
>> Hi Gopal, Thanks for all the information and suggestion.
>>
>> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.
>>
>> I think I should create a intermediate table which includes all the
>> dimensions (including the serval kinds of ids), and then use spark-sql to
>> calculate the distinct values separately (spark sql is really fast so ~~).
>>
>> thanks again.
>>
>> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan :
>>
>>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
>>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
>>> …
>>> > GROUPING_ID() AS gid,
>>> > COUNT(1) AS dummy
>>>
>>> There are two things which prevent Hive from optimize multiple count
>>> distincts.
>>>
>>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>>>
>>> The multiple count distincts are rewritten into a ROLLUP internally by
>>> the CBO.
>>>
>>> https://issues.apache.org/jira/browse/HIVE-10901
>>>
>>> A single count distinct + other aggregates (like
>>> min,max,count,count_distinct in 1 pass) is fixed via
>>>
>>> https://issues.apache.org/jira/browse/HIVE-16654
>>>
>>> There's no optimizer rule to combine both those scenarios.
>>>
>>> https://issues.apache.org/jira/browse/HIVE-15045
>>>
>>> There's a possibility that you're using Hive-1.x release branch the CBO
>>> doesn't kick in unless column stats are present, but in the Hive-2.x series
>>> you'll notice that some of these optimizations are not driven by a cost
>>> function and are always applied if CBO is enabled.
>>>
>>> > is there any way to rewrite it to optimize the memory usage.
>>>
>>> If you want it to run through very slowly without errors, you can try
>>> disabling all in-memory aggregations.
>>>
>>> set hive.map.aggr=false;
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>>
>> --
>> 不学习,不知道
>>
>
>
>
> --
> 不学习,不知道
>



-- 
不学习,不知道


Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread peter zhang
How about splitting your txn data into two parts, one for the tx that has
currency info (just use join) and the other part for the tx that can't find
currency info then use a union all operator combines two parts tx, as below:

SELECT ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
  FROM CURRENCY JOIN TXN ON (CURRENCY.CCY_CD = TXN.CCY_CD)
 WHERE TXN_DT BETWEEN EFF_ST_DT AND EFF_END_DT;
 UNION ALL
SELECT ROW_NUM,CCY_CD,TXN_DT, null AS CNTRY_DESC
  FROM TXN
 WHERE NOT EXISTS (SELECT 1 FROM CURRENCY WHERE CURRENCY.CCY_CD =
TXN.CCY_CD)

2017-08-23 23:47 GMT+08:00 Furcy Pin :

> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
>   Note that Hive is smart enough to optimize it correctly :
>   if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
>   To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
>   replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
>   then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
>   This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> ramasubramanian.naraya...@gmail.com> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin 
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>>   SELECT
>>> CCY_CD,
>>> MAX(CNTRY_DESC) as CNTRY_DESC
>>>   FROM CURRENCY
>>>   GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>>  that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> 
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> ramasubramanian.naraya...@gmail.com> wrote:
>>>
 Hi,

 Need your suggestion on the below.

 Have two tables TXN and CURRENCY.

 Need all records in TXN and hence doing Left Join with CURRENCY.

 *Two problems :*
 1. CURRENCY table may contain duplicate records hence it needs to be
 handled through RANK or some other function.
 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
 clause' then we will loose the EUR records which should not happen.

 Please suggest a solution to over come both the problems. For
 duplicated records it is fine if we select any of the CNTRY_DESC.

 *Table : CURRENCY*

 *Table : TXN*

 *CCY_CD*

 *CNTRY_DESC*

 *EFF_ST_DT*

 *EFF_END_DT*

 *ROW_NUM*

 *CCY_CD*

 *TXN_DT*

 INR

 Indian Rupee

 1-Jan-15

 20-Feb-16

 1

 INR

 16-Feb-17

 INR

 Indian Rupee New

 21-Feb-16

 20-Feb-99

 2

 USD

 16-Feb-17

 USD

 US Dollar

 1-Jan-15

 20-Feb-16

 3

 SGD

 16-Feb-17

 SGD

 Singapore Dollar

 1-Jan-15

 20-Feb-17

 4

 EUR

 16-Feb-17

 SGD

 Singapore Dollar New

 15-Feb-17

 20-Feb-99

 SGD

 Singapore Dollar Latest

 16-Feb-17

 16-Feb-17

 *Expected Output*

 *ROW_NUM*

 *CCY_CD*

 *TXN_DT*

 *CNTRY_DESC*

 1

 INR

 16-Feb-17

 Indian Rupee

 2

 USD

 16-Feb-17

 US Dollar

 3

 SGD

 16-Feb-17

 Singapore Dollar Latest (Any of three valid valid is fine)

 4

 EUR

 16-Feb-17

 

One column into multiple column.

2017-08-23 Thread Deepak Khandelwal
Can someone tell the best way to implement below in hive.

 how can we take input from column c1 from tab Such that c1 has multiple
values delimited by pipe. Each of the delimited value from col c1 of table
t1 needs to be inserted into separate column in table t2.

I can write a UDF for this but this udf I have to call 10 times to retrieve
10 delimited values from c1 of table t1 and then insert them into 10
separate cols of table t2. I am just trying to find out if there is a
better way to do this so that I don't have to call udf 10 times
 Thanks


Re: ORC Transaction Table - Spark

2017-08-23 Thread Eugene Koifman
This looks like you have some data written by Hive 2.x and Hive 1.x code trying 
to read it.
That is not supported.

From: Aviral Agarwal 
Reply-To: "user@hive.apache.org" 
Date: Wednesday, August 23, 2017 at 12:24 AM
To: "user@hive.apache.org" 
Subject: Re: ORC Transaction Table - Spark

Hi,

Yes it caused by wrong naming convention of the delta directory :

/apps/hive/warehouse/foo.db/bar/year=2017/month=5/delta_0645253_0645253_0001

How do I solve this ?

Thanks !
Aviral Agarwal

On Tue, Aug 22, 2017 at 11:50 PM, Eugene Koifman 
> wrote:
Could you do recursive “ls” in your table or partition that you are trying to 
read?
Most likely you have files that don’t follow expected naming convention

Eugene


From: Aviral Agarwal >
Reply-To: "user@hive.apache.org" 
>
Date: Tuesday, August 22, 2017 at 5:39 AM
To: "user@hive.apache.org" 
>
Subject: ORC Transaction Table - Spark

Hi,

I am trying to read hive orc transaction table through Spark but I am getting 
the following error

Caused by: java.lang.RuntimeException: serious problem
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException: 
java.lang.NumberFormatException: For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal



Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Ramasubramanian Narayanan
Hi,

TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. It
needs to be equated.


regards,
Rams

On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin  wrote:

> I would suggest to use a subquery
>
> WITH unique_currency AS (
>   SELECT
> CCY_CD,
> MAX(CNTRY_DESC) as CNTRY_DESC
>   FROM CURRENCY
>   GROUP BY CCY_CD
> )
>
> and then perform your left join on it.
>
> Some SQL engine (e.g. Presto) have aggregation functions like
> arbitrary(col) that take any value and are a little less costly than a
> max.
> Sometimes, they also have functions like max_by(x, y)
>  that
> would allow you to get the most recent description.
>
> It is a shame that this function is not included in Hive yet, but still
> you can find some UDAF implementations on github
> 
> .
>
>
>
> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
> ramasubramanian.naraya...@gmail.com> wrote:
>
>> Hi,
>>
>> Need your suggestion on the below.
>>
>> Have two tables TXN and CURRENCY.
>>
>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>
>> *Two problems :*
>> 1. CURRENCY table may contain duplicate records hence it needs to be
>> handled through RANK or some other function.
>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>> clause' then we will loose the EUR records which should not happen.
>>
>> Please suggest a solution to over come both the problems. For duplicated
>> records it is fine if we select any of the CNTRY_DESC.
>>
>> *Table : CURRENCY*
>>
>> *Table : TXN*
>>
>> *CCY_CD*
>>
>> *CNTRY_DESC*
>>
>> *EFF_ST_DT*
>>
>> *EFF_END_DT*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> INR
>>
>> Indian Rupee
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> INR
>>
>> Indian Rupee New
>>
>> 21-Feb-16
>>
>> 20-Feb-99
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> USD
>>
>> US Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-16
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar
>>
>> 1-Jan-15
>>
>> 20-Feb-17
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> SGD
>>
>> Singapore Dollar New
>>
>> 15-Feb-17
>>
>> 20-Feb-99
>>
>> SGD
>>
>> Singapore Dollar Latest
>>
>> 16-Feb-17
>>
>> 16-Feb-17
>>
>> *Expected Output*
>>
>> *ROW_NUM*
>>
>> *CCY_CD*
>>
>> *TXN_DT*
>>
>> *CNTRY_DESC*
>>
>> 1
>>
>> INR
>>
>> 16-Feb-17
>>
>> Indian Rupee
>>
>> 2
>>
>> USD
>>
>> 16-Feb-17
>>
>> US Dollar
>>
>> 3
>>
>> SGD
>>
>> 16-Feb-17
>>
>> Singapore Dollar Latest (Any of three valid valid is fine)
>>
>> 4
>>
>> EUR
>>
>> 16-Feb-17
>>
>> 
>>
>>
>>
>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>> where
>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>
>>
>>
>> This query will drop the "EUR" record because of the where clause used.
>> It cannot be handled with case statement instead of 'where clause' as we
>> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>
>> regards,
>> Rams
>>
>
>


Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Furcy Pin
I would suggest to use a subquery

WITH unique_currency AS (
  SELECT
CCY_CD,
MAX(CNTRY_DESC) as CNTRY_DESC
  FROM CURRENCY
  GROUP BY CCY_CD
)

and then perform your left join on it.

Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col)
that take any value and are a little less costly than a max.
Sometimes, they also have functions like max_by(x, y)
 that
would allow you to get the most recent description.

It is a shame that this function is not included in Hive yet, but still you
can find some UDAF implementations on github
.



On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
ramasubramanian.naraya...@gmail.com> wrote:

> Hi,
>
> Need your suggestion on the below.
>
> Have two tables TXN and CURRENCY.
>
> Need all records in TXN and hence doing Left Join with CURRENCY.
>
> *Two problems :*
> 1. CURRENCY table may contain duplicate records hence it needs to be
> handled through RANK or some other function.
> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
> clause' then we will loose the EUR records which should not happen.
>
> Please suggest a solution to over come both the problems. For duplicated
> records it is fine if we select any of the CNTRY_DESC.
>
> *Table : CURRENCY*
>
> *Table : TXN*
>
> *CCY_CD*
>
> *CNTRY_DESC*
>
> *EFF_ST_DT*
>
> *EFF_END_DT*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> INR
>
> Indian Rupee
>
> 1-Jan-15
>
> 20-Feb-16
>
> 1
>
> INR
>
> 16-Feb-17
>
> INR
>
> Indian Rupee New
>
> 21-Feb-16
>
> 20-Feb-99
>
> 2
>
> USD
>
> 16-Feb-17
>
> USD
>
> US Dollar
>
> 1-Jan-15
>
> 20-Feb-16
>
> 3
>
> SGD
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar
>
> 1-Jan-15
>
> 20-Feb-17
>
> 4
>
> EUR
>
> 16-Feb-17
>
> SGD
>
> Singapore Dollar New
>
> 15-Feb-17
>
> 20-Feb-99
>
> SGD
>
> Singapore Dollar Latest
>
> 16-Feb-17
>
> 16-Feb-17
>
> *Expected Output*
>
> *ROW_NUM*
>
> *CCY_CD*
>
> *TXN_DT*
>
> *CNTRY_DESC*
>
> 1
>
> INR
>
> 16-Feb-17
>
> Indian Rupee
>
> 2
>
> USD
>
> 16-Feb-17
>
> US Dollar
>
> 3
>
> SGD
>
> 16-Feb-17
>
> Singapore Dollar Latest (Any of three valid valid is fine)
>
> 4
>
> EUR
>
> 16-Feb-17
>
> 
>
>
>
> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
> where
> TXN_DT between EFF_ST_DT and EFF_END_DT;
>
>
>
> This query will drop the "EUR" record because of the where clause used.
> It cannot be handled with case statement instead of 'where clause' as we
> have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>
> regards,
> Rams
>


LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Ramasubramanian Narayanan
Hi,

Need your suggestion on the below.

Have two tables TXN and CURRENCY.

Need all records in TXN and hence doing Left Join with CURRENCY.

*Two problems :*
1. CURRENCY table may contain duplicate records hence it needs to be
handled through RANK or some other function.
2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
clause' then we will loose the EUR records which should not happen.

Please suggest a solution to over come both the problems. For duplicated
records it is fine if we select any of the CNTRY_DESC.

*Table : CURRENCY*

*Table : TXN*

*CCY_CD*

*CNTRY_DESC*

*EFF_ST_DT*

*EFF_END_DT*

*ROW_NUM*

*CCY_CD*

*TXN_DT*

INR

Indian Rupee

1-Jan-15

20-Feb-16

1

INR

16-Feb-17

INR

Indian Rupee New

21-Feb-16

20-Feb-99

2

USD

16-Feb-17

USD

US Dollar

1-Jan-15

20-Feb-16

3

SGD

16-Feb-17

SGD

Singapore Dollar

1-Jan-15

20-Feb-17

4

EUR

16-Feb-17

SGD

Singapore Dollar New

15-Feb-17

20-Feb-99

SGD

Singapore Dollar Latest

16-Feb-17

16-Feb-17

*Expected Output*

*ROW_NUM*

*CCY_CD*

*TXN_DT*

*CNTRY_DESC*

1

INR

16-Feb-17

Indian Rupee

2

USD

16-Feb-17

US Dollar

3

SGD

16-Feb-17

Singapore Dollar Latest (Any of three valid valid is fine)

4

EUR

16-Feb-17





*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;



This query will drop the "EUR" record because of the where clause used.
It cannot be handled with case statement instead of 'where clause' as we
have   more than one record for 'SGD' when  TXN_DT is 16-FEB.

regards,
Rams


Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-23 Thread panfei
The full error stack is (which described here :
https://issues.apache.org/jira/browse/MAPREDUCE-6108) :

this error can not reproduce every time, after retry several times, the job
successfully finished.

2017-08-23 17:16:03,574 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
shuffle in fetcher#2
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:56)
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:46)
at 
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.(InMemoryMapOutput.java:63)
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305)
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295)
at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514)
at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)

2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task:
Runnning cleanup for the task


2017-08-23 13:10 GMT+08:00 panfei :

> Hi Gopal, Thanks for all the information and suggestion.
>
> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.
>
> I think I should create a intermediate table which includes all the
> dimensions (including the serval kinds of ids), and then use spark-sql to
> calculate the distinct values separately (spark sql is really fast so ~~).
>
> thanks again.
>
> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan :
>
>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
>> …
>> > GROUPING_ID() AS gid,
>> > COUNT(1) AS dummy
>>
>> There are two things which prevent Hive from optimize multiple count
>> distincts.
>>
>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>>
>> The multiple count distincts are rewritten into a ROLLUP internally by
>> the CBO.
>>
>> https://issues.apache.org/jira/browse/HIVE-10901
>>
>> A single count distinct + other aggregates (like
>> min,max,count,count_distinct in 1 pass) is fixed via
>>
>> https://issues.apache.org/jira/browse/HIVE-16654
>>
>> There's no optimizer rule to combine both those scenarios.
>>
>> https://issues.apache.org/jira/browse/HIVE-15045
>>
>> There's a possibility that you're using Hive-1.x release branch the CBO
>> doesn't kick in unless column stats are present, but in the Hive-2.x series
>> you'll notice that some of these optimizations are not driven by a cost
>> function and are always applied if CBO is enabled.
>>
>> > is there any way to rewrite it to optimize the memory usage.
>>
>> If you want it to run through very slowly without errors, you can try
>> disabling all in-memory aggregations.
>>
>> set hive.map.aggr=false;
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
>
> --
> 不学习,不知道
>



-- 
不学习,不知道