Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread panfei
Hi Gopal, Thanks for all the information and suggestion.

The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.

I think I should create a intermediate table which includes all the
dimensions (including the serval kinds of ids), and then use spark-sql to
calculate the distinct values separately (spark sql is really fast so ~~).

thanks again.

2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan :

> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
> …
> > GROUPING_ID() AS gid,
> > COUNT(1) AS dummy
>
> There are two things which prevent Hive from optimize multiple count
> distincts.
>
> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>
> The multiple count distincts are rewritten into a ROLLUP internally by the
> CBO.
>
> https://issues.apache.org/jira/browse/HIVE-10901
>
> A single count distinct + other aggregates (like
> min,max,count,count_distinct in 1 pass) is fixed via
>
> https://issues.apache.org/jira/browse/HIVE-16654
>
> There's no optimizer rule to combine both those scenarios.
>
> https://issues.apache.org/jira/browse/HIVE-15045
>
> There's a possibility that you're using Hive-1.x release branch the CBO
> doesn't kick in unless column stats are present, but in the Hive-2.x series
> you'll notice that some of these optimizations are not driven by a cost
> function and are always applied if CBO is enabled.
>
> > is there any way to rewrite it to optimize the memory usage.
>
> If you want it to run through very slowly without errors, you can try
> disabling all in-memory aggregations.
>
> set hive.map.aggr=false;
>
> Cheers,
> Gopal
>
>
>


-- 
不学习,不知道


Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread Gopal Vijayaraghavan
> COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
> COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
…
> GROUPING_ID() AS gid,
> COUNT(1) AS dummy

There are two things which prevent Hive from optimize multiple count distincts.

Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.

The multiple count distincts are rewritten into a ROLLUP internally by the CBO.

https://issues.apache.org/jira/browse/HIVE-10901

A single count distinct + other aggregates (like min,max,count,count_distinct 
in 1 pass) is fixed via 

https://issues.apache.org/jira/browse/HIVE-16654

There's no optimizer rule to combine both those scenarios.

https://issues.apache.org/jira/browse/HIVE-15045

There's a possibility that you're using Hive-1.x release branch the CBO doesn't 
kick in unless column stats are present, but in the Hive-2.x series you'll 
notice that some of these optimizations are not driven by a cost function and 
are always applied if CBO is enabled.

> is there any way to rewrite it to optimize the memory usage.

If you want it to run through very slowly without errors, you can try disabling 
all in-memory aggregations.

set hive.map.aggr=false;

Cheers,
Gopal




Fwd: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread panfei
-- Forwarded message --
From: panfei 
Date: 2017-08-23 12:26 GMT+08:00
Subject: Fwd: How to optimize multiple count( distinct col) in Hive SQL
To: hive-...@hadoop.apache.org



-- Forwarded message --
From: panfei 
Date: 2017-08-23 12:26 GMT+08:00
Subject: How to optimize multiple count( distinct col) in Hive SQL
To: CDH Users 


SELECT product_id AS product_id,
channel_id AS channel_id,
server_id AS server_id,
platform AS platform,
COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
COUNT(DISTINCT daily_user_id) AS daily_active_users,
COUNT(DISTINCT monthly_device_id) AS monthly_active_devices,
COUNT(DISTINCT weekly_device_id) AS weekly_active_devices,
COUNT(DISTINCT daily_device_id) AS daily_active_devices,
COUNT(DISTINCT monthly_ip) AS monthly_active_ips,
COUNT(DISTINCT weekly_ip) AS weekly_active_ips,
COUNT(DISTINCT daily_ip) AS daily_active_ips,
COUNT(DISTINCT monthly_role_id) AS monthly_active_roles,
COUNT(DISTINCT weekly_role_id) AS weekly_active_roles,
COUNT(DISTINCT daily_role_id) AS daily_active_roles,
GROUPING_ID() AS gid,
COUNT(1) AS dummy
FROM (.)
GROUP BY .



There are so many COUNT(DISTINCT xxx) in the select clause. when execute
this sql on very large dataset, it reports OOM Java Heap Space Error.

I think it's the count(distinct xxx)s which occupy lots of memory in the
heap for deduplication . am I rigt ? or is there any way to rewrite it to
optimize the memory usage.

every work node has 32GB physical memory in total, some related settings:

hive.optimize.distinct.rewrite=true;
mapreduce.map.memory.mb=1638;
mapreduce.reduce.memory.mb=3276;
yarn.nodemanager.resource.memory-mb=26208;
mapreduce.reduce.shuffle.memory.limit.percent=0.25;


thanks



-- 
不学习,不知道



-- 
不学习,不知道


Re: ORC Transaction Table - Spark

2017-08-22 Thread Eugene Koifman
Could you do recursive “ls” in your table or partition that you are trying to 
read?
Most likely you have files that don’t follow expected naming convention

Eugene


From: Aviral Agarwal 
Reply-To: "user@hive.apache.org" 
Date: Tuesday, August 22, 2017 at 5:39 AM
To: "user@hive.apache.org" 
Subject: ORC Transaction Table - Spark

Hi,

I am trying to read hive orc transaction table through Spark but I am getting 
the following error

Caused by: java.lang.RuntimeException: serious problem
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException: 
java.lang.NumberFormatException: For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal


Re: Aug. 2017 Hive User Group Meeting

2017-08-22 Thread dan young
Dooh..thanx!

On Tue, Aug 22, 2017, 11:11 AM Alan Gates  wrote:

> The address is at the top of the text description, even though it isn’t in
> the location field:
>
> 5470 Great America Parkway, Santa Clara, CA
>
> Alan.
>
> On Mon, Aug 21, 2017 at 5:50 PM, dan young  wrote:
>
>> For us out of town folks, where is the location of this meetup? Says
>> Hortonworks but do you have an address?
>>
>> Regards
>>
>> Dano
>>
>> On Mon, Aug 21, 2017, 1:33 PM Xuefu Zhang  wrote:
>>
>>> Dear Hive users and developers,
>>>
>>> As reminder, the next Hive User Group Meeting will occur this Thursday,
>>> Aug. 24. The agenda is available on the event page (
>>> https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/).
>>>
>>> See you all there!
>>>
>>> Thanks,
>>> Xuefu
>>>
>>> On Tue, Aug 1, 2017 at 7:18 PM, Xuefu Zhang  wrote:
>>>
 Hi all,

 It's an honor to announce that Hive community is launching a Hive user
 group meeting in the bay area this month. The details can be found at
 https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/.

 We are inviting talk proposals from Hive users as well as developers at
 this time. We currently have 5 openings.

 Please let me know if you have any questions or suggestions.

 Thanks,
 Xuefu


>>>
>


Re: Hive on Spark

2017-08-22 Thread Vihang Karajgaonkar
Xuefu is planning to give a talk on Hive-on-Spark @Uber the user meetup
this week. We can check if can share the presentation on this list for
folks who can't attend the meetup.

https://www.meetup.com/Hive-User-Group-Meeting/events/242210487/


On Mon, Aug 21, 2017 at 11:44 PM, peter zhang 
wrote:

> Hi All,
> Has anybody used hive on spark in your production environment? How
> does it's the stability and performance compared with spark sql?
> Hope anybody can share your experience.
>
> Thanks in advance!
>


ORC Transaction Table - Spark

2017-08-22 Thread Aviral Agarwal
Hi,

I am trying to read hive orc transaction table through Spark but I am
getting the following error

Caused by: java.lang.RuntimeException: serious problem
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
.
Caused by: java.util.concurrent.ExecutionException:
java.lang.NumberFormatException: For input string: "0645253_0001"
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
... 118 more

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal


Re: Hive index + Tez engine = no performance gain?!

2017-08-22 Thread Gopal Vijayaraghavan

TL;DR - A Materialized view is a much more useful construct than trying to get 
limited indexes to work.

That is pretty lively project which has been going on for a while with 
Druid+LLAP

https://issues.apache.org/jira/browse/HIVE-14486

> This seems out of the blue but my initial benchmarks have shown that there's 
> no performance gain when Hive index is used with Tez engine.

Indexes in Hive are broken once columnar formats came in, because there is no 
concept of directly locating a row in another table from an existing table - 
file+offsets doesn't work for a columnar format in any practical sense.

Once you have a fast format, these indexes are only good enough to answer 
queries directly from an index by maintaining a more compact copy of data, 
which is really not what an index is supposed to do.

> I think that index is a perfect solution for non-ORC file format since you 
> can selectively build an index table and leverage Tez to only look at those 
> blocks and/or files that we need to scan

Since LLAP is a distributed system, the concept of "looking up indexes and then 
looking at blocks" does not work the same way it would work for a single 
machine DB with common memory across all threads - the index file scans cannot 
go and specify which other blocks to read, because there is no single frame of 
reference for this.

We can only get meaningful speedups if the index are co-located with the blocks 
on the same machine and can reference column groups (10k rows in each chunk), 
which is exactly what ORC+LLAP manages to do with ORC's bloom filter indexes.

> we still want to have fast ad-hoc query via Hive LLAP / Tez

LLAP can do fast ad-hoc queries at around ~100 million rows/sec on a single 
node, which is pretty fast without needing a separate index + a lookup loop.

In a production cluster, with constant ETL, the global indexes were actually 
lowering throughput since they got stale every few minutes and rebuilding fell 
behind because it is lower priority operations through-out.

Here's an experiment for you to try

CREATE INDEX idx_COUNTRY 
ON TABLE customer(c_birth_country)
AS 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES('AGGREGATES'='count(c_birth_country)')
STORED AS ORC;

Which does build an index in ORC (why not?), can only be made in MR because the 
index handlers weren't implemented in Tez. 

Tez can still use those indexes.

0: jdbc:hive2://localhost:10007/tpcds_bin_par> set 
hive.optimize.index.groupby=true;

0: jdbc:hive2://localhost:10007/tpcds_bin_par> 
 Vertex dependency in root stage
 
 Reducer 2 <- Map 1 (SIMPLE_EDGE)   
 

 
 Stage-0
 
   Fetch Operator   
 
 limit:-1   
 
 Stage-1
 
   Reducer 2 vectorized
   File Output Operator [FS_8]  
 
 Group By Operator [GBY_7] (rows=1060 width=9536)   
 
   
Output:["_col0","_col1"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0   
  
 <-Map 1 [SIMPLE_EDGE] 
   SHUFFLE [RS_3]   
 
 PartitionCols:_col0
 
 Group By Operator [GBY_2] (rows=2120 width=9536)   
 
   
Output:["_col0","_col1"],aggregations:["sum(_count_of_c_birth_country)"],keys:c_birth_country
 
   Select Operator [SEL_1] (rows=2120 width=9536)   
 
 Output:["c_birth_country","_count_of_c_birth_country"] 
 
 TableScan [TS_0] (rows=2120 width=9536) 

Hive on Spark

2017-08-22 Thread peter zhang
Hi All,
Has anybody used hive on spark in your production environment? How
does it's the stability and performance compared with spark sql?
Hope anybody can share your experience.

Thanks in advance!