RE: Partition performance

Peter Marron Thu, 04 Jul 2013 02:26:51 -0700

Sorry, just caught up with the last couple of day’s email and I feel that this 
question
has already been answered fairly comprehensively. Apologies.

Z

From: Peter Marron [mailto:peter.mar...@trilliumsoftware.com]
Sent: 04 July 2013 08:37
To: user@hive.apache.org
Subject: RE: Partition performance

Hi,

Just to check that I understand this problem, my reading suggests that the 
overhead of
many partitions is currently unavoidable. Specifically this means that any 
query on a table that has, let’s say, 10,000 partitions
will be significantly slower (than on un-partitioned table with the “same” 
data) even if
the query explicitly specifies a single partition.
(I mean I _could_ actually do the experiments myself…)

Regards,

Z

From: Owen O'Malley [mailto:omal...@apache.org]
Sent: 02 July 2013 15:52
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Partition performance

On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron 
<peter.mar...@trilliumsoftware.com<mailto:peter.mar...@trilliumsoftware.com>> 
wrote:
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental 
reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000?

The precise numbers don't matter. I wanted to give people a ballpark range that 
they should be looking at. Most tables at 1000 partitions won't cause big slow 
downs, but the cost scales with the number of partitions. By the time you are 
at 10,000 the cost is noticeable. I have one customer who has a table with 1.2 
million partitions. That causes a lot of slow downs.

And the improvements
that you mention are they going to be in version 12? Is there a JIRA raised so 
that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to 
be able to explain the situation.)

I think this is the one they will use: 
https://issues.apache.org/jira/browse/HIVE-4051

-- Owen

RE: Partition performance

Reply via email to