Hi,

 

I believe partitioning followed by hash cluster allows only up to 32 buckets
within a single partition?

 

HTH,

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Daniel Harper [mailto:daniel.har...@bbc.co.uk] 
Sent: 15 April 2015 16:42
To: user@hive.apache.org
Subject: [Hive 0.13.1] - Explanation/confusion over "Fatal error occurred
when node tried to create too many dynamic partitions" on small dataset with
dynamic partitions

 

Hi there,

 

We've been encountering the exception

 

Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal
error occurred when node tried to create too many dynamic partitions. The
maximum number of dynamic partitions is controlled by
hive.exec.max.dynamic.partitions and
hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100

 

On a very small dataset (180 lines) using the following setup

 

CREATE TABLE enriched_data (

enriched_json_data string

)

PARTITIONED BY (yyyy string, mm string, dd string, identifier string,
sub_identifier string, unique_run_id string)

CLUSTERED BY (enriched_json_data) INTO 128 BUCKETS

LOCATION "${OUTDIR}"; 

 

INSERT OVERWRITE TABLE enriched_data PARTITION (yyyy, mm, dd, identifier,
sub_identifier, unique_run_id)

SELECT .

 

We've not seen this issue before (normally our dataset is billions of
lines), but in this case we have a very tiny amount of data causing this
issue.

 

After looking at the code, it appears as if this condition is failing
https://github.com/apache/hive/blob/branch-0.13/ql/src/java/org/apache/hadoo
p/hive/ql/exec/FileSinkOperator.java#L745

I downloaded and rebuilt the branch with a bit of debugging/stdout printing
on the contents of the valToPaths map and it fails as there are 101 entries
in it

 

All the entries look like this

 

yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/
000047_0

yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/
000048_0

yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/
000049_0

yyyy=2015/mm=04/dd=09/identifier=1/sub-identifier=3/unique_run_id=df-345345/
000051_0

..

 

We're just confused as to why Hive considers the final bit of the output
path (e.g. 000047_0) to be a "dynamic partition", as this is not in our
PARTITIONED BY clause

 

The only thing I can think of is the CLUSTERED BY 128 BUCKETS clause,
combined with the dataset being really small (180 lines), is loading
everything into 1 REDUCER task - but the hashing of each line is
distributing the rows fairly uniformly so we have > 100 buckets to write to
via one reducer

 

Any help will be greatly appreciated

 

With thanks,

 

Daniel Harper

Software Engineer, OTG ANT

BC5 A5

Reply via email to