Hi German & Thomas,
Seems i found the data that causes the error, but i still don't know the
exactly reason.
I just do a group with pig latin:
domain_device_group = GROUP data_filter BY (custid, domain, level, device);
domain_device = FOREACH domain_device_group {
distinct_ip = DISTINCT data_filter.ip;
distinct_userid = DISTINCT data_filter.userid;
GENERATE group.custid, group.domain, group.level, group.device,
COUNT_STAR(data_filter), COUNT_STAR(distinct_ip), COUNT_STAR(distinct_userid);
}
STORE domain_device INTO '$outputdir/$batchdate/data/domain_device' USING
PigStorage('\t');
The group key (custid, domain, level, device) is significantly skewed, about
42% (58,621,533 / 138,455,355) of the records are the same key, and only the
reducer which handle this key failed.
But from
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort
, I still have no idea why it cause an OOM. It doesn't tell how skewed key
will be handled, neither how different keys in same reducer will be merged.
[email protected]
From: [email protected]
Date: 2014-04-15 23:35
To: user; th; german.fl
Subject: Re: RE: memoryjava.lang.OutOfMemoryError related with number of
reducer?
Thanks, let me take a careful look at it.
[email protected]
From: German Florez-Larrahondo
Date: 2014-04-15 23:27
To: user; 'th'
Subject: RE: Re: memoryjava.lang.OutOfMemoryError related with number of
reducer?
Lei
A good explanation of this can be found on the Hadoop The Definitive Guide by
Tom White.
Here is an excerpt that explains a bit the behavior at the reduce side and some
possible tweaks to control it.
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort
From: [email protected] [mailto:[email protected]]
Sent: Tuesday, April 15, 2014 9:29 AM
To: user; th
Subject: Re: Re: memoryjava.lang.OutOfMemoryError related with number of
reducer?
Thanks Thomas.
Anohter question. I have no idea what is "Failed to merge in memory". Does
the 'merge' is the shuffle phase in reducer side? Why it is in memory?
Except the two methods(increase reducer number and increase heap size), is
there any other alternatives to fix this issue?
Thanks a lot.
[email protected]
From: Thomas Bentsen
Date: 2014-04-15 21:53
To: user
Subject: Re: memoryjava.lang.OutOfMemoryError related with number of reducer?
When you increase the number of reducers they each have less to work
with provided the data is distributed evenly between them - in this case
about one third of the original work.
It is eessentially the same thing as increasing the heap size - it's
just distributed between more reducers.
/th
On Tue, 2014-04-15 at 20:41 +0800, [email protected] wrote:
> I can fix this by changing heap size.
> But what confuse me is that when i change the reducer number from 24
> to 84, there's no this error.
>
>
> Any insight on this?
>
>
> Thanks
> Lei
> Failed to merge in memoryjava.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2786)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
> at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:384)
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)
> at org.apache.pig.data.utils.SedesHelper.writeChararray(SedesHelper.java:66)
> at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:543)
> at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
> at
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
> at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
> at org.apache.pig.data.BinInterSedes.writeBag(BinInterSedes.java:604)
> at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:447)
> at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
> at
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
> at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
> at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
> at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
> at
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
> at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
> at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
> at org.apache.pig.data.BinSedesTuple.write(BinSedesTuple.java:41)
> at
> org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:100)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:84)
> at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:188)
> at
> org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1145)
> at
> org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1456)
> at
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
> at
> org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:201)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:163)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>
> ______________________________________________________________________
> [email protected]