[ 
https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair resolved PIG-1442.
--------------------------------

    Resolution: Duplicate

> java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
> ---------------------------------------------------------------
>
>                 Key: PIG-1442
>                 URL: https://issues.apache.org/jira/browse/PIG-1442
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.2.0, 0.7.0
>         Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev 
> (18/may)
> Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0
>            Reporter: Dirk Schmid
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>
> As mentioned by Ashutosh this is a reopen of 
> https://issues.apache.org/jira/browse/PIG-766 because there is still a 
> problem which causes that PIG scales only by memory.
> For convenience here comes the last entry of the PIG-766-Jira-Ticket:
> {quote}1. Are you getting the exact same stack trace as mentioned in the 
> jira?{quote} Yes the same and some similar traces:
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.Arrays.copyOf(Arrays.java:2786)
>       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>       at java.io.DataOutputStream.write(DataOutputStream.java:90)
>       at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>       at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279)
>       at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>       at 
> org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249)
>       at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214)
>       at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>       at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209)
>       at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>       at 
> org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
>       at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
>       at 
> org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201)
>       at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
> java.lang.OutOfMemoryError: Java heap space
>       at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
>       at 
> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at 
> org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284)
>       at 
> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>       at 
> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
>       at 
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage.getNext(POCombinerPackage.java:155)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:242)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:170)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.ArrayList.(ArrayList.java:112)
>       at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
>       at 
> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at 
> org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
>       at 
> org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
>       at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>       at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284)
>       at 
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:221)
>       at 
> org.apache.pig.builtin.Distinct.getDistinctFromNestedBags(Distinct.java:138)
>       at org.apache.pig.builtin.Distinct.access$200(Distinct.java:40)
>       at org.apache.pig.builtin.Distinct$Intermediate.exec(Distinct.java:103)
>       at org.apache.pig.builtin.Distinct$Intermediate.exec(Distinct.java:96)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:209)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:250)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:341)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:289)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:217)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:207)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:183)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>       at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
> {noformat}
> {quote}
> 2. Which operations are you doing in your query - join, group-by, any other ?
> 3. What load/store func are you using to read and write data? PigStorage or 
> your own ?
> 4. What is your data size and memory available to your tasks?
> 5. Do you have very large records in your dataset, like hundreds of MB for 
> one record ?
> It would be great if you can paste here the script from which you get this 
> exception.
> {quote}
> As we started to test the transformation (see below) the OutOfMemory-Error 
> first occured at input-datasets of about 150MB.
> Increasing the Memory for the child-vms by setting {{mapred.child.java.opts}} 
> to {{600m}} fixed it for a while.
> When using larger input-dataset the problem reappears.
> *Input-Data:*
> A CSV-File, ~14GB Dataset, ~100,000,000 Records per Dataset, ~145 Byte per 
> Record
> *Example:*
> {noformat} 
>   USER_ID                       REQUEST_DATE    SESSION                       
>           COMPANY SERVICENAME  SECTION_1  SECTION_2  SECTION_3  SECTION_4  
> SECTION_5  SECTION_6     SECTION SECTION_NEW
>   ac14263e-22082-2263455080-9   2010-03-02      
> ac14263e-22082-2263455080-9-1273015305  ABC     (NULL)       main       
> (NULL)     (NULL)     (NULL)     (NULL)     abc/main/mail /main/mail
>   ...
>   ...
> {noformat} 
> *The Pig-Script*
> {code}
> A = LOAD 'full_load' USING PigStorage('\t');
> B = FOREACH A GENERATE $4 AS servicename, $3 AS company, $2 AS session, $0 as 
> user_id
>                        , $5 AS section_1, $6 AS section_2, $7 AS section_3, 
> $8 as section_4
>                        , $9 as section_5, $10 as section_6, $11 AS section;
>                         
> /* 1st aggregation */
> S0 = GROUP B BY (servicename, company);
> S0_A = FOREACH S0 {
>                     unique_clients = DISTINCT B.user_id;
>                     visits = DISTINCT B.session;
>                     GENERATE FLATTEN(group), COUNT(B) AS pi_count, 
> COUNT(unique_clients) AS unique_clients_count, COUNT(visits) AS visit_count;
>                   }
> S0_B = FOREACH S0_A GENERATE servicename, company, '' as section_1, '' as 
> section_2, '' as section_3, '' as section_4
>                            , '' as section_5, '' as section_6, '' as section, 
> pi_count, unique_clients_count
>                            , visit_count, 0 as level;
> /* 2nd aggregation */
> S1 = GROUP B BY (servicename, company, section_1); S1_A = FOREACH S1 {
>                     unique_clients = DISTINCT B.user_id;
>                     visits = DISTINCT B.session;
>                     GENERATE FLATTEN(group), COUNT(B) AS pi_count, 
> COUNT(unique_clients) AS unique_clients_count, COUNT(visits) AS visit_count;
>                   }
> S1_B = FOREACH S1_A GENERATE servicename, company, section_1, '' as 
> section_2, '' as section_3, '' as section_4
>                              , '' as section_5, '' as section_6, '' as 
> section, pi_count, unique_clients_count
>                              , visit_count, 1 as level;
> /* 3rd - 7th aggregation may follow here */
> /* build result*/
> X = UNION S0_B, S1_B;
> STORE X INTO 'result' USING PigStorage ('\t'); {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to