For distincts, it can be hit or miss whether the combiner makes a job faster or not, since it causes more serialization to disk and if the number of duplicates is low, then that overhead was not worth that serialization.
The double group is the more scalable solution, but of course, all that matters is what works with your data :) 2012/1/18 Mario Lassnig <[email protected]> > Boom! > > SET pig.exec.nocombiner true; > > Now everything takes 5 minutes and finishes successfully. > > Thanks a lot, William! > > I'll also try the double group by approach tomorrow, > perhaps I can squeeze out a few more minutes. > > Thanks again and good night, > Mario > > > > On 12-1-18 23:14 , > william.dowling@**thomsonreuters.com<[email protected]>wrote: > >> Nested DISTINCT is a killer. See >> >> https://mail-archives.apache.**org/mod_mbox/pig-user/201201.**mbox/%** >> 3CCAKne9Z5unW03bk2qbyXNxBsevmL**cBVY1Se8_7TsDmDMEdHKh8A@mail.** >> gmail.com%3E<https://mail-archives.apache.org/mod_mbox/pig-user/201201.mbox/%3ccakne9z5unw03bk2qbyxnxbsevmlcbvy1se8_7tsdmdmedhk...@mail.gmail.com%3E> >> >> for a discussion of a simple workaround that worked for me. >> >> William F Dowling >> Senior Technologist >> Thomson Reuters >> >> >> -----Original Message----- >> From: Mario Lassnig [mailto:[email protected]] >> Sent: Wednesday, January 18, 2012 5:07 PM >> To: [email protected] >> Subject: 0.9.1 out of memory problem >> >> Hello, >> >> I'm having an out of memory problem that seems rather weird to me. >> Perhaps you can help me. >> >> Here's what I do: >> >> dump = LOAD '/user/accounting/dump_2012-**01-05.lst' AS ( >> ts:chararray, >> duid:chararray, >> owner:chararray, >> hidden:chararray, >> lgroup:chararray, >> nbfiles:long, >> length:long, >> replicas:long, >> provenance:chararray, >> state:chararray, >> campaign:chararray, >> rlength:long, >> rnbfiles:long, >> rowner:chararray, >> rgroup:chararray, >> rarchived:chararray, >> rsuspicious:chararray, >> name:chararray, >> ami:chararray, >> site:chararray >> ); >> >> wset = FOREACH dump GENERATE site, duid, replicas, nbfiles, rnbfiles, >> length, rlength; >> >> bySite = GROUP wset BY site; >> >> report = FOREACH bySite { >> duids = DISTINCT wset.duid; >> GENERATE group, COUNT(duids), SUM(wset.replicas), SUM(wset.nbfiles), >> SUM(wset.rnbfiles), SUM(wset.length), SUM(wset.rlength); >> }; >> >> STORE report INTO 'testfile.out'; >> >> >> >> >> So far, nothing special. The dump file has about 5GB with ~500 million >> lines. >> The whole STORE process takes about 2 minutes until it ends up at the >> last reducer, >> which dies like this: >> >> 2012-01-18 22:45:42,461 INFO org.apache.hadoop.util.**NativeCodeLoader: >> Loaded the native-hadoop library >> 2012-01-18 22:45:42,706 INFO org.apache.hadoop.metrics.jvm.**JvmMetrics: >> Initializing JVM Metrics with processName=SHUFFLE, sessionId= >> 2012-01-18 22:45:42,976 INFO org.apache.hadoop.mapred.**ReduceTask: >> ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=** >> 167031600 >> 2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread started: Thread for >> merging on-disk files >> 2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread waiting: Thread for >> merging on-disk files >> 2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread started: Thread for >> merging in memory files >> 2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Need another 89 map output(s) >> where 0 is already in progress >> 2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread started: Thread for >> polling Map Completion Events >> 2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 0 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:45:47,986 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 5 outputs (0 slow hosts >> and0 dup hosts) >> ..... >> 2012-01-18 22:45:42,461 INFO org.apache.hadoop.util.**NativeCodeLoader: >> Loaded the native-hadoop library >> 2012-01-18 22:45:42,706 INFO org.apache.hadoop.metrics.jvm.**JvmMetrics: >> Initializing JVM Metrics with processName=SHUFFLE, sessionId= >> 2012-01-18 22:45:42,976 INFO org.apache.hadoop.mapred.**ReduceTask: >> ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=** >> 167031600 >> 2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread started: Thread for >> merging on-disk files >> 2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread waiting: Thread for >> merging on-disk files >> 2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread started: Thread for >> merging in memory files >> 2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Need another 89 map output(s) >> where 0 is already in progress >> 2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Thread started: Thread for >> polling Map Completion Events >> 2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 0 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:45:47,986 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 5 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:45:48,091 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and18 dup hosts) >> 2012-01-18 22:45:48,294 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and14 dup hosts) >> 2012-01-18 22:45:48,336 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and15 dup hosts) >> 2012-01-18 22:45:48,368 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and13 dup hosts) >> 2012-01-18 22:45:48,592 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and15 dup hosts) >> 2012-01-18 22:45:48,636 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and12 dup hosts) >> 2012-01-18 22:45:48,774 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and13 dup hosts) >> 2012-01-18 22:45:48,796 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and11 dup hosts) >> 2012-01-18 22:45:48,827 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and8 dup hosts) >> 2012-01-18 22:45:48,848 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and9 dup hosts) >> 2012-01-18 22:45:48,874 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and9 dup hosts) >> 2012-01-18 22:45:49,041 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and8 dup hosts) >> 2012-01-18 22:45:49,129 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and3 dup hosts) >> 2012-01-18 22:45:49,250 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and3 dup hosts) >> 2012-01-18 22:45:49,461 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and3 dup hosts) >> 2012-01-18 22:45:49,466 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and2 dup hosts) >> 2012-01-18 22:45:49,668 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and2 dup hosts) >> 2012-01-18 22:45:49,801 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and1 dup hosts) >> 2012-01-18 22:45:49,940 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and1 dup hosts) >> 2012-01-18 22:45:50,100 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and1 dup hosts) >> 2012-01-18 22:45:50,101 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and3 dup hosts) >> 2012-01-18 22:45:50,101 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and4 dup hosts) >> 2012-01-18 22:45:50,125 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and4 dup hosts) >> 2012-01-18 22:45:50,345 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and2 dup hosts) >> 2012-01-18 22:45:50,388 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and2 dup hosts) >> 2012-01-18 22:45:50,649 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and1 dup hosts) >> 2012-01-18 22:45:50,671 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:45:55,890 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 10 outputs (0 slow >> hosts and0 dup hosts) >> 2012-01-18 22:45:56,119 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and29 dup hosts) >> 2012-01-18 22:45:56,262 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 2 outputs (0 slow hosts >> and28 dup hosts) >> 2012-01-18 22:45:56,266 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and33 dup hosts) >> 2012-01-18 22:45:56,296 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and31 dup hosts) >> 2012-01-18 22:45:56,335 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and29 dup hosts) >> 2012-01-18 22:45:56,363 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and26 dup hosts) >> 2012-01-18 22:45:56,461 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and25 dup hosts) >> 2012-01-18 22:45:56,465 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and27 dup hosts) >> 2012-01-18 22:45:56,643 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and23 dup hosts) >> 2012-01-18 22:45:56,662 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and23 dup hosts) >> 2012-01-18 22:45:56,671 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and25 dup hosts) >> 2012-01-18 22:45:56,696 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and23 dup hosts) >> 2012-01-18 22:45:56,874 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and19 dup hosts) >> 2012-01-18 22:45:57,016 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and21 dup hosts) >> 2012-01-18 22:45:57,043 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and18 dup hosts) >> 2012-01-18 22:45:57,122 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and21 dup hosts) >> 2012-01-18 22:45:57,122 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and21 dup hosts) >> 2012-01-18 22:45:57,129 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and19 dup hosts) >> 2012-01-18 22:45:57,207 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and17 dup hosts) >> 2012-01-18 22:45:57,321 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and16 dup hosts) >> 2012-01-18 22:45:57,460 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and18 dup hosts) >> 2012-01-18 22:45:57,460 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and17 dup hosts) >> 2012-01-18 22:45:57,561 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and13 dup hosts) >> 2012-01-18 22:45:57,580 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and15 dup hosts) >> 2012-01-18 22:45:57,584 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and13 dup hosts) >> 2012-01-18 22:45:57,588 INFO org.apache.hadoop.mapred.**ReduceTask: >> Initiating in-memory merge with 58 segments... >> 2012-01-18 22:45:57,591 INFO org.apache.hadoop.mapred.**Merger: Merging >> 58 sorted segments >> 2012-01-18 22:45:57,591 INFO org.apache.hadoop.mapred.**Merger: Down to >> the last merge-pass, with 58 segments left of total size: 443794431 bytes >> 2012-01-18 22:45:57,594 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and13 dup hosts) >> 2012-01-18 22:45:57,636 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and10 dup hosts) >> 2012-01-18 22:45:57,906 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> first memory handler call- Usage threshold init = 263258112(257088K) used = >> 513227712(501198K) committed = 626393088(611712K) max = 715849728(699072K) >> 2012-01-18 22:45:57,956 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and9 dup hosts) >> 2012-01-18 22:45:58,034 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and10 dup hosts) >> 2012-01-18 22:45:58,036 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and7 dup hosts) >> 2012-01-18 22:45:58,125 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and9 dup hosts) >> 2012-01-18 22:45:58,179 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and6 dup hosts) >> 2012-01-18 22:45:58,505 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> first memory handler call - Collection threshold init = 263258112(257088K) >> used = 599026296(584986K) committed = 715849728(699072K) max = >> 715849728(699072K) >> 2012-01-18 22:45:58,628 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and6 dup hosts) >> 2012-01-18 22:45:58,640 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and5 dup hosts) >> 2012-01-18 22:45:58,715 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and4 dup hosts) >> 2012-01-18 22:45:58,780 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and3 dup hosts) >> 2012-01-18 22:45:58,893 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and3 dup hosts) >> 2012-01-18 22:45:58,945 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and2 dup hosts) >> 2012-01-18 22:45:59,022 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and1 dup hosts) >> 2012-01-18 22:45:59,295 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:46:03,457 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> Spilled an estimate of 89947448 bytes from 2 objects. init = >> 263258112(257088K) used = 497067424(485417K) committed = 715849728(699072K) >> max = 715849728(699072K) >> 2012-01-18 22:46:12,065 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> Spilled an estimate of 93882208 bytes from 2 objects. init = >> 263258112(257088K) used = 524713768(512415K) committed = 715849728(699072K) >> max = 715849728(699072K) >> 2012-01-18 22:46:20,631 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> Spilled an estimate of 307912016 bytes from 6 objects. init = >> 263258112(257088K) used = 537442552(524846K) committed = 715849728(699072K) >> max = 715849728(699072K) >> 2012-01-18 22:46:24,488 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:46:26,405 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> Spilled an estimate of 308408644 bytes from 7 objects. init = >> 263258112(257088K) used = 548074616(535229K) committed = 715849728(699072K) >> max = 715849728(699072K) >> 2012-01-18 22:46:29,331 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 1 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:46:30,569 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> Spilled an estimate of 226846412 bytes from 5 objects. init = >> 263258112(257088K) used = 507240048(495351K) committed = 715849728(699072K) >> max = 715849728(699072K) >> 2012-01-18 22:46:35,576 INFO org.apache.hadoop.mapred.**ReduceTask: >> attempt_201201170946_0084_r_**000000_0 Scheduled 4 outputs (0 slow hosts >> and0 dup hosts) >> 2012-01-18 22:46:36,801 INFO org.apache.hadoop.mapred.**ReduceTask: >> GetMapEventsThread exiting >> 2012-01-18 22:46:36,801 INFO org.apache.hadoop.mapred.**ReduceTask: >> getMapsEventsThread joined. >> 2012-01-18 22:46:36,801 INFO org.apache.hadoop.mapred.**ReduceTask: >> Closed ram manager >> 2012-01-18 22:46:36,802 INFO org.apache.hadoop.mapred.**ReduceTask: >> Interleaved on-disk merge complete: 0 files left. >> 2012-01-18 22:46:38,463 INFO >> org.apache.pig.impl.util.**SpillableMemoryManager: >> Spilled an estimate of 278231340 bytes from 7 objects. init = >> 263258112(257088K) used = 542068736(529364K) committed = 715849728(699072K) >> max = 715849728(699072K) >> 2012-01-18 22:46:53,367 FATAL org.apache.hadoop.mapred.Task: >> attempt_201201170946_0084_r_**000000_0 : Failed to merge in >> memoryjava.lang.**OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOf(**Arrays.java:2786) >> at java.io.ByteArrayOutputStream.**write(ByteArrayOutputStream.** >> java:94) >> at java.io.DataOutputStream.**write(DataOutputStream.java:**90) >> at java.io.DataOutputStream.**writeUTF(DataOutputStream.** >> java:384) >> at java.io.DataOutputStream.**writeUTF(DataOutputStream.** >> java:306) >> at org.apache.pig.data.**BinInterSedes.writeDatum(** >> BinInterSedes.java:454) >> at org.apache.pig.data.**BinInterSedes.writeTuple(** >> BinInterSedes.java:542) >> at org.apache.pig.data.**BinInterSedes.writeBag(** >> BinInterSedes.java:523) >> at org.apache.pig.data.**BinInterSedes.writeDatum(** >> BinInterSedes.java:361) >> at org.apache.pig.data.**BinInterSedes.writeTuple(** >> BinInterSedes.java:542) >> at org.apache.pig.data.**BinInterSedes.writeDatum(** >> BinInterSedes.java:357) >> at org.apache.pig.data.**BinInterSedes.writeTuple(** >> BinInterSedes.java:542) >> at org.apache.pig.data.**BinInterSedes.writeDatum(** >> BinInterSedes.java:357) >> at org.apache.pig.data.**BinSedesTuple.write(** >> BinSedesTuple.java:57) >> at org.apache.pig.impl.io.**PigNullableWritable.write(** >> PigNullableWritable.java:123) >> at org.apache.hadoop.io.**serializer.**WritableSerialization$** >> WritableSerializer.serialize(**WritableSerialization.java:90) >> at org.apache.hadoop.io.**serializer.**WritableSerialization$** >> WritableSerializer.serialize(**WritableSerialization.java:77) >> at org.apache.hadoop.mapred.**IFile$Writer.append(IFile.** >> java:187) >> at org.apache.hadoop.mapred.Task$**CombineOutputCollector.** >> collect(Task.java:1050) >> at org.apache.hadoop.mapred.Task$**NewCombinerRunner$** >> OutputConverter.write(Task.**java:1371) >> at org.apache.hadoop.mapreduce.**TaskInputOutputContext.write(** >> TaskInputOutputContext.java:**80) >> at org.apache.pig.backend.hadoop.**executionengine.** >> mapReduceLayer.PigCombiner$**Combine.**processOnePackageOutput(** >> PigCombiner.java:200) >> at org.apache.pig.backend.hadoop.**executionengine.** >> mapReduceLayer.PigCombiner$**Combine.reduce(PigCombiner.**java:162) >> at org.apache.pig.backend.hadoop.**executionengine.** >> mapReduceLayer.PigCombiner$**Combine.reduce(PigCombiner.**java:51) >> at org.apache.hadoop.mapreduce.**Reducer.run(Reducer.java:176) >> at org.apache.hadoop.mapred.Task$**NewCombinerRunner.combine(** >> Task.java:1392) >> at org.apache.hadoop.mapred.**ReduceTask$ReduceCopier$** >> InMemFSMergeThread.**doInMemMerge(ReduceTask.java:**2691) >> at org.apache.hadoop.mapred.**ReduceTask$ReduceCopier$** >> InMemFSMergeThread.run(**ReduceTask.java:2629) >> >> 2012-01-18 22:46:53,372 INFO org.apache.hadoop.mapred.**ReduceTask: >> In-memory merge complete: 31 files left. >> >> >> If I omit the COUNT(DISTINCT), it works brilliantly and fast. With the >> COUNT(DISTINCT) it dies like this. >> >> Now, I don't know where to go from here. I'm running Hadoop and Pig with >> default settings, >> except I've increased child.opts to -Xmx1024M (24GB machines) so it >> would be great if you could tell me what to do, >> because I'm stuck. >> >> Thanks, >> Mario >> >> > -- > Mario Lassnig > CERN PH-ADP-DDM 1-R-022 > CH-1211 Geneve 23 > [email protected] > +41 22 76 71268 > >
