Yeah concatenating into a big file is a good idea (as long as your file
format is splittable, which new-line delimited text files trivially are, but
compression sometimes throws a kink in that).

Oh, also, use compression when you are in production. We like LZO at
Twitter, others use other formats sometimes. :).

-D

On Fri, Oct 8, 2010 at 11:21 AM, Vincent <[email protected]> wrote:

>  Oups! I suppose you've pointed out my mistakes.
> I am doing a lot a bad operations like you've mentioned :
>
>
> - group foo by (a, b), and then flattening out the group manually: foreach
> grouped_data generate group.a as a, group.b as b;
> - group all_http_requests by status_code.
>
> One more question and I will leave you for the week end: :-)
>
> Running the script locally with Pig, I am loading many small (< 3MB) log
> files as inputs (288 logs for each input precisely).
> Running the script on hadoop, before copying the logs on the HDFS, I am
> concatenating all them by input types. As I've read that HDFS doesn't like
> small files.
> It makes an input is more than 1GB and couple of others are several
> hundreds of MB.
>
> Is it ok? I've let the HDFS block size by default (64MB).
>
> And many thanks for your help to both of you Jeff and Dmitriy! I've learned
> a lot today thanks to you.
>
> -Vincent
>
>
> On 10/08/2010 09:47 PM, Dmitriy Ryaboy wrote:
>
>> Sorry, just saw that you said you explicitly did that.
>>
>> Ok, basically what's happening with the GC is that when you do something
>> like "group x by id" Pig will try to load *all the tuples of x with a
>> given
>> id* into memory in the reducer, unless algebraic or accumulative
>> optimizations were applied (this depends on what exactly you are doing
>> with
>> results of grouping).  Same for joins.  A common pitfall is grouping by a
>> tuple: group foo by (a, b), and then flattening out the group manually:
>> foreach grouped_data generate group.a as a, group.b as b;  Instead, you
>> should use FLATTEN to get optimizations to kick in: foreach grouped_data
>> generate FLATTEN(group) as (a, b);
>>
>> Doing things like "group all_http_requests by status_code" is bad, because
>> that puts a very large number of records into very few buckets. It's ok if
>> all you are doing is counting the results, but bad if you want to do
>> something like sort them by timestamp.
>>
>>
>>
>> On Fri, Oct 8, 2010 at 10:40 AM, Dmitriy Ryaboy<[email protected]>
>>  wrote:
>>
>>  When you changed to using replicated joins, did you put the small
>>> relation
>>> last in your list? The order is important...
>>>
>>> -D
>>>
>>>
>>> On Fri, Oct 8, 2010 at 8:47 AM, Vincent<[email protected]
>>> >wrote:
>>>
>>>
>>>>
>>>>  *I've tried mapred.child.java.opts value 2048m*. Now the error is a
>>>>  timeout. Seems like system is so loaded that it becomes irresponsive...
>>>>
>>>> Here are the outputs of the job tracker:
>>>>
>>>>
>>>>  Hadoop job_201010081840_0010
>>>>  <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081840_0010>
>>>>  failures on prog7<http://prog7.lan:50030/jobtracker.jsp>
>>>>
>>>> Attempt         Task    Machine         State   Error   Logs
>>>> attempt_201010081840_0010_r_000001_0    task_201010081840_0010_r_000001<
>>>>
>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081840_0010&tipid=task_201010081840_0010_r_000001
>>>> >
>>>>      prog7<http://prog7:50060>       FAILED
>>>>
>>>> Task attempt_201010081840_0010_r_000001_0 failed to report status for
>>>> 601
>>>> seconds. Killing!
>>>>
>>>>        Last 4KB<
>>>>
>>>> http://prog7:50060/tasklog?taskid=attempt_201010081840_0010_r_000001_0&start=-4097
>>>> Last 8KB<
>>>>
>>>> http://prog7:50060/tasklog?taskid=attempt_201010081840_0010_r_000001_0&start=-8193
>>>> All<
>>>> http://prog7:50060/tasklog?taskid=attempt_201010081840_0010_r_000001_0>
>>>>
>>>>
>>>>
>>>>  ask Logs: 'attempt_201010081840_0010_r_000001_0'
>>>>
>>>>
>>>>
>>>> *_stdout logs_*
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> *_stderr logs_*
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> *_syslog logs_*
>>>>
>>>> 2010-10-08 19:11:49,732 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
>>>> Initializing JVM Metrics with processName=SHUFFLE, sessionId=
>>>> 2010-10-08 19:11:50,963 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> ShuffleRamManager: MemoryLimit=1336252800,
>>>> MaxSingleShuffleLimit=334063200
>>>> 2010-10-08 19:11:50,997 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Thread started: Thread for merging
>>>> on-disk files
>>>> 2010-10-08 19:11:50,997 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Thread waiting: Thread for merging
>>>> on-disk files
>>>> 2010-10-08 19:11:51,004 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Thread started: Thread for merging
>>>> in
>>>> memory files
>>>> 2010-10-08 19:11:51,004 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Need another 24 map output(s) where
>>>> 0
>>>> is already in progress
>>>> 2010-10-08 19:11:51,005 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 0 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:11:51,005 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Thread started: Thread for polling
>>>> Map
>>>> Completion Events
>>>> 2010-10-08 19:11:51,020 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 10 new map-outputs
>>>> 2010-10-08 19:11:56,005 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 2 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:11:56,091 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000002_0, compressed len: 18158866,
>>>> decompressed
>>>> len: 18158862
>>>> 2010-10-08 19:11:56,091 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18158862 bytes (18158866 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000002_0
>>>> 2010-10-08 19:11:56,582 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000000_0, compressed len: 20624287,
>>>> decompressed
>>>> len: 20624283
>>>> 2010-10-08 19:11:56,582 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 20624283 bytes (20624287 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000000_0
>>>> 2010-10-08 19:11:57,035 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs
>>>> 2010-10-08 19:11:57,258 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 20624283 bytes from map-output for attempt_201010081840_0010_m_000000_0
>>>> 2010-10-08 19:11:57,271 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000000_0 ->   (105, 265) from prog7
>>>> 2010-10-08 19:11:57,274 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and9
>>>> dup hosts)
>>>> 2010-10-08 19:11:57,313 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000001_0, compressed len: 18485340,
>>>> decompressed
>>>> len: 18485336
>>>> 2010-10-08 19:11:57,313 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18485336 bytes (18485340 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000001_0
>>>> 2010-10-08 19:11:57,971 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18158862 bytes from map-output for attempt_201010081840_0010_m_000002_0
>>>> 2010-10-08 19:11:57,971 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000002_0 ->   (177, 148) from hermitage
>>>> 2010-10-08 19:11:57,980 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:11:58,043 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000003_0, compressed len: 18075620,
>>>> decompressed
>>>> len: 18075616
>>>> 2010-10-08 19:11:58,044 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18075616 bytes (18075620 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000003_0
>>>> 2010-10-08 19:11:58,277 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18485336 bytes from map-output for attempt_201010081840_0010_m_000001_0
>>>> 2010-10-08 19:11:58,277 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000001_0 ->   (241, 162) from prog7
>>>> 2010-10-08 19:12:01,929 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18075616 bytes from map-output for attempt_201010081840_0010_m_000003_0
>>>> 2010-10-08 19:12:01,930 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000003_0 ->   (189, 187) from hermitage
>>>> 2010-10-08 19:12:01,935 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:01,937 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000006_0, compressed len: 18255983,
>>>> decompressed
>>>> len: 18255979
>>>> 2010-10-08 19:12:01,937 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18255979 bytes (18255983 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000006_0
>>>> 2010-10-08 19:12:03,044 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 1 new map-outputs
>>>> 2010-10-08 19:12:03,049 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000004_0, compressed len: 18874529,
>>>> decompressed
>>>> len: 18874525
>>>> 2010-10-08 19:12:03,049 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18874525 bytes (18874529 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000004_0
>>>> 2010-10-08 19:12:03,067 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and7
>>>> dup hosts)
>>>> 2010-10-08 19:12:03,608 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18874525 bytes from map-output for attempt_201010081840_0010_m_000004_0
>>>> 2010-10-08 19:12:03,609 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000004_0 ->   (105, 133) from prog7
>>>> 2010-10-08 19:12:04,087 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18255979 bytes from map-output for attempt_201010081840_0010_m_000006_0
>>>> 2010-10-08 19:12:04,088 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000006_0 ->   (105, 178) from hermitage
>>>> 2010-10-08 19:12:04,094 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:04,319 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000007_0, compressed len: 18358512,
>>>> decompressed
>>>> len: 18358508
>>>> 2010-10-08 19:12:04,319 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18358508 bytes (18358512 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000007_0
>>>> 2010-10-08 19:12:06,254 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18358508 bytes from map-output for attempt_201010081840_0010_m_000007_0
>>>> 2010-10-08 19:12:06,255 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000007_0 ->   (105, 166) from hermitage
>>>> 2010-10-08 19:12:06,258 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:06,270 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000008_0, compressed len: 18092007,
>>>> decompressed
>>>> len: 18092003
>>>> 2010-10-08 19:12:06,271 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18092003 bytes (18092007 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000008_0
>>>> 2010-10-08 19:12:07,808 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18092003 bytes from map-output for attempt_201010081840_0010_m_000008_0
>>>> 2010-10-08 19:12:07,809 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000008_0 ->   (293, 232) from hermitage
>>>> 2010-10-08 19:12:07,810 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:07,813 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000009_0, compressed len: 17941909,
>>>> decompressed
>>>> len: 17941905
>>>> 2010-10-08 19:12:07,813 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 17941905 bytes (17941909 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000009_0
>>>> 2010-10-08 19:12:09,059 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 3 new map-outputs
>>>> 2010-10-08 19:12:09,060 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and6
>>>> dup hosts)
>>>> 2010-10-08 19:12:09,338 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 17941905 bytes from map-output for attempt_201010081840_0010_m_000009_0
>>>> 2010-10-08 19:12:09,338 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000009_0 ->   (105, 197) from hermitage
>>>> 2010-10-08 19:12:09,338 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:09,341 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000010_0, compressed len: 18405142,
>>>> decompressed
>>>> len: 18405138
>>>> 2010-10-08 19:12:09,341 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18405138 bytes (18405142 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000010_0
>>>> 2010-10-08 19:12:09,369 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000005_0, compressed len: 18009096,
>>>> decompressed
>>>> len: 18009092
>>>> 2010-10-08 19:12:09,369 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18009092 bytes (18009096 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000005_0
>>>> 2010-10-08 19:12:10,691 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18009092 bytes from map-output for attempt_201010081840_0010_m_000005_0
>>>> 2010-10-08 19:12:10,691 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000005_0 ->   (105, 206) from prog7
>>>> 2010-10-08 19:12:11,101 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18405138 bytes from map-output for attempt_201010081840_0010_m_000010_0
>>>> 2010-10-08 19:12:11,101 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000010_0 ->   (137, 175) from hermitage
>>>> 2010-10-08 19:12:11,102 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:11,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000011_0, compressed len: 20002825,
>>>> decompressed
>>>> len: 20002821
>>>> 2010-10-08 19:12:11,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 20002821 bytes (20002825 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000011_0
>>>> 2010-10-08 19:12:12,805 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 20002821 bytes from map-output for attempt_201010081840_0010_m_000011_0
>>>> 2010-10-08 19:12:12,805 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000011_0 ->   (105, 143) from hermitage
>>>> 2010-10-08 19:12:12,815 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:12,817 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000012_0, compressed len: 18135959,
>>>> decompressed
>>>> len: 18135955
>>>> 2010-10-08 19:12:12,817 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18135955 bytes (18135959 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000012_0
>>>> 2010-10-08 19:12:14,361 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18135955 bytes from map-output for attempt_201010081840_0010_m_000012_0
>>>> 2010-10-08 19:12:14,361 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000012_0 ->   (137, 149) from hermitage
>>>> 2010-10-08 19:12:14,362 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:14,364 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000013_0, compressed len: 18440786,
>>>> decompressed
>>>> len: 18440782
>>>> 2010-10-08 19:12:14,364 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18440782 bytes (18440786 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000013_0
>>>> 2010-10-08 19:12:15,935 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18440782 bytes from map-output for attempt_201010081840_0010_m_000013_0
>>>> 2010-10-08 19:12:15,935 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000013_0 ->   (137, 142) from hermitage
>>>> 2010-10-08 19:12:15,936 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:15,938 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000014_0, compressed len: 18205885,
>>>> decompressed
>>>> len: 18205881
>>>> 2010-10-08 19:12:15,938 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18205881 bytes (18205885 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000014_0
>>>> 2010-10-08 19:12:17,489 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18205881 bytes from map-output for attempt_201010081840_0010_m_000014_0
>>>> 2010-10-08 19:12:17,499 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000014_0 ->   (253, 159) from hermitage
>>>> 2010-10-08 19:12:17,506 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:17,510 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000015_0, compressed len: 17476262,
>>>> decompressed
>>>> len: 17476258
>>>> 2010-10-08 19:12:17,510 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 17476258 bytes (17476262 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000015_0
>>>> 2010-10-08 19:12:17,612 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 1 new map-outputs
>>>> 2010-10-08 19:12:19,030 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 17476258 bytes from map-output for attempt_201010081840_0010_m_000015_0
>>>> 2010-10-08 19:12:19,035 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000015_0 ->   (105, 158) from hermitage
>>>> 2010-10-08 19:12:19,035 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:19,061 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000017_0, compressed len: 18542230,
>>>> decompressed
>>>> len: 18542226
>>>> 2010-10-08 19:12:19,061 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18542226 bytes (18542230 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000017_0
>>>> 2010-10-08 19:12:20,640 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18542226 bytes from map-output for attempt_201010081840_0010_m_000017_0
>>>> 2010-10-08 19:12:20,640 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000017_0 ->   (257, 151) from hermitage
>>>> 2010-10-08 19:12:23,626 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 1 new map-outputs
>>>> 2010-10-08 19:12:25,643 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:25,670 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000018_0, compressed len: 18737340,
>>>> decompressed
>>>> len: 18737336
>>>> 2010-10-08 19:12:25,670 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18737336 bytes (18737340 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000018_0
>>>> 2010-10-08 19:12:27,438 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18737336 bytes from map-output for attempt_201010081840_0010_m_000018_0
>>>> 2010-10-08 19:12:27,439 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000018_0 ->   (253, 175) from hermitage
>>>> 2010-10-08 19:12:28,646 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs
>>>> 2010-10-08 19:12:31,652 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs
>>>> 2010-10-08 19:12:32,439 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 2 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:32,473 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000020_0, compressed len: 17710258,
>>>> decompressed
>>>> len: 17710254
>>>> 2010-10-08 19:12:32,473 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 17710254 bytes (17710258 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000020_0
>>>> 2010-10-08 19:12:32,475 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000016_0, compressed len: 20708576,
>>>> decompressed
>>>> len: 20708572
>>>> 2010-10-08 19:12:32,475 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 20708572 bytes (20708576 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000016_0
>>>> 2010-10-08 19:12:33,138 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 20708572 bytes from map-output for attempt_201010081840_0010_m_000016_0
>>>> 2010-10-08 19:12:33,164 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000016_0 ->   (297, 318) from prog7
>>>> 2010-10-08 19:12:33,167 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and1
>>>> dup hosts)
>>>> 2010-10-08 19:12:33,172 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000019_0, compressed len: 18984487,
>>>> decompressed
>>>> len: 18984483
>>>> 2010-10-08 19:12:33,172 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18984483 bytes (18984487 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000019_0
>>>> 2010-10-08 19:12:33,774 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18984483 bytes from map-output for attempt_201010081840_0010_m_000019_0
>>>> 2010-10-08 19:12:33,774 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000019_0 ->   (285, 160) from prog7
>>>> 2010-10-08 19:12:34,057 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 17710254 bytes from map-output for attempt_201010081840_0010_m_000020_0
>>>> 2010-10-08 19:12:34,057 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000020_0 ->   (105, 127) from hermitage
>>>> 2010-10-08 19:12:34,081 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:34,085 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000021_0, compressed len: 18803713,
>>>> decompressed
>>>> len: 18803709
>>>> 2010-10-08 19:12:34,085 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18803709 bytes (18803713 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000021_0
>>>> 2010-10-08 19:12:36,579 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18803709 bytes from map-output for attempt_201010081840_0010_m_000021_0
>>>> 2010-10-08 19:12:36,579 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000021_0 ->   (137, 164) from hermitage
>>>> 2010-10-08 19:12:43,867 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs
>>>> 2010-10-08 19:12:46,585 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:46,589 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000022_1, compressed len: 18143868,
>>>> decompressed
>>>> len: 18143864
>>>> 2010-10-08 19:12:46,589 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 18143864 bytes (18143868 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000022_1
>>>> 2010-10-08 19:12:48,167 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 18143864 bytes from map-output for attempt_201010081840_0010_m_000022_1
>>>> 2010-10-08 19:12:48,176 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000022_1 ->   (105, 133) from hermitage
>>>> 2010-10-08 19:12:48,182 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts
>>>> and0
>>>> dup hosts)
>>>> 2010-10-08 19:12:48,428 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> header:
>>>> attempt_201010081840_0010_m_000023_1, compressed len: 9198819,
>>>> decompressed
>>>> len: 9198815
>>>> 2010-10-08 19:12:48,428 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Shuffling 9198815 bytes (9198819 raw bytes) into RAM from
>>>> attempt_201010081840_0010_m_000023_1
>>>> 2010-10-08 19:12:49,878 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Ignoring
>>>> obsolete output of KILLED map-task:
>>>> 'attempt_201010081840_0010_m_000022_0'
>>>> 2010-10-08 19:12:49,938 INFO org.apache.hadoop.mapred.ReduceTask: Read
>>>> 9198815 bytes from map-output for attempt_201010081840_0010_m_000023_1
>>>> 2010-10-08 19:12:49,938 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1
>>>> from attempt_201010081840_0010_m_000023_1 ->   (137, 166) from hermitage
>>>> 2010-10-08 19:12:50,878 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> GetMapEventsThread exiting
>>>> 2010-10-08 19:12:50,878 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> getMapsEventsThread joined.
>>>> 2010-10-08 19:12:50,878 INFO org.apache.hadoop.mapred.ReduceTask: Closed
>>>> ram manager
>>>> 2010-10-08 19:12:50,879 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Interleaved on-disk merge complete: 0 files left.
>>>> 2010-10-08 19:12:50,879 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> In-memory merge complete: 24 files left.
>>>> 2010-10-08 19:12:51,029 INFO org.apache.hadoop.mapred.Merger: Merging 24
>>>> sorted segments
>>>> 2010-10-08 19:12:51,030 INFO org.apache.hadoop.mapred.Merger: Down to
>>>> the
>>>> last merge-pass, with 24 segments left of total size: 436372203 bytes
>>>> 2010-10-08 19:13:04,406 INFO org.apache.hadoop.mapred.ReduceTask: Merged
>>>> 24 segments, 436372203 bytes to disk to satisfy reduce memory limit
>>>> 2010-10-08 19:13:04,407 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Merging
>>>> 1 files, 436372161 bytes from disk
>>>> 2010-10-08 19:13:04,426 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> Merging
>>>> 0 segments, 0 bytes from memory into reduce
>>>> 2010-10-08 19:13:04,426 INFO org.apache.hadoop.mapred.Merger: Merging 1
>>>> sorted segments
>>>> 2010-10-08 19:13:04,463 INFO org.apache.hadoop.mapred.Merger: Down to
>>>> the
>>>> last merge-pass, with 1 segments left of total size: 436372157 bytes
>>>> 2010-10-08 19:13:18,879 INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input paths
>>>> to
>>>> process : 24
>>>> 2010-10-08 19:13:18,879 INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil: Total
>>>> input
>>>> paths to process : 24
>>>> 2010-10-08 19:16:14,354 INFO
>>>> org.apache.pig.impl.util.SpillableMemoryManager: low memory handler
>>>> called
>>>> (Collection threshold exceeded) init = 32309248(31552K) used =
>>>> 803560952(784727K) committed = 1069678592(1044608K) max =
>>>> 1431699456(1398144K)
>>>>
>>>>
>>>>
>>>>
>>>> On 10/08/2010 06:44 PM, Vincent wrote:
>>>>
>>>>   Yep, I did restart cluster (dfs and mapred stop/start).
>>>>>
>>>>> Increasing the amount of memory I can see that the reduce task goes
>>>>> further (percentage is greater), but then start to decrease with memory
>>>>> failures.
>>>>>
>>>>> On 10/08/2010 06:41 PM, Jeff Zhang wrote:
>>>>>
>>>>>  Did you restart cluster after reconfiguration ?
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 8, 2010 at 9:59 PM, Vincent<[email protected]>
>>>>>>  wrote:
>>>>>>
>>>>>>   I've tried with mapred.child.java.opts value:
>>>>>>> -Xmx512m -->   still memory errors in reduce phase
>>>>>>> -Xmx1024m -->   still memory errors in reduce phase
>>>>>>> I am now trying with -Xmx1536m but I'm afraid that my nodes will
>>>>>>> start
>>>>>>> to
>>>>>>> swap memory...
>>>>>>>
>>>>>>> Should I continue in this direction? Or it's already to much and I
>>>>>>> should
>>>>>>> search the problem somewhere else?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> -Vincent
>>>>>>>
>>>>>>>
>>>>>>> On 10/08/2010 03:04 PM, Jeff Zhang wrote:
>>>>>>>
>>>>>>>  Try to increase the heap size on of task by setting
>>>>>>>> mapred.child.java.opts in mapred-site.xml. The default value is
>>>>>>>> -Xmx200m in mapred-default.xml which may be too small for you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 8, 2010 at 6:55 PM, Vincent<[email protected]>
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>   Thanks to Dmitriy and Jeff, I've set :
>>>>>>>>>
>>>>>>>>> set default_parallel 20; at the beginning of my script.
>>>>>>>>>
>>>>>>>>> Updated 8 JOINs to behave like:
>>>>>>>>>
>>>>>>>>> JOIN big BY id, small BY id USING 'replicated';
>>>>>>>>>
>>>>>>>>> Unfortunately this didn't improve the script speed (at least it
>>>>>>>>> runs
>>>>>>>>> for
>>>>>>>>> more than one hour now).
>>>>>>>>>
>>>>>>>>> But Looking in the jobtracker one of the job which reduce, I can
>>>>>>>>> see
>>>>>>>>> for
>>>>>>>>> the
>>>>>>>>> map:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Hadoop map task list for job_201010081314_0010
>>>>>>>>> <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010
>>>>>>>>> >
>>>>>>>>>    on
>>>>>>>>>  prog7<http://prog7.lan:50030/jobtracker.jsp>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   All Tasks
>>>>>>>>>
>>>>>>>>> Task    Complete        Status  Start Time      Finish Time
>>>>>>>>> Errors
>>>>>>>>>  Counters
>>>>>>>>> task_201010081314_0010_m_000000
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      100.00%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        8-Oct-2010 14:07:44
>>>>>>>>>        8-Oct-2010 14:23:11 (15mins, 27sec)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Too many fetch-failures
>>>>>>>>> Too many fetch-failures
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        8
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And I can see this for the reduce
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Hadoop reduce task list for job_201010081314_0010
>>>>>>>>> <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010
>>>>>>>>> >
>>>>>>>>>    on
>>>>>>>>>  prog7<http://prog7.lan:50030/jobtracker.jsp>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   All Tasks
>>>>>>>>>
>>>>>>>>> Task    Complete        Status  Start Time      Finish Time
>>>>>>>>> Errors
>>>>>>>>>  Counters
>>>>>>>>> task_201010081314_0010_r_000000
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      9.72%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        reduce>     copy (7 of 24 at 0.01 MB/s)>
>>>>>>>>>        8-Oct-2010 14:14:49
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Error: GC overhead limit exceeded
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        7
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> task_201010081314_0010_r_000001
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      0.00%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        8-Oct-2010 14:14:52
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Error: Java heap space
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        0
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> task_201010081314_0010_r_000002
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      0.00%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        8-Oct-2010 14:15:58
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> java.io.IOException: Task process exit with nonzero status of 1.
>>>>>>>>>        at
>>>>>>>>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        0
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> task_201010081314_0010_r_000003
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      9.72%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        reduce>     copy (7 of 24 at 0.01 MB/s)>
>>>>>>>>>        8-Oct-2010 14:16:58
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        7
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> task_201010081314_0010_r_000004
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      0.00%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        8-Oct-2010 14:18:11
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Error: GC overhead limit exceeded
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        0
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> task_201010081314_0010_r_000005
>>>>>>>>>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000005
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      0.00%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        8-Oct-2010 14:18:56
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Error: GC overhead limit exceeded
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Seems like it runs out of memory... Which parameter should be
>>>>>>>>> increased?
>>>>>>>>>
>>>>>>>>> -Vincent
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/08/2010 01:12 PM, Jeff Zhang wrote:
>>>>>>>>>
>>>>>>>>>  BTW, you can look at the job tracker web ui to see which part of
>>>>>>>>>> the
>>>>>>>>>> job cost the most of the time
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 8, 2010 at 5:11 PM, Jeff Zhang<[email protected]>
>>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>  No I mean whether your mapreduce job's reduce task number is 1.
>>>>>>>>>>>
>>>>>>>>>>> And could you share your pig script, then others can really
>>>>>>>>>>> understand
>>>>>>>>>>> your problem.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 8, 2010 at 5:04 PM, Vincent<
>>>>>>>>>>> [email protected]
>>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>   You are right, I didn't change this parameter, therefore the
>>>>>>>>>>>> default
>>>>>>>>>>>> is
>>>>>>>>>>>> used from src/mapred/mapred-default.xml
>>>>>>>>>>>>
>>>>>>>>>>>> <property>
>>>>>>>>>>>> <name>mapred.reduce.tasks</name>
>>>>>>>>>>>> <value>1</value>
>>>>>>>>>>>> <description>The default number of reduce tasks per job.
>>>>>>>>>>>> Typically
>>>>>>>>>>>> set
>>>>>>>>>>>> to
>>>>>>>>>>>> 99%
>>>>>>>>>>>>  of the cluster's reduce capacity, so that if a node fails the
>>>>>>>>>>>> reduces
>>>>>>>>>>>> can
>>>>>>>>>>>>  still be executed in a single wave.
>>>>>>>>>>>>  Ignored when mapred.job.tracker is "local".
>>>>>>>>>>>> </description>
>>>>>>>>>>>> </property>
>>>>>>>>>>>>
>>>>>>>>>>>> Not clear for me what is the reduce capacity of my cluster :)
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/08/2010 01:00 PM, Jeff Zhang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>  I guess maybe your reduce number is 1 which cause the reduce
>>>>>>>>>>>>> phase
>>>>>>>>>>>>> very
>>>>>>>>>>>>> slowly.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 8, 2010 at 4:44 PM, Vincent<
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   Well I can see from the job tracker that all the jobs are
>>>>>>>>>>>>>> done
>>>>>>>>>>>>>> quite
>>>>>>>>>>>>>> quickly expect 2 for which reduce phase goes really really
>>>>>>>>>>>>>> slowly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But how can I make the parallel between a job in the Hadoop
>>>>>>>>>>>>>> jop
>>>>>>>>>>>>>> tracker
>>>>>>>>>>>>>> (example: job_201010072150_0045) and the Pig script execution?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And what is the most efficient: several small Pig scripts? or
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>> big
>>>>>>>>>>>>>> Pig
>>>>>>>>>>>>>> script? I did one big to avoid to load several time the same
>>>>>>>>>>>>>> logs in
>>>>>>>>>>>>>> different scripts. Maybe it is not so good design...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Vincent
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/08/2010 11:31 AM, Vincent wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   I'm using pig-0.7.0 on hadoop-0.20.2.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For the script, well it's more then 500 lines, I'm not sure
>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> post
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> here that somebody will read it till the end :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/08/2010 11:26 AM, Dmitriy Ryaboy wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  What version of Pig, and what does your script look like?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Oct 7, 2010 at 11:48 PM,
>>>>>>>>>>>>>>>> Vincent<[email protected]>
>>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm quite new to Pig/Hadoop. So maybe my cluster size will
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> laugh.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I wrote a script on Pig handling 1.5GB of logs in less than
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> hour
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> pig
>>>>>>>>>>>>>>>>> local mode on a Intel core 2 duo with 3GB of RAM.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then I tried this script on a simple 2 nodes cluster. These
>>>>>>>>>>>>>>>>> 2
>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> servers but simple computers:
>>>>>>>>>>>>>>>>> - Intel core 2 duo with 3GB of RAM.
>>>>>>>>>>>>>>>>> - Intel Quad with 4GB of RAM.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Well I was aware that hadoop has overhead and that it won't
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> half
>>>>>>>>>>>>>>>>> an hour (time in local divided by number of nodes). But I
>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>> surprised
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> see this morning it took 7 hours to complete!!!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My configuration was made according to this link:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My question is simple: Is it normal?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Vincent
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best Regards
>>>>>>>>>>>
>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>
>

Reply via email to