Yeah concatenating into a big file is a good idea (as long as your file format is splittable, which new-line delimited text files trivially are, but compression sometimes throws a kink in that).
Oh, also, use compression when you are in production. We like LZO at Twitter, others use other formats sometimes. :). -D On Fri, Oct 8, 2010 at 11:21 AM, Vincent <[email protected]> wrote: > Oups! I suppose you've pointed out my mistakes. > I am doing a lot a bad operations like you've mentioned : > > > - group foo by (a, b), and then flattening out the group manually: foreach > grouped_data generate group.a as a, group.b as b; > - group all_http_requests by status_code. > > One more question and I will leave you for the week end: :-) > > Running the script locally with Pig, I am loading many small (< 3MB) log > files as inputs (288 logs for each input precisely). > Running the script on hadoop, before copying the logs on the HDFS, I am > concatenating all them by input types. As I've read that HDFS doesn't like > small files. > It makes an input is more than 1GB and couple of others are several > hundreds of MB. > > Is it ok? I've let the HDFS block size by default (64MB). > > And many thanks for your help to both of you Jeff and Dmitriy! I've learned > a lot today thanks to you. > > -Vincent > > > On 10/08/2010 09:47 PM, Dmitriy Ryaboy wrote: > >> Sorry, just saw that you said you explicitly did that. >> >> Ok, basically what's happening with the GC is that when you do something >> like "group x by id" Pig will try to load *all the tuples of x with a >> given >> id* into memory in the reducer, unless algebraic or accumulative >> optimizations were applied (this depends on what exactly you are doing >> with >> results of grouping). Same for joins. A common pitfall is grouping by a >> tuple: group foo by (a, b), and then flattening out the group manually: >> foreach grouped_data generate group.a as a, group.b as b; Instead, you >> should use FLATTEN to get optimizations to kick in: foreach grouped_data >> generate FLATTEN(group) as (a, b); >> >> Doing things like "group all_http_requests by status_code" is bad, because >> that puts a very large number of records into very few buckets. It's ok if >> all you are doing is counting the results, but bad if you want to do >> something like sort them by timestamp. >> >> >> >> On Fri, Oct 8, 2010 at 10:40 AM, Dmitriy Ryaboy<[email protected]> >> wrote: >> >> When you changed to using replicated joins, did you put the small >>> relation >>> last in your list? The order is important... >>> >>> -D >>> >>> >>> On Fri, Oct 8, 2010 at 8:47 AM, Vincent<[email protected] >>> >wrote: >>> >>> >>>> >>>> *I've tried mapred.child.java.opts value 2048m*. Now the error is a >>>> timeout. Seems like system is so loaded that it becomes irresponsive... >>>> >>>> Here are the outputs of the job tracker: >>>> >>>> >>>> Hadoop job_201010081840_0010 >>>> <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081840_0010> >>>> failures on prog7<http://prog7.lan:50030/jobtracker.jsp> >>>> >>>> Attempt Task Machine State Error Logs >>>> attempt_201010081840_0010_r_000001_0 task_201010081840_0010_r_000001< >>>> >>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081840_0010&tipid=task_201010081840_0010_r_000001 >>>> > >>>> prog7<http://prog7:50060> FAILED >>>> >>>> Task attempt_201010081840_0010_r_000001_0 failed to report status for >>>> 601 >>>> seconds. Killing! >>>> >>>> Last 4KB< >>>> >>>> http://prog7:50060/tasklog?taskid=attempt_201010081840_0010_r_000001_0&start=-4097 >>>> Last 8KB< >>>> >>>> http://prog7:50060/tasklog?taskid=attempt_201010081840_0010_r_000001_0&start=-8193 >>>> All< >>>> http://prog7:50060/tasklog?taskid=attempt_201010081840_0010_r_000001_0> >>>> >>>> >>>> >>>> ask Logs: 'attempt_201010081840_0010_r_000001_0' >>>> >>>> >>>> >>>> *_stdout logs_* >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> *_stderr logs_* >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> *_syslog logs_* >>>> >>>> 2010-10-08 19:11:49,732 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: >>>> Initializing JVM Metrics with processName=SHUFFLE, sessionId= >>>> 2010-10-08 19:11:50,963 INFO org.apache.hadoop.mapred.ReduceTask: >>>> ShuffleRamManager: MemoryLimit=1336252800, >>>> MaxSingleShuffleLimit=334063200 >>>> 2010-10-08 19:11:50,997 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Thread started: Thread for merging >>>> on-disk files >>>> 2010-10-08 19:11:50,997 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Thread waiting: Thread for merging >>>> on-disk files >>>> 2010-10-08 19:11:51,004 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Thread started: Thread for merging >>>> in >>>> memory files >>>> 2010-10-08 19:11:51,004 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Need another 24 map output(s) where >>>> 0 >>>> is already in progress >>>> 2010-10-08 19:11:51,005 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 0 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:11:51,005 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Thread started: Thread for polling >>>> Map >>>> Completion Events >>>> 2010-10-08 19:11:51,020 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 10 new map-outputs >>>> 2010-10-08 19:11:56,005 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 2 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:11:56,091 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000002_0, compressed len: 18158866, >>>> decompressed >>>> len: 18158862 >>>> 2010-10-08 19:11:56,091 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18158862 bytes (18158866 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000002_0 >>>> 2010-10-08 19:11:56,582 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000000_0, compressed len: 20624287, >>>> decompressed >>>> len: 20624283 >>>> 2010-10-08 19:11:56,582 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 20624283 bytes (20624287 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000000_0 >>>> 2010-10-08 19:11:57,035 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs >>>> 2010-10-08 19:11:57,258 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 20624283 bytes from map-output for attempt_201010081840_0010_m_000000_0 >>>> 2010-10-08 19:11:57,271 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000000_0 -> (105, 265) from prog7 >>>> 2010-10-08 19:11:57,274 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and9 >>>> dup hosts) >>>> 2010-10-08 19:11:57,313 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000001_0, compressed len: 18485340, >>>> decompressed >>>> len: 18485336 >>>> 2010-10-08 19:11:57,313 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18485336 bytes (18485340 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000001_0 >>>> 2010-10-08 19:11:57,971 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18158862 bytes from map-output for attempt_201010081840_0010_m_000002_0 >>>> 2010-10-08 19:11:57,971 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000002_0 -> (177, 148) from hermitage >>>> 2010-10-08 19:11:57,980 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:11:58,043 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000003_0, compressed len: 18075620, >>>> decompressed >>>> len: 18075616 >>>> 2010-10-08 19:11:58,044 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18075616 bytes (18075620 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000003_0 >>>> 2010-10-08 19:11:58,277 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18485336 bytes from map-output for attempt_201010081840_0010_m_000001_0 >>>> 2010-10-08 19:11:58,277 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000001_0 -> (241, 162) from prog7 >>>> 2010-10-08 19:12:01,929 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18075616 bytes from map-output for attempt_201010081840_0010_m_000003_0 >>>> 2010-10-08 19:12:01,930 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000003_0 -> (189, 187) from hermitage >>>> 2010-10-08 19:12:01,935 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:01,937 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000006_0, compressed len: 18255983, >>>> decompressed >>>> len: 18255979 >>>> 2010-10-08 19:12:01,937 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18255979 bytes (18255983 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000006_0 >>>> 2010-10-08 19:12:03,044 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 1 new map-outputs >>>> 2010-10-08 19:12:03,049 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000004_0, compressed len: 18874529, >>>> decompressed >>>> len: 18874525 >>>> 2010-10-08 19:12:03,049 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18874525 bytes (18874529 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000004_0 >>>> 2010-10-08 19:12:03,067 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and7 >>>> dup hosts) >>>> 2010-10-08 19:12:03,608 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18874525 bytes from map-output for attempt_201010081840_0010_m_000004_0 >>>> 2010-10-08 19:12:03,609 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000004_0 -> (105, 133) from prog7 >>>> 2010-10-08 19:12:04,087 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18255979 bytes from map-output for attempt_201010081840_0010_m_000006_0 >>>> 2010-10-08 19:12:04,088 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000006_0 -> (105, 178) from hermitage >>>> 2010-10-08 19:12:04,094 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:04,319 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000007_0, compressed len: 18358512, >>>> decompressed >>>> len: 18358508 >>>> 2010-10-08 19:12:04,319 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18358508 bytes (18358512 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000007_0 >>>> 2010-10-08 19:12:06,254 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18358508 bytes from map-output for attempt_201010081840_0010_m_000007_0 >>>> 2010-10-08 19:12:06,255 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000007_0 -> (105, 166) from hermitage >>>> 2010-10-08 19:12:06,258 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:06,270 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000008_0, compressed len: 18092007, >>>> decompressed >>>> len: 18092003 >>>> 2010-10-08 19:12:06,271 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18092003 bytes (18092007 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000008_0 >>>> 2010-10-08 19:12:07,808 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18092003 bytes from map-output for attempt_201010081840_0010_m_000008_0 >>>> 2010-10-08 19:12:07,809 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000008_0 -> (293, 232) from hermitage >>>> 2010-10-08 19:12:07,810 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:07,813 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000009_0, compressed len: 17941909, >>>> decompressed >>>> len: 17941905 >>>> 2010-10-08 19:12:07,813 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 17941905 bytes (17941909 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000009_0 >>>> 2010-10-08 19:12:09,059 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 3 new map-outputs >>>> 2010-10-08 19:12:09,060 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and6 >>>> dup hosts) >>>> 2010-10-08 19:12:09,338 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 17941905 bytes from map-output for attempt_201010081840_0010_m_000009_0 >>>> 2010-10-08 19:12:09,338 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000009_0 -> (105, 197) from hermitage >>>> 2010-10-08 19:12:09,338 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:09,341 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000010_0, compressed len: 18405142, >>>> decompressed >>>> len: 18405138 >>>> 2010-10-08 19:12:09,341 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18405138 bytes (18405142 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000010_0 >>>> 2010-10-08 19:12:09,369 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000005_0, compressed len: 18009096, >>>> decompressed >>>> len: 18009092 >>>> 2010-10-08 19:12:09,369 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18009092 bytes (18009096 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000005_0 >>>> 2010-10-08 19:12:10,691 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18009092 bytes from map-output for attempt_201010081840_0010_m_000005_0 >>>> 2010-10-08 19:12:10,691 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000005_0 -> (105, 206) from prog7 >>>> 2010-10-08 19:12:11,101 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18405138 bytes from map-output for attempt_201010081840_0010_m_000010_0 >>>> 2010-10-08 19:12:11,101 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000010_0 -> (137, 175) from hermitage >>>> 2010-10-08 19:12:11,102 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:11,104 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000011_0, compressed len: 20002825, >>>> decompressed >>>> len: 20002821 >>>> 2010-10-08 19:12:11,104 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 20002821 bytes (20002825 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000011_0 >>>> 2010-10-08 19:12:12,805 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 20002821 bytes from map-output for attempt_201010081840_0010_m_000011_0 >>>> 2010-10-08 19:12:12,805 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000011_0 -> (105, 143) from hermitage >>>> 2010-10-08 19:12:12,815 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:12,817 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000012_0, compressed len: 18135959, >>>> decompressed >>>> len: 18135955 >>>> 2010-10-08 19:12:12,817 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18135955 bytes (18135959 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000012_0 >>>> 2010-10-08 19:12:14,361 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18135955 bytes from map-output for attempt_201010081840_0010_m_000012_0 >>>> 2010-10-08 19:12:14,361 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000012_0 -> (137, 149) from hermitage >>>> 2010-10-08 19:12:14,362 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:14,364 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000013_0, compressed len: 18440786, >>>> decompressed >>>> len: 18440782 >>>> 2010-10-08 19:12:14,364 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18440782 bytes (18440786 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000013_0 >>>> 2010-10-08 19:12:15,935 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18440782 bytes from map-output for attempt_201010081840_0010_m_000013_0 >>>> 2010-10-08 19:12:15,935 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000013_0 -> (137, 142) from hermitage >>>> 2010-10-08 19:12:15,936 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:15,938 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000014_0, compressed len: 18205885, >>>> decompressed >>>> len: 18205881 >>>> 2010-10-08 19:12:15,938 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18205881 bytes (18205885 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000014_0 >>>> 2010-10-08 19:12:17,489 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18205881 bytes from map-output for attempt_201010081840_0010_m_000014_0 >>>> 2010-10-08 19:12:17,499 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000014_0 -> (253, 159) from hermitage >>>> 2010-10-08 19:12:17,506 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:17,510 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000015_0, compressed len: 17476262, >>>> decompressed >>>> len: 17476258 >>>> 2010-10-08 19:12:17,510 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 17476258 bytes (17476262 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000015_0 >>>> 2010-10-08 19:12:17,612 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 1 new map-outputs >>>> 2010-10-08 19:12:19,030 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 17476258 bytes from map-output for attempt_201010081840_0010_m_000015_0 >>>> 2010-10-08 19:12:19,035 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000015_0 -> (105, 158) from hermitage >>>> 2010-10-08 19:12:19,035 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:19,061 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000017_0, compressed len: 18542230, >>>> decompressed >>>> len: 18542226 >>>> 2010-10-08 19:12:19,061 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18542226 bytes (18542230 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000017_0 >>>> 2010-10-08 19:12:20,640 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18542226 bytes from map-output for attempt_201010081840_0010_m_000017_0 >>>> 2010-10-08 19:12:20,640 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000017_0 -> (257, 151) from hermitage >>>> 2010-10-08 19:12:23,626 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 1 new map-outputs >>>> 2010-10-08 19:12:25,643 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:25,670 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000018_0, compressed len: 18737340, >>>> decompressed >>>> len: 18737336 >>>> 2010-10-08 19:12:25,670 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18737336 bytes (18737340 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000018_0 >>>> 2010-10-08 19:12:27,438 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18737336 bytes from map-output for attempt_201010081840_0010_m_000018_0 >>>> 2010-10-08 19:12:27,439 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000018_0 -> (253, 175) from hermitage >>>> 2010-10-08 19:12:28,646 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs >>>> 2010-10-08 19:12:31,652 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs >>>> 2010-10-08 19:12:32,439 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 2 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:32,473 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000020_0, compressed len: 17710258, >>>> decompressed >>>> len: 17710254 >>>> 2010-10-08 19:12:32,473 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 17710254 bytes (17710258 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000020_0 >>>> 2010-10-08 19:12:32,475 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000016_0, compressed len: 20708576, >>>> decompressed >>>> len: 20708572 >>>> 2010-10-08 19:12:32,475 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 20708572 bytes (20708576 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000016_0 >>>> 2010-10-08 19:12:33,138 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 20708572 bytes from map-output for attempt_201010081840_0010_m_000016_0 >>>> 2010-10-08 19:12:33,164 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000016_0 -> (297, 318) from prog7 >>>> 2010-10-08 19:12:33,167 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and1 >>>> dup hosts) >>>> 2010-10-08 19:12:33,172 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000019_0, compressed len: 18984487, >>>> decompressed >>>> len: 18984483 >>>> 2010-10-08 19:12:33,172 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18984483 bytes (18984487 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000019_0 >>>> 2010-10-08 19:12:33,774 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18984483 bytes from map-output for attempt_201010081840_0010_m_000019_0 >>>> 2010-10-08 19:12:33,774 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000019_0 -> (285, 160) from prog7 >>>> 2010-10-08 19:12:34,057 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 17710254 bytes from map-output for attempt_201010081840_0010_m_000020_0 >>>> 2010-10-08 19:12:34,057 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000020_0 -> (105, 127) from hermitage >>>> 2010-10-08 19:12:34,081 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:34,085 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000021_0, compressed len: 18803713, >>>> decompressed >>>> len: 18803709 >>>> 2010-10-08 19:12:34,085 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18803709 bytes (18803713 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000021_0 >>>> 2010-10-08 19:12:36,579 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18803709 bytes from map-output for attempt_201010081840_0010_m_000021_0 >>>> 2010-10-08 19:12:36,579 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000021_0 -> (137, 164) from hermitage >>>> 2010-10-08 19:12:43,867 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0: Got 2 new map-outputs >>>> 2010-10-08 19:12:46,585 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:46,589 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000022_1, compressed len: 18143868, >>>> decompressed >>>> len: 18143864 >>>> 2010-10-08 19:12:46,589 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 18143864 bytes (18143868 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000022_1 >>>> 2010-10-08 19:12:48,167 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 18143864 bytes from map-output for attempt_201010081840_0010_m_000022_1 >>>> 2010-10-08 19:12:48,176 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000022_1 -> (105, 133) from hermitage >>>> 2010-10-08 19:12:48,182 INFO org.apache.hadoop.mapred.ReduceTask: >>>> attempt_201010081840_0010_r_000001_0 Scheduled 1 outputs (0 slow hosts >>>> and0 >>>> dup hosts) >>>> 2010-10-08 19:12:48,428 INFO org.apache.hadoop.mapred.ReduceTask: >>>> header: >>>> attempt_201010081840_0010_m_000023_1, compressed len: 9198819, >>>> decompressed >>>> len: 9198815 >>>> 2010-10-08 19:12:48,428 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Shuffling 9198815 bytes (9198819 raw bytes) into RAM from >>>> attempt_201010081840_0010_m_000023_1 >>>> 2010-10-08 19:12:49,878 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Ignoring >>>> obsolete output of KILLED map-task: >>>> 'attempt_201010081840_0010_m_000022_0' >>>> 2010-10-08 19:12:49,938 INFO org.apache.hadoop.mapred.ReduceTask: Read >>>> 9198815 bytes from map-output for attempt_201010081840_0010_m_000023_1 >>>> 2010-10-08 19:12:49,938 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 >>>> from attempt_201010081840_0010_m_000023_1 -> (137, 166) from hermitage >>>> 2010-10-08 19:12:50,878 INFO org.apache.hadoop.mapred.ReduceTask: >>>> GetMapEventsThread exiting >>>> 2010-10-08 19:12:50,878 INFO org.apache.hadoop.mapred.ReduceTask: >>>> getMapsEventsThread joined. >>>> 2010-10-08 19:12:50,878 INFO org.apache.hadoop.mapred.ReduceTask: Closed >>>> ram manager >>>> 2010-10-08 19:12:50,879 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Interleaved on-disk merge complete: 0 files left. >>>> 2010-10-08 19:12:50,879 INFO org.apache.hadoop.mapred.ReduceTask: >>>> In-memory merge complete: 24 files left. >>>> 2010-10-08 19:12:51,029 INFO org.apache.hadoop.mapred.Merger: Merging 24 >>>> sorted segments >>>> 2010-10-08 19:12:51,030 INFO org.apache.hadoop.mapred.Merger: Down to >>>> the >>>> last merge-pass, with 24 segments left of total size: 436372203 bytes >>>> 2010-10-08 19:13:04,406 INFO org.apache.hadoop.mapred.ReduceTask: Merged >>>> 24 segments, 436372203 bytes to disk to satisfy reduce memory limit >>>> 2010-10-08 19:13:04,407 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Merging >>>> 1 files, 436372161 bytes from disk >>>> 2010-10-08 19:13:04,426 INFO org.apache.hadoop.mapred.ReduceTask: >>>> Merging >>>> 0 segments, 0 bytes from memory into reduce >>>> 2010-10-08 19:13:04,426 INFO org.apache.hadoop.mapred.Merger: Merging 1 >>>> sorted segments >>>> 2010-10-08 19:13:04,463 INFO org.apache.hadoop.mapred.Merger: Down to >>>> the >>>> last merge-pass, with 1 segments left of total size: 436372157 bytes >>>> 2010-10-08 19:13:18,879 INFO >>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input paths >>>> to >>>> process : 24 >>>> 2010-10-08 19:13:18,879 INFO >>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil: Total >>>> input >>>> paths to process : 24 >>>> 2010-10-08 19:16:14,354 INFO >>>> org.apache.pig.impl.util.SpillableMemoryManager: low memory handler >>>> called >>>> (Collection threshold exceeded) init = 32309248(31552K) used = >>>> 803560952(784727K) committed = 1069678592(1044608K) max = >>>> 1431699456(1398144K) >>>> >>>> >>>> >>>> >>>> On 10/08/2010 06:44 PM, Vincent wrote: >>>> >>>> Yep, I did restart cluster (dfs and mapred stop/start). >>>>> >>>>> Increasing the amount of memory I can see that the reduce task goes >>>>> further (percentage is greater), but then start to decrease with memory >>>>> failures. >>>>> >>>>> On 10/08/2010 06:41 PM, Jeff Zhang wrote: >>>>> >>>>> Did you restart cluster after reconfiguration ? >>>>>> >>>>>> >>>>>> On Fri, Oct 8, 2010 at 9:59 PM, Vincent<[email protected]> >>>>>> wrote: >>>>>> >>>>>> I've tried with mapred.child.java.opts value: >>>>>>> -Xmx512m --> still memory errors in reduce phase >>>>>>> -Xmx1024m --> still memory errors in reduce phase >>>>>>> I am now trying with -Xmx1536m but I'm afraid that my nodes will >>>>>>> start >>>>>>> to >>>>>>> swap memory... >>>>>>> >>>>>>> Should I continue in this direction? Or it's already to much and I >>>>>>> should >>>>>>> search the problem somewhere else? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> -Vincent >>>>>>> >>>>>>> >>>>>>> On 10/08/2010 03:04 PM, Jeff Zhang wrote: >>>>>>> >>>>>>> Try to increase the heap size on of task by setting >>>>>>>> mapred.child.java.opts in mapred-site.xml. The default value is >>>>>>>> -Xmx200m in mapred-default.xml which may be too small for you. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Oct 8, 2010 at 6:55 PM, Vincent<[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks to Dmitriy and Jeff, I've set : >>>>>>>>> >>>>>>>>> set default_parallel 20; at the beginning of my script. >>>>>>>>> >>>>>>>>> Updated 8 JOINs to behave like: >>>>>>>>> >>>>>>>>> JOIN big BY id, small BY id USING 'replicated'; >>>>>>>>> >>>>>>>>> Unfortunately this didn't improve the script speed (at least it >>>>>>>>> runs >>>>>>>>> for >>>>>>>>> more than one hour now). >>>>>>>>> >>>>>>>>> But Looking in the jobtracker one of the job which reduce, I can >>>>>>>>> see >>>>>>>>> for >>>>>>>>> the >>>>>>>>> map: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hadoop map task list for job_201010081314_0010 >>>>>>>>> <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010 >>>>>>>>> > >>>>>>>>> on >>>>>>>>> prog7<http://prog7.lan:50030/jobtracker.jsp> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> All Tasks >>>>>>>>> >>>>>>>>> Task Complete Status Start Time Finish Time >>>>>>>>> Errors >>>>>>>>> Counters >>>>>>>>> task_201010081314_0010_m_000000 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000 >>>>>>>>> > >>>>>>>>> >>>>>>>>> 100.00% >>>>>>>>> >>>>>>>>> >>>>>>>>> 8-Oct-2010 14:07:44 >>>>>>>>> 8-Oct-2010 14:23:11 (15mins, 27sec) >>>>>>>>> >>>>>>>>> >>>>>>>>> Too many fetch-failures >>>>>>>>> Too many fetch-failures >>>>>>>>> >>>>>>>>> >>>>>>>>> 8 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000 >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> And I can see this for the reduce >>>>>>>>> >>>>>>>>> >>>>>>>>> Hadoop reduce task list for job_201010081314_0010 >>>>>>>>> <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010 >>>>>>>>> > >>>>>>>>> on >>>>>>>>> prog7<http://prog7.lan:50030/jobtracker.jsp> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> All Tasks >>>>>>>>> >>>>>>>>> Task Complete Status Start Time Finish Time >>>>>>>>> Errors >>>>>>>>> Counters >>>>>>>>> task_201010081314_0010_r_000000 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000 >>>>>>>>> > >>>>>>>>> >>>>>>>>> 9.72% >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> reduce> copy (7 of 24 at 0.01 MB/s)> >>>>>>>>> 8-Oct-2010 14:14:49 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Error: GC overhead limit exceeded >>>>>>>>> >>>>>>>>> >>>>>>>>> 7 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000 >>>>>>>>> > >>>>>>>>> >>>>>>>>> task_201010081314_0010_r_000001 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001 >>>>>>>>> > >>>>>>>>> >>>>>>>>> 0.00% >>>>>>>>> >>>>>>>>> >>>>>>>>> 8-Oct-2010 14:14:52 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Error: Java heap space >>>>>>>>> >>>>>>>>> >>>>>>>>> 0 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001 >>>>>>>>> > >>>>>>>>> >>>>>>>>> task_201010081314_0010_r_000002 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002 >>>>>>>>> > >>>>>>>>> >>>>>>>>> 0.00% >>>>>>>>> >>>>>>>>> >>>>>>>>> 8-Oct-2010 14:15:58 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> java.io.IOException: Task process exit with nonzero status of 1. >>>>>>>>> at >>>>>>>>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 0 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002 >>>>>>>>> > >>>>>>>>> >>>>>>>>> task_201010081314_0010_r_000003 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003 >>>>>>>>> > >>>>>>>>> >>>>>>>>> 9.72% >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> reduce> copy (7 of 24 at 0.01 MB/s)> >>>>>>>>> 8-Oct-2010 14:16:58 >>>>>>>>> >>>>>>>>> >>>>>>>>> 7 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003 >>>>>>>>> > >>>>>>>>> >>>>>>>>> task_201010081314_0010_r_000004 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004 >>>>>>>>> > >>>>>>>>> >>>>>>>>> 0.00% >>>>>>>>> >>>>>>>>> >>>>>>>>> 8-Oct-2010 14:18:11 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Error: GC overhead limit exceeded >>>>>>>>> >>>>>>>>> >>>>>>>>> 0 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004 >>>>>>>>> > >>>>>>>>> >>>>>>>>> task_201010081314_0010_r_000005 >>>>>>>>> >>>>>>>>> < >>>>>>>>> >>>>>>>>> http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000005 >>>>>>>>> > >>>>>>>>> >>>>>>>>> 0.00% >>>>>>>>> >>>>>>>>> >>>>>>>>> 8-Oct-2010 14:18:56 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Error: GC overhead limit exceeded >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Seems like it runs out of memory... Which parameter should be >>>>>>>>> increased? >>>>>>>>> >>>>>>>>> -Vincent >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/08/2010 01:12 PM, Jeff Zhang wrote: >>>>>>>>> >>>>>>>>> BTW, you can look at the job tracker web ui to see which part of >>>>>>>>>> the >>>>>>>>>> job cost the most of the time >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Oct 8, 2010 at 5:11 PM, Jeff Zhang<[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> No I mean whether your mapreduce job's reduce task number is 1. >>>>>>>>>>> >>>>>>>>>>> And could you share your pig script, then others can really >>>>>>>>>>> understand >>>>>>>>>>> your problem. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Oct 8, 2010 at 5:04 PM, Vincent< >>>>>>>>>>> [email protected] >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> You are right, I didn't change this parameter, therefore the >>>>>>>>>>>> default >>>>>>>>>>>> is >>>>>>>>>>>> used from src/mapred/mapred-default.xml >>>>>>>>>>>> >>>>>>>>>>>> <property> >>>>>>>>>>>> <name>mapred.reduce.tasks</name> >>>>>>>>>>>> <value>1</value> >>>>>>>>>>>> <description>The default number of reduce tasks per job. >>>>>>>>>>>> Typically >>>>>>>>>>>> set >>>>>>>>>>>> to >>>>>>>>>>>> 99% >>>>>>>>>>>> of the cluster's reduce capacity, so that if a node fails the >>>>>>>>>>>> reduces >>>>>>>>>>>> can >>>>>>>>>>>> still be executed in a single wave. >>>>>>>>>>>> Ignored when mapred.job.tracker is "local". >>>>>>>>>>>> </description> >>>>>>>>>>>> </property> >>>>>>>>>>>> >>>>>>>>>>>> Not clear for me what is the reduce capacity of my cluster :) >>>>>>>>>>>> >>>>>>>>>>>> On 10/08/2010 01:00 PM, Jeff Zhang wrote: >>>>>>>>>>>> >>>>>>>>>>>> I guess maybe your reduce number is 1 which cause the reduce >>>>>>>>>>>>> phase >>>>>>>>>>>>> very >>>>>>>>>>>>> slowly. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Oct 8, 2010 at 4:44 PM, Vincent< >>>>>>>>>>>>> [email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Well I can see from the job tracker that all the jobs are >>>>>>>>>>>>>> done >>>>>>>>>>>>>> quite >>>>>>>>>>>>>> quickly expect 2 for which reduce phase goes really really >>>>>>>>>>>>>> slowly. >>>>>>>>>>>>>> >>>>>>>>>>>>>> But how can I make the parallel between a job in the Hadoop >>>>>>>>>>>>>> jop >>>>>>>>>>>>>> tracker >>>>>>>>>>>>>> (example: job_201010072150_0045) and the Pig script execution? >>>>>>>>>>>>>> >>>>>>>>>>>>>> And what is the most efficient: several small Pig scripts? or >>>>>>>>>>>>>> one >>>>>>>>>>>>>> big >>>>>>>>>>>>>> Pig >>>>>>>>>>>>>> script? I did one big to avoid to load several time the same >>>>>>>>>>>>>> logs in >>>>>>>>>>>>>> different scripts. Maybe it is not so good design... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for your help. >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Vincent >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 10/08/2010 11:31 AM, Vincent wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm using pig-0.7.0 on hadoop-0.20.2. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For the script, well it's more then 500 lines, I'm not sure >>>>>>>>>>>>>>> if >>>>>>>>>>>>>>> I >>>>>>>>>>>>>>> post >>>>>>>>>>>>>>> it >>>>>>>>>>>>>>> here that somebody will read it till the end :-) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 10/08/2010 11:26 AM, Dmitriy Ryaboy wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What version of Pig, and what does your script look like? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Oct 7, 2010 at 11:48 PM, >>>>>>>>>>>>>>>> Vincent<[email protected]> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm quite new to Pig/Hadoop. So maybe my cluster size will >>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>> laugh. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I wrote a script on Pig handling 1.5GB of logs in less than >>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>> hour >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> pig >>>>>>>>>>>>>>>>> local mode on a Intel core 2 duo with 3GB of RAM. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Then I tried this script on a simple 2 nodes cluster. These >>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>> nodes >>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>> servers but simple computers: >>>>>>>>>>>>>>>>> - Intel core 2 duo with 3GB of RAM. >>>>>>>>>>>>>>>>> - Intel Quad with 4GB of RAM. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Well I was aware that hadoop has overhead and that it won't >>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>> done >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> half >>>>>>>>>>>>>>>>> an hour (time in local divided by number of nodes). But I >>>>>>>>>>>>>>>>> was >>>>>>>>>>>>>>>>> surprised >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> see this morning it took 7 hours to complete!!! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My configuration was made according to this link: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My question is simple: Is it normal? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Cheers >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Vincent >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best Regards >>>>>>>>>>> >>>>>>>>>>> Jeff Zhang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>> >
