> We went through some grief with small files and inefficiencies there.
[...]
>> Hadoop was engineered to efficiently process small number of large files
>> and not the other way around. Since PIG utilizes Hadoop it will have a
>> similar limitation. Some improvement have been made on that front
>> (CombinedInputFormat) but the performance is still lacking.

Combining all of the files into just three large files reduces the run-time to 
20 minutes on 2 nodes! (compared to 5h40m on 10 nodes). Going to one file per 
data-day (instead of one per data-hour which is what it was before) keeps it at 
a still-comfortable 33mins on 2 nodes. I didn't think that 15k files was in the 
"lots" range, but there you go. Thank you guys so much :)

This just leaves me with the question of how to get the job to actually 
complete on my laptop in local mode. It doesn't have to be fast, but having it 
not die with out-of-memory would be a good start for testing purposes. I'm 
giving it a 2gb heap which seems like it should be fine since the larger 
intermediate chunks should be able to spill over to disk, right? Giving it a 
3gb heap doesn't seem to change the behaviour, it just takes a few more minutes 
to die

Reply via email to