Do you know what stage it is giving you the heap error on? In my
experience, I've seen a couple common things that can lead to heap errors...
1) not giving it enough heap for bags to spill properly (but >1GB has been
fine in my experience)
2) a gigantic tuple. Bags are the only object in Pig that spills, so a
tuple has to be able to fit in memory. Doesn't seem to be the error here
3) Distincts. They are succinct, but can be dangerous. Usually distincts in
nested foreaches are the real culprit, but is yours dying on this stage:

userwordvals = DISTINCT userwordvals;
?
If so, I'd try rewriting that as: userwordvals = FOREACH (GROUP
userwordvals BY (all,that,stuff)) GENERATE flatten(group);

If that works, that's your culprit. Either way, it'd be helpful to know
what portion it is dying on.

And yeah, it's amazing how poorly pig/hadoop handles lots of small files.
The recommendation for filecrush is welcome -- I was looking for exactly
that the other day.

2011/11/30 David King <[email protected]>

> > We went through some grief with small files and inefficiencies there.
> [...]
> >> Hadoop was engineered to efficiently process small number of large files
> >> and not the other way around. Since PIG utilizes Hadoop it will have a
> >> similar limitation. Some improvement have been made on that front
> >> (CombinedInputFormat) but the performance is still lacking.
>
> Combining all of the files into just three large files reduces the
> run-time to 20 minutes on 2 nodes! (compared to 5h40m on 10 nodes). Going
> to one file per data-day (instead of one per data-hour which is what it was
> before) keeps it at a still-comfortable 33mins on 2 nodes. I didn't think
> that 15k files was in the "lots" range, but there you go. Thank you guys so
> much :)
>
> This just leaves me with the question of how to get the job to actually
> complete on my laptop in local mode. It doesn't have to be fast, but having
> it not die with out-of-memory would be a good start for testing purposes.
> I'm giving it a 2gb heap which seems like it should be fine since the
> larger intermediate chunks should be able to spill over to disk, right?
> Giving it a 3gb heap doesn't seem to change the behaviour, it just takes a
> few more minutes to die
>
>

Reply via email to