If you want to know more about the internals, I'd check out the paper Yahoo
put out on the topic (or, of course, buy the book Programming Pig).
The answer to this is pretty simple: if you load a file multiple times into
different relations, then it will be scanned multiple times. So...
a = load 'thing';
b = load 'thing;
{..stuff using a..}
{..stuff using b..}
would load 'thing' twice. This is done for joins and whatnot -- there are
cases when you need to load the same file separately, twice. What happens is
essentially that you're going to load and scan the data twice.
However, as in your case, if you instead combine the load, then you'd have
a = load 'thing';
{..stuff using a..}
{..stuff using a (which previously used b)..}
Now it will just scan a once, and then go into each of the pipelines you
defined.
Obviously it's more complex than that, but that's the general gist.
2011/10/3 Something Something <[email protected]>
> I have 3 Pig scripts that load data from the same log file, but filter &
> group this data differently. If I combine these 3 into one & LOAD only
> once, performance seems to have improved, but now I am curious exactly what
> does LOAD do?
>
> How does LOAD work internally? Does Pig save results of the LOAD into some
> separate location in HDFS? Someone please explain how LOAD relates to
> MapReduce? Thanks.
>