Re: Pig storage and load functions and Cache

Dmitriy Ryaboy Sun, 07 Oct 2012 21:11:23 -0700

Pig has multi-query execution optimization built-in. If you compute
multiple relations in your script that share parent relations, those
parent relations will be computed only once. You don't have to do
anything to make that happen.

If you prefer to handle your own caching, you would have to handle it
yourself, of course.

There is some academic work on reusing parts of previous runs of the
same script (potentially on overlapping, but not identical datasets);
the papers to read are:
Nectar http://research.microsoft.com/apps/pubs/default.aspx?id=131525
ReStore: http://vldb.org/pvldb/vol5/p586_imanelghandour_vldb2012.pdf

There are a lot of papers on iterative mapreduce, I am sure if you
start with ReStore citations and/or Google Scholar, you'll find some.

None of that has yet made it into Pig yet; I believe a general compute
caching framework would be very useful, and look forward to someone
taking up that challenge..

D

On Fri, Oct 5, 2012 at 2:51 PM, Abhishek <[email protected]> wrote:
> BinStorage()
> PigDump()
> PigStorage()
> TextLoader()
>
> Load or storing in which of the above format.Will optimize the queries.
>
> Can cache be any where in pig.How can the cache be use ful in pig.
>
> Regards
> Abhi

Re: Pig storage and load functions and Cache

Reply via email to