As long as the loader can provide statistics (be it Howl loader, or something else), we can use stats to make optimizations. There is actually a (very simplistic) example of this already in 0.8, in that it will estimate number of reducers based on file size if parallelism is not specified for join/group/etc operations. It's not using LoadStatistics afaik, that should be fixed.
Ashutosh and I did some work around this about a year ago (on top of the visitor madness) that was never really presentable.. would be happy to dig up the code and see if any of it is still salvageable. -D On Thu, Oct 14, 2010 at 11:01 AM, Alan Gates <[email protected]> wrote: > AFAIK no one is working on that currently. Our next thoughts on optimizer > improvement were to start using the new optimizer framework in the MR > optimizer so we can bring some order to the madness of visitors that is the > MR optimizer. I think Thejas plans on starting work on that in 0.9. > > In the long run many people have discussed having a cost based optimizer, > but I have not seen any proposals of how it should work. With the advent of > Howl it would seem that at least basic statistic based decisions could be > made in the optimizer: is one of the files small enough to use a replicated > join in this case? is the file already sorted so we can use a merge join?. > > It would be great if you're interested in working in this area. The best > way to start is file a JIRA or start a wiki page with general approach and > design information. > > Alan. > > > On Oct 13, 2010, at 7:46 PM, Renato Marroquín Mogrovejo wrote: > > Hey everyone! >> In the Pig Journal page (http://wiki.apache.org/pig/PigJournal) says >> something about getting statistics for Pig's optimizer. Is there any work >> being done on that? >> Or are there any other plans to improve the optimizer? I mean now is a >> rule >> based one, are there expectations to change it to a cost based one? >> Any opinions or comments are highly appreciated. Thanks! >> >> >> Renato M. >> > >
