[
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Santhosh Srinivasan updated PIG-697:
Status: In Progress (was: Patch Available)
Proposed improvements to pig's optimizer
Key: PIG-697
URL: https://issues.apache.org/jira/browse/PIG-697
Project: Pig
Issue Type: Bug
Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch,
OptimizerPhase2.patch
I propose the following changes to pig optimizer, plan, and operator
functionality to support more robust optimization:
1) Remove the required array from Rule. This will change rules so that they
only match exact patterns instead of allowing missing elements in the pattern.
This has the downside that if a given rule applies to two patterns (say
Load-Filter-Group, Load-Group) you have to write two rules. But it has
the upside that
the resulting rules know exactly what they are getting. The original intent
of this was to reduce the number of rules that needed to be written. But the
resulting rules have do a lot of work to understand the operators they are
working with. With exact matches only, each rule will know exactly the
operators it
is working on and can apply the logic of shifting the operators around. All
four of the existing rules set all entries of required to true, so removing
this
will have no effect on them.
2) Change PlanOptimizer.optimize to iterate over the rules until there are no
conversions or a certain number of iterations has been reached. Currently the
function is:
{code}
public final void optimize() throws OptimizerException {
RuleMatcher matcher = new RuleMatcher();
for (Rule rule : mRules) {
if (matcher.match(rule)) {
// It matches the pattern. Now check if the transformer
// approves as well.
ListListO matches = matcher.getAllMatches();
for (ListO match:matches)
{
if (rule.transformer.check(match)) {
// The transformer approves.
rule.transformer.transform(match);
}
}
}
}
}
{code}
It would change to be:
{code}
public final void optimize() throws OptimizerException {
RuleMatcher matcher = new RuleMatcher();
boolean sawMatch;
int iterators = 0;
do {
sawMatch = false;
for (Rule rule : mRules) {
ListListO matches = matcher.getAllMatches();
for (ListO match:matches) {
// It matches the pattern. Now check if the transformer
// approves as well.
if (rule.transformer.check(match)) {
// The transformer approves.
sawMatch = true;
rule.transformer.transform(match);
}
}
}
// Not sure if 1000 is the right number of iterations, maybe it
// should be configurable so that large scripts don't stop too
// early.
} while (sawMatch numIterations++ 1000);
}
{code}
The reason for limiting the number of iterations is to avoid infinite loops.
The reason for iterating over the rules is so that each rule can be applied
multiple
times as necessary. This allows us to write simple rules, mostly swaps
between neighboring operators, without worrying that we get the plan right in
one pass.
For example, we might have a plan that looks like:
Load-Join-Filter-Foreach, and we want to optimize it to
Load-Foreach-Filter-Join. With two simple
rules (swap filter and join and swap foreach and filter), applied
iteratively, we can get from the initial to final plan, without needing to
understanding the
big picture of the entire plan.
3) Add three calls to OperatorPlan:
{code}
/**
* Swap two operators in a plan. Both of the operators must have single
* inputs and single outputs.
* @param first operator
* @param second operator
* @throws PlanException if either operator is not single input and output.
*/
public void swap(E first, E second) throws PlanException {
...
}
/**
* Push one operator in front of another. This function is for use when
* the first operator has multiple inputs. The caller can specify
* which input of the first operator the second operator should be pushed to.
* @param first operator, assumed to have multiple inputs.
* @param second operator, will be pushed in front of first
* @param inputNum, indicates which input of the first