Hi,
Continuing to have performance problems with the Fetch after fetching is
complete. When I do a check of jstack, I only show 1 thread for
org.apache.hadoop.mapred.ReduceTask.run. Does it only have 1 thread when
Nutch only runs on 1 machine? Is there a way to have more than one thread
to improve performance on a single machine?
This leads me to a few other questions:
1) Why is the URLFilters.filter process run as part of
mapred.ReduceTask.run?
2) When I continually check jstack during the mapred.ReduceTask.run it
appears to be URLFilters.filter or BasicURLNormalizer are being run. Is
there a way I can change my configuration to improve the performance of
these functions?
3) Could these functions be run prior to fetching the URL to be completely
eliminate it from the mapred.ReduceTask.run process and gain the advantage
of the multiple threads used in the fetch process?
4) Lastly, in trying to look at the bottlenecks I'm experiencing, I looked
at the RegexURLFilter.java. I was curious why a new Matcher is used in
every usage of match instead of using matcher.reset? In terms of
performance, my understanding is using reset was preferable to creating a
new matcher. Below is an example of what I mean. Just curious.
private class Rule extends RegexRule {
private Pattern pattern;
private Matcher myMatcher; //add a matcher
Rule(boolean sign, String regex) {
super(sign, regex);
pattern = Pattern.compile(regex);
myMatcher = pattern.matcher("");//initialize it to blank
}
protected boolean match(String url) {
//return pattern.matcher(url).find();
return myMatcher.reset(url).find();//use reset instead of matcher
}
}
Sorry about all the questions. I find, at least on a 1 machine Nutch
configuration, the fetch part of the fetcher is much faster that than the
mapred.ReduceTask.run process and the mapred.ReduceTask process is really
bogging down.
Thank you for your time!
Brad