I got the final numbers on fetching 1 million records: Total Time 29:01:39 Fetch & Parse Time 6:45:32 MapReduce Time 22:16:07
So, about 75% of a Nutch fetch is spent in the MapReduce portion and only 25% of the time is spent in Fetch and Parse portion. Is this typical? Would the result be similar on a cluster of machines vs a single machine? What can I do to reduce the MapReduce time? Thanks Brad -----Original Message----- From: brad [mailto:[email protected]] Sent: Monday, August 02, 2010 5:03 PM To: [email protected] Subject: Does org.apache.hadoop.mapred.ReduceTask.run have more than one thread? Hi, Continuing to have performance problems with the Fetch after fetching is complete. When I do a check of jstack, I only show 1 thread for org.apache.hadoop.mapred.ReduceTask.run. Does it only have 1 thread when Nutch only runs on 1 machine? Is there a way to have more than one thread to improve performance on a single machine? This leads me to a few other questions: 1) Why is the URLFilters.filter process run as part of mapred.ReduceTask.run? 2) When I continually check jstack during the mapred.ReduceTask.run it appears to be URLFilters.filter or BasicURLNormalizer are being run. Is there a way I can change my configuration to improve the performance of these functions? 3) Could these functions be run prior to fetching the URL to be completely eliminate it from the mapred.ReduceTask.run process and gain the advantage of the multiple threads used in the fetch process? 4) Lastly, in trying to look at the bottlenecks I'm experiencing, I looked at the RegexURLFilter.java. I was curious why a new Matcher is used in every usage of match instead of using matcher.reset? In terms of performance, my understanding is using reset was preferable to creating a new matcher. Below is an example of what I mean. Just curious. private class Rule extends RegexRule { private Pattern pattern; private Matcher myMatcher; //add a matcher Rule(boolean sign, String regex) { super(sign, regex); pattern = Pattern.compile(regex); myMatcher = pattern.matcher("");//initialize it to blank } protected boolean match(String url) { //return pattern.matcher(url).find(); return myMatcher.reset(url).find();//use reset instead of matcher } } Sorry about all the questions. I find, at least on a 1 machine Nutch configuration, the fetch part of the fetcher is much faster that than the mapred.ReduceTask.run process and the mapred.ReduceTask process is really bogging down. Thank you for your time! Brad

