On Mon, Sep 27, 2010 at 1:26 PM, Andrzej Bialecki <[email protected]> wrote:
> On 2010-09-27 17:24, Steve Cohen wrote: > >> Hello, >> >> I've been given the task of figuring out why nutch is running slower on >> Solaris then on Linux with the same configuration. I am looking at the log >> file and I see this big gap between the time fetcher stops fetching and it >> says it is done and I would love to know what is going on. Here is the log >> snippet. >> >> 2010-09-24 11:04:28,413 INFO fetcher.Fetcher - -finishing thread >> FetcherThread, activeThreads=0 >> 2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0, >> spinWaiting=0, fetchQueues.totalSize=0 >> 2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0 >> 2010-09-24 11:05:32,782 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where >> applicable >> 2010-09-24 11:05:33,469 INFO plugin.PluginRepository - Plugins: looking >> in: >> /opt/nutch/build/plugins >> 2010-09-24 11:05:34,052 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Registered >> Plugins: >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Jakarta >> POI >> - Java API To Access Microsoft Format Files (lib-jakarta-poi) >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More >> Indexing Filter (index-more) >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - HTTP >> Framework (lib-http) >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - MSWord >> Parse >> Plug-in (parse-msword) >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More Query >> Filter (query-more) >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Regex URL >> Filter (urlfilter-regex) >> 2010-09-24 11:05:34,053 INFO plugin.PluginRepository - XML >> Libraries (lib-xml) >> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Http >> Protocol Plug-in (protocol-http) >> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - MSExcel >> Parse Plug-in (parse-msexcel) >> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - XML >> Response >> Writer Plug-in (response-xml) >> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - OPIC >> Scoring >> Plug-in (scoring-opic) >> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Zip Parse >> Plug-in (parse-zip) >> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Anchor >> Indexing Filter (index-anchor) >> 2010-09-24 11:05:34,054 INFO plugin.PluginRepository - URL Query >> Filter (query-url) >> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Parse MS >> Documents Framework (lib-parsems) >> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Regex URL >> Filter Framework (lib-regex-filter) >> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - JSON >> Response Writer Plug-in (response-json) >> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - the nutch >> core extension points (nutch-extensionpoints) >> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - >> MSPowerPoint >> Parse Plug-in (parse-mspowerpoint) >> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Basic >> Query >> Filter (query-basic) >> 2010-09-24 11:05:34,055 INFO plugin.PluginRepository - RSS Parse >> Plug-in (parse-rss) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic >> Indexing Filter (index-basic) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic >> Summarizer Plug-in (summary-basic) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - CyberNeko >> HTML Parser (lib-nekohtml) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - File >> Protocol Plug-in (protocol-file) >> 2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch >> Summarizer (org.apache.nutch.searcher.Summarizer) >> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch >> Protocol (org.apache.nutch.protocol.Protocol) >> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch >> Analysis (org.apache.nutch.analysis.NutchAnalyzer) >> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch >> Field >> Filter (org.apache.nutch.indexer.field.FieldFilter) >> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch >> Query >> Filter (org.apache.nutch.searcher.QueryFilter) >> 2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch >> Search >> Results Response Writer >> (org.apache.nutch.searcher.response.ResponseWriter) >> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL >> Filter (org.apache.nutch.net.URLFilter) >> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch >> Online >> Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch >> Indexing Filter (org.apache.nutch.indexer.IndexingFilter) >> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch >> Content Parser (org.apache.nutch.parse.Parser) >> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch >> Scoring (org.apache.nutch.scoring.ScoringFilter) >> 2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Ontology >> Model Loader (org.apache.nutch.ontology.Ontology) >> 2010-09-24 11:47:04,995 INFO fetcher.Fetcher - Fetcher: done >> 2010-09-24 11:47:10,151 INFO crawl.CrawlDb - CrawlDb update: starting >> >> So at 11:04, fetcher winds down and has no more threads to run. At 11:05 >> it >> gives an error about not having native hadoop libraries (I am going to >> build >> them today) and loads plugins. Then Fetcher gives a message that is done - >> 32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes? >> > > It was diligently running the "reduce" phase, which consists of sorting and > the reduce() proper. If you run Fetcher in the parsing mode then another > possibility is that some of the parsers run slower on Solaris. Yet another > possibility, that you mentioned, is that HAdoop can use the native > compression libs on Linux, but there are no such libs pre-compiled for > Solaris. > > Also, while reduce() speed is mostly determined by the Reducer > implementation (and very little by IO), the sorting speed is very much > dependent on disk IO and the size of the dataset that was partitioned to a > given reduce task. All other config factors being equal, I suspect that your > Solaris box could have a slower disk. > > You can verify these hypotheses with top/iostat/vmstat and see whether the > tasks are bound by CPU or by diskwait. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > I tried running nutch again after setting job tracker to localhost:<port> from local and set mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum, hoping it would use multiple threads and cores to speed up the portion after the fetch.fetcher threads drops to 0 but before it tells me that fetcher is done. It seems I need to do more configuration. It is only using one thread during this down time. but I do know what it is doing. I ran the solaris command pstack on the pid to see what is going on and you were right. It is using running map reduce. ----------------- lwp# 2 / thread# 2 -------------------- fc0e3324 * *java/util/regex/Pattern$CharProperty.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z [compiled] fc0f9b9c * *java/util/regex/Pattern$Slice.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z [compiled] +152 (line 6979) fc0f9b9c * *java/util/regex/Pattern$Begin.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z+62 (line 6244) fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208) fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058) fc12c078 * *org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18 (line 180) fc12c078 * *org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38 (line 234) fc12c078 * *org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50 (line 184) fc12c078 * *org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992 (line 555) fc16a680 * *org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V [compiled] +20 (line 226) fc16a680 * *org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120 (line 186) fc16a680 * *org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V+20 (line 140) fc16a680 * *org/apache/hadoop/mapred/ReduceTask$3.collect(Ljava/lang/Object;Ljava/lang/Object;)V+14 (line 821) fc16a680 * *org/apache/hadoop/mapred/lib/IdentityReducer.reduce(Ljava/lang/Object;Ljava/util/Iterator;Lorg/apache/hadoop/mapred/OutputCollector;Lorg/apache/hadoop/mapred/Reporter;)V+36 (line 79) fc005fd0 * org/apache/hadoop/mapred/ReduceTask.run(Lorg/apache/hadoop/mapred/JobConf;Lorg/apache/hadoop/mapred/TaskUmbilicalProtocol;)V+610 (line 751) fc005ab0 * org/apache/hadoop/mapred/Child.main([Ljava/lang/String;)V+440 (line 227) fc00021c * StubRoutines (1) fe5594fc __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (fc0001c0, 35400, 1, 0, 779d6e98, fe37ff08) + 208 fe5fd1d4 jni_CallStaticVoidMethod (35510, 35c4c, 35848, 35400, 35840, 2c648) + 4b8 00013ab0 JavaMain (368e8, 2b8f4, 2af28, 35510, 4, fee32d24) + 15f0 ff2c8950 _lwp_start (0, 0, 0, 0, 0, 0) I have a feeling I know why It is only using one core. I set mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4 as well? Thanks, Steve Cohen

