On 2010-09-27 17:24, Steve Cohen wrote:
Hello,

I've been given the task of figuring out why nutch is running slower on
Solaris then on Linux with the same configuration. I am looking at the log
file and I see this big gap between the time fetcher stops fetching and it
says it is done and I would love to know what is going on. Here is the log
snippet.

2010-09-24 11:04:28,413 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0
2010-09-24 11:05:32,782 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2010-09-24 11:05:33,469 INFO  plugin.PluginRepository - Plugins: looking in:
/opt/nutch/build/plugins
2010-09-24 11:05:34,052 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository - Registered Plugins:
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Jakarta POI
- Java API To Access Microsoft Format Files (lib-jakarta-poi)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More
Indexing Filter (index-more)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More Query
Filter (query-more)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         XML
Libraries (lib-xml)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         MSExcel
Parse Plug-in (parse-msexcel)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         XML Response
Writer Plug-in (response-xml)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Zip Parse
Plug-in (parse-zip)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Parse MS
Documents Framework (lib-parsems)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         JSON
Response Writer Plug-in (response-json)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         MSPowerPoint
Parse Plug-in (parse-mspowerpoint)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         RSS Parse
Plug-in (parse-rss)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         File
Protocol Plug-in (protocol-file)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository - Registered
Extension-Points:
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2010-09-24 11:47:04,995 INFO  fetcher.Fetcher - Fetcher: done
2010-09-24 11:47:10,151 INFO  crawl.CrawlDb - CrawlDb update: starting

So at 11:04, fetcher winds down and has no more threads to run. At 11:05 it
gives an error about not having native hadoop libraries (I am going to build
them today) and loads plugins. Then Fetcher gives a message that is done -
32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?

It was diligently running the "reduce" phase, which consists of sorting and the reduce() proper. If you run Fetcher in the parsing mode then another possibility is that some of the parsers run slower on Solaris. Yet another possibility, that you mentioned, is that HAdoop can use the native compression libs on Linux, but there are no such libs pre-compiled for Solaris.

Also, while reduce() speed is mostly determined by the Reducer implementation (and very little by IO), the sorting speed is very much dependent on disk IO and the size of the dataset that was partitioned to a given reduce task. All other config factors being equal, I suspect that your Solaris box could have a slower disk.

You can verify these hypotheses with top/iostat/vmstat and see whether the tasks are bound by CPU or by diskwait.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to