On 2010-09-27 17:24, Steve Cohen wrote:
Hello,
I've been given the task of figuring out why nutch is running slower on
Solaris then on Linux with the same configuration. I am looking at the log
file and I see this big gap between the time fetcher stops fetching and it
says it is done and I would love to know what is going on. Here is the log
snippet.
2010-09-24 11:04:28,413 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-09-24 11:04:29,200 INFO fetcher.Fetcher - -activeThreads=0
2010-09-24 11:05:32,782 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2010-09-24 11:05:33,469 INFO plugin.PluginRepository - Plugins: looking in:
/opt/nutch/build/plugins
2010-09-24 11:05:34,052 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Registered Plugins:
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Jakarta POI
- Java API To Access Microsoft Format Files (lib-jakarta-poi)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More
Indexing Filter (index-more)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - MSWord Parse
Plug-in (parse-msword)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - More Query
Filter (query-more)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2010-09-24 11:05:34,053 INFO plugin.PluginRepository - XML
Libraries (lib-xml)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - MSExcel
Parse Plug-in (parse-msexcel)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - XML Response
Writer Plug-in (response-xml)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Zip Parse
Plug-in (parse-zip)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - Anchor
Indexing Filter (index-anchor)
2010-09-24 11:05:34,054 INFO plugin.PluginRepository - URL Query
Filter (query-url)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Parse MS
Documents Framework (lib-parsems)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - JSON
Response Writer Plug-in (response-json)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - MSPowerPoint
Parse Plug-in (parse-mspowerpoint)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2010-09-24 11:05:34,055 INFO plugin.PluginRepository - RSS Parse
Plug-in (parse-rss)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - CyberNeko
HTML Parser (lib-nekohtml)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - File
Protocol Plug-in (protocol-file)
2010-09-24 11:05:34,056 INFO plugin.PluginRepository - Registered
Extension-Points:
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-09-24 11:05:34,057 INFO plugin.PluginRepository - Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-09-24 11:05:34,058 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2010-09-24 11:47:04,995 INFO fetcher.Fetcher - Fetcher: done
2010-09-24 11:47:10,151 INFO crawl.CrawlDb - CrawlDb update: starting
So at 11:04, fetcher winds down and has no more threads to run. At 11:05 it
gives an error about not having native hadoop libraries (I am going to build
them today) and loads plugins. Then Fetcher gives a message that is done -
32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?
It was diligently running the "reduce" phase, which consists of sorting
and the reduce() proper. If you run Fetcher in the parsing mode then
another possibility is that some of the parsers run slower on Solaris.
Yet another possibility, that you mentioned, is that HAdoop can use the
native compression libs on Linux, but there are no such libs
pre-compiled for Solaris.
Also, while reduce() speed is mostly determined by the Reducer
implementation (and very little by IO), the sorting speed is very much
dependent on disk IO and the size of the dataset that was partitioned to
a given reduce task. All other config factors being equal, I suspect
that your Solaris box could have a slower disk.
You can verify these hypotheses with top/iostat/vmstat and see whether
the tasks are bound by CPU or by diskwait.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com