Hello,

I've been given the task of figuring out why nutch is running slower on
Solaris then on Linux with the same configuration. I am looking at the log
file and I see this big gap between the time fetcher stops fetching and it
says it is done and I would love to know what is going on. Here is the log
snippet.

2010-09-24 11:04:28,413 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0
2010-09-24 11:05:32,782 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2010-09-24 11:05:33,469 INFO  plugin.PluginRepository - Plugins: looking in:
/opt/nutch/build/plugins
2010-09-24 11:05:34,052 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository - Registered Plugins:
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Jakarta POI
- Java API To Access Microsoft Format Files (lib-jakarta-poi)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More
Indexing Filter (index-more)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         MSWord Parse
Plug-in (parse-msword)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More Query
Filter (query-more)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         XML
Libraries (lib-xml)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         MSExcel
Parse Plug-in (parse-msexcel)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         XML Response
Writer Plug-in (response-xml)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Zip Parse
Plug-in (parse-zip)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Parse MS
Documents Framework (lib-parsems)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         JSON
Response Writer Plug-in (response-json)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         MSPowerPoint
Parse Plug-in (parse-mspowerpoint)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         RSS Parse
Plug-in (parse-rss)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         File
Protocol Plug-in (protocol-file)
2010-09-24 11:05:34,056 INFO  plugin.PluginRepository - Registered
Extension-Points:
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2010-09-24 11:47:04,995 INFO  fetcher.Fetcher - Fetcher: done
2010-09-24 11:47:10,151 INFO  crawl.CrawlDb - CrawlDb update: starting

So at 11:04, fetcher winds down and has no more threads to run. At 11:05 it
gives an error about not having native hadoop libraries (I am going to build
them today) and loads plugins. Then Fetcher gives a message that is done -
32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?

Thanks,
Steve

Reply via email to