Re: What is nutch doing?

Steve Cohen Tue, 28 Sep 2010 16:13:54 -0700

On Mon, Sep 27, 2010 at 1:26 PM, Andrzej Bialecki <[email protected]> wrote:


> On 2010-09-27 17:24, Steve Cohen wrote:
>
>> Hello,
>>
>> I've been given the task of figuring out why nutch is running slower on
>> Solaris then on Linux with the same configuration. I am looking at the log
>> file and I see this big gap between the time fetcher stops fetching and it
>> says it is done and I would love to know what is going on. Here is the log
>> snippet.
>>
>> 2010-09-24 11:04:28,413 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2010-09-24 11:04:29,200 INFO  fetcher.Fetcher - -activeThreads=0
>> 2010-09-24 11:05:32,782 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where
>> applicable
>> 2010-09-24 11:05:33,469 INFO  plugin.PluginRepository - Plugins: looking
>> in:
>> /opt/nutch/build/plugins
>> 2010-09-24 11:05:34,052 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Jakarta
>> POI
>> - Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More
>> Indexing Filter (index-more)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         HTTP
>> Framework (lib-http)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         MSWord
>> Parse
>> Plug-in (parse-msword)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         More Query
>> Filter (query-more)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         Regex URL
>> Filter (urlfilter-regex)
>> 2010-09-24 11:05:34,053 INFO  plugin.PluginRepository -         XML
>> Libraries (lib-xml)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Http
>> Protocol Plug-in (protocol-http)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         MSExcel
>> Parse Plug-in (parse-msexcel)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         XML
>> Response
>> Writer Plug-in (response-xml)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         OPIC
>> Scoring
>> Plug-in (scoring-opic)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Zip Parse
>> Plug-in (parse-zip)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         Anchor
>> Indexing Filter (index-anchor)
>> 2010-09-24 11:05:34,054 INFO  plugin.PluginRepository -         URL Query
>> Filter (query-url)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Parse MS
>> Documents Framework (lib-parsems)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         JSON
>> Response Writer Plug-in (response-json)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -
>> MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         Basic
>> Query
>> Filter (query-basic)
>> 2010-09-24 11:05:34,055 INFO  plugin.PluginRepository -         RSS Parse
>> Plug-in (parse-rss)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Html Parse
>> Plug-in (parse-html)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
>> Indexing Filter (index-basic)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Site Query
>> Filter (query-site)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         Text Parse
>> Plug-in (parse-text)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository -         File
>> Protocol Plug-in (protocol-file)
>> 2010-09-24 11:05:34,056 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Field
>> Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-09-24 11:05:34,057 INFO  plugin.PluginRepository -         Nutch
>> Search
>> Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-09-24 11:05:34,058 INFO  plugin.PluginRepository -         Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>> 2010-09-24 11:47:04,995 INFO  fetcher.Fetcher - Fetcher: done
>> 2010-09-24 11:47:10,151 INFO  crawl.CrawlDb - CrawlDb update: starting
>>
>> So at 11:04, fetcher winds down and has no more threads to run. At 11:05
>> it
>> gives an error about not having native hadoop libraries (I am going to
>> build
>> them today) and loads plugins. Then Fetcher gives a message that is done -
>> 32 minutes later and Crawldb starts. What did Fetcher do for 32 minutes?
>>
>
> It was diligently running the "reduce" phase, which consists of sorting and
> the reduce() proper.  If you run Fetcher in the parsing mode then another
> possibility is that some of the parsers run slower on Solaris. Yet another
> possibility, that you mentioned, is that HAdoop can use the native
> compression libs on Linux, but there are no such libs pre-compiled for
> Solaris.
>
> Also, while reduce() speed is mostly determined by the Reducer
> implementation (and very little by IO), the sorting speed is very much
> dependent on disk IO and the size of the dataset that was partitioned to a
> given reduce task. All other config factors being equal, I suspect that your
> Solaris box could have a slower disk.
>
> You can verify these hypotheses with top/iostat/vmstat and see whether the
> tasks are bound by CPU or by diskwait.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

I tried running nutch again after setting job tracker to localhost:<port>
from local and set mapred.tasktracker.reduce.tasks.maximum and
mapred.tasktracker.map.tasks.maximum, hoping it would use multiple threads
and cores to speed up the portion after the fetch.fetcher threads drops to 0
but before it tells me that fetcher is done.

It seems I need to do more configuration. It is only using one thread during
this down time.

 but I do know what it is doing. I ran the solaris command pstack on the pid
to see what is going on and you were right. It is using running map reduce.

-----------------  lwp# 2 / thread# 2  --------------------
 fc0e3324 *
*java/util/regex/Pattern$CharProperty.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z
[compiled]
 fc0f9b9c *
*java/util/regex/Pattern$Slice.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z
[compiled] +152 (line 6979)
 fc0f9b9c *
*java/util/regex/Pattern$Begin.match(Ljava/util/regex/Matcher;ILjava/lang/CharSequence;)Z+62
(line 6244)
 fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
 fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
 fc12c078 *
*org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
(line 180)
 fc12c078 *
*org/apache/nutch/urlfilter/api/RegexURLFilterBase.filter(Ljava/lang/String;)Ljava/lang/String;+38
(line 234)
 fc12c078 *
*org/apache/nutch/net/URLFilters.filter(Ljava/lang/String;)Ljava/lang/String;+50
(line 184)
 fc12c078 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/parse/Parse;)V+992
(line 555)
 fc16a680 *
*org/apache/nutch/parse/ParseOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V
[compiled] +20 (line 226)
 fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/NutchWritable;)V+120
(line 186)
 fc16a680 *
*org/apache/nutch/fetcher/FetcherOutputFormat$1.write(Ljava/lang/Object;Ljava/lang/Object;)V+20
(line 140)
 fc16a680 *
*org/apache/hadoop/mapred/ReduceTask$3.collect(Ljava/lang/Object;Ljava/lang/Object;)V+14
(line 821)
 fc16a680 *
*org/apache/hadoop/mapred/lib/IdentityReducer.reduce(Ljava/lang/Object;Ljava/util/Iterator;Lorg/apache/hadoop/mapred/OutputCollector;Lorg/apache/hadoop/mapred/Reporter;)V+36
(line 79)
 fc005fd0 *
org/apache/hadoop/mapred/ReduceTask.run(Lorg/apache/hadoop/mapred/JobConf;Lorg/apache/hadoop/mapred/TaskUmbilicalProtocol;)V+610
(line 751)
 fc005ab0 * org/apache/hadoop/mapred/Child.main([Ljava/lang/String;)V+440
(line 227)
 fc00021c * StubRoutines (1)
 fe5594fc
__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_
(fc0001c0, 35400, 1, 0, 779d6e98, fe37ff08) + 208
 fe5fd1d4 jni_CallStaticVoidMethod (35510, 35c4c, 35848, 35400, 35840,
2c648) + 4b8
 00013ab0 JavaMain (368e8, 2b8f4, 2af28, 35510, 4, fee32d24) + 15f0
 ff2c8950 _lwp_start (0, 0, 0, 0, 0, 0)


I have a feeling I know why It is only using one core. I set
mapred.tasktracker.reduce.tasks.maximum to 4 but I see that there is a
setting for mapred.reduce.tasks which is set to 1. Do I need to up it to 4
as well?

Thanks,
Steve Cohen

Re: What is nutch doing?

Reply via email to