Hi, Try calling jstack on the pid of the task to have a better idea of what it is doing. My bet is on the normalisation of some long URLs taking ages but it could be a lot of other things
J. On 2 August 2010 17:26, brad <[email protected]> wrote: > Hi Julien, > I'll see if I can give a try later this week. > > I'm having a problem in the mapred.LocalJobRunner - reduce > reduce portion > right after the actual URL fetch/parse portion is complete. I don't know > how long it is supposed to take for this portion to complete, but I have > had > fetches run for 12 hours and map-reduce portion run for 36 hours and still > not be complete. I ended up killing the job. > > Right now, I'm running a fetch on 1 million URLs. The parse and fetch > portion took less than 7 hours, but the map-reduce has been running for 11 > hours now and I'm going to wait and see if it completes. > > It started complete of fetcher.Fetcher: > 2010-08-01 22:06:43,479 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2010-08-01 22:06:44,368 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2010-08-01 22:06:44,369 INFO fetcher.Fetcher - -activeThreads=0 > 2010-08-01 22:06:44,369 INFO mapred.MapTask - Starting flush of map output > 2010-08-01 22:06:45,129 INFO mapred.LocalJobRunner - 0 threads, 853809 > pages, 18772 errors, 35.4 pages/s, 16989 kb/s, > > The issue appears to start with > 2010-08-01 23:22:22,174 INFO mapred.Merger - Down to the last merge-pass, > with 1 segments left of total size: 31012166567 bytes > > Now the process has been cycling on for 10 hours: > INFO mapred.LocalJobRunner - reduce > reduce > > I'm running Nutch on a single server. > > Thanks > Brad > > > -----Original Message----- > From: Julien Nioche [mailto:[email protected]] > Sent: Monday, August 02, 2010 5:11 AM > To: [email protected] > Subject: Re: For HTML - is parse-html twice as fast as parse-tika > > Hi Brad, > > Could you run and measure the parser independently of the fetching? That > would remove any possible side effect due to caching, network issues etc... > > All you need to do is remove the subdirectories parse_text, parse_data and > crawl_parse then run : nutch parse > > Thanks > > Julien > > PS: regarding parse-html being phased out : see Andrzej's JIRA from this > morning > > > On 31 July 2010 22:43, brad <[email protected]> wrote: > > > > I have been experiencing some performance issues with Tika and > > > general parsing (see Parsing Performance - related to Java > > > concurrency issue) > > > > > > Ken pointed out that both the both Tika and Nutch HtmlParser show up > > > in > > my > > > jstack list using the delivered configuration. > > > > > > Julien suggested checking parsing with only parse-tika (html) and > > > then with parse-html. > > > > > > So here is what I did. > > > > > > Option 1) parse-tika > > > parse-(rss|text|js|tika) > > > parse-plugin.xml as delivered > > tika-mimetypes.xml as delivered > > > > > Option 2) parse-html > > > parse-(rss|text|html|js|tika) > > > parse-plugin.xml turned ON <plugin id="parse-html" /> > > > tika-mimetypes.xml commented out <mime-type > > > type="text/html"> > > > > > > Using the same generated crawl, ran fetch with parse for each of the > > > options for 2 hours. > > > All other configurations and settings are identical > > > > > > Results: > > > Parse-tika > > > INFO mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 > > > errors, > > 27.8 > > > pages/s, 12916 kb/s > > > > > > Parse-html > > > INFO mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 > > > errors, > > > 60.1 pages/s, 27980 kb/s, > > > > > > > > > The results: > > > Parse-html is 116% faster than parse-tika for html for the same > > > period of time and same URLs > > > > > > The error rate was about the same parse-html 3%, parse-tika 3.3% > > > Most of the errors are read timeouts > > > > > > > > > So is parse-html better? It appears to be faster. But, is the data > > > as good? > > > Other considerations? Is parse-html really going to be phased out? > > > > > > Brad > > > > > > > > > > > > > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering http://www.digitalpebble.com > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

