Thanks, I had tried that multiple times and the majority of time it is stuck
at:
"Thread-11" prio=10 tid=0x00002aabd8023000 nid=0x62ef runnable
[0x00000000420d8000..0x00000000420d8c10]
java.lang.Thread.State: RUNNABLE
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3787)
at java.util.regex.Pattern$Curly.match(Pattern.java:3761)
at java.util.regex.Pattern$Start.match(Pattern.java:3072)
at java.util.regex.Matcher.search(Matcher.java:1116)
at java.util.regex.Matcher.find(Matcher.java:552)
at
org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(RegexURLFilter.ja
va:90)
at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(RegexURLFilterBase.
java:117)
- locked <0x00002aaaf32f93d8> (a
org.apache.nutch.urlfilter.regex.RegexURLFilter)
at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:220)
at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:115)
at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.jav
a:96)
at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.jav
a:70)
at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:42)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
I'm not sure how to do anything to improve this aspect. I do have about 10
entries in the regex-urlfilter.txt file, but they are mainly to exclude
sites. For Example:
-^http://([a-z0-9\-A-Z]*\.)*twitter.com
-^http://([a-z0-9\-A-Z]*\.)*facebook.com
Or exclude extensions
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|t
gz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-\.(js|JS|mp3|MP3|mp4|MP4|wav|WAV|mov|MOV|z|Z|tar|TAR|avi|AVI|rar|RAR|jar|JA
R
|ps|PS|eps|EPS|css|CSS|wmv|WMV|flv|FLV|dmg|DMG|img|IMG|swf|SWF|msi|MSI|wvx|W
VX)$
I would have used prefix-urlfilter.txt and suffix-urlfilter.txt, but I
haven't found any documentation on how they work...
Brad
-----Original Message-----
From: Julien Nioche [mailto:[email protected]]
Sent: Monday, August 02, 2010 10:39 AM
To: [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika
Hi,
Try calling jstack on the pid of the task to have a better idea of what it
is doing. My bet is on the normalisation of some long URLs taking ages but
it could be a lot of other things
J.
On 2 August 2010 17:26, brad <[email protected]> wrote:
> Hi Julien,
> I'll see if I can give a try later this week.
>
> I'm having a problem in the mapred.LocalJobRunner - reduce > reduce
> portion right after the actual URL fetch/parse portion is complete. I
> don't know how long it is supposed to take for this portion to
> complete, but I have had fetches run for 12 hours and map-reduce
> portion run for 36 hours and still not be complete. I ended up
> killing the job.
>
> Right now, I'm running a fetch on 1 million URLs. The parse and fetch
> portion took less than 7 hours, but the map-reduce has been running
> for 11 hours now and I'm going to wait and see if it completes.
>
> It started complete of fetcher.Fetcher:
> 2010-08-01 22:06:43,479 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2010-08-01 22:06:44,368 INFO fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-08-01 22:06:44,369 INFO fetcher.Fetcher - -activeThreads=0
> 2010-08-01 22:06:44,369 INFO mapred.MapTask - Starting flush of map
> output
> 2010-08-01 22:06:45,129 INFO mapred.LocalJobRunner - 0 threads,
> 853809 pages, 18772 errors, 35.4 pages/s, 16989 kb/s,
>
> The issue appears to start with
> 2010-08-01 23:22:22,174 INFO mapred.Merger - Down to the last
> merge-pass, with 1 segments left of total size: 31012166567 bytes
>
> Now the process has been cycling on for 10 hours:
> INFO mapred.LocalJobRunner - reduce > reduce
>
> I'm running Nutch on a single server.
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Monday, August 02, 2010 5:11 AM
> To: [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching?
> That would remove any possible side effect due to caching, network issues
etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from
> this morning
>
>
> On 31 July 2010 22:43, brad <[email protected]> wrote:
>
> > > I have been experiencing some performance issues with Tika and
> > > general parsing (see Parsing Performance - related to Java
> > > concurrency issue)
> > >
> > > Ken pointed out that both the both Tika and Nutch HtmlParser show
> > > up in
> > my
> > > jstack list using the delivered configuration.
> > >
> > > Julien suggested checking parsing with only parse-tika (html) and
> > > then with parse-html.
> > >
> > > So here is what I did.
> > >
> > > Option 1) parse-tika
> > > parse-(rss|text|js|tika)
> > > parse-plugin.xml as delivered
> > tika-mimetypes.xml as delivered
> >
> > > Option 2) parse-html
> > > parse-(rss|text|html|js|tika)
> > > parse-plugin.xml turned ON <plugin id="parse-html" />
> > > tika-mimetypes.xml commented out <mime-type
> > > type="text/html">
> > >
> > > Using the same generated crawl, ran fetch with parse for each of
> > > the options for 2 hours.
> > > All other configurations and settings are identical
> > >
> > > Results:
> > > Parse-tika
> > > INFO mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
> > > errors,
> > 27.8
> > > pages/s, 12916 kb/s
> > >
> > > Parse-html
> > > INFO mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
> > > errors,
> > > 60.1 pages/s, 27980 kb/s,
> > >
> > >
> > > The results:
> > > Parse-html is 116% faster than parse-tika for html for the same
> > > period of time and same URLs
> > >
> > > The error rate was about the same parse-html 3%, parse-tika 3.3%
> > > Most of the errors are read timeouts
> > >
> > >
> > > So is parse-html better? It appears to be faster. But, is the
> > > data as good?
> > > Other considerations? Is parse-html really going to be phased out?
> > >
> > > Brad
> > >
> > >
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
>
--
DigitalPebble Ltd
Open Source Solutions for Text Engineering http://www.digitalpebble.com