RE: For HTML - is parse-html twice as fast as parse-tika

brad Mon, 02 Aug 2010 11:15:12 -0700

Thanks, I had tried that multiple times and the majority of time it is stuck
at:


"Thread-11" prio=10 tid=0x00002aabd8023000 nid=0x62ef runnable
[0x00000000420d8000..0x00000000420d8c10]
   java.lang.Thread.State: RUNNABLE
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:3787)
        at java.util.regex.Pattern$Curly.match(Pattern.java:3761)
        at java.util.regex.Pattern$Start.match(Pattern.java:3072)
        at java.util.regex.Matcher.search(Matcher.java:1116)
        at java.util.regex.Matcher.find(Matcher.java:552)
        at
org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(RegexURLFilter.ja
va:90)
        at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(RegexURLFilterBase.
java:117)
        - locked <0x00002aaaf32f93d8> (a
org.apache.nutch.urlfilter.regex.RegexURLFilter)
        at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88)
        at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:220)
        at
org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:115)
        at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.jav
a:96)
        at
org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.jav
a:70)
        at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
        at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:42)
        at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)


I'm not sure how to do anything to improve this aspect.  I do have about 10
entries in the regex-urlfilter.txt file, but they are mainly to exclude
sites.  For Example:
-^http://([a-z0-9\-A-Z]*\.)*twitter.com
-^http://([a-z0-9\-A-Z]*\.)*facebook.com
Or exclude extensions
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|t
gz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-\.(js|JS|mp3|MP3|mp4|MP4|wav|WAV|mov|MOV|z|Z|tar|TAR|avi|AVI|rar|RAR|jar|JA
R
|ps|PS|eps|EPS|css|CSS|wmv|WMV|flv|FLV|dmg|DMG|img|IMG|swf|SWF|msi|MSI|wvx|W
VX)$

I would have used prefix-urlfilter.txt and suffix-urlfilter.txt, but I
haven't found any documentation on how they work...

Brad

 

-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Monday, August 02, 2010 10:39 AM
To: [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi,

Try calling jstack on the pid of the task to have a better idea of what it
is doing. My bet is on the normalisation of some long URLs taking ages but
it could be a lot of other things

J.

On 2 August 2010 17:26, brad <[email protected]> wrote:

> Hi Julien,
> I'll see if I can give a try later this week.
>
> I'm having a problem in the mapred.LocalJobRunner - reduce > reduce 
> portion right after the actual URL fetch/parse portion is complete.  I 
> don't know how long it is supposed to take for this portion to 
> complete, but I have had fetches run for 12 hours and map-reduce 
> portion run for 36 hours and still not be complete.  I ended up 
> killing the job.
>
> Right now, I'm running a fetch on 1 million URLs.  The parse and fetch 
> portion took less than 7 hours, but the map-reduce has been running 
> for 11 hours now and I'm going to wait and see if it completes.
>
> It started complete of fetcher.Fetcher:
> 2010-08-01 22:06:43,479 INFO  fetcher.Fetcher - -finishing thread 
> FetcherThread, activeThreads=0
> 2010-08-01 22:06:44,368 INFO  fetcher.Fetcher - -activeThreads=0, 
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-08-01 22:06:44,369 INFO  fetcher.Fetcher - -activeThreads=0
> 2010-08-01 22:06:44,369 INFO  mapred.MapTask - Starting flush of map 
> output
> 2010-08-01 22:06:45,129 INFO  mapred.LocalJobRunner - 0 threads, 
> 853809 pages, 18772 errors, 35.4 pages/s, 16989 kb/s,
>
> The issue appears to start with
> 2010-08-01 23:22:22,174 INFO  mapred.Merger - Down to the last 
> merge-pass, with 1 segments left of total size: 31012166567 bytes
>
> Now the process has been cycling on for 10 hours:
> INFO  mapred.LocalJobRunner - reduce > reduce
>
> I'm running Nutch on a single server.
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Monday, August 02, 2010 5:11 AM
> To: [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching? 
> That would remove any possible side effect due to caching, network issues
etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data 
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from 
> this morning
>
>
> On 31 July 2010 22:43, brad <[email protected]> wrote:
>
> > > I have been experiencing some performance issues with Tika and 
> > > general parsing (see Parsing Performance - related to Java 
> > > concurrency issue)
> > >
> > > Ken pointed out that both the both Tika and Nutch HtmlParser show 
> > > up in
> > my
> > > jstack list using the delivered configuration.
> > >
> > > Julien suggested checking parsing with only parse-tika (html) and 
> > > then with parse-html.
> > >
> > > So here is what I did.
> > >
> > > Option 1) parse-tika
> > >           parse-(rss|text|js|tika)
> > >           parse-plugin.xml as delivered
> >          tika-mimetypes.xml as delivered
> >
> > > Option 2) parse-html
> > >           parse-(rss|text|html|js|tika)
> > >           parse-plugin.xml turned ON <plugin id="parse-html" />
> > >           tika-mimetypes.xml commented out <mime-type 
> > > type="text/html">
> > >
> > > Using the same generated crawl, ran fetch with parse for each of 
> > > the options for 2 hours.
> > > All other configurations and settings are identical
> > >
> > > Results:
> > > Parse-tika
> > > INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
> > > errors,
> > 27.8
> > > pages/s, 12916 kb/s
> > >
> > > Parse-html
> > > INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
> > > errors,
> > > 60.1 pages/s, 27980 kb/s,
> > >
> > >
> > > The results:
> > > Parse-html is 116% faster than parse-tika for html for the same 
> > > period of time and same URLs
> > >
> > > The error rate was about the same parse-html 3%, parse-tika 3.3% 
> > > Most of the errors are read timeouts
> > >
> > >
> > > So is parse-html better?  It appears to be faster.  But, is the 
> > > data as good?
> > > Other considerations?  Is parse-html really going to be phased out?
> > >
> > > Brad
> > >
> > >
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering 
> http://www.digitalpebble.com
>
>


--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com

RE: For HTML - is parse-html twice as fast as parse-tika

Reply via email to