Ahh...this indicates that the plugin.xml file for parse-tika and parse-html 
both claim to support text/html, but there is no mapping in parse-plugins.xml 
that takes care of it - you need to update parse-plugins.xml under the 
text/html mime-type...


On 7/24/10 12:15 PM, "brad" <[email protected]> wrote:

Start up of the fetcher shows:
INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes
system property, and all claim to support the content type text/html, but
they are not mapped to it  in the parse-plugins.xml file


In the hadoop.log file there are a few tika errors like:

 ERROR tika.TikaParser - Error parsing http://www.xyx.com/download.html




-----Original Message-----
From: Julien Nioche [mailto:[email protected]]
Sent: Saturday, July 24, 2010 12:01 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue

since tika covers the same mime-types as parse-text and parse-html you
probably don't need to include them. Can't remember why we kept them in the
default (anyone?).

Re-finding both the html and tika parsers in the stacks, the only case where
it should happen is when both are loaded (as in your conf and the default)
and the default parser (i.e. Tika)  is loaded first fails to return a result
then remaining parsers for the mime type are used.

Did you notice anything in the log about any errors during the parsing with
Tika on HTML docs? That could explain why the html parser was tried

J.


On 24 July 2010 19:45, brad <[email protected]> wrote:

> Here is my plugin.includes from nutch-site.xml
>
> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(bas
> protocol-http|urlfilter-regex|ic|anc
>
> hor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-
> opic|u
> rlnormalizer-(pass|regex|basic)
>
> Here is the one from nutch-default.xml
>
>
> protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|a
> protocol-http|urlfilter-regex|nchor)
>
> |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic
> ||urlno
> rmalizer-(pass|regex|basic)
>
> Do I need to change something on this?
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Saturday, July 24, 2010 11:09 AM
> To: [email protected]
> Subject: Re: Parsing Performance - related to Java concurrency issue
>
> Hi guys,
>
> Brad, thanks for sharing your observations with us, that's great.
> looks like we could definitely do without the lock on the charset.
>
> The stack trace definitely shows that BOTH parse-html and parse-tika
> are used, which should not happen. I wonder whether they are both
> called on each document or alternatively. I will have a look at it and
> see if I can find an explanation for this.
>
> It would be interesting to see for a fetched segments :
> - how long it takes to parse it when both parse-(html|tika) are in
> plugin.includes
> - same with only parse-tika
> - same with only parse-html
>
> Thanks
>
> Jul
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
>
> On 24 July 2010 18:33, Ken Krugler <[email protected]> wrote:
>
> > Hi Brad,
> >
> >
> > On Jul 23, 2010, at 8:50pm, brad wrote:
> >
> >  The items listed in the original email were just the location of
> > the
> >> original call.  Here is the actual jstack dump for the parseNeko
> >> and the other.  The parseNeko apparently being directly called by
> >> the Nutch parser, not Tika...
> >>
> >
> > Thanks for the additional details.
> >
> > Maybe Julien can provide an explanation for why there are both Tika
> > and Nutch HtmlParser references showing up - Julien?
> >
> > -- Ken
> >
> >
> >  "Thread-177094" daemon prio=10 tid=0x00002aab281fb000 nid=0x5638
> > waiting
> >> for
> >> monitor entry [0x00002aab82a18000..0x00002aab82a18b90]
> >>  java.lang.Thread.State: BLOCKED (on object monitor)
> >>        at
> >>
> >>
> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java
> :135)
> >>        - waiting to lock <0x00002aaace621488> (a
> >> sun.nio.cs.StandardCharsets)
> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
> >>        at
> >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
> >>        at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
> >>        at
> >> org.cyberneko.html.HTMLScanner.setInputSource(HTMLScanner.java:774)
> >>        at
> >>
> >> org.cyberneko.html.HTMLConfiguration.setInputSource(HTMLConfigurati
> >> on
> >> .java:4
> >> 57)
> >>        at
> >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:430)
> >>        at
> >>
> >> org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
> >> java:16
> >> 4)
> >>        at
> >> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253)
> >>        at
> >> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:210)
> >>        at
> >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
> >>        at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>        at java.lang.Thread.run(Thread.java:636)
> >>
> >>
> >> Here are the main ones listed in the jira case:
> >>
> >> "Thread-177096" daemon prio=10 tid=0x00002aab283e5400 nid=0x563a
> >> waiting for monitor entry [0x000000007318b000..0x000000007318bc90]
> >>  java.lang.Thread.State: BLOCKED (on object monitor)
> >>        at
> >>
> >>
> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java
> :135)
> >>        - waiting to lock <0x00002aaace621488> (a
> >> sun.nio.cs.StandardCharsets)
> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
> >>        at
> >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
> >>        at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
> >>        at
> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
> >>        at
> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
> >>        at
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
> >>        at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>        at java.lang.Thread.run(Thread.java:636)
> >>
> >>
> >> "Thread-177079" daemon prio=10 tid=0x00002aab1c149000 nid=0x5629
> >> waiting for monitor entry [0x00002aab8200e000..0x00002aab8200ec10]
> >>  java.lang.Thread.State: BLOCKED (on object monitor)
> >>        at
> >>
> >>
> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java
> :135)
> >>        - waiting to lock <0x00002aaace621488> (a
> >> sun.nio.cs.StandardCharsets)
> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
> >>        at
> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99)
> >>        at
> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
> >>        at
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
> >>        at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>        at java.lang.Thread.run(Thread.java:636)
> >>
> >>
> >> "Thread-177029" daemon prio=10 tid=0x0000000017a67400 nid=0x55f7
> >> waiting for monitor entry [0x00002aab8806f000..0x00002aab8806fb90]
> >>  java.lang.Thread.State: BLOCKED (on object monitor)
> >>        at
> >>
> >>
> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java
> :135)
> >>        - waiting to lock <0x00002aaace621488> (a
> >> sun.nio.cs.StandardCharsets)
> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
> >>        at
> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133)
> >>        at
> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
> >>        at
> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
> >>        at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>        at java.lang.Thread.run(Thread.java:636)
> >>
> >>
> >> "Thread-177090" daemon prio=10 tid=0x00002aab48060c00 nid=0x5634
> >> waiting for monitor entry [0x00002aab82f1d000..0x00002aab82f1dd90]
> >>  java.lang.Thread.State: BLOCKED (on object monitor)
> >>        at
> >>
> >>
> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java
> :135)
> >>        - waiting to lock <0x00002aaace621488> (a
> >> sun.nio.cs.StandardCharsets)
> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
> >>        at java.nio.charset.Charset.forName(Charset.java:521)
> >>        at
> >>
> >> org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlP
> >> ar
> >> ser.jav
> >> a:86)
> >>        at
> >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:137)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
> >>        at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
> >>        at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>        at java.lang.Thread.run(Thread.java:636)
> >>
> >> The last one isn't a tika problem, it's nutch's issue...
> >>
> >>
> >> If there is anything else I can provide, please let me know.
> >>
> >> Thanks
> >> Brad
> >>
> >> -----Original Message-----
> >> From: brad [mailto:[email protected]]
> >> Sent: Friday, July 23, 2010 8:37 PM
> >> To: [email protected]
> >> Subject: RE: Parsing Performance - related to Java concurrency
> >> issue
> >>
> >> I just running Nutch as delivered.  The information about the
> >> org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from
> >> running jstack on the nutch process when it slowed down to a crawl...
> >>
> >> -----Original Message-----
> >> From: Ken Krugler [mailto:[email protected]]
> >> Sent: Friday, July 23, 2010 8:25 PM
> >> To: [email protected]
> >> Subject: Re: Parsing Performance - related to Java concurrency
> >> issue
> >>
> >> Hi Brad,
> >>
> >> On Jul 23, 2010, at 7:21pm, brad wrote:
> >>
> >>  Hi Ken,
> >>> Thanks for the info.  I'm using Nutch 1.1, so I believe it is Tika
> >>> 0.7?  The jar files in my Nutch path are tika-core-0.7.jar and
> >>> tika- parsers-0.7.jar.
> >>> Is there a way to find out if it actually pulling something
> >>> different when executing?
> >>>
> >>
> >> Tika switched from Neko to TagSoup on 14/Oct/2009.
> >>
> >> Tika 0.7 was released on April 3rd, 2010 so I would expect that if
> >> you are using Tika 0.7, you'd be using TagSoup.
> >>
> >> However I see the line that references Neko is this one:
> >>
> >>  at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
> >>>>
> >>>
> >> So it's the Nutch HtmlParser that's using Neko.
> >>
> >> Curious why you have both Nutch and Tika HtmlParser refs in your
> >> file, e.g.
> >> I also see:
> >>
> >>  org.apache.tika.parser.html.HtmlParser.getEncoding
> >>>>
> >>>
> >> -- Ken
> >>
> >>  The ps -ef | grep nutch
> >>> Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...
> >>> in the
> >>> Nutch execution command line
> >>>
> >>> My server does have
> >>> /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
> >>> /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar
> >>>
> >>> But, they are not in the classpath nor are the $PATH so I doubt
> >>> they are being picked up?  Is there someplace else I should be
looking?
> >>>
> >>> Thanks
> >>> Brad
> >>>
> >>> -----Original Message-----
> >>> From: Ken Krugler [mailto:[email protected]]
> >>> Sent: Friday, July 23, 2010 6:38 PM
> >>> To: [email protected]
> >>> Subject: Re: Parsing Performance - related to Java concurrency
> >>> issue
> >>>
> >>> Hi Brad,
> >>>
> >>> Thanks for the nice write-up, and the refs.
> >>>
> >>> I'll look into using a simple cache in Tika to avoid this type of
> >>> blocking.
> >>> Feel free to comment on
> >>> https://issues.apache.org/jira/browse/TIKA-471
> >>>
> >>> Note that the Tika code base has changed from what it appears that
> >>> you're using (e.g. the switch from Neko to TagSoup happened quite
> >>> a while ago).
> >>>
> >>> -- Ken
> >>>
> >>> On Jul 23, 2010, at 3:51pm, brad wrote:
> >>>
> >>>  I'm continuing to have performance problems with parsing.  I ran
> >>> the
> >>>> fetch process with -noParsing and got great performance.  If I do
> >>>> the same process with parsing left in, the fetching seems to be
> >>>> going great, but as the process continues to run, everything
> >>>> slows down to almost a dead stop.
> >>>>
> >>>> When I check the thread stack, I find that 1062 threads are blocked:
> >>>> java.lang.Thread.State: BLOCKED (on object monitor)
> >>>>        at
> >>>> sun
> >>>> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
> >>>> 135)
> >>>>
> >>>> Apparently this is a known issue with Java, and a couple articles
> >>>> are written about it:
> >>>> http://paul.vox.com/library/post/the-mysteries-of-java-character-
> >>>> set-p
> >>>> erform
> >>>> ance.html
> >>>> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-
> >>>> about
> >>>> .html
> >>>>
> >>>> There is also a note in java bug database about scaling issues
> >>>> with the class...
> >>>> Please also note that the current implementation of
> >>>> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide
> >>>> lock and is called very often (e.g. by new String(byte[]
> >>>> data,String encoding)).
> >>>> This
> >>>> JVM-wide lock means that Java applications do not scale beyond 4
> >>>> CPU cores.
> >>>>
> >>>> I noted in the case of my stack at this particular point in time.
> >>>> The
> >>>> BLOCKED calls to charsetForName were generated by:
> >>>>
> >>>> at
> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
> >>>> 84)
> >>>> 378
> >>>> at
> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
> >>>> 99) 61
> >>>> at
> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
> >>>> 133)
> >>>> 19
> >>>> at
> >>>> org
> >>>> .apache
> >>>> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.ja
> >>>> v
> >>>> a:86)  238
> >>>> at
> >>>> org
> >>>> .apache
> >>>> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetecto
> >>>> r
> >>>> .java:310) 133
> >>>> at
> >>>> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
> >>>> 270) 8
> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
> >>>> 253) 47
> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
> >>>> 247) 19
> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
> >>>> 227) 2
> >>>> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104)
> >>>> 7 at
> >>>> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
> >>>> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache
> >>>> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.j
> >>>> a
> >>>> va:161) 12
> >>>> at
> >>>> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
> >>>> 13
> >>>> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245)
> >>>> 3
> >>>>
> >>>> Is this an issue that only I'm facing?  Is it worth looking at
> >>>> alternatives as talked about in the articles?  Or, just limit the
> >>>> number of threads that are run?  Right now it seems like the
> >>>> block is causing problem unrelated to general design and behavior of
Nutch.
> >>>>
> >>>> Thoughts??
> >>>>
> >>>> Thanks
> >>>> Brad
> >>>>
> >>>>
> >>>>
> >>> --------------------------------------------
> >>> Ken Krugler
> >>> +1 530-210-6378
> >>> http://bixolabs.com
> >>> e l a s t i c   w e b   m i n i n g
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >> --------------------------------------------
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://bixolabs.com
> >> e l a s t i c   w e b   m i n i n g
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> > --------------------------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
>
>


--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to