On Mon, 2010-07-19 at 09:17 +0100, Julien Nioche wrote:
> Jeff,
> 
> Hi, in Nutch 1.0 I was able to replace the parse-html plugin with my own
> > html parser to parse html files, through modifying the mime types in
> > parse-plugins.xml.
> >
> > I have been trying to do the same things in Nutch 1.1, but my own html
> > parser is not picked up when crawling, leading to no parser exceptions.
> >
> 
> You should be able to override Tika for a given mime-type provided that you
> declare the association between your plugin and the mime-type in
> parse-plugins.xml. Have you checked that your plugin is listed in
> plugin.includes? Can you see it listed in the log?
> 
> J.

I did put my plugin (containing a parse filter and html parser) in
plugin.includes. The following output demonstrates that the parse filter
was called but the html parser wasn't. 

"P2RHtmlParseFilter successfully executed." ==> this line shows my html
parser filter is working.

"Original parse-html plugin executed." ==> this line shows
HtmlParser.java was called and Neko parser was used. (I printed this
line in HtmlParser.java. 

Output:
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13
P2RHtmlParseFilter successfully executed.
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13
fetching http://www.apache.org/foundation/getinvolved.html
Original parse-html plugin executed.
Original parse-html plugin executed.
Original parse-html plugin executed.
Original parse-html plugin executed.
Original parse-html plugin executed.
P2RHtmlParseFilter successfully executed.
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12

Very interestingly, this
http://www.apache.org/foundation/getinvolved.html web page has
"text/html" mime type which by default is commented out, meaning Tika
should parse the page rather than Neko.

So my guess is that Commenting out or Uncommenting a mime type ALONE
doesn't replace Tika parsers in parse-plugins.xml.

In addition, when i made only one mime type change by uncommenting and
modifying the "text/html" mime type this way:

<mimeType name="text/html">
                <!-- <plugin id="parse-html" /> -->
                <plugin id="p2r-plugins" />
        </mimeType>

the above hyperlink can no longer be parsed since my parser was not picked up. 

Thanks

Reply via email to