On Mon, 2010-07-19 at 09:17 +0100, Julien Nioche wrote: > Jeff, > > Hi, in Nutch 1.0 I was able to replace the parse-html plugin with my own > > html parser to parse html files, through modifying the mime types in > > parse-plugins.xml. > > > > I have been trying to do the same things in Nutch 1.1, but my own html > > parser is not picked up when crawling, leading to no parser exceptions. > > > > You should be able to override Tika for a given mime-type provided that you > declare the association between your plugin and the mime-type in > parse-plugins.xml. Have you checked that your plugin is listed in > plugin.includes? Can you see it listed in the log? > > J.
I did put my plugin (containing a parse filter and html parser) in plugin.includes. The following output demonstrates that the parse filter was called but the html parser wasn't. "P2RHtmlParseFilter successfully executed." ==> this line shows my html parser filter is working. "Original parse-html plugin executed." ==> this line shows HtmlParser.java was called and Neko parser was used. (I printed this line in HtmlParser.java. Output: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=13 P2RHtmlParseFilter successfully executed. -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=13 fetching http://www.apache.org/foundation/getinvolved.html Original parse-html plugin executed. Original parse-html plugin executed. Original parse-html plugin executed. Original parse-html plugin executed. Original parse-html plugin executed. P2RHtmlParseFilter successfully executed. -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=12 Very interestingly, this http://www.apache.org/foundation/getinvolved.html web page has "text/html" mime type which by default is commented out, meaning Tika should parse the page rather than Neko. So my guess is that Commenting out or Uncommenting a mime type ALONE doesn't replace Tika parsers in parse-plugins.xml. In addition, when i made only one mime type change by uncommenting and modifying the "text/html" mime type this way: <mimeType name="text/html"> <!-- <plugin id="parse-html" /> --> <plugin id="p2r-plugins" /> </mimeType> the above hyperlink can no longer be parsed since my parser was not picked up. Thanks

