Hi, To be absolutely sure that only Tika is used you should also remove the parse-html plugin from plugin.includes. Make sure all references to the parse-html plugin are removed from the parse-plugins.xml. (Looking at your snippet it seems as this is the case).
With Tika itself or Boilerpipe I'm not really familiar. (Mostly I use parse-html.) Ferdy. On Mon, Sep 10, 2012 at 3:29 AM, Matt MacDonald <[email protected]> wrote: > Hi, > > I've been looking at 2.x source code, JIRA and the mailing list for > information about Boilerpipe and Nutch 2.x. I can see that the > boilerpipe.jar file is included in the Tika plugin.xml file: <library > name="boilerpipe-1.1.0.jar"/>. I also see two jira tickets talking > about boilerpipe in Nutch 1.6: > > * https://issues.apache.org/jira/browse/NUTCH-961 > * https://issues.apache.org/jira/browse/NUTCH-1233 > > I also see that Tika 1.1 is using Boilerpipe: > > http://tika.apache.org/1.1/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html > > I've searched the mailing lists and code looking for what > configuration options I need to setup so that when HTML/XHTML > documents are parsed that Tika with Boilerpipe and a specific > Extractor is being used. I have added the following to nutch-site.xml: > > <property> > <name>tika.use_boilerpipe</name> > <value>true</value> > </property> > <property> > <name>tika.boilerpipe.extractor</name> > <value>ArticleExtractor</value> > </property> > > And in parse-plugins.xml I have the following: > > <mimeType name="*"> > <plugin id="parse-tika" /> > </mimeType> > <mimeType name="text/html"> > <plugin id="parse-tika" /> > </mimeType> > <mimeType name="application/xhtml+xml"> > <plugin id="parse-tika" /> > </mimeType> > > When I run my crawl it isn't clear that the Tika parser is being used > for text/html application/xhtml+xml and when looking at the extracted > content from the pages that I am crawling I'm seeing lots of > shell/template/wrapper HTML. Questions: > > 1. Ideas about what I can do to confirm that the Tika parser is being used? > 2. Is there a logging setting so that I know that Boilerpipe is being > used to parse the HTML/XHTML? > 3. Can I change the Extractor Boilperpipe uses and if so how? > 4. Any ideas about what I am missing in my configuration so that > Tika/Boilerpipe is being used to parse those documents? > > Thanks, > Matt >

