Re: Boilerpipe and Nutch 2.x ?

Ferdy Galema Mon, 10 Sep 2012 04:49:26 -0700

Hi,

To be absolutely sure that only Tika is used you should also remove the
parse-html plugin from plugin.includes. Make sure all references to the
parse-html plugin are removed from the parse-plugins.xml. (Looking at your
snippet it seems as this is the case).


With Tika itself or Boilerpipe I'm not really familiar. (Mostly I use
parse-html.)

Ferdy.

On Mon, Sep 10, 2012 at 3:29 AM, Matt MacDonald <[email protected]> wrote:

> Hi,
>
> I've been looking at 2.x source code, JIRA and the mailing list for
> information about Boilerpipe and Nutch 2.x. I can see that the
> boilerpipe.jar file is included in the Tika plugin.xml file: <library
> name="boilerpipe-1.1.0.jar"/>. I also see two jira tickets talking
> about boilerpipe in Nutch 1.6:
>
> * https://issues.apache.org/jira/browse/NUTCH-961
> * https://issues.apache.org/jira/browse/NUTCH-1233
>
> I also see that Tika 1.1 is using Boilerpipe:
>
> http://tika.apache.org/1.1/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html
>
> I've searched the mailing lists and code looking for what
> configuration options I need to setup so that when HTML/XHTML
> documents are parsed that Tika with Boilerpipe and a specific
> Extractor is being used. I have added the following to nutch-site.xml:
>
> <property>
>   <name>tika.use_boilerpipe</name>
>   <value>true</value>
> </property>
> <property>
>   <name>tika.boilerpipe.extractor</name>
>   <value>ArticleExtractor</value>
> </property>
>
> And in parse-plugins.xml I have the following:
>
> <mimeType name="*">
>   <plugin id="parse-tika" />
> </mimeType>
> <mimeType name="text/html">
>   <plugin id="parse-tika" />
> </mimeType>
> <mimeType name="application/xhtml+xml">
>   <plugin id="parse-tika" />
> </mimeType>
>
> When I run my crawl it isn't clear that the Tika parser is being used
> for text/html application/xhtml+xml and when looking at the extracted
> content from the pages that I am crawling I'm seeing lots of
> shell/template/wrapper HTML. Questions:
>
> 1. Ideas about what I can do to confirm that the Tika parser is being used?
> 2. Is there a logging setting so that I know that Boilerpipe is being
> used to parse the HTML/XHTML?
> 3. Can I change the Extractor Boilperpipe uses and if so how?
> 4. Any ideas about what I am missing in my configuration so that
> Tika/Boilerpipe is being used to parse those documents?
>
> Thanks,
> Matt
>

Re: Boilerpipe and Nutch 2.x ?

Reply via email to