Hi,

How can I use boilerpipe for nutch 2.1?

I have so far: (these instructions are for 1.6, i cannot find anything on
2.1)


4. delete the following lines from runtime/local/conf/parse-plugins.xml:
        <mimeType name="text/html">
                <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
        </mimeType>

5. Add the following lines to runtime/local/conf/nutch-site.xml
  <property>

                <name>tika.boilerpipe</name>

                <value>true</value>

        </property>

I test with L: bin/nutch parsechecker -dumpText
http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html

But that doesnt give me the desired result.

Thanks in advance,

Jaap

On Wed, Jan 16, 2013 at 3:54 PM, kemical <[email protected]> wrote:

> Outlink extraction is not mandatory since the most important for me is the
> main content.
>
> Also is there some options for the plugin to extract html tags and not raw
> plain text without line returns (sometimes i've got tags but most of the
> time i've not), or at least some conversion in "\n" so the main content
> displayed could have some interest too?
>
> And when the url here : http://boilerpipe-web.appspot.com/ i've got them.
>
> (but i guess it could be because tika boilerpipe version is an older one)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033868.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to