Hi,
How can I use boilerpipe for nutch 2.1?
I have so far: (these instructions are for 1.6, i cannot find anything on
2.1)
4. delete the following lines from runtime/local/conf/parse-plugins.xml:
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
5. Add the following lines to runtime/local/conf/nutch-site.xml
<property>
<name>tika.boilerpipe</name>
<value>true</value>
</property>
I test with L: bin/nutch parsechecker -dumpText
http://www.nu.nl/buitenland/2845586/turkije-zal-syrie-niet-aanvallen.html
But that doesnt give me the desired result.
Thanks in advance,
Jaap
On Wed, Jan 16, 2013 at 3:54 PM, kemical <[email protected]> wrote:
> Outlink extraction is not mandatory since the most important for me is the
> main content.
>
> Also is there some options for the plugin to extract html tags and not raw
> plain text without line returns (sometimes i've got tags but most of the
> time i've not), or at least some conversion in "\n" so the main content
> displayed could have some interest too?
>
> And when the url here : http://boilerpipe-web.appspot.com/ i've got them.
>
> (but i guess it could be because tika boilerpipe version is an older one)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-Nutch-with-Boilerpipe-tp3991587p4033868.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>