Hi Markus,
The patch does work if you specify the -F3 parameter, like:
patch -p0 -ui NUTCH-961-1.5-1.patch -F3
I checked parse-plugins.xml and changed the html and xml-html mimetypes like
this:
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
Unfortunately, it doesn't work.
I also checked TikaParser.java and it refers to useBoilerpipe and
boilerpipeExtractorName:
boolean useBoilerpipe = getConf().getBoolean("tika.boilerpipe", true);
boolean useBoilerpipeEstimator =
getConf().getBoolean("tika.boilerpipe.estimator", false);
String boilerpipeExtractorName =
getConf().get("tika.boilerpipe.extractor", "ArticleExtractor"
Still, I am unsure where to specify these variables. Instead I added the
following lines to the java code (and commented the previous lines):
boolean useBoilerpipe = true;
String boilerpipeExtractorName = "ArticleExtractor";
Still, it is not working….
Any ideas?
Cheers,
René
On Jun 27, 2012, at 12:32 PM, Markus Jelsma wrote:
> Hi René,
>
> It seems NUTCH-961-1.5-1.patch doesn't apply cleanly to the finally released
> 1.5 at all, the TikaParser.java has changed a bit since the patch and the
> release of 1.5. Did you resolve the failde hunks? If so, are you sure Tika is
> being used for (x)html pages? Nutch by default uses the old parse-html plugin
> to parse those ContentTypes. Check your parse-plugins.xml configuration.
>
> Cheers,
> Markus
>
>
> -----Original message-----
>> From:Rene Nederhand <[email protected]>
>> Sent: Wed 27-Jun-2012 11:59
>> To: [email protected]
>> Subject: Using Nutch with Boilerpipe
>>
>> Hi,
>>
>> I'm trying to index only the main content (main article) of various
>> websites. For this, I'd like to use Boilerpipe with Nutch.
>>
>> Markus has been developing a patch (NUTCH-961) that does exactly that.
>> Although, the patch does install without problems, I am not sure how to set
>> the necessary settings. Is there anyone how can shed some light on this?
>>
>> As I understand two variables have to be set:
>>
>> tika.boilerpipe = true
>> tika.boilerpipe.extractor = "ArticleExtractor"
>>
>> I have tried to do this in a file conf/tika.config.file (is this still being
>> used?) and conf/nutch-default.xml within as valid XML within a properly
>> field. Both, didn't activate Boilerpipe. FYI: I am using Nutch 1.5.
>>
>> What should I do to get this thing going?
>>
>> Kind regards,
>>
>> René
>>
>>
>>
>>