Could we use both normal parse-html and parse-tika, and write them on
different fields in a solr instance?

The problem is finding outlinks? I did not understand this one. Do you
mean we wont be able to find outlinks from a tika-boilerplate parsed
page because the boilerpipe detection changes links to text?

Best Regards,
C.B.

On Sat, Jul 9, 2011 at 9:23 PM, Markus Jelsma
<[email protected]> wrote:
> I'll patch against 1.4 and 1.3. Not sure on 1.2. It works but hasn't been
> tested well. It also suffers from the problem of finding outlinks as it won't
> detect outlinks that are marked as boilerplate text.
>
>> I have been familiar with boilerpipe, and I was glad to see:
>> https://issues.apache.org/jira/browse/NUTCH-961
>>
>> Is this a working patch? And if so, can I patch againist the latest?
>>
>> Best Regards,
>> C.B.
>

Reply via email to