Re: tika and boilerpipe

Markus Jelsma Sat, 09 Jul 2011 12:56:12 -0700

> Could we use both normal parse-html and parse-tika, and write them on
> different fields in a solr instance?


No, they are mapped to a content type, only one is will be executed. 

> 
> The problem is finding outlinks? I did not understand this one. Do you
> mean we wont be able to find outlinks from a tika-boilerplate parsed
> page because the boilerpipe detection changes links to text?

As boilerpipe strips away boilerplate text, so does it remove the links in 
that boilerplate text so there are less links to detect. Many websites are not 
crawlable in this manner.

> 
> Best Regards,
> C.B.
> 
> On Sat, Jul 9, 2011 at 9:23 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> > I'll patch against 1.4 and 1.3. Not sure on 1.2. It works but hasn't been
> > tested well. It also suffers from the problem of finding outlinks as it
> > won't detect outlinks that are marked as boilerplate text.
> > 
> >> I have been familiar with boilerpipe, and I was glad to see:
> >> https://issues.apache.org/jira/browse/NUTCH-961
> >> 
> >> Is this a working patch? And if so, can I patch againist the latest?
> >> 
> >> Best Regards,
> >> C.B.

Re: tika and boilerpipe

Reply via email to