Could we use both normal parse-html and parse-tika, and write them on different fields in a solr instance?
The problem is finding outlinks? I did not understand this one. Do you mean we wont be able to find outlinks from a tika-boilerplate parsed page because the boilerpipe detection changes links to text? Best Regards, C.B. On Sat, Jul 9, 2011 at 9:23 PM, Markus Jelsma <[email protected]> wrote: > I'll patch against 1.4 and 1.3. Not sure on 1.2. It works but hasn't been > tested well. It also suffers from the problem of finding outlinks as it won't > detect outlinks that are marked as boilerplate text. > >> I have been familiar with boilerpipe, and I was glad to see: >> https://issues.apache.org/jira/browse/NUTCH-961 >> >> Is this a working patch? And if so, can I patch againist the latest? >> >> Best Regards, >> C.B. >

