Hi,

In Nutch, the BoilerpipeContentHandler returns only a partial DOM. Nutch uses 
the DOM parsed by Tika to find outlinks. So, whenever i use BP i can only 
retrieve a small amount of links, or even none at all.

I can work around this problem by parsing normally and later on send the 
(already parsed) input stream again to Tika but this time through BP.

This double parsing seems silly so i'm looking for advice on how to do this 
better. 

Thanks,

Reply via email to