Hi, In Nutch, the BoilerpipeContentHandler returns only a partial DOM. Nutch uses the DOM parsed by Tika to find outlinks. So, whenever i use BP i can only retrieve a small amount of links, or even none at all.
I can work around this problem by parsing normally and later on send the (already parsed) input stream again to Tika but this time through BP. This double parsing seems silly so i'm looking for advice on how to do this better. Thanks,
