Retrieve full DOM with BoilperpipeCH

Markus Jelsma Thu, 30 Jun 2011 04:46:05 -0700

Hi,

In Nutch, the BoilerpipeContentHandler returns only a partial DOM. Nutch uses 
the DOM parsed by Tika to find outlinks. So, whenever i use BP i can only 
retrieve a small amount of links, or even none at all.


I can work around this problem by parsing normally and later on send the 
(already parsed) input stream again to Tika but this time through BP.

This double parsing seems silly so i'm looking for advice on how to do this 
better. 

Thanks,

Retrieve full DOM with BoilperpipeCH

Reply via email to