Hi,

I did a small crawl of some pages on the web and want to geht the raw HTML content of these pages now. Reading the documentation in the wiki I guess this content might be somewhere under crawl/segments/20101122071139/content/part-00000.

I also guess I can access this content using the Hadoop API like described here: http://wiki.apache.org/nutch/Getting_Started

However I have absolutely no idea how to configure:

MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);


The Hadoop documentation is not very helpful either. May someone please point me in the right direction to get the page content?

Thank you and regards
    Klemens Muthmann

Reply via email to