How to get page content of crawled pages

Klemens Muthmann Mon, 22 Nov 2010 00:24:21 -0800

Hi,

I did a small crawl of some pages on the web and want to geht the rawHTML content of these pages now. Reading the documentation in the wiki Iguess this content might be somewhere undercrawl/segments/20101122071139/content/part-00000.

I also guess I can access this content using the Hadoop API likedescribed here: http://wiki.apache.org/nutch/Getting_Started


However I have absolutely no idea how to configure:

MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);

The Hadoop documentation is not very helpful either. May someone pleasepoint me in the right direction to get the page content?


Thank you and regards
    Klemens Muthmann

How to get page content of crawled pages

Reply via email to