Hi,
I did a small crawl of some pages on the web and want to geht the raw
HTML content of these pages now. Reading the documentation in the wiki I
guess this content might be somewhere under
crawl/segments/20101122071139/content/part-00000.
I also guess I can access this content using the Hadoop API like
described here: http://wiki.apache.org/nutch/Getting_Started
However I have absolutely no idea how to configure:
MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
The Hadoop documentation is not very helpful either. May someone please
point me in the right direction to get the page content?
Thank you and regards
Klemens Muthmann