Hi Klemens, you should run ./bin/nutch readseg!
For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex Kind Regards from Hannover Hannes On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann < [email protected]> wrote: > Hi, > > I did a small crawl of some pages on the web and want to geht the raw HTML > content of these pages now. Reading the documentation in the wiki I guess > this content might be somewhere under > crawl/segments/20101122071139/content/part-00000. > > I also guess I can access this content using the Hadoop API like described > here: http://wiki.apache.org/nutch/Getting_Started > > However I have absolutely no idea how to configure: > > MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf); > > > The Hadoop documentation is not very helpful either. May someone please > point me in the right direction to get the page content? > > Thank you and regards > Klemens Muthmann >

