Yes, use the SegmentReader tool to fetch data from the content directory. On Friday 23 December 2011 08:28:26 Mathijs Homminga wrote: > I believe they are stored in the /content subdir of a segment. > If you need a lot of pages, you could also take a look at: > http://www.commoncrawl.org/ > > On Dec 23, 2011, at 3:06 , 邓尧 wrote: > > Hi, > > > > I need tons of HTML pages to do a research. I followed the tutorial in > > the wiki page and setup a nutch-1.4 crawler (without solr). I can now > > dump the extracted text from the segments, unfortunately the HTML tags > > are stripped. How can I retrieve the original HTML pages from the > > crawled database? or are the original HTML pages actually stored by > > nutch? > > > > Thanks > > > > -Yao
-- Markus Jelsma - CTO - Openindex

