I believe they are stored in the /content subdir of a segment. If you need a lot of pages, you could also take a look at: http://www.commoncrawl.org/
On Dec 23, 2011, at 3:06 , 邓尧 wrote: > Hi, > > I need tons of HTML pages to do a research. I followed the tutorial in the > wiki page and setup a nutch-1.4 crawler (without solr). I can now dump the > extracted text from the segments, unfortunately the HTML tags are stripped. > How can I retrieve the original HTML pages from the crawled database? or > are the original HTML pages actually stored by nutch? > > Thanks > > -Yao

