Re: Retrieve the original HTML from nutch-1.4 crawldb

Mathijs Homminga Thu, 22 Dec 2011 23:28:59 -0800

I believe they are stored in the /content subdir of a segment.
If you need a lot of pages, you could also take a look at: 
http://www.commoncrawl.org/


On Dec 23, 2011, at 3:06 , 邓尧 wrote:

> Hi,
> 
> I need tons of HTML pages to do a research. I followed the tutorial in the
> wiki page and setup a nutch-1.4 crawler (without solr). I can now dump the
> extracted text from the segments, unfortunately the HTML tags are stripped.
> How can I retrieve the original HTML pages from the crawled database? or
> are the original HTML pages actually stored by nutch?
> 
> Thanks
> 
> -Yao

Re: Retrieve the original HTML from nutch-1.4 crawldb

Reply via email to