Retrieve the original HTML from nutch-1.4 crawldb

邓尧 Thu, 22 Dec 2011 18:06:56 -0800

Hi,

I need tons of HTML pages to do a research. I followed the tutorial in the
wiki page and setup a nutch-1.4 crawler (without solr). I can now dump the
extracted text from the segments, unfortunately the HTML tags are stripped.
How can I retrieve the original HTML pages from the crawled database? or
are the original HTML pages actually stored by nutch?


Thanks

-Yao

Retrieve the original HTML from nutch-1.4 crawldb

Reply via email to