Hi,

I need tons of HTML pages to do a research. I followed the tutorial in the
wiki page and setup a nutch-1.4 crawler (without solr). I can now dump the
extracted text from the segments, unfortunately the HTML tags are stripped.
How can I retrieve the original HTML pages from the crawled database? or
are the original HTML pages actually stored by nutch?

Thanks

-Yao

Reply via email to