Re: Retrieve the original HTML from nutch-1.4 crawldb

Markus Jelsma Fri, 23 Dec 2011 01:56:35 -0800

Yes, use the SegmentReader tool to fetch data from the content directory.

On Friday 23 December 2011 08:28:26 Mathijs Homminga wrote:
> I believe they are stored in the /content subdir of a segment.
> If you need a lot of pages, you could also take a look at:
> http://www.commoncrawl.org/
> 
> On Dec 23, 2011, at 3:06 , 邓尧 wrote:
> > Hi,
> > 
> > I need tons of HTML pages to do a research. I followed the tutorial in
> > the wiki page and setup a nutch-1.4 crawler (without solr). I can now
> > dump the extracted text from the segments, unfortunately the HTML tags
> > are stripped. How can I retrieve the original HTML pages from the
> > crawled database? or are the original HTML pages actually stored by
> > nutch?
> > 
> > Thanks
> > 
> > -Yao


-- 
Markus Jelsma - CTO - Openindex

Re: Retrieve the original HTML from nutch-1.4 crawldb

Reply via email to