Hi, When you are done with crawling you can try dump command. Its usage is as follows:
*$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]* * [-segment <segment>]* * -h,--help show this help message* * -mimetype <mimetype> an optional list of mimetypes to dump, excluding* * all others. Defaults to all.* * -outputDir <outputDir> output directory (which will be created) to host* * the raw data* * -segment <segment> the segment(s) to use* So, you can apply that: *$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/* which will create a new directory at -outputDir and dump all the crawled pages in html format. On the other hand, this may also be useful for your case: https://wiki.apache.org/nutch/CommonCrawlDataDumper Kind Regards, Furkan KAMACI On Tue, Apr 5, 2016 at 6:29 PM, Markus Jelsma <[email protected]> wrote: > Hello - you should try the newer dump tool, it dumps HTML files as is to > some directory. > Markus > > > > -----Original message----- > > From:Vijay Veluchamy <[email protected]> > > Sent: Tuesday 5th April 2016 17:24 > > To: [email protected] > > Subject: RE: How to read segment dump? > > > > Hi, > > > > I am looking for crawling a website as HTML files. After that, I need to > > parse them and get the elements in it. > > > > Thanks, > > Vijay > > On Apr 5, 2016 8:37 PM, "Markus Jelsma" <[email protected]> > wrote: > > > > > Hello, segment dumps are notorious hard to comprehend. What information > > > are you looking for? What do you mean by reading contents of a website? > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:Vijay Veluchamy <[email protected]> > > > > Sent: Tuesday 5th April 2016 16:22 > > > > To: [email protected] > > > > Subject: How to read segment dump? > > > > > > > > Hi Team, > > > > > > > > I need to crawl a website using Apache Nutch. Currently, I am using > Nutch > > > > 1.x. > > > > > > > > I have followed the steps provided in the following URL upto > 'invertlink' > > > > step. > > > > > > > > https://wiki.apache.org/nutch/NutchTutorial > > > > > > > > Then, used 'readseg' command to dump the segments. The dump file is > > > created > > > > successfully. > > > > > > > > Now, I have the following questions. > > > > > > > > 1. Is this the right file (segment dump file) to read contents of a > > > > website? If yes, how to read the contents from dump file? I am > unable to > > > > read as it looks like encrypted. > > > > 2. Otherwise, how can I read the contents of a website? > > > > > > > > Thanks, > > > > Vijay > > > > > > > > > >

