Hi,

When you are done with crawling you can try dump command. Its usage is as
follows:

*$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]*
*   [-segment <segment>]*
* -h,--help                show this help message*
* -mimetype <mimetype>     an optional list of mimetypes to dump, excluding*
*                      all others. Defaults to all.*
* -outputDir <outputDir>   output directory (which will be created) to host*
*                      the raw data*
* -segment <segment>       the segment(s) to use*

So, you can apply that:

*$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/*

which will create a new directory at -outputDir and dump all the crawled
pages in html format.

On the other hand, this may also be useful for your case:
https://wiki.apache.org/nutch/CommonCrawlDataDumper


Kind Regards,
Furkan KAMACI

On Tue, Apr 5, 2016 at 6:29 PM, Markus Jelsma <[email protected]>
wrote:

> Hello - you should try the newer dump tool, it dumps HTML files as is to
> some directory.
> Markus
>
>
>
> -----Original message-----
> > From:Vijay Veluchamy <[email protected]>
> > Sent: Tuesday 5th April 2016 17:24
> > To: [email protected]
> > Subject: RE: How to read segment dump?
> >
> > Hi,
> >
> > I am looking for crawling a website as HTML files. After that, I need to
> > parse them and get the elements in it.
> >
> > Thanks,
> > Vijay
> > On Apr 5, 2016 8:37 PM, "Markus Jelsma" <[email protected]>
> wrote:
> >
> > > Hello, segment dumps are notorious hard to comprehend. What information
> > > are you looking for? What do you mean by reading contents of a website?
> > > Markus
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Vijay Veluchamy <[email protected]>
> > > > Sent: Tuesday 5th April 2016 16:22
> > > > To: [email protected]
> > > > Subject: How to read segment dump?
> > > >
> > > > Hi Team,
> > > >
> > > > I need to crawl a website using Apache Nutch. Currently, I am using
> Nutch
> > > > 1.x.
> > > >
> > > > I have followed the steps provided in the following URL upto
> 'invertlink'
> > > > step.
> > > >
> > > > https://wiki.apache.org/nutch/NutchTutorial
> > > >
> > > > Then, used 'readseg' command to dump the segments. The dump file is
> > > created
> > > > successfully.
> > > >
> > > > Now, I have the following questions.
> > > >
> > > > 1. Is this the right file (segment dump file) to read contents of a
> > > > website? If yes, how to read the contents from dump file? I am
> unable to
> > > > read as it looks like encrypted.
> > > > 2. Otherwise, how can I read the contents of a website?
> > > >
> > > > Thanks,
> > > > Vijay
> > > >
> > >
> >
>

Reply via email to