Re: Getting html pages through a Nutch crawl (for a dataset)

Sameendra Samarawickrama Mon, 23 Jan 2012 07:07:03 -0800

yes it has a dump file which contains 'CrawlDatums'. And I found some html
content in it but to get html pages out of it I think you will have to
further process it right? How about my crawl contains several thousand web
pages, will that file contain the contents of all the pages? Is this the
way it happens?


Thanks,
Sameendra

On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <[email protected]> wrote:

> Hi,
>
> in your output directory, you should see two files:
> 1..part-00000.crc
> 2. part-00000
>
> Open the second one with a text editor and you should be able to see the
> crawled urls. Perharps if there is no html in there, you probably didn't
> crawl any.
>
> Remi
>
> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
> [email protected]> wrote:
>
> > Hi,
> > I tried the readdb comamnd, but I can't get the html pages with it.
> > Thanks,
> > Sameendra
> >
> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <[email protected]
> > >wrote:
> >
> > > Hi Sameendra,
> > >
> > > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> > >
> > > For instance the following command, will read your database and output
> > the
> > > crawled URLs to the directory output_dir:
> > >
> > > bin/nutch readdb crawl/crawldb -dump output_dir
> > >
> > > Remi
> > >
> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> > > [email protected]> wrote:
> > >
> > > > The best method is to read or dump the contents of your crawldb and
> > work
> > > > based on this.
> > > >
> > > > Please have a look on the wiki for using the readdb tool.
> > > >
> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > > > [email protected]> wrote:
> > > >
> > > > > Hi,
> > > > > I am using Nutch to generate a small dataset of web; dataset on
> > which I
> > > > am
> > > > > planning of running a focused crawler later.
> > > > >
> > > > > I did a test crawl of and I have the 'segments' folder built up.
> Now
> > I
> > > > need
> > > > > to get that exact html pages it fetched out of the seed url/s.
> > > > >
> > > > > Is it possible to create a dataset this way? If so, how do I get
> > those
> > > > html
> > > > > pages?
> > > > >
> > > > > Thanks a lot!
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to