Re: Getting html pages through a Nutch crawl (for a dataset)

Sameendra Samarawickrama Mon, 23 Jan 2012 09:37:03 -0800

I need the content. :(

On Mon, Jan 23, 2012 at 9:47 PM, remi tassing <[email protected]> wrote:


> If you need the urls, then yes, you just need to further process that file.
>
> If you need the content of those htlm files, then I'm not.sure how
> to.do.that
>
> On Monday, January 23, 2012, Sameendra Samarawickrama <
> [email protected]> wrote:
> > yes it has a dump file which contains 'CrawlDatums'. And I found some
> html
> > content in it but to get html pages out of it I think you will have to
> > further process it right? How about my crawl contains several thousand
> web
> > pages, will that file contain the contents of all the pages? Is this the
> > way it happens?
> >
> > Thanks,
> > Sameendra
> >
> > On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <[email protected]>
> wrote:
> >
> >> Hi,
> >>
> >> in your output directory, you should see two files:
> >> 1..part-00000.crc
> >> 2. part-00000
> >>
> >> Open the second one with a text editor and you should be able to see the
> >> crawled urls. Perharps if there is no html in there, you probably didn't
> >> crawl any.
> >>
> >> Remi
> >>
> >> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
> >> [email protected]> wrote:
> >>
> >> > Hi,
> >> > I tried the readdb comamnd, but I can't get the html pages with it.
> >> > Thanks,
> >> > Sameendra
> >> >
> >> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <[email protected]
> >> > >wrote:
> >> >
> >> > > Hi Sameendra,
> >> > >
> >> > > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> >> > >
> >> > > For instance the following command, will read your database and
> output
> >> > the
> >> > > crawled URLs to the directory output_dir:
> >> > >
> >> > > bin/nutch readdb crawl/crawldb -dump output_dir
> >> > >
> >> > > Remi
> >> > >
> >> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> >> > > [email protected]> wrote:
> >> > >
> >> > > > The best method is to read or dump the contents of your crawldb
> and
> >> > work
> >> > > > based on this.
> >> > > >
> >> > > > Please have a look on the wiki for using the readdb tool.
> >> > > >
> >> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> >> > > > [email protected]> wrote:
> >> > > >
> >> > > > > Hi,
> >> > > > > I am using Nutch to generate a small dataset of web; dataset on
> >> > which I
> >> > > > am
> >> > > > > planning of running a focused crawler later.
> >> > > > >
> >> > > > > I did a test crawl of and I have the 'segments' folder built up.
> >> Now
> >> > I
> >> > > > need
> >> > > > > to get that exact html pages it fetched out of the seed url/s.
> >> > > > >
> >> > > > > Is it possible to create a dataset this way? If so, how do I get
> >> > those
> >> > > > html
> >> > > > > pages?
> >> > > > >
> >> > > > > Thanks a lot!
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > *Lewis*
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to