Re: Getting html pages through a Nutch crawl (for a dataset)

Markus Jelsma Mon, 23 Jan 2012 09:52:35 -0800

It is in the big dump file output by the readseg command.


> I need the content. :(
> 
> On Mon, Jan 23, 2012 at 9:47 PM, remi tassing <[email protected]> wrote:
> > If you need the urls, then yes, you just need to further process that
> > file.
> > 
> > If you need the content of those htlm files, then I'm not.sure how
> > to.do.that
> > 
> > On Monday, January 23, 2012, Sameendra Samarawickrama <
> > 
> > [email protected]> wrote:
> > > yes it has a dump file which contains 'CrawlDatums'. And I found some
> > 
> > html
> > 
> > > content in it but to get html pages out of it I think you will have to
> > > further process it right? How about my crawl contains several thousand
> > 
> > web
> > 
> > > pages, will that file contain the contents of all the pages? Is this
> > > the way it happens?
> > > 
> > > Thanks,
> > > Sameendra
> > > 
> > > On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <[email protected]>
> > 
> > wrote:
> > >> Hi,
> > >> 
> > >> in your output directory, you should see two files:
> > >> 1..part-00000.crc
> > >> 2. part-00000
> > >> 
> > >> Open the second one with a text editor and you should be able to see
> > >> the crawled urls. Perharps if there is no html in there, you probably
> > >> didn't crawl any.
> > >> 
> > >> Remi
> > >> 
> > >> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
> > >> 
> > >> [email protected]> wrote:
> > >> > Hi,
> > >> > I tried the readdb comamnd, but I can't get the html pages with it.
> > >> > Thanks,
> > >> > Sameendra
> > >> > 
> > >> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing
> > >> > <[email protected]
> > >> > 
> > >> > >wrote:
> > >> > > Hi Sameendra,
> > >> > > 
> > >> > > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> > >> > > 
> > >> > > For instance the following command, will read your database and
> > 
> > output
> > 
> > >> > the
> > >> > 
> > >> > > crawled URLs to the directory output_dir:
> > >> > > 
> > >> > > bin/nutch readdb crawl/crawldb -dump output_dir
> > >> > > 
> > >> > > Remi
> > >> > > 
> > >> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> > >> > > 
> > >> > > [email protected]> wrote:
> > >> > > > The best method is to read or dump the contents of your crawldb
> > 
> > and
> > 
> > >> > work
> > >> > 
> > >> > > > based on this.
> > >> > > > 
> > >> > > > Please have a look on the wiki for using the readdb tool.
> > >> > > > 
> > >> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > >> > > > 
> > >> > > > [email protected]> wrote:
> > >> > > > > Hi,
> > >> > > > > I am using Nutch to generate a small dataset of web; dataset
> > >> > > > > on
> > >> > 
> > >> > which I
> > >> > 
> > >> > > > am
> > >> > > > 
> > >> > > > > planning of running a focused crawler later.
> > >> > > > > 
> > >> > > > > I did a test crawl of and I have the 'segments' folder built
> > >> > > > > up.
> > >> 
> > >> Now
> > >> 
> > >> > I
> > >> > 
> > >> > > > need
> > >> > > > 
> > >> > > > > to get that exact html pages it fetched out of the seed url/s.
> > >> > > > > 
> > >> > > > > Is it possible to create a dataset this way? If so, how do I
> > >> > > > > get
> > >> > 
> > >> > those
> > >> > 
> > >> > > > html
> > >> > > > 
> > >> > > > > pages?
> > >> > > > > 
> > >> > > > > Thanks a lot!
> > >> > > > 
> > >> > > > --
> > >> > > > *Lewis*

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to