It is in the big dump file output by the readseg command.
> I need the content. :( > > On Mon, Jan 23, 2012 at 9:47 PM, remi tassing <[email protected]> wrote: > > If you need the urls, then yes, you just need to further process that > > file. > > > > If you need the content of those htlm files, then I'm not.sure how > > to.do.that > > > > On Monday, January 23, 2012, Sameendra Samarawickrama < > > > > [email protected]> wrote: > > > yes it has a dump file which contains 'CrawlDatums'. And I found some > > > > html > > > > > content in it but to get html pages out of it I think you will have to > > > further process it right? How about my crawl contains several thousand > > > > web > > > > > pages, will that file contain the contents of all the pages? Is this > > > the way it happens? > > > > > > Thanks, > > > Sameendra > > > > > > On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <[email protected]> > > > > wrote: > > >> Hi, > > >> > > >> in your output directory, you should see two files: > > >> 1..part-00000.crc > > >> 2. part-00000 > > >> > > >> Open the second one with a text editor and you should be able to see > > >> the crawled urls. Perharps if there is no html in there, you probably > > >> didn't crawl any. > > >> > > >> Remi > > >> > > >> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama < > > >> > > >> [email protected]> wrote: > > >> > Hi, > > >> > I tried the readdb comamnd, but I can't get the html pages with it. > > >> > Thanks, > > >> > Sameendra > > >> > > > >> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing > > >> > <[email protected] > > >> > > > >> > >wrote: > > >> > > Hi Sameendra, > > >> > > > > >> > > read this page: http://wiki.apache.org/nutch/bin/nutch_readdb > > >> > > > > >> > > For instance the following command, will read your database and > > > > output > > > > >> > the > > >> > > > >> > > crawled URLs to the directory output_dir: > > >> > > > > >> > > bin/nutch readdb crawl/crawldb -dump output_dir > > >> > > > > >> > > Remi > > >> > > > > >> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney < > > >> > > > > >> > > [email protected]> wrote: > > >> > > > The best method is to read or dump the contents of your crawldb > > > > and > > > > >> > work > > >> > > > >> > > > based on this. > > >> > > > > > >> > > > Please have a look on the wiki for using the readdb tool. > > >> > > > > > >> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama < > > >> > > > > > >> > > > [email protected]> wrote: > > >> > > > > Hi, > > >> > > > > I am using Nutch to generate a small dataset of web; dataset > > >> > > > > on > > >> > > > >> > which I > > >> > > > >> > > > am > > >> > > > > > >> > > > > planning of running a focused crawler later. > > >> > > > > > > >> > > > > I did a test crawl of and I have the 'segments' folder built > > >> > > > > up. > > >> > > >> Now > > >> > > >> > I > > >> > > > >> > > > need > > >> > > > > > >> > > > > to get that exact html pages it fetched out of the seed url/s. > > >> > > > > > > >> > > > > Is it possible to create a dataset this way? If so, how do I > > >> > > > > get > > >> > > > >> > those > > >> > > > >> > > > html > > >> > > > > > >> > > > > pages? > > >> > > > > > > >> > > > > Thanks a lot! > > >> > > > > > >> > > > -- > > >> > > > *Lewis*

