I need the content. :( On Mon, Jan 23, 2012 at 9:47 PM, remi tassing <[email protected]> wrote:
> If you need the urls, then yes, you just need to further process that file. > > If you need the content of those htlm files, then I'm not.sure how > to.do.that > > On Monday, January 23, 2012, Sameendra Samarawickrama < > [email protected]> wrote: > > yes it has a dump file which contains 'CrawlDatums'. And I found some > html > > content in it but to get html pages out of it I think you will have to > > further process it right? How about my crawl contains several thousand > web > > pages, will that file contain the contents of all the pages? Is this the > > way it happens? > > > > Thanks, > > Sameendra > > > > On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <[email protected]> > wrote: > > > >> Hi, > >> > >> in your output directory, you should see two files: > >> 1..part-00000.crc > >> 2. part-00000 > >> > >> Open the second one with a text editor and you should be able to see the > >> crawled urls. Perharps if there is no html in there, you probably didn't > >> crawl any. > >> > >> Remi > >> > >> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama < > >> [email protected]> wrote: > >> > >> > Hi, > >> > I tried the readdb comamnd, but I can't get the html pages with it. > >> > Thanks, > >> > Sameendra > >> > > >> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <[email protected] > >> > >wrote: > >> > > >> > > Hi Sameendra, > >> > > > >> > > read this page: http://wiki.apache.org/nutch/bin/nutch_readdb > >> > > > >> > > For instance the following command, will read your database and > output > >> > the > >> > > crawled URLs to the directory output_dir: > >> > > > >> > > bin/nutch readdb crawl/crawldb -dump output_dir > >> > > > >> > > Remi > >> > > > >> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney < > >> > > [email protected]> wrote: > >> > > > >> > > > The best method is to read or dump the contents of your crawldb > and > >> > work > >> > > > based on this. > >> > > > > >> > > > Please have a look on the wiki for using the readdb tool. > >> > > > > >> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama < > >> > > > [email protected]> wrote: > >> > > > > >> > > > > Hi, > >> > > > > I am using Nutch to generate a small dataset of web; dataset on > >> > which I > >> > > > am > >> > > > > planning of running a focused crawler later. > >> > > > > > >> > > > > I did a test crawl of and I have the 'segments' folder built up. > >> Now > >> > I > >> > > > need > >> > > > > to get that exact html pages it fetched out of the seed url/s. > >> > > > > > >> > > > > Is it possible to create a dataset this way? If so, how do I get > >> > those > >> > > > html > >> > > > > pages? > >> > > > > > >> > > > > Thanks a lot! > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > -- > >> > > > *Lewis* > >> > > > > >> > > > >> > > >> > > >

