Hi, You can use the -format option along with the dump which will allow you to get the crawl db dump in csv format.
"bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout -format csv" The csv will have the following fields: Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval;Score;Signature;Metadata I am not sure if there will be only one row per url. If not, you can then extract the unique urls from the csv file. - Uday. On Sat, Jul 17, 2010 at 12:35 AM, Branden Makana <[email protected]> wrote: > Hi All, > > > I never heard a reply back from Alex, so I'm wondering if anyone else > has insight. I know that I can run > > "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout" > > to have Nutch dump the list of pages crawled to a text file, so I'm > sure it's possible to get a list of all pages crawled on a site. However, the > format from the command is not very easy to parse - I just want a list of > each unique page URL from the crawl, which I guess I could get by parsing > that output and removing duplicate URLs. > > I'm also looking at the source of CrawlDBReader - it looks like > however you call the command line, it'll end up creating a NutchJob and > sending it off to Hadoop. I'm not familiar with Hadoop so that's another > stumbling block. > > > Could someone familiar with the crawldb tell me if I'm on the right > track? Again, I just want a list of all pages on a site from a crawl (without > duplicates but I can remove them myself if I have to). Should I be trying to > parse the output of "readlinkdb -dump", or should I be trying to run some job > through Hadoop? > > > Many Thanks, > Branden Makana > > > > On Jul 14, 2010, at 9:57 PM, Branden Root wrote: > >> Hi Alex, >> >> Thanks for your reply. I want to have Nutch crawl a site, then get a list of >> all pages/images on the site from the crawl. I am fluent in Java, but I'm >> looking for pointers to where to begin. >> >> From running the tutorial, I did see a file created by the crawl, >> "links/part-00000" with plaintext info on all the site's pages - is that the >> lnkdb you refer to? >> >> >> >> Thanks, >> Branden Makana >> >> >> >> On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote: >>> I'm a bit confused as to what you want to do, your skills available, >>> and how much you can code yourself. Presumably you have seen the >>> linksdb? and you see that there is code to read from linksdb? >>> >>> Have you looked at the ReadDB facility? You probably want to look at >>> the class org.apache.nutch.crawl.CrawlDbReader >>> >>> >>> Alex > >

