Is there a way to extract it and then inject it again? I would like to extract the content and metada from each page to change them and then persist it. Is there a simple way that I can use Nutch classes to do that or what can I do?
On Sat, Jul 17, 2010 at 11:16 AM, Rayala Udayakumar <[email protected]>wrote: > Hi, > > You can use the -format option along with the dump which will allow > you to get the crawl db dump in csv format. > > "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout -format csv" > > The csv will have the following fields: > > Url;Status code;Status name;Fetch Time;Modified Time;Retries since > fetch;Retry interval;Score;Signature;Metadata > > I am not sure if there will be only one row per url. If not, you can > then extract the unique urls from the csv file. > > - Uday. > > On Sat, Jul 17, 2010 at 12:35 AM, Branden Makana > <[email protected]> wrote: > > Hi All, > > > > > > I never heard a reply back from Alex, so I'm wondering if anyone > else has insight. I know that I can run > > > > "bin/nutch readlinkdb mycrawl/linkdb -dump linkdbout" > > > > to have Nutch dump the list of pages crawled to a text file, so > I'm sure it's possible to get a list of all pages crawled on a site. > However, the format from the command is not very easy to parse - I just want > a list of each unique page URL from the crawl, which I guess I could get by > parsing that output and removing duplicate URLs. > > > > I'm also looking at the source of CrawlDBReader - it looks like > however you call the command line, it'll end up creating a NutchJob and > sending it off to Hadoop. I'm not familiar with Hadoop so that's another > stumbling block. > > > > > > Could someone familiar with the crawldb tell me if I'm on the > right track? Again, I just want a list of all pages on a site from a crawl > (without duplicates but I can remove them myself if I have to). Should I be > trying to parse the output of "readlinkdb -dump", or should I be trying to > run some job through Hadoop? > > > > > > Many Thanks, > > Branden Makana > > > > > > > > On Jul 14, 2010, at 9:57 PM, Branden Root wrote: > > > >> Hi Alex, > >> > >> Thanks for your reply. I want to have Nutch crawl a site, then get a > list of all pages/images on the site from the crawl. I am fluent in Java, > but I'm looking for pointers to where to begin. > >> > >> From running the tutorial, I did see a file created by the crawl, > "links/part-00000" with plaintext info on all the site's pages - is that the > lnkdb you refer to? > >> > >> > >> > >> Thanks, > >> Branden Makana > >> > >> > >> > >> On Wednesday, July 14, 2010, Alex McLintock <[email protected]> > wrote: > >>> I'm a bit confused as to what you want to do, your skills available, > >>> and how much you can code yourself. Presumably you have seen the > >>> linksdb? and you see that there is code to read from linksdb? > >>> > >>> Have you looked at the ReadDB facility? You probably want to look at > >>> the class org.apache.nutch.crawl.CrawlDbReader > >>> > >>> > >>> Alex > > > > >

