Hi Alex, Thanks for your reply. I want to have Nutch crawl a site, then get a list of all pages/images from the crawl. I am fluent in Java, I was just looking for pointers to where to begin.
>From the tutorial, I did see a "crawl/ On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote: > I'm a bit confused as to what you want to do, your skills available, > and how much you can code yourself. Presumably you have seen the > linksdb? and you see that there is code to read from linksdb? > > Have you looked at the ReadDB facility? You probably want to look at > the class org.apache.nutch.crawl.CrawlDbReader > > > Alex > > > > On 14 July 2010 21:34, Branden Root <[email protected]> wrote: >> Hello, >> >> >> I'm new to Nutch, so pardon me if this question had been asked before >> (an archives search didn't show anything). I'm trying to use nutch to crawl >> a website, and then get a list of all URLs on the site, including image >> URLs. I just need the URLs themselves, not the page/image content or >> anything like that. >> >> Right now I know how to run Nutch on the command line, then after >> crawling/indexing, I can view the links/whatever file to see all the links, >> so that's a starting point. But, I really want to be able to >> programmatically run a nutch crawl (examples exist), then programmatically >> retrieve those links (no examples I can find). Also, I want to include all >> image hrefs on the crawled site in the final link printout (if it is even >> possible) >> >> Any help is greatly appreciated! >> >> Thanks, >> Branden Makana >> >> >> > -- Branden Root Chief Technology Officer | 206.575.3740 Portent Interactive An Internet Marketing Agency http://www.portentinteractive.com

