Sorry, had a crash - I have seen the plain-text file of pages crawled, is that really all I need, assuming I can get it to list images on each page too?
On Wednesday, July 14, 2010, Branden Makana <[email protected]> wrote: > Hi Alex, > > Thanks for your reply. I want to have Nutch crawl a site, then get a > list of all pages/images from the crawl. I am fluent in Java, I was > just looking for pointers to where to begin. > > > From the tutorial, I did see a "crawl/ > > On Wednesday, July 14, 2010, Alex McLintock <[email protected]> wrote: >> I'm a bit confused as to what you want to do, your skills available, >> and how much you can code yourself. Presumably you have seen the >> linksdb? and you see that there is code to read from linksdb? >> >> Have you looked at the ReadDB facility? You probably want to look at >> the class org.apache.nutch.crawl.CrawlDbReader >> >> >> Alex >> >> >> >> On 14 July 2010 21:34, Branden Root <[email protected]> wrote: >>> Hello, >>> >>> >>> I'm new to Nutch, so pardon me if this question had been asked >>> before (an archives search didn't show anything). I'm trying to use nutch >>> to crawl a website, and then get a list of all URLs on the site, including >>> image URLs. I just need the URLs themselves, not the page/image content or >>> anything like that. >>> >>> Right now I know how to run Nutch on the command line, then after >>> crawling/indexing, I can view the links/whatever file to see all the links, >>> so that's a starting point. But, I really want to be able to >>> programmatically run a nutch crawl (examples exist), then programmatically >>> retrieve those links (no examples I can find). Also, I want to include all >>> image hrefs on the crawled site in the final link printout (if it is even >>> possible) >>> >>> Any help is greatly appreciated! >>> >>> Thanks, >>> Branden Makana >>> >>> > -- Branden Root Chief Technology Officer | 206.575.3740 Portent Interactive An Internet Marketing Agency http://www.portentinteractive.com

