Looking to extract link data from a nutch crawl

Branden Makana Tue, 13 Jul 2010 14:43:32 -0700

Hello,


        I'm new to Nutch, so pardon me if this question had been asked before 
(an archives search didn't show anything). I'm trying to use nutch to crawl a 
website, and then get a list of all URLs on the site, including image URLs. I 
just need the URLs themselves, not the page/image content or anything like 
that. 

        Right now I know how to run Nutch on the command line, then after 
crawling/indexing, I can view the links/whatever file to see all the links, so 
that's a starting point. But, I really want to be able to programmatically run 
a nutch crawl (examples exist), then programmatically retrieve those links (no 
examples I can find). Also, I want to include all image hrefs on the crawled 
site in the final link printout (if it is even possible)

        Any help is greatly appreciated!

Thanks,
Branden Makana

Looking to extract link data from a nutch crawl

Reply via email to