Hello,
I'm new to Nutch, so pardon me if this question had been asked before
(an archives search didn't show anything). I'm trying to use nutch to crawl a
website, and then get a list of all URLs on the site, including image URLs. I
just need the URLs themselves, not the page/image content or anything like
that.
Right now I know how to run Nutch on the command line, then after
crawling/indexing, I can view the links/whatever file to see all the links, so
that's a starting point. But, I really want to be able to programmatically run
a nutch crawl (examples exist), then programmatically retrieve those links (no
examples I can find). Also, I want to include all image hrefs on the crawled
site in the final link printout (if it is even possible)
Any help is greatly appreciated!
Thanks,
Branden Makana