Hi,
I would like to crawl a set of URLs looking for only a specific type of
file.  For example, all images or RSS feeds.

Right now I can successfully generate/fetch/parse/index, but I don't want to
be doing a lot of parse work if I don't need to.

I also don't want to parse any of the files I find, I just want to grab a
link to it.

I understand I can dump the linkdb using  "bin/nutch readlinkdb
mycrawl/linkdb -dump linkdbout -format csv", but what would be the most
efficient nutch cycle to get links to these files without doing a lot of
extraneous parsing work?

Thanks.

Reply via email to