I discovered that the Protocol extension point is a good place to do this,
since it is responsible for actually fetching content.

Is it possible with Nutch to fetch content that I may not want to
parse/index?

Example: I want to fetch images in addition to HTML, but I only want the
HTML to be parsed and indexed.

Thanks,
Joe

-----Original Message-----
From: Joseph Naegele [mailto:[email protected]] 
Sent: Monday, February 08, 2016 7:29 PM
To: '[email protected]' <[email protected]>
Subject: Crawling while collecting resources

My goal is to use Nutch "normally" to craw, parse, extract links and index
said textual content but with the added goal of fetching and saving *all*
resources found at outlinks. It is my understanding that there is no
straightforward method for collecting resources this way, i.e. an extension
point. I found a few posts where users asked how to save the original
content of crawled resources. I'll address those options here:

1. Modify the fetcher code to "save" fetched resources
(http://stackoverflow.com/a/10060160/1689220). This is not a modular
approach.
2. Write an HtmlParseFilter that adds the original byte content to the parse
data, and an IndexingFilter that just adds the same content to the document
(http://www.mail-archive.com/user%40nutch.apache.org/msg03659.html). I don't
think this makes sense for non-HTML resources.

Another approach would be to implement a "Parser" that isn't a parser at
all, but just stores the original resource content however I see fit, then
returns `null`, which causes Nutch to try the next configured Parser (e.g.
parse-tika). This might work, and I could even prevent things like images
from being passed to the real parser in my pseudo-parser.

If we just assume that textual resources contain outlinks and non-textual
resources do not, ideally Nutch would fetch *all* links, pass them to my
code for storing, and only pass textual resources on to parse and index.
What would be the best way to do this?

Thanks,
Joe

Reply via email to