RE: Crawling while collecting resources

Joseph Naegele Mon, 15 Feb 2016 16:19:37 -0800

I discovered that the Protocol extension point is a good place to do this,
since it is responsible for actually fetching content.


Is it possible with Nutch to fetch content that I may not want to
parse/index?

Example: I want to fetch images in addition to HTML, but I only want the
HTML to be parsed and indexed.

Thanks,
Joe

-----Original Message-----
From: Joseph Naegele [mailto:[email protected]] 
Sent: Monday, February 08, 2016 7:29 PM
To: '[email protected]' <[email protected]>
Subject: Crawling while collecting resources

My goal is to use Nutch "normally" to craw, parse, extract links and index
said textual content but with the added goal of fetching and saving *all*
resources found at outlinks. It is my understanding that there is no
straightforward method for collecting resources this way, i.e. an extension
point. I found a few posts where users asked how to save the original
content of crawled resources. I'll address those options here:

1. Modify the fetcher code to "save" fetched resources
(http://stackoverflow.com/a/10060160/1689220). This is not a modular
approach.
2. Write an HtmlParseFilter that adds the original byte content to the parse
data, and an IndexingFilter that just adds the same content to the document
(http://www.mail-archive.com/user%40nutch.apache.org/msg03659.html). I don't
think this makes sense for non-HTML resources.

Another approach would be to implement a "Parser" that isn't a parser at
all, but just stores the original resource content however I see fit, then
returns `null`, which causes Nutch to try the next configured Parser (e.g.
parse-tika). This might work, and I could even prevent things like images
from being passed to the real parser in my pseudo-parser.

If we just assume that textual resources contain outlinks and non-textual
resources do not, ideally Nutch would fetch *all* links, pass them to my
code for storing, and only pass textual resources on to parse and index.
What would be the best way to do this?

Thanks,
Joe

RE: Crawling while collecting resources

Reply via email to