Hi Nutch users,

I'm trying to implement a simple "focused crawler" using Nutch. I'm new to
the software and I'm seeking some advice on how best to proceed.

The crawling algorithm I'd like to use is very basic. Each time I visit a
page, I first determine whether or not it's relevant to my topic. If the
page IS relevant, I update its score to 0, and I assign a score of -1 to
all of its outlinks. If the page is NOT relevant, I subtract 1 from the
page's score, and assign this decremented value to its outlinks.  In this
way, a page's score should always represent the distance between it and the
nearest relevant page. Using this score when I create my fetch lists should
focus my crawl to the neighbourhood of the web that contains pages relevant
to my topic.

Implementing this algorithm in Nutch seems straightforward.  If I
understand the framework correctly, I should be able to put all the
necessary logic into a single scoring plugin.


The challenge I'm facing has to do with how my page relevance calculations
are made.

Unlike many crawling applications, I can't determine relevance simply by
examining a page's text and links.  Rather, to determine relevance, I need
to examine the content of the page's outlinks.  To be precise, I'm trying
to crawl the web for a particular kind of audio file. To determine the type
of the audio file, it needs to be downloaded and analyzed by an audio
classifier.  If an audio file is classified as one type, the page that
points to it is relevant; if it's categorized as another type, the page is
not relevant. * My central problem, then, is that I need to follow all of a
page's outlinks before I can assign the page a relevance score.*  I'm
unsure how to do this in the context of the Nutch crawl cycle.

I can think of two potential solutions.

In the first solution, I could download some of the page's outlinks in the
updateScore method of my plugin. I'd first identify those outlinks that
looked liked audio files through their extensions (.wav, .aiff, etc).  I'd
then download and analyze the files. Then I'd update the page's score
depending on the results of the analysis. This is a straightforward
solution, but it seems kludgy and wasteful since it results in downloading
each audio file at least twice (once in updateScore, and then later by the
Nutch fetcher).

A second (possible) solution might be to download, parse and analyze the
audio in the standard Nutch crawl loop. Parsing could be done in a parse
plugin. If a file were found to be relevant, the scoring plugin could try
to update the scores of the pages that linked to it. This strategy would
avoid downloading the same audio multiple times, but I'm not sure it would
work well with Nutch's scoring framework.

Has anyone on the list ever dealt with a situation like this?  Which of the
two strategies above is more sensible?  Or is there a third, better
solution that I'm missing?

Your advice would be much appreciated.

Many thanks,


Dave

Reply via email to