Hi Nutch users, I'm trying to implement a simple "focused crawler" using Nutch. I'm new to the software and I'm seeking some advice on how best to proceed.
The crawling algorithm I'd like to use is very basic. Each time I visit a page, I first determine whether or not it's relevant to my topic. If the page IS relevant, I update its score to 0, and I assign a score of -1 to all of its outlinks. If the page is NOT relevant, I subtract 1 from the page's score, and assign this decremented value to its outlinks. In this way, a page's score should always represent the distance between it and the nearest relevant page. Using this score when I create my fetch lists should focus my crawl to the neighbourhood of the web that contains pages relevant to my topic. Implementing this algorithm in Nutch seems straightforward. If I understand the framework correctly, I should be able to put all the necessary logic into a single scoring plugin. The challenge I'm facing has to do with how my page relevance calculations are made. Unlike many crawling applications, I can't determine relevance simply by examining a page's text and links. Rather, to determine relevance, I need to examine the content of the page's outlinks. To be precise, I'm trying to crawl the web for a particular kind of audio file. To determine the type of the audio file, it needs to be downloaded and analyzed by an audio classifier. If an audio file is classified as one type, the page that points to it is relevant; if it's categorized as another type, the page is not relevant. * My central problem, then, is that I need to follow all of a page's outlinks before I can assign the page a relevance score.* I'm unsure how to do this in the context of the Nutch crawl cycle. I can think of two potential solutions. In the first solution, I could download some of the page's outlinks in the updateScore method of my plugin. I'd first identify those outlinks that looked liked audio files through their extensions (.wav, .aiff, etc). I'd then download and analyze the files. Then I'd update the page's score depending on the results of the analysis. This is a straightforward solution, but it seems kludgy and wasteful since it results in downloading each audio file at least twice (once in updateScore, and then later by the Nutch fetcher). A second (possible) solution might be to download, parse and analyze the audio in the standard Nutch crawl loop. Parsing could be done in a parse plugin. If a file were found to be relevant, the scoring plugin could try to update the scores of the pages that linked to it. This strategy would avoid downloading the same audio multiple times, but I'm not sure it would work well with Nutch's scoring framework. Has anyone on the list ever dealt with a situation like this? Which of the two strategies above is more sensible? Or is there a third, better solution that I'm missing? Your advice would be much appreciated. Many thanks, Dave

