Hi everybody ! I am developing a image parsing plugin for Nutch 1.1 to use it for intranet pages. This plugin traverses the document getting all images and parsing them synchronously. The parsing process is not expensive if we do not parse the same images again and again whenever they are found throughout the site. About 70% of the parsed pages contain only the same images, things like logos, footer images, etc. To avoid the unnecessary load, we are now trying to build a sort of cache in which we put all image addresses that have already been processed. The problem is that we are running in a distributed environment.
With that in mind, is there a kind of Hadoop level Distributed Cache our plugin can access so all cores know what has already been parsed by the other cores ? A singleton object all cores could see would suffice. I have read about Hadoop's DistributedCache, but it doesn't seem to be what I need. http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache Thanks in advance, Emmanuel de Castro Santana

