Hi everybody !

I am developing a image parsing plugin for Nutch 1.1 to use it for intranet
pages.
This plugin traverses the document getting all images and parsing them
synchronously.
The parsing process is not expensive if we do not parse the same images
again and again whenever they are found throughout the site.
About 70% of the parsed pages contain only the same images, things like
logos, footer images, etc.
To avoid the unnecessary load, we are now trying to build a sort of cache in
which we put all image addresses that have already been processed.
The problem is that we are running in a distributed environment.

With that in mind, is there a kind of Hadoop level Distributed Cache our
plugin can access so all cores know what has already been parsed by the
other cores ? A singleton object all cores could see would suffice.

I have read about Hadoop's DistributedCache, but it doesn't seem to be what
I need.

http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html
http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache


Thanks in advance,

Emmanuel de Castro Santana

Reply via email to