Off the top of my head there are a couple of different ways to do this:
1. Create a list of the images parsed and add them to some file(s) on
the HDFS. Then use that file to ignore going forward. You would
probably need to do some type of file merge at the end and your
complexity would grow with the number of images parsed. The file
type could also be something like SQLLite or BDB for fast access.
The key here is either you have multiple files you are searching
or you are merging intermediate and existing files at some point.
2. Use Memcached or a similar system to hold your real time
information and access that from your Hadoop tasks. This should
give you the singelton type of structure you are looking for. It
is still possible that some images are parsed more than once due
to distributed race conditions but it would significantly reduce
that number. You could also use HBase for something like this
although I think key value (memcached) versus record store (HBase)
is better in this instance.
The distributed cache isn't really what you are looking for unless the
batch ignore method described above is what you want. In that case the
case gives you an optimization of have the file on every local machine
between jobs.
Dennis
On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote:
Hi everybody !
I am developing a image parsing plugin for Nutch 1.1 to use it for intranet
pages.
This plugin traverses the document getting all images and parsing them
synchronously.
The parsing process is not expensive if we do not parse the same images
again and again whenever they are found throughout the site.
About 70% of the parsed pages contain only the same images, things like
logos, footer images, etc.
To avoid the unnecessary load, we are now trying to build a sort of cache in
which we put all image addresses that have already been processed.
The problem is that we are running in a distributed environment.
With that in mind, is there a kind of Hadoop level Distributed Cache our
plugin can access so all cores know what has already been parsed by the
other cores ? A singleton object all cores could see would suffice.
I have read about Hadoop's DistributedCache, but it doesn't seem to be what
I need.
http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html
http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache
Thanks in advance,
Emmanuel de Castro Santana