Off the top of my head there are a couple of different ways to do this:

  1. Create a list of the images parsed and add them to some file(s) on
     the HDFS.  Then use that file to ignore going forward.  You would
     probably need to do some type of file merge at the end and your
     complexity would grow with the number of images parsed.  The file
type could also be something like SQLLite or BDB for fast access. The key here is either you have multiple files you are searching
     or you are merging intermediate and existing files at some point.
  2. Use Memcached or a similar system to hold your real time
     information and access that from your Hadoop tasks.  This should
     give you the singelton type of structure you are looking for.  It
     is still possible that some images are parsed more than once due
     to distributed race conditions but it would significantly reduce
     that number.  You could also use HBase for something like this
     although I think key value (memcached) versus record store (HBase)
     is better in this instance.

The distributed cache isn't really what you are looking for unless the batch ignore method described above is what you want. In that case the case gives you an optimization of have the file on every local machine between jobs.

Dennis

On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote:
Hi everybody !

I am developing a image parsing plugin for Nutch 1.1 to use it for intranet
pages.
This plugin traverses the document getting all images and parsing them
synchronously.
The parsing process is not expensive if we do not parse the same images
again and again whenever they are found throughout the site.
About 70% of the parsed pages contain only the same images, things like
logos, footer images, etc.
To avoid the unnecessary load, we are now trying to build a sort of cache in
which we put all image addresses that have already been processed.
The problem is that we are running in a distributed environment.

With that in mind, is there a kind of Hadoop level Distributed Cache our
plugin can access so all cores know what has already been parsed by the
other cores ? A singleton object all cores could see would suffice.

I have read about Hadoop's DistributedCache, but it doesn't seem to be what
I need.

http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html
http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache


Thanks in advance,

Emmanuel de Castro Santana

Reply via email to