Hi Dennis I was rather thinking in a simpler approach, something like a Hadoop class I could extend and let Hadoop take care of making it a single instance throughout all nodes for me. If there is no such a mechanism, I will try the second approach you described. Running Nutch Crawl class on Eclipse, a static reference to a Map-based cache on my plugin solves my problem, but I guess that would not work in a distributed environment.
I am studying some more of the mechanics for plugin loading in Nutch, much of it is still a mistery to me. Does it try to maintain state of the objects throughout the nodes, or each process is supposed to run independently ? Thanks for the help. 2010/6/20 Dennis Kubes <[email protected]> > Off the top of my head there are a couple of different ways to do this: > > 1. Create a list of the images parsed and add them to some file(s) on > the HDFS. Then use that file to ignore going forward. You would > probably need to do some type of file merge at the end and your > complexity would grow with the number of images parsed. The file > type could also be something like SQLLite or BDB for fast access. > The key here is either you have multiple files you are searching > or you are merging intermediate and existing files at some point. > 2. Use Memcached or a similar system to hold your real time > information and access that from your Hadoop tasks. This should > give you the singelton type of structure you are looking for. It > is still possible that some images are parsed more than once due > to distributed race conditions but it would significantly reduce > that number. You could also use HBase for something like this > although I think key value (memcached) versus record store (HBase) > is better in this instance. > > The distributed cache isn't really what you are looking for unless the > batch ignore method described above is what you want. In that case the case > gives you an optimization of have the file on every local machine between > jobs. > > Dennis > > > On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote: > >> Hi everybody ! >> >> I am developing a image parsing plugin for Nutch 1.1 to use it for >> intranet >> pages. >> This plugin traverses the document getting all images and parsing them >> synchronously. >> The parsing process is not expensive if we do not parse the same images >> again and again whenever they are found throughout the site. >> About 70% of the parsed pages contain only the same images, things like >> logos, footer images, etc. >> To avoid the unnecessary load, we are now trying to build a sort of cache >> in >> which we put all image addresses that have already been processed. >> The problem is that we are running in a distributed environment. >> >> With that in mind, is there a kind of Hadoop level Distributed Cache our >> plugin can access so all cores know what has already been parsed by the >> other cores ? A singleton object all cores could see would suffice. >> >> I have read about Hadoop's DistributedCache, but it doesn't seem to be >> what >> I need. >> >> >> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html >> >> http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache >> >> >> Thanks in advance, >> >> Emmanuel de Castro Santana >> >> >> > -- Emmanuel de Castro Santana

