Thank you for all the help Dennis, all of this is valid information to me !
I am trying a solution using Memcache, will post results as I go. 2010/6/22 Dennis Kubes <[email protected]> > > > On 06/22/2010 05:37 AM, Emmanuel de Castro Santana wrote: > >> Hi Dennis >> >> I was rather thinking in a simpler approach, something like a Hadoop class >> I >> could extend and let Hadoop take care of making it a single instance >> throughout all nodes for me. >> >> > AFAIK there isn't any type of singleton cache shared among all Hadoop nodes > that could be updated in real time. The Hadoop cache places a given file or > files on all machines but it isn't real time. > > If there is no such a mechanism, I will try the second approach you >> described. >> Running Nutch Crawl class on Eclipse, a static reference to a Map-based >> cache on my plugin solves my problem, but I guess that would not work in a >> distributed environment. >> >> > Correct. On a single machine using something like a MapRunner (old API), > or yes a static ref keeping everything in memory you could solve the > problem. No it wouldn't work on anything other than local mode because in > distributed mode (besides being on separate machines), each child task has > its own JVM and is executed as a separate process on the OS. > > Any distributed singleton would need to be either external (memcached) or > batch based (update to HDFS and pull down at beginning of job). In both > instances you are relying on some external data store to the jobs to act as > a shared memory area for processing. > > I am studying some more of the mechanics for plugin loading in Nutch, much >> of it is still a mistery to me. >> >> > It is based on Eclipse 2.0 plugin architecture if that helps. The short of > it is there are extension points with interfaces. Each extension point has > plugins, classes that implement the interface. The extension points and > plugins are loaded dynamically at the beginning of each job from the > classpath. > > Does it try to maintain state of the objects throughout the nodes, or each >> process is supposed to run independently ? >> >> > Plugin classloaders are independent and have a dependency mechanism setup > in the plugin.xml file inside the plugin's folder. Plugins do have access > to the Nutch classes and libs. They don't necessarily have access to other > plugin classes and their libs. > > Dennis > > Thanks for the help. >> >> >> 2010/6/20 Dennis Kubes<[email protected]> >> >> >> >>> Off the top of my head there are a couple of different ways to do this: >>> >>> 1. Create a list of the images parsed and add them to some file(s) on >>> the HDFS. Then use that file to ignore going forward. You would >>> probably need to do some type of file merge at the end and your >>> complexity would grow with the number of images parsed. The file >>> type could also be something like SQLLite or BDB for fast access. >>> The key here is either you have multiple files you are searching >>> or you are merging intermediate and existing files at some point. >>> 2. Use Memcached or a similar system to hold your real time >>> information and access that from your Hadoop tasks. This should >>> give you the singelton type of structure you are looking for. It >>> is still possible that some images are parsed more than once due >>> to distributed race conditions but it would significantly reduce >>> that number. You could also use HBase for something like this >>> although I think key value (memcached) versus record store (HBase) >>> is better in this instance. >>> >>> The distributed cache isn't really what you are looking for unless the >>> batch ignore method described above is what you want. In that case the >>> case >>> gives you an optimization of have the file on every local machine between >>> jobs. >>> >>> Dennis >>> >>> >>> On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote: >>> >>> >>> >>>> Hi everybody ! >>>> >>>> I am developing a image parsing plugin for Nutch 1.1 to use it for >>>> intranet >>>> pages. >>>> This plugin traverses the document getting all images and parsing them >>>> synchronously. >>>> The parsing process is not expensive if we do not parse the same images >>>> again and again whenever they are found throughout the site. >>>> About 70% of the parsed pages contain only the same images, things like >>>> logos, footer images, etc. >>>> To avoid the unnecessary load, we are now trying to build a sort of >>>> cache >>>> in >>>> which we put all image addresses that have already been processed. >>>> The problem is that we are running in a distributed environment. >>>> >>>> With that in mind, is there a kind of Hadoop level Distributed Cache our >>>> plugin can access so all cores know what has already been parsed by the >>>> other cores ? A singleton object all cores could see would suffice. >>>> >>>> I have read about Hadoop's DistributedCache, but it doesn't seem to be >>>> what >>>> I need. >>>> >>>> >>>> >>>> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html >>>> >>>> >>>> http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache >>>> >>>> >>>> Thanks in advance, >>>> >>>> Emmanuel de Castro Santana >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> > -- Emmanuel de Castro Santana

