Hi Dennis

I was rather thinking in a simpler approach, something like a Hadoop class I
could extend and let Hadoop take care of making it a single instance
throughout all nodes for me.
If there is no such a mechanism, I will try the second approach you
described.
Running Nutch Crawl class on Eclipse, a static reference to a Map-based
cache on my plugin solves my problem, but I guess that would not work in a
distributed environment.

I am studying some more of the mechanics for plugin loading in Nutch, much
of it is still a mistery to me.
Does it try to maintain state of the objects throughout the nodes, or each
process is supposed to run independently ?

Thanks for the help.


2010/6/20 Dennis Kubes <[email protected]>

> Off the top of my head there are a couple of different ways to do this:
>
>  1. Create a list of the images parsed and add them to some file(s) on
>     the HDFS.  Then use that file to ignore going forward.  You would
>     probably need to do some type of file merge at the end and your
>     complexity would grow with the number of images parsed.  The file
>     type could also be something like SQLLite or BDB for fast access.
>  The key here is either you have multiple files you are searching
>     or you are merging intermediate and existing files at some point.
>  2. Use Memcached or a similar system to hold your real time
>     information and access that from your Hadoop tasks.  This should
>     give you the singelton type of structure you are looking for.  It
>     is still possible that some images are parsed more than once due
>     to distributed race conditions but it would significantly reduce
>     that number.  You could also use HBase for something like this
>     although I think key value (memcached) versus record store (HBase)
>     is better in this instance.
>
> The distributed cache isn't really what you are looking for unless the
> batch ignore method described above is what you want.  In that case the case
> gives you an optimization of have the file on every local machine between
> jobs.
>
> Dennis
>
>
> On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote:
>
>> Hi everybody !
>>
>> I am developing a image parsing plugin for Nutch 1.1 to use it for
>> intranet
>> pages.
>> This plugin traverses the document getting all images and parsing them
>> synchronously.
>> The parsing process is not expensive if we do not parse the same images
>> again and again whenever they are found throughout the site.
>> About 70% of the parsed pages contain only the same images, things like
>> logos, footer images, etc.
>> To avoid the unnecessary load, we are now trying to build a sort of cache
>> in
>> which we put all image addresses that have already been processed.
>> The problem is that we are running in a distributed environment.
>>
>> With that in mind, is there a kind of Hadoop level Distributed Cache our
>> plugin can access so all cores know what has already been parsed by the
>> other cores ? A singleton object all cores could see would suffice.
>>
>> I have read about Hadoop's DistributedCache, but it doesn't seem to be
>> what
>> I need.
>>
>>
>> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html
>>
>> http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache
>>
>>
>> Thanks in advance,
>>
>> Emmanuel de Castro Santana
>>
>>
>>
>


-- 
Emmanuel de Castro Santana

Reply via email to