Thank you for all the help Dennis, all of this is valid information to me !

I am trying a solution using Memcache, will post results as I go.


2010/6/22 Dennis Kubes <[email protected]>

>
>
> On 06/22/2010 05:37 AM, Emmanuel de Castro Santana wrote:
>
>> Hi Dennis
>>
>> I was rather thinking in a simpler approach, something like a Hadoop class
>> I
>> could extend and let Hadoop take care of making it a single instance
>> throughout all nodes for me.
>>
>>
> AFAIK there isn't any type of singleton cache shared among all Hadoop nodes
> that could be updated in real time.  The Hadoop cache places a given file or
> files on all machines but it isn't real time.
>
>  If there is no such a mechanism, I will try the second approach you
>> described.
>> Running Nutch Crawl class on Eclipse, a static reference to a Map-based
>> cache on my plugin solves my problem, but I guess that would not work in a
>> distributed environment.
>>
>>
> Correct.  On a single machine using something like a MapRunner (old API),
> or yes a static ref keeping everything in memory you could solve the
> problem.  No it wouldn't work on anything other than local mode because in
> distributed mode (besides being on separate machines), each child task has
> its own JVM and is executed as a separate process on the OS.
>
> Any distributed singleton would need to be either external (memcached) or
> batch based (update to HDFS and pull down at beginning of job).  In both
> instances you are relying on some external data store to the jobs to act as
> a shared memory area for processing.
>
>  I am studying some more of the mechanics for plugin loading in Nutch, much
>> of it is still a mistery to me.
>>
>>
> It is based on Eclipse 2.0 plugin architecture if that helps.  The short of
> it is there are extension points with interfaces.  Each extension point has
> plugins, classes that implement the interface.  The extension points and
> plugins are loaded dynamically at the beginning of each job from the
> classpath.
>
>  Does it try to maintain state of the objects throughout the nodes, or each
>> process is supposed to run independently ?
>>
>>
> Plugin classloaders are independent and have a dependency mechanism setup
> in the plugin.xml file inside the plugin's folder.  Plugins do have access
> to the Nutch classes and libs.  They don't necessarily have access to other
> plugin classes and their libs.
>
> Dennis
>
>  Thanks for the help.
>>
>>
>> 2010/6/20 Dennis Kubes<[email protected]>
>>
>>
>>
>>> Off the top of my head there are a couple of different ways to do this:
>>>
>>>  1. Create a list of the images parsed and add them to some file(s) on
>>>     the HDFS.  Then use that file to ignore going forward.  You would
>>>     probably need to do some type of file merge at the end and your
>>>     complexity would grow with the number of images parsed.  The file
>>>     type could also be something like SQLLite or BDB for fast access.
>>>  The key here is either you have multiple files you are searching
>>>     or you are merging intermediate and existing files at some point.
>>>  2. Use Memcached or a similar system to hold your real time
>>>     information and access that from your Hadoop tasks.  This should
>>>     give you the singelton type of structure you are looking for.  It
>>>     is still possible that some images are parsed more than once due
>>>     to distributed race conditions but it would significantly reduce
>>>     that number.  You could also use HBase for something like this
>>>     although I think key value (memcached) versus record store (HBase)
>>>     is better in this instance.
>>>
>>> The distributed cache isn't really what you are looking for unless the
>>> batch ignore method described above is what you want.  In that case the
>>> case
>>> gives you an optimization of have the file on every local machine between
>>> jobs.
>>>
>>> Dennis
>>>
>>>
>>> On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote:
>>>
>>>
>>>
>>>> Hi everybody !
>>>>
>>>> I am developing a image parsing plugin for Nutch 1.1 to use it for
>>>> intranet
>>>> pages.
>>>> This plugin traverses the document getting all images and parsing them
>>>> synchronously.
>>>> The parsing process is not expensive if we do not parse the same images
>>>> again and again whenever they are found throughout the site.
>>>> About 70% of the parsed pages contain only the same images, things like
>>>> logos, footer images, etc.
>>>> To avoid the unnecessary load, we are now trying to build a sort of
>>>> cache
>>>> in
>>>> which we put all image addresses that have already been processed.
>>>> The problem is that we are running in a distributed environment.
>>>>
>>>> With that in mind, is there a kind of Hadoop level Distributed Cache our
>>>> plugin can access so all cores know what has already been parsed by the
>>>> other cores ? A singleton object all cores could see would suffice.
>>>>
>>>> I have read about Hadoop's DistributedCache, but it doesn't seem to be
>>>> what
>>>> I need.
>>>>
>>>>
>>>>
>>>> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html
>>>>
>>>>
>>>> http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Emmanuel de Castro Santana
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>


-- 
Emmanuel de Castro Santana

Reply via email to