Please do I am very interested in how a solution like that works for you in terms of performance.

Dennis

On 06/24/2010 12:10 PM, Emmanuel de Castro Santana wrote:
Thank you for all the help Dennis, all of this is valid information to me !

I am trying a solution using Memcache, will post results as I go.


2010/6/22 Dennis Kubes<[email protected]>


On 06/22/2010 05:37 AM, Emmanuel de Castro Santana wrote:

Hi Dennis

I was rather thinking in a simpler approach, something like a Hadoop class
I
could extend and let Hadoop take care of making it a single instance
throughout all nodes for me.


AFAIK there isn't any type of singleton cache shared among all Hadoop nodes
that could be updated in real time.  The Hadoop cache places a given file or
files on all machines but it isn't real time.

  If there is no such a mechanism, I will try the second approach you
described.
Running Nutch Crawl class on Eclipse, a static reference to a Map-based
cache on my plugin solves my problem, but I guess that would not work in a
distributed environment.


Correct.  On a single machine using something like a MapRunner (old API),
or yes a static ref keeping everything in memory you could solve the
problem.  No it wouldn't work on anything other than local mode because in
distributed mode (besides being on separate machines), each child task has
its own JVM and is executed as a separate process on the OS.

Any distributed singleton would need to be either external (memcached) or
batch based (update to HDFS and pull down at beginning of job).  In both
instances you are relying on some external data store to the jobs to act as
a shared memory area for processing.

  I am studying some more of the mechanics for plugin loading in Nutch, much
of it is still a mistery to me.


It is based on Eclipse 2.0 plugin architecture if that helps.  The short of
it is there are extension points with interfaces.  Each extension point has
plugins, classes that implement the interface.  The extension points and
plugins are loaded dynamically at the beginning of each job from the
classpath.

  Does it try to maintain state of the objects throughout the nodes, or each
process is supposed to run independently ?


Plugin classloaders are independent and have a dependency mechanism setup
in the plugin.xml file inside the plugin's folder.  Plugins do have access
to the Nutch classes and libs.  They don't necessarily have access to other
plugin classes and their libs.

Dennis

  Thanks for the help.

2010/6/20 Dennis Kubes<[email protected]>



Off the top of my head there are a couple of different ways to do this:

  1. Create a list of the images parsed and add them to some file(s) on
     the HDFS.  Then use that file to ignore going forward.  You would
     probably need to do some type of file merge at the end and your
     complexity would grow with the number of images parsed.  The file
     type could also be something like SQLLite or BDB for fast access.
  The key here is either you have multiple files you are searching
     or you are merging intermediate and existing files at some point.
  2. Use Memcached or a similar system to hold your real time
     information and access that from your Hadoop tasks.  This should
     give you the singelton type of structure you are looking for.  It
     is still possible that some images are parsed more than once due
     to distributed race conditions but it would significantly reduce
     that number.  You could also use HBase for something like this
     although I think key value (memcached) versus record store (HBase)
     is better in this instance.

The distributed cache isn't really what you are looking for unless the
batch ignore method described above is what you want.  In that case the
case
gives you an optimization of have the file on every local machine between
jobs.

Dennis


On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote:



Hi everybody !

I am developing a image parsing plugin for Nutch 1.1 to use it for
intranet
pages.
This plugin traverses the document getting all images and parsing them
synchronously.
The parsing process is not expensive if we do not parse the same images
again and again whenever they are found throughout the site.
About 70% of the parsed pages contain only the same images, things like
logos, footer images, etc.
To avoid the unnecessary load, we are now trying to build a sort of
cache
in
which we put all image addresses that have already been processed.
The problem is that we are running in a distributed environment.

With that in mind, is there a kind of Hadoop level Distributed Cache our
plugin can access so all cores know what has already been parsed by the
other cores ? A singleton object all cores could see would suffice.

I have read about Hadoop's DistributedCache, but it doesn't seem to be
what
I need.



http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html


http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache


Thanks in advance,

Emmanuel de Castro Santana









Reply via email to