By the way,

Given the same problem, there is also another approach.
Today I identify images by reading the html and gathering all the
information I need for each image.
Instead, I could make nutch crawl process to only accept image urls (i.e
urls ending with .jpg, .gif. etc).

Nevertheless, relevant information about the image lies on the page that
links to it (i.e alt attribute, title of the page, etc).
Such information is critical.
The plugin would need to have access to information inside the page that
links to that image.
Do you have any idea of how that can be performed ?

Thanks

Emmanuel


2010/6/24 Dennis Kubes <[email protected]>

> Please do I am very interested in how a solution like that works for you in
> terms of performance.
>
> Dennis
>
>
> On 06/24/2010 12:10 PM, Emmanuel de Castro Santana wrote:
>
>> Thank you for all the help Dennis, all of this is valid information to me
>> !
>>
>> I am trying a solution using Memcache, will post results as I go.
>>
>>
>> 2010/6/22 Dennis Kubes<[email protected]>
>>
>>
>>
>>>
>>> On 06/22/2010 05:37 AM, Emmanuel de Castro Santana wrote:
>>>
>>>
>>>
>>>> Hi Dennis
>>>>
>>>> I was rather thinking in a simpler approach, something like a Hadoop
>>>> class
>>>> I
>>>> could extend and let Hadoop take care of making it a single instance
>>>> throughout all nodes for me.
>>>>
>>>>
>>>>
>>>>
>>> AFAIK there isn't any type of singleton cache shared among all Hadoop
>>> nodes
>>> that could be updated in real time.  The Hadoop cache places a given file
>>> or
>>> files on all machines but it isn't real time.
>>>
>>>  If there is no such a mechanism, I will try the second approach you
>>>
>>>
>>>> described.
>>>> Running Nutch Crawl class on Eclipse, a static reference to a Map-based
>>>> cache on my plugin solves my problem, but I guess that would not work in
>>>> a
>>>> distributed environment.
>>>>
>>>>
>>>>
>>>>
>>> Correct.  On a single machine using something like a MapRunner (old API),
>>> or yes a static ref keeping everything in memory you could solve the
>>> problem.  No it wouldn't work on anything other than local mode because
>>> in
>>> distributed mode (besides being on separate machines), each child task
>>> has
>>> its own JVM and is executed as a separate process on the OS.
>>>
>>> Any distributed singleton would need to be either external (memcached) or
>>> batch based (update to HDFS and pull down at beginning of job).  In both
>>> instances you are relying on some external data store to the jobs to act
>>> as
>>> a shared memory area for processing.
>>>
>>>  I am studying some more of the mechanics for plugin loading in Nutch,
>>> much
>>>
>>>
>>>> of it is still a mistery to me.
>>>>
>>>>
>>>>
>>>>
>>> It is based on Eclipse 2.0 plugin architecture if that helps.  The short
>>> of
>>> it is there are extension points with interfaces.  Each extension point
>>> has
>>> plugins, classes that implement the interface.  The extension points and
>>> plugins are loaded dynamically at the beginning of each job from the
>>> classpath.
>>>
>>>  Does it try to maintain state of the objects throughout the nodes, or
>>> each
>>>
>>>
>>>> process is supposed to run independently ?
>>>>
>>>>
>>>>
>>>>
>>> Plugin classloaders are independent and have a dependency mechanism setup
>>> in the plugin.xml file inside the plugin's folder.  Plugins do have
>>> access
>>> to the Nutch classes and libs.  They don't necessarily have access to
>>> other
>>> plugin classes and their libs.
>>>
>>> Dennis
>>>
>>>  Thanks for the help.
>>>
>>>
>>>>
>>>> 2010/6/20 Dennis Kubes<[email protected]>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Off the top of my head there are a couple of different ways to do this:
>>>>>
>>>>>  1. Create a list of the images parsed and add them to some file(s) on
>>>>>     the HDFS.  Then use that file to ignore going forward.  You would
>>>>>     probably need to do some type of file merge at the end and your
>>>>>     complexity would grow with the number of images parsed.  The file
>>>>>     type could also be something like SQLLite or BDB for fast access.
>>>>>  The key here is either you have multiple files you are searching
>>>>>     or you are merging intermediate and existing files at some point.
>>>>>  2. Use Memcached or a similar system to hold your real time
>>>>>     information and access that from your Hadoop tasks.  This should
>>>>>     give you the singelton type of structure you are looking for.  It
>>>>>     is still possible that some images are parsed more than once due
>>>>>     to distributed race conditions but it would significantly reduce
>>>>>     that number.  You could also use HBase for something like this
>>>>>     although I think key value (memcached) versus record store (HBase)
>>>>>     is better in this instance.
>>>>>
>>>>> The distributed cache isn't really what you are looking for unless the
>>>>> batch ignore method described above is what you want.  In that case the
>>>>> case
>>>>> gives you an optimization of have the file on every local machine
>>>>> between
>>>>> jobs.
>>>>>
>>>>> Dennis
>>>>>
>>>>>
>>>>> On 06/20/2010 09:35 AM, Emmanuel de Castro Santana wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi everybody !
>>>>>>
>>>>>> I am developing a image parsing plugin for Nutch 1.1 to use it for
>>>>>> intranet
>>>>>> pages.
>>>>>> This plugin traverses the document getting all images and parsing them
>>>>>> synchronously.
>>>>>> The parsing process is not expensive if we do not parse the same
>>>>>> images
>>>>>> again and again whenever they are found throughout the site.
>>>>>> About 70% of the parsed pages contain only the same images, things
>>>>>> like
>>>>>> logos, footer images, etc.
>>>>>> To avoid the unnecessary load, we are now trying to build a sort of
>>>>>> cache
>>>>>> in
>>>>>> which we put all image addresses that have already been processed.
>>>>>> The problem is that we are running in a distributed environment.
>>>>>>
>>>>>> With that in mind, is there a kind of Hadoop level Distributed Cache
>>>>>> our
>>>>>> plugin can access so all cores know what has already been parsed by
>>>>>> the
>>>>>> other cores ? A singleton object all cores could see would suffice.
>>>>>>
>>>>>> I have read about Hadoop's DistributedCache, but it doesn't seem to be
>>>>>> what
>>>>>> I need.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/filecache/DistributedCache.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache
>>>>>>
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> Emmanuel de Castro Santana
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>


-- 
Emmanuel de Castro Santana

Reply via email to