Re: Plugins initialized all the time!

2007-05-31 Thread Doğacan Güney

On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:

On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:

  My patch is just a draft to see if we can create a better caching
  mechanism. There are definitely some rough edges there:)

 One important information: in future versions of Hadoop the method
 Configuration.setObject() is deprecated and then will be removed, so we
 have to grow our own caching mechanism anyway - either use a singleton
 cache, or change nearly all API-s to pass around a user/job/task context.

 So, we will face this problem pretty soon, with the next upgrade of Hadoop.

Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.




  You are right about per-plugin parameters but I think it will be very
  difficult to keep PluginProperty class in sync with plugin parameters.
  I mean, if a plugin defines a new parameter, we have to remember to
  update PluginProperty. Perhaps, we can force plugins to define
  configuration options it will use in, say, its plugin.xml file, but
  that will be very error-prone too. I don't want to compare entire
  configuration objects, because changing irrevelant options, like
  fetcher.store.content shouldn't force loading plugins again, though it
  seems it may be inevitable

 Let me see if I understand this ... In my opinion this is a non-issue.

 Child tasks are started in separate JVMs, so the only context
 information that they have is what they can read from job.xml (which is
 a superset of all properties from config files + job-specific data +
 task-specific data). This context is currently instantiated as a
 Configuration object, and we (ab)use it also as a local per-JVM cache
 for plugin instances and other objects.

 Once we instantiate the plugins, they exist unchanged throughout the
 lifecycle of JVM (== lifecycle of a single task), so we don't have to
 worry about having different sets of plugins with different parameters
 for different jobs (or even tasks).

 In other words, it seems to me that there is no such situation in which
 we have to reload plugins within the same JVM, but with different
 parameters.

Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.


Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?





 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




--
Doğacan Güney




--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-31 Thread Nicolás Lichtmaier



Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?


IMO a better way would be to add a proper equals() method to  Hadoop's 
Configuration object (and hashcode) that would call 
getProps().equals(o.getProps()). So that you could use them as keys... 
Every class which is a map from keys to values has equals  hashcode 
(Properties, HashMap, etc.).


Another nice thing would be to be able to freeze a configuration 
object, preventing anyone from modifying it.




Re: Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney

Hi,

On 5/29/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:


 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

Some comments about you patch. The approach seems nice, you only check
the parameters that affect plugin loading. But have in mind that the
plugin themselves will configure themselves with many other parameters,
so to keep things safe there should be a PluginRepository for each set
of parameters (including all of them). Besides, remember that CACHE is a
WeakHashMap, you are creating ad-hoc PluginProperty objects as keys,
something doesn't loook right... the lifespan of those objects will be
much shorter than you require, perhaps you should be using
SoftReferences instead, or a simple LRU (LinkedHashMap provides that
simply) cache.


My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)

I don't really worry about WeakHashMap-LinkedHashMap stuff. But your
approach is simple and should be faster so I guess it's OK.

You are right about per-plugin parameters but I think it will be very
difficult to keep PluginProperty class in sync with plugin parameters.
I mean, if a plugin defines a new parameter, we have to remember to
update PluginProperty. Perhaps, we can force plugins to define
configuration options it will use in, say, its plugin.xml file, but
that will be very error-prone too. I don't want to compare entire
configuration objects, because changing irrevelant options, like
fetcher.store.content shouldn't force loading plugins again, though it
seems it may be inevitable



Anyway, I'll try to build my own Nutch to test your patch.

Thanks!





--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-30 Thread Andrzej Bialecki

Doğacan Güney wrote:


My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)


One important information: in future versions of Hadoop the method 
Configuration.setObject() is deprecated and then will be removed, so we 
have to grow our own caching mechanism anyway - either use a singleton 
cache, or change nearly all API-s to pass around a user/job/task context.


So, we will face this problem pretty soon, with the next upgrade of Hadoop.




You are right about per-plugin parameters but I think it will be very
difficult to keep PluginProperty class in sync with plugin parameters.
I mean, if a plugin defines a new parameter, we have to remember to
update PluginProperty. Perhaps, we can force plugins to define
configuration options it will use in, say, its plugin.xml file, but
that will be very error-prone too. I don't want to compare entire
configuration objects, because changing irrevelant options, like
fetcher.store.content shouldn't force loading plugins again, though it
seems it may be inevitable


Let me see if I understand this ... In my opinion this is a non-issue.

Child tasks are started in separate JVMs, so the only context 
information that they have is what they can read from job.xml (which is 
a superset of all properties from config files + job-specific data + 
task-specific data). This context is currently instantiated as a 
Configuration object, and we (ab)use it also as a local per-JVM cache 
for plugin instances and other objects.


Once we instantiate the plugins, they exist unchanged throughout the 
lifecycle of JVM (== lifecycle of a single task), so we don't have to 
worry about having different sets of plugins with different parameters 
for different jobs (or even tasks).


In other words, it seems to me that there is no such situation in which 
we have to reload plugins within the same JVM, but with different 
parameters.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney

On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Doğacan Güney wrote:

 My patch is just a draft to see if we can create a better caching
 mechanism. There are definitely some rough edges there:)

One important information: in future versions of Hadoop the method
Configuration.setObject() is deprecated and then will be removed, so we
have to grow our own caching mechanism anyway - either use a singleton
cache, or change nearly all API-s to pass around a user/job/task context.

So, we will face this problem pretty soon, with the next upgrade of Hadoop.


Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.





 You are right about per-plugin parameters but I think it will be very
 difficult to keep PluginProperty class in sync with plugin parameters.
 I mean, if a plugin defines a new parameter, we have to remember to
 update PluginProperty. Perhaps, we can force plugins to define
 configuration options it will use in, say, its plugin.xml file, but
 that will be very error-prone too. I don't want to compare entire
 configuration objects, because changing irrevelant options, like
 fetcher.store.content shouldn't force loading plugins again, though it
 seems it may be inevitable

Let me see if I understand this ... In my opinion this is a non-issue.

Child tasks are started in separate JVMs, so the only context
information that they have is what they can read from job.xml (which is
a superset of all properties from config files + job-specific data +
task-specific data). This context is currently instantiated as a
Configuration object, and we (ab)use it also as a local per-JVM cache
for plugin instances and other objects.

Once we instantiate the plugins, they exist unchanged throughout the
lifecycle of JVM (== lifecycle of a single task), so we don't have to
worry about having different sets of plugins with different parameters
for different jobs (or even tasks).

In other words, it seems to me that there is no such situation in which
we have to reload plugins within the same JVM, but with different
parameters.


Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.




--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney

Hi,

On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:

I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )


Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch



Bye!




--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-29 Thread Briggs

I have also noticed this. The code explicitly loads an instance of the
plugins for every fetch (well, or parse etc., depending on what you
are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
you can see the filter classes get loaded and the never get unloaded
(they are loaded within their own classloader). So, you'll see the
same class loaded thousands of time, which is bad.

So, in my case, I had to change the way the plugins are loaded.
Basically, I changed all the main plugin loaders (like
URLFilters.java, IndexFilters.java) to be singletons with a single
'getInstance()' method on each. I don't need special configs for
filters so I can deal with singletons.

You'll find the heart of the problem somewhere in the extension point
class(es).  It calls newInstance() an aweful lot. But, the classloader
(one per plugin) never gets destroyed, or something so this can be
nasty.

I'm still dealing with my OutOfMemory errors on parsing, yuck.





On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:

Hi,

On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
 I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
 that the plugin repository initializes itself all the timem until I get
 an out of memory exception. I've been seeing the code... the plugin
 repository mantains a map from Configuration to plugin repositories, but
 the Configuration object does not have an equals or hashCode method...
 wouldn't it be nice to add such a method (comparing property values)?
 Wouldn't that help prevent initializing many plugin repositories? What
 could be the cause to may problem? (Aaah.. so many questions... =) )

Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch


 Bye!



--
Doğacan Güney




--
Conscious decisions by conscious minds are what make reality real


Re: Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney

On 5/29/07, Briggs [EMAIL PROTECTED] wrote:

I have also noticed this. The code explicitly loads an instance of the
plugins for every fetch (well, or parse etc., depending on what you
are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
you can see the filter classes get loaded and the never get unloaded
(they are loaded within their own classloader). So, you'll see the
same class loaded thousands of time, which is bad.

So, in my case, I had to change the way the plugins are loaded.
Basically, I changed all the main plugin loaders (like
URLFilters.java, IndexFilters.java) to be singletons with a single
'getInstance()' method on each. I don't need special configs for
filters so I can deal with singletons.

You'll find the heart of the problem somewhere in the extension point
class(es).  It calls newInstance() an aweful lot. But, the classloader
(one per plugin) never gets destroyed, or something so this can be
nasty.

I'm still dealing with my OutOfMemory errors on parsing, yuck.


Well then can you test the patch too? Nicolas's idea seems to be the
right one. After this patch, I think plugin loaders will see the same
PluginRepository instance.







On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 Hi,

 On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
  I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
  that the plugin repository initializes itself all the timem until I get
  an out of memory exception. I've been seeing the code... the plugin
  repository mantains a map from Configuration to plugin repositories, but
  the Configuration object does not have an equals or hashCode method...
  wouldn't it be nice to add such a method (comparing property values)?
  Wouldn't that help prevent initializing many plugin repositories? What
  could be the cause to may problem? (Aaah.. so many questions... =) )

 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

 
  Bye!
 


 --
 Doğacan Güney



--
Conscious decisions by conscious minds are what make reality real




--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-29 Thread Nicolás Lichtmaier



I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )


Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch


I'm running it. So far it's working ok, and I haven't seen all those 
plugin loadings...


I've modified your patch though to define CACHE like this:

 private static final MapPluginProperty, PluginRepository CACHE =
 new LinkedHashMapPluginProperty, PluginRepository() {
   @Override
   protected boolean removeEldestEntry(
   EntryPluginProperty, PluginRepository eldest) {
 return size()  10;
   }
 };

...which means an LRU cache with a fixed size of 10.



Plugins initialized all the time!

2007-05-28 Thread Nicolás Lichtmaier
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems 
that the plugin repository initializes itself all the timem until I get 
an out of memory exception. I've been seeing the code... the plugin 
repository mantains a map from Configuration to plugin repositories, but 
the Configuration object does not have an equals or hashCode method... 
wouldn't it be nice to add such a method (comparing property values)? 
Wouldn't that help prevent initializing many plugin repositories? What 
could be the cause to may problem? (Aaah.. so many questions... =) )


Bye!


Re: Plugins initialized all the time!

2007-05-28 Thread Nicolás Lichtmaier


More info...

I see map progressing from 0% to 100. It seems to reload plugins whan 
reaching 100%. Besides, I've realized that each NutchJob is a 
Configuration, so (as is there's no equals) a plugin repo would be 
created per each NutchJob...