[jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499777
 ] 

Doğacan Güney commented on NUTCH-489:
-

Please ignore my last comment. I don't know what I was on when I wrote that. 
Your patch seems to be correct.

Andrzej, (hope you are reading this:) isn't javadoc in SuffixURLFilter wrong? 
It says , you should use +.jpg instead., but AFAICS, that line would 
have no effect except possibly switching mode to accept. Since, first character 
after '+' is not 'I',  filter will just skip rest of the line. Am I missing 
something here?

 URLFilter-suffix management of the url path when the url contains some query 
 parameters
 ---

 Key: NUTCH-489
 URL: https://issues.apache.org/jira/browse/NUTCH-489
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: Windows, Java 1.5
Reporter: Emmanuel Joke
 Attachments: suffix-urlfilter.txt.patch, SuffixURLFilter.java.patch, 
 SuffixURLFilter_v2.java.patch


 The current filter compares only on string level. It try to apply the filter 
 to the full URL (path + query parameters).
 So, even if we have define in our filter to exclude all js extension, it 
 won't exclude this URL http://www.toto.com/from.js?id=5.
 I've added a new parameter in the filter which can be use to configure the 
 filter to exclude URL based on the url path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: proposal for committer

2007-05-29 Thread Enis Soztutar

Gal Nitzan wrote:

Hi,

Since I'm no committer I can't really propose :-) but I just thought to draw 
some attention to the great work done on the dev/users lists and also the many 
patches created by Do?acan G?ney...


Just my 2 cents...

Gal.





  

+1


Re: Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney

Hi,

On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:

I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )


Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch



Bye!




--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-29 Thread Briggs

I have also noticed this. The code explicitly loads an instance of the
plugins for every fetch (well, or parse etc., depending on what you
are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
you can see the filter classes get loaded and the never get unloaded
(they are loaded within their own classloader). So, you'll see the
same class loaded thousands of time, which is bad.

So, in my case, I had to change the way the plugins are loaded.
Basically, I changed all the main plugin loaders (like
URLFilters.java, IndexFilters.java) to be singletons with a single
'getInstance()' method on each. I don't need special configs for
filters so I can deal with singletons.

You'll find the heart of the problem somewhere in the extension point
class(es).  It calls newInstance() an aweful lot. But, the classloader
(one per plugin) never gets destroyed, or something so this can be
nasty.

I'm still dealing with my OutOfMemory errors on parsing, yuck.





On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:

Hi,

On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
 I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
 that the plugin repository initializes itself all the timem until I get
 an out of memory exception. I've been seeing the code... the plugin
 repository mantains a map from Configuration to plugin repositories, but
 the Configuration object does not have an equals or hashCode method...
 wouldn't it be nice to add such a method (comparing property values)?
 Wouldn't that help prevent initializing many plugin repositories? What
 could be the cause to may problem? (Aaah.. so many questions... =) )

Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch


 Bye!



--
Doğacan Güney




--
Conscious decisions by conscious minds are what make reality real


Re: Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney

On 5/29/07, Briggs [EMAIL PROTECTED] wrote:

I have also noticed this. The code explicitly loads an instance of the
plugins for every fetch (well, or parse etc., depending on what you
are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
you can see the filter classes get loaded and the never get unloaded
(they are loaded within their own classloader). So, you'll see the
same class loaded thousands of time, which is bad.

So, in my case, I had to change the way the plugins are loaded.
Basically, I changed all the main plugin loaders (like
URLFilters.java, IndexFilters.java) to be singletons with a single
'getInstance()' method on each. I don't need special configs for
filters so I can deal with singletons.

You'll find the heart of the problem somewhere in the extension point
class(es).  It calls newInstance() an aweful lot. But, the classloader
(one per plugin) never gets destroyed, or something so this can be
nasty.

I'm still dealing with my OutOfMemory errors on parsing, yuck.


Well then can you test the patch too? Nicolas's idea seems to be the
right one. After this patch, I think plugin loaders will see the same
PluginRepository instance.







On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 Hi,

 On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
  I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
  that the plugin repository initializes itself all the timem until I get
  an out of memory exception. I've been seeing the code... the plugin
  repository mantains a map from Configuration to plugin repositories, but
  the Configuration object does not have an equals or hashCode method...
  wouldn't it be nice to add such a method (comparing property values)?
  Wouldn't that help prevent initializing many plugin repositories? What
  could be the cause to may problem? (Aaah.. so many questions... =) )

 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

 
  Bye!
 


 --
 Doğacan Güney



--
Conscious decisions by conscious minds are what make reality real




--
Doğacan Güney


Re: proposal for committer

2007-05-29 Thread Doug Cutting
Personnel discussions are conducted on the PMC's private mailing list. 
I have forwarded your message there.


Thanks for the suggestion!

Doug

Gal Nitzan wrote:

Hi,

Since I'm no committer I can't really propose :-) but I just thought to draw 
some attention to the great work done on the dev/users lists and also the many 
patches created by Do?acan G?ney...


Just my 2 cents...

Gal.






Re: Plugins initialized all the time!

2007-05-29 Thread Nicolás Lichtmaier



I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )


Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch


I'm running it. So far it's working ok, and I haven't seen all those 
plugin loadings...


I've modified your patch though to define CACHE like this:

 private static final MapPluginProperty, PluginRepository CACHE =
 new LinkedHashMapPluginProperty, PluginRepository() {
   @Override
   protected boolean removeEldestEntry(
   EntryPluginProperty, PluginRepository eldest) {
 return size()  10;
   }
 };

...which means an LRU cache with a fixed size of 10.



[jira] Created: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get

2007-05-29 Thread wangxu (JIRA)
contentType parse not correctlygot empty content using readseg -get
---

 Key: NUTCH-493
 URL: https://issues.apache.org/jira/browse/NUTCH-493
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: java version 1.5.0_04

Linux localhost 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux
Reporter: wangxu


I am using nutch0.9.
I found lots of my crawled pages's contents are empty.
then I checked the log,and find the warnning accordingly:the ContentType is 
said to be url=http://..,and cannot 
find a suitable parser for the page:


parser not found for contentType=
url=http://product.dangdang.com/product.aspx?product_id=490321


then most of this kind of pages's contents are empty.
but I didnot find any warn or error other than timeout from the fetcher log.

Can somebody explain me why?
many thanks!



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.