[jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499777 ] Doğacan Güney commented on NUTCH-489: - Please ignore my last comment. I don't know what I was on when I wrote that. Your patch seems to be correct. Andrzej, (hope you are reading this:) isn't javadoc in SuffixURLFilter wrong? It says , you should use +.jpg instead., but AFAICS, that line would have no effect except possibly switching mode to accept. Since, first character after '+' is not 'I', filter will just skip rest of the line. Am I missing something here? URLFilter-suffix management of the url path when the url contains some query parameters --- Key: NUTCH-489 URL: https://issues.apache.org/jira/browse/NUTCH-489 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: Windows, Java 1.5 Reporter: Emmanuel Joke Attachments: suffix-urlfilter.txt.patch, SuffixURLFilter.java.patch, SuffixURLFilter_v2.java.patch The current filter compares only on string level. It try to apply the filter to the full URL (path + query parameters). So, even if we have define in our filter to exclude all js extension, it won't exclude this URL http://www.toto.com/from.js?id=5. I've added a new parameter in the filter which can be use to configure the filter to exclude URL based on the url path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: proposal for committer
Gal Nitzan wrote: Hi, Since I'm no committer I can't really propose :-) but I just thought to draw some attention to the great work done on the dev/users lists and also the many patches created by Do?acan G?ney... Just my 2 cents... Gal. +1
Re: Plugins initialized all the time!
Hi, On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Bye! -- Doğacan Güney
Re: Plugins initialized all the time!
I have also noticed this. The code explicitly loads an instance of the plugins for every fetch (well, or parse etc., depending on what you are doing). This causes OutOfMemoryErrors. So, if you dump the heap, you can see the filter classes get loaded and the never get unloaded (they are loaded within their own classloader). So, you'll see the same class loaded thousands of time, which is bad. So, in my case, I had to change the way the plugins are loaded. Basically, I changed all the main plugin loaders (like URLFilters.java, IndexFilters.java) to be singletons with a single 'getInstance()' method on each. I don't need special configs for filters so I can deal with singletons. You'll find the heart of the problem somewhere in the extension point class(es). It calls newInstance() an aweful lot. But, the classloader (one per plugin) never gets destroyed, or something so this can be nasty. I'm still dealing with my OutOfMemory errors on parsing, yuck. On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Bye! -- Doğacan Güney -- Conscious decisions by conscious minds are what make reality real
Re: Plugins initialized all the time!
On 5/29/07, Briggs [EMAIL PROTECTED] wrote: I have also noticed this. The code explicitly loads an instance of the plugins for every fetch (well, or parse etc., depending on what you are doing). This causes OutOfMemoryErrors. So, if you dump the heap, you can see the filter classes get loaded and the never get unloaded (they are loaded within their own classloader). So, you'll see the same class loaded thousands of time, which is bad. So, in my case, I had to change the way the plugins are loaded. Basically, I changed all the main plugin loaders (like URLFilters.java, IndexFilters.java) to be singletons with a single 'getInstance()' method on each. I don't need special configs for filters so I can deal with singletons. You'll find the heart of the problem somewhere in the extension point class(es). It calls newInstance() an aweful lot. But, the classloader (one per plugin) never gets destroyed, or something so this can be nasty. I'm still dealing with my OutOfMemory errors on parsing, yuck. Well then can you test the patch too? Nicolas's idea seems to be the right one. After this patch, I think plugin loaders will see the same PluginRepository instance. On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Bye! -- Doğacan Güney -- Conscious decisions by conscious minds are what make reality real -- Doğacan Güney
Re: proposal for committer
Personnel discussions are conducted on the PMC's private mailing list. I have forwarded your message there. Thanks for the suggestion! Doug Gal Nitzan wrote: Hi, Since I'm no committer I can't really propose :-) but I just thought to draw some attention to the great work done on the dev/users lists and also the many patches created by Do?acan G?ney... Just my 2 cents... Gal.
Re: Plugins initialized all the time!
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch I'm running it. So far it's working ok, and I haven't seen all those plugin loadings... I've modified your patch though to define CACHE like this: private static final MapPluginProperty, PluginRepository CACHE = new LinkedHashMapPluginProperty, PluginRepository() { @Override protected boolean removeEldestEntry( EntryPluginProperty, PluginRepository eldest) { return size() 10; } }; ...which means an LRU cache with a fixed size of 10.
[jira] Created: (NUTCH-493) contentType parse not correctly,,,,got empty content using readseg -get
contentType parse not correctlygot empty content using readseg -get --- Key: NUTCH-493 URL: https://issues.apache.org/jira/browse/NUTCH-493 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: java version 1.5.0_04 Linux localhost 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux Reporter: wangxu I am using nutch0.9. I found lots of my crawled pages's contents are empty. then I checked the log,and find the warnning accordingly:the ContentType is said to be url=http://..,and cannot find a suitable parser for the page: parser not found for contentType= url=http://product.dangdang.com/product.aspx?product_id=490321 then most of this kind of pages's contents are empty. but I didnot find any warn or error other than timeout from the fetcher log. Can somebody explain me why? many thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.