[jira] Created: (NUTCH-356) Plugin repository cache can lead to memory leak
Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: http://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Attachments: NutchTest.java, patch.txt While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-346) Improve readability of logs/hadoop.log
[ http://issues.apache.org/jira/browse/NUTCH-346?page=all ] Renaud Richardet updated NUTCH-346: --- Attachment: log4j_plugins.diff OK, here we go. This patch should be good for 0.8 and trunk. Improve readability of logs/hadoop.log -- Key: NUTCH-346 URL: http://issues.apache.org/jira/browse/NUTCH-346 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: ubuntu dapper Reporter: Renaud Richardet Priority: Minor Attachments: log4j_plugins.diff adding log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN to conf/log4j.properties dramatically improves the readability of the logs in logs/hadoop.log (removes all INFO) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 not loading plugins
I'm at a loss on this. I'm going to revert to using 0.7.2. If anyone has any insight on my problem, I would appreciate hearing from you. Chris Stephens wrote: By manually copying the the custom-meta directory in build/plugin to plugin/ I was able to get at least some debug output in my log. It doesn't really tell me much, any idea why it wouldn't be loading the plugin when it has the correct entry in my nutch-site.xml? 2006-08-18 13:34:35,007 DEBUG plugin.PluginRepository - parsing: /usr/local/nutch-0.8/plugins/custom-meta/plugin.xml 2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - plugin: id=custommeta name=Custom Meta Parser/Filter version=0.0.1 provider=liveoakinteractive.comclass=null 2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.liveoak.nutch.parse.custommeta.CustomMetaParser 2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.indexer.IndexingFilter class=org.liveoak.nutch.parse.custommeta.CustomMetaIndexer 2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.searcher.QueryFilter class=org.liveoak.nutch.parse.custommeta.CustomMetaQueryFilter 2006-08-18 13:34:35,244 DEBUG plugin.PluginRepository - not including: custommeta
[jira] Commented: (NUTCH-355) The title of query result could like the summary have the highlight??
[ http://issues.apache.org/jira/browse/NUTCH-355?page=comments#action_12429450 ] King Kong commented on NUTCH-355: - I add a class name of Titler package org.apache.nutch.searcher; //overleap import ... public class Titler implements Configurable{ private int maxLength = 20; private Analyzer analyzer = null; private Configuration conf = null; public Titler() { } public Titler(Configuration conf) { setConf(conf); } /* - * * implementation:Configurable * * - */ public Configuration getConf() { return conf; } public void setConf(Configuration conf) { this.conf = conf; this.analyzer = new NutchDocumentAnalyzer(conf); this.maxLength = conf.getInt(searcher.title.maxlength, 40); } public Summary getSummary(String text, Query query) { Token[] tokens = getTokens(text); // parse text to token array if (tokens.length == 0) return new Summary(); String[] terms = query.getTerms(); HashSet highlight = new HashSet();// put query terms in table for (int i = 0; i terms.length; i++) highlight.add(terms[i]); Summary s = new Summary(); int offset = 0; for( int i= 0; i tokens.length tokens[i].startOffset() this.maxLength; i++){ Token token = tokens[i]; // // If we find a term that's in the query... // if (highlight.contains(token.termText())) { s.add(new Fragment(text.substring(offset,token.startOffset(; s.add(new Highlight(text.substring(token.startOffset(),token.endOffset(; offset = token.endOffset(); } } s.add(new Fragment(text.substring(offset,Math.min(text.length(), this.maxLength; if (text.length() this.maxLength){ s.add(new Ellipsis()); } return s; } /** Maximun number of tokens inspect in a summary . */ private static final int token_deep = 1000; private Token[] getTokens(String text) { ArrayList result = new ArrayList(); TokenStream ts = analyzer.tokenStream(title, new StringReader(text)); Token token = null; while (result.size()token_deep) { try { token = ts.next(); } catch (IOException e) { token = null; } if (token == null) { break; } result.add(token); } try { ts.close(); } catch (IOException e) { // ignore } return (Token[]) result.toArray(new Token[result.size()]); } } then, I add a property titler in NutchBean : public class NutchBean... { ... private Titler titler; ... public NutchBean(Configuration conf, Path dir) throws IOException { this.titler = new Titler(conf); } ... //add getTitle() with highlight public Summary getTitle(HitDetails hit, Query query) throws IOException { return titler.getSummary(hit.getValue(title),query); } } finally, in search.jsp, String title = detail.getValue(title); change to , String title =bean.getTitle(detail,query).toHtml(true); a target=_blank href=%=url%%=Entities.encode(title)%/a change to , a target=_blank href=%=url%%=title%/a I recomplied , and it does well, but I don't know if I can do it like this . Could you give me any suggestion?? The title of query result could like the summary have the highlight?? -- Key: NUTCH-355 URL: http://issues.apache.org/jira/browse/NUTCH-355 Project: Nutch Issue Type: Wish Components: searcher Affects Versions: 0.8 Environment: all Reporter: King Kong I'd like to make the title hightlight, but i can't found how to do it . when i query Nutch , the result must like this: a href=http://lucene.apache.org/nutch/; Welcome to bNutch/b! /a This is the first bNutch/b release as an Apache Lucene sub-project. See CHANGES.txt for details. The release is available here. ... bNutch/bhas now graduated from the Apache incubator, and is now a Subproject of Lucene. ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
[ http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] Stefan Groschupf commented on NUTCH-354: Since this issue is already closed I can not attach the patch file, so I attach it as text within this comment. If you need the file let me know and I send you a offlist mail. Index: src/test/org/apache/nutch/crawl/TestMapWritable.java === --- src/test/org/apache/nutch/crawl/TestMapWritable.java(revision 432325) +++ src/test/org/apache/nutch/crawl/TestMapWritable.java(working copy) @@ -180,6 +180,31 @@ assertEquals(before, after); } + public void testRecycling() throws Exception { +UTF8 value = new UTF8(value); +UTF8 key1 = new UTF8(a); +UTF8 key2 = new UTF8(b); + +MapWritable writable = new MapWritable(); +writable.put(key1, value); +assertEquals(writable.get(key1), value); +assertNull(writable.get(key2)); + +DataOutputBuffer dob = new DataOutputBuffer(); +writable.write(dob); +writable.clear(); +writable.put(key1, value); +writable.put(key2, value); +assertEquals(writable.get(key1), value); +assertEquals(writable.get(key2), value); + +DataInputBuffer dib = new DataInputBuffer(); +dib.reset(dob.getData(), dob.getLength()); +writable.readFields(dib); +assertEquals(writable.get(key1), value); +assertNull(writable.get(key2)); + } + public static void main(String[] args) throws Exception { TestMapWritable writable = new TestMapWritable(); writable.testPerformance(); MapWritable, nextEntry is not reset when Entries are recycled -- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0, 0.8.1 Attachments: resetNextEntryInMapWritableV1.patch MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Fwd: [webspam-announces] Web Spam Collection Announced
Hi, May be some people will find that posting interesting. Webspam is one of the biggest issues or nutch for whole web crawls from my POV. Greetings, Stefan During AIRWeb'06 we announced the availability of the collection. We are currently planning a Web Spam challenge based on the dataset we have built. I assume most of you will be interested on this, so I have moved the webspam-volunteers list to webspam-announces. If you do not want to be in this new webspam-announces list, please send me an e-mail. This was shown during AIRWeb in Seattle: . Web Spam Collection Available August 10th, 2006 We are pleased to announce the availability of a public collection for research on Web spam. This collection is the result of efforts by a team of volunteers: Thiago AlvesAntonio GulliTamas Sarlos Luca Becchetti Zoltan Gyongyi Mike Thelwall Paolo Boldi Thomas Lavergn Belle Tseng Paul ChiritaAlex Ntoulas Tanguy Urvoy Mirel Cosulschi Josiane-Xavier Parreira Wenzhong Zhao Brian Davison Xiaoguang Qi Pascal Filoche Massimo Santini The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts downloaded in May 2006 by the Laboratory of Web Algorithmics, Universit{\`a} degli Studi di Milano. The labelling process was coordinated by Carlos Castillo working at the Algorithmic Engineering group at Universit{\`a} di Roma ``La Sapienza'' The project was funded by the DELIS project (Dynamically Evolving, Large Scale Information Systems). Volunteers were provided with a set of guidelines and were asked to mark a set of hosts as either normal, spam, or borderline. The collection includes about 6,700 judgments done by the volunteers and can be used for testing link-based and content-based Web spam detection and demotion techniques. More information is available in our Web page, including the guidelines given to the human judges, the instructions for obtaining the links and contents of the pages in this collection, and the contact information for questions and comments. http://aeserver.dis.uniroma1.it/webspam/ If you use this data set please subscribe to our mailing list by sending an e-mail to [EMAIL PROTECTED] -- Carlos Castillo Universita di Roma La Sapienza Rome, ITALY Yahoo! Groups Links * To visit your group on the web, go to: http://groups.yahoo.com/group/webspam-announces/ * To unsubscribe from this group, send an email to: [EMAIL PROTECTED] * Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/
RE: 0.8 not loading plugins
The not including: custommeta sounds like a config file prob Check your plugin.includes config in nutch-site.xml and the id of your plugin and that everything match properly everywhere (your id is custommeta apparently and should be the value in the plugin.includes if I am not wrong). The fact that the jar file is not copied in the build/plugins/custom-meta folder is most likely a build.xml prob during the deploy task. You can modify the src/plugin/build-plugin.xml to verbose more stuff if you want (verbose=true in the deploy target for instance) or enable the failonerror. Aren't you missing a - somewhere in the plugin id of the project name or elsewhere? -Original Message- From: Chris Stephens [mailto:[EMAIL PROTECTED] Sent: Monday, August 21, 2006 7:47 AM To: nutch-dev@lucene.apache.org Subject: Re: 0.8 not loading plugins I'm at a loss on this. I'm going to revert to using 0.7.2. If anyone has any insight on my problem, I would appreciate hearing from you. Chris Stephens wrote: By manually copying the the custom-meta directory in build/plugin to plugin/ I was able to get at least some debug output in my log. It doesn't really tell me much, any idea why it wouldn't be loading the plugin when it has the correct entry in my nutch-site.xml? 2006-08-18 13:34:35,007 DEBUG plugin.PluginRepository - parsing: /usr/local/nutch-0.8/plugins/custom-meta/plugin.xml 2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - plugin: id=custommeta name=Custom Meta Parser/Filter version=0.0.1 provider=liveoakinteractive.comclass=null 2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.liveoak.nutch.parse.custommeta.CustomMetaParser 2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.indexer.IndexingFilter class=org.liveoak.nutch.parse.custommeta.CustomMetaIndexer 2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: point=org.apache.nutch.searcher.QueryFilter class=org.liveoak.nutch.parse.custommeta.CustomMetaQueryFilter 2006-08-18 13:34:35,244 DEBUG plugin.PluginRepository - not including: custommeta
[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] Stefan Groschupf commented on NUTCH-356: Hi Enrico, there will be as much PluginRepositories as Configuration objects. So in case you create many configuration objects you will have a problem with the memory. There is no way around having a singleton pluginrepository. However you can reset the the pluginRepository by remove the cached object from the configuration object. In any case do not cache the pluginrepository is a bad idea, thinkabout writing a own plugin that solve your problem that should be a cleaner solution for your problem. Would you agree to close this issue since we will not be able to commit your changes. Stefan Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: http://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Attachments: NutchTest.java, patch.txt While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429546 ] Enrico Triolo commented on NUTCH-356: - Thanks Stefan for your reply. The patch I submitted wasn't meant to be committed to the trunk, it was only a proof of concept to demonstrate that a potential leak really exists. I am aware that the cache shouldn't be removed, but since I'm not an expert at all, I was only reporting a possible problem, not a solution. I can see that there are as much PluginRepositories as Configurations, in fact if you look at the source code of the test class I attached you'll see there is only one Configuration instance involved. Nevertheless I keep getting OOM... Furthermore I can't understand your suggestion of writing a plugin to solve my problem. Maybe I wasn't able to clearly explain it: while at first I thought it was the LanguageIdentifier, I found out that the cause is not the plugin itself, rather the plugin management system. I couldn't inspect the code in depth, but using a profiler I saw that many objects don't get released. Don't you think this alone would be an issue? Anyway, if you think this is not an issue I can close it. Enrico Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: http://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Attachments: NutchTest.java, patch.txt While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429548 ] Chris A. Mattmann commented on NUTCH-356: - -1 for closing this issue. If there is a demonstrable memory leak in the plugin system, then I think it should be remedied. I haven't ran your test code, Enrico, nor experienced your problem before, but it would seem that this issue is worth investigating. Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: http://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Attachments: NutchTest.java, patch.txt While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-357) crawling simulation
crawling simulation --- Key: NUTCH-357 URL: http://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Fix For: 0.9.0 We recently discovered some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. Therefore it would be very useful to have a testbed to simulate crawls where we can control the response of web servers. For the very beginning simulate very basic situation like a page points to it self, link chains or internal links would already be very usefully. However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-357) crawling simulation
[ http://issues.apache.org/jira/browse/NUTCH-357?page=all ] Stefan Groschupf updated NUTCH-357: --- Attachment: protocol-simulation-pluginV1.patch A very first preview of a plugin that helps to simulate crawls. This protocol plugin can be used to replace the http protocol plugin and return defined content during a fetch. To simulate custom scenarios a interface names Simulator can be implemented with just one method. The plugin comes with a very simple basic Simulator implementation, however this already allows to simulate the by today known nutch scoring problems, like pages pointing to itself or link chains. For more details see the java doc, however I plan to improve the java doc with a native speaker. Feedback is welcome. crawling simulation --- Key: NUTCH-357 URL: http://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Fix For: 0.9.0 Attachments: protocol-simulation-pluginV1.patch We recently discovered some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. Therefore it would be very useful to have a testbed to simulate crawls where we can control the response of web servers. For the very beginning simulate very basic situation like a page points to it self, link chains or internal links would already be very usefully. However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira