[jira] Created: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Enrico Triolo (JIRA)
Plugin repository cache can lead to memory leak
---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt

While I was trying to solve a problem I reported a while ago (see Nutch-314), I 
found out that actually the problem was related to the plugin cache used in 
class PluginRepository.java.
As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
work, since I need to frequently submit new urls and append their contents to 
the index; I don't (and I can't) have an urls.txt file with all urls I'm going 
to fetch, but I recreate it each time a new url is submitted.
Thus,  I think in the majority of times you won't have problems using nutch 
as-is, since the problem I found occours only if nutch is used in a way similar 
to the one I use.
To simplify your test I'm attaching a class that performs something similar to 
what I need. It fetches and index some sample urls; to avoid webmasters 
complaints I left the sample urls list empty, so you should modify the source 
code and add some urls.
Then you only have to run it and watch your memory consumption with top. In my 
experience I get an OutOfMemoryException after a couple of minutes, but it 
clearly depends on your heap settings and on the plugins you are using (I'm 
using 
'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').

The problem is bound to the PluginRepository 'singleton' instance, since it 
never get released. It seems that some class maintains a reference to it and 
this class is never released since it is cached somewhere in the configuration.

So I modified the PluginRepository's 'get' method so that it never uses the 
cache and always returns a new instance (you can find the patch in attachment). 
This way the memory consumption is always stable and I get no OOM anymore.
Clearly this is not the solution, since I guess there are many performance 
issues involved, but for the moment it works.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-21 Thread Renaud Richardet (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-346?page=all ]

Renaud Richardet updated NUTCH-346:
---

Attachment: log4j_plugins.diff

OK, here we go. This patch should be good for 0.8 and trunk.

 Improve readability of logs/hadoop.log
 --

 Key: NUTCH-346
 URL: http://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor
 Attachments: log4j_plugins.diff


 adding
 log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
 to conf/log4j.properties
 dramatically improves the readability of the logs in logs/hadoop.log (removes 
 all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: 0.8 not loading plugins

2006-08-21 Thread Chris Stephens
I'm at a loss on this.  I'm going to revert to using 0.7.2.  If anyone 
has any insight on my problem, I would appreciate hearing from you.


Chris Stephens wrote:
By manually copying the the custom-meta directory in build/plugin to 
plugin/ I was able to get at least some debug output in my log.  It 
doesn't really tell me much, any idea why it wouldn't be loading the 
plugin when it has the correct entry in my nutch-site.xml?


2006-08-18 13:34:35,007 DEBUG plugin.PluginRepository - parsing: 
/usr/local/nutch-0.8/plugins/custom-meta/plugin.xml
2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - plugin: 
id=custommeta name=Custom Meta Parser/Filter version=0.0.1 
provider=liveoakinteractive.comclass=null
2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - impl: 
point=org.apache.nutch.parse.HtmlParseFilter 
class=org.liveoak.nutch.parse.custommeta.CustomMetaParser
2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: 
point=org.apache.nutch.indexer.IndexingFilter 
class=org.liveoak.nutch.parse.custommeta.CustomMetaIndexer
2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: 
point=org.apache.nutch.searcher.QueryFilter 
class=org.liveoak.nutch.parse.custommeta.CustomMetaQueryFilter


2006-08-18 13:34:35,244 DEBUG plugin.PluginRepository - not including: 
custommeta






[jira] Commented: (NUTCH-355) The title of query result could like the summary have the highlight??

2006-08-21 Thread King Kong (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-355?page=comments#action_12429450 ] 

King Kong commented on NUTCH-355:
-



I add a class name of Titler

package org.apache.nutch.searcher;

//overleap  import ...

public class Titler implements Configurable{

  private int maxLength = 20;
  private Analyzer analyzer = null;
  private Configuration conf = null;

  
  public Titler() { }
  
  public Titler(Configuration conf) {
setConf(conf);
  }
  
  
  /* - *
   * implementation:Configurable *
   * - */
  
  public Configuration getConf() {
return conf;
  }
  
  public void setConf(Configuration conf) {
this.conf = conf;
this.analyzer = new NutchDocumentAnalyzer(conf);
this.maxLength = conf.getInt(searcher.title.maxlength, 40);
  }
 
  public Summary getSummary(String text, Query query) {
  Token[] tokens = getTokens(text); // parse text to token array

if (tokens.length == 0)
  return new Summary();

String[] terms = query.getTerms();
HashSet highlight = new HashSet();// put query terms in table
for (int i = 0; i  terms.length; i++)
  highlight.add(terms[i]);

Summary s = new Summary();

int offset = 0;
for( int i= 0; i tokens.length  tokens[i].startOffset() this.maxLength; 
i++){
 Token token = tokens[i];   
  //
  // If we find a term that's in the query...
  //
  if (highlight.contains(token.termText())) {
  s.add(new Fragment(text.substring(offset,token.startOffset(;
  s.add(new 
Highlight(text.substring(token.startOffset(),token.endOffset(;
  offset = token.endOffset();
  }
  
}
 
 s.add(new Fragment(text.substring(offset,Math.min(text.length(), 
this.maxLength;
 
 if (text.length()  this.maxLength){
   s.add(new Ellipsis()); 
 }

return s;
  }
 
  
  /** Maximun number of tokens inspect in a summary . */
  private static final int token_deep = 1000;
  
  private Token[] getTokens(String text) {
ArrayList result = new ArrayList();
TokenStream ts = analyzer.tokenStream(title, new StringReader(text));
Token token = null;
while (result.size()token_deep) {
  try {
token = ts.next();
  } catch (IOException e) {
token = null;
  }
  if (token == null) { break; }
  result.add(token);
}
try {
  ts.close();
} catch (IOException e) {
  // ignore
}
return (Token[]) result.toArray(new Token[result.size()]);
   }
  }


then, I add a property titler in NutchBean :

public class NutchBean...
{
   ...
 private Titler titler;
   ...
  public NutchBean(Configuration conf, Path dir) throws IOException {
   
   this.titler = new Titler(conf);
 }

...
   //add getTitle() with highlight
   public Summary getTitle(HitDetails hit, Query query) throws IOException {
 return titler.getSummary(hit.getValue(title),query);
   }
   }

finally, in search.jsp, 

String title = detail.getValue(title);
change to ,
String title =bean.getTitle(detail,query).toHtml(true);  

a target=_blank href=%=url%%=Entities.encode(title)%/a  
change  to ,
 a target=_blank href=%=url%%=title%/a


I recomplied , and it does well, 

but I don't know if I  can  do it like this .
Could you give me any suggestion??



  


 The title of query result  could like the summary have the highlight??
 --

 Key: NUTCH-355
 URL: http://issues.apache.org/jira/browse/NUTCH-355
 Project: Nutch
  Issue Type: Wish
  Components: searcher
Affects Versions: 0.8
 Environment: all
Reporter: King Kong

 I'd like to make the title hightlight, but i can't found how to do it .
 when i query Nutch , the result must like this:
 a href=http://lucene.apache.org/nutch/; Welcome to bNutch/b!  /a  
 This is the first bNutch/b release as an Apache Lucene sub-project. See 
 CHANGES.txt for details. The release is available here. ... bNutch/bhas 
 now graduated from the Apache incubator, and is now a Subproject of Lucene. 
 ...
  
 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] 

Stefan Groschupf commented on NUTCH-354:


Since this issue is already closed I can not attach the patch file, so I attach 
it as text within this comment.
If you need the file let me know and I send you a offlist mail. 


Index: src/test/org/apache/nutch/crawl/TestMapWritable.java
===
--- src/test/org/apache/nutch/crawl/TestMapWritable.java(revision 
432325)
+++ src/test/org/apache/nutch/crawl/TestMapWritable.java(working copy)
@@ -180,6 +180,31 @@
 assertEquals(before, after);
   }
 
+  public void testRecycling() throws Exception {
+UTF8 value = new UTF8(value);
+UTF8 key1 = new UTF8(a);
+UTF8 key2 = new UTF8(b);
+
+MapWritable writable = new MapWritable();
+writable.put(key1, value);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+
+DataOutputBuffer dob = new DataOutputBuffer();
+writable.write(dob);
+writable.clear();
+writable.put(key1, value);
+writable.put(key2, value);
+assertEquals(writable.get(key1), value);
+assertEquals(writable.get(key2), value);
+
+DataInputBuffer dib = new DataInputBuffer();
+dib.reset(dob.getData(), dob.getLength());
+writable.readFields(dib);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+  }
+  
   public static void main(String[] args) throws Exception {
 TestMapWritable writable = new TestMapWritable();
 writable.testPerformance();


 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Fwd: [webspam-announces] Web Spam Collection Announced

2006-08-21 Thread Stefan Groschupf

Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls  
from my POV.


Greetings,
Stefan




During AIRWeb'06 we announced the availability of the collection.

We are currently planning a Web Spam challenge based on the dataset we
have built. I assume most of you will be interested on this, so I have
moved the webspam-volunteers list to webspam-announces. If you do
not want to be in this new webspam-announces list, please send me an
e-mail.

This was shown during AIRWeb in Seattle:

.

Web Spam Collection Available
August 10th, 2006

We are pleased to announce the availability of a public collection for
research on Web spam. This collection is the result of efforts by a
team of volunteers:

Thiago AlvesAntonio GulliTamas Sarlos
Luca Becchetti  Zoltan Gyongyi   Mike Thelwall
Paolo Boldi Thomas Lavergn   Belle Tseng
Paul ChiritaAlex Ntoulas Tanguy Urvoy
Mirel Cosulschi Josiane-Xavier Parreira  Wenzhong Zhao
Brian Davison   Xiaoguang Qi
Pascal Filoche  Massimo Santini

The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts
downloaded in May 2006 by the Laboratory of Web Algorithmics,
Universit{\`a} degli Studi di Milano. The labelling process was
coordinated by Carlos Castillo working at the Algorithmic Engineering
group at Universit{\`a} di Roma ``La Sapienza'' The project was funded
by the DELIS project (Dynamically Evolving, Large Scale Information
Systems).

Volunteers were provided with a set of guidelines and were asked to
mark a set of hosts as either normal, spam, or borderline. The
collection includes about 6,700 judgments done by the volunteers and
can be used for testing link-based and content-based Web spam
detection and demotion techniques.

More information is available in our Web page, including the
guidelines given to the human judges, the instructions for obtaining
the links and contents of the pages in this collection, and the
contact information for questions and comments.

http://aeserver.dis.uniroma1.it/webspam/

If you use this data set please subscribe to our mailing list by
sending an e-mail to [EMAIL PROTECTED]

--
Carlos Castillo
Universita di Roma La Sapienza
Rome, ITALY





Yahoo! Groups Links

* To visit your group on the web, go to:
http://groups.yahoo.com/group/webspam-announces/

* To unsubscribe from this group, send an email to:
[EMAIL PROTECTED]

* Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/








RE: 0.8 not loading plugins

2006-08-21 Thread HUYLEBROECK Jeremy RD-ILAB-SSF

The not including: custommeta sounds like a config file prob
Check your plugin.includes config in nutch-site.xml and the id of
your plugin and that everything match properly everywhere (your id is
custommeta apparently and should be the value in the plugin.includes if
I am not wrong).

The fact that the jar file is not copied in the
build/plugins/custom-meta folder is most likely a build.xml prob
during the deploy task.
You can modify the src/plugin/build-plugin.xml to verbose more stuff
if you want (verbose=true in the deploy target for instance) or enable
the failonerror.

Aren't you missing a - somewhere in the plugin id of the project name
or elsewhere?


-Original Message-
From: Chris Stephens [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 21, 2006 7:47 AM
To: nutch-dev@lucene.apache.org
Subject: Re: 0.8 not loading plugins

I'm at a loss on this.  I'm going to revert to using 0.7.2.  If anyone
has any insight on my problem, I would appreciate hearing from you.

Chris Stephens wrote:
 By manually copying the the custom-meta directory in build/plugin to 
 plugin/ I was able to get at least some debug output in my log.  It 
 doesn't really tell me much, any idea why it wouldn't be loading the 
 plugin when it has the correct entry in my nutch-site.xml?

 2006-08-18 13:34:35,007 DEBUG plugin.PluginRepository - parsing: 
 /usr/local/nutch-0.8/plugins/custom-meta/plugin.xml
 2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - plugin: 
 id=custommeta name=Custom Meta Parser/Filter version=0.0.1 
 provider=liveoakinteractive.comclass=null
 2006-08-18 13:34:35,010 DEBUG plugin.PluginRepository - impl: 
 point=org.apache.nutch.parse.HtmlParseFilter
 class=org.liveoak.nutch.parse.custommeta.CustomMetaParser
 2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: 
 point=org.apache.nutch.indexer.IndexingFilter
 class=org.liveoak.nutch.parse.custommeta.CustomMetaIndexer
 2006-08-18 13:34:35,011 DEBUG plugin.PluginRepository - impl: 
 point=org.apache.nutch.searcher.QueryFilter
 class=org.liveoak.nutch.parse.custommeta.CustomMetaQueryFilter

 2006-08-18 13:34:35,244 DEBUG plugin.PluginRepository - not including:

 custommeta




[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] 

Stefan Groschupf commented on NUTCH-356:


Hi Enrico, 
there will be as much PluginRepositories as Configuration objects. 
So in case you create many configuration objects you will have a problem with 
the memory. 
There is no way around having a singleton pluginrepository. However you can 
reset the the pluginRepository by remove the cached object from the 
configuration object. 
In any case do not cache the pluginrepository is a bad idea, thinkabout writing 
a own plugin that solve your problem that should be a cleaner solution for your 
problem. 

Would you agree to close this issue since we will not be able to commit your 
changes. 
Stefan  

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Enrico Triolo (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429546 ] 

Enrico Triolo commented on NUTCH-356:
-

Thanks Stefan for your reply. The patch I submitted wasn't meant to be 
committed to the trunk, it was only a proof of concept to demonstrate that a 
potential leak really exists. I am aware that the cache shouldn't be removed, 
but since I'm not an expert at all, I was only reporting a possible problem, 
not a solution. 

I can see that there are as much PluginRepositories as Configurations, in fact 
if you look at the source code of the test class I attached you'll see there is 
only one Configuration instance involved. Nevertheless I keep getting OOM...

Furthermore I can't understand your suggestion of writing a plugin to solve my 
problem. Maybe I wasn't able to clearly explain it: while at first I thought it 
was the LanguageIdentifier, I found out that the cause is not the plugin 
itself, rather the plugin management system. I couldn't inspect the code in 
depth, but using a profiler I saw that many objects don't get released. Don't 
you think this alone would be an issue?

Anyway, if you think this is not an issue I can close it.
Enrico

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Chris A. Mattmann (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429548 ] 

Chris A. Mattmann commented on NUTCH-356:
-

-1 for closing this issue.

If there is a demonstrable memory leak in the plugin system, then I think it 
should be remedied. I haven't ran your test code, Enrico, nor experienced your 
problem before, but it would seem that this issue is worth investigating. 

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
crawling simulation
---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0


We recently discovered  some serious issue related to crawling and scoring. 
Reproducing these problems is a kind of difficult, since first of all it is not 
polite to re-crawl a set of pages again and again, secondly it is difficult to 
catch the page that cause a problem. 
Therefore it would be very useful to have a testbed to simulate crawls where  
we can control the response of  web servers. 
For the very beginning simulate very basic situation like a page points to it 
self,  link chains or internal links would already be very usefully. 

However later on simulate crawls against existing data collections like TREC or 
a webgraph would be much more interesting, for instance to caculate the quality 
of the nutch OPIC implementation against page rank scores of the webgraph or 
evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]

Stefan Groschupf updated NUTCH-357:
---

Attachment: protocol-simulation-pluginV1.patch

A very first preview of a plugin that helps to simulate crawls. This protocol 
plugin can be used to replace the http protocol plugin and return defined 
content during a fetch. To simulate custom scenarios a interface names 
Simulator can be implemented with just one method. 
The plugin comes with a very simple basic Simulator implementation, however 
this already allows to simulate the by today known nutch scoring problems, like 
pages pointing to itself or link chains. 
For more details see the java doc, however I plan to improve the java doc with 
a native speaker. 

Feedback is welcome. 

 crawling simulation
 ---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0

 Attachments: protocol-simulation-pluginV1.patch


 We recently discovered  some serious issue related to crawling and scoring. 
 Reproducing these problems is a kind of difficult, since first of all it is 
 not polite to re-crawl a set of pages again and again, secondly it is 
 difficult to catch the page that cause a problem. 
 Therefore it would be very useful to have a testbed to simulate crawls where  
 we can control the response of  web servers. 
 For the very beginning simulate very basic situation like a page points to it 
 self,  link chains or internal links would already be very usefully. 
 However later on simulate crawls against existing data collections like TREC 
 or a webgraph would be much more interesting, for instance to caculate the 
 quality of the nutch OPIC implementation against page rank scores of the 
 webgraph or evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira