Performance issues with ConjunctionScorer

2005-11-22 Thread Andrzej Bialecki
Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method. This method operates on a LinkedList, which seems to be a huge

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Stefan Groschupf
Andrzej, very interesting!!! Nutch Summarizer also needlessly re-tokenizes the text over and over again - perhaps it would be better to save already tokenized text in parse_text, instead of the raw plain text? After all, the only use for that text is to index it and then build the

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Piotr Kosiorowski
On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method. This method

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Piotr Kosiorowski
You are right - it is still not committed but the patch is here: http://issues.apache.org/jira/browse/LUCENE-443. During tests of my patch - it was very,very similar to this one- I had up to 5% perfomance increase. But probably it will mainly result in nicer GC behaviour. Piotr On 11/22/05,

ndfs / Lost connection to namenode

2005-11-22 Thread Mr. Udatny
i use the nutch 0.7.1 distribution starting namenode seems to be ok (as i can see from the logs) 051122 161211 parsing file:/home/rude/workspace/nutch-0.7.1/conf/nutch-default.xml 051122 161212 parsing file:/home/rude/workspace/nutch-0.7.1/conf/nutch-site.xml 051122 161212 Server listener on

Re: ndfs / Lost connection to namenode

2005-11-22 Thread Mr. Udatny
i checked out the latest mapred branch version from svn and it seems to work! java -cp ./:./classes/:./conf org.apache.nutch.ndfs.NameNode java -cp ./:./classes/:./conf org.apache.nutch.ndfs.DataNode java -cp ./:./classes/:./conf org.apache.nutch.fs.TestClient -put somebigfile /testfile

Re: Urlfilter bug (doesn't return on long URLs)

2005-11-22 Thread Rod Taylor
On Mon, 2005-11-21 at 15:11 -0800, Doug Cutting wrote: This sounds like a bug in the URLFilter implementation. Is this RegexURLFilter? Can you figure out what regex is causing this? Probably the patch should be there, no? I am using the URL Filtering and normalization plugins. As to where

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Doug Cutting
Andrzej Bialecki wrote: Further input into this: after replacing the ConjunctionScorer with the fixed version from JIRA, now the bottleneck seems to be ... in Summarizer, of all things. :-) While making the summarizer faster would of course be good, keep in mind that the cost of summarizing

[Fwd: Spider Causing Contact Form Submissions]

2005-11-22 Thread Doug Cutting
It looks as though Nutch is inadvertantly submitting forms. At DOMContentUtils.java:58 we specify that the action parameter of an HTML form should be extracted as a link. Yet we ignore the method parameter of the form. I think we should only follow these when the method is get, not when it

Re: [Fwd: Spider Causing Contact Form Submissions]

2005-11-22 Thread Ben Halsted
Hi Doug, I'm not a 'nutch dev' but I agree with you. I'm not 100% sure, but I think even the google accelorator does it this way. Cheers, Ben On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote: It looks as though Nutch is inadvertantly submitting forms. At DOMContentUtils.java:58 we specify