Hi,
I've been profiling a Nutch installation, and to my surprise the largest
amount of throwaway allocations and the most time spent was not in Nutch
specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
This method operates on a LinkedList, which seems to be a huge
Andrzej,
very interesting!!!
Nutch Summarizer also needlessly re-tokenizes the text over and
over again - perhaps it would be better to save already tokenized
text in parse_text, instead of the raw plain text? After all, the
only use for that text is to index it and then build the
On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Hi,
I've been profiling a Nutch installation, and to my surprise the largest
amount of throwaway allocations and the most time spent was not in Nutch
specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method.
This method
Piotr Kosiorowski wrote:
On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Hi,
I've been profiling a Nutch installation, and to my surprise the largest
amount of throwaway allocations and the most time spent was not in Nutch
specific code, or IPC, but in Lucene
You are right - it is still not committed but the patch is here:
http://issues.apache.org/jira/browse/LUCENE-443.
During tests of my patch - it was very,very similar to this one- I had up to
5% perfomance increase. But probably it will mainly result in nicer GC
behaviour.
Piotr
On 11/22/05,
i use the nutch 0.7.1 distribution
starting namenode seems to be ok (as i can see from the logs)
051122 161211 parsing
file:/home/rude/workspace/nutch-0.7.1/conf/nutch-default.xml
051122 161212 parsing
file:/home/rude/workspace/nutch-0.7.1/conf/nutch-site.xml
051122 161212 Server listener on
i checked out the latest mapred branch version from svn
and it seems to work!
java -cp ./:./classes/:./conf org.apache.nutch.ndfs.NameNode
java -cp ./:./classes/:./conf org.apache.nutch.ndfs.DataNode
java -cp ./:./classes/:./conf org.apache.nutch.fs.TestClient -put
somebigfile /testfile
On Mon, 2005-11-21 at 15:11 -0800, Doug Cutting wrote:
This sounds like a bug in the URLFilter implementation. Is this
RegexURLFilter? Can you figure out what regex is causing this?
Probably the patch should be there, no?
I am using the URL Filtering and normalization plugins. As to where
Andrzej Bialecki wrote:
Further input into this: after replacing the ConjunctionScorer with the
fixed version from JIRA, now the bottleneck seems to be ... in
Summarizer, of all things. :-)
While making the summarizer faster would of course be good, keep in mind
that the cost of summarizing
It looks as though Nutch is inadvertantly submitting forms.
At DOMContentUtils.java:58 we specify that the action parameter of an
HTML form should be extracted as a link. Yet we ignore the method
parameter of the form. I think we should only follow these when the
method is get, not when it
Hi Doug,
I'm not a 'nutch dev' but I agree with you. I'm not 100% sure, but I think
even the google accelorator does it this way.
Cheers,
Ben
On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote:
It looks as though Nutch is inadvertantly submitting forms.
At DOMContentUtils.java:58 we specify
11 matches
Mail list logo