Hi,
Thanks for the pointers. We had indeed missed out adding the
language-identifier plugin to the war file. Problem resolved by adding
it.
Best Regards,
Anup
Jérôme Charron [EMAIL PROTECTED]
10/27/2005 08:40 PM
Please respond to
nutch-user@lucene.apache.org
To
Ken Krugler wrote:
I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a
300k url list.
Initially, its able to reach ~25 pages/s with 150 threads. The
fetcher gets progressivly slower though, dropping down to about ~15
pages/s after about 2-3 hours or so and continues to slow
For what its worth i fetch my segments of 1 million
urls with 80 threads at a time and no slow downs.
I'll grab some of my stats and publish them, but i
haven't had problems with fetcher slowing down like
this in a long time.
(linux/Centos 4.2 platform)
-byron
--- Andrzej Bialecki [EMAIL
051028 083415 DONE indexing segment 20051019000305:
total 10 records in 520.156 s (192.3077 rec/s).
051028 083415 done indexing
Been doing some testing and i've pretty much peaked
out at 192-200 rec/s on a 2.8ghz machine with lang
ident enabled on 512bytes data @ 3ngrams which after
tweaking
There is an issue with the PDFBox library shipped with Nutch 0.7. It will
hang parsing certain PDF files. PDFBox 0.7.2 fixes this issue. If you are
parsing PDF files, then this could also be a problem.
Thanks,
Steve Betts
[EMAIL PROTECTED]
937-477-1797
-Original Message-
From: Byron
Ken van Mulder wrote:
Initially, its able to reach ~25 pages/s with 150 threads. The fetcher
gets progressivly slower though, dropping down to about ~15 pages/s
after about 2-3 hours or so and continues to slow down. I've seen a few
references on these lists to the issue, but I'm not clear on
Byron Miller wrote:
For example i've been tweaking max merge/min merge and
such and i've been able to double my performance
without increasing anything but cpu load..
Smaller maxMergeDocs will cost you in the end, since these will
eventually be merged during the index optimization at the end.
051028 083415 DONE indexing segment 20051019000305:
total 10 records in 520.156 s (192.3077 rec/s).
051028 083415 done indexing
Been doing some testing and i've pretty much peaked
out at 192-200 rec/s on a 2.8ghz machine with lang
ident enabled on 512bytes data @ 3ngrams which after
tweaking
I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a
300k url list.
Initially, its able to reach ~25 pages/s with 150 threads. The
fetcher gets progressivly slower though, dropping down to about
~15 pages/s after about 2-3 hours or so and continues to slow
down. I've seen a few
Ken Krugler wrote:
We're only using the html text parsers, so I don't think that's the
problem. Plus we dumping the thread stack when it hangs, and it's always
in the ChunkedInputStream.exhaustInputStream() process (see trace below).
The trace did not make it.
Have you tried protocol-http
Ken van Mulder wrote:
As a side note, does anyone have any recommendations for profiling
software? I've used the standard hprof, which slows down the process to
much for my needs and jmp which seems pretty unstable.
I recommend 'kill -QUIT' as a poor-man's profiler. With a few stack
dumps
I've been working with the following to consistently
get 200 rec/s indexed (index_more and language-ident
enabled)
Mind you i have over sized these and i'm working
backwards to shrink them down (all this machine does
is index). Odd thing is the jvm really didn't change
much with these adjusted.
Byron Miller wrote:
property
nameindexer.mergeFactor/name
value350/value
description
/description
/property
Initially high index merge factor caused out of file
handle errors but increasing the others along with it
seemed to help get around that.
That is a very large mergeFactor,
My testing is on 100k documents, but most of the time
i work with 1 million so i don't have a gazillion
segments across my servers.
i'll try and adjust that number down and see what
happens.
-byron
--- Doug Cutting [EMAIL PROTECTED] wrote:
Byron Miller wrote:
property
If you use site:mydomain.com instead of
site:www.mydomain.com, shouldn't the query search
home.mydomain.com, news.mydomain.com or any prefixed
url of that domain?
Ken Krugler wrote:
We're only using the html text parsers, so I don't think that's
the problem. Plus we dumping the thread stack when it hangs, and
it's always in the ChunkedInputStream.exhaustInputStream() process
(see trace below).
The trace did not make it.
Oops - see at the end of
On Oct 28, 2005, at 4:44 PM, Byron Miller wrote:
If you use site:mydomain.com instead of
site:www.mydomain.com, shouldn't the query search
home.mydomain.com, news.mydomain.com or any prefixed
url of that domain?
site: only matches on the full hostname that was given in the url.
One
Trunk? Map reduce? Could you describe your box
setup, job division, and maybe post your
conf/nutch-site.xml file?
Just trying to get things going and not have much luck
with the mapreduce branch. I also tried trunk, the
crawl stops around 3 pages (out of maybe a million
), and once it's
Hi,everybody:
I runed Nutch0.7.1 in cygwin installed my Windows XP os,but
got the follow exception.i'm sure there is a file with named urls i had
created in the current directory!
$ bin/*nutch crawl urls -dir crawl.test -depth 3 -thread 1
*run java in C:\java\j2sdk1.4.2_04
050928 110225 parsing
19 matches
Mail list logo