Re: [Nutch-general] Issue in executing multi-field search in Nutch

2005-10-28 Thread Anupkumar Putane
Hi, Thanks for the pointers. We had indeed missed out adding the language-identifier plugin to the war file. Problem resolved by adding it. Best Regards, Anup Jérôme Charron [EMAIL PROTECTED] 10/27/2005 08:40 PM Please respond to nutch-user@lucene.apache.org To

Re: fetch questions - freezing

2005-10-28 Thread Andrzej Bialecki
Ken Krugler wrote: I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 300k url list. Initially, its able to reach ~25 pages/s with 150 threads. The fetcher gets progressivly slower though, dropping down to about ~15 pages/s after about 2-3 hours or so and continues to slow

Re: fetch questions - freezing

2005-10-28 Thread Byron Miller
For what its worth i fetch my segments of 1 million urls with 80 threads at a time and no slow downs. I'll grab some of my stats and publish them, but i haven't had problems with fetcher slowing down like this in a long time. (linux/Centos 4.2 platform) -byron --- Andrzej Bialecki [EMAIL

Indexer Performance - up to 200+ rec/s with Lang identification enabled

2005-10-28 Thread Byron Miller
051028 083415 DONE indexing segment 20051019000305: total 10 records in 520.156 s (192.3077 rec/s). 051028 083415 done indexing Been doing some testing and i've pretty much peaked out at 192-200 rec/s on a 2.8ghz machine with lang ident enabled on 512bytes data @ 3ngrams which after tweaking

RE: fetch questions - freezing

2005-10-28 Thread Steve Betts
There is an issue with the PDFBox library shipped with Nutch 0.7. It will hang parsing certain PDF files. PDFBox 0.7.2 fixes this issue. If you are parsing PDF files, then this could also be a problem. Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -Original Message- From: Byron

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting
Ken van Mulder wrote: Initially, its able to reach ~25 pages/s with 150 threads. The fetcher gets progressivly slower though, dropping down to about ~15 pages/s after about 2-3 hours or so and continues to slow down. I've seen a few references on these lists to the issue, but I'm not clear on

Re: Peak index performance

2005-10-28 Thread Doug Cutting
Byron Miller wrote: For example i've been tweaking max merge/min merge and such and i've been able to double my performance without increasing anything but cpu load.. Smaller maxMergeDocs will cost you in the end, since these will eventually be merged during the index optimization at the end.

Re: Indexer Performance - up to 200+ rec/s with Lang identification enabled

2005-10-28 Thread Ken Krugler
051028 083415 DONE indexing segment 20051019000305: total 10 records in 520.156 s (192.3077 rec/s). 051028 083415 done indexing Been doing some testing and i've pretty much peaked out at 192-200 rec/s on a 2.8ghz machine with lang ident enabled on 512bytes data @ 3ngrams which after tweaking

Re: fetch questions - freezing

2005-10-28 Thread Ken Krugler
I'm using the mapred branch on a FreeBSD 7.0 box to do fetchs of a 300k url list. Initially, its able to reach ~25 pages/s with 150 threads. The fetcher gets progressivly slower though, dropping down to about ~15 pages/s after about 2-3 hours or so and continues to slow down. I've seen a few

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting
Ken Krugler wrote: We're only using the html text parsers, so I don't think that's the problem. Plus we dumping the thread stack when it hangs, and it's always in the ChunkedInputStream.exhaustInputStream() process (see trace below). The trace did not make it. Have you tried protocol-http

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting
Ken van Mulder wrote: As a side note, does anyone have any recommendations for profiling software? I've used the standard hprof, which slows down the process to much for my needs and jmp which seems pretty unstable. I recommend 'kill -QUIT' as a poor-man's profiler. With a few stack dumps

Re: Peak index performance

2005-10-28 Thread Byron Miller
I've been working with the following to consistently get 200 rec/s indexed (index_more and language-ident enabled) Mind you i have over sized these and i'm working backwards to shrink them down (all this machine does is index). Odd thing is the jvm really didn't change much with these adjusted.

Re: Peak index performance

2005-10-28 Thread Doug Cutting
Byron Miller wrote: property nameindexer.mergeFactor/name value350/value description /description /property Initially high index merge factor caused out of file handle errors but increasing the others along with it seemed to help get around that. That is a very large mergeFactor,

Re: Peak index performance

2005-10-28 Thread Byron Miller
My testing is on 100k documents, but most of the time i work with 1 million so i don't have a gazillion segments across my servers. i'll try and adjust that number down and see what happens. -byron --- Doug Cutting [EMAIL PROTECTED] wrote: Byron Miller wrote: property

using site:mydomain.com searches question

2005-10-28 Thread Byron Miller
If you use site:mydomain.com instead of site:www.mydomain.com, shouldn't the query search home.mydomain.com, news.mydomain.com or any prefixed url of that domain?

Re: fetch questions - freezing

2005-10-28 Thread Ken Krugler
Ken Krugler wrote: We're only using the html text parsers, so I don't think that's the problem. Plus we dumping the thread stack when it hangs, and it's always in the ChunkedInputStream.exhaustInputStream() process (see trace below). The trace did not make it. Oops - see at the end of

Re: using site:mydomain.com searches question

2005-10-28 Thread Andy Lee
On Oct 28, 2005, at 4:44 PM, Byron Miller wrote: If you use site:mydomain.com instead of site:www.mydomain.com, shouldn't the query search home.mydomain.com, news.mydomain.com or any prefixed url of that domain? site: only matches on the full hostname that was given in the url. One

Re: fetch questions - freezing

2005-10-28 Thread Earl Cahill
Trunk? Map reduce? Could you describe your box setup, job division, and maybe post your conf/nutch-site.xml file? Just trying to get things going and not have much luck with the mapreduce branch. I also tried trunk, the crawl stops around 3 pages (out of maybe a million ), and once it's

I runed Nutch0.7.1's crawl but got an FileNotFountException ,why?,psl help me.

2005-10-28 Thread RZG
Hi,everybody: I runed Nutch0.7.1 in cygwin installed my Windows XP os,but got the follow exception.i'm sure there is a file with named urls i had created in the current directory! $ bin/*nutch crawl urls -dir crawl.test -depth 3 -thread 1 *run java in C:\java\j2sdk1.4.2_04 050928 110225 parsing