Re: 100 fetches per second?

2009-11-24 Thread Dennis Kubes
Hi Mark, I just put this up on the wiki. Hope it helps: http://wiki.apache.org/nutch/OptimizingCrawls Dennis Mark Kerzner wrote: Hi, guys, my goal is to do by crawls at 100 fetches per second, observing, of course, polite crawling. But, when URLs are all different domains, what

Re: 100 fetches per second?

2009-11-24 Thread Mark Kerzner
Dennis, that's awesomely interesting. Thank you, Mark On Tue, Nov 24, 2009 at 10:01 AM, Dennis Kubes ku...@apache.org wrote: Hi Mark, I just put this up on the wiki. Hope it helps: http://wiki.apache.org/nutch/OptimizingCrawls Dennis Mark Kerzner wrote: Hi, guys, my goal is to

Re: 100 fetches per second?

2009-11-24 Thread Julien Nioche
Hi Mark, I've recently contributed 2 patches on JIRA (NUTCH-769 / NUTCH-770) which will have an impact on the speed of the crawling. This should help with the fetch rate slowing down. There is also https://issues.apache.org/jira/browse/NUTCH-753 which should help to a lesser extent. Julien --

Re: 100 fetches per second?

2009-11-24 Thread MilleBii
Why would DNS local caching work... It only is working if you are going to crawl often the same site ... In which case you are hit by the politeness. if you have segments with only/mainly different sites it is not/really going to help. So far I have not seen my quad core + 100mb/s + pseudo

Re: 100 fetches per second?

2009-11-24 Thread Mark Kerzner
I may be awfully wrong on that, but below is my plan for super-fast crawling. I have prepared it for a venture that does not need it anymore, but it looks like fun to do anyway. What would you all say: is there a need, and what's wrong with the plan? Thank you, Mark Fast Crawl Plan =

Map and Reduce not overlapping in a pseudo-distributed

2009-11-24 Thread MilleBii
I just observed that on my set-up hadoop pseudo distributed I hardly get overlap between Map Reduce phases... sounds strange to me especially when I have plenty of spare CPU. Would there be a setting to set it proper ? -- -MilleBii-

dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those URLS. i give you an exemple: i have three urls that have the same content 1-

Re: 100 fetches per second?

2009-11-24 Thread MilleBii
So I indeed have a local DNS server running does not seem to help so much Just finished a run of 80K urls, at the beginning the speed can be around 15 Fetch/s and at the end due to long tail effects I get below 1 Fetch/s... Average on a 12h34 run : 1,7 Fetch/s pretty slow really. I limit URL

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those URLS. i give you an exemple: i have three urls that have the same

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
i dont understand also why they have 3 differentes signatures, since it's realy the same page ! From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: dedup dont delete duplicates ! Date: Tue, 24 Nov 2009 20:56:39 + hi, dedup doesn't work for me. I have read that

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html Date: Tue, 24 Nov 2009 22:21:19 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: dedup dont delete

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
BELLINI ADAM wrote: yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html Apparently it isn't a bit-exact replica of the page, so its MD5 hash is different. You need to use a more

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
hi, my two urls points to the same page ! the page is stored in a database , and the parametre lang=fr or lang=en is just to extract the englsih page from the database or the french one ...and by default the page should be in frehch !! so the url url www.domaine/folder/index.html?lang=fr and

Re: How do I block/ban a specific domain name or a tld?

2009-11-24 Thread Subhojit Roy
Try: 1. In order to prevent crawling of URLs with pattern* who.int* ,you can add the following in following files: a) if you are the using bin/nutch crawl command then add the following line inside conf/crawl-urlfilter.txt -^http://( http://%28/[a-z0-9]*\.)*who.int/ b) if you are

Re: dedup dont delete duplicates !

2009-11-24 Thread Subhojit Roy
Hi, Does TextProfileSignature exclude the HTML header (meta tags etc.) while creating the signature for a page? I have noticed that minor differences like time-stamp etc. cause the same page to look different to Nutch causing multiple copies of the same page to be added to the index. Is it also

Re: 100 fetches per second?

2009-11-24 Thread MilleBii
I looked at the bandwidth profile of my last two runs and they have the same shape: starts at 5 MBytes/s and decreases to below 500 kBytes/s with 1/x kind of curve shape Fairly even distribution of urls, local DNS is running... I can't find a good explanation for this behavior ... It looks to

Re: How do I block/ban a specific domain name or a tld?

2009-11-24 Thread Subhojit Roy
Sorry... The regular expressions should be: -^http://( http://%28/[a-z0-9]*\.)*who.int/ Had made an error in the previous email. Wonder whether gmail is playing with the characters in the set emails... -sroy On Wed, Nov 25, 2009 at 12:00 PM, Subhojit Roy mails...@gmail.com wrote: Try: