Re: dedup dont delete duplicates !

2009-11-25 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages

Re: dedup dont delete duplicates !

2009-11-25 Thread reinhard schwab
Andrzej Bialecki schrieb: BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
Hello All, I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools :( !-- 2009-11-25 11:42:49,299 INFO crawl.Injector - Injector: done 2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient -

Re: 100 fetches per second?

2009-11-25 Thread Dennis Kubes
It is not about the local DNS caching as much as having local DNS servers. Too many fetchers hitting a centralized DNS server can act as a DOS attack and slow down the entire fetching system. For example say I have a single centralized DNS server for my network. And say I have 2 map task per

Re: Nutch config IOException

2009-11-25 Thread Andrzej Bialecki
Mischa Tuffield wrote: Hello Again, Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. !-- 2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config() at

Re: Nutch config IOException

2009-11-25 Thread Mischa Tuffield
Hi Andrzej, Yeah, I just noticed that this stack trace is for DEBUG purposes only I found it in the hadoop src, thanks for the info. Regards, Mischa On 25 Nov 2009, at 13:11, Andrzej Bialecki wrote: Mischa Tuffield wrote: Hello Again, Following my previous post below, I have noticed that

RE: dedup dont delete duplicates !

2009-11-25 Thread BELLINI ADAM
plz mischa, if your problem is not about delete duplicate just open another thread ! thx Andrzej, thx for all, i will try to run a diff command on the content of the 2 pages. i will give you news when done. From: mischa.tuffi...@garlik.com Subject: Re: dedup dont delete duplicates !

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
Ok, my bad. M On 25 Nov 2009, at 15:35, BELLINI ADAM wrote: plz mischa, if your problem is not about delete duplicate just open another thread ! thx Andrzej, thx for all, i will try to run a diff command on the content of the 2 pages. i will give you news when done. From:

recrawl.sh stopped at depth 7/10 without error

2009-11-25 Thread BELLINI ADAM
hi, i'm running recrawl.sh and it stops every time at depth 7/10 without any error ! but when run the bin/crawl with the same crawl-urlfilter and the same seeds file it finishs softly in 1h50 i checked the hadoop.log, and dont find any error there...i just find the last url it was parsing

Re: 100 fetches per second?

2009-11-25 Thread MilleBii
Get your point... Although I thought high number of threads would do exactly the same. Maybe I miss something. During my fetcher runs used bandwidth gets low pretty quickly, disk I/O is low, the CPU is low... So it must be waiting for something but what ? Could be the DNS cache wich is full and

Re: 100 fetches per second?

2009-11-25 Thread Dennis Kubes
If it is waiting and the box is idle, my first though is not dns. I just put that up as one of the things people will run into. Most likely it is uneven distribution of urls or something like that. Dennis MilleBii wrote: Get your point... Although I thought high number of threads would do

Re: 100 fetches per second?

2009-11-25 Thread Julien Nioche
or it is stuck on a couple of hosts which time out? The logs should have a trace with the number of active threads, which should give some indication of what's happening. Julien 2009/11/25 Dennis Kubes ku...@apache.org If it is waiting and the box is idle, my first though is not dns. I just

Re: 100 fetches per second?

2009-11-25 Thread MilleBii
The logs show that my fetch queue is full and my 100 threads are mostly spin waiting towards the end. Now the very last run (150kURLs) I can clearly see 4 phases: + very high speed : 3MB/s for a few minutes + sudden speed drop around 1MB/s and flat for several hours + another speed drop to

Re: 100 fetches per second?

2009-11-25 Thread Mark Kerzner
Judging by how this discussion goes, there may be a need for URL mix optimizer and for a fast crawler based on that. Is this something worth pursuing. MilleBii, q'en pensez vous? Mark On Wed, Nov 25, 2009 at 3:44 PM, MilleBii mille...@gmail.com wrote: The logs show that my fetch queue is full

Re: 100 fetches per second?

2009-11-25 Thread MilleBii
I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what : got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running at 50 Fetches/sec

Re: 100 fetches per second?

2009-11-25 Thread Andrzej Bialecki
MilleBii wrote: I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what : got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running

Re: 100 fetches per second?

2009-11-25 Thread Dennis Kubes
One interesting thing we were seeing a while back on large crawls where we were fetching the best scoring pages first, then next best, and so on, is that lower scoring pages typically had worse response time rates and worse timeout rates. So while the best scoring pages would respond very

Re: Exception while slicing and parsing old segments without fetching

2009-11-25 Thread srinivasarao v
Hi Vishal, I got the same prolem while runing updatedb and invertlinks. Have you got the solution to the problem? Please let me know if u get the solution. Thank You, Srinivas On Mon, Aug 24, 2009 at 2:00 PM, vishal vachhani vishal...@gmail.comwrote: Hi All, I had a big segment(size=

Re: 100 fetches per second?

2009-11-25 Thread MilleBii
Dennis, Interesting info, I don't use the standard OPIC scorer but a slightly modified version which boost pages with content that I'm looking for... so it could be that my pages are generally on slow servers. Now heads-up, just started a new run with 450k URLs and it looks like I'm back to the