BELLINI ADAM wrote:
hi,
my two urls points to the same page !
Please, no need to shout ...
If the MD5 signatures are different, then the binary content of these
pages is different, period.
Use readseg -dump utility to retrieve the page content from the segment,
extract just the two pages
Andrzej Bialecki schrieb:
BELLINI ADAM wrote:
hi,
my two urls points to the same page !
Please, no need to shout ...
If the MD5 signatures are different, then the binary content of these
pages is different, period.
Use readseg -dump utility to retrieve the page content from the
Hello All,
I am getting the following error in my hadoop.log (see below). It seems to
happen everytime I run any of the nutch command line tools :(
!--
2009-11-25 11:42:49,299 INFO crawl.Injector - Injector: done
2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient -
It is not about the local DNS caching as much as having local DNS
servers. Too many fetchers hitting a centralized DNS server can act as
a DOS attack and slow down the entire fetching system.
For example say I have a single centralized DNS server for my network.
And say I have 2 map task per
Mischa Tuffield wrote:
Hello Again,
Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch.
!--
2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config()
at
Hi Andrzej,
Yeah, I just noticed that this stack trace is for DEBUG purposes only I found
it in the hadoop src, thanks for the info.
Regards,
Mischa
On 25 Nov 2009, at 13:11, Andrzej Bialecki wrote:
Mischa Tuffield wrote:
Hello Again, Following my previous post below, I have noticed that
plz mischa, if your problem is not about delete duplicate just open another
thread ! thx
Andrzej, thx for all, i will try to run a diff command on the content of the 2
pages.
i will give you news when done.
From: mischa.tuffi...@garlik.com
Subject: Re: dedup dont delete duplicates !
Ok, my bad.
M
On 25 Nov 2009, at 15:35, BELLINI ADAM wrote:
plz mischa, if your problem is not about delete duplicate just open another
thread ! thx
Andrzej, thx for all, i will try to run a diff command on the content of the
2 pages.
i will give you news when done.
From:
hi,
i'm running recrawl.sh and it stops every time at depth 7/10 without any error
! but when run the bin/crawl with the same crawl-urlfilter and the same seeds
file it finishs softly in 1h50
i checked the hadoop.log, and dont find any error there...i just find the last
url it was parsing
Get your point... Although I thought high number of threads would do
exactly the same. Maybe I miss something.
During my fetcher runs used bandwidth gets low pretty quickly, disk
I/O is low, the CPU is low... So it must be waiting for something but
what ?
Could be the DNS cache wich is full and
If it is waiting and the box is idle, my first though is not dns. I
just put that up as one of the things people will run into. Most likely
it is uneven distribution of urls or something like that.
Dennis
MilleBii wrote:
Get your point... Although I thought high number of threads would do
or it is stuck on a couple of hosts which time out? The logs should have a
trace with the number of active threads, which should give some indication
of what's happening.
Julien
2009/11/25 Dennis Kubes ku...@apache.org
If it is waiting and the box is idle, my first though is not dns. I just
The logs show that my fetch queue is full and my 100 threads are mostly spin
waiting towards the end.
Now the very last run (150kURLs) I can clearly see 4 phases:
+ very high speed : 3MB/s for a few minutes
+ sudden speed drop around 1MB/s and flat for several hours
+ another speed drop to
Judging by how this discussion goes, there may be a need for URL mix
optimizer and for a fast crawler based on that. Is this something worth
pursuing. MilleBii, q'en pensez vous?
Mark
On Wed, Nov 25, 2009 at 3:44 PM, MilleBii mille...@gmail.com wrote:
The logs show that my fetch queue is full
I have to say that I'm still puzzled. Here is the latest. I just restarted a
run and then guess what :
got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec
MilleBii wrote:
I have to say that I'm still puzzled. Here is the latest. I just restarted a
run and then guess what :
got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running
One interesting thing we were seeing a while back on large crawls where
we were fetching the best scoring pages first, then next best, and so
on, is that lower scoring pages typically had worse response time rates
and worse timeout rates.
So while the best scoring pages would respond very
Hi Vishal,
I got the same prolem while runing updatedb and invertlinks.
Have you got the solution to the problem?
Please let me know if u get the solution.
Thank You,
Srinivas
On Mon, Aug 24, 2009 at 2:00 PM, vishal vachhani vishal...@gmail.comwrote:
Hi All,
I had a big segment(size=
Dennis,
Interesting info, I don't use the standard OPIC scorer but a slightly
modified version which boost pages with content that I'm looking for... so
it could be that my pages are generally on slow servers.
Now heads-up, just started a new run with 450k URLs and it looks like I'm
back to the
19 matches
Mail list logo