Hi Mark,
I just put this up on the wiki. Hope it helps:
http://wiki.apache.org/nutch/OptimizingCrawls
Dennis
Mark Kerzner wrote:
Hi, guys,
my goal is to do by crawls at 100 fetches per second, observing, of course,
polite crawling. But, when URLs are all different domains, what
Dennis, that's awesomely interesting. Thank you,
Mark
On Tue, Nov 24, 2009 at 10:01 AM, Dennis Kubes ku...@apache.org wrote:
Hi Mark,
I just put this up on the wiki. Hope it helps:
http://wiki.apache.org/nutch/OptimizingCrawls
Dennis
Mark Kerzner wrote:
Hi, guys,
my goal is to
Hi Mark,
I've recently contributed 2 patches on JIRA (NUTCH-769 / NUTCH-770) which
will have an impact on the speed of the crawling. This should help with the
fetch rate slowing down.
There is also https://issues.apache.org/jira/browse/NUTCH-753 which should
help to a lesser extent.
Julien
--
Why would DNS local caching work... It only is working if you are
going to crawl often the same site ... In which case you are hit by
the politeness.
if you have segments with only/mainly different sites it is not/really
going to help.
So far I have not seen my quad core + 100mb/s + pseudo
I may be awfully wrong on that, but below is my plan for super-fast
crawling. I have prepared it for a venture that does not need it anymore,
but it looks like fun to do anyway. What would you all say: is there a need,
and what's wrong with the plan?
Thank you,
Mark
Fast Crawl Plan
=
I just observed that on my set-up hadoop pseudo distributed I hardly get
overlap between Map Reduce phases... sounds strange to me especially when
I have plenty of spare CPU.
Would there be a setting to set it proper ?
--
-MilleBii-
hi,
dedup doesn't work for me.
I have read that Duplicates have either the same contents (via MD5 hash) or
the same URL
in my case i dont have the same URLS but still have the same contents for those
URLS.
i give you an exemple: i have three urls that have the same content
1-
So I indeed have a local DNS server running does not seem to help so much
Just finished a run of 80K urls, at the beginning the speed can be around 15
Fetch/s and at the end due to long tail effects I get below 1 Fetch/s...
Average on a 12h34 run : 1,7 Fetch/s pretty slow really.
I limit URL
BELLINI ADAM wrote:
hi,
dedup doesn't work for me.
I have read that Duplicates have either the same contents (via MD5 hash) or
the same URL
in my case i dont have the same URLS but still have the same contents for those
URLS.
i give you an exemple: i have three urls that have the same
i dont understand also why they have 3 differentes signatures, since it's
realy the same page !
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: dedup dont delete duplicates !
Date: Tue, 24 Nov 2009 20:56:39 +
hi,
dedup doesn't work for me.
I have read that
yes i cheked the signatures and it's not the same !! it's realy weird
the url www.domaine/folder/index.html?lang=fr is just this one
www.domaine/folder/index.html
Date: Tue, 24 Nov 2009 22:21:19 +0100
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: dedup dont delete
BELLINI ADAM wrote:
yes i cheked the signatures and it's not the same !! it's realy weird
the url www.domaine/folder/index.html?lang=fr is just this one
www.domaine/folder/index.html
Apparently it isn't a bit-exact replica of the page, so its MD5 hash is
different. You need to use a more
hi,
my two urls points to the same page ! the page is stored in a database , and
the parametre lang=fr or lang=en is just to extract the englsih page from the
database or the french one ...and by default the page should be in frehch !! so
the url url www.domaine/folder/index.html?lang=fr and
Try:
1. In order to prevent crawling of URLs with pattern* who.int* ,you can add
the following in following files:
a) if you are the using bin/nutch crawl command then add the following
line inside conf/crawl-urlfilter.txt
-^http://( http://%28/[a-z0-9]*\.)*who.int/
b) if you are
Hi,
Does TextProfileSignature exclude the HTML header (meta tags etc.) while
creating the signature for a page? I have noticed that minor differences
like time-stamp etc. cause the same page to look different to Nutch
causing multiple copies of the same page to be added to the index.
Is it also
I looked at the bandwidth profile of my last two runs and they have
the same shape:
starts at 5 MBytes/s and decreases to below 500 kBytes/s with 1/x
kind of curve shape
Fairly even distribution of urls, local DNS is running...
I can't find a good explanation for this behavior ... It looks to
Sorry...
The regular expressions should be:
-^http://( http://%28/[a-z0-9]*\.)*who.int/
Had made an error in the previous email. Wonder whether gmail is playing
with the characters in the set emails...
-sroy
On Wed, Nov 25, 2009 at 12:00 PM, Subhojit Roy mails...@gmail.com wrote:
Try:
17 matches
Mail list logo