Hi - when we did very large scale web crawling (over 1000 pages per second for
many millions of domains) we did not have issues with DNS. We did try using
local dns caching tools but they did not improve anything but make things worse
in our case. We tried unscd, it may help you, or not.
Hi Martin,
We used local DNS caches on the slave nodes when we were running the crawl
for SimilarPages (10+ billion pages in Crawldb) and IIRC were using some
external dns servers as the ones on EC2 at the time were not very robust +
they were getting quite angry with us. Can't quite remember
Hi Talat,
On Sat, Dec 7, 2013 at 5:44 PM, user-digest-h...@nutch.apache.org wrote:
Hi Vangelis,
I draw a Nutch Software Architecture diagram. Maybe it can be help you.
https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/
edit?usp=sharing
Talat
Would you be interested in
Hi Lewis,
I agree with you. After last modifications I will add it there.
Talat
10-12-2013 14:51 tarihinde, Lewis John Mcgibbney yazdı:
Hi Talat,
On Sat, Dec 7, 2013 at 5:44 PM, user-digest-h...@nutch.apache.org wrote:
Hi Vangelis,
I draw a Nutch Software Architecture diagram. Maybe it
Hi,
While working for a client we came across a use case that seems like it
might not be uncommon. We may have some code to contribute.
The use case is that we have a few seed URLs that we need to fetch at
relatively high frequency (e.g. every N minutes). There URLs have pointers
to news type
Already in 1.x:
https://issues.apache.org/jira/browse/NUTCH-1388
Also see:
https://issues.apache.org/jira/browse/NUTCH-1405
You can already inject with fetchInterval but you need a fixedFetchInterval to
be added to the metadata and a FetchScheduler that supports it.
-Original message-
By the way, if you don't use an adaptive scheduler but one that maintain's the
configured or injected interval, you can already do it by simply injecting
url's with low intervals.
-Original message-
From:Markus Jelsma markus.jel...@openindex.io
Sent: Tuesday 10th December 2013 16:04
Thanks Julien, thanks, Markus,
seems my provider is somehow in particular picky, I was querying just ~
100/sec.
However, just for the records, I found a greatly working solution for my
problem.
pdns-recursor offers to set the TTL of cached records explicitly (I set
it to one week) and I am
Hi d_k,
Can you please check out this issue
https://issues.apache.org/jira/browse/NUTCH-1253
I uploaded a patch on Fed 7th 2013 which has not been tested but which i
hope will fix this issue. Can you please read up on the Jira issue and test
the patch?
Please also see my comments below
On Tue,
Solved.
So I started to prepare a stripped down routine outside Nutch to file a bug
report, but in the process have solved the problem.
The issue was with the User-Agent string that I had configured. Apparently
the domain in question runs dotDefender, a software firewall that checks,
among
Hi,
On Tue, Dec 10, 2013 at 8:46 PM, user-digest-h...@nutch.apache.org wrote:
So this leaves me with a question. Are there recommendations for a
properly
configured User-Agent string that identifies an instance of a Nutch Crawler
and does not run afoul of a firewall like this? Using the
Lewis,
That seems like a reasonable compromise. I will run with it.
Thanks
-Original Message-
From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Tuesday, December 10, 2013 2:55 PM
Cc: user@nutch.apache.org
Subject: Re: Unsuccessful fetch/parse of large page with
Tejas,
I have looked at the Hadoop UI and under tools there is a 'Local logs' link
and under it there is a 'user logs linksand under it there is a
container_1386136488805_0971_01_03 directory( these looks like job
specific logs) under each of these directories there are three log files.
Finally I was able to locate the logs in syslogs , I see the error it is
because of the following exception.
java.lang.NoSuchMethodError:
org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/scheme/SchemeRegistry;
Apparently this happens when an older
In the logs for one of the jobs I see the following log message , what does
this mean ?
nutch hadoop can't find rules for scope 'generate_host_count', using default
On Tue, Dec 10, 2013 at 8:26 PM, S.L simpleliving...@gmail.com wrote:
Finally I was able to locate the logs in syslogs , I see
I am consistently running into this excepition , googling tells me that
this is because of the Jar mismatch between Hadoop and the job , with
Hadoop using older version.
I am not able to locate how I could Hadoop 2.2 pickup the jars from the
sumitted job , can some one who has gone thru this
16 matches
Mail list logo