RE: DNS setup and issues

2013-12-10 Thread Markus Jelsma
Hi - when we did very large scale web crawling (over 1000 pages per second for many millions of domains) we did not have issues with DNS. We did try using local dns caching tools but they did not improve anything but make things worse in our case. We tried unscd, it may help you, or not.

Re: DNS setup and issues

2013-12-10 Thread Julien Nioche
Hi Martin, We used local DNS caches on the slave nodes when we were running the crawl for SimilarPages (10+ billion pages in Crawldb) and IIRC were using some external dns servers as the ones on EC2 at the time were not very robust + they were getting quite angry with us. Can't quite remember

Re: Manipulating Nutch 2.2.1 scoring system

2013-12-10 Thread Lewis John Mcgibbney
Hi Talat, On Sat, Dec 7, 2013 at 5:44 PM, user-digest-h...@nutch.apache.org wrote: Hi Vangelis, I draw a Nutch Software Architecture diagram. Maybe it can be help you. https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/ edit?usp=sharing Talat Would you be interested in

Re: Manipulating Nutch 2.2.1 scoring system

2013-12-10 Thread Talat UYARER
Hi Lewis, I agree with you. After last modifications I will add it there. Talat 10-12-2013 14:51 tarihinde, Lewis John Mcgibbney yazdı: Hi Talat, On Sat, Dec 7, 2013 at 5:44 PM, user-digest-h...@nutch.apache.org wrote: Hi Vangelis, I draw a Nutch Software Architecture diagram. Maybe it

New feature: Seed URL high fetch frequency

2013-12-10 Thread Otis Gospodnetic
Hi, While working for a client we came across a use case that seems like it might not be uncommon. We may have some code to contribute. The use case is that we have a few seed URLs that we need to fetch at relatively high frequency (e.g. every N minutes). There URLs have pointers to news type

RE: New feature: Seed URL high fetch frequency

2013-12-10 Thread Markus Jelsma
Already in 1.x: https://issues.apache.org/jira/browse/NUTCH-1388 Also see: https://issues.apache.org/jira/browse/NUTCH-1405 You can already inject with fetchInterval but you need a fixedFetchInterval to be added to the metadata and a FetchScheduler that supports it. -Original message-

RE: New feature: Seed URL high fetch frequency

2013-12-10 Thread Markus Jelsma
By the way, if you don't use an adaptive scheduler but one that maintain's the configured or injected interval, you can already do it by simply injecting url's with low intervals. -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Tuesday 10th December 2013 16:04

Re: DNS setup and issues

2013-12-10 Thread Martin Aesch
Thanks Julien, thanks, Markus, seems my provider is somehow in particular picky, I was querying just ~ 100/sec. However, just for the records, I found a greatly working solution for my problem. pdns-recursor offers to set the TTL of cached records explicitly (I set it to one week) and I am

Re: NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser when using HtmlParser

2013-12-10 Thread Lewis John Mcgibbney
Hi d_k, Can you please check out this issue https://issues.apache.org/jira/browse/NUTCH-1253 I uploaded a patch on Fed 7th 2013 which has not been tested but which i hope will fix this issue. Can you please read up on the Jira issue and test the patch? Please also see my comments below On Tue,

RE: Unsuccessful fetch/parse of large page with many outlinks

2013-12-10 Thread Iain Lopata
Solved. So I started to prepare a stripped down routine outside Nutch to file a bug report, but in the process have solved the problem. The issue was with the User-Agent string that I had configured. Apparently the domain in question runs dotDefender, a software firewall that checks, among

Re: Unsuccessful fetch/parse of large page with many outlinks

2013-12-10 Thread Lewis John Mcgibbney
Hi, On Tue, Dec 10, 2013 at 8:46 PM, user-digest-h...@nutch.apache.org wrote: So this leaves me with a question. Are there recommendations for a properly configured User-Agent string that identifies an instance of a Nutch Crawler and does not run afoul of a firewall like this? Using the

RE: Unsuccessful fetch/parse of large page with many outlinks

2013-12-10 Thread Iain Lopata
Lewis, That seems like a reasonable compromise. I will run with it. Thanks -Original Message- From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Tuesday, December 10, 2013 2:55 PM Cc: user@nutch.apache.org Subject: Re: Unsuccessful fetch/parse of large page with

Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-10 Thread S.L
Tejas, I have looked at the Hadoop UI and under tools there is a 'Local logs' link and under it there is a 'user logs linksand under it there is a container_1386136488805_0971_01_03 directory( these looks like job specific logs) under each of these directories there are three log files.

Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-10 Thread S.L
Finally I was able to locate the logs in syslogs , I see the error it is because of the following exception. java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/scheme/SchemeRegistry; Apparently this happens when an older

Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-10 Thread S.L
In the logs for one of the jobs I see the following log message , what does this mean ? nutch hadoop can't find rules for scope 'generate_host_count', using default On Tue, Dec 10, 2013 at 8:26 PM, S.L simpleliving...@gmail.com wrote: Finally I was able to locate the logs in syslogs , I see

Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-10 Thread S.L
I am consistently running into this excepition , googling tells me that this is because of the Jar mismatch between Hadoop and the job , with Hadoop using older version. I am not able to locate how I could Hadoop 2.2 pickup the jars from the sumitted job , can some one who has gone thru this