RE: Nutch not fetching all urls from urlsdir

McGibbney, Lewis John Mon, 20 Dec 2010 06:54:38 -0800

My situation is exactly the same. To add to confusion I have the following URLs


http://www.south-ayrshire.gov.uk <always fetched
http://www.sabsm.co.uk/ <always fetched
http://www.scotland.gov.uk/Home <never fetched

So putting this down to regex-urlfilter or crawl-urlfilter not allowing the 
.gov suffix or something of this nature is not the solution. I checked the 
robots.txt files on the URL above which is never fetched and it reads as follows

# /robots.txt file

User-agent: *
Disallow: /_private
Disallow: /_test
Disallow: /_dsforums
Disallow: /_gsi
Disallow: /_includes
Disallow: /_temp
Disallow: /what
Disallow: /_interviews
Disallow: /deleted

This should clearly permit nutch to crawl so I am stuck with this one. I haeve 
contacted website admin at scotland.gov and they mention that the website is 
regularly crawled... although this doesn't the situation much is does help to 
localise the problem to somewhere within Nutch config maybe?


-----Original Message-----
From: Chris Woolum [mailto:[email protected]]
Sent: 20 December 2010 00:22
To: [email protected]
Subject: RE: Nutch not fetching all urls from urlsdir

Looking in hadoop.log, I only see mention of the 3 urls which is what confuses 
me the most. My url regex should allow them through.

-----Original Message-----
From: McGibbney, Lewis John [mailto:[email protected]]
Sent: Sat 12/18/2010 3:42 AM
To: [email protected]
Subject: RE: Nutch not fetching all urls from urlsdir

Hi Chris,

I have experienced similar problems with this in the past, for example I was 
trying to crawl for the following URL(amongst others)
http://www.scotland.gov.uk
For a reason still unknown to me I was unable to do so, having experiemented a 
bit I found that appending the /Home to the URL atleast gave me some results, 
having looked at my hadoop.log I saw the following:
2010-12-06 12:33:33,031 INFO  fetcher.Fetcher - fetching http://www.sabsm.co.uk/
2010-12-06 12:33:33,031 INFO  fetcher.Fetcher - fetching 
http://www.south-ayrshire.gov.uk/
2010-12-06 12:33:33,031 INFO  fetcher.Fetcher - fetching 
http://www.scotland.gov.uk/Home/
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=9
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=6
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=4
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=7
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=8
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=3
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=5
2010-12-06 12:33:33,125 INFO  http.Http - http.proxy.host = null
2010-12-06 12:33:33,125 INFO  http.Http - http.proxy.port = 8080
2010-12-06 12:33:33,125 INFO  http.Http - http.timeout = 10000
2010-12-06 12:33:33,125 INFO  http.Http - http.content.limit = -1
2010-12-06 12:33:33,125 INFO  http.Http - http.agent = WOMBRA/Nutch-1.2
2010-12-06 12:33:33,125 INFO  http.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
2010-12-06 12:33:33,125 INFO  http.Http - protocol.plugin.check.blocking = false
2010-12-06 12:33:33,125 INFO  http.Http - protocol.plugin.check.robots = false
.....
2010-12-06 12:33:34,248 INFO  fetcher.Fetcher - -activeThreads=3, 
spinWaiting=0, fetchQueues.totalSize=0
2010-12-06 12:33:34,263 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=2
2010-12-06 12:33:34,279 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=1
2010-12-06 12:33:34,295 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'fetcher', using default
2010-12-06 12:33:34,310 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=0
2010-12-06 12:33:35,262 INFO  fetcher.Fetcher - -activeThreads=0, 
spinWaiting=0, fetchQueues.totalSize=0
2010-12-06 12:33:35,262 INFO  fetcher.Fetcher - -activeThreads=0
2010-12-06 12:33:35,933 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2010-12-06 12:33:36,338 INFO  fetcher.Fetcher - Fetcher: finished at 2010-12-06 
12:33:36, elapsed: 00:00:12
2010-12-06 12:34:48,081 INFO  parse.ParseSegment - ParseSegment: starting at 
2010-12-06 12:34:48
2010-12-06 12:34:48,081 INFO  parse.ParseSegment - ParseSegment: segment: 
crawl/segments/20101206123213
2010-12-06 12:34:51,404 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable

From this I puzzled as to what was stopping the thread fetching the URL.
Does your Hadoop.log look anything similar or does it simply fail to mention 
injected URLs?

Lewis


________________________________________
From: Chris Woolum [[email protected]]
Sent: 18 December 2010 04:57
To: [email protected]
Subject: Nutch not fetching all urls from urlsdir

Hello everyone,

I have a list of urls that I am testing with.  There are currently 10
urls that I am injecting. The problem though is that when I look through
the log, I only see 3 or 4 of them being fetched. What would cause this?
There are no errors that I can find. My understanding is that on the
first pass of the fetch, all urls that were injected should be fetched.
Am I correct in thinking this?

Thanks,
Chris


Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

RE: Nutch not fetching all urls from urlsdir

Reply via email to