My situation is exactly the same. To add to confusion I have the following URLs
http://www.south-ayrshire.gov.uk <always fetched http://www.sabsm.co.uk/ <always fetched http://www.scotland.gov.uk/Home <never fetched So putting this down to regex-urlfilter or crawl-urlfilter not allowing the .gov suffix or something of this nature is not the solution. I checked the robots.txt files on the URL above which is never fetched and it reads as follows # /robots.txt file User-agent: * Disallow: /_private Disallow: /_test Disallow: /_dsforums Disallow: /_gsi Disallow: /_includes Disallow: /_temp Disallow: /what Disallow: /_interviews Disallow: /deleted This should clearly permit nutch to crawl so I am stuck with this one. I haeve contacted website admin at scotland.gov and they mention that the website is regularly crawled... although this doesn't the situation much is does help to localise the problem to somewhere within Nutch config maybe? -----Original Message----- From: Chris Woolum [mailto:[email protected]] Sent: 20 December 2010 00:22 To: [email protected] Subject: RE: Nutch not fetching all urls from urlsdir Looking in hadoop.log, I only see mention of the 3 urls which is what confuses me the most. My url regex should allow them through. -----Original Message----- From: McGibbney, Lewis John [mailto:[email protected]] Sent: Sat 12/18/2010 3:42 AM To: [email protected] Subject: RE: Nutch not fetching all urls from urlsdir Hi Chris, I have experienced similar problems with this in the past, for example I was trying to crawl for the following URL(amongst others) http://www.scotland.gov.uk For a reason still unknown to me I was unable to do so, having experiemented a bit I found that appending the /Home to the URL atleast gave me some results, having looked at my hadoop.log I saw the following: 2010-12-06 12:33:33,031 INFO fetcher.Fetcher - fetching http://www.sabsm.co.uk/ 2010-12-06 12:33:33,031 INFO fetcher.Fetcher - fetching http://www.south-ayrshire.gov.uk/ 2010-12-06 12:33:33,031 INFO fetcher.Fetcher - fetching http://www.scotland.gov.uk/Home/ 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=9 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=7 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5 2010-12-06 12:33:33,125 INFO http.Http - http.proxy.host = null 2010-12-06 12:33:33,125 INFO http.Http - http.proxy.port = 8080 2010-12-06 12:33:33,125 INFO http.Http - http.timeout = 10000 2010-12-06 12:33:33,125 INFO http.Http - http.content.limit = -1 2010-12-06 12:33:33,125 INFO http.Http - http.agent = WOMBRA/Nutch-1.2 2010-12-06 12:33:33,125 INFO http.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2010-12-06 12:33:33,125 INFO http.Http - protocol.plugin.check.blocking = false 2010-12-06 12:33:33,125 INFO http.Http - protocol.plugin.check.robots = false ..... 2010-12-06 12:33:34,248 INFO fetcher.Fetcher - -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 2010-12-06 12:33:34,263 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2 2010-12-06 12:33:34,279 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2010-12-06 12:33:34,295 WARN regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default 2010-12-06 12:33:34,310 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2010-12-06 12:33:35,262 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2010-12-06 12:33:35,262 INFO fetcher.Fetcher - -activeThreads=0 2010-12-06 12:33:35,933 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2010-12-06 12:33:36,338 INFO fetcher.Fetcher - Fetcher: finished at 2010-12-06 12:33:36, elapsed: 00:00:12 2010-12-06 12:34:48,081 INFO parse.ParseSegment - ParseSegment: starting at 2010-12-06 12:34:48 2010-12-06 12:34:48,081 INFO parse.ParseSegment - ParseSegment: segment: crawl/segments/20101206123213 2010-12-06 12:34:51,404 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable From this I puzzled as to what was stopping the thread fetching the URL. Does your Hadoop.log look anything similar or does it simply fail to mention injected URLs? Lewis ________________________________________ From: Chris Woolum [[email protected]] Sent: 18 December 2010 04:57 To: [email protected] Subject: Nutch not fetching all urls from urlsdir Hello everyone, I have a list of urls that I am testing with. There are currently 10 urls that I am injecting. The problem though is that when I look through the log, I only see 3 or 4 of them being fetched. What would cause this? There are no errors that I can find. My understanding is that on the first pass of the fetch, all urls that were injected should be fetched. Am I correct in thinking this? Thanks, Chris Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

