Looking in hadoop.log, I only see mention of the 3 urls which is what confuses me the most. My url regex should allow them through.
-----Original Message----- From: McGibbney, Lewis John [mailto:[email protected]] Sent: Sat 12/18/2010 3:42 AM To: [email protected] Subject: RE: Nutch not fetching all urls from urlsdir Hi Chris, I have experienced similar problems with this in the past, for example I was trying to crawl for the following URL(amongst others) http://www.scotland.gov.uk For a reason still unknown to me I was unable to do so, having experiemented a bit I found that appending the /Home to the URL atleast gave me some results, having looked at my hadoop.log I saw the following: 2010-12-06 12:33:33,031 INFO fetcher.Fetcher - fetching http://www.sabsm.co.uk/ 2010-12-06 12:33:33,031 INFO fetcher.Fetcher - fetching http://www.south-ayrshire.gov.uk/ 2010-12-06 12:33:33,031 INFO fetcher.Fetcher - fetching http://www.scotland.gov.uk/Home/ 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=9 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=7 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3 2010-12-06 12:33:33,047 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5 2010-12-06 12:33:33,125 INFO http.Http - http.proxy.host = null 2010-12-06 12:33:33,125 INFO http.Http - http.proxy.port = 8080 2010-12-06 12:33:33,125 INFO http.Http - http.timeout = 10000 2010-12-06 12:33:33,125 INFO http.Http - http.content.limit = -1 2010-12-06 12:33:33,125 INFO http.Http - http.agent = WOMBRA/Nutch-1.2 2010-12-06 12:33:33,125 INFO http.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2010-12-06 12:33:33,125 INFO http.Http - protocol.plugin.check.blocking = false 2010-12-06 12:33:33,125 INFO http.Http - protocol.plugin.check.robots = false ..... 2010-12-06 12:33:34,248 INFO fetcher.Fetcher - -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0 2010-12-06 12:33:34,263 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2 2010-12-06 12:33:34,279 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2010-12-06 12:33:34,295 WARN regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default 2010-12-06 12:33:34,310 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2010-12-06 12:33:35,262 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2010-12-06 12:33:35,262 INFO fetcher.Fetcher - -activeThreads=0 2010-12-06 12:33:35,933 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2010-12-06 12:33:36,338 INFO fetcher.Fetcher - Fetcher: finished at 2010-12-06 12:33:36, elapsed: 00:00:12 2010-12-06 12:34:48,081 INFO parse.ParseSegment - ParseSegment: starting at 2010-12-06 12:34:48 2010-12-06 12:34:48,081 INFO parse.ParseSegment - ParseSegment: segment: crawl/segments/20101206123213 2010-12-06 12:34:51,404 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >From this I puzzled as to what was stopping the thread fetching the URL. Does your Hadoop.log look anything similar or does it simply fail to mention injected URLs? Lewis ________________________________________ From: Chris Woolum [[email protected]] Sent: 18 December 2010 04:57 To: [email protected] Subject: Nutch not fetching all urls from urlsdir Hello everyone, I have a list of urls that I am testing with. There are currently 10 urls that I am injecting. The problem though is that when I look through the log, I only see 3 or 4 of them being fetched. What would cause this? There are no errors that I can find. My understanding is that on the first pass of the fetch, all urls that were injected should be fetched. Am I correct in thinking this? Thanks, Chris Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

