RE: Nutch not fetching all urls from urlsdir

Chris Woolum Sun, 19 Dec 2010 16:24:41 -0800

Looking in hadoop.log, I only see mention of the 3 urls which is what confuses 
me the most. My url regex should allow them through.


-----Original Message-----
From: McGibbney, Lewis John [mailto:[email protected]]
Sent: Sat 12/18/2010 3:42 AM
To: [email protected]
Subject: RE: Nutch not fetching all urls from urlsdir
 
Hi Chris,

I have experienced similar problems with this in the past, for example I was 
trying to crawl for the following URL(amongst others)
http://www.scotland.gov.uk
For a reason still unknown to me I was unable to do so, having experiemented a 
bit I found that appending the /Home to the URL atleast gave me some results, 
having looked at my hadoop.log I saw the following:
2010-12-06 12:33:33,031 INFO  fetcher.Fetcher - fetching http://www.sabsm.co.uk/
2010-12-06 12:33:33,031 INFO  fetcher.Fetcher - fetching 
http://www.south-ayrshire.gov.uk/
2010-12-06 12:33:33,031 INFO  fetcher.Fetcher - fetching 
http://www.scotland.gov.uk/Home/
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=9
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=6
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=4
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=7
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=8
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=3
2010-12-06 12:33:33,047 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=5
2010-12-06 12:33:33,125 INFO  http.Http - http.proxy.host = null
2010-12-06 12:33:33,125 INFO  http.Http - http.proxy.port = 8080
2010-12-06 12:33:33,125 INFO  http.Http - http.timeout = 10000
2010-12-06 12:33:33,125 INFO  http.Http - http.content.limit = -1
2010-12-06 12:33:33,125 INFO  http.Http - http.agent = WOMBRA/Nutch-1.2
2010-12-06 12:33:33,125 INFO  http.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
2010-12-06 12:33:33,125 INFO  http.Http - protocol.plugin.check.blocking = false
2010-12-06 12:33:33,125 INFO  http.Http - protocol.plugin.check.robots = false
.....
2010-12-06 12:33:34,248 INFO  fetcher.Fetcher - -activeThreads=3, 
spinWaiting=0, fetchQueues.totalSize=0
2010-12-06 12:33:34,263 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=2
2010-12-06 12:33:34,279 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=1
2010-12-06 12:33:34,295 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'fetcher', using default
2010-12-06 12:33:34,310 INFO  fetcher.Fetcher - -finishing thread 
FetcherThread, activeThreads=0
2010-12-06 12:33:35,262 INFO  fetcher.Fetcher - -activeThreads=0, 
spinWaiting=0, fetchQueues.totalSize=0
2010-12-06 12:33:35,262 INFO  fetcher.Fetcher - -activeThreads=0
2010-12-06 12:33:35,933 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2010-12-06 12:33:36,338 INFO  fetcher.Fetcher - Fetcher: finished at 2010-12-06 
12:33:36, elapsed: 00:00:12
2010-12-06 12:34:48,081 INFO  parse.ParseSegment - ParseSegment: starting at 
2010-12-06 12:34:48
2010-12-06 12:34:48,081 INFO  parse.ParseSegment - ParseSegment: segment: 
crawl/segments/20101206123213
2010-12-06 12:34:51,404 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable

>From this I puzzled as to what was stopping the thread fetching the URL.
Does your Hadoop.log look anything similar or does it simply fail to mention 
injected URLs?

Lewis


________________________________________
From: Chris Woolum [[email protected]]
Sent: 18 December 2010 04:57
To: [email protected]
Subject: Nutch not fetching all urls from urlsdir

Hello everyone,

I have a list of urls that I am testing with.  There are currently 10
urls that I am injecting. The problem though is that when I look through
the log, I only see 3 or 4 of them being fetched. What would cause this?
There are no errors that I can find. My understanding is that on the
first pass of the fetch, all urls that were injected should be fetched.
Am I correct in thinking this?

Thanks,
Chris


Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

RE: Nutch not fetching all urls from urlsdir

Reply via email to