Hi list, I have trawled the mail archives for something which could help me on this one, and although there is some interesting past use cases I have not seen any queries or answers which help me.
I am using Nutch 1.2 to crawl the following website http://www.scotland.gov.uk It has an automatic redirect to www.scotland.gov.uk/Home, therefore I thought that experimenting with http.redirect.max and http.verbose in nutch-site would shine some light, however this then flagged up the following in yesterdays hadoop.log 2011-03-02 15:58:19,165 INFO fetcher.Fetcher - fetching http://www.scotland.gov.uk/Home 2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8 2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8 2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6 2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5 2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4 2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2 2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3 2011-03-02 15:58:19,170 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2 2011-03-02 15:58:19,170 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-03-02 15:58:19,211 INFO http.Http - http.proxy.host = null 2011-03-02 15:58:19,211 INFO http.Http - http.proxy.port = 8080 2011... blahblahblah 2011-03-02 15:58:26,220 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2011-03-02 15:58:26,241 INFO fetcher.Fetcher - fetching https://citfil1.enterprise.gcal.ac.uk:8081/AuthenticationServer/AuthenticationForm.jsp?URL=http:/www.scotland.gov.uk/Home&IP=10.15.5.246 2011-03-02 15:58:26,241 INFO fetcher.Fetcher - fetch of https://citfil1.enterprise.gcal.ac.uk:8081/AuthenticationServer/AuthenticationForm.jsp?URL=http:/www.scotland.gov.uk/Home&IP=10.15.5.246 failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https 2011-03-02 15:58:26,241 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=9 2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8 2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=7 2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6 2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5 2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4 2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2 2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3 2011-03-02 15:58:27,241 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2011-03-02 15:58:27,242 INFO fetcher.Fetcher - -activeThreads=0 It was at this stage that I realised that some sort of authentication scheme was in place, however I am still puzzled to what type and how I can work around it. Today I reconfigured nutch-1.2 to crawl using httpclient protocol as oppose to http protocol, however I am now no longer able to replicate the org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https. In implementing the httpclient protocol I have undertaken all steps advised in the wiki entry HttpAuthenticationSchemes apart from setting credentials in httpclient-auth.xml (as I don't know what they are). I hope I have explained thoroughly enough to justify the post Thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

