Thanks for tips Susam! Unfortunately I don't have much support on the server side...
I have been tipped off by a friend mentioning the possibility of crawlers being purposely blocked by the server. So how can I make Nutch impersonate a browser? I tried the tip in the following link but it didn't work: http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html Remi On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[email protected]> wrote: > On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[email protected]> > wrote: > > Hello guys, > > With your advices, I tried tweaking config files during the week-end and > got > > some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get > > nutch-1.3 to run). > > A sample of my log file can be found below. I have two concerns: > > -How do I know if NTLM login worked? > > -How do I debug the http 500 error code? I suspect it might be due to > > cookies... > > Thanks in advance for your help > > ... > > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported > > authentication schemes in the order of preference: [ntlm, digest, basic] > > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm > > authentication scheme selected > > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm > > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed > > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > > fetchQueues.totalSize=0 > > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > > fetchQueues.totalSize=0 > > INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, > > url=https://URL > > INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 > > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, > > fetchQueues.totalSize=0 > > INFO fetcher.Fetcher - -activeThreads=0 > > ... > > From the logs, Nutch did attempt an NTLM authentication but the server > returned HTTP 500. It says nothing about whether the NTLM > authentication succeeded or failed. It only indicates that the > authentication failed. It suggests that an internal error happened in > SharePoint. > > Now, this can happen due to a variety of reasons. I don't know much > about how to troubleshoot this in the SharePoint side. Perhaps you > should be looking into IIS logs, event viewer, etc. to figure why > SharePoint didn't accept your credentials. > > Most likely it is some kind of configuration problem in either > SharePoint or IIS due to which the the NTLM authentication is causing > some trouble. Even though it is outside the scope of Nutch, from my > very limited experience working with SharePoint, I can say that it > might be a good idea to get the Microsoft technical support involved > while trying to troubleshoot this. > > Regards, > Susam Pal > http://susam.in/ > -- Remi Tassing

