Hi, According to the link below, IIS gives an HTTP 500 response when the server expects an NTLM V2 but is probably receiving an older version. I would guess that the Httpclient in Nutch doesn't support NTLM V2.
I would also guess that It worked for Arkadi because its server doesn't use NTLM V2. Again according to the reference, Sun JRE 5 or higher fully suppliers NTLM V2. I wonder why it wasn't used for Nutch. reference: http://oaklandsoftware.com/papers/ntlm.html On Wednesday, November 30, 2011, remi tassing <[email protected]> wrote: > Thanks for tips Susam! > Unfortunately I don't have much support on the server side... > I have been tipped off by a friend mentioning the possibility of crawlers being purposely blocked by the server. > So how can I make Nutch impersonate a browser? > I tried the tip in the following link but it didn't work: http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html > Remi > On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[email protected]> wrote: >> >> On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[email protected]> wrote: >> > Hello guys, >> > With your advices, I tried tweaking config files during the week-end and got >> > some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get >> > nutch-1.3 to run). >> > A sample of my log file can be found below. I have two concerns: >> > -How do I know if NTLM login worked? >> > -How do I debug the http 500 error code? I suspect it might be due to >> > cookies... >> > Thanks in advance for your help >> > ... >> > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported >> > authentication schemes in the order of preference: [ntlm, digest, basic] >> > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm >> > authentication scheme selected >> > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm >> > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed >> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> > fetchQueues.totalSize=0 >> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> > fetchQueues.totalSize=0 >> > INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, >> > url=https://URL >> > INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 >> > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, >> > fetchQueues.totalSize=0 >> > INFO fetcher.Fetcher - -activeThreads=0 >> > ... >> >> From the logs, Nutch did attempt an NTLM authentication but the server >> returned HTTP 500. It says nothing about whether the NTLM >> authentication succeeded or failed. It only indicates that the >> authentication failed. It suggests that an internal error happened in >> SharePoint. >> >> Now, this can happen due to a variety of reasons. I don't know much >> about how to troubleshoot this in the SharePoint side. Perhaps you >> should be looking into IIS logs, event viewer, etc. to figure why >> SharePoint didn't accept your credentials. >> >> Most likely it is some kind of configuration problem in either >> SharePoint or IIS due to which the the NTLM authentication is causing >> some trouble. Even though it is outside the scope of Nutch, from my >> very limited experience working with SharePoint, I can say that it >> might be a good idea to get the Microsoft technical support involved >> while trying to troubleshoot this. >> >> Regards, >> Susam Pal >> http://susam.in/ > > > > -- > Remi Tassing > >

