I logged a JIRA for this issue. I wasn't sure if it was a bug or improvement. But HttpUrlConnection does work for NTLMv2. So the problem will be to integrate it to Nutch.
[1] https://issues.apache.org/jira/browse/NUTCH-1254 On Tue, Dec 20, 2011 at 10:49 AM, remi tassing <[email protected]>wrote: > Hi, > > I tried the code snippet from the link below and it worked! Just need to > figure out how to integrate that into Nutch, any help? > > [1] > http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html > > > On Sat, Dec 17, 2011 at 3:07 PM, remi tassing <[email protected]>wrote: > >> How can I make Nutch use HttpUrlConnection instead of HttpClient in the >> painless way? It's been 8years since I wrote any Java code :-/ >> >> >> On Saturday, December 17, 2011, remi tassing <[email protected]> >> wrote: >> > Hi, >> > >> > According to the link below, IIS gives an HTTP 500 response when the >> server expects an NTLM V2 but is probably receiving an older version. I >> would guess that the Httpclient in Nutch doesn't support NTLM V2. >> > >> > I would also guess that It worked for Arkadi because its server doesn't >> use NTLM V2. >> > >> > Again according to the reference, Sun JRE 5 or higher fully suppliers >> NTLM V2. I wonder why it wasn't used for Nutch. >> > >> > reference: http://oaklandsoftware.com/papers/ntlm.html >> > >> > On Wednesday, November 30, 2011, remi tassing <[email protected]> >> wrote: >> >> Thanks for tips Susam! >> >> Unfortunately I don't have much support on the server side... >> >> I have been tipped off by a friend mentioning the possibility of >> crawlers being purposely blocked by the server. >> >> So how can I make Nutch impersonate a browser? >> >> I tried the tip in the following link but it didn't work: >> http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html >> >> Remi >> >> On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[email protected]> wrote: >> >>> >> >>> On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[email protected]> >> wrote: >> >>> > Hello guys, >> >>> > With your advices, I tried tweaking config files during the >> week-end and got >> >>> > some problem I couldn't solve (I'm running nutch-1.2. Cygwin >> couldn't get >> >>> > nutch-1.3 to run). >> >>> > A sample of my log file can be found below. I have two concerns: >> >>> > -How do I know if NTLM login worked? >> >>> > -How do I debug the http 500 error code? I suspect it might be >> due to >> >>> > cookies... >> >>> > Thanks in advance for your help >> >>> > ... >> >>> > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - >> Supported >> >>> > authentication schemes in the order of preference: [ntlm, digest, >> basic] >> >>> > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm >> >>> > authentication scheme selected >> >>> > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: >> ntlm >> >>> > DEBUG auth.AuthChallengeProcessor - Authorization challenge >> processed >> >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> >>> > fetchQueues.totalSize=0 >> >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> >>> > fetchQueues.totalSize=0 >> >>> > INFO fetcher.Fetcher - fetch of https://URL failed with: Http >> code=500, >> >>> > url=https://URL >> >>> > INFO fetcher.Fetcher - -finishing thread FetcherThread, >> activeThreads=0 >> >>> > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, >> >>> > fetchQueues.totalSize=0 >> >>> > INFO fetcher.Fetcher - -activeThreads=0 >> >>> > ... >> >>> >> >>> From the logs, Nutch did attempt an NTLM authentication but the server >> >>> returned HTTP 500. It says nothing about whether the NTLM >> >>> authentication succeeded or failed. It only indicates that the >> >>> authentication failed. It suggests that an internal error happened in >> >>> SharePoint. >> >>> >> >>> Now, this can happen due to a variety of reasons. I don't know much >> >>> about how to troubleshoot this in the SharePoint side. Perhaps you >> >>> should be looking into IIS logs, event viewer, etc. to figure why >> >>> SharePoint didn't accept your credentials. >> >>> >> >>> Most likely it is some kind of configuration problem in either >> >>> SharePoint or IIS due to which the the NTLM authentication is causing >> >>> some trouble. Even though it is outside the scope of Nutch, from my >> >>> very limited experience working with SharePoint, I can say that it >> >>> might be a good idea to get the Microsoft technical support involved >> >>> while trying to troubleshoot this. >> >>> >> >>> Regards, >> >>> Susam Pal >> >>> http://susam.in/ >> >> >> >> >> >> >> >> -- >> >> Remi Tassing >> >> >> >> >> > > > > -- > Remi Tassing > >

