Hi, I tried the code snippet from the link below and it worked! Just need to figure out how to integrate that into Nutch, any help?
[1] http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html On Sat, Dec 17, 2011 at 3:07 PM, remi tassing <[email protected]> wrote: > How can I make Nutch use HttpUrlConnection instead of HttpClient in the > painless way? It's been 8years since I wrote any Java code :-/ > > > On Saturday, December 17, 2011, remi tassing <[email protected]> > wrote: > > Hi, > > > > According to the link below, IIS gives an HTTP 500 response when the > server expects an NTLM V2 but is probably receiving an older version. I > would guess that the Httpclient in Nutch doesn't support NTLM V2. > > > > I would also guess that It worked for Arkadi because its server doesn't > use NTLM V2. > > > > Again according to the reference, Sun JRE 5 or higher fully suppliers > NTLM V2. I wonder why it wasn't used for Nutch. > > > > reference: http://oaklandsoftware.com/papers/ntlm.html > > > > On Wednesday, November 30, 2011, remi tassing <[email protected]> > wrote: > >> Thanks for tips Susam! > >> Unfortunately I don't have much support on the server side... > >> I have been tipped off by a friend mentioning the possibility of > crawlers being purposely blocked by the server. > >> So how can I make Nutch impersonate a browser? > >> I tried the tip in the following link but it didn't work: > http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html > >> Remi > >> On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[email protected]> wrote: > >>> > >>> On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[email protected]> > wrote: > >>> > Hello guys, > >>> > With your advices, I tried tweaking config files during the week-end > and got > >>> > some problem I couldn't solve (I'm running nutch-1.2. Cygwin > couldn't get > >>> > nutch-1.3 to run). > >>> > A sample of my log file can be found below. I have two concerns: > >>> > -How do I know if NTLM login worked? > >>> > -How do I debug the http 500 error code? I suspect it might be due > to > >>> > cookies... > >>> > Thanks in advance for your help > >>> > ... > >>> > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported > >>> > authentication schemes in the order of preference: [ntlm, digest, > basic] > >>> > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm > >>> > authentication scheme selected > >>> > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm > >>> > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed > >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > >>> > fetchQueues.totalSize=0 > >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > >>> > fetchQueues.totalSize=0 > >>> > INFO fetcher.Fetcher - fetch of https://URL failed with: Http > code=500, > >>> > url=https://URL > >>> > INFO fetcher.Fetcher - -finishing thread FetcherThread, > activeThreads=0 > >>> > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, > >>> > fetchQueues.totalSize=0 > >>> > INFO fetcher.Fetcher - -activeThreads=0 > >>> > ... > >>> > >>> From the logs, Nutch did attempt an NTLM authentication but the server > >>> returned HTTP 500. It says nothing about whether the NTLM > >>> authentication succeeded or failed. It only indicates that the > >>> authentication failed. It suggests that an internal error happened in > >>> SharePoint. > >>> > >>> Now, this can happen due to a variety of reasons. I don't know much > >>> about how to troubleshoot this in the SharePoint side. Perhaps you > >>> should be looking into IIS logs, event viewer, etc. to figure why > >>> SharePoint didn't accept your credentials. > >>> > >>> Most likely it is some kind of configuration problem in either > >>> SharePoint or IIS due to which the the NTLM authentication is causing > >>> some trouble. Even though it is outside the scope of Nutch, from my > >>> very limited experience working with SharePoint, I can say that it > >>> might be a good idea to get the Microsoft technical support involved > >>> while trying to troubleshoot this. > >>> > >>> Regards, > >>> Susam Pal > >>> http://susam.in/ > >> > >> > >> > >> -- > >> Remi Tassing > >> > >> > -- Remi Tassing

