Hello guys, With your advices, I tried tweaking config files during the week-end and got some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get nutch-1.3 to run). A sample of my log file can be found below. I have two concerns: -How do I know if NTLM login worked? -How do I debug the http 500 error code? I suspect it might be due to cookies...
Thanks in advance for your help ... 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm DEBUG auth.AuthChallengeProcessor - Authorization challenge processed INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, url=https://URL INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 INFO fetcher.Fetcher - -activeThreads=0 ... On Fri, Nov 25, 2011 at 9:34 PM, Lewis John Mcgibbney < [email protected]> wrote: > Yes thanks for the feedback Arkadi. > > I know this is possibly outside the scope of your work, but it would be > really great if you could add some of your experience to > http://wiki.apache.org/nutch/HttpAuthenticationSchemes > > This is an area which has been unclear for some users for sometime, if you > are happy with your working implementation, your thoughts would be > extremely appreciated from the rest of the community. > > Thank you, and glad to hear that things are working. > > On Fri, Nov 25, 2011 at 7:16 AM, <[email protected]> wrote: > > > Hi Lewis, > > > > I am saying that my configuration works with our SharePoint server. The > > authentication scheme is NTLM. Two versions of Nutch are working: a > > snapshot of Nutch 1.4 in my development and Nutch 1.2 that is being used > in > > production. > > > > I have to admit that it took some tweaking to get authentication working. > > > > Regards, > > > > Arkadi > > > > > -----Original Message----- > > > From: Lewis John Mcgibbney [mailto:[email protected]] > > > Sent: Thursday, 24 November 2011 10:29 PM > > > To: [email protected] > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > Hi Arkadi, > > > > > > Are you saying that this has been solved and that are successfully able > > > to > > > crawl the server? > > > > > > Thanks > > > > > > On Thu, Nov 24, 2011 at 12:48 AM, <[email protected]> wrote: > > > > > > > Hi, > > > > > > > > I am crawling a SharePoint server, no major problems. I do have to > > > use > > > > protocol-httpclient for this. Here is an extract from my > > > > httpclient-auth.xml file, if it helps: > > > > > > > > <auth-configuration> > > > > <credentials username="myusername" password="mypassword"> > > > > <default realm="myrealm" /> > > > > </credentials> > > > > </auth-configuration> > > > > > > > > Regards, > > > > > > > > Arkadi > > > > > > > > > -----Original Message----- > > > > > From: Lewis John Mcgibbney [mailto:[email protected]] > > > > > Sent: Tuesday, 22 November 2011 9:43 PM > > > > > To: [email protected] > > > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > > > > > Hi, > > > > > > > > > > From what I have read on the Nutch user@ archives [1] it is > > > possible to > > > > > crawl a MS Sharepoint server which includes setting up NTLM > > > > > authentication > > > > > for your crawler. It is becoming a pretty major problem now the the > > > > > protocol-httpclient plugin is unstable, there are Jira issues open > > > for > > > > > this. > > > > > > > > > > Unfortunately as Manifold CF is in incubation status, it can only > > > be > > > > > expected that they might have not completed all documentation yet, > > > > > however > > > > > I advise you to try there as well, as them about the Sharepoint > > > > > configuration/documentation if it is not possible for you to work > > > with > > > > > Nutch protocol-httpclient. > > > > > > > > > > hth > > > > > > > > > > [1] > > > > > http://www.mail- > > > > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > > > > > > > > > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing > > > <[email protected]> > > > > > wrote: > > > > > > > > > > > Hello guys, > > > > > > > > > > > > I read the wiki on > > > > > > "HttpAuthenticationSchemes< > > > > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > > > > > > I previously managed to make Nutch crawl local folders and > > > websites > > > > > (with > > > > > > SSL authentication). However, I'm trying to crawl some sites in a > > > > > corporate > > > > > > intranet environment running under MS Sharepoint. I was > > > unsucceful so > > > > > far > > > > > > and I believe it's because of authentication. > > > > > > > > > > > > > > > > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a > > > > > link/mail > > > > > > tutorial on this? > > > > > > > > > > > > > > > > > > I was recently aware of the ManifoldCF initiative and it seems to > > > be > > > > > an > > > > > > eventual solution to my problem. But it's currently poorly > > > documented > > > > > (as > > > > > > far as Sharepoint connector is concerned). > > > > > > > > > > > > - Do you have any recommendation on this regards? > > > > > > > > > > > > > > > > > > Thanks in advance for your help, I'll really appreciate it! > > > > > > > > > > > > -- > > > > > > Remi Tassing > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > *Lewis* > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > -- > *Lewis* > -- Remi Tassing

