On Tue, Nov 22, 2011 at 10:57 AM, remi tassing <[email protected]> wrote: > Hello guys, > > I read the wiki on > "HttpAuthenticationSchemes<http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > I previously managed to make Nutch crawl local folders and websites (with > SSL authentication). However, I'm trying to crawl some sites in a corporate > intranet environment running under MS Sharepoint. I was unsucceful so far > and I believe it's because of authentication. > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a link/mail > tutorial on this? > > > I was recently aware of the ManifoldCF initiative and it seems to be an > eventual solution to my problem. But it's currently poorly documented (as > far as Sharepoint connector is concerned). > > - Do you have any recommendation on this regards? > > > Thanks in advance for your help, I'll really appreciate it! > > -- > Remi Tassing >
Hi Remi, I am sorry, I was not able to reply you earlier. I have been pretty busy this week. I haven't ever tried crawling SharePoint with Nutch, so, I am not very sure if it works fine. My work on authentication assumes that a website is properly configured to challenge the client or crawler with NTLM authentication. In case, it doesn't work, I would suggest that you follow the "Need Help?" section at http://wiki.apache.org/nutch/HttpAuthenticationSchemes#Need_Help.3F accurately and post the relevant information in [email protected] (with me in CC possibly since I am not actively monitoring the mailing list) and we as a community might be able to help you out. Once again, I am sorry, I couldn't help you sooner and good luck with this experiment. Regards, Susam Pal

