Digging more, that text is on every SharePoint page, in a class="NOINDEX" div (I guess the MS FAST indexer skips over it -- is there a way for nutch to do the same?)
Now I'm trying to determine why I'm not getting some of the files. On the main page, I have a link to: "http://url/Shared%20Documents/vi.pdf" I have successfully run: nutch org.apache.nutch.indexer.IndexingFiltersChecker <url> nutch parseChecker -dumpText <url> And both return successfully and make it seem like it can be indexed...any idea of where to get started with the config files? -- Chris On Thu, Dec 15, 2011 at 3:13 PM, Christopher Gross <[email protected]> wrote: > I'm able to start crawling a SharePoint site, but then I get this for > the body of ALL the pages it finds: > > " You may be trying to access this site from a secured browser on the > server. Please enable scripts and reload this page. Turn on more > accessible mode Turn off more accessible mode Skip Ribbon Commands > Skip to main content To navigate through the Ribbon, use standard > browser navigation ...." > > Is this something that I have to fix on the SharePoint side of things, > or is it on nutch? I'm thinking that if I put the right stuff in the > authentication for nutch it may work -- but I'm not sure what needs to > go in there either. > > Is anyone willing to share experience/configuration files for crawling > SharePoint content with nutch? > > Nutch 1.4, SharePoint 2010, Java 6 > > -- Chris

