Re: Crawling Sharepoint

Christopher Gross Thu, 15 Dec 2011 13:07:09 -0800

Digging more, that text is on every SharePoint page, in a
class="NOINDEX" div (I guess the MS FAST indexer skips over it -- is
there a way for nutch to do the same?)


Now I'm trying to determine why I'm not getting some of the files.  On
the main page, I have a link to:

"http://url/Shared%20Documents/vi.pdf";

I have successfully run:
nutch org.apache.nutch.indexer.IndexingFiltersChecker <url>
nutch parseChecker -dumpText <url>

And both return successfully and make it seem like it can be
indexed...any idea of where to get started with the config files?

-- Chris



On Thu, Dec 15, 2011 at 3:13 PM, Christopher Gross <[email protected]> wrote:
> I'm able to start crawling a SharePoint site, but then I get this for
> the body of ALL the pages it finds:
>
> " You may be trying to access this site from a secured browser on the
> server. Please enable scripts and reload this page. Turn on more
> accessible mode Turn off more accessible mode Skip Ribbon Commands
> Skip to main content To navigate through the Ribbon, use standard
> browser navigation ...."
>
> Is this something that I have to fix on the SharePoint side of things,
> or is it on nutch?  I'm thinking that if I put the right stuff in the
> authentication for nutch it may work -- but I'm not sure what needs to
> go in there either.
>
> Is anyone willing to share experience/configuration files for crawling
> SharePoint content with nutch?
>
> Nutch 1.4, SharePoint 2010, Java 6
>
> -- Chris

Re: Crawling Sharepoint

Reply via email to