Re: Crawling all of a SharePoint site

Mark Libucha Mon, 18 Nov 2013 17:38:13 -0800

Hi Karl,

It's not the first problem you mentioned. I don't have a site specified in
my SP connection. But it could well be the misconfigured IIS issue...


Here's what I get with your modified log message:

ERROR 2013-11-18 20:35:47,440 (Worker thread '7') - Exception tossed:
Expected path to start with /Lists/, saw: '/Cache Profiles/1_.000'
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path to
start with /Lists/, saw: '/Cache Profiles/1_.000'

Thanks,

Mark



On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright <[email protected]> wrote:

> Hi Mark,
>
> The exception is very helpful.
>
> I've seen this before.  I know of two ways it can happen.
>
> First way: your Repository Connection is not actually pointing at the
> SharePoint root, but rather a subsite of the root.  That usually messes
> things up pretty well - and it's not easy to detect in the connector
> properly either.  You must point at the actual root, not a subsite, and use
> the criteria to limit what you include.
>
> Second way: your SharePoint instance has a malconfigured IIS, which is
> mapping paths in ways that are unexpected.
>
> There may be other ways that this can happen; SharePoint has a myriad
> different configuration options and it is possible your instance has one
> that is not something we've ever seen before.  If you think that is what is
> happening, change this line:
>
>             throw new ManifoldCFException("Expected path to start with
> /Lists/");
>
> to:
>
>             throw new ManifoldCFException("Expected path to start with
> /Lists/, saw: '"+relPath+"'");
>
> Karl
>
>
>
>
> On Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <[email protected]> wrote:
>
>> Screen shot attached. Using 4.1, SharePoint 2010.
>>
>> Throws this exception:
>>
>> ERROR 2013-11-18 20:12:58,058 (Worker thread '13') - Exception tossed:
>> Expected path to start with /Lists/
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path
>> to start with /Lists/
>>     at
>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointRepository.java:2255)
>>
>> I added a debug log message to the SharePoint crawler so the line number
>> may be off by 1 or 2...
>>
>> Thanks,
>>
>> Mark
>>
>>
>>
>> On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Mark,
>>>
>>> First, what version of ManifoldCF are you using?  1.3 has some bugs
>>> where lists are concerned.
>>>
>>> Second, I've recently and repeatedly run exactly this crawl against a
>>> site that one of our ManifoldCF users set up in Amazon, so I know it works
>>> properly.  So now the question is to determine exactly what you are doing
>>> that is not correct.
>>>
>>> If you want to crawl just lists, you will nevertheless need to enter
>>> both a Site match and a List match.  Otherwise you will get nothing,
>>> because no sites can be crawled.
>>>
>>> To enter ANY of the rules I specified above, type a "*" in the type-in
>>> box, then select "Add Text".  Then, select one of "File","Site","List",or
>>> "Library" from the pulldown, and then click the "Add new Rule" button.  The
>>> Metadata tab works similarly.
>>>
>>> If you want me to verify you have done this correctly, please include a
>>> screen shot of the job's View page.
>>>
>>> If this still isn't helping you, please include a screen shot of the
>>> Simple History report after you have run a crawl.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha <[email protected]>wrote:
>>>
>>>> I've seen this issue come up before, but I'd like to hear more about it
>>>> (Karl), if there is more to say about it...
>>>>
>>>> Why isn't there an option to crawl an entire SharePoint site. I mean
>>>> it's awesome that the UI gives us the option of drilling down dynamically
>>>> and specifying exactly which parts we want crawled, but isn't the default
>>>> case for most users to just crawl the whole thing?
>>>>
>>>> So, why exactly is this not an option, and what would adding that
>>>> functionality (I would be volunteering to try this) be feasible?
>>>>
>>>> On a more specific level, Karl wrote this in an earlier thread:
>>>>
>>>> <quote>
>>>> For SharePoint, if you want to crawl everything beneath your root site,
>>>> the simplest way is to define 4 rules:
>>>> (1) SITE rule "/*"
>>>> (2) LIST rule "/*"
>>>> (3) LIBRARY rule "/*"
>>>> (4) FILE rule "/*"
>>>> </quote>
>>>>
>>>> I haven't be able to get this to work. It only seems to get files.
>>>>
>>>> Limiting the scope to just Lists, when I use "/*" and specify List, I
>>>> get nothing crawled. Also tried "/Lists/*". Still nothing.
>>>>
>>>> Maybe I'm not specifying the Metadata correctly? Could you expand on
>>>> this Karl? What exactly needs to be specified to crawl all Lists? If I can
>>>> get that to work I can probably figure out the rest of it.
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>>
>>>>
>>>
>>
>

Re: Crawling all of a SharePoint site

Reply via email to