Re: Crawling all of a SharePoint site

Karl Wright Mon, 18 Nov 2013 17:47:13 -0800

Hah.  Exactly the kind of configuration difference I was expecting.
Whatever it is, it's showing up as a list.


I'll open a ticket, and propose a patch; let's see if that gets us past
this.

The ticket is CONNECTORS-812.  I should have a patch in a few minutes,
attached to the ticket.

Karl




On Mon, Nov 18, 2013 at 8:41 PM, Mark Libucha <[email protected]> wrote:

> Seems to be a SP-internal thing.
>
> http://msdn.microsoft.com/en-us/library/aa661294.ASPX
>
> Mark
>
>
> On Mon, Nov 18, 2013 at 5:39 PM, Karl Wright <[email protected]> wrote:
>
>> Hi Mark,
>>
>> Is "Cache Profiles" a list in your SharePoint?  If not, what is it?
>>
>> Karl
>>
>>
>>
>> On Mon, Nov 18, 2013 at 8:37 PM, Mark Libucha <[email protected]> wrote:
>>
>>> Hi Karl,
>>>
>>> It's not the first problem you mentioned. I don't have a site specified
>>> in my SP connection. But it could well be the misconfigured IIS issue...
>>>
>>> Here's what I get with your modified log message:
>>>
>>> ERROR 2013-11-18 20:35:47,440 (Worker thread '7') - Exception tossed:
>>> Expected path to start with /Lists/, saw: '/Cache Profiles/1_.000'
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected path
>>> to start with /Lists/, saw: '/Cache Profiles/1_.000'
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>>
>>> On Mon, Nov 18, 2013 at 5:29 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> The exception is very helpful.
>>>>
>>>> I've seen this before.  I know of two ways it can happen.
>>>>
>>>> First way: your Repository Connection is not actually pointing at the
>>>> SharePoint root, but rather a subsite of the root.  That usually messes
>>>> things up pretty well - and it's not easy to detect in the connector
>>>> properly either.  You must point at the actual root, not a subsite, and use
>>>> the criteria to limit what you include.
>>>>
>>>> Second way: your SharePoint instance has a malconfigured IIS, which is
>>>> mapping paths in ways that are unexpected.
>>>>
>>>> There may be other ways that this can happen; SharePoint has a myriad
>>>> different configuration options and it is possible your instance has one
>>>> that is not something we've ever seen before.  If you think that is what is
>>>> happening, change this line:
>>>>
>>>>             throw new ManifoldCFException("Expected path to start with
>>>> /Lists/");
>>>>
>>>> to:
>>>>
>>>>             throw new ManifoldCFException("Expected path to start with
>>>> /Lists/, saw: '"+relPath+"'");
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 18, 2013 at 8:20 PM, Mark Libucha <[email protected]>wrote:
>>>>
>>>>> Screen shot attached. Using 4.1, SharePoint 2010.
>>>>>
>>>>> Throws this exception:
>>>>>
>>>>> ERROR 2013-11-18 20:12:58,058 (Worker thread '13') - Exception tossed:
>>>>> Expected path to start with /Lists/
>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Expected
>>>>> path to start with /Lists/
>>>>>     at
>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository$ListItemStream.addFile(SharePointRepository.java:2255)
>>>>>
>>>>> I added a debug log message to the SharePoint crawler so the line
>>>>> number may be off by 1 or 2...
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 18, 2013 at 4:59 PM, Karl Wright <[email protected]>wrote:
>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> First, what version of ManifoldCF are you using?  1.3 has some bugs
>>>>>> where lists are concerned.
>>>>>>
>>>>>> Second, I've recently and repeatedly run exactly this crawl against a
>>>>>> site that one of our ManifoldCF users set up in Amazon, so I know it 
>>>>>> works
>>>>>> properly.  So now the question is to determine exactly what you are doing
>>>>>> that is not correct.
>>>>>>
>>>>>> If you want to crawl just lists, you will nevertheless need to enter
>>>>>> both a Site match and a List match.  Otherwise you will get nothing,
>>>>>> because no sites can be crawled.
>>>>>>
>>>>>> To enter ANY of the rules I specified above, type a "*" in the
>>>>>> type-in box, then select "Add Text".  Then, select one of
>>>>>> "File","Site","List",or "Library" from the pulldown, and then click the
>>>>>> "Add new Rule" button.  The Metadata tab works similarly.
>>>>>>
>>>>>> If you want me to verify you have done this correctly, please include
>>>>>> a screen shot of the job's View page.
>>>>>>
>>>>>> If this still isn't helping you, please include a screen shot of the
>>>>>> Simple History report after you have run a crawl.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 18, 2013 at 7:49 PM, Mark Libucha <[email protected]>wrote:
>>>>>>
>>>>>>> I've seen this issue come up before, but I'd like to hear more about
>>>>>>> it (Karl), if there is more to say about it...
>>>>>>>
>>>>>>> Why isn't there an option to crawl an entire SharePoint site. I mean
>>>>>>> it's awesome that the UI gives us the option of drilling down 
>>>>>>> dynamically
>>>>>>> and specifying exactly which parts we want crawled, but isn't the 
>>>>>>> default
>>>>>>> case for most users to just crawl the whole thing?
>>>>>>>
>>>>>>> So, why exactly is this not an option, and what would adding that
>>>>>>> functionality (I would be volunteering to try this) be feasible?
>>>>>>>
>>>>>>> On a more specific level, Karl wrote this in an earlier thread:
>>>>>>>
>>>>>>> <quote>
>>>>>>> For SharePoint, if you want to crawl everything beneath your root
>>>>>>> site, the simplest way is to define 4 rules:
>>>>>>> (1) SITE rule "/*"
>>>>>>> (2) LIST rule "/*"
>>>>>>> (3) LIBRARY rule "/*"
>>>>>>> (4) FILE rule "/*"
>>>>>>> </quote>
>>>>>>>
>>>>>>> I haven't be able to get this to work. It only seems to get files.
>>>>>>>
>>>>>>> Limiting the scope to just Lists, when I use "/*" and specify List,
>>>>>>> I get nothing crawled. Also tried "/Lists/*". Still nothing.
>>>>>>>
>>>>>>> Maybe I'm not specifying the Metadata correctly? Could you expand on
>>>>>>> this Karl? What exactly needs to be specified to crawl all Lists? If I 
>>>>>>> can
>>>>>>> get that to work I can probably figure out the rest of it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Crawling all of a SharePoint site

Reply via email to