Thanks, Karl. Do you think that this issue is generic with regard to any
Amz instance? I'm just wondering how easily reproducible this may be..

For now, I assume our only workaround is to list the paths of interest
manually, i.e. add explicit rules for each library and list.

A related subject - is identification and extraction of attachments
supported in the SP connector?  E.g. if I have a Word doc attached to a
Task list item, would that be extracted?  So far, I see that library
content gets crawled and I'm getting the list item data but am not sure
what happens to the attachments.


On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <[email protected]> wrote:

> Hi Dmitry,
>
> Thanks for the additional information.  It does appear like the method
> that lists subsites is not working as expected under AWS.  Nor are some
> number of other methods which supposedly just list the children of a
> subsite.
>
> I've reopened CONNECTORS-772 to work on addressing this issue.  Please
> stay tuned.
>
> Karl
>
>
>
> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> Most of the paths that get generated are listed in the attached log, they
>> match what shows up in the diag report. So I'm not sure where they diverge,
>> most of them just don't seem right.  There are 3 subsites rooted in the
>> main site: Abcd, Defghij, Klmnopqr.  It's strange that the connector would
>> try such paths as:
>>
>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are multiple
>> repetitions of the same subsite on the path and to begin with, Defghij is
>> not a subsite of Klmnopqr, so why would it try this? the /// at the end
>> doesn't seem correct either, unless I'm missing something in how this
>> pathing works.
>>
>> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements --
>> looks wrong. A docname is mixed into the path, a subsite ends up after a
>> docname?...
>>
>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same
>> types of issues plus now somehow the docname got split with a forward
>> slash?..
>>
>> There are also a bunch of StringIndexOutOfBoundsException's.  Perhaps
>> this logic doesn't fit with the pathing we're seeing on this amz-based
>> installation?
>>
>> I'd expect the logic to just know that root contains 3 subsites, and work
>> off that. Each subsite has a specific list of libraries and lists, etc. It
>> seems odd that the connector gets into this matching pattern, and tries
>> what looks like thousands of variations (I aborted the execution).
>>
>> - Dmitry
>>
>>
>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Dmitry,
>>>
>>> To clarify, the way you would need to analyze this is to run a crawl
>>> with the wildcards as you have selected, abort if necessary after a while,
>>> and then use the Document Status report to list the document identifiers
>>> that had been generated.  Find a document identifier that you believe
>>> represents a path that is illegal, and figure out what SOAP getChild call
>>> caused the problem by returning incorrect data.  In other words, find the
>>> point in the path where the path diverges from what exists into what
>>> doesn't exist, and go back in the ManifoldCF logs to find the particular
>>> SOAP request that led to the issue.
>>>
>>> I'd expect from your description that the problem lies with getting
>>> child sites given a site path, but that's just a guess at this point.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Dmitry,
>>>>
>>>> I don't understand what you mean by "I've tried the set of wildcards as
>>>> below and I seem to be running into a lot of cycles, where various subsite
>>>> folders are appended to each other and an extraction of data at all of
>>>> those locations is attempted".   If you are seeing cycles it means that
>>>> document discovery is still failing in some way.  For each
>>>> folder/library/site/subsite, only the children of that
>>>> folder/library/site/subsite should be appended to the path - ever.
>>>>
>>>> If you can give a specific example, preferably including the soap
>>>> back-and-forth, that would be very helpful.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Quick question. Is there an easy way to configure an SP repo
>>>>> connection for crawling of all content, from the root site all the way 
>>>>> down?
>>>>>
>>>>> I've tried the set of wildcards as below and I seem to be running into
>>>>> a lot of cycles, where various subsite folders are appended to each other
>>>>> and an extraction of data at all of those locations is attempted. Ideally
>>>>> I'd like to avoid having to construct an exact set of paths because the 
>>>>> set
>>>>> may change, especially with new content being added.
>>>>>
>>>>> Path rules:
>>>>> /* file include
>>>>> /* library include
>>>>> /* list include
>>>>> /* site include
>>>>>
>>>>> Metadata:
>>>>> /* include true
>>>>>
>>>>> I'd also like to pull down any files attached to list items. I'm
>>>>> hoping that some type of "/* file include" should do it, once I figure out
>>>>> how to safely include all content.
>>>>>
>>>>> Thanks,
>>>>> - Dmitry
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to