Thanks, Karl. Do you think that this issue is generic with regard to any Amz instance? I'm just wondering how easily reproducible this may be..
For now, I assume our only workaround is to list the paths of interest manually, i.e. add explicit rules for each library and list. A related subject - is identification and extraction of attachments supported in the SP connector? E.g. if I have a Word doc attached to a Task list item, would that be extracted? So far, I see that library content gets crawled and I'm getting the list item data but am not sure what happens to the attachments. On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <[email protected]> wrote: > Hi Dmitry, > > Thanks for the additional information. It does appear like the method > that lists subsites is not working as expected under AWS. Nor are some > number of other methods which supposedly just list the children of a > subsite. > > I've reopened CONNECTORS-772 to work on addressing this issue. Please > stay tuned. > > Karl > > > > On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg < > [email protected]> wrote: > >> Hi Karl, >> >> Most of the paths that get generated are listed in the attached log, they >> match what shows up in the diag report. So I'm not sure where they diverge, >> most of them just don't seem right. There are 3 subsites rooted in the >> main site: Abcd, Defghij, Klmnopqr. It's strange that the connector would >> try such paths as: >> >> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are multiple >> repetitions of the same subsite on the path and to begin with, Defghij is >> not a subsite of Klmnopqr, so why would it try this? the /// at the end >> doesn't seem correct either, unless I'm missing something in how this >> pathing works. >> >> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements -- >> looks wrong. A docname is mixed into the path, a subsite ends up after a >> docname?... >> >> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ -- same >> types of issues plus now somehow the docname got split with a forward >> slash?.. >> >> There are also a bunch of StringIndexOutOfBoundsException's. Perhaps >> this logic doesn't fit with the pathing we're seeing on this amz-based >> installation? >> >> I'd expect the logic to just know that root contains 3 subsites, and work >> off that. Each subsite has a specific list of libraries and lists, etc. It >> seems odd that the connector gets into this matching pattern, and tries >> what looks like thousands of variations (I aborted the execution). >> >> - Dmitry >> >> >> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <[email protected]> wrote: >> >>> Hi Dmitry, >>> >>> To clarify, the way you would need to analyze this is to run a crawl >>> with the wildcards as you have selected, abort if necessary after a while, >>> and then use the Document Status report to list the document identifiers >>> that had been generated. Find a document identifier that you believe >>> represents a path that is illegal, and figure out what SOAP getChild call >>> caused the problem by returning incorrect data. In other words, find the >>> point in the path where the path diverges from what exists into what >>> doesn't exist, and go back in the ManifoldCF logs to find the particular >>> SOAP request that led to the issue. >>> >>> I'd expect from your description that the problem lies with getting >>> child sites given a site path, but that's just a guess at this point. >>> >>> Karl >>> >>> >>> >>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <[email protected]> wrote: >>> >>>> Hi Dmitry, >>>> >>>> I don't understand what you mean by "I've tried the set of wildcards as >>>> below and I seem to be running into a lot of cycles, where various subsite >>>> folders are appended to each other and an extraction of data at all of >>>> those locations is attempted". If you are seeing cycles it means that >>>> document discovery is still failing in some way. For each >>>> folder/library/site/subsite, only the children of that >>>> folder/library/site/subsite should be appended to the path - ever. >>>> >>>> If you can give a specific example, preferably including the soap >>>> back-and-forth, that would be very helpful. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg < >>>> [email protected]> wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> Quick question. Is there an easy way to configure an SP repo >>>>> connection for crawling of all content, from the root site all the way >>>>> down? >>>>> >>>>> I've tried the set of wildcards as below and I seem to be running into >>>>> a lot of cycles, where various subsite folders are appended to each other >>>>> and an extraction of data at all of those locations is attempted. Ideally >>>>> I'd like to avoid having to construct an exact set of paths because the >>>>> set >>>>> may change, especially with new content being added. >>>>> >>>>> Path rules: >>>>> /* file include >>>>> /* library include >>>>> /* list include >>>>> /* site include >>>>> >>>>> Metadata: >>>>> /* include true >>>>> >>>>> I'd also like to pull down any files attached to list items. I'm >>>>> hoping that some type of "/* file include" should do it, once I figure out >>>>> how to safely include all content. >>>>> >>>>> Thanks, >>>>> - Dmitry >>>>> >>>> >>>> >>> >> >
