Hi Dmitry,

To clarify, the way you would need to analyze this is to run a crawl with
the wildcards as you have selected, abort if necessary after a while, and
then use the Document Status report to list the document identifiers that
had been generated.  Find a document identifier that you believe represents
a path that is illegal, and figure out what SOAP getChild call caused the
problem by returning incorrect data.  In other words, find the point in the
path where the path diverges from what exists into what doesn't exist, and
go back in the ManifoldCF logs to find the particular SOAP request that led
to the issue.

I'd expect from your description that the problem lies with getting child
sites given a site path, but that's just a guess at this point.

Karl



On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <[email protected]> wrote:

> Hi Dmitry,
>
> I don't understand what you mean by "I've tried the set of wildcards as
> below and I seem to be running into a lot of cycles, where various subsite
> folders are appended to each other and an extraction of data at all of
> those locations is attempted".   If you are seeing cycles it means that
> document discovery is still failing in some way.  For each
> folder/library/site/subsite, only the children of that
> folder/library/site/subsite should be appended to the path - ever.
>
> If you can give a specific example, preferably including the soap
> back-and-forth, that would be very helpful.
>
> Karl
>
>
>
> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <[email protected]
> > wrote:
>
>> Hi Karl,
>>
>> Quick question. Is there an easy way to configure an SP repo connection
>> for crawling of all content, from the root site all the way down?
>>
>> I've tried the set of wildcards as below and I seem to be running into a
>> lot of cycles, where various subsite folders are appended to each other and
>> an extraction of data at all of those locations is attempted. Ideally I'd
>> like to avoid having to construct an exact set of paths because the set may
>> change, especially with new content being added.
>>
>> Path rules:
>> /* file include
>> /* library include
>> /* list include
>> /* site include
>>
>> Metadata:
>> /* include true
>>
>> I'd also like to pull down any files attached to list items. I'm hoping
>> that some type of "/* file include" should do it, once I figure out how to
>> safely include all content.
>>
>> Thanks,
>> - Dmitry
>>
>
>

Reply via email to