Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed

Karl Wright Mon, 16 Sep 2013 09:17:50 -0700

"Is the connector capable of determining that a given list item has
attachment files?  If so, will it be able to extract those?"


The Lists part of the connector currently only knows how to deal with
fields as metadata.  In order to deal with list item attachment files,
first we'd have to understand how they work (I doubt, for instance, that
they are returned from GetListItems).  How difficult this would be to
implement would depend on those details.

Karl


On Mon, Sep 16, 2013 at 11:37 AM, Dmitry Goldenberg
<[email protected]>wrote:

> Sorry, Karl, with regard to attachments I wasn't clear.  Is the connector
> capable of determining that a given list item has attachment files?  If so,
> will it be able to extract those?  Does it have a way of associating
> extracted documents with their attachments?
>
> If this is not something easily available, how difficult would it be to
> add this in my custom output connector?  Do the attachments 'hang' off the
> repository document, or...?
>
>
> On Mon, Sep 16, 2013 at 11:20 AM, Karl Wright <[email protected]> wrote:
>
>> "Do you think that this issue is generic with regard to any Amz instance?"
>>
>> I presume so, since you didn't apparently do anything special to set one
>> of these up.  Unfortunately, such instances are not part of the free tier,
>> so I am still constrained from setting one up for myself because of
>> household rules here.
>>
>> "For now, I assume our only workaround is to list the paths of interest
>> manually"
>>
>> Depending on what is going wrong, that may not even work.  It looks like
>> several SharePoint web service calls may be affected, and not in a cleanly
>> predictable way, for this to happen.
>>
>> "is identification and extraction of attachments supported in the SP
>> connector?"
>>
>> ManifoldCF in general leaves identification and extraction to the search
>> engine.  Solr, for instance uses Tika for this, if so configured.  You can
>> configure your Solr output connection to include or exclude specific mime
>> types or extensions if you want to limit what is attempted.
>>
>> Karl
>>
>>
>>
>>
>>
>> On Mon, Sep 16, 2013 at 11:09 AM, Dmitry Goldenberg <
>> [email protected]> wrote:
>>
>>> Thanks, Karl. Do you think that this issue is generic with regard to any
>>> Amz instance? I'm just wondering how easily reproducible this may be..
>>>
>>> For now, I assume our only workaround is to list the paths of interest
>>> manually, i.e. add explicit rules for each library and list.
>>>
>>> A related subject - is identification and extraction of attachments
>>> supported in the SP connector?  E.g. if I have a Word doc attached to a
>>> Task list item, would that be extracted?  So far, I see that library
>>> content gets crawled and I'm getting the list item data but am not sure
>>> what happens to the attachments.
>>>
>>>
>>> On Mon, Sep 16, 2013 at 10:48 AM, Karl Wright <[email protected]>wrote:
>>>
>>>> Hi Dmitry,
>>>>
>>>> Thanks for the additional information.  It does appear like the method
>>>> that lists subsites is not working as expected under AWS.  Nor are some
>>>> number of other methods which supposedly just list the children of a
>>>> subsite.
>>>>
>>>> I've reopened CONNECTORS-772 to work on addressing this issue.  Please
>>>> stay tuned.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Mon, Sep 16, 2013 at 10:08 AM, Dmitry Goldenberg <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Most of the paths that get generated are listed in the attached log,
>>>>> they match what shows up in the diag report. So I'm not sure where they
>>>>> diverge, most of them just don't seem right.  There are 3 subsites rooted
>>>>> in the main site: Abcd, Defghij, Klmnopqr.  It's strange that the 
>>>>> connector
>>>>> would try such paths as:
>>>>>
>>>>> /*Klmnopqr*/*Defghij*/*Defghij*/Announcements/// -- there are
>>>>> multiple repetitions of the same subsite on the path and to begin with,
>>>>> Defghij is not a subsite of Klmnopqr, so why would it try this? the /// at
>>>>> the end doesn't seem correct either, unless I'm missing something in how
>>>>> this pathing works.
>>>>>
>>>>> /Test Library 1/Financia/lProjectionsTemplate.xl/Abcd/Announcements --
>>>>> looks wrong. A docname is mixed into the path, a subsite ends up after a
>>>>> docname?...
>>>>>
>>>>> /Shared Documents/Personal_Fina/ncial_Statement_1_1.xl/Defghij/ --
>>>>> same types of issues plus now somehow the docname got split with a forward
>>>>> slash?..
>>>>>
>>>>> There are also a bunch of StringIndexOutOfBoundsException's.  Perhaps
>>>>> this logic doesn't fit with the pathing we're seeing on this amz-based
>>>>> installation?
>>>>>
>>>>> I'd expect the logic to just know that root contains 3 subsites, and
>>>>> work off that. Each subsite has a specific list of libraries and lists,
>>>>> etc. It seems odd that the connector gets into this matching pattern, and
>>>>> tries what looks like thousands of variations (I aborted the execution).
>>>>>
>>>>> - Dmitry
>>>>>
>>>>>
>>>>> On Mon, Sep 16, 2013 at 7:56 AM, Karl Wright <[email protected]>wrote:
>>>>>
>>>>>> Hi Dmitry,
>>>>>>
>>>>>> To clarify, the way you would need to analyze this is to run a crawl
>>>>>> with the wildcards as you have selected, abort if necessary after a 
>>>>>> while,
>>>>>> and then use the Document Status report to list the document identifiers
>>>>>> that had been generated.  Find a document identifier that you believe
>>>>>> represents a path that is illegal, and figure out what SOAP getChild call
>>>>>> caused the problem by returning incorrect data.  In other words, find the
>>>>>> point in the path where the path diverges from what exists into what
>>>>>> doesn't exist, and go back in the ManifoldCF logs to find the particular
>>>>>> SOAP request that led to the issue.
>>>>>>
>>>>>> I'd expect from your description that the problem lies with getting
>>>>>> child sites given a site path, but that's just a guess at this point.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 15, 2013 at 6:40 PM, Karl Wright <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Dmitry,
>>>>>>>
>>>>>>> I don't understand what you mean by "I've tried the set of wildcards
>>>>>>> as below and I seem to be running into a lot of cycles, where various
>>>>>>> subsite folders are appended to each other and an extraction of data at 
>>>>>>> all
>>>>>>> of those locations is attempted".   If you are seeing cycles it means 
>>>>>>> that
>>>>>>> document discovery is still failing in some way.  For each
>>>>>>> folder/library/site/subsite, only the children of that
>>>>>>> folder/library/site/subsite should be appended to the path - ever.
>>>>>>>
>>>>>>> If you can give a specific example, preferably including the soap
>>>>>>> back-and-forth, that would be very helpful.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Sep 15, 2013 at 1:40 PM, Dmitry Goldenberg <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> Quick question. Is there an easy way to configure an SP repo
>>>>>>>> connection for crawling of all content, from the root site all the way 
>>>>>>>> down?
>>>>>>>>
>>>>>>>> I've tried the set of wildcards as below and I seem to be running
>>>>>>>> into a lot of cycles, where various subsite folders are appended to 
>>>>>>>> each
>>>>>>>> other and an extraction of data at all of those locations is attempted.
>>>>>>>> Ideally I'd like to avoid having to construct an exact set of paths 
>>>>>>>> because
>>>>>>>> the set may change, especially with new content being added.
>>>>>>>>
>>>>>>>> Path rules:
>>>>>>>> /* file include
>>>>>>>> /* library include
>>>>>>>> /* list include
>>>>>>>> /* site include
>>>>>>>>
>>>>>>>> Metadata:
>>>>>>>> /* include true
>>>>>>>>
>>>>>>>> I'd also like to pull down any files attached to list items. I'm
>>>>>>>> hoping that some type of "/* file include" should do it, once I figure 
>>>>>>>> out
>>>>>>>> how to safely include all content.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> - Dmitry
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Getting a 401 Unauthorized on a SharePoint 2010 crawl request, with MCPermissions.asmx installed

Reply via email to