Re: HTTP 302 error causing job to abort

Karl Wright Mon, 22 Feb 2016 05:00:47 -0800

Any news on this research?
Karl


On Fri, Feb 19, 2016 at 12:46 AM, Karl Wright <[email protected]> wrote:

> Hi Phil,
>
> Thanks -- this information is more helpful.
>
> So my understanding is that there is an external site reference in your
> site/subsite hierarchy?  And the *root* site (the one that you point at
> when you configure the connection itself) is *not* external after all?
>
> If that is the case, then the external site must be being "discovered"
> through the Webs service API call.  There are two ways forward:
>
> (1)  We can change the Webs response parsing to detect external sites and
> not include those in the crawl, or
> (2) We can try to make decisions based on whether a 302 comes back as a
> response code.
>
> (1) is by far the best approach but it will require some cooperation and
> execution of sample code on your part.  Essentially I'll need to see what
> the xml is that is coming back that first describes the exterrnal site and
> see if there is an attribute that lets us know it is external.  That way I
> properly just skip it entirely.
>
> We can have a look at what comes back from SharePoint for this API
> response if you enable connector debugging in properties.xml:
>
> <property name="org.apache.manifoldcf.connectors" value="DEBUG"/>
>
> ... and restart.  You will then need to do a crawl.  The following line
> will be what you look for:
>
> Logging.connectors.debug("SharePoint: getSites xml response:
> "+xmlResponse);
>
> This xml response will contain "Url" and "Title" nodes; what I need to
> know is whether there's any attribute of the "Url" node, or parallel node
> other than "Url" or "Title', that contains an indication of whether the Url
> that describes the external site is indeed external.  So you look for the
> Url that describes the SharePoint URL that has the redirection, and tell me
> if there's anything special about it in the associated getSites response.
> Does that make sense?
>
> If this is too hard, alternative (2) is possible, but it will require tons
> of individual changes.  So let's look into (1) first.
>
> Thanks
> Karl
>
>
> On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> Some further info:
>>
>>    - The problem document that Manifold reported, is redirecting to an
>>    external site.
>>    - We tried crawling a smaller subset of content on the same
>>    Sharepoint site that definitely doesn’t contain any external links in the
>>    content, and this works OK.
>>    - The job that errors with the 302, says it has found 529 docs so far
>>    and processed 127 of them. This seems to indicate that is has in fact 
>> found
>>    some documents.
>>
>> I’m not sure what you mean that the error is being generated from the API
>> call, and not an individual document? The info appears to indicate it is
>> not all documents, but just selected documents.
>>
>> There really isn’t much we can do about this from the Sharepoint
>> configuration side, is there any way we can test if it is as simple as the
>> 302 coming from the documents themselves?
>>
>> Thanks for your help to date.
>>
>> Phil
>>
>>
>> From: Karl Wright <[email protected]>
>> Reply-To: <[email protected]>
>> Date: Thursday, 18 February 2016 10:31 am
>>
>> To: "[email protected]" <[email protected]>
>> Subject: Re: HTTP 302 error causing job to abort
>>
>> Hi Phil,
>>
>> The 302 error is not coming from a single document.  If it *was* coming
>> from the fetch of an individual document, it would be easy to work around.
>> But, from your stack trace, it is clear that this error is coming from an
>> API call, specifically a call to enumerate subsites of a given site.  That
>> means that some or all of the SharePoint hierarchy is not accessible
>> through POST requests.  I have never seen this kind of behavior from
>> SharePoint before.
>>
>> This is not something that I can work around without more information.
>> In order to get that information, you will at the very minimum need to turn
>> on connector debugging, and probably turning on http wire debugging would
>> be helpful too.  And, if what you said about the View page for this
>> connection is true and it also shows a 302 error, I very much suspect that
>> something changed on the server end and you are currently unable to crawl
>> *any* documents at all.
>>
>> I am sorry I cannot make this any clearer.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller <
>> [email protected]> wrote:
>>
>>> Hi Karl,
>>>
>>> Thanks for the update.
>>>
>>> I’m not 100% sure how many documents have this redirect in them, but
>>> I’ll see if I can get a better estimate. The content we are crawling is
>>> substantially large, and comes from many different authors so it’s
>>> difficult to manage how these Sharepoint documents are created. It makes it
>>> extremely difficult to pinpoint all the documents that contain redirects.
>>>
>>> Am I correct in assuming a single 302 error causes the job to fail, or
>>> is there some other logic that determines this?
>>>
>>> How plausible would it be to include in the product an option for
>>> treating 302’s as a warning, rather than a fatal error? Possibly just an
>>> option in the Job setup?
>>>
>>> Regards,
>>> Phil
>>>
>>>
>>> From: Karl Wright <[email protected]>
>>> Reply-To: <[email protected]>
>>> Date: Thursday, 18 February 2016 1:39 am
>>>
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: HTTP 302 error causing job to abort
>>>
>>> Hi again Phil,
>>>
>>> The HttpClient team points out that POST requests (as we do for the
>>> SharePoint repository requests) are not allowed to follow 302 redirections
>>> according to RFC2616.  We use POST requests because, for SOAP, there is
>>> often quite a bit of XML data that goes along with the request, and we
>>> would otherwise have size issues.  So we cannot use GET instead of POST.
>>> See CONNECTORS-1279 for details.
>>>
>>> If you still believe that it is only a couple of URLs that are returning
>>> 302 for you, I'd like some analysis of why you believe that to be true.  I
>>> would be happy to consider recognition of an occasional 302 response as
>>> meaning "skip this document".  On the other hand, based on your stack
>>> trace, it really appears that you have a far more systemic problem; it is
>>> failing while obtaining information for an entire site, so not much would
>>> get crawled in that case.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Phil,
>>>>
>>>> It is not surprising that the connector doesn't like 302 responses and
>>>> doesn't know what to do with them, because it isn't supposed to ever be
>>>> getting any of these.
>>>>
>>>> I am puzzled by your statement that "only a couple of documents have
>>>> redirections in them", because the connector crawls Lists and Library
>>>> documents within SharePoint *only*, and these are very specifically
>>>> accessible through a SharePoint URL hierarchy structure.  There's no room
>>>> in any of that for a 302 redirection.  Since you see a 302 in the UI, I
>>>> feel pretty certain you have a problem with your configuration and it is
>>>> not just "a couple of documents".
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Karl,
>>>>>
>>>>> The majority of content is not going to the redirect, it’s probably
>>>>> just a handful of documents that are behaving this way.
>>>>>
>>>>> I’d agree that it’s of lesser concern whether or not the document
>>>>> itself is indexing, however I wouldn’t expect the 302 to be treated as a
>>>>> fatal error that causes the job to come to a halt. I’d expect the document
>>>>> to be passed over, and the crawl to continue.
>>>>>
>>>>> Is the only solution at this point to remove the documents which
>>>>> redirect to a 302 to get the crawl to run in full?
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Phil Riethmuller*
>>>>> Technical Consultant
>>>>>
>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>>
>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>
>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback>
>>>>>  - *Twitter*
>>>>>
>>>>>
>>>>> From: Karl Wright <[email protected]>
>>>>> Reply-To: <[email protected]>
>>>>> Date: Wednesday, 17 February 2016 8:58 am
>>>>>
>>>>> To: "[email protected]" <[email protected]>
>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>
>>>>> Hi Phil,
>>>>>
>>>>> You probably want to point your SharePoint repository connection to
>>>>> the proper server and site, and not rely on redirections.  It's also
>>>>> possible that you are missing the site entirely and the redirection you 
>>>>> are
>>>>> seeing is taking you to some error page somewhere.
>>>>>
>>>>> I will be raising the question of redirections with the
>>>>> HttpComponents/HttpClient team, since I see no obvious problems with the
>>>>> SharePoint connector code.  However, if your connection is properly set 
>>>>> up,
>>>>> redirections should be unneeded.
>>>>>
>>>>> I would read the documentation on the Wiki page for debugging
>>>>> SharePoint connections at the bottom of this page:
>>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Do you mean in the job status in the Manifold CF interface?
>>>>>>
>>>>>> The job status also shows the same:
>>>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>:
>>>>>> (302)HTTP/1.0 302 Found
>>>>>>
>>>>>> I agree, I wouldn’t of thought that the crawler would follow any
>>>>>> links or redirections.
>>>>>>
>>>>>> What sort of configurations could be incorrectly configured, that I
>>>>>> could look at revising?
>>>>>>
>>>>>> Phil
>>>>>>
>>>>>>
>>>>>> From: Karl Wright <[email protected]>
>>>>>> Reply-To: <[email protected]>
>>>>>> Date: Wednesday, 17 February 2016 8:45 am
>>>>>>
>>>>>> To: "[email protected]" <[email protected]>
>>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> When you view the repository connection in the UI, do you get a 302
>>>>>> error also?
>>>>>>
>>>>>> I have looked at the code; Httpclient is supposedly configured to
>>>>>> honor redirections.  Obviously it is not doing that, so I'll have to dig
>>>>>> deeper into why that is.  On the other hand, I would not expect you to be
>>>>>> getting any redirections, unless you have configured your connection
>>>>>> incorrectly.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks Karl -
>>>>>>>
>>>>>>> I’ve replaced the actual URL with <URL> below, but here is the stack
>>>>>>> trace:
>>>>>>>
>>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception
>>>>>>> tossed: Unexpected http error code 302 accessing SharePoint at <URL>:
>>>>>>> (302)HTTP/1.0 302 Found
>>>>>>>
>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>> Unexpected http error code 302 accessing SharePoint at <URL>: 
>>>>>>> (302)HTTP/1.0
>>>>>>> 302 Found
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>>>>>>>
>>>>>>> Caused by: (302)HTTP/1.0 302 Found
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>>>>>>>
>>>>>>>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:165)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invokeEngine(Call.java:2784)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2767)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2443)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:2366)
>>>>>>>
>>>>>>>         at org.apache.axis.client.Call.invoke(Call.java:1812)
>>>>>>>
>>>>>>>         at
>>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Phil Riethmuller*
>>>>>>> Technical Consultant
>>>>>>>
>>>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000
>>>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/>
>>>>>>>
>>>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES
>>>>>>>
>>>>>>> Connect with us: LinkedIn
>>>>>>> <http://www.linkedin.com/company/funnelback> - *Twitter*
>>>>>>>
>>>>>>>
>>>>>>> From: Karl Wright <[email protected]>
>>>>>>> Reply-To: <[email protected]>
>>>>>>> Date: Tuesday, 16 February 2016 6:54 pm
>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>> Subject: Re: HTTP 302 error causing job to abort
>>>>>>>
>>>>>>> Hi Phil,
>>>>>>>
>>>>>>> A HTTP 302 response is simply a redirection.  It should not, by
>>>>>>> itself, cause a job to abort.  I would expect that to go by in wire/http
>>>>>>> logging, but you should not see it anywhere else.  So it is not clear 
>>>>>>> to me
>>>>>>> what you are really seeing here.
>>>>>>>
>>>>>>> Can you include an example stack trace from the manifoldcf log?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi -
>>>>>>>>
>>>>>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302
>>>>>>>> error which is causing the manifold job to abort. How do I prevent the
>>>>>>>> crawler from aborting the job?
>>>>>>>>
>>>>>>>> I’m using v2.3 of Manifold with a postgres database.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Phil
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HTTP 302 error causing job to abort

Reply via email to