Thanks Karl, I¹ll take a look at this today.
Regards, Phil Riethmuller Technical Consultant Funnelback | 437 Kent Street, Sydney, NSW 2000 T +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> - Twitter From: Karl Wright <[email protected]> Reply-To: <[email protected]> Date: Monday, 22 February 2016 11:32 pm To: "[email protected]" <[email protected]> Subject: Re: HTTP 302 error causing job to abort Any news on this research? Karl On Fri, Feb 19, 2016 at 12:46 AM, Karl Wright <[email protected]> wrote: > Hi Phil, > > Thanks -- this information is more helpful. > > So my understanding is that there is an external site reference in your > site/subsite hierarchy? And the *root* site (the one that you point at when > you configure the connection itself) is *not* external after all? > > If that is the case, then the external site must be being "discovered" through > the Webs service API call. There are two ways forward: > > (1) We can change the Webs response parsing to detect external sites and not > include those in the crawl, or > (2) We can try to make decisions based on whether a 302 comes back as a > response code. > > (1) is by far the best approach but it will require some cooperation and > execution of sample code on your part. Essentially I'll need to see what the > xml is that is coming back that first describes the exterrnal site and see if > there is an attribute that lets us know it is external. That way I properly > just skip it entirely. > > We can have a look at what comes back from SharePoint for this API response if > you enable connector debugging in properties.xml: > > <property name="org.apache.manifoldcf.connectors" value="DEBUG"/> > > ... and restart. You will then need to do a crawl. The following line will > be what you look for: > > Logging.connectors.debug("SharePoint: getSites xml response: "+xmlResponse); > > This xml response will contain "Url" and "Title" nodes; what I need to know is > whether there's any attribute of the "Url" node, or parallel node other than > "Url" or "Title', that contains an indication of whether the Url that > describes the external site is indeed external. So you look for the Url that > describes the SharePoint URL that has the redirection, and tell me if there's > anything special about it in the associated getSites response. Does that make > sense? > > If this is too hard, alternative (2) is possible, but it will require tons of > individual changes. So let's look into (1) first. > > Thanks > Karl > > > On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller > <[email protected]> wrote: >> Hi Karl, >> >> Some further info: >> * The problem document that Manifold reported, is redirecting to an external >> site. >> * We tried crawling a smaller subset of content on the same Sharepoint site >> that definitely doesn¹t contain any external links in the content, and this >> works OK. >> * The job that errors with the 302, says it has found 529 docs so far and >> processed 127 of them. This seems to indicate that is has in fact found some >> documents. >> I¹m not sure what you mean that the error is being generated from the API >> call, and not an individual document? The info appears to indicate it is not >> all documents, but just selected documents. >> >> There really isn¹t much we can do about this from the Sharepoint >> configuration side, is there any way we can test if it is as simple as the >> 302 coming from the documents themselves? >> >> Thanks for your help to date. >> >> Phil >> >> >> From: Karl Wright <[email protected]> >> Reply-To: <[email protected]> >> Date: Thursday, 18 February 2016 10:31 am >> >> To: "[email protected]" <[email protected]> >> Subject: Re: HTTP 302 error causing job to abort >> >> Hi Phil, >> >> The 302 error is not coming from a single document. If it *was* coming from >> the fetch of an individual document, it would be easy to work around. But, >> from your stack trace, it is clear that this error is coming from an API >> call, specifically a call to enumerate subsites of a given site. That means >> that some or all of the SharePoint hierarchy is not accessible through POST >> requests. I have never seen this kind of behavior from SharePoint before. >> >> This is not something that I can work around without more information. In >> order to get that information, you will at the very minimum need to turn on >> connector debugging, and probably turning on http wire debugging would be >> helpful too. And, if what you said about the View page for this connection >> is true and it also shows a 302 error, I very much suspect that something >> changed on the server end and you are currently unable to crawl *any* >> documents at all. >> >> I am sorry I cannot make this any clearer. >> >> Thanks, >> Karl >> >> >> >> >> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller >> <[email protected]> wrote: >>> Hi Karl, >>> >>> Thanks for the update. >>> >>> I¹m not 100% sure how many documents have this redirect in them, but I¹ll >>> see if I can get a better estimate. The content we are crawling is >>> substantially large, and comes from many different authors so it¹s difficult >>> to manage how these Sharepoint documents are created. It makes it extremely >>> difficult to pinpoint all the documents that contain redirects. >>> >>> Am I correct in assuming a single 302 error causes the job to fail, or is >>> there some other logic that determines this? >>> >>> How plausible would it be to include in the product an option for treating >>> 302¹s as a warning, rather than a fatal error? Possibly just an option in >>> the Job setup? >>> >>> Regards, >>> Phil >>> >>> >>> From: Karl Wright <[email protected]> >>> Reply-To: <[email protected]> >>> Date: Thursday, 18 February 2016 1:39 am >>> >>> To: "[email protected]" <[email protected]> >>> Subject: Re: HTTP 302 error causing job to abort >>> >>> Hi again Phil, >>> >>> The HttpClient team points out that POST requests (as we do for the >>> SharePoint repository requests) are not allowed to follow 302 redirections >>> according to RFC2616. We use POST requests because, for SOAP, there is >>> often quite a bit of XML data that goes along with the request, and we would >>> otherwise have size issues. So we cannot use GET instead of POST. See >>> CONNECTORS-1279 for details. >>> >>> If you still believe that it is only a couple of URLs that are returning 302 >>> for you, I'd like some analysis of why you believe that to be true. I would >>> be happy to consider recognition of an occasional 302 response as meaning >>> "skip this document". On the other hand, based on your stack trace, it >>> really appears that you have a far more systemic problem; it is failing >>> while obtaining information for an entire site, so not much would get >>> crawled in that case. >>> >>> Thanks, >>> Karl >>> >>> >>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <[email protected]> wrote: >>>> Hi Phil, >>>> >>>> It is not surprising that the connector doesn't like 302 responses and >>>> doesn't know what to do with them, because it isn't supposed to ever be >>>> getting any of these. >>>> >>>> I am puzzled by your statement that "only a couple of documents have >>>> redirections in them", because the connector crawls Lists and Library >>>> documents within SharePoint *only*, and these are very specifically >>>> accessible through a SharePoint URL hierarchy structure. There's no room >>>> in any of that for a 302 redirection. Since you see a 302 in the UI, I >>>> feel pretty certain you have a problem with your configuration and it is >>>> not just "a couple of documents". >>>> >>>> Karl >>>> >>>> >>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller >>>> <[email protected]> wrote: >>>>> Thanks Karl, >>>>> >>>>> The majority of content is not going to the redirect, it¹s probably just a >>>>> handful of documents that are behaving this way. >>>>> >>>>> I¹d agree that it¹s of lesser concern whether or not the document itself >>>>> is indexing, however I wouldn¹t expect the 302 to be treated as a fatal >>>>> error that causes the job to come to a halt. I¹d expect the document to be >>>>> passed over, and the crawl to continue. >>>>> >>>>> Is the only solution at this point to remove the documents which redirect >>>>> to a 302 to get the crawl to run in full? >>>>> >>>>> Regards, >>>>> >>>>> Phil Riethmuller >>>>> Technical Consultant >>>>> >>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000 >>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882> | funnelback.com >>>>> <http://www.funnelback.com/> >>>>> >>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES >>>>> >>>>> >>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> - >>>>> Twitter >>>>> >>>>> >>>>> From: Karl Wright <[email protected]> >>>>> Reply-To: <[email protected]> >>>>> Date: Wednesday, 17 February 2016 8:58 am >>>>> >>>>> To: "[email protected]" <[email protected]> >>>>> Subject: Re: HTTP 302 error causing job to abort >>>>> >>>>> Hi Phil, >>>>> >>>>> You probably want to point your SharePoint repository connection to the >>>>> proper server and site, and not rely on redirections. It's also possible >>>>> that you are missing the site entirely and the redirection you are seeing >>>>> is taking you to some error page somewhere. >>>>> >>>>> I will be raising the question of redirections with the >>>>> HttpComponents/HttpClient team, since I see no obvious problems with the >>>>> SharePoint connector code. However, if your connection is properly set >>>>> up, redirections should be unneeded. >>>>> >>>>> I would read the documentation on the Wiki page for debugging SharePoint >>>>> connections at the bottom of this page: >>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connectio >>>>> ns >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller >>>>> <[email protected]> wrote: >>>>>> Do you mean in the job status in the Manifold CF interface? >>>>>> >>>>>> The job status also shows the same: >>>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>: >>>>>> (302)HTTP/1.0 302 Found >>>>>> >>>>>> I agree, I wouldn¹t of thought that the crawler would follow any links or >>>>>> redirections. >>>>>> >>>>>> What sort of configurations could be incorrectly configured, that I could >>>>>> look at revising? >>>>>> >>>>>> Phil >>>>>> >>>>>> >>>>>> From: Karl Wright <[email protected]> >>>>>> Reply-To: <[email protected]> >>>>>> Date: Wednesday, 17 February 2016 8:45 am >>>>>> >>>>>> To: "[email protected]" <[email protected]> >>>>>> Subject: Re: HTTP 302 error causing job to abort >>>>>> >>>>>> Thanks. >>>>>> >>>>>> When you view the repository connection in the UI, do you get a 302 error >>>>>> also? >>>>>> >>>>>> I have looked at the code; Httpclient is supposedly configured to honor >>>>>> redirections. Obviously it is not doing that, so I'll have to dig deeper >>>>>> into why that is. On the other hand, I would not expect you to be >>>>>> getting any redirections, unless you have configured your connection >>>>>> incorrectly. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller >>>>>> <[email protected]> wrote: >>>>>>> Thanks Karl - >>>>>>> >>>>>>> I¹ve replaced the actual URL with <URL> below, but here is the stack >>>>>>> trace: >>>>>>> >>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception tossed: >>>>>>> Unexpected http error code 302 accessing SharePoint at <URL>: >>>>>>> (302)HTTP/1.0 302 Found >>>>>>> >>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Unexpected >>>>>>> http error code 302 accessing SharePoint at <URL>: (302)HTTP/1.0 302 >>>>>>> Found >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSi >>>>>>> tes(SPSProxyHelper.java:2246) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository >>>>>>> .processDocuments(SharePointRepository.java:1549) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java: >>>>>>> 399) >>>>>>> >>>>>>> Caused by: (302)HTTP/1.0 302 Found >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(Co >>>>>>> mmonsHTTPSender.java:201) >>>>>>> >>>>>>> at >>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.j >>>>>>> ava:32) >>>>>>> >>>>>>> at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) >>>>>>> >>>>>>> at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) >>>>>>> >>>>>>> at org.apache.axis.client.AxisClient.invoke(AxisClient.java:165) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invokeEngine(Call.java:2784) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2767) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2443) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2366) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:1812) >>>>>>> >>>>>>> at >>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(Webs >>>>>>> SoapStub.java:854) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSi >>>>>>> tes(SPSProxyHelper.java:2161) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Phil Riethmuller >>>>>>> Technical Consultant >>>>>>> >>>>>>> Funnelback | 437 Kent Street, Sydney, NSW 2000 >>>>>>> T +61 2 9045 2882 <tel:%2B61%202%209045%202882> | funnelback.com >>>>>>> <http://www.funnelback.com/> >>>>>>> >>>>>>> AUSTRALIA | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES >>>>>>> >>>>>>> >>>>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> >>>>>>> - Twitter >>>>>>> >>>>>>> >>>>>>> From: Karl Wright <[email protected]> >>>>>>> Reply-To: <[email protected]> >>>>>>> Date: Tuesday, 16 February 2016 6:54 pm >>>>>>> To: "[email protected]" <[email protected]> >>>>>>> Subject: Re: HTTP 302 error causing job to abort >>>>>>> >>>>>>> Hi Phil, >>>>>>> >>>>>>> A HTTP 302 response is simply a redirection. It should not, by itself, >>>>>>> cause a job to abort. I would expect that to go by in wire/http >>>>>>> logging, but you should not see it anywhere else. So it is not clear to >>>>>>> me what you are really seeing here. >>>>>>> >>>>>>> Can you include an example stack trace from the manifoldcf log? >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller >>>>>>> <[email protected]> wrote: >>>>>>> Hi - >>>>>>> >>>>>>> When crawling a Sharepoint repository, I¹m receiving a HTTP 302 error >>>>>>> which is causing the manifold job to abort. How do I prevent the crawler >>>>>>> from aborting the job? >>>>>>> >>>>>>> I¹m using v2.3 of Manifold with a postgres database. >>>>>>> >>>>>>> Regards, >>>>>>> Phil >>>>>>> >>>>>> >>>>> >>>> >>> >> >
