Any news on this research? Karl
On Fri, Feb 19, 2016 at 12:46 AM, Karl Wright <[email protected]> wrote: > Hi Phil, > > Thanks -- this information is more helpful. > > So my understanding is that there is an external site reference in your > site/subsite hierarchy? And the *root* site (the one that you point at > when you configure the connection itself) is *not* external after all? > > If that is the case, then the external site must be being "discovered" > through the Webs service API call. There are two ways forward: > > (1) We can change the Webs response parsing to detect external sites and > not include those in the crawl, or > (2) We can try to make decisions based on whether a 302 comes back as a > response code. > > (1) is by far the best approach but it will require some cooperation and > execution of sample code on your part. Essentially I'll need to see what > the xml is that is coming back that first describes the exterrnal site and > see if there is an attribute that lets us know it is external. That way I > properly just skip it entirely. > > We can have a look at what comes back from SharePoint for this API > response if you enable connector debugging in properties.xml: > > <property name="org.apache.manifoldcf.connectors" value="DEBUG"/> > > ... and restart. You will then need to do a crawl. The following line > will be what you look for: > > Logging.connectors.debug("SharePoint: getSites xml response: > "+xmlResponse); > > This xml response will contain "Url" and "Title" nodes; what I need to > know is whether there's any attribute of the "Url" node, or parallel node > other than "Url" or "Title', that contains an indication of whether the Url > that describes the external site is indeed external. So you look for the > Url that describes the SharePoint URL that has the redirection, and tell me > if there's anything special about it in the associated getSites response. > Does that make sense? > > If this is too hard, alternative (2) is possible, but it will require tons > of individual changes. So let's look into (1) first. > > Thanks > Karl > > > On Thu, Feb 18, 2016 at 11:49 PM, Phil Riethmuller < > [email protected]> wrote: > >> Hi Karl, >> >> Some further info: >> >> - The problem document that Manifold reported, is redirecting to an >> external site. >> - We tried crawling a smaller subset of content on the same >> Sharepoint site that definitely doesn’t contain any external links in the >> content, and this works OK. >> - The job that errors with the 302, says it has found 529 docs so far >> and processed 127 of them. This seems to indicate that is has in fact >> found >> some documents. >> >> I’m not sure what you mean that the error is being generated from the API >> call, and not an individual document? The info appears to indicate it is >> not all documents, but just selected documents. >> >> There really isn’t much we can do about this from the Sharepoint >> configuration side, is there any way we can test if it is as simple as the >> 302 coming from the documents themselves? >> >> Thanks for your help to date. >> >> Phil >> >> >> From: Karl Wright <[email protected]> >> Reply-To: <[email protected]> >> Date: Thursday, 18 February 2016 10:31 am >> >> To: "[email protected]" <[email protected]> >> Subject: Re: HTTP 302 error causing job to abort >> >> Hi Phil, >> >> The 302 error is not coming from a single document. If it *was* coming >> from the fetch of an individual document, it would be easy to work around. >> But, from your stack trace, it is clear that this error is coming from an >> API call, specifically a call to enumerate subsites of a given site. That >> means that some or all of the SharePoint hierarchy is not accessible >> through POST requests. I have never seen this kind of behavior from >> SharePoint before. >> >> This is not something that I can work around without more information. >> In order to get that information, you will at the very minimum need to turn >> on connector debugging, and probably turning on http wire debugging would >> be helpful too. And, if what you said about the View page for this >> connection is true and it also shows a 302 error, I very much suspect that >> something changed on the server end and you are currently unable to crawl >> *any* documents at all. >> >> I am sorry I cannot make this any clearer. >> >> Thanks, >> Karl >> >> >> >> >> On Wed, Feb 17, 2016 at 6:20 PM, Phil Riethmuller < >> [email protected]> wrote: >> >>> Hi Karl, >>> >>> Thanks for the update. >>> >>> I’m not 100% sure how many documents have this redirect in them, but >>> I’ll see if I can get a better estimate. The content we are crawling is >>> substantially large, and comes from many different authors so it’s >>> difficult to manage how these Sharepoint documents are created. It makes it >>> extremely difficult to pinpoint all the documents that contain redirects. >>> >>> Am I correct in assuming a single 302 error causes the job to fail, or >>> is there some other logic that determines this? >>> >>> How plausible would it be to include in the product an option for >>> treating 302’s as a warning, rather than a fatal error? Possibly just an >>> option in the Job setup? >>> >>> Regards, >>> Phil >>> >>> >>> From: Karl Wright <[email protected]> >>> Reply-To: <[email protected]> >>> Date: Thursday, 18 February 2016 1:39 am >>> >>> To: "[email protected]" <[email protected]> >>> Subject: Re: HTTP 302 error causing job to abort >>> >>> Hi again Phil, >>> >>> The HttpClient team points out that POST requests (as we do for the >>> SharePoint repository requests) are not allowed to follow 302 redirections >>> according to RFC2616. We use POST requests because, for SOAP, there is >>> often quite a bit of XML data that goes along with the request, and we >>> would otherwise have size issues. So we cannot use GET instead of POST. >>> See CONNECTORS-1279 for details. >>> >>> If you still believe that it is only a couple of URLs that are returning >>> 302 for you, I'd like some analysis of why you believe that to be true. I >>> would be happy to consider recognition of an occasional 302 response as >>> meaning "skip this document". On the other hand, based on your stack >>> trace, it really appears that you have a far more systemic problem; it is >>> failing while obtaining information for an entire site, so not much would >>> get crawled in that case. >>> >>> Thanks, >>> Karl >>> >>> >>> On Tue, Feb 16, 2016 at 5:47 PM, Karl Wright <[email protected]> wrote: >>> >>>> Hi Phil, >>>> >>>> It is not surprising that the connector doesn't like 302 responses and >>>> doesn't know what to do with them, because it isn't supposed to ever be >>>> getting any of these. >>>> >>>> I am puzzled by your statement that "only a couple of documents have >>>> redirections in them", because the connector crawls Lists and Library >>>> documents within SharePoint *only*, and these are very specifically >>>> accessible through a SharePoint URL hierarchy structure. There's no room >>>> in any of that for a 302 redirection. Since you see a 302 in the UI, I >>>> feel pretty certain you have a problem with your configuration and it is >>>> not just "a couple of documents". >>>> >>>> Karl >>>> >>>> >>>> On Tue, Feb 16, 2016 at 5:22 PM, Phil Riethmuller < >>>> [email protected]> wrote: >>>> >>>>> Thanks Karl, >>>>> >>>>> The majority of content is not going to the redirect, it’s probably >>>>> just a handful of documents that are behaving this way. >>>>> >>>>> I’d agree that it’s of lesser concern whether or not the document >>>>> itself is indexing, however I wouldn’t expect the 302 to be treated as a >>>>> fatal error that causes the job to come to a halt. I’d expect the document >>>>> to be passed over, and the crawl to continue. >>>>> >>>>> Is the only solution at this point to remove the documents which >>>>> redirect to a 302 to get the crawl to run in full? >>>>> >>>>> Regards, >>>>> >>>>> *Phil Riethmuller* >>>>> Technical Consultant >>>>> >>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000 >>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/> >>>>> >>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES >>>>> >>>>> Connect with us: LinkedIn <http://www.linkedin.com/company/funnelback> >>>>> - *Twitter* >>>>> >>>>> >>>>> From: Karl Wright <[email protected]> >>>>> Reply-To: <[email protected]> >>>>> Date: Wednesday, 17 February 2016 8:58 am >>>>> >>>>> To: "[email protected]" <[email protected]> >>>>> Subject: Re: HTTP 302 error causing job to abort >>>>> >>>>> Hi Phil, >>>>> >>>>> You probably want to point your SharePoint repository connection to >>>>> the proper server and site, and not rely on redirections. It's also >>>>> possible that you are missing the site entirely and the redirection you >>>>> are >>>>> seeing is taking you to some error page somewhere. >>>>> >>>>> I will be raising the question of redirections with the >>>>> HttpComponents/HttpClient team, since I see no obvious problems with the >>>>> SharePoint connector code. However, if your connection is properly set >>>>> up, >>>>> redirections should be unneeded. >>>>> >>>>> I would read the documentation on the Wiki page for debugging >>>>> SharePoint connections at the bottom of this page: >>>>> https://cwiki.apache.org/confluence/display/CONNECTORS/Debugging+Connections >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 16, 2016 at 4:55 PM, Phil Riethmuller < >>>>> [email protected]> wrote: >>>>> >>>>>> Do you mean in the job status in the Manifold CF interface? >>>>>> >>>>>> The job status also shows the same: >>>>>> Error: Unexpected http error code 302 accessing SharePoint at <url>: >>>>>> (302)HTTP/1.0 302 Found >>>>>> >>>>>> I agree, I wouldn’t of thought that the crawler would follow any >>>>>> links or redirections. >>>>>> >>>>>> What sort of configurations could be incorrectly configured, that I >>>>>> could look at revising? >>>>>> >>>>>> Phil >>>>>> >>>>>> >>>>>> From: Karl Wright <[email protected]> >>>>>> Reply-To: <[email protected]> >>>>>> Date: Wednesday, 17 February 2016 8:45 am >>>>>> >>>>>> To: "[email protected]" <[email protected]> >>>>>> Subject: Re: HTTP 302 error causing job to abort >>>>>> >>>>>> Thanks. >>>>>> >>>>>> When you view the repository connection in the UI, do you get a 302 >>>>>> error also? >>>>>> >>>>>> I have looked at the code; Httpclient is supposedly configured to >>>>>> honor redirections. Obviously it is not doing that, so I'll have to dig >>>>>> deeper into why that is. On the other hand, I would not expect you to be >>>>>> getting any redirections, unless you have configured your connection >>>>>> incorrectly. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Tue, Feb 16, 2016 at 4:31 PM, Phil Riethmuller < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks Karl - >>>>>>> >>>>>>> I’ve replaced the actual URL with <URL> below, but here is the stack >>>>>>> trace: >>>>>>> >>>>>>> ERROR 2016-02-16 12:10:55,251 (Worker thread '16') - Exception >>>>>>> tossed: Unexpected http error code 302 accessing SharePoint at <URL>: >>>>>>> (302)HTTP/1.0 302 Found >>>>>>> >>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>>>>>> Unexpected http error code 302 accessing SharePoint at <URL>: >>>>>>> (302)HTTP/1.0 >>>>>>> 302 Found >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2246) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1549) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) >>>>>>> >>>>>>> Caused by: (302)HTTP/1.0 302 Found >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.connectorcommon.common.CommonsHTTPSender.invoke(CommonsHTTPSender.java:201) >>>>>>> >>>>>>> at >>>>>>> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) >>>>>>> >>>>>>> at >>>>>>> org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) >>>>>>> >>>>>>> at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) >>>>>>> >>>>>>> at >>>>>>> org.apache.axis.client.AxisClient.invoke(AxisClient.java:165) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invokeEngine(Call.java:2784) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2767) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2443) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:2366) >>>>>>> >>>>>>> at org.apache.axis.client.Call.invoke(Call.java:1812) >>>>>>> >>>>>>> at >>>>>>> com.microsoft.schemas.sharepoint.soap.WebsSoapStub.getWebCollection(WebsSoapStub.java:854) >>>>>>> >>>>>>> at >>>>>>> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getSites(SPSProxyHelper.java:2161) >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> *Phil Riethmuller* >>>>>>> Technical Consultant >>>>>>> >>>>>>> *Funnelback |* 437 Kent Street, Sydney, NSW 2000 >>>>>>> *T* +61 2 9045 2882 | funnelback.com <http://www.funnelback.com/> >>>>>>> >>>>>>> *AUSTRALIA* | UNITED KINGDOM | NEW ZEALAND | POLAND | UNITED STATES >>>>>>> >>>>>>> Connect with us: LinkedIn >>>>>>> <http://www.linkedin.com/company/funnelback> - *Twitter* >>>>>>> >>>>>>> >>>>>>> From: Karl Wright <[email protected]> >>>>>>> Reply-To: <[email protected]> >>>>>>> Date: Tuesday, 16 February 2016 6:54 pm >>>>>>> To: "[email protected]" <[email protected]> >>>>>>> Subject: Re: HTTP 302 error causing job to abort >>>>>>> >>>>>>> Hi Phil, >>>>>>> >>>>>>> A HTTP 302 response is simply a redirection. It should not, by >>>>>>> itself, cause a job to abort. I would expect that to go by in wire/http >>>>>>> logging, but you should not see it anywhere else. So it is not clear >>>>>>> to me >>>>>>> what you are really seeing here. >>>>>>> >>>>>>> Can you include an example stack trace from the manifoldcf log? >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 16, 2016 at 12:22 AM, Phil Riethmuller < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi - >>>>>>>> >>>>>>>> When crawling a Sharepoint repository, I’m receiving a HTTP 302 >>>>>>>> error which is causing the manifold job to abort. How do I prevent the >>>>>>>> crawler from aborting the job? >>>>>>>> >>>>>>>> I’m using v2.3 of Manifold with a postgres database. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Phil >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
