[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
[ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569365#comment-15569365 ] Karl Wright commented on CONNECTORS-1325: - [~kavdeev] Here's the actual snippet of XML that is problematic: {code} ows_Title="Task emoji " {code} These emojis are not valid unicode characters. See: https://en.wikibooks.org/wiki/Unicode/Character_reference/D000-DFFF That's why the java XML parsers can't deal with them. There may be a switch of some kind that permits them to be ignored; I'll have to look into that. > Invalid XML character causing job to abort > -- > > Key: CONNECTORS-1325 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1325 > Project: ManifoldCF > Issue Type: Bug > Components: SharePoint connector >Affects Versions: ManifoldCF 2.3 >Reporter: Phil >Assignee: Karl Wright >Priority: Blocker > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, > CONNECTORS-1325.patch, mcf-bad-ms-char.xml > > > The following error is causing the Manifold job to abort, and subsequently > the job not being able to finish. > It would be good to have the crawler log this error, but not throw an > exception which causes the entire job to stop. > {code} > ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - > Exception tossed: XML parsing error: Character reference "" is an > invalid XML character. > org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: > Character reference "" is an invalid XML character. > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390) > at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; > Character reference "" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359) > ... 4 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
[ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569228#comment-15569228 ] Konstantin Avdeev commented on CONNECTORS-1325: --- ok, the XML response has been attached > Invalid XML character causing job to abort > -- > > Key: CONNECTORS-1325 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1325 > Project: ManifoldCF > Issue Type: Bug > Components: SharePoint connector >Affects Versions: ManifoldCF 2.3 >Reporter: Phil >Assignee: Karl Wright >Priority: Blocker > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, > CONNECTORS-1325.patch, mcf-bad-ms-char.xml > > > The following error is causing the Manifold job to abort, and subsequently > the job not being able to finish. > It would be good to have the crawler log this error, but not throw an > exception which causes the entire job to stop. > {code} > ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - > Exception tossed: XML parsing error: Character reference "" is an > invalid XML character. > org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: > Character reference "" is an invalid XML character. > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390) > at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; > Character reference "" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359) > ... 4 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CONNECTORS-1325) Invalid XML character causing job to abort
[ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Avdeev updated CONNECTORS-1325: -- Attachment: mcf-bad-ms-char.xml Bad char in the Title field > Invalid XML character causing job to abort > -- > > Key: CONNECTORS-1325 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1325 > Project: ManifoldCF > Issue Type: Bug > Components: SharePoint connector >Affects Versions: ManifoldCF 2.3 >Reporter: Phil >Assignee: Karl Wright >Priority: Blocker > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, > CONNECTORS-1325.patch, mcf-bad-ms-char.xml > > > The following error is causing the Manifold job to abort, and subsequently > the job not being able to finish. > It would be good to have the crawler log this error, but not throw an > exception which causes the entire job to stop. > {code} > ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - > Exception tossed: XML parsing error: Character reference "" is an > invalid XML character. > org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: > Character reference "" is an invalid XML character. > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390) > at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; > Character reference "" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359) > ... 4 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
[ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569216#comment-15569216 ] Konstantin Avdeev commented on CONNECTORS-1325: --- oops, the confluence parser turned the Title text into a readable form :) Trying again: ows_Title="Task emoji >>><<<" > Invalid XML character causing job to abort > -- > > Key: CONNECTORS-1325 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1325 > Project: ManifoldCF > Issue Type: Bug > Components: SharePoint connector >Affects Versions: ManifoldCF 2.3 >Reporter: Phil >Assignee: Karl Wright >Priority: Blocker > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, > CONNECTORS-1325.patch > > > The following error is causing the Manifold job to abort, and subsequently > the job not being able to finish. > It would be good to have the crawler log this error, but not throw an > exception which causes the entire job to stop. > {code} > ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - > Exception tossed: XML parsing error: Character reference "" is an > invalid XML character. > org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: > Character reference "" is an invalid XML character. > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390) > at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; > Character reference "" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359) > ... 4 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
[ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569211#comment-15569211 ] Konstantin Avdeev commented on CONNECTORS-1325: --- hi Karl, I think, the issue can be reproduced easily, by putting an emoji (e.g. ) into a field of a task list: {code} DEBUG 2016-10-12 18:32:47,521 (Worker thread '72') - SharePoint: getListItems FileRef value 'sites/test-team/Lists/Main Task List/5_.000', xml response: 'http://schemas.microsoft.com/sharepoint/soap/;> ' DEBUG 2016-10-12 18:32:47,522 (Worker thread '72') - SharePoint: Can't get version of '/Main Task List///5_.000' because of bad XML characters(?) {code} Thanks! > Invalid XML character causing job to abort > -- > > Key: CONNECTORS-1325 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1325 > Project: ManifoldCF > Issue Type: Bug > Components: SharePoint connector >Affects Versions: ManifoldCF 2.3 >Reporter: Phil >Assignee: Karl Wright >Priority: Blocker > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, > CONNECTORS-1325.patch > > > The following error is causing the Manifold job to abort, and subsequently > the job not being able to finish. > It would be good to have the crawler log this error, but not throw an > exception which causes the entire job to stop. > {code} > ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - > Exception tossed: XML parsing error: Character reference "" is an > invalid XML character. > org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: > Character reference "" is an invalid XML character. > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390) > at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; > Character reference "" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359) > ... 4 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
[ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15568408#comment-15568408 ] Karl Wright commented on CONNECTORS-1325: - Hi Konstantin, Good release practices, and Apache policy, says we cannot and will not re-issue releases to include patches. In order to do that there would need to be a point release instead. This change will go out as part of the 2.6 release in December. I would like to further understand how exactly this entity is presenting into the XML. If you can obtain the actual XML document (redact sensitive content, of course, but preserve formatting etc), I would greatly appreciate it. If it turns out that the problem is with the xerces parser, I can create a ticket against that. I suspect, however, that a ticket really should be created against SharePoint, although I also suspect they will be completely unwilling to fix a deprecated feature like this. > Invalid XML character causing job to abort > -- > > Key: CONNECTORS-1325 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1325 > Project: ManifoldCF > Issue Type: Bug > Components: SharePoint connector >Affects Versions: ManifoldCF 2.3 >Reporter: Phil >Assignee: Karl Wright >Priority: Blocker > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, > CONNECTORS-1325.patch > > > The following error is causing the Manifold job to abort, and subsequently > the job not being able to finish. > It would be good to have the crawler log this error, but not throw an > exception which causes the entire job to stop. > {code} > ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - > Exception tossed: XML parsing error: Character reference "" is an > invalid XML character. > org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: > Character reference "" is an invalid XML character. > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390) > at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; > Character reference "" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359) > ... 4 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
[ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567863#comment-15567863 ] Konstantin Avdeev commented on CONNECTORS-1325: --- Thank you, Karl! The patch seems to be working - we were able to complete the crawl, unfortunately all documents from that particular library contain this record separator char, so, there is no content in the index. We'd need a pre-parsing stage here ;) P.S. just a note: the complete patch is not yet integrated into v.2.5. > Invalid XML character causing job to abort > -- > > Key: CONNECTORS-1325 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1325 > Project: ManifoldCF > Issue Type: Bug > Components: SharePoint connector >Affects Versions: ManifoldCF 2.3 >Reporter: Phil >Assignee: Karl Wright >Priority: Blocker > Fix For: ManifoldCF 2.5 > > Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, > CONNECTORS-1325.patch > > > The following error is causing the Manifold job to abort, and subsequently > the job not being able to finish. > It would be good to have the crawler log this error, but not throw an > exception which causes the entire job to stop. > {code} > ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - > Exception tossed: XML parsing error: Character reference "" is an > invalid XML character. > org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: > Character reference "" is an invalid XML character. > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390) > at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039) > at > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974) > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; > Character reference "" is an invalid XML character. > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359) > ... 4 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)