[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-12 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569365#comment-15569365
 ] 

Karl Wright commented on CONNECTORS-1325:
-

[~kavdeev] Here's the actual snippet of XML that is problematic:

{code}
ows_Title="Task emoji "
{code}

These emojis are not valid unicode characters.  See:

https://en.wikibooks.org/wiki/Unicode/Character_reference/D000-DFFF

That's why the java XML parsers can't deal with them.
There may be a switch of some kind that permits them to be ignored; I'll have 
to look into that.




> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-12 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569228#comment-15569228
 ] 

Konstantin Avdeev commented on CONNECTORS-1325:
---

ok, the XML response has been attached

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-12 Thread Konstantin Avdeev (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Avdeev updated CONNECTORS-1325:
--
Attachment: mcf-bad-ms-char.xml

Bad char in the Title field

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch, mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-12 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569216#comment-15569216
 ] 

Konstantin Avdeev commented on CONNECTORS-1325:
---

oops, the confluence parser turned the Title text into a readable form :)
Trying again:

ows_Title="Task emoji >>><<<"

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-12 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569211#comment-15569211
 ] 

Konstantin Avdeev commented on CONNECTORS-1325:
---

hi Karl,

I think, the issue can be reproduced easily, by putting an emoji (e.g. ) into 
a field of a task list:

{code}
DEBUG 2016-10-12 18:32:47,521 (Worker thread '72') - SharePoint: getListItems 
FileRef value 'sites/test-team/Lists/Main Task List/5_.000', xml response: 
'http://schemas.microsoft.com/sharepoint/soap/;>

   

'
DEBUG 2016-10-12 18:32:47,522 (Worker thread '72') - SharePoint: Can't get 
version of '/Main Task List///5_.000' because of bad XML characters(?)
{code}

Thanks!

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-12 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15568408#comment-15568408
 ] 

Karl Wright commented on CONNECTORS-1325:
-

Hi Konstantin,

Good release practices, and Apache policy, says we cannot and will not re-issue 
releases to include patches.  In order to do that there would need to be a 
point release instead.  This change will go out as part of the 2.6 release in 
December.

I would like to further understand how exactly this entity is presenting into 
the XML.  If you can obtain the actual XML document (redact sensitive content, 
of course, but preserve formatting etc), I would greatly appreciate it. If it 
turns out that the problem is with the xerces parser, I can create a ticket 
against that.  I suspect, however, that a ticket really should be created 
against SharePoint, although I also suspect they will be completely unwilling 
to fix a deprecated feature like this.






> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort

2016-10-12 Thread Konstantin Avdeev (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567863#comment-15567863
 ] 

Konstantin Avdeev commented on CONNECTORS-1325:
---

Thank you, Karl! The patch seems to be working - we were able to complete the 
crawl, unfortunately all documents from that particular library contain this 
record separator char, so, there is no content in the index.
We'd need a pre-parsing stage here ;)

P.S. just a note: the complete patch is not yet integrated into v.2.5.

> Invalid XML character causing job to abort
> --
>
> Key: CONNECTORS-1325
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: SharePoint connector
>Affects Versions: ManifoldCF 2.3
>Reporter: Phil
>Assignee: Karl Wright
>Priority: Blocker
> Fix For: ManifoldCF 2.5
>
> Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, 
> CONNECTORS-1325.patch
>
>
> The following error is causing the Manifold job to abort, and subsequently 
> the job not being able to finish.
> It would be good to have the crawler log this error, but not throw an 
> exception which causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - 
> Exception tossed: XML parsing error: Character reference "" is an 
> invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: 
> Character reference "" is an invalid XML character.
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
> at org.apache.manifoldcf.core.common.XMLDoc.(XMLDoc.java:286)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
> at 
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
> at 
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; 
> Character reference "" is an invalid XML character.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
> ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)