I agree. DAFFODIL-2128 will fix Daffodil so that you can also use -I xml and it will inspect the preamble. It's a bug that it's not doing that right now.
- Steve On 5/15/19 8:31 AM, Costello, Roger L. wrote: > A follow-up question please ... > > Steve wrote: > >> Or you could use the "scala-xml" infoset type >> (e.g. daffodil unparse -I scala-xml ...) which >> does correctly look at the XML preamble to >> determine encoding. > > I would think that that is the behavior that I always want. That is, I always > want Daffodil to look at the XML declaration to determine encoding. Yes? If > so, then it seems to me that I should always use the -l scala-xml flag for > unparsing. Yes? > > /Roger > > -----Original Message----- > From: Steve Lawrence <[email protected]> > Sent: Wednesday, May 15, 2019 8:03 AM > To: [email protected] > Subject: [EXT] Re: Strange behavior with dfdl:encoding > > I believe I've found the issue, and it is related to encoding and Windows. > > With the CLI, when "xml" is used as he infoset type (which is the default), > Daffodil does not specify an encoding to use to decode the XML, so Java > defaults to the "file.encoding" system property. On Brandon's and my > machines, this property is probably UTF-8, and so the right thing happens. > But since you're on Windows, the default is probably "Windows-1252". I can > reproduce the behavior with the following: > > $ export DAFFODIL_JAVA_OPTS="-Dfile.encoding=Windows-1252" > $ daffodil unparse -s test.dfdl.xsd test-UTF-8.xml | xxd > 00000000: 46c3 83c2 b8c3 83c2 b6 > > So we can see that changing the encoding to Windows-1252 does result in the > extra bytes. > > A workaround would be to modify DAFFODIL_JAVA_OPTS to set the java > file.encoding to "UTF-8", similar to above. Or you could use the "scala-xml" > infoset type (e.g. daffodil unparse -I scala-xml ...) which does correctly > look at the XML preamble to determine encoding. You might also be able to > change your default terminal encoding to UTF-8 with "chcp 65001", but I'm not > sure if Java uses that or not. > > I've also created DAFFODIL-2128 to track this issue. When using the "xml" > infoset type, we should be inspecting the XML preamble to determine the > encoding. > > - Steve > > On 5/14/19 8:05 AM, Steve Lawrence wrote: >> I've seen encoding issues similar to this when running on Windows. One >> potential cause is however you're getting the XML into a file (e.g. >> copy paste, redirection in a shell), windows might be messing with the >> encoding and creating XML that isn't encoded as UTF-8, but is >> something else. If the XML is wrong, the unparsed output will be wrong too. >> >> So in addition to the full schema, it might also be helpful to attach >> the actual XML file that you are unparsing and we can see what the >> encoding of that file is. >> >> - Steve >> >> On 5/13/19 5:15 PM, Sloane, Brandon wrote: >>> Roger, >>> >>> >>> I am unable to reproduce this. Can you post a complete schema? >>> >>> >>> Looking at your output, the only thing that jumps out to me is that >>> the problem is 83 C2 being inserted between each character. My guess >>> is you are setting some property that changes how strings are >>> encoded, but nothing jumps out at me as being able to cause this type of >>> encoding behavior. >>> >>> >>> Below is the schema I tried which does not reproduce this problem. >>> >>> >>> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" >>> xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" >>> xmlns:tns="urn:a" >>> xmlns:ex="http://example.com" >>> xmlns:fn="http://www.w3.org/2005/xpath-functions" >>> targetNamespace="urn:a" > >>> <xs:include >>> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" >>> /> >>> >>> <xs:annotation> >>> <xs:appinfo source="http://www.ogf.org/dfdl/"> >>> <dfdl:format ref="tns:GeneralFormat"/> >>> </xs:appinfo> >>> </xs:annotation> >>> >>> >>> <xs:element name="UTF-8"> >>> <xs:complexType> >>> <xs:sequence> >>> <xs:element name="string" type="xs:string" >>> dfdl:encoding="utf-8" >>> dfdl:lengthKind="pattern" dfdl:lengthPattern=".*" /> >>> <xs:element name="length" type="xs:integer" >>> dfdl:inputValueCalc="{ >>> fn:string-length(../string) }" /> >>> </xs:sequence> >>> </xs:complexType> >>> </xs:element> >>> >>> </xs:schema> >>> >>> >>> --------------------------------------------------------------------- >>> ----------- >>> *From:* Costello, Roger L. <[email protected]> >>> *Sent:* Monday, May 13, 2019 2:38:03 PM >>> *To:* [email protected] >>> *Subject:* Strange behavior with dfdl:encoding >>> >>> Hello DFDL community, >>> >>> My input is a single UTF-8 string. Parsing the input generates the >>> expected XML document, but unparsing the XML results in a totally >>> different string. Below is a graphic showing the input, parsing >>> results, and unparsing results. Under it are the actual hex bytes. >>> Note how the bytes for the input are very different than the bytes >>> for the unparse results. Why such differences between the input and >>> the parse output? At the bottom is my DFDL schema. /Roger >>> >>> <xs:elementname="UTF-8"> >>> <xs:complexType> >>> <xs:sequence> >>> <xs:elementname="string"type="xs:string"dfdl:encoding="utf-8"/> >>> <xs:elementname="length"type="xs:integer" >>> dfdl:inputValueCalc="{ >>> fn:string-length(../string) }"/> >>> </xs:sequence> >>> </xs:complexType> >>> </xs:element> >>> >> >
