Wow! An outstanding piece of detective work. Thank you Steve!
/Roger -----Original Message----- From: Steve Lawrence <[email protected]> Sent: Wednesday, May 15, 2019 8:03 AM To: [email protected] Subject: [EXT] Re: Strange behavior with dfdl:encoding I believe I've found the issue, and it is related to encoding and Windows. With the CLI, when "xml" is used as he infoset type (which is the default), Daffodil does not specify an encoding to use to decode the XML, so Java defaults to the "file.encoding" system property. On Brandon's and my machines, this property is probably UTF-8, and so the right thing happens. But since you're on Windows, the default is probably "Windows-1252". I can reproduce the behavior with the following: $ export DAFFODIL_JAVA_OPTS="-Dfile.encoding=Windows-1252" $ daffodil unparse -s test.dfdl.xsd test-UTF-8.xml | xxd 00000000: 46c3 83c2 b8c3 83c2 b6 So we can see that changing the encoding to Windows-1252 does result in the extra bytes. A workaround would be to modify DAFFODIL_JAVA_OPTS to set the java file.encoding to "UTF-8", similar to above. Or you could use the "scala-xml" infoset type (e.g. daffodil unparse -I scala-xml ...) which does correctly look at the XML preamble to determine encoding. You might also be able to change your default terminal encoding to UTF-8 with "chcp 65001", but I'm not sure if Java uses that or not. I've also created DAFFODIL-2128 to track this issue. When using the "xml" infoset type, we should be inspecting the XML preamble to determine the encoding. - Steve On 5/14/19 8:05 AM, Steve Lawrence wrote: > I've seen encoding issues similar to this when running on Windows. One > potential cause is however you're getting the XML into a file (e.g. > copy paste, redirection in a shell), windows might be messing with the > encoding and creating XML that isn't encoded as UTF-8, but is > something else. If the XML is wrong, the unparsed output will be wrong too. > > So in addition to the full schema, it might also be helpful to attach > the actual XML file that you are unparsing and we can see what the > encoding of that file is. > > - Steve > > On 5/13/19 5:15 PM, Sloane, Brandon wrote: >> Roger, >> >> >> I am unable to reproduce this. Can you post a complete schema? >> >> >> Looking at your output, the only thing that jumps out to me is that >> the problem is 83 C2 being inserted between each character. My guess >> is you are setting some property that changes how strings are >> encoded, but nothing jumps out at me as being able to cause this type of >> encoding behavior. >> >> >> Below is the schema I tried which does not reproduce this problem. >> >> >> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" >> xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" >> xmlns:tns="urn:a" >> xmlns:ex="http://example.com" >> xmlns:fn="http://www.w3.org/2005/xpath-functions" >> targetNamespace="urn:a" > >> <xs:include >> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" >> /> >> >> <xs:annotation> >> <xs:appinfo source="http://www.ogf.org/dfdl/"> >> <dfdl:format ref="tns:GeneralFormat"/> >> </xs:appinfo> >> </xs:annotation> >> >> >> <xs:element name="UTF-8"> >> <xs:complexType> >> <xs:sequence> >> <xs:element name="string" type="xs:string" >> dfdl:encoding="utf-8" >> dfdl:lengthKind="pattern" dfdl:lengthPattern=".*" /> >> <xs:element name="length" type="xs:integer" >> dfdl:inputValueCalc="{ >> fn:string-length(../string) }" /> >> </xs:sequence> >> </xs:complexType> >> </xs:element> >> >> </xs:schema> >> >> >> --------------------------------------------------------------------- >> ----------- >> *From:* Costello, Roger L. <[email protected]> >> *Sent:* Monday, May 13, 2019 2:38:03 PM >> *To:* [email protected] >> *Subject:* Strange behavior with dfdl:encoding >> >> Hello DFDL community, >> >> My input is a single UTF-8 string. Parsing the input generates the >> expected XML document, but unparsing the XML results in a totally >> different string. Below is a graphic showing the input, parsing >> results, and unparsing results. Under it are the actual hex bytes. >> Note how the bytes for the input are very different than the bytes >> for the unparse results. Why such differences between the input and >> the parse output? At the bottom is my DFDL schema. /Roger >> >> <xs:elementname="UTF-8"> >> <xs:complexType> >> <xs:sequence> >> <xs:elementname="string"type="xs:string"dfdl:encoding="utf-8"/> >> <xs:elementname="length"type="xs:integer" >> dfdl:inputValueCalc="{ >> fn:string-length(../string) }"/> >> </xs:sequence> >> </xs:complexType> >> </xs:element> >> >
