Re: Strange behavior with dfdl:encoding

Steve Lawrence Wed, 15 May 2019 05:33:31 -0700

I agree. DAFFODIL-2128 will fix Daffodil so that you can also use -I xml
and it will inspect the preamble. It's a bug that it's not doing that
right now.


- Steve

On 5/15/19 8:31 AM, Costello, Roger L. wrote:
> A follow-up question please ...
> 
> Steve wrote:
> 
>> Or you could use the "scala-xml" infoset type
>> (e.g. daffodil unparse -I scala-xml ...) which 
>> does correctly look at the XML preamble to 
>> determine encoding.
> 
> I would think that that is the behavior that I always want. That is, I always 
> want Daffodil to look at the XML declaration to determine encoding. Yes? If 
> so, then it seems to me that I should always use the -l scala-xml flag for 
> unparsing. Yes?
> 
> /Roger
> 
> -----Original Message-----
> From: Steve Lawrence <[email protected]> 
> Sent: Wednesday, May 15, 2019 8:03 AM
> To: [email protected]
> Subject: [EXT] Re: Strange behavior with dfdl:encoding
> 
> I believe I've found the issue, and it is related to encoding and Windows.
> 
> With the CLI, when "xml" is used as he infoset type (which is the default), 
> Daffodil does not specify an encoding to use to decode the XML, so Java 
> defaults to the "file.encoding" system property. On Brandon's and my 
> machines, this property is probably UTF-8, and so the right thing happens. 
> But since you're on Windows, the default is probably "Windows-1252". I can 
> reproduce the behavior with the following:
> 
> $ export DAFFODIL_JAVA_OPTS="-Dfile.encoding=Windows-1252"
> $ daffodil unparse -s test.dfdl.xsd test-UTF-8.xml | xxd
> 00000000: 46c3 83c2 b8c3 83c2 b6
> 
> So we can see that changing the encoding to Windows-1252 does result in the 
> extra bytes.
> 
> A workaround would be to modify DAFFODIL_JAVA_OPTS to set the java 
> file.encoding to "UTF-8", similar to above. Or you could use the "scala-xml" 
> infoset type (e.g. daffodil unparse -I scala-xml ...) which does correctly 
> look at the XML preamble to determine encoding. You might also be able to 
> change your default terminal encoding to UTF-8 with "chcp 65001", but I'm not 
> sure if Java uses that or not.
> 
> I've also created DAFFODIL-2128 to track this issue. When using the "xml" 
> infoset type, we should be inspecting the XML preamble to determine the 
> encoding.
> 
> - Steve
> 
> On 5/14/19 8:05 AM, Steve Lawrence wrote:
>> I've seen encoding issues similar to this when running on Windows. One 
>> potential cause is however you're getting the XML into a file (e.g. 
>> copy paste, redirection in a shell), windows might be messing with the 
>> encoding and creating XML that isn't encoded as UTF-8, but is 
>> something else. If the XML is wrong, the unparsed output will be wrong too.
>>
>> So in addition to the full schema, it might also be helpful to attach 
>> the actual XML file that you are unparsing and we can see what the 
>> encoding of that file is.
>>
>> - Steve
>>
>> On 5/13/19 5:15 PM, Sloane, Brandon wrote:
>>> Roger,
>>>
>>>
>>> I am unable to reproduce this. Can you post a complete schema?
>>>
>>>
>>> Looking at your output, the only thing that jumps out to me is that 
>>> the problem is 83 C2 being inserted between each character. My guess 
>>> is you are setting some property that changes how strings are 
>>> encoded, but nothing jumps out at me as being able to cause this type of 
>>> encoding behavior.
>>>
>>>
>>> Below is the schema I tried which does not reproduce this problem.
>>>
>>>
>>> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
>>>             xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";
>>>             xmlns:tns="urn:a"
>>>             xmlns:ex="http://example.com";
>>>             xmlns:fn="http://www.w3.org/2005/xpath-functions";
>>>             targetNamespace="urn:a" >
>>>    <xs:include
>>> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" 
>>> />
>>>
>>>     <xs:annotation>
>>>      <xs:appinfo source="http://www.ogf.org/dfdl/";>
>>>        <dfdl:format ref="tns:GeneralFormat"/>
>>>     </xs:appinfo>
>>>    </xs:annotation>
>>>
>>>
>>> <xs:element name="UTF-8">
>>>      <xs:complexType>
>>>          <xs:sequence>
>>>              <xs:element name="string" type="xs:string" 
>>> dfdl:encoding="utf-8" 
>>> dfdl:lengthKind="pattern" dfdl:lengthPattern=".*" />
>>>              <xs:element name="length" type="xs:integer"
>>>                                         dfdl:inputValueCalc="{
>>> fn:string-length(../string) }" />
>>>          </xs:sequence>
>>>      </xs:complexType>
>>> </xs:element>
>>>
>>> </xs:schema>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> -----------
>>> *From:* Costello, Roger L. <[email protected]>
>>> *Sent:* Monday, May 13, 2019 2:38:03 PM
>>> *To:* [email protected]
>>> *Subject:* Strange behavior with dfdl:encoding
>>>
>>> Hello DFDL community,
>>>
>>> My input is a single UTF-8 string. Parsing the input generates the 
>>> expected XML document, but unparsing the XML results in a totally 
>>> different string. Below is a graphic showing the input, parsing 
>>> results, and unparsing results. Under it are the actual hex bytes. 
>>> Note how the bytes for the input are very different than the bytes 
>>> for the unparse results. Why such differences between the input and 
>>> the parse output?  At the bottom is my DFDL schema. /Roger
>>>
>>> <xs:elementname="UTF-8">
>>> <xs:complexType>
>>> <xs:sequence>
>>> <xs:elementname="string"type="xs:string"dfdl:encoding="utf-8"/>
>>> <xs:elementname="length"type="xs:integer"
>>>                                         dfdl:inputValueCalc="{
>>> fn:string-length(../string) }"/>
>>> </xs:sequence>
>>> </xs:complexType>
>>> </xs:element>
>>>
>>
>

Re: Strange behavior with dfdl:encoding

Reply via email to