I suspect that the issue here is related to the default encoding on your
system. When the Daffodil CLI writes the infoset, it does not specify
which encoding to use and it uses the system default encoding. On some
systems (like mine) that's UTF-8, which explains why I don't see the
issue. On other systems, it could be ISO-8859-1, CP-1250, or something
completely different, which will result in different output than you
might expect. And the XML claims to be UTF-8 with the "<?xml
version="1.0" encoding="UTF-8" ?>" header, so that could very easily
confuse some editors and lead to mangling.

The permanent fix is probably for Daffodil to explicitly set the infoset
output encoding to UTF-8. But until that's fixed, a temporary solution
is to change the default encoding that Daffodil uses by adding the
following to the DAFFODIL_JAVA_OPTS environment variable:

  -Dfile.encoding=UTF-8

With that change, I *think* you should get the results you're expecting.



On 10/10/18 12:30 PM, Costello, Roger L. wrote:
> Hi Mike,
> 
> Okay, per your suggestion I set encoding="utf-8" and in the element 
> declaration 
> for NAME, I changed dfdl:lengthUnits="characters" to 
> dfdl:lengthUnits="bytes". 
> Here’s the element declaration:
> 
> <xs:element    name="NAME"
> 
> type="xs:string"
> 
> dfdl:length="93"
> 
> dfdl:lengthKind="explicit"
> 
> dfdl:lengthUnits="bytes"
> 
> dfdl:textTrimKind="padChar"
> 
> dfdl:textStringPadCharacter="%SP;"
> 
> dfdl:textStringJustification="center"/>
> 
> Here are the set of bytes before parsing:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
> 6F 
> 73 73 65 20 20 …
> 
> Here are the set of bytes after parsing:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C9 63 6F 
> 73 
> 73 65
> 
> The changes are shown in yellow.
> 
> That change to the element declaration has triggered other problems.
> 
> Here is text in the original binary file:
> 
> Nuevo León
> 
> Here is its binary:
> 
> 4E 75 65 76 6F 20 4C 65 C3 B3 6E
> 
> Here is the XML that parsing generates:
> 
> <NAME>Nuevo Le󮼯NAME>
> 
> Here is the binary:
> 
> 3C 4E 41 4D 45 3E 4E 75 65 76 6F 20 4C 65 F3 AE BC AF 4E 41 4D 45 3E
> 
> The part in grey corresponds to the data. The output data is the same
> 
> as the input data up to hex 65 and then something strange happens.
> 
> You can see that the end tag </NAME> got mangled.
> 
> Thoughts?
> 
> /Roger
> 
> *From:* Mike Beckerle <[email protected]>
> *Sent:* Wednesday, October 10, 2018 11:24 AM
> *To:* [email protected]
> *Subject:* Re: Why does Daffodil change the binary of non-ASCII characters?
> 
> Interesting,
> 
> So that error says it is looking for 80 utf-8 characters, not 80 bytes.
> 
> This is a supported behavior, but not typically what people want. Usually in 
> legacy formats (like dbase) lengths are in bytes.
> 
> If you have lengthUnits='characters' in iso-8859-1 that's identical to bytes, 
> but in utf8 it is clearly not the same as bytes.
> 
> Try lengthUnits="bytes".
> 
> --------------------------------------------------------------------------------
> 
> *From:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
> *Sent:* Wednesday, October 10, 2018 11:21:17 AM
> *To:* [email protected] <mailto:[email protected]>
> *Subject:* RE: Why does Daffodil change the binary of non-ASCII characters?
> 
> Hi Mike,
> 
> Below is the error message that I get when I change encoding to utf-8 (i.e., 
> encoding="utf-8"). Does that help narrow down the possible problem?  /Roger
> 
> [error] Parse Error: Failed to populate record[1832]. Cause: Parse Error: 
> <SpecifiedLengthExplicitCharactersParser><STATEABB 
> parser='StringOfSpecifiedLengthParser' 
> /></SpecifiedLengthExplicitCharactersParser> - STATEABB - Parse failed. 
> Failed 
> to find exactly 80 characters.
> 
> Schema context: STATEABB Location line 115 column 42 in dBase.dfdl.xsd
> 
> Data location was preceding byte 652456.
> 
> Schema context: sequence Location line 81 column 26 in dBase.dfdl.xsd
> 
> Data location was preceding byte 652456
> 
> *From:* Mike Beckerle <[email protected] <mailto:[email protected]>>
> *Sent:* Wednesday, October 10, 2018 11:03 AM
> *To:* [email protected] <mailto:[email protected]>
> *Subject:* Re: Why does Daffodil change the binary of non-ASCII characters?
> 
> Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER 
> E 
> WITH ACUTE.
> 
> So using iso-8859-1 is going to do the wrong thing for sure.
> 
> So let's figure out why your data fails to parse when specifying the correct 
> character set encoding, utf-8.
> 
> Your hex bytes as presented are all valid Utf-8 according to this site:
> 
> http://www.endmemo.com/unicode/unicodeconverter.php
> 
> So, maybe there's a utf-8 bug in daffodil?
> 
> --------------------------------------------------------------------------------
> 
> *From:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
> *Sent:* Wednesday, October 10, 2018 9:59:16 AM
> *To:* [email protected] <mailto:[email protected]>
> *Subject:* Why does Daffodil change the binary of non-ASCII characters?
> 
> Hello DFDL community,
> 
> I have a binary file that contains, among other things, this text:
> 
> Nova Scotia / Nouvelle-Écosse
> 
> Its corresponding hex binary is this:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
> 6F 
> 73 73 65 20 …
> 
> I used this element declaration in my DFDL schema to parse that binary:
> 
> <xs:element    name="NAME"
>                         type="xs:string"
>                         dfdl:length="93"
>                          dfdl:lengthKind="explicit"
>                         dfdl:lengthUnits="characters"
>                          dfdl:textTrimKind="padChar"
>                          dfdl:textStringPadCharacter="%SP;"
>                          dfdl:textStringJustification="center"/>
> 
> Surprisingly, during parsing Daffodil modified the text to this:
> 
> Nova Scotia / Nouvelle-Ã?cosse
> 
> With this corresponding hex binary:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 
> 6F 
> 73 73 65 20 …
> 
> The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing).
> 
> Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode 
> codepoint.
> 
> Why did Daffodil change the binary?
> 
> One other piece of the puzzle: in my DFDL schema I specify 
> encoding="ISO-8859-1". For a reason I do not understand, when I 
> specifyencoding="utf-8" I get an error message on parse.
> 
> Please help!
> 
> /Roger
> 

Reply via email to