I'm unable to reproduce this issue. Would it be possible to provide your
schema, test data, and the command you're using that returns the
incorrect output?


On 10/10/18 12:30 PM, Costello, Roger L. wrote:
> Hi Mike,
> 
> Okay, per your suggestion I set encoding="utf-8" and in the element 
> declaration 
> for NAME, I changed dfdl:lengthUnits="characters" to 
> dfdl:lengthUnits="bytes". 
> Here’s the element declaration:
> 
> <xs:element    name="NAME"
> 
> type="xs:string"
> 
> dfdl:length="93"
> 
> dfdl:lengthKind="explicit"
> 
> dfdl:lengthUnits="bytes"
> 
> dfdl:textTrimKind="padChar"
> 
> dfdl:textStringPadCharacter="%SP;"
> 
> dfdl:textStringJustification="center"/>
> 
> Here are the set of bytes before parsing:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
> 6F 
> 73 73 65 20 20 …
> 
> Here are the set of bytes after parsing:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C9 63 6F 
> 73 
> 73 65
> 
> The changes are shown in yellow.
> 
> That change to the element declaration has triggered other problems.
> 
> Here is text in the original binary file:
> 
> Nuevo León
> 
> Here is its binary:
> 
> 4E 75 65 76 6F 20 4C 65 C3 B3 6E
> 
> Here is the XML that parsing generates:
> 
> <NAME>Nuevo Le󮼯NAME>
> 
> Here is the binary:
> 
> 3C 4E 41 4D 45 3E 4E 75 65 76 6F 20 4C 65 F3 AE BC AF 4E 41 4D 45 3E
> 
> The part in grey corresponds to the data. The output data is the same
> 
> as the input data up to hex 65 and then something strange happens.
> 
> You can see that the end tag </NAME> got mangled.
> 
> Thoughts?
> 
> /Roger
> 
> *From:* Mike Beckerle <[email protected]>
> *Sent:* Wednesday, October 10, 2018 11:24 AM
> *To:* [email protected]
> *Subject:* Re: Why does Daffodil change the binary of non-ASCII characters?
> 
> Interesting,
> 
> So that error says it is looking for 80 utf-8 characters, not 80 bytes.
> 
> This is a supported behavior, but not typically what people want. Usually in 
> legacy formats (like dbase) lengths are in bytes.
> 
> If you have lengthUnits='characters' in iso-8859-1 that's identical to bytes, 
> but in utf8 it is clearly not the same as bytes.
> 
> Try lengthUnits="bytes".
> 
> --------------------------------------------------------------------------------
> 
> *From:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
> *Sent:* Wednesday, October 10, 2018 11:21:17 AM
> *To:* [email protected] <mailto:[email protected]>
> *Subject:* RE: Why does Daffodil change the binary of non-ASCII characters?
> 
> Hi Mike,
> 
> Below is the error message that I get when I change encoding to utf-8 (i.e., 
> encoding="utf-8"). Does that help narrow down the possible problem?  /Roger
> 
> [error] Parse Error: Failed to populate record[1832]. Cause: Parse Error: 
> <SpecifiedLengthExplicitCharactersParser><STATEABB 
> parser='StringOfSpecifiedLengthParser' 
> /></SpecifiedLengthExplicitCharactersParser> - STATEABB - Parse failed. 
> Failed 
> to find exactly 80 characters.
> 
> Schema context: STATEABB Location line 115 column 42 in dBase.dfdl.xsd
> 
> Data location was preceding byte 652456.
> 
> Schema context: sequence Location line 81 column 26 in dBase.dfdl.xsd
> 
> Data location was preceding byte 652456
> 
> *From:* Mike Beckerle <[email protected] <mailto:[email protected]>>
> *Sent:* Wednesday, October 10, 2018 11:03 AM
> *To:* [email protected] <mailto:[email protected]>
> *Subject:* Re: Why does Daffodil change the binary of non-ASCII characters?
> 
> Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER 
> E 
> WITH ACUTE.
> 
> So using iso-8859-1 is going to do the wrong thing for sure.
> 
> So let's figure out why your data fails to parse when specifying the correct 
> character set encoding, utf-8.
> 
> Your hex bytes as presented are all valid Utf-8 according to this site:
> 
> http://www.endmemo.com/unicode/unicodeconverter.php
> 
> So, maybe there's a utf-8 bug in daffodil?
> 
> --------------------------------------------------------------------------------
> 
> *From:*Costello, Roger L. <[email protected] <mailto:[email protected]>>
> *Sent:* Wednesday, October 10, 2018 9:59:16 AM
> *To:* [email protected] <mailto:[email protected]>
> *Subject:* Why does Daffodil change the binary of non-ASCII characters?
> 
> Hello DFDL community,
> 
> I have a binary file that contains, among other things, this text:
> 
> Nova Scotia / Nouvelle-Écosse
> 
> Its corresponding hex binary is this:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 
> 6F 
> 73 73 65 20 …
> 
> I used this element declaration in my DFDL schema to parse that binary:
> 
> <xs:element    name="NAME"
>                         type="xs:string"
>                         dfdl:length="93"
>                          dfdl:lengthKind="explicit"
>                         dfdl:lengthUnits="characters"
>                          dfdl:textTrimKind="padChar"
>                          dfdl:textStringPadCharacter="%SP;"
>                          dfdl:textStringJustification="center"/>
> 
> Surprisingly, during parsing Daffodil modified the text to this:
> 
> Nova Scotia / Nouvelle-Ã?cosse
> 
> With this corresponding hex binary:
> 
> 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 
> 6F 
> 73 73 65 20 …
> 
> The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing).
> 
> Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode 
> codepoint.
> 
> Why did Daffodil change the binary?
> 
> One other piece of the puzzle: in my DFDL schema I specify 
> encoding="ISO-8859-1". For a reason I do not understand, when I 
> specifyencoding="utf-8" I get an error message on parse.
> 
> Please help!
> 
> /Roger
> 

Reply via email to