Hi Mike, Below is the error message that I get when I change encoding to utf-8 (i.e., encoding="utf-8"). Does that help narrow down the possible problem? /Roger
[error] Parse Error: Failed to populate record[1832]. Cause: Parse Error: <SpecifiedLengthExplicitCharactersParser><STATEABB parser='StringOfSpecifiedLengthParser' /></SpecifiedLengthExplicitCharactersParser> - STATEABB - Parse failed. Failed to find exactly 80 characters. Schema context: STATEABB Location line 115 column 42 in dBase.dfdl.xsd Data location was preceding byte 652456. Schema context: sequence Location line 81 column 26 in dBase.dfdl.xsd Data location was preceding byte 652456 From: Mike Beckerle <[email protected]> Sent: Wednesday, October 10, 2018 11:03 AM To: [email protected] Subject: Re: Why does Daffodil change the binary of non-ASCII characters? Your data is definitely UTF-8, or C3 89 would not be the LATIN CAPITAL LETTER E WITH ACUTE. So using iso-8859-1 is going to do the wrong thing for sure. So let's figure out why your data fails to parse when specifying the correct character set encoding, utf-8. Your hex bytes as presented are all valid Utf-8 according to this site: http://www.endmemo.com/unicode/unicodeconverter.php So, maybe there's a utf-8 bug in daffodil? ________________________________ From: Costello, Roger L. <[email protected]<mailto:[email protected]>> Sent: Wednesday, October 10, 2018 9:59:16 AM To: [email protected]<mailto:[email protected]> Subject: Why does Daffodil change the binary of non-ASCII characters? Hello DFDL community, I have a binary file that contains, among other things, this text: Nova Scotia / Nouvelle-Écosse Its corresponding hex binary is this: 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 89 63 6F 73 73 65 20 ... I used this element declaration in my DFDL schema to parse that binary: <xs:element name="NAME" type="xs:string" dfdl:length="93" dfdl:lengthKind="explicit" dfdl:lengthUnits="characters" dfdl:textTrimKind="padChar" dfdl:textStringPadCharacter="%SP;" dfdl:textStringJustification="center"/> Surprisingly, during parsing Daffodil modified the text to this: Nova Scotia / Nouvelle-Ã?cosse With this corresponding hex binary: 4E 6F 76 61 20 53 63 6F 74 69 61 20 2F 20 4E 6F 75 76 65 6C 6C 65 2D C3 3F 63 6F 73 73 65 20 ... The part in yellow changed -- from C3 89 (original) to C3 3F (after parsing). Hex C3 89 corresponds to the É symbol whereas C3 3F is not a valid unicode codepoint. Why did Daffodil change the binary? One other piece of the puzzle: in my DFDL schema I specify encoding="ISO-8859-1". For a reason I do not understand, when I specify encoding="utf-8" I get an error message on parse. Please help! /Roger
