For reference, here is the ticket to add a way to disable this behavior:

https://issues.apache.org/jira/browse/DAFFODIL-1559


On 2025-06-02 03:03 PM, Mike Beckerle wrote:
Parsing with iso-8859-1 preserves all bytes from native form into the DFDL 
Infoset.

But ... then Daffodil is projecting the DFDL infoset into XML.

It is this XML conversion step that is causing the problem.

XML reading does not preserve CRLFs. On input XML readers convert CRLF->LF, and stand alone CR to LF also.
Your data has CRCRLF so that becomes two LFs.

(This is one of several reasons why, in hindsight, XML isn't a very good data language. I.e., it's not just that it is verbose!)

Unlike the illegal XML characters, which we have no choice but to remap into the Unicode private use area (aka PUA) (as detailed here: https:// daffodil.apache.org/infoset/ <https://daffodil.apache.org/infoset/> See heading "XML Illegal Characters"), Daffodil really does need a "preserveCR" flag of some kind, as CR isn't technically an "illegal character" in XML data.

The workaround I have used and suggested in the past is to model a string which can contain CR as an array of strings separated by CR.




On Mon, Jun 2, 2025 at 2:29 PM Mark Kozak <mark.ko...@adeptus-cs.com <mailto:mark.ko...@adeptus-cs.com>> wrote:

    Hello folks,____

    __ __

    Section 11.2.3 of the documentation says that if I use the ISO-8859-1
    encoding, all bytes will be preserved. ____

    So I have a simple text file that has the following text, represented as
    hex:____

    ____

    __ __

    Using the following schema, I get the expected xml on parse____

    __ __

       <element name="file">____

         <complexType>____

           <sequence >____

             <element name="file_string" type="xs:string" dfdl:lengthKind =
    "delimited" dfdl:encoding="ISO-8859-1"/>____

           </sequence>____

         </complexType>____

       </element>____

    __ __

    But when unparsing, one 0D is dropped, and one is converted to 0A as shown
    below:____

    ____

    __ __

    What am I missing to actually preserve all bytes?____

    __ __

    Thanks,____

    Mark____

    __ __

    Mark Kozak____

    Director of Engineering____

    Adeptus Cyber Solutions____

    Adeptus-CS.com____

    __ __


Reply via email to