Re: Bug in Daffodil?

Steve Lawrence Mon, 08 Apr 2019 05:34:58 -0700

One potential workaround is to change your output to JSON, which has no
problem with CR's in the data. You'd end up with something like this:


  {
    "input": {
      "prolog": "PROLOG",
      "payload": {
        "field": [
          "A",
          "B",
          "C\r\n",
          "D",
          "E\r",
          "F"
        ]
      }
    }
  }

If you really need XML, you might be able find JSON to XML converter
that maintains CRs.

Another option would be to do something similar to what Julian hinted
at, which is to model the CR in data as syntax. So you could make your
field element a sequence of fields that are CR separated. So the field
contained a CR you'd end up with extra fields, something like this:

  <xs:element name="field" maxOccurs="unbounded">
    <xs:complexType>
      <xs:sequence dfdl:separator="%CR;">
        <xs:element name="fields" type="xs:string"
          maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexTYpe>
  </xs:element>

Your infoset gets a little messier, looking something like this:

  <input>
    <prolog>PROLOG</prolog>
    <payload>
      <field>
        <fields>A</fields>
      </field>
      <field>
        <fields>B</fields>
      </field>
      <field>
        <fields>C</fields>
        <fields>
</fields>
      </field>
      <field>
        <fields>D</fields>
      </field>
      <field>
        <fields>E</fields>
        <fields></fields>
      </field>
      <field>
        <fields>F</fields>
      </field>
    </payload>
  </input>

Note that the C field is split into two, with the second fields element
only containing a LF. And the E element is split in two, with the second
fields element being the empty string. It's not ideal, but is a
potential workaround.


Though, we really just need to fix DAFFODIL-1559. Julian or Roger, if
either of you have any interest and time in Daffodil/Scala development,
I think this would probably be a good beginner bug, and we'd be happy to
provide guidance.

- Steve


On 4/8/19 7:46 AM, Lander, Julian C. wrote:
> I was the person who passed the original problem on to Roger. I was able to
> get a simple case to work
> by capturing the CRLF or LF combinations in hidden groups, not treating them
> as
> separators or terminators.
> 
> Trying this in a more sophisticated DFDL schema--the actual problem has
> other fields
> before and after the payload with the mixed linefeed combinations--caused
> Daffodil
> to infinite loop.  That's where I'm stuck.
> 
> Julian
> 
> ---------------------------------------------------------------
> Dr. Julian C. Lander
> Lead Software Engineer
> MITRE
> 
> Mail Stop M360
> The MITRE Corporation
> 202 Burlington Road
> Bedford, MA   01730-1420
> 781-271-4516 
> 
> 
> -----Original Message-----
> From: Costello, Roger L. <coste...@mitre.org> 
> Sent: Monday, April 08, 2019 7:42 AM
> To: users@daffodil.apache.org
> Subject: Re: Bug in Daffodil?
> 
> Thanks Steve.
> 
> Is there a workaround? I need the output of unparsing to exactly match the
> original input.
> 
> /Roger
> 
> 
> -----Original Message-----
> From: Steve Lawrence <slawre...@apache.org>
> Sent: Friday, April 5, 2019 10:45 AM
> To: users@daffodil.apache.org; Costello, Roger L. <coste...@mitre.org>
> Subject: [EXT] Re: Bug in Daffodil?
> 
> This is actually the expected behavior, though it's maybe not always
> desired.
> 
> The issue here is that XML is not allowed to contain CR's, only LF's are
> allowed. So when we output infoset data, all CRLF's are converted to LF, and
> any lone CR's are also converted to LF. Unfortunately, if your data fields
> contains a CR, it's going to get lost. In a lot of cases this is fine, since
> lots of formats don't care about CRLF vs LF. But there are definitely some
> places where it matters.
> 
> DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One
> option would be to convert CR character in the data to a private use area
> like we do with other illegal XML characters, but that makes the infoset
> less useful. Another option might be to say that whenever an LF appears in
> the data, we just always unparse it as a CRLF. This means if your data mixes
> CRLF and LF, we'd always output CRLF, but that's probably not a big deal if
> mixing is allowed in the format.
> 
> - Steve
> 
> [1] https://issues.apache.org/jira/browse/DAFFODIL-1559
> 
> On 4/5/19 9:25 AM, Costello, Roger L. wrote:
>> Hello DFDL community,
>>
>> My input file consists of a prolog of known format and a payload 
>> surrounded by parentheses. The payload consists of a series of text 
>> fields separated by hyphens. In some cases, the hyphen can be preceded 
>> by a new line, which can be a carriage return or CRLF combination.
>>
>> Here is a sample input file; I show it in a hex editor so you can see 
>> that some hyphens are preceded by CRLF and others by just a CR.
>>
>> Here is my DFDL schema:
>>
>> <xs:elementname="input">
>> <xs:complexType>
>> <xs:sequence>
>> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
>> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
>> <xs:complexType>
>> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
>> <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/>
>> </xs:sequence>
>> </xs:complexType>
>> </xs:element>
>> </xs:sequence>
>> </xs:complexType>
>> </xs:element>
>>
>> When I parse the input file using the DFDL schema, I get this XML:
>>
>> <input>
>> <prolog>PROLOG</prolog>
>> <payload>
>> <field>A</field>
>> <field>B</field>
>> <field>C
>> </field>
>> <field>D</field>
>> <field>E
>> </field>
>> <field>F</field>
>> </payload>
>> </input>
>>
>> That’s perfect.
>>
>> When I unparse the XML I get this (please note the bug (?) described in
> yellow):
>>
>

Re: Bug in Daffodil?

Reply via email to