Re: Bug in Daffodil?

Steve Lawrence Mon, 08 Apr 2019 11:05:05 -0700

The issue here is not that there is no LF after the CR, but there is
nothing after the CR. In this case, the last field-part is a string of
zero length, so the dfdl:separatorSuppressionPolicy plays a role in how
that empty element and it's separator is unparsed.


Unfortunately, I can't seem to get things to work with with different
values of SSP. There are known issues with that property, but I don't
know the details and it's not clear if the reason this doesn't unparse
as you expect is because of those issues, or if that's just how SSP works.

Mike is working on fixing the SSP bugs--he'd have a better idea if this
is expected behavior or not.

But again, the better overall solution would be to just fix things so
that we can output CR to the XML. From what I can tell, it's legal to
have "&xD;" in XML, so we should just have an tuanble to output that to
the infoset instead of converting CRLF/CR to LF.

- Steve


On 4/8/19 1:44 PM, Costello, Roger L. wrote:
> Here’s a graphic which shows the CR that is now getting dropped:
> 
> *From:* Costello, Roger L. <coste...@mitre.org>
> *Sent:* Monday, April 8, 2019 1:16 PM
> *To:* Steve Lawrence <slawre...@apache.org>; users@daffodil.apache.org; 
> Lander, 
> Julian C. <jclan...@mitre.org>
> *Subject:* Re: Bug in Daffodil?
> 
> Hi Steve,
> 
> I implemented your idea of modeling CR as syntax. (dfdl:separator="%CR;" and 
> dfdl:separatorPosition="infix"). That works great when CR is sandwiched 
> between 
> text and LF but it fails when CR has no following LF. See below. Suggestions? 
> /Roger
> 
> Here is my DFDL schema:
> 
> <xs:elementname="input">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
> <xs:complexType>
> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
> <xs:elementname="field"maxOccurs="unbounded">
> <xs:complexType>
> <xs:sequencedfdl:separator="%CR;"dfdl:separatorPosition="infix">
> <xs:elementname="field-part"type="xs:string"maxOccurs="unbounded"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> 
> -----Original Message-----
> From: Steve Lawrence <slawre...@apache.org <mailto:slawre...@apache.org>>
> Sent: Monday, April 8, 2019 8:34 AM
> To: users@daffodil.apache.org <mailto:users@daffodil.apache.org>; Lander, 
> Julian 
> C. <jclan...@mitre.org <mailto:jclan...@mitre.org>>; Costello, Roger L. 
> <coste...@mitre.org <mailto:coste...@mitre.org>>
> Subject: [EXT] Re: Bug in Daffodil?
> 
> One potential workaround is to change your output to JSON, which has no 
> problem 
> with CR's in the data. You'd end up with something like this:
> 
>    {
> 
>      "input": {
> 
>        "prolog": "PROLOG",
> 
>        "payload": {
> 
>          "field": [
> 
>            "A",
> 
>            "B",
> 
>            "C\r\n",
> 
>            "D",
> 
>            "E\r",
> 
>            "F"
> 
>          ]
> 
>        }
> 
>      }
> 
>    }
> 
> If you really need XML, you might be able find JSON to XML converter that 
> maintains CRs.
> 
> Another option would be to do something similar to what Julian hinted at, 
> which 
> is to model the CR in data as syntax. So you could make your field element a 
> sequence of fields that are CR separated. So the field contained a CR you'd 
> end 
> up with extra fields, something like this:
> 
>    <xs:element name="field" maxOccurs="unbounded">
> 
>      <xs:complexType>
> 
>        <xs:sequence dfdl:separator="%CR;">
> 
>          <xs:element name="fields" type="xs:string"
> 
>            maxOccurs="unbounded" />
> 
>        </xs:sequence>
> 
>      </xs:complexTYpe>
> 
>    </xs:element>
> 
> Your infoset gets a little messier, looking something like this:
> 
>    <input>
> 
>      <prolog>PROLOG</prolog>
> 
>      <payload>
> 
>        <field>
> 
>          <fields>A</fields>
> 
>        </field>
> 
>        <field>
> 
>          <fields>B</fields>
> 
>        </field>
> 
>        <field>
> 
>          <fields>C</fields>
> 
>          <fields>
> 
> </fields>
> 
>        </field>
> 
>        <field>
> 
>          <fields>D</fields>
> 
>        </field>
> 
>        <field>
> 
>          <fields>E</fields>
> 
>          <fields></fields>
> 
>        </field>
> 
>        <field>
> 
>          <fields>F</fields>
> 
>        </field>
> 
>      </payload>
> 
>    </input>
> 
> Note that the C field is split into two, with the second fields element only 
> containing a LF. And the E element is split in two, with the second fields 
> element being the empty string. It's not ideal, but is a potential workaround.
> 
> Though, we really just need to fix DAFFODIL-1559. Julian or Roger, if either 
> of 
> you have any interest and time in Daffodil/Scala development, I think this 
> would 
> probably be a good beginner bug, and we'd be happy to provide guidance.
> 
> - Steve
> 
> On 4/8/19 7:46 AM, Lander, Julian C. wrote:
> 
>  > I was the person who passed the original problem on to Roger. I was
> 
>  > able to get a simple case to work by capturing the CRLF or LF
> 
>  > combinations in hidden groups, not treating them as separators or
> 
>  > terminators.
> 
>  >
> 
>  > Trying this in a more sophisticated DFDL schema--the actual problem
> 
>  > has other fields before and after the payload with the mixed linefeed
> 
>  > combinations--caused Daffodil to infinite loop.  That's where I'm
> 
>  > stuck.
> 
>  >
> 
>  > Julian
> 
>  >
> 
>  > ---------------------------------------------------------------
> 
>  > Dr. Julian C. Lander
> 
>  > Lead Software Engineer
> 
>  > MITRE
> 
>  >
> 
>  > Mail Stop M360
> 
>  > The MITRE Corporation
> 
>  > 202 Burlington Road
> 
>  > Bedford, MA   01730-1420
> 
>  > 781-271-4516
> 
>  >
> 
>  >
> 
>  > -----Original Message-----
> 
>  > From: Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org>>
> 
>  > Sent: Monday, April 08, 2019 7:42 AM
> 
>  > To: users@daffodil.apache.org <mailto:users@daffodil.apache.org>
> 
>  > Subject: Re: Bug in Daffodil?
> 
>  >
> 
>  > Thanks Steve.
> 
>  >
> 
>  > Is there a workaround? I need the output of unparsing to exactly match
> 
>  > the original input.
> 
>  >
> 
>  > /Roger
> 
>  >
> 
>  >
> 
>  > -----Original Message-----
> 
>  > From: Steve Lawrence <slawre...@apache.org <mailto:slawre...@apache.org>>
> 
>  > Sent: Friday, April 5, 2019 10:45 AM
> 
>  > To: users@daffodil.apache.org <mailto:users@daffodil.apache.org>; 
> Costello, 
> Roger L. <coste...@mitre.org <mailto:coste...@mitre.org>>
> 
>  > Subject: [EXT] Re: Bug in Daffodil?
> 
>  >
> 
>  > This is actually the expected behavior, though it's maybe not always
> 
>  > desired.
> 
>  >
> 
>  > The issue here is that XML is not allowed to contain CR's, only LF's
> 
>  > are allowed. So when we output infoset data, all CRLF's are converted
> 
>  > to LF, and any lone CR's are also converted to LF. Unfortunately, if
> 
>  > your data fields contains a CR, it's going to get lost. In a lot of
> 
>  > cases this is fine, since lots of formats don't care about CRLF vs LF.
> 
>  > But there are definitely some places where it matters.
> 
>  >
> 
>  > DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One
> 
>  > option would be to convert CR character in the data to a private use
> 
>  > area like we do with other illegal XML characters, but that makes the
> 
>  > infoset less useful. Another option might be to say that whenever an
> 
>  > LF appears in the data, we just always unparse it as a CRLF. This
> 
>  > means if your data mixes CRLF and LF, we'd always output CRLF, but
> 
>  > that's probably not a big deal if mixing is allowed in the format.
> 
>  >
> 
>  > - Steve
> 
>  >
> 
>  > [1] https://issues.apache.org/jira/browse/DAFFODIL-1559
> 
>  >
> 
>  > On 4/5/19 9:25 AM, Costello, Roger L. wrote:
> 
>  >> Hello DFDL community,
> 
>  >>
> 
>  >> My input file consists of a prolog of known format and a payload
> 
>  >> surrounded by parentheses. The payload consists of a series of text
> 
>  >> fields separated by hyphens. In some cases, the hyphen can be
> 
>  >> preceded by a new line, which can be a carriage return or CRLF 
> combination.
> 
>  >>
> 
>  >> Here is a sample input file; I show it in a hex editor so you can see
> 
>  >> that some hyphens are preceded by CRLF and others by just a CR.
> 
>  >>
> 
>  >> Here is my DFDL schema:
> 
>  >>
> 
>  >> <xs:elementname="input">
> 
>  >> <xs:complexType>
> 
>  >> <xs:sequence>
> 
>  >> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
> 
>  >> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
> 
>  >> <xs:complexType>
> 
>  >> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
> 
>  >> <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/>
> 
>  >> </xs:sequence>
> 
>  >> </xs:complexType>
> 
>  >> </xs:element>
> 
>  >> </xs:sequence>
> 
>  >> </xs:complexType>
> 
>  >> </xs:element>
> 
>  >>
> 
>  >> When I parse the input file using the DFDL schema, I get this XML:
> 
>  >>
> 
>  >> <input>
> 
>  >> <prolog>PROLOG</prolog>
> 
>  >> <payload>
> 
>  >> <field>A</field>
> 
>  >> <field>B</field>
> 
>  >> <field>C
> 
>  >> </field>
> 
>  >> <field>D</field>
> 
>  >> <field>E
> 
>  >> </field>
> 
>  >> <field>F</field>
> 
>  >> </payload>
> 
>  >> </input>
> 
>  >>
> 
>  >> That’s perfect.
> 
>  >>
> 
>  >> When I unparse the XML I get this (please note the bug (?) described
> 
>  >> in
> 
>  > yellow):
> 
>  >>
> 
>  >
>

Re: Bug in Daffodil?

Reply via email to