Re: Bug in Daffodil?

Beckerle, Mike Wed, 10 Apr 2019 10:42:49 -0700

Here's what I would try:

Add the close paren to the regex:

dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|\)$))"

          ^^ right there.

This will cause it to stop when it encounters the close paren without including 
it in the match.

Then you have to add a fourth case to the choice so:
                                    <xs:choice>
                                        <xs:element name="crDash" 
dfdl:initiator="%CR;-" type="xs:string"/>
                                        <xs:element name="crlfDash" 
dfdl:initiator="%CR;%LF;-" type="xs:string"/>
                                        <xs:element name="dash" 
dfdl:initiator="-" type="xs:string"/>
                                        <xs:element name="none" 
type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern=".*(?=\))" />
                                    </xs:choice>

That will allow for a <none/> zero length 'delimiter' when the close paren is 
encountered.

Then the enclosing context will parse the ")" as terminator.

I can't insure this is correct, but this is the concept.
________________________________
From: Costello, Roger L. <coste...@mitre.org>
Sent: Wednesday, April 10, 2019 11:28 AM
To: users@daffodil.apache.org; Beckerle, Mike
Subject: Re: Bug in Daffodil?

Thank you Mike – that is a fantastic solution.

But, but, but ….

I’ve got one tiny problem: what about the right parenthesis at the end? The 
input starts with a left parenthesis and ends with a right parenthesis, e.g.,

PROLOG
(A-B-C
-D-E
-F)

Notice that the last field is F and it is not followed by dash, CR/dash, nor 
CRLF/dash.

What to do?

Below is my schema. I am getting this error:

[error] Parse Error: Failed to populate field[1]. Cause: Parse Error: All 
choice alternatives failed. Reason(s): List(Parse Error: Alternative failed. 
Reason(s): List(Parse Error: Found out of scope delimiter: ')' ')'

<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="prolog" type="xs:string" dfdl:terminator="%NL;" />
            <xs:element name="payload" dfdl:initiator="(" dfdl:terminator=")">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="field" maxOccurs="unbounded">
                            <xs:complexType>
                                <xs:sequence>
                                    <xs:element name="value" type="xs:string" 
dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|$))" />
                                    <xs:choice>
                                        <xs:element name="crDash" 
dfdl:initiator="%CR;-" type="xs:string" />
                                        <xs:element name="crlfDash" 
dfdl:initiator="%CR;%LF;-" type="xs:string" />
                                        <xs:element name="dash" 
dfdl:initiator="-" type="xs:string" />
                                    </xs:choice>
                                </xs:sequence>
                            </xs:complexType>
                        </xs:element>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

From: Beckerle, Mike <mbecke...@tresys.com>
Sent: Wednesday, April 10, 2019 8:49 AM
To: users@daffodil.apache.org
Subject: [EXT] Re: Bug in Daffodil?

I'm late in trying to interact with this thread, I fear I may have missed some 
messages.

But here goes....

DFDL does not capture and preserve which of multiple possible delimiters was 
used. When there is more than one possible delimiter, the first one is 
considered canonical and is used for unparsing, even though when parsing, the 
longest match is used.

So if you have dfdl:terminator="- %CR;- %CR;%LF;-" then on unparsing you'll 
always unparse these as just "-" no CR or CRLF will *ever* be output regardless 
of what was found when parsing.

That means if you round-trip data (parse then unparse) it will canonicalize the 
delimiters. What you get out is considered a canonical form of the data.  This 
will usually NOT get back the exact same output as input, but you will get what 
the DFDL schema specifies is an equivalent canonical form. If you parse this 
again, the infoset you get should be the same as the infoset from the first 
parse. This is what we call a "twoPass" round trip.

If that isn't the behavior you want, because it is significant and important in 
the format exactly which delimiters were used, then the delimiters are not just 
delimiters. They are carrying some additional information/significance that 
must be captured by the Infoset in order for the DFDL schema to accurately 
represent the information content.

To do that you must do what I call modeling syntax as data. That is, you must 
capture the specific delimiters in elements so that the significance of which 
specific delimiter was used is captured in the infoset.

Suppose your delimiter is either "CR-", "CRLF-", or just "-".

To parse an element delimited by this and capture which delimiter specifically 
was found, you must use dfdl:lengthKind='pattern' and regular expressions with 
lookahead:

<element name="foo" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-))" ....>

This matches text up to, but not including one of the CR-, CRLF-, or just - 
patterns using the regex forward-lookahead feature.

This element is then followed by

<choice>

   <element name="crDash" dfdl:initiator="%CR;-" ..../>

   <element name="crlfDash" dfdl:initiator="%CR;%LF;-" .../>

   <element name="dash" dfdl:initiator="-" .../>

</choice>

In each choice branch above, the element is a string of explicit length 0.

This will parse, and unparse just fine. You'll get infosets like

<foo>contents</foo><crlfDash/>

That <crlfDash/> element indicates which delimiter specifically was found and 
should be laid down after the <foo>contents</foo> when unparsing.

The above technique will not run into the DAFFODIL-1559 bug, because the CR 
characters are never brought into the XML Infoset, so are never converted into 
LF.

Note that you cannot put the choice above into a hidden group so as to hide 
this delimiter cruft. Because then that information would be lost and 
unavailable for unparsing.

I hope that helps.

________________________________

From: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Monday, April 8, 2019 7:41 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: Re: Bug in Daffodil?

Thanks Steve.

Is there a workaround? I need the output of unparsing to exactly match the 
original input.

/Roger

-----Original Message-----
From: Steve Lawrence <slawre...@apache.org<mailto:slawre...@apache.org>>
Sent: Friday, April 5, 2019 10:45 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>; Costello, 
Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>>
Subject: [EXT] Re: Bug in Daffodil?

This is actually the expected behavior, though it's maybe not always desired.

The issue here is that XML is not allowed to contain CR's, only LF's are 
allowed. So when we output infoset data, all CRLF's are converted to LF, and 
any lone CR's are also converted to LF. Unfortunately, if your data fields 
contains a CR, it's going to get lost. In a lot of cases this is fine, since 
lots of formats don't care about CRLF vs LF. But there are definitely some 
places where it matters.

DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One option 
would be to convert CR character in the data to a private use area like we do 
with other illegal XML characters, but that makes the infoset less useful. 
Another option might be to say that whenever an LF appears in the data, we just 
always unparse it as a CRLF. This means if your data mixes CRLF and LF, we'd 
always output CRLF, but that's probably not a big deal if mixing is allowed in 
the format.

- Steve

[1] https://issues.apache.org/jira/browse/DAFFODIL-1559

On 4/5/19 9:25 AM, Costello, Roger L. wrote:
> Hello DFDL community,
>
> My input file consists of a prolog of known format and a payload
> surrounded by parentheses. The payload consists of a series of text
> fields separated by hyphens. In some cases, the hyphen can be preceded
> by a new line, which can be a carriage return or CRLF combination.
>
> Here is a sample input file; I show it in a hex editor so you can see
> that some hyphens are preceded by CRLF and others by just a CR.
>
> Here is my DFDL schema:
>
> <xs:elementname="input">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
> <xs:complexType>
> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
> <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
>
> When I parse the input file using the DFDL schema, I get this XML:
>
> <input>
> <prolog>PROLOG</prolog>
> <payload>
> <field>A</field>
> <field>B</field>
> <field>C
> </field>
> <field>D</field>
> <field>E
> </field>
> <field>F</field>
> </payload>
> </input>
>
> That's perfect.
>
> When I unparse the XML I get this (please note the bug (?) described in 
> yellow):
>

Re: Bug in Daffodil?

Reply via email to