Is it possible that the pattern .*(?=\)) isn't what you want? Could it be
.*?(?=\)) or something

similar so that it doesn't greedily snatch up the right parenthesis?

 

I'm working through it myself, but I'm about a step behind Roger. I'm doing
a file in isolation without

the wrapping parentheses, offering another element of the choice to have the
pattern "$" so that it

picks up an end of file at the end of the last repeated field.

 

This stuff is hard! Thank you both for the help.

 

Julian

 

---------------------------------------------------------------

Dr. Julian C. Lander

Lead Software Engineer

MITRE

 

Mail Stop M360

The MITRE Corporation

202 Burlington Road

Bedford, MA   01730-1420

781-271-4516 

 

 

From: Costello, Roger L. <coste...@mitre.org> 
Sent: Wednesday, April 10, 2019 2:10 PM
To: Beckerle, Mike <mbecke...@tresys.com>; users@daffodil.apache.org
Subject: Re: Bug in Daffodil?

 

Hi Mike,

 

That is wicked cool.

 

I implemented your suggestion. We are close, but not quite there. Below is
my schema. I am now getting an "infinite loop" error. I think it is the same
error that Julian was getting.

 

[error] Parse Error: Repeating or Optional Element - No forward progress at
byte 24. Attempt to parse field succeeded but consumed no data.

Please re-examine your schema to correct this infinite loop.

 

What do you suggest?  /Roger

 

<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="prolog" type="xs:string"
dfdl:terminator="%NL;" />
            <xs:element name="payload" dfdl:initiator="(">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="field" maxOccurs="unbounded">
                            <xs:complexType>
                                <xs:sequence>
                                    <xs:element name="value"
type="xs:string" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|\)$))" />
                                    <xs:choice> 
                                        <xs:element name="crDash"
dfdl:initiator="%CR;-" type="xs:string" />
                                        <xs:element name="crlfDash"
dfdl:initiator="%CR;%LF;-" type="xs:string" />
                                        <xs:element name="dash"
dfdl:initiator="-" type="xs:string" />
                                        <xs:element name="none"
type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern=".*(?=\))" />
                                    </xs:choice>
                                </xs:sequence>
                            </xs:complexType>
                        </xs:element>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

 

 

From: Beckerle, Mike <mbecke...@tresys.com <mailto:mbecke...@tresys.com> > 
Sent: Wednesday, April 10, 2019 1:43 PM
To: Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org> >;
users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: [EXT] Re: Bug in Daffodil?

 

Here's what I would try:

 

Add the close paren to the regex:

 

dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|\)$))"

 
^^ right there.

 

This will cause it to stop when it encounters the close paren without
including it in the match.

 

Then you have to add a fourth case to the choice so:

                                    <xs:choice>
                                        <xs:element name="crDash"
dfdl:initiator="%CR;-" type="xs:string"/>
                                        <xs:element name="crlfDash"
dfdl:initiator="%CR;%LF;-" type="xs:string"/>
                                        <xs:element name="dash"
dfdl:initiator="-" type="xs:string"/>

                                        <xs:element name="none"
type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern=".*(?=\))" />

                                    </xs:choice>

 

That will allow for a <none/> zero length 'delimiter' when the close paren
is encountered. 

 

Then the enclosing context will parse the ")" as terminator. 

 

I can't insure this is correct, but this is the concept. 

  _____  

From: Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org> >
Sent: Wednesday, April 10, 2019 11:28 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> ; Beckerle,
Mike
Subject: Re: Bug in Daffodil? 

 

Thank you Mike - that is a fantastic solution.

 

But, but, but ..

 

I've got one tiny problem: what about the right parenthesis at the end? The
input starts with a left parenthesis and ends with a right parenthesis,
e.g.,

 

PROLOG
(A-B-C
-D-E
-F)

 

Notice that the last field is F and it is not followed by dash, CR/dash, nor
CRLF/dash.

 

What to do?

 

Below is my schema. I am getting this error:

 

[error] Parse Error: Failed to populate field[1]. Cause: Parse Error: All
choice alternatives failed. Reason(s): List(Parse Error: Alternative failed.
Reason(s): List(Parse Error: Found out of scope delimiter: ')' ')'

 

<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="prolog" type="xs:string"
dfdl:terminator="%NL;" />
            <xs:element name="payload" dfdl:initiator="("
dfdl:terminator=")">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="field" maxOccurs="unbounded">
                            <xs:complexType>
                                <xs:sequence>
                                    <xs:element name="value"
type="xs:string" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|$))" />
                                    <xs:choice> 
                                        <xs:element name="crDash"
dfdl:initiator="%CR;-" type="xs:string" />
                                        <xs:element name="crlfDash"
dfdl:initiator="%CR;%LF;-" type="xs:string" />
                                        <xs:element name="dash"
dfdl:initiator="-" type="xs:string" />
                                    </xs:choice>
                                </xs:sequence>
                            </xs:complexType>
                        </xs:element>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

 

 

From: Beckerle, Mike <mbecke...@tresys.com <mailto:mbecke...@tresys.com> > 
Sent: Wednesday, April 10, 2019 8:49 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: [EXT] Re: Bug in Daffodil?

 

I'm late in trying to interact with this thread, I fear I may have missed
some messages.

 

But here goes....

 

DFDL does not capture and preserve which of multiple possible delimiters was
used. When there is more than one possible delimiter, the first one is
considered canonical and is used for unparsing, even though when parsing,
the longest match is used.

 

So if you have dfdl:terminator="- %CR;- %CR;%LF;-" then on unparsing you'll
always unparse these as just "-" no CR or CRLF will *ever* be output
regardless of what was found when parsing.

 

That means if you round-trip data (parse then unparse) it will canonicalize
the delimiters. What you get out is considered a canonical form of the data.
This will usually NOT get back the exact same output as input, but you will
get what the DFDL schema specifies is an equivalent canonical form. If you
parse this again, the infoset you get should be the same as the infoset from
the first parse. This is what we call a "twoPass" round trip. 

 

If that isn't the behavior you want, because it is significant and important
in the format exactly which delimiters were used, then the delimiters are
not just delimiters. They are carrying some additional
information/significance that must be captured by the Infoset in order for
the DFDL schema to accurately represent the information content. 

 

To do that you must do what I call modeling syntax as data. That is, you
must capture the specific delimiters in elements so that the significance of
which specific delimiter was used is captured in the infoset. 

 

Suppose your delimiter is either "CR-", "CRLF-", or just "-". 

 

To parse an element delimited by this and capture which delimiter
specifically was found, you must use dfdl:lengthKind='pattern' and regular
expressions with lookahead:

 

<element name="foo" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-))" ....>

 

This matches text up to, but not including one of the CR-, CRLF-, or just -
patterns using the regex forward-lookahead feature.

 

This element is then followed by 

 

<choice> 

   <element name="crDash" dfdl:initiator="%CR;-" ..../>

   <element name="crlfDash" dfdl:initiator="%CR;%LF;-" .../>

   <element name="dash" dfdl:initiator="-" .../>

</choice>

 

In each choice branch above, the element is a string of explicit length 0. 

 

This will parse, and unparse just fine. You'll get infosets like

 

<foo>contents</foo><crlfDash/>

 

That <crlfDash/> element indicates which delimiter specifically was found
and should be laid down after the <foo>contents</foo> when unparsing. 

 

The above technique will not run into the DAFFODIL-1559 bug, because the CR
characters are never brought into the XML Infoset, so are never converted
into LF. 

 

Note that you cannot put the choice above into a hidden group so as to hide
this delimiter cruft. Because then that information would be lost and
unavailable for unparsing. 

 

I hope that helps.

  _____  

From: Costello, Roger L. <coste...@mitre.org <mailto:coste...@mitre.org> >
Sent: Monday, April 8, 2019 7:41 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: Re: Bug in Daffodil? 

 

Thanks Steve.

Is there a workaround? I need the output of unparsing to exactly match the
original input.

/Roger


-----Original Message-----
From: Steve Lawrence <slawre...@apache.org <mailto:slawre...@apache.org> > 
Sent: Friday, April 5, 2019 10:45 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> ; Costello,
Roger L. <coste...@mitre.org <mailto:coste...@mitre.org> >
Subject: [EXT] Re: Bug in Daffodil?

This is actually the expected behavior, though it's maybe not always
desired.

The issue here is that XML is not allowed to contain CR's, only LF's are
allowed. So when we output infoset data, all CRLF's are converted to LF, and
any lone CR's are also converted to LF. Unfortunately, if your data fields
contains a CR, it's going to get lost. In a lot of cases this is fine, since
lots of formats don't care about CRLF vs LF. But there are definitely some
places where it matters.

DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One
option would be to convert CR character in the data to a private use area
like we do with other illegal XML characters, but that makes the infoset
less useful. Another option might be to say that whenever an LF appears in
the data, we just always unparse it as a CRLF. This means if your data mixes
CRLF and LF, we'd always output CRLF, but that's probably not a big deal if
mixing is allowed in the format.

- Steve

[1] https://issues.apache.org/jira/browse/DAFFODIL-1559

On 4/5/19 9:25 AM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> My input file consists of a prolog of known format and a payload 
> surrounded by parentheses. The payload consists of a series of text 
> fields separated by hyphens. In some cases, the hyphen can be preceded 
> by a new line, which can be a carriage return or CRLF combination.
> 
> Here is a sample input file; I show it in a hex editor so you can see 
> that some hyphens are preceded by CRLF and others by just a CR.
> 
> Here is my DFDL schema:
> 
> <xs:elementname="input">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
> <xs:complexType>
> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
> <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> 
> When I parse the input file using the DFDL schema, I get this XML:
> 
> <input>
> <prolog>PROLOG</prolog>
> <payload>
> <field>A</field>
> <field>B</field>
> <field>C
> </field>
> <field>D</field>
> <field>E
> </field>
> <field>F</field>
> </payload>
> </input>
> 
> That's perfect.
> 
> When I unparse the XML I get this (please note the bug (?) described in
yellow):
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to