We are really close, I think. But still not quite right. I added dfdl:terminator=")" and the non-greedy ? operator. See schema below. I am getting the same error message:
[error] Parse Error: Repeating or Optional Element - No forward progress at byte 23. Attempt to parse field succeeded but consumed no data. Please re-examine your schema to correct this infinite loop. What next Mike? /Roger <xs:element name="input"> <xs:complexType> <xs:sequence> <xs:element name="prolog" type="xs:string" dfdl:terminator="%NL;" /> <xs:element name="payload" dfdl:initiator="(" dfdl:terminator=")"> <xs:complexType> <xs:sequence> <xs:element name="field" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="value" type="xs:string" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|\)$))" /> <xs:choice> <xs:element name="crDash" dfdl:initiator="%CR;-" type="xs:string" /> <xs:element name="crlfDash" dfdl:initiator="%CR;%LF;-" type="xs:string" /> <xs:element name="dash" dfdl:initiator="-" type="xs:string" /> <xs:element name="none" type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern=".*?(?=\))" /> </xs:choice> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> From: Beckerle, Mike <mbecke...@tresys.com> Sent: Wednesday, April 10, 2019 2:16 PM To: Costello, Roger L. <coste...@mitre.org>; users@daffodil.apache.org Subject: [EXT] Re: Bug in Daffodil? Where did the dfdl:terminator=")" go? That is still needed. Without that, I think it will forever parse zero length value elements followed by <none/> elements forever, which it will detect (no forward progress) and fail. ________________________________ From: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>> Sent: Wednesday, April 10, 2019 2:10 PM To: Beckerle, Mike; users@daffodil.apache.org<mailto:users@daffodil.apache.org> Subject: Re: Bug in Daffodil? Hi Mike, That is wicked cool. I implemented your suggestion. We are close, but not quite there. Below is my schema. I am now getting an "infinite loop" error. I think it is the same error that Julian was getting. [error] Parse Error: Repeating or Optional Element - No forward progress at byte 24. Attempt to parse field succeeded but consumed no data. Please re-examine your schema to correct this infinite loop. What do you suggest? /Roger <xs:element name="input"> <xs:complexType> <xs:sequence> <xs:element name="prolog" type="xs:string" dfdl:terminator="%NL;" /> <xs:element name="payload" dfdl:initiator="("> <xs:complexType> <xs:sequence> <xs:element name="field" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="value" type="xs:string" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|\)$))" /> <xs:choice> <xs:element name="crDash" dfdl:initiator="%CR;-" type="xs:string" /> <xs:element name="crlfDash" dfdl:initiator="%CR;%LF;-" type="xs:string" /> <xs:element name="dash" dfdl:initiator="-" type="xs:string" /> <xs:element name="none" type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern=".*(?=\))" /> </xs:choice> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> From: Beckerle, Mike <mbecke...@tresys.com<mailto:mbecke...@tresys.com>> Sent: Wednesday, April 10, 2019 1:43 PM To: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>>; users@daffodil.apache.org<mailto:users@daffodil.apache.org> Subject: [EXT] Re: Bug in Daffodil? Here's what I would try: Add the close paren to the regex: dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|\)$))" ^^ right there. This will cause it to stop when it encounters the close paren without including it in the match. Then you have to add a fourth case to the choice so: <xs:choice> <xs:element name="crDash" dfdl:initiator="%CR;-" type="xs:string"/> <xs:element name="crlfDash" dfdl:initiator="%CR;%LF;-" type="xs:string"/> <xs:element name="dash" dfdl:initiator="-" type="xs:string"/> <xs:element name="none" type="xs:string" dfdl:lengthKind="pattern" dfdl:lengthPattern=".*(?=\))" /> </xs:choice> That will allow for a <none/> zero length 'delimiter' when the close paren is encountered. Then the enclosing context will parse the ")" as terminator. I can't insure this is correct, but this is the concept. ________________________________ From: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>> Sent: Wednesday, April 10, 2019 11:28 AM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>; Beckerle, Mike Subject: Re: Bug in Daffodil? Thank you Mike - that is a fantastic solution. But, but, but .... I've got one tiny problem: what about the right parenthesis at the end? The input starts with a left parenthesis and ends with a right parenthesis, e.g., PROLOG (A-B-C -D-E -F) Notice that the last field is F and it is not followed by dash, CR/dash, nor CRLF/dash. What to do? Below is my schema. I am getting this error: [error] Parse Error: Failed to populate field[1]. Cause: Parse Error: All choice alternatives failed. Reason(s): List(Parse Error: Alternative failed. Reason(s): List(Parse Error: Found out of scope delimiter: ')' ')' <xs:element name="input"> <xs:complexType> <xs:sequence> <xs:element name="prolog" type="xs:string" dfdl:terminator="%NL;" /> <xs:element name="payload" dfdl:initiator="(" dfdl:terminator=")"> <xs:complexType> <xs:sequence> <xs:element name="field" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="value" type="xs:string" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-|$))" /> <xs:choice> <xs:element name="crDash" dfdl:initiator="%CR;-" type="xs:string" /> <xs:element name="crlfDash" dfdl:initiator="%CR;%LF;-" type="xs:string" /> <xs:element name="dash" dfdl:initiator="-" type="xs:string" /> </xs:choice> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> From: Beckerle, Mike <mbecke...@tresys.com<mailto:mbecke...@tresys.com>> Sent: Wednesday, April 10, 2019 8:49 AM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> Subject: [EXT] Re: Bug in Daffodil? I'm late in trying to interact with this thread, I fear I may have missed some messages. But here goes.... DFDL does not capture and preserve which of multiple possible delimiters was used. When there is more than one possible delimiter, the first one is considered canonical and is used for unparsing, even though when parsing, the longest match is used. So if you have dfdl:terminator="- %CR;- %CR;%LF;-" then on unparsing you'll always unparse these as just "-" no CR or CRLF will *ever* be output regardless of what was found when parsing. That means if you round-trip data (parse then unparse) it will canonicalize the delimiters. What you get out is considered a canonical form of the data. This will usually NOT get back the exact same output as input, but you will get what the DFDL schema specifies is an equivalent canonical form. If you parse this again, the infoset you get should be the same as the infoset from the first parse. This is what we call a "twoPass" round trip. If that isn't the behavior you want, because it is significant and important in the format exactly which delimiters were used, then the delimiters are not just delimiters. They are carrying some additional information/significance that must be captured by the Infoset in order for the DFDL schema to accurately represent the information content. To do that you must do what I call modeling syntax as data. That is, you must capture the specific delimiters in elements so that the significance of which specific delimiter was used is captured in the infoset. Suppose your delimiter is either "CR-", "CRLF-", or just "-". To parse an element delimited by this and capture which delimiter specifically was found, you must use dfdl:lengthKind='pattern' and regular expressions with lookahead: <element name="foo" dfdl:lengthPattern=".*?(?=(\x0D-|\x0A\x0D-|-))" ....> This matches text up to, but not including one of the CR-, CRLF-, or just - patterns using the regex forward-lookahead feature. This element is then followed by <choice> <element name="crDash" dfdl:initiator="%CR;-" ..../> <element name="crlfDash" dfdl:initiator="%CR;%LF;-" .../> <element name="dash" dfdl:initiator="-" .../> </choice> In each choice branch above, the element is a string of explicit length 0. This will parse, and unparse just fine. You'll get infosets like <foo>contents</foo><crlfDash/> That <crlfDash/> element indicates which delimiter specifically was found and should be laid down after the <foo>contents</foo> when unparsing. The above technique will not run into the DAFFODIL-1559 bug, because the CR characters are never brought into the XML Infoset, so are never converted into LF. Note that you cannot put the choice above into a hidden group so as to hide this delimiter cruft. Because then that information would be lost and unavailable for unparsing. I hope that helps. ________________________________ From: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>> Sent: Monday, April 8, 2019 7:41 AM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> Subject: Re: Bug in Daffodil? Thanks Steve. Is there a workaround? I need the output of unparsing to exactly match the original input. /Roger -----Original Message----- From: Steve Lawrence <slawre...@apache.org<mailto:slawre...@apache.org>> Sent: Friday, April 5, 2019 10:45 AM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>; Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>> Subject: [EXT] Re: Bug in Daffodil? This is actually the expected behavior, though it's maybe not always desired. The issue here is that XML is not allowed to contain CR's, only LF's are allowed. So when we output infoset data, all CRLF's are converted to LF, and any lone CR's are also converted to LF. Unfortunately, if your data fields contains a CR, it's going to get lost. In a lot of cases this is fine, since lots of formats don't care about CRLF vs LF. But there are definitely some places where it matters. DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One option would be to convert CR character in the data to a private use area like we do with other illegal XML characters, but that makes the infoset less useful. Another option might be to say that whenever an LF appears in the data, we just always unparse it as a CRLF. This means if your data mixes CRLF and LF, we'd always output CRLF, but that's probably not a big deal if mixing is allowed in the format. - Steve [1] https://issues.apache.org/jira/browse/DAFFODIL-1559 On 4/5/19 9:25 AM, Costello, Roger L. wrote: > Hello DFDL community, > > My input file consists of a prolog of known format and a payload > surrounded by parentheses. The payload consists of a series of text > fields separated by hyphens. In some cases, the hyphen can be preceded > by a new line, which can be a carriage return or CRLF combination. > > Here is a sample input file; I show it in a hex editor so you can see > that some hyphens are preceded by CRLF and others by just a CR. > > Here is my DFDL schema: > > <xs:elementname="input"> > <xs:complexType> > <xs:sequence> > <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/> > <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")"> > <xs:complexType> > <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix"> > <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/> > </xs:sequence> > </xs:complexType> > </xs:element> > </xs:sequence> > </xs:complexType> > </xs:element> > > When I parse the input file using the DFDL schema, I get this XML: > > <input> > <prolog>PROLOG</prolog> > <payload> > <field>A</field> > <field>B</field> > <field>C > </field> > <field>D</field> > <field>E > </field> > <field>F</field> > </payload> > </input> > > That's perfect. > > When I unparse the XML I get this (please note the bug (?) described in > yellow): >