Just wanted to comment that this is a super-insightful dialog and pass along
thanks.

 

These kinds of gap issues are great to identify. "Be careful when you try to
fix that ambiguity in your parser, looks like you can't get there from here"
(pronounced with down-east Maine accent).  Getting clear about problems is
worthwhile and helps drive good practices.

 

Then, when you "hit the wall" hard on some of those serious parsing
problems, the real fix becomes evident: go back to the format syntax, or
quite possibly the underlying data model, and remedy the ambiguity there.
If designed and implemented satisfactorily, then the pathology doesn't occur
any more.

 

Gosh, not unlike writing "good" prose for people, determining such goodness
is mostly measured on the people side.  Interesting parallels.

 

Say what you mean, mean what you say. and "we don't talk about Bruno."  8)

 

all the best, Don

-- 

Don Brutzman  Naval Postgraduate School, Code USW/Br        brutz...@nps.edu

Watkins 270,  MOVES Institute, Monterey CA 93943-5000 USA    +1.831.656.2149

X3D graphics, virtual worlds, Navy robotics https://
faculty.nps.edu/brutzman

 

From: Mike Beckerle <mbecke...@apache.org> 
Sent: Wednesday, May 11, 2022 10:01 AM
To: users@daffodil.apache.org
Subject: Re: Catalog the causes of the dreaded "left over data" error
message

 

Roger wrote:
Is this the lesson: given two element declarations inside a choice each with
dfdl:lengthPattern="regex", order the elements so that the element with the
regex that describes the larger set of characters is listed first. Is that
the lesson? If so, we are in deep trouble - how in the world is one to write
a program which determines that one regex describes a larger set of
characters than the other? That is probably possible (in fact, I'm certain
that it is theoretically possible) but practically it is an very hard
problem (or at least an enormous amount of work). Uh oh.

Not the lesson. 

 

There is no algorithm possible for what you are describing. Consider a regex
that matches up to 100 digits, and another regex that matches up to 100
digits or alpha characters. Both will match a string of digits. It's
fundamentally ambiguous which should be chosen without more information
about which is considered the correct resolution of this ambiguity.   

 

So, assuming you are describing an actual data format there must be some
robust criterion for deciding which alternative is to be selected.  

 

I know you might want to express the choice alternatives by just describing
each ones' format in isolation using regex length patterns. That might be
your preference, but it's quite hard to make the regular expressions 100%
disjoint, so that only one alternative can succeed.

 

In your example, the choice alternatives, on their own, are ambiguous. Both
will match the data, consuming different amounts of it. 

 

So any DFDL implementation will need more information from the DFDL schema
to be able to synthesize a parse algorithm from these two choice
alternatives. 

 

The main feature DFDL provides for this ambiguity elimination is by
specifying that parsing will be consistent with sequential order of the
choice alternatives, and so the schema author must write the branch
alternatives in an order, and that eliminates the ambiguity. 

 

Note that a DFDL implementation is free to pre-analyze the alternatives and
come up with faster clever ways to resolve the choice, but the DFDL
specification requires this to be consistent with just attempting the
alternatives in sequence. This is simple and predictable and allows for
unsophisticated DFDL implementations to work the same as more advanced ones.


 

-mikeb

 

 

 

 

 

 

On Wed, May 11, 2022 at 8:34 AM Roger L Costello <coste...@mitre.org
<mailto:coste...@mitre.org> > wrote:

Ah, yep that makes perfect sense. Thank you Steve.

What is the lesson learned? 

Earlier I learned this lesson with regexes: in a list of alternatives, list
the longest alternative first, e.g.,  foobar|foo  not  foo|foobar

Is that lesson rearing its ugly head again, albeit in a slightly different
form?

Is this the lesson: given two element declarations inside a choice each with
dfdl:lengthPattern="regex", order the elements so that the element with the
regex that describes the larger set of characters is listed first. Is that
the lesson? If so, we are in deep trouble - how in the world is one to write
a program which determines that one regex describes a larger set of
characters than the other? That is probably possible (in fact, I'm certain
that it is theoretically possible) but practically it is an very hard
problem (or at least an enormous amount of work). Uh oh.

/Roger

-----Original Message-----
From: Steve Lawrence <slawre...@apache.org <mailto:slawre...@apache.org> > 
Sent: Wednesday, May 11, 2022 8:10 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: [EXT] Re: Catalog the causes of the dreaded "left over data" error
message

The regex "[A-Z]{2,20}" says to match between 2 and 20 characters where 
only A-Z characters are allowed. Using this regex, Daffodil will scan 
the data and stop at the colon character since it does not match A-Z. So 
the length of the Identifier element according to the regex is 4 (the 
length of "TYPE").

Since the value of the Identifer is "TYPE" is does not fail the nilld or 
empty string assertion, and so there is no parse error and the first 
choice branch is successful. Because there are no more elements to 
parse, the remaining data (i.e. the colon and TEL) are not parsed and 
are considered left over data.

When Description is moved to the first branch of the choice, it 
successfully parses the "TYPE:" initiator, and then the regex matches 
everything after that (i.e. TEL) and it works as expected.

On 5/11/22 7:59 AM, Roger L Costello wrote:
> Another thing that cause the dreaded left over data error message.
> 
> I have input containing this field:
> 
> TYPE:TEL
> 
> That is, the field is initiated by TYPE:
> 
> The field has a choice of values: either a string of 2-20 uppercase
letters, or
> a string 1-56 uppercase letters initiated by TYPE:
> 
> Here's the DFDL schema I used
> 
> <xs:choice dfdl:choiceLengthKind="implicit">
>       <xs:element name="Identifier" type="non-zero-length-string"
> dfdl:lengthPattern="[A-Z]{2,20}"/>
>       <xs:element name="Description" type="non-zero-length-string"
> dfdl:lengthPattern="[A-Z]{1,56}" dfdl:initiator="TYPE:"/>
> </xs:choice>
> 
> With that choice and the above input, Daffodil doesn't process the field
and
> reports left over data. As best I can tell, Daffodil uses the first branch
of
> the choice, notices that the regex doesn't contain a colon, and then gives
up. I
> think.
> 
> If I reverse the element declarations, then Daffodil successfully
processes the
> input.
> 
> I guess that I really don't understand why one works while the other
doesn't.
> Would you explain why Daffodil reports left over data with the first but
not the
> second, please?
> 
> For completeness, here is the simpleType:
> 
> <xs:simpleType name="non-zero-length-string" dfdl:lengthKind="pattern">
>      <xs:annotation>
>           <xs:appinfo source=http://www.ogf.org/dfdl/
<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ogf.or
g%2Fdfdl%2F&data=05%7C01%7Cbrutzman%40nps.edu%7C189fadb3a5f04f5313ba08da336f
d65c%7C6d936231a51740ea9199f7578963378e%7C0%7C0%7C637878852825224984%7CUnkno
wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
6Mn0%3D%7C3000%7C%7C%7C&sdata=0fjVhlUxfOlwQ8UVBBCmMhWQ034Yw5B61VsXLbj5S8A%3D
&reserved=0>  <http://www.ogf.org/dfdl/
<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ogf.or
g%2Fdfdl%2F&data=05%7C01%7Cbrutzman%40nps.edu%7C189fadb3a5f04f5313ba08da336f
d65c%7C6d936231a51740ea9199f7578963378e%7C0%7C0%7C637878852825224984%7CUnkno
wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
6Mn0%3D%7C3000%7C%7C%7C&sdata=0fjVhlUxfOlwQ8UVBBCmMhWQ034Yw5B61VsXLbj5S8A%3D
&reserved=0> >>
>               <dfdl:assert test="{ fn:nilled(.) or . ne '' }"/>
>           </xs:appinfo>
>       </xs:annotation>
>       <xs:restriction base="xs:string"/>
> </xs:simpleType>
> 
> /Roger
> 
> *From:* Mike Beckerle <mbecke...@apache.org <mailto:mbecke...@apache.org>
>
> *Sent:* Tuesday, May 3, 2022 6:32 PM
> *To:* users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
> *Subject:* [EXT] Re: Catalog the causes of the dreaded "left over data"
error
> message
> 
> Here is a trick used in one schema I've seen:
> 
> <*xs**:group *name*="requireNoDataLeft"* >
>     <*xs**:sequence* >
>       <*xs**:element *name*="data" *type*="tns:tIntField"
*dfdl:length*="1" *minOccurs*="0"*/>
>       <*xs**:sequence* >
>         <*xs**:annotation* >
>           <*xs**:appinfo *source*="http://www.ogf.org/dfdl/
<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ogf.or
g%2Fdfdl%2F&data=05%7C01%7Cbrutzman%40nps.edu%7C189fadb3a5f04f5313ba08da336f
d65c%7C6d936231a51740ea9199f7578963378e%7C0%7C0%7C637878852825224984%7CUnkno
wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
6Mn0%3D%7C3000%7C%7C%7C&sdata=0fjVhlUxfOlwQ8UVBBCmMhWQ034Yw5B61VsXLbj5S8A%3D
&reserved=0>  <http://www.ogf.org/dfdl/
<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ogf.or
g%2Fdfdl%2F&data=05%7C01%7Cbrutzman%40nps.edu%7C189fadb3a5f04f5313ba08da336f
d65c%7C6d936231a51740ea9199f7578963378e%7C0%7C0%7C637878852825224984%7CUnkno
wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
6Mn0%3D%7C3000%7C%7C%7C&sdata=0fjVhlUxfOlwQ8UVBBCmMhWQ034Yw5B61VsXLbj5S8A%3D
&reserved=0> >"* >
>             <*dfdl**:assert *test*="{ fn:not(fn:exists(data)) }"
*message*="Data found where none was expected." */>
>           </*xs**:appinfo* >
>         </*xs**:annotation* >
>       </*xs**:sequence* >
>     </*xs**:sequence* >
> </*xs**:group* >
> 
> So a group reference to "requireNoDataLeft" states "There cannot be any
more
> data available."
> 
> This mostly is for the case where there is a surrounding "box" of data
such as
> an element with lengthKind 'explicit' and you expect the described
contents to
> use up everything in that box.
> 
> So if your first choice branch ends with a group ref to
"requireNoDataLeft" then
> it must consume all available data, and will fail (and backtrack the
choice to
> the next one) if there is data available after it.
> 
> On Tue, May 3, 2022 at 1:52 PM Roger L Costello <coste...@mitre.org
<mailto:coste...@mitre.org> 
> <mailto:coste...@mitre.org <mailto:coste...@mitre.org> >> wrote:
> 
>      The "left over data" error occurs when there is a choice where the
first
>      branch matches the same data as the second branch and the second
branch
>      matches a bit more. Input data that matches the second branch fails
because
>      the first branch parses the input and then stops and reports left
over data.
>      See example below.
> 
>      Is there a workaround? (without manually shuffling the order of the
branches
>      in the choice)
> 
>      <xs:choice>
>           <xs:element name="MilitaryDayTime">
>               <xs:complexType>
>                   <xs:sequence dfdl:separator="">
>                       <xs:element name="Day" type="non-zero-length-string"
>      dfdl:lengthPattern="[0-9]{2}"/>
>                       <xs:element name="HourTime"
type="non-zero-length-string"
>      dfdl:lengthPattern="[0-9]{2}"/>
>                       <xs:element name="MinuteTime"
type="non-zero-length-string"
>      dfdl:lengthPattern="[0-9]{2}"/>
>                       <xs:element name="TimeZone"
type="non-zero-length-string"
>      dfdl:lengthPattern="..."/>
>                   </xs:sequence>
>               </xs:complexType>
>           </xs:element>
>          <xs:element name="DateTimeGroup">
>               <xs:complexType>
>                   <xs:sequence dfdl:separator="">
>                       <xs:element name="Day" type="non-zero-length-string"
>      dfdl:lengthPattern="[0-9]{2}"/>
>                       <xs:element name="HourTime"
type="non-zero-length-string"
>      dfdl:lengthPattern="[0-9]{2}"/>
>                       <xs:element name="MinuteTime"
type="non-zero-length-string"
>      dfdl:lengthPattern="[0-9]{2}"/>
>                       <xs:element name="TimeZone"
type="non-zero-length-string"
>      dfdl:lengthPattern="..."/>
>                       <xs:element name="MonthName"
type="non-zero-length-string"
>      dfdl:lengthPattern="."/>
>                       <xs:element name="Year"
type="non-zero-length-string"
>      dfdl:lengthPattern="[0-9]{4}"/>
>                   </xs:sequence>
>               </xs:complexType>
>           </xs:element>
>      </xs:choice>
> 
>      *From:* Mike Beckerle <mbecke...@apache.org
<mailto:mbecke...@apache.org>  <mailto:mbecke...@apache.org
<mailto:mbecke...@apache.org> >>
>      *Sent:* Monday, May 2, 2022 10:02 AM
>      *To:* users@daffodil.apache.org <mailto:users@daffodil.apache.org>
<mailto:users@daffodil.apache.org <mailto:users@daffodil.apache.org> >
>      *Subject:* [EXT] Re: Catalog the causes of the dreaded "left over
data"
>      error message
> 
>      I first encountered left-over-data with a dead-simple file format.
Just a
>      top level element named "records" with a minOccurs="0"
maxOccurs="unbounded"
>      array of elements named "record".
> 
>      Due to minOccurs="0" such a schema is very happy to "successfully"
parse
>      zero records, and tell you the entire file contents are "left over
data".
> 
>      I learned one often wants to have minOccurs="1" to force it to at
least be
>      successful on one record.
> 
>      On Fri, Apr 15, 2022 at 9:48 AM Roger L Costello <coste...@mitre.org
<mailto:coste...@mitre.org> 
>      <mailto:coste...@mitre.org <mailto:coste...@mitre.org> >> wrote:
> 
>          Hi Folks,
> 
>          Have you encountered the "left over data" error message? If
you've
>          worked with Daffodil for more than 5 minutes, you undoubtedly
have.
> 
>          The problem with that error message is it gives you absolutely no
clue
>          what's causing the problem.
> 
>          Perhaps if we start cataloging the things that triggered the
error
>          message, then the Daffodil team will be able to provide better
>          diagnostics. Here's my contribution to said catalog.
> 
>          -----------------------
> 
>          In recent weeks I have encountered the dreaded "left over data"
error
>          message twice. After enormous effort I was able to figure out
what the
>          problems were in my DFDL schema. First I need to describe my DFDL
schema.
> 
>          My DFDL schema consists of a series of element declarations and
within
>          each element are declarations of subelements:
> 
>          A
>               A.1
>               A.2
>               .
>          B
>               B.1
>               B.2
>               .
>          .
> 
>          Each subelement is of type string and uses a regex to describe
the
>          subelement's data (i.e., the subelements use
dfdl:lengthKind="pattern"
>          and dfdl:lengthPattern="regex")
> 
>          The first time that I got the "left over data" error message I
found the
>          cause was due to this bug in my DFDL schema: a dfdl:lengthPattern
listed
>          the regex alternatives in the wrong order (shortest to longest
instead
>          of longest to shortest). The error message said that Daffodil
stopped
>          consuming input at element G. The actual element containing the
regex in
>          wrong order was element G.2 (Daffodil stopped consuming input
pretty
>          near the problem)
> 
>          After I fixed that bug I immediately got another "left over data"
error
>          at element J. After much more effort I found the bug: a regex
>          erroneously had spaces in it. In this case, the error message
said that
>          Daffodil stopped consuming input at element J. The actual element
>          containing the regex with spaces was element K.5 (Daffodil
stopped
>          consuming input pretty far from the problem)
> 
>          /Roger
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to