I forgot to mention one other very very important way to get better diagnostic messages in general in DFDL.
Add a dfdl:discriminator. In your format you have an element with an initiator. Commonly in such formats, once such an initiator is found, it is certain that the rest of that data item must appear. But your schema, the fragments of it you have shown here, doesn't say that. The problem you are having is that backtracking in DFDL means if any sort of error occurs during the parse of your GeneralTextInfo item, which is optional/array, then the parser just backtracks, ends the optional/array at the prior element (or no elements in this case), and then, since that's the end of the schema, it is done and parsed successfully. Except, ....just some data is left over. But this is 100% legal since your format did not say that there is anything wrong with failure during parse of a GeneralTextInfo element, only that there is an optional/array of them. The dfdl:occursCountKind 'implicit' and 'parsed' both have this meaning when minOccurs='0'. It's correct for a parse to fail, and for that to mean the array ends, in this case it ends with zero occurrences. But what you intend to express is "once you find the text 'GENTEXT' initiator, then the rest of that element must parse successfully" (and become the array/optional element). The way you do that is to add a discriminator. There are two ways to do this. One is to go to the enclosing sequence where the GeneralTextInfo element appears, and add dfdl:initiatedContent='yes'. This requires that ALL children of that sequence have initiators, which is fairly common, so that's why there is this specific DFDL property for this case. That property means "once you find the initiator, distriminate true". The other way is to put this group in your schema, and add a group ref to it at the start of GeneralTextInfo's sequence: <group name="discriminateTrue"> <!-- this global group def is just to declutter points of use where we need to discriminate true --> <sequence> <annotation><appinfo source="http://www.ogf.org/dfdl/"> <dfdl:discriminator>{ fn:true() }</dfdl:discriminator> </appinfo></annotation> </sequence> </group> <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" dfdl:terminator="//"> <xs:complexType> <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix"> <xs:group ref="ex:discriminateTrue"/> <!-- if we get this far, there IS a GeneralTextInfo element. --> .... </xsComplexType> </xs:element> If there are other elements in the sequence without initiators then you have to use this sort of explicit discriminator instead of dfdl:initiatedContent property. Now, once the GENTEXT is found in the data stream, a parse error that happens will be fatal, and you'll get a superior diagnostic about that error. Experience with DFDL has shown that adding discriminators to schemas is VERY important to getting good diagnostics. Many schemas are much too accepting of malformed data otherwise, or even if they don't ultimately accept the data, they can backtrack in numerous ways wasting time and ultimately fail very far away from where the actual problem was. Discriminators are key to a high-quality DFDL schema because one of the jobs of a DFDL schema is not only to parse correct data, but to robustly reject malformed data with good diagnostics. This issue has come up so often that I think the "left over data" message should directly suggest "Consider adding discriminators to improve diagnostic behavior." On Wed, Apr 6, 2022 at 11:52 AM Roger L Costello <coste...@mitre.org> wrote: > After many hours of effort, I figured out what is causing the error. For > some reason, Daffodil does not like a forward slash in a regex: > > > > dfdl:lengthPattern="[/A-Z]+" > > > > The intent of that regex is to say that the input may contain a forward > slash or any uppercase letter. > > > > That regex is contained in here: > > > > <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" > dfdl:terminator="//"> > <xs:complexType> > <xs:sequence dfdl:separator="/" dfdl:separatorPosition="prefix"> > <xs:element name="TextIndicator" minOccurs="0" nillable="true" > type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/> > <xs:element name="FreeText" minOccurs="0" nillable="true" > type="non-zero-length-string" dfdl:lengthPattern="[/A-Z]+"/> > </xs:sequence> > </xs:complexType> > </xs:element> > > > > > > [The actual regex for FreeText is much more complicated than what I’ve > shown. It took me a long time to distill out the part that was causing the > error.] > > > > Here is the input: > > > > GENTEXT/FOO/A// > > > > The error I get is: > > > > [error] Left over data. Consumed 1504 bit(s) with at least 3040 bit(s) > remaining. > Left over data (Hex) starting at byte 189 is: (0x0d0a47454e544558...) > Left over data (UTF-8) starting at byte 189 is: (??GENTEX...) > > Questions: > > > > 1. Why can’t a forward slash be used in a regex? What’s the workaround? > 2. Why can’t the error message be more helpful? Why can’t Daffodil > generate this error message: > > > > Error on line 34, column 59 of the DFDL schema. The regex in > dfdl:lengthPattern contains a forward slash, which cannot be used because > [insert reason here]. Daffodil is hereby abandoning the parse of this > element (FreeText) and its parent element (GeneralTextInfo). > > > > If I had had that error message, I could have fixed the problem in 30 > seconds. > > > > /Roger > > > > *From:* Mike Beckerle <mbecke...@apache.org> > *Sent:* Wednesday, April 6, 2022 11:04 AM > *To:* users@daffodil.apache.org > *Subject:* [EXT] Re: Daffodil error messages are awful > > > > I agree that every bad error message is a bug, and any error message > that is not-helpful should be reported as one. > > > > The left-over data error you are seeing is a bit tricky. When Daffodil is > invoked to consume data from a stream, then this situation is not even an > error at all, as it is perfectly normal for a parser to parse one message > from a stream, and stop, leaving the stream positioned for the next parse > call. > > > > Only when daffodil is invoked in a context where it is clear it is > intended to consume the entire input, is this error detected at all. > > > > What this means is that the parse ended normally, produced an infoset, but > then it was discovered that there was data left over. > > > > To me what can be improved here is the error message text, which should > say that "parse ended normally", should indicate that an infoset was > created (and display all/part of it), and indicate that it ended without > consuming all the data, giving all positions in both bytes+optional 0..7 > bits if not on a byte boundary. > > > > > > > > > > > > > > On Wed, Apr 6, 2022 at 7:53 AM Roger L Costello <coste...@mitre.org> > wrote: > > Hi Folks, > > I ran Daffodil on my DFDL schema and got this error message: > > [error] Left over data. Consumed 1504 bit(s) with at least 3040 bit(s) > remaining. > Left over data (Hex) starting at byte 189 is: (0x0d0a47454e544558...) > Left over data (UTF-8) starting at byte 189 is: (??GENTEX...) > > That is a really bad error message. Why did Daffodil stop consuming the > input? No idea. What is in my DFDL schema that caused the generation of the > error? No idea. > > No disrespect intended, but Daffodil has the worst error messages of any > tool that I have ever encountered. > > Good error messages are important. In a recent podcast Michael Kay > (creator of Saxon) talks about his emphasis on good error messages: > > What makes a good product? Users must be able to understand the error > messages. People will tell you, one thing I like about Saxon is the error > messages. To me, a bad error message is something that really needs to be > fixed. Error messages are what users are dealing with every day. They are > reading my error messages. If those glare out as being unhelpful, as being > badly spelled, then that's their experience with the product, so it's > important to get it right. I put a lot of effort into those sorts of little > details. Getting good error messages it really quite an art. Do you phrase > the error message in terms of the proper terminology of the spec, or do you > use the terminology that the users are using (which might be quite wrong)? > For example, what many users call a "tag" isn't what the spec calls a tag. > They'll use "tag" to mean "element." So which word am I going to use in an > error message? It's quite hard to get that sort of thing right. Getting a > balance between a message that is technically correct and a message that > users understand, sometimes requires a fair bit of thought. And then you've > got to phrase the error message in terms of what the user was trying to do, > not what was going on internally. That again gives you a significant > challenge. So you have to think about those sorts of things. > > /Roger > >