Adding to steve's response. I did create ticket https://issues.apache.org/jira/browse/DAFFODIL-2686 about the poor diagnostic about left over data.
As for Daffodil "does not like slash in a regex". I think you have lots of slashes going around here. "//" as terminator, "/" as prefix separator, and then you have "/" in a lengthPattern regex. If I assume the surrounding global format has lengthKind 'delimited', then the question arises of who "wins" when separators, terminators, and lengthKind 'pattern' are duking it out over who gets to consume the "/". The answer in DFDL is always "innermost wins". In this case the element with lengthKind 'pattern' will consume any "/" in the data stream, as it ignores the definitions of any surrounding delimiters. It is only when that FreeText element's parsing ends that we "pop the stack" and go back to lengthKind 'delimited' as defined in the surrounding scope. Other lengthKinds also work this way. E.g., if an element had explicit or prefixed length 10 when parsing, then 10 characters would be taken, regardless of whether those 10 characters contain things defined as delimiters in surrounding scopes. Inner most wins is the rule. In your case, when Daffodil starts trying to parse the element FreeText, it has lengthKind 'pattern', so scanning for delimiters is turned off and the lengthPattern regex match starts parsing the data stream. This will gobble up "A//" making that the data of element FreeText. All of this would be obvious in a trace. daffodil -t parse -s mySchema.dfdl.xsd myData.dat You would see it build an infoset where <FreeText>A//</FreeText> would be created. I believe to handle formats akin to what you are trying to express, you need to use lengthKind pattern at the outer scopes, not the inner scopes. I.e., the element with terminator="//" should instead be expressed using lengthKind 'pattern' with length pattern that matches least possible anything up to, but looking ahead for "//". On Wed, Apr 6, 2022 at 12:27 PM Steve Lawrence <slawre...@apache.org> wrote: > Forward slashes work in regular expression how you expect. The issue is > that your regular expression is consuming too much data. When Daffodil > evaluates the regular expression, the data Daffodil is looking at looks > like this: > > A// > > Your regular expression greedily matches one or more capital letters or > forward slashes. Note that delimiters are ignored when using > lengthKind="pattern". This means your pattern matches the data "A//". So > Daffodil thinks the FreeText element has the value "A//". > > This is where things start going off the rails. > > Because your FreeText element consumed your // characters, it means the > the terminator that the GeneralTextInfo element requires is missing. > Because it is optional (minOccurs="0"), this is not considered an error, > it just means the element does not exist, and Daffodil will backtrack. > > Presumably the rest of the schema does not expect anymore data, and so > Daffodil finishes the parse successfully. But there was left over data, > so we output an error letting you know (admittedly, pretty poorly). > > The fix here it to make your pattern not consume the terminator. Your > FreeText probably wants to consume A-Z and /, but stop when it hits the > double slash. This can be done with a forward lookahead, and you > probably also want the + to be non-greedy, for example: > > dfdl:lengthPattern="[/A-Z]+?(?=//)" > > - Steve > > > On 4/6/22 11:52 AM, Roger L Costello wrote: > > After many hours of effort, I figured out what is causing the error. For > some > > reason, Daffodil does not like a forward slash in a regex: > > > > dfdl:lengthPattern="[/A-Z]+" > > > > The intent of that regex is to say that the input may contain a forward > slash or > > any uppercase letter. > > > > That regex is contained in here: > > > > <xs:element name="GeneralTextInfo" minOccurs="0" dfdl:initiator="GENTEXT" > > dfdl:terminator="//"> > > <xs:complexType> > > <xs:sequence dfdl:separator="/" > dfdl:separatorPosition="prefix"> > > <xs:element name="TextIndicator" minOccurs="0" > nillable="true" > > type="non-zero-length-string" dfdl:lengthPattern="[A-Z ]+"/> > > <xs:element name="FreeText" minOccurs="0" nillable="true" > > type="non-zero-length-string" dfdl:lengthPattern="[/A-Z]+"/> > > </xs:sequence> > > </xs:complexType> > > </xs:element> > > > > [The actual regex for FreeText is much more complicated than what I’ve > shown. It > > took me a long time to distill out the part that was causing the error.] > > > > Here is the input: > > > > GENTEXT/FOO/A// > > > > The error I get is: > > > > [error] Left over data. Consumed 1504 bit(s) with at least 3040 bit(s) > remaining. > > Left over data (Hex) starting at byte 189 is: (0x0d0a47454e544558...) > > Left over data (UTF-8) starting at byte 189 is: (??GENTEX...) > > > > Questions: > > > > 1. Why can’t a forward slash be used in a regex? What’s the workaround? > > 2. Why can’t the error message be more helpful? Why can’t Daffodil > generate > > this error message: > > > > Error on line 34, column 59 of the DFDL schema. The regex in > dfdl:lengthPattern > > contains a forward slash, which cannot be used because [insert reason > here]. > > Daffodil is hereby abandoning the parse of this element (FreeText) and > its > > parent element (GeneralTextInfo). > > > > If I had had that error message, I could have fixed the problem in 30 > seconds. > > > > /Roger > > > > *From:* Mike Beckerle <mbecke...@apache.org> > > *Sent:* Wednesday, April 6, 2022 11:04 AM > > *To:* users@daffodil.apache.org > > *Subject:* [EXT] Re: Daffodil error messages are awful > > > > I agree that every bad error message is a bug, and any error message > that is > > not-helpful should be reported as one. > > > > The left-over data error you are seeing is a bit tricky. When Daffodil is > > invoked to consume data from a stream, then this situation is not even > an error > > at all, as it is perfectly normal for a parser to parse one message from > a > > stream, and stop, leaving the stream positioned for the next parse call. > > > > Only when daffodil is invoked in a context where it is clear it is > intended to > > consume the entire input, is this error detected at all. > > > > What this means is that the parse ended normally, produced an infoset, > but then > > it was discovered that there was data left over. > > > > To me what can be improved here is the error message text, which should > say that > > "parse ended normally", should indicate that an infoset was created (and > display > > all/part of it), and indicate that it ended without consuming all the > data, > > giving all positions in both bytes+optional 0..7 bits if not on a byte > boundary. > > > > On Wed, Apr 6, 2022 at 7:53 AM Roger L Costello <coste...@mitre.org > > <mailto:coste...@mitre.org>> wrote: > > > > Hi Folks, > > > > I ran Daffodil on my DFDL schema and got this error message: > > > > [error] Left over data. Consumed 1504 bit(s) with at least 3040 > bit(s) > > remaining. > > Left over data (Hex) starting at byte 189 is: > (0x0d0a47454e544558...) > > Left over data (UTF-8) starting at byte 189 is: (??GENTEX...) > > > > That is a really bad error message. Why did Daffodil stop consuming > the > > input? No idea. What is in my DFDL schema that caused the > generation of the > > error? No idea. > > > > No disrespect intended, but Daffodil has the worst error messages > of any > > tool that I have ever encountered. > > > > Good error messages are important. In a recent podcast Michael Kay > (creator > > of Saxon) talks about his emphasis on good error messages: > > > > What makes a good product? Users must be able to understand the > error > > messages. People will tell you, one thing I like about Saxon is the > error > > messages. To me, a bad error message is something that really needs > to be > > fixed. Error messages are what users are dealing with every day. > They are > > reading my error messages. If those glare out as being unhelpful, > as being > > badly spelled, then that's their experience with the product, so > it's > > important to get it right. I put a lot of effort into those sorts > of little > > details. Getting good error messages it really quite an art. Do you > phrase > > the error message in terms of the proper terminology of the spec, > or do you > > use the terminology that the users are using (which might be quite > wrong)? > > For example, what many users call a "tag" isn't what the spec calls > a tag. > > They'll use "tag" to mean "element." So which word am I going to > use in an > > error message? It's quite hard to get that sort of thing right. > Getting a > > balance between a message that is technically correct and a message > that > > users understand, sometimes requires a fair bit of thought. And > then you've > > got to phrase the error message in terms of what the user was > trying to do, > > not what was going on internally. That again gives you a significant > > challenge. So you have to think about those sorts of things. > > > > /Roger > > > >