A further comment on this thread:
Text formats are often full of redundancy, but binary formats are far less redundant usually. So we have to be cautious of depending on examples based on contrived text examples that contain lots of redundancy, since those will not be illustrative for binary data situations. One way to help this is to make our expository examples textual (so as to avoid having to use hex dumps), but binary-like in behavior. Eg., this data is all text characters, but behaves like many binary data formats which store length and count values. "003008abcd efg0061234ef002xx001000" This is a 3 digit occurs count, followed by 3 variable-length strings, each is a 3-digit prefix length and the characters of the string. The first string is length 8 contents "abcd efg". The second is length 6, contents "1234ef", the third length 2, contents "xx". Then there is a second 3 digit occurs count of 1, followed by one variable-length string of length 0. The only way to know that 001 is an occurs count, and not the length of another string, is because of the 003 occurs count which appears at the beginning. You need the count just to parse this data properly. There is no redundancy here of length, count, and there are no delimiters nor escape schemes in this format, which is much more typical of binary data. The logical equivalent XML to this I would say is something like: <stringArray> <count>3</count> <str>abcd efg</str> <str>1234ef</str> <str>xx</str> </stringArray> <stringArray> <count>1</count> <str/> </stringArray> -mikeb ________________________________ From: Steve Lawrence <slawre...@apache.org> Sent: Thursday, August 12, 2021 2:05 PM To: users@daffodil.apache.org <users@daffodil.apache.org> Subject: Re: Minimalist DFDL, part II Yep, this is related to the discussion about well-formed vs valid. It's not uncommon, and often preferred, to model only the syntax of the data so that you can parse data that is syntactically correct (i.e. well-formed) but isn't semantically correct (i.e not valid), and then do the validation later. That would for example let you parse data with the number "4" but only 3 students listed, but then use XSLT/Schematron to validate that the counts don't match up. That said, I think you'll still often need occursCountKind="expression". Once you start modeling more complicated data formats, you almost always start seeing repetitions of types, and you often can't use speculative parsing to differentiate between the types. And the only solution is with expressions to figure out the occurrences. For example, say we had this data: 3 2 John Doe Sally Smith Judy Jones Richard Roe Bob Barker We really don't want to think of this as two numbers followed by 5 strings. That just isn't going to be useful. We instead want to think of this as two numbers that specify the number of students and the number of teachers, followed by a list of the student names and a list of the teacher names. And so we really want an infoset that looks like this: <People> <NumStudents>3</NumStudents> <NumTeachers>2</NumTeachers> <Students> <name>John Doe</name> <name>Sally Smith</name> <name>Judy Jones</name> </Students> <Teachers> <name>Alice Anderson</name> <name>Bob Brown</name> </Teacher> </People> Notice this data doesn't allow speculative parsing to differentiate student names from teacher names--they names have the exact same form. So the only way to know when one ends and the other begins is by using occursCountKind="expression" and an expression to reach back into the parsed numbers to figure out the number of occurrences. - Steve On 8/12/21 1:01 PM, Roger L Costello wrote: > Hi Folks, > > A couple of weeks ago Mike Beckerle pointed out that many data formats > contain things like this: > > A number, N > N occurrences of something > > For example, 3 followed by the names of three students: > > 3 > John Doe > Sally Smith > Judy Jones > > How should that be parsed? Using the DFDL occursCount and > occursCountKind="expression" and hiddenGroup you can parse the input to > ensure that exactly three student names are consumed. The output is this XML: > > <Students> > <name>John Doe</name> > <name>Sally Smith</name> > <name>Judy Jones</name> > </Students> > > But is it really the job of the parser to "ensure that exactly three student > names are consumed"? > > I raised this question to the compiler experts on the compilers Usenet list. > Here's what one person wrote: > >> I would contend that in your example the /syntax/ of lists is really a number >> followed by zero or more strings (number string*), and that verifying the >> string >> count is semantics, not syntax. I believe that, whenever possible, >> semantics are >> best left until after parsing is finished. > > In other words, keep your DFDL schema simple: forget > occursCountKind="expression" and hiddenGroup; just parse the number and the > following strings. The output should be this: > > <number>3</number> > <Students> > <name>John Doe</name> > <name>Sally Smith</name> > <name>Judy Jones</name> > </Students> > > If you need to "ensure that there are 3 student names" you can do that check > *after* parsing. > > This is the Minimalist DFDL philosophy. > > /Roger > >