Hi Folks, What is "the DFDL way" to describing (specifying) data?
I seek your insights! Below I attempt to distill some principles of DFDL design by taking a specific use case. I show 3 approaches to describe (specify) the data and their tradeoffs. Are there approaches that I have missed? Are there advantages and disadvantages that I have missed? Please share your thoughts! /Roger --------------------------------------- Scenario: The data format has a field which contains a datetime value, e.g., .../20230926T124800Z/... The data between the two slashes denotes this datetime: October 26, 2023, at 12:48 PM GMT. Here are 3 approaches to describing the field: 1. The field is a string delimited by two slashes. Based on that description, parsing produces this XML: <DateTime>20230926T124800Z</DateTime> 2. The field is comprised of subfields. Parsing produces this XML: <DateTime> <Year>2023</Year> <Month>09</Month> <Day>26</Day> <Separator>T</Separator> <Hour>12</Hour> <Minute>48</Minute> <Second>00</Second> <TimeZone>Z</TimeZone> </DateTime> 3. The field contains an ISO 8601 date time value. Parsing produces this XML: <DateTime>2023-09-26T12:48:00+00:00</DateTime> Here are the advantages and disadvantages of each approach: 1. With this approach it is easy to specify the field (the data is simply a string) and it results in a small XML representation. However, it requires the consumer of the XML to understand how the element's value is encoded. 2. This approach takes more work to express the field and yields a much larger XML, but it makes explicit the meaning of each subfield in the data. It has been said that the purpose of XML is to make explicit what is implicit. The input data has implicit structuring - the first four characters represent the year, the next two characters represent the month, and so forth - which this approach makes explicit. In (1) the implicit structure remains implicit. 3. This approach yields a compact XML representation, the output data is in a standard form, and it leverages the DFDL processor to the greatest extent. The latter point is key. In (2), we (the humans) do a lot of work to specify the subfields. The DFDL processor (the machine) doesn't do much; the humans have already done most of the heavy lifting. With this third approach, humans do little, the machine does a lot. However, one of the thorny things with datetime data is time zones. Military data usually uses Zulu time (UTC), but lots of other data formats have missing time zone information; so, getting the datetimes to have the right time zone information, and to not display time zones if none is known, is tricky.