What is "the DFDL way" to describing data? Here are 3 approaches to describing data

Roger L Costello Sun, 29 Oct 2023 09:58:38 -0700

Hi Folks,

What is "the DFDL way" to describing (specifying) data?


I seek your insights!

Below I attempt to distill some principles of DFDL design by taking a specific 
use case. I show 3 approaches to describe (specify) the data and their 
tradeoffs. Are there approaches that I have missed? Are there advantages and 
disadvantages that I have missed? Please share your thoughts!  /Roger
---------------------------------------
Scenario: The data format has a field which contains a datetime value, e.g.,

.../20230926T124800Z/...

The data between the two slashes denotes this datetime: October 26, 2023, at 
12:48 PM GMT.

Here are 3 approaches to describing the field:

1. The field is a string delimited by two slashes. Based on that description, 
parsing produces this XML:

<DateTime>20230926T124800Z</DateTime>

2. The field is comprised of subfields. Parsing produces this XML:

<DateTime>
    <Year>2023</Year>
    <Month>09</Month>
    <Day>26</Day>
    <Separator>T</Separator>
    <Hour>12</Hour>
    <Minute>48</Minute>
    <Second>00</Second>
    <TimeZone>Z</TimeZone>
</DateTime>

3. The field contains an ISO 8601 date time value. Parsing produces this XML:

<DateTime>2023-09-26T12:48:00+00:00</DateTime>

Here are the advantages and disadvantages of each approach:

1. With this approach it is easy to specify the field (the data is simply a 
string) and it results in a small XML representation. However, it requires the 
consumer of the XML to understand how the element's value is encoded.

2. This approach takes more work to express the field and yields a much larger 
XML, but it makes explicit the meaning of each subfield in the data. It has 
been said that the purpose of XML is to make explicit what is implicit. The 
input data has implicit structuring - the first four characters represent the 
year, the next two characters represent the month, and so forth - which this 
approach makes explicit. In (1) the implicit structure remains implicit.

3. This approach yields a compact XML representation, the output data is in a 
standard form, and it leverages the DFDL processor to the greatest extent. The 
latter point is key. In (2), we (the humans) do a lot of work to specify the 
subfields. The DFDL processor (the machine) doesn't do much; the humans have 
already done most of the heavy lifting. With this third approach, humans do 
little, the machine does a lot. However, one of the thorny things with datetime 
data is time zones. Military data usually uses Zulu time (UTC), but lots of 
other data formats have missing time zone information; so, getting the 
datetimes to have the right time zone information, and to not display time 
zones if none is known, is tricky.

What is "the DFDL way" to describing data? Here are 3 approaches to describing data

Reply via email to