I’m surprised to hear about the (7) parse-only use case.  Why is a DFDL schema 
with no unparse capability such a common use case?

To date, I’ve applied DFDL schemas only to binary fixed format data, not text 
data, but it seems to me like every DFDL schema I’ve written or encountered for 
binary data works in both directions (parse and unparse) without any special 
effort.  You would have to make a mistake or use some advanced or out of the 
usual DFDL feature (input / output value calculation, variable, hidden element, 
etc.) to break a binary fixed format schema’s symmetric parse and unparse 
functionality.  Using a DFDL schema to move data between a binary data file and 
an XML infoset file feels so natural that I’m puzzled DFDL schemas with no 
unparse capability are as common as you say.  How come?

John

From: Mike Beckerle <mbecke...@apache.org>
Sent: Tuesday, October 17, 2023 3:35 PM
To: users@daffodil.apache.org
Subject: EXT: Re: 6 ways to use DFDL

Agree that the right thing to do depends on your needs.

(3) and (5) seem the same to me.

This notion of "erroneous data is detected" has some degrees.

Data can be so broken that you can't even tell how big a field of data is. In 
that case, you can't capture data into some special element because you don't 
even know how much data to capture.
Perhaps a good term for this situation is "non-isolatable data". You can't even 
isolate the start and end of the data in question. The only thing you can do in 
this case is a fatal parse failure.

If you can isolate the erroneous data, then the data can be malformed (parsing 
fails) but you have the option to capture the isolated but non-well-formed data 
in some sort of exception element like <unrecognized>....</unrecognized> 
(probably a better name than "invalid" for such an element, since the data is 
worse than just invalid in the XSD sense. It's not even well formed.)

I like to combine your different options by having a control variable that 
determines whether an assertion fails on invalid data or not., whether 
erroneous data gets isolated in special elements or not, etc. It is possible 
for a schema to serve multiple such purposes, and such a schema is more general 
purpose than if the schema is purpose-built for just one of your cases.  This 
is a pretty advanced technique in DFDL though.

There are a few more things one can do with DFDL.

(7) parse-only schema (no unparse capability). This is really common.
(8) selective parsing - parse only parts of the data. Skip over other parts of 
the data. (this is a special case of parse-only)
(9) unparse to canonical form
(10) unparse preserving data bit-for-bit from parse.

W.r.t. (9) and (10) here, Cyberians tend to get pretty hung up on this one. If 
a schema must unparse, the question arises of what is the "fidelity" with which 
it reproduces data coming from a parse. An important decision is does the 
unparsed output have to be bit-for-bit identical to the input, or is the data 
being converted to a canonical form? Canonical form is usually an easier and 
more natural schema to write. Getting bit-for-bit identical data out is harder, 
and often quite awkward.

The classic case is text numbers and leading zeros. That is, parsing "00001" 
but unparsing to a canonical form as "1" is natural, but breaks bit-for-bit 
fidelity of parse/unparse. The reason to use canonical form besides being 
easier and more natural is that it blocks a covert channel that tries to hide 
information using leading zeros to encode that information.

Trying to get bit-for-bit identical output requires doing unnatural things like 
treating numbers as strings, capturing what characters are escaped 
unnecessarily in strings, whether alternative/redundant delimiters are present 
or not, etc. It's quite challenging actually.

-mikeb

On Tue, Oct 17, 2023 at 6:59 AM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
Hi Folks,

Here are 6 ways to use DFDL:

1.      Use DFDL to process any well-formed input. All data (even invalid data) 
is okay as long as it is well-formed.
2.      Same as #1 except after generating the XML, validate the XML, i.e., run 
Daffodil with the validate flag.
3.      If erroneous data is detected, put the data into an <invalid> element. 
Continue parsing. No errors raised during parsing. Need to examine the XML that 
is output for the presence of <invalid> elements.
4.      If erroneous data is detected, throw an error, and stop parsing 
immediately. No output is generated.
5.      Same as #4 except don't stop parsing. The output contains the erroneous 
data.
6.      If erroneous data is detected, throw an error, drop the data, continue 
parsing. The output contains only valid data.

None of these ways are "the right way to use DFDL". Which is the "right way" 
depends on your situation and requirements.

Do you agree with the above? In my list of "ways to use DFDL" have I missed any 
ways?

/Roger

Reply via email to