I’m surprised to hear about the (7) parse-only use case. Why is a DFDL schema with no unparse capability such a common use case?
To date, I’ve applied DFDL schemas only to binary fixed format data, not text data, but it seems to me like every DFDL schema I’ve written or encountered for binary data works in both directions (parse and unparse) without any special effort. You would have to make a mistake or use some advanced or out of the usual DFDL feature (input / output value calculation, variable, hidden element, etc.) to break a binary fixed format schema’s symmetric parse and unparse functionality. Using a DFDL schema to move data between a binary data file and an XML infoset file feels so natural that I’m puzzled DFDL schemas with no unparse capability are as common as you say. How come? John From: Mike Beckerle <mbecke...@apache.org> Sent: Tuesday, October 17, 2023 3:35 PM To: users@daffodil.apache.org Subject: EXT: Re: 6 ways to use DFDL Agree that the right thing to do depends on your needs. (3) and (5) seem the same to me. This notion of "erroneous data is detected" has some degrees. Data can be so broken that you can't even tell how big a field of data is. In that case, you can't capture data into some special element because you don't even know how much data to capture. Perhaps a good term for this situation is "non-isolatable data". You can't even isolate the start and end of the data in question. The only thing you can do in this case is a fatal parse failure. If you can isolate the erroneous data, then the data can be malformed (parsing fails) but you have the option to capture the isolated but non-well-formed data in some sort of exception element like <unrecognized>....</unrecognized> (probably a better name than "invalid" for such an element, since the data is worse than just invalid in the XSD sense. It's not even well formed.) I like to combine your different options by having a control variable that determines whether an assertion fails on invalid data or not., whether erroneous data gets isolated in special elements or not, etc. It is possible for a schema to serve multiple such purposes, and such a schema is more general purpose than if the schema is purpose-built for just one of your cases. This is a pretty advanced technique in DFDL though. There are a few more things one can do with DFDL. (7) parse-only schema (no unparse capability). This is really common. (8) selective parsing - parse only parts of the data. Skip over other parts of the data. (this is a special case of parse-only) (9) unparse to canonical form (10) unparse preserving data bit-for-bit from parse. W.r.t. (9) and (10) here, Cyberians tend to get pretty hung up on this one. If a schema must unparse, the question arises of what is the "fidelity" with which it reproduces data coming from a parse. An important decision is does the unparsed output have to be bit-for-bit identical to the input, or is the data being converted to a canonical form? Canonical form is usually an easier and more natural schema to write. Getting bit-for-bit identical data out is harder, and often quite awkward. The classic case is text numbers and leading zeros. That is, parsing "00001" but unparsing to a canonical form as "1" is natural, but breaks bit-for-bit fidelity of parse/unparse. The reason to use canonical form besides being easier and more natural is that it blocks a covert channel that tries to hide information using leading zeros to encode that information. Trying to get bit-for-bit identical output requires doing unnatural things like treating numbers as strings, capturing what characters are escaped unnecessarily in strings, whether alternative/redundant delimiters are present or not, etc. It's quite challenging actually. -mikeb On Tue, Oct 17, 2023 at 6:59 AM Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>> wrote: Hi Folks, Here are 6 ways to use DFDL: 1. Use DFDL to process any well-formed input. All data (even invalid data) is okay as long as it is well-formed. 2. Same as #1 except after generating the XML, validate the XML, i.e., run Daffodil with the validate flag. 3. If erroneous data is detected, put the data into an <invalid> element. Continue parsing. No errors raised during parsing. Need to examine the XML that is output for the presence of <invalid> elements. 4. If erroneous data is detected, throw an error, and stop parsing immediately. No output is generated. 5. Same as #4 except don't stop parsing. The output contains the erroneous data. 6. If erroneous data is detected, throw an error, drop the data, continue parsing. The output contains only valid data. None of these ways are "the right way to use DFDL". Which is the "right way" depends on your situation and requirements. Do you agree with the above? In my list of "ways to use DFDL" have I missed any ways? /Roger