Re: 6 ways to use DFDL

Mike Beckerle Tue, 17 Oct 2023 16:00:02 -0700

The parse-only use case is for importing data into enterprise software
systems of various kinds - databases, data analysis tools, data
integration, master-data-management, AI, etc. Data goes in, integrated
data, decisions, reports and graphics are what comes out. Most of those
systems have no need for data export in the form it arrived. Not
exclusively, but mostly they don't really export things much. Often CSV
export is sufficient since the data ends up in more tabular forms after
ingest.


The use of DFDL for giving data the ultimate in scrutiny for cybersecurity
reasons, by the parse-validate-unparse cycle, is very recent
(comparatively) and most people I explain the cyber use-case to find it
surprising, but then when I point out that it's just 1 pass over the data
so isn't necessarily slow, etc. they see the value of it.




On Tue, Oct 17, 2023 at 4:09 PM Interrante, John A (GE Aerospace, US) <
john.interra...@ge.com> wrote:

> I’m surprised to hear about the (7) parse-only use case.  Why is a DFDL
> schema with no unparse capability such a common use case?
>
>
>
> To date, I’ve applied DFDL schemas only to binary fixed format data, not
> text data, but it seems to me like every DFDL schema I’ve written or
> encountered for binary data works in both directions (parse and unparse)
> without any special effort.  You would have to make a mistake or use some
> advanced or out of the usual DFDL feature (input / output value
> calculation, variable, hidden element, etc.) to break a binary fixed format
> schema’s symmetric parse and unparse functionality.  Using a DFDL schema to
> move data between a binary data file and an XML infoset file feels so
> natural that I’m puzzled DFDL schemas with no unparse capability are as
> common as you say.  How come?
>
>
>
> John
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Tuesday, October 17, 2023 3:35 PM
> *To:* users@daffodil.apache.org
> *Subject:* EXT: Re: 6 ways to use DFDL
>
>
>
> Agree that the right thing to do depends on your needs.
>
>
>
> (3) and (5) seem the same to me.
>
>
>
> This notion of "erroneous data is detected" has some degrees.
>
>
>
> Data can be so broken that you can't even tell how big a field of data is.
> In that case, you can't capture data into some special element because you
> don't even know how much data to capture.
>
> Perhaps a good term for this situation is "non-isolatable data". You can't
> even isolate the start and end of the data in question. The only thing you
> can do in this case is a fatal parse failure.
>
>
>
> If you can isolate the erroneous data, then the data can be malformed
> (parsing fails) but you have the option to capture the isolated but
> non-well-formed data in some sort of exception element like
> <unrecognized>....</unrecognized> (probably a better name than "invalid"
> for such an element, since the data is worse than just invalid in the XSD
> sense. It's not even well formed.)
>
>
>
> I like to combine your different options by having a control variable that
> determines whether an assertion fails on invalid data or not., whether
> erroneous data gets isolated in special elements or not, etc. It is
> possible for a schema to serve multiple such purposes, and such a schema is
> more general purpose than if the schema is purpose-built for just one of
> your cases.  This is a pretty advanced technique in DFDL though.
>
>
>
> There are a few more things one can do with DFDL.
>
>
>
> (7) parse-only schema (no unparse capability). This is really common.
>
> (8) selective parsing - parse only parts of the data. Skip over other
> parts of the data. (this is a special case of parse-only)
>
> (9) unparse to canonical form
>
> (10) unparse preserving data bit-for-bit from parse.
>
>
>
> W.r.t. (9) and (10) here, Cyberians tend to get pretty hung up on this
> one. If a schema must unparse, the question arises of what is the
> "fidelity" with which it reproduces data coming from a parse. An important
> decision is does the unparsed output have to be bit-for-bit identical to
> the input, or is the data being converted to a canonical form? Canonical
> form is usually an easier and more natural schema to write. Getting
> bit-for-bit identical data out is harder, and often quite awkward.
>
>
>
> The classic case is text numbers and leading zeros. That is, parsing
> "00001" but unparsing to a canonical form as "1" is natural, but breaks
> bit-for-bit fidelity of parse/unparse. The reason to use canonical form
> besides being easier and more natural is that it blocks a covert channel
> that tries to hide information using leading zeros to encode that
> information.
>
>
>
> Trying to get bit-for-bit identical output requires doing unnatural things
> like treating numbers as strings, capturing what characters are escaped
> unnecessarily in strings, whether alternative/redundant delimiters are
> present or not, etc. It's quite challenging actually.
>
>
>
> -mikeb
>
>
>
> On Tue, Oct 17, 2023 at 6:59 AM Roger L Costello <coste...@mitre.org>
> wrote:
>
> Hi Folks,
>
> Here are 6 ways to use DFDL:
>
> 1.      Use DFDL to process any well-formed input. All data (even invalid
> data) is okay as long as it is well-formed.
> 2.      Same as #1 except after generating the XML, validate the XML,
> i.e., run Daffodil with the validate flag.
> 3.      If erroneous data is detected, put the data into an <invalid>
> element. Continue parsing. No errors raised during parsing. Need to examine
> the XML that is output for the presence of <invalid> elements.
> 4.      If erroneous data is detected, throw an error, and stop parsing
> immediately. No output is generated.
> 5.      Same as #4 except don't stop parsing. The output contains the
> erroneous data.
> 6.      If erroneous data is detected, throw an error, drop the data,
> continue parsing. The output contains only valid data.
>
> None of these ways are "the right way to use DFDL". Which is the "right
> way" depends on your situation and requirements.
>
> Do you agree with the above? In my list of "ways to use DFDL" have I
> missed any ways?
>
> /Roger
>
>

Re: 6 ways to use DFDL

Reply via email to