RE: DFDL and XProc

John Dziurlaj Thu, 22 Aug 2024 05:08:13 -0700

I am working through writing the formal 
specification<https://github.com/theturnout/3.0-steps/blob/feature/dfdl-support/step-dfdl/src/main/xml/specification.xml>
 for the XProc DFDL steps and have a few more questions.



  *   It appears the DFDL specification does not specify any required infoset 
formats (outside of its internal representation).  I would imagine the existing 
DFDL implementations universally support XML. My preference is to make XML the 
default, but if it is possible for a conforming DFDL processor to not support 
XML, the best approach may be to add an “infoset” option that has no explicit 
default. It will be implementation-defined what the default output is.
  *   “parameters” are given by QName, I believe this is correct even though 
DFDL Sec 7.7 uses NCName. it appears you need the namespace to locate the 
correct variable.
  *   It is unclear if the parameters option needs to be marked as 
“implementation-defined”. DFDL spec supports variables but appears to be silent 
on whether those parameters can be provided outside the processor’s context.
  *   “stream” is also implementation-defined. It seems like a useful concept, 
and I imagine some DFDL schemas expect this feature to be available.

Once I get these questions answered, I will revise the specification and offer 
it to this group before issuing a PR with the XProc team.

Regards,

John Dziurlaj

From: Steve Lawrence <slawre...@apache.org>
Sent: Monday, August 19, 2024 8:54 AM
To: users@daffodil.apache.org
Subject: Re: DFDL and XProc


Comments inline blue with SL...



> I expect parameters will map to Daffodil variables. How this mapping occurs 
> will be implementation-defined.

Daffodil also has another kind of parameter called "tunables". I imagine there 
could be an implementation defined way to define a variable vs a tunable via 
the parameters option. For example, maybe parameters in the "daf" namespace are 
used as Daffodil tunables, and parameters in the dfdl or other namespaces are 
used as variables.

JND: On the question of tunables, processors like MorganaXProc allow for 
configuration files 
to<https://www.xml-project.com/manual/ch02.html#configuration_s1_1_s2_4> be 
passed at runtime. A parameter such as –dfdl-config could be provided to fill 
these values.

SL: Makes sense and seems perfectly reasonable. The DFDL config files allow 
passing in both variables and tunables.

> fail-on-error indicates whether or not processing should continue if a 
> recoverable error is encountered. I’m not sure what this would be.

Daffodil has a few different kinds of potentially recoverable errors:

1) Daffodil can perform optional validation, either "limited" which is 
implemented by Daffodil as it parses, or "full" via Xerces at the end of a 
parse. While these show up as errors in the Daffodil API, they can be 
differentiated from "Parse Errors" vs "Validation Errors". Parse errors are 
always fatal since you do not get an infoset, but you could consider validation 
errors as either non-fatal or not. The Daffodil CLI currently considers both 
kinds of errors as fatal, though its not unreasonable for an XProc 
implementation to consider validation errors as recoverable. That said, I 
imagine most XProc pipelines wouldn't even enable Daffodil validation, instead 
doing it as a separate step in the pipeline.

JND: Are these “validation” errors against the DFDL/XSD Schema (i.e. confirming 
the validity of the schema) or the input? XProc already has an optional 
validation step for XSD validation, 
p:validate-with-xml-schema<https://spec.xproc.org/master/head/validation/#c.validate-with-xml-schema>,
 which can use a variety of validators including Xerces.

We could add output report port which would contain the list of errors (if any) 
in the XVRL (another new XProc 3.0 feature) 
vocabulary<https://github.com/xproc/xvrl>. This is what other validation steps 
do.

But I want to understand your larger point about handling validation as a 
separate step, what advantages does that entail? It appears that validation is 
an optional DFDL feature. I worry about creating a separate step that only some 
DFDL implementations could meet.

SL: Regarding validation as a separate step, I was suggesting something like 
p:validation-with-xml-schema be used. I didn't mean to imply that we need 
something like a p:dfdl-validate step.

The validation errors that Daffodil can create are normal XSD validation checks 
based on the DFDL/XSD schema. They can be enabled or disabled, but usually 
systems that already have XSD validation capabilities disable Daffodil 
validation and use the built-in ones instead. I would maybe suggest that the 
dfdl-parse step says something like "the p:dfdl-parse step does not perform XSD 
validation, if it is needed then p:validation-with-xm-schema should be used."

I think an output report could be useful, not only for the dfdl:assert, but 
also for diagnostics when parse failures happen. When a parse fails, it's very 
difficult to know why, and Daffodil tries to create helpful diagnostics that 
make it more clear. Making these available in a consistent report seems like a 
good idea. Though, can XVRL reports be used for diagnostics about input 
failures (e.g. "failed to find delimiter in the data"), or are they fairly 
specific to validation diagnostics?

2) DFDL allows for assertions while parsing (dfdl:assert). Normally a failed 
assertion causes backtracking while speculatively parsing. You can specify that 
an assertion is "recoverable", which just means it does not backtrack and 
parsing will continue as if the the error didn't occur, and the error is 
treated like a validation error (see above). A non-recoverable assertion causes 
backtracking, which could lead to any outcome (e.g. fatal parse error, 
successful infoset, etc).

3) It is possible for Daffodil to successfully parse but not consume all the 
input data. In this case there are no errors reported by the Daffodil API and 
you do get an infoset. But most people expect this to be an error so the CLI 
and Daffodil NiFi implementations both consider this a fatal error, but in 
theory one could consider this a recoverable error, or not an error at all.

I guess the "fail-on-error" property lets implementations define which of these 
failures (and maybe others I've forgotten) can still allow pipeline processing 
as long as an infoset is created?

JND: XProc has a try/catch<https://spec.xproc.org/3.0/xproc/#p.try> structure. 
However, if fail-on-error was set then the output would be lost. Should we 
leave it as implementation-defined what is a recoverable vs unrecoverable error?

SL: Yeah, I think it makes sense to allow implementations to define what is 
recoverable or not and how to handle fail-on-errors. Though, it's still not 
entirely clear to me how fail-on-error works if set to true vs false. Most 
Daffodil errors are not recoverable (e.g. you don't get an infoset), so if 
fail-on-error is false but there is no infoset what should the dfdl-parse step 
do?

> There is no explicit support for parser files. I assume these are proprietary 
> representations to Daffodil and cannot interoperate with other DFDL 
> implementations.

We are finding more and more that parser files (which as you assume are 
specific to Daffodil) are very useful. Some schemas take a relatively long time 
to compile so this can drastically reduce startup time if they are 
pre-compiled. It also makes distributing large complex schemas much easier.

Would it be possible to change the content-type of the "schema" property to 
"any" with a restriction that implementations must support "xml" content-type, 
and support for other content types are implementation defined? Or are 
implementations free to ignore the content and support non-standard content 
types if it wanted?

This would require the XProc to be able to “detect” whether the input is a XSD 
or Parser file (whose structure I do know). If you think this is practical let 
me know and I’ll relax the content types.

SL: The first handful of bytes in a Daffodil pre-compiled parser is the string 
"DAFFODIL", so an implementation could check the first handful of bytes to 
differentiate XSD vs a parser. I think it also feels fine to require the 
content-type to be XML (I imagine most DFDL implementations don't have a 
concept of a pre-compiled parser), as long as ignoring the content type is 
something implements sometimes do. If that's not normal, I think it would be 
helpful to relax the resriction.

One other thought is Daffodil has a concept of plugins, which are just 
specially crafted jars put on the classpath. I imagine providing these jars is 
outside the scope of the step and is implementation defined how they end up on 
the classpath.

Generally processors such as MorganaXProc will make available any jars on their 
classpath.

SL: Makes sense.

RE: DFDL and XProc

Reply via email to