Re: DFDL and XProc

Steve Lawrence Mon, 19 Aug 2024 05:53:59 -0700

Comments inline blue with SL...



> I expect parameters will map to Daffodil variables. How this mapping
occurs will be implementation-defined.

Daffodil also has another kind of parameter called "tunables". I imagine
there could be an implementation defined way to define a variable vs a
tunable via the parameters option. For example, maybe parameters in the
"daf" namespace are used as Daffodil tunables, and parameters in the dfdl
or other namespaces are used as variables.

JND: On the question of tunables, processors like MorganaXProc allow
for configuration
files to
<https://www.xml-project.com/manual/ch02.html#configuration_s1_1_s2_4> be
passed at runtime. A parameter such as –dfdl-config could be provided to
fill these values.

SL: Makes sense and seems perfectly reasonable. The DFDL config files allow
passing in both variables and tunables.

> fail-on-error indicates whether or not processing should continue if a
recoverable error is encountered. I’m not sure what this would be.

Daffodil has a few different kinds of potentially recoverable errors:

1) Daffodil can perform optional validation, either "limited" which is
implemented by Daffodil as it parses, or "full" via Xerces at the end of a
parse. While these show up as errors in the Daffodil API, they can be
differentiated from "Parse Errors" vs "Validation Errors". Parse errors are
always fatal since you do not get an infoset, but you could consider
validation errors as either non-fatal or not. The Daffodil CLI currently
considers both kinds of errors as fatal, though its not unreasonable for an
XProc implementation to consider validation errors as recoverable. That
said, I imagine most XProc pipelines wouldn't even enable Daffodil
validation, instead doing it as a separate step in the pipeline.

JND: Are these “validation” errors against the DFDL/XSD Schema (i.e.
confirming the validity of the schema) or the input? XProc already has an
optional validation step for XSD validation, p:validate-with-xml-schema
<https://spec.xproc.org/master/head/validation/#c.validate-with-xml-schema>,
which can use a variety of validators including Xerces.

We could add output report port which would contain the list of errors (if
any) in the XVRL (another new XProc 3.0 feature) vocabulary
<https://github.com/xproc/xvrl>. This is what other validation steps do.

But I want to understand your larger point about handling validation as a
separate step, what advantages does that entail? It appears that validation
is an optional DFDL feature. I worry about creating a separate step that
only some DFDL implementations could meet.

SL: Regarding validation as a separate step, I was suggesting something
like p:validation-with-xml-schema be used. I didn't mean to imply that we
need something like a p:dfdl-validate step.

The validation errors that Daffodil can create are normal XSD validation
checks based on the DFDL/XSD schema. They can be enabled or disabled, but
usually systems that already have XSD validation capabilities disable
Daffodil validation and use the built-in ones instead. I would maybe
suggest that the dfdl-parse step says something like "the p:dfdl-parse step
does not perform XSD validation, if it is needed then
p:validation-with-xm-schema should be used."

I think an output report could be useful, not only for the dfdl:assert, but
also for diagnostics when parse failures happen. When a parse fails, it's
very difficult to know why, and Daffodil tries to create helpful
diagnostics that make it more clear. Making these available in a consistent
report seems like a good idea. Though, can XVRL reports be used for
diagnostics about input failures (e.g. "failed to find delimiter in the
data"), or are they fairly specific to validation diagnostics?

2) DFDL allows for assertions while parsing (dfdl:assert). Normally a
failed assertion causes backtracking while speculatively parsing. You can
specify that an assertion is "recoverable", which just means it does not
backtrack and parsing will continue as if the the error didn't occur, and
the error is treated like a validation error (see above). A non-recoverable
assertion causes backtracking, which could lead to any outcome (e.g. fatal
parse error, successful infoset, etc).

3) It is possible for Daffodil to successfully parse but not consume all
the input data. In this case there are no errors reported by the Daffodil
API and you do get an infoset. But most people expect this to be an error
so the CLI and Daffodil NiFi implementations both consider this a fatal
error, but in theory one could consider this a recoverable error, or not an
error at all.

I guess the "fail-on-error" property lets implementations define which of
these failures (and maybe others I've forgotten) can still allow pipeline
processing as long as an infoset is created?

JND: XProc has a try/catch <https://spec.xproc.org/3.0/xproc/#p.try>
structure. However, if fail-on-error was set then the output would be lost.
Should we leave it as implementation-defined what is a recoverable vs
unrecoverable error?

SL: Yeah, I think it makes sense to allow implementations to define what is
recoverable or not and how to handle fail-on-errors. Though, it's still not
entirely clear to me how fail-on-error works if set to true vs false. Most
Daffodil errors are not recoverable (e.g. you don't get an infoset), so if
fail-on-error is false but there is no infoset what should the dfdl-parse
step do?

> There is no explicit support for parser files. I assume these are
proprietary representations to Daffodil and cannot interoperate with other
DFDL implementations.

We are finding more and more that parser files (which as you assume are
specific to Daffodil) are very useful. Some schemas take a relatively long
time to compile so this can drastically reduce startup time if they are
pre-compiled. It also makes distributing large complex schemas much easier.

Would it be possible to change the content-type of the "schema" property to
"any" with a restriction that implementations must support "xml"
content-type, and support for other content types are implementation
defined? Or are implementations free to ignore the content and support
non-standard content types if it wanted?

This would require the XProc to be able to “detect” whether the input is a
XSD or Parser file (whose structure I do know). If you think this is
practical let me know and I’ll relax the content types.

SL: The first handful of bytes in a Daffodil pre-compiled parser is the
string "DAFFODIL", so an implementation could check the first handful of
bytes to differentiate XSD vs a parser. I think it also feels fine to
require the content-type to be XML (I imagine most DFDL implementations
don't have a concept of a pre-compiled parser), as long as ignoring the
content type is something implements sometimes do. If that's not normal, I
think it would be helpful to relax the resriction.

One other thought is Daffodil has a concept of plugins, which are just
specially crafted jars put on the classpath. I imagine providing these jars
is outside the scope of the step and is implementation defined how they end
up on the classpath.

Generally processors such as MorganaXProc will make available any jars on
their classpath.

SL: Makes sense.

Re: DFDL and XProc

Reply via email to