RE: DFDL and XProc

John Dziurlaj Sat, 17 Aug 2024 04:52:59 -0700

Hi Steve,

See my comments inline prefixed with JND.

> I expect parameters will map to Daffodil variables. How this mapping occurs
> will be implementation-defined.

Daffodil also has another kind of parameter called "tunables". I imagine there
could be an implementation defined way to define a variable vs a tunable via
the parameters option. For example, maybe parameters in the "daf" namespace are
used as Daffodil tunables, and parameters in the dfdl or other namespaces are
used as variables.

JND: On the question of tunables, processors like MorganaXProc allow for
configuration files
to<https://www.xml-project.com/manual/ch02.html#configuration_s1_1_s2_4> be
passed at runtime. A parameter such as –dfdl-config could be provided to fill
these values.

> fail-on-error indicates whether or not processing should continue if a
> recoverable error is encountered. I’m not sure what this would be.

Daffodil has a few different kinds of potentially recoverable errors:

1) Daffodil can perform optional validation, either "limited" which is
implemented by Daffodil as it parses, or "full" via Xerces at the end of a
parse. While these show up as errors in the Daffodil API, they can be
differentiated from "Parse Errors" vs "Validation Errors". Parse errors are
always fatal since you do not get an infoset, but you could consider validation
errors as either non-fatal or not. The Daffodil CLI currently considers both
kinds of errors as fatal, though its not unreasonable for an XProc
implementation to consider validation errors as recoverable. That said, I
imagine most XProc pipelines wouldn't even enable Daffodil validation, instead
doing it as a separate step in the pipeline.

JND: Are these “validation” errors against the DFDL/XSD Schema (i.e. confirming
the validity of the schema) or the input? XProc already has an optional
validation step for XSD validation,
p:validate-with-xml-schema<https://spec.xproc.org/master/head/validation/#c.validate-with-xml-schema>,
which can use a variety of validators including Xerces.

We could add output report port which would contain the list of errors (if any)
in the XVRL (another new XProc 3.0 feature)
vocabulary<https://github.com/xproc/xvrl>. This is what other validation steps
do.

But I want to understand your larger point about handling validation as a
separate step, what advantages does that entail? It appears that validation is
an optional DFDL feature. I worry about creating a separate step that only some
DFDL implementations could meet.

2) DFDL allows for assertions while parsing (dfdl:assert). Normally a failed
assertion causes backtracking while speculatively parsing. You can specify that
an assertion is "recoverable", which just means it does not backtrack and
parsing will continue as if the the error didn't occur, and the error is
treated like a validation error (see above). A non-recoverable assertion causes
backtracking, which could lead to any outcome (e.g. fatal parse error,
successful infoset, etc).

3) It is possible for Daffodil to successfully parse but not consume all the
input data. In this case there are no errors reported by the Daffodil API and
you do get an infoset. But most people expect this to be an error so the CLI
and Daffodil NiFi implementations both consider this a fatal error, but in
theory one could consider this a recoverable error, or not an error at all.

I guess the "fail-on-error" property lets implementations define which of these
failures (and maybe others I've forgotten) can still allow pipeline processing
as long as an infoset is created?

JND: XProc has a try/catch<https://spec.xproc.org/3.0/xproc/#p.try> structure.
However, if fail-on-error was set then the output would be lost. Should we
leave it as implementation-defined what is a recoverable vs unrecoverable error?

> There is no explicit support for parser files. I assume these are proprietary
> representations to Daffodil and cannot interoperate with other DFDL
> implementations.

We are finding more and more that parser files (which as you assume are
specific to Daffodil) are very useful. Some schemas take a relatively long time
to compile so this can drastically reduce startup time if they are
pre-compiled. It also makes distributing large complex schemas much easier.

Would it be possible to change the content-type of the "schema" property to
"any" with a restriction that implementations must support "xml" content-type,
and support for other content types are implementation defined? Or are
implementations free to ignore the content and support non-standard content
types if it wanted?

This would require the XProc to be able to “detect” whether the input is a XSD
or Parser file (whose structure I do know). If you think this is practical let
me know and I’ll relax the content types.

One other thought is Daffodil has a concept of plugins, which are just
specially crafted jars put on the classpath. I imagine providing these jars is
outside the scope of the step and is implementation defined how they end up on
the classpath.

Generally processors such as MorganaXProc will make available any jars on their
classpath.

RE: DFDL and XProc

Reply via email to