Re: DFDL and XProc

Steve Lawrence Fri, 16 Aug 2024 10:10:18 -0700

Looks great to me! I like the improvements over the the XProc 1.0 step.


Some minor comments/clarifications:

> I expect parameters will map to Daffodil variables. How this mapping occurswill be implementation-defined.

Daffodil also has another kind of parameter called "tunables". I imagine therecould be an implementation defined way to define a variable vs a tunable via theparameters option. For example, maybe parameters in the "daf" namespace are usedas Daffodil tunables, and parameters in the dfdl or other namespaces are used asvariables.

> fail-on-error indicates whether or not processing should continue if arecoverable error is encountered. I’m not sure what this would be.


Daffodil has a few different kinds of potentially recoverable errors:

1) Daffodil can perform optional validation, either "limited" which isimplemented by Daffodil as it parses, or "full" via Xerces at the end of aparse. While these show up as errors in the Daffodil API, they can bedifferentiated from "Parse Errors" vs "Validation Errors". Parse errors arealways fatal since you do not get an infoset, but you could consider validationerrors as either non-fatal or not. The Daffodil CLI currently considers bothkinds of errors as fatal, though its not unreasonable for an XProcimplementation to consider validation errors as recoverable. That said, Iimagine most XProc pipelines wouldn't even enable Daffodil validation, insteaddoing it as a separate step in the pipeline.

2) DFDL allows for assertions while parsing (dfdl:assert). Normally a failedassertion causes backtracking while speculatively parsing. You can specify thatan assertion is "recoverable", which just means it does not backtrack andparsing will continue as if the the error didn't occur, and the error is treatedlike a validation error (see above). A non-recoverable assertion causesbacktracking, which could lead to any outcome (e.g. fatal parse error,successful infoset, etc).

3) It is possible for Daffodil to successfully parse but not consume all theinput data. In this case there are no errors reported by the Daffodil API andyou do get an infoset. But most people expect this to be an error so the CLI andDaffodil NiFi implementations both consider this a fatal error, but in theoryone could consider this a recoverable error, or not an error at all.

I guess the "fail-on-error" property lets implementations define which of thesefailures (and maybe others I've forgotten) can still allow pipeline processingas long as an infoset is created?

> There is no explicit support for parser files. I assume these are proprietaryrepresentations to Daffodil and cannot interoperate with other DFDL implementations.

We are finding more and more that parser files (which as you assume are specificto Daffodil) are very useful. Some schemas take a relatively long time tocompile so this can drastically reduce startup time if they are pre-compiled. Italso makes distributing large complex schemas much easier.

Would it be possible to change the content-type of the "schema" property to"any" with a restriction that implementations must support "xml" content-type,and support for other content types are implementation defined? Or areimplementations free to ignore the content and support non-standard contenttypes if it wanted?

One other thought is Daffodil has a concept of plugins, which are just speciallycrafted jars put on the classpath. I imagine providing these jars is outside thescope of the step and is implementation defined how they end up on the classpath.



Thanks!
- Steve


On 2024-08-16 11:33 AM, John Dziurlaj wrote:

Hello,

I am working on the XProc 3.x specification for two steps, p:dfdl-parse and 
p:dfdl-unparse. The specification needs to be DFDL processor neutral as much as 
possible. I fully expect implementers of XProc 3.x will use Apache Daffodil 
since it is free and open source.

Here are some of the advantages of XProc 3.0 over XProc 1.0
•       Greatly simplified syntax, addition of AVTs
•       Multiple documents may be output from a port (e.g. from Daffidil using 
-stream)
•       Supports inputs and outputs other than XML (e.g. JSON, binary)

The structure follows the earlier XProc spec, but with some modifications. The 
original XProc 1.0 step looked like this:

   <declare-step type="dfdl:parse">
       <input port="source" />
       <output port="result" />
       <option name="schema" required="true" />
       <option name="root" />          <!-- (QName) -->
    </declare-step>

The 3.x one currently looks like this:

<p:declare-step type="p:dfdl-parse">
   <p:input port="schema" content-types="xml"/>
   <p:input port="source" primary="true" content-types="any"/>
   <p:output port="result" sequence="true" content-types="any"/>
   <p:option name="parameters" as="map(xs:QName, item()*)?"/>
   <p:option name="fail-on-error" as="xs:boolean" select="true()"/>
   <p:option name="stream" as="xs:boolean" select="false()"/>
   <p:option name="root" as="xs:QName" />
</p:declare-step>

Some notes:

•       The result document is any content-type, users can pick which they 
want. DFDL does not specify required serialization outputs, but practically 
speaking most XProc users will want an XML infoset.
•       I expect parameters will map to Daffodil variables. How this mapping 
occurs will be implementation-defined.
•       stream is to control the -stream or -nostream parameter, by default 
-nostream
o       If stream is specified, multiple documents may be represented on the 
result port (see sequence=”true”)
•       root maps to the parameter of the same name,  it must be formatted as 
an xs:QName
•       fail-on-error indicates whether or not processing should continue if a 
recoverable error is encountered. I’m not sure what this would be.
•       There is no explicit support for parser files. I assume these are 
proprietary representations to Daffodil and cannot interoperate with other DFDL 
implementations.
•       There is no separate p:parse-file step. XProc 3.0 supports conveying 
non-XML data over its ports.
•       It is possible, although implementation defined that an XProc 3.0 
processer will accept a Daffodil configuration file (i.e. an instance of 
dafext.xsd). For example, MorganaXProc currently accepts external configuration 
files for Saxon.
•       A PSVI should become available post successful parse

A p:dfdl-unparse has not been sketched out but will likely look mostly the same.

On timing, the XProc group wants to get a new version of XProc out relatively 
soon, so I will need to put together a formal proposal fairly quickly. Any 
feedback is greatly appreciated!

Regards,

John Dziurlaj

-----Original Message-----
From: Steve Lawrence <slawre...@apache.org>
Sent: Wednesday, August 14, 2024 7:26 AM
To: users@daffodil.apache.org
Subject: Re: DFDL and XProc

That sounds great! If you need any help creating or reviewing the proposal let 
us know. We'd be happy to lend a hand.

On 2024-08-13 12:22 PM, John Dziurlaj wrote:

I am a heavy user of XProc 3.0. DFDL has a XProc step implementation,
but it’s for the XProc 1.0 version of Calabash. The XProc people have
a GitHub <https://github.com/xproc/3.0-steps/issues> repository where
interested parties can create proposals for implementation (via
issues). I am happy to create a proposal, likely based off the
existing Calabash one, but with some modifications that make it more idiomatic 
for XProc 3.0.

Because Apache Daffodil comes with EXI, I may write a EXI parsing step as well.

Regards,

John Dziurłaj /d͡ʑurwaj/

Sr. Solutions Architect, The Turnout

e: john@turnout.rocks <mailto:john@turnout.rocks>

s: +1 (330) 714-8935
x: @dziurlaj
work hours: 7am-3pm ET

http://turnout.rocks <http://turnout.rocks/>

@turnoutrocks

Re: DFDL and XProc

Reply via email to