Tom,

Any data format that is popular gets encapsulated and carried around in other 
data formats. Nature of the whole data game is a long history of this. E.g., 
you just want to aggregate multiple different pieces of data about a particular 
event together in a common data structure, but you have a constraint on the 
format you must use for this aggregate.



So genericallly,

  *   using Daffodil to convert data from native (e.g., binary) into format X.
  *   Typically format X is textual, but not necessarily.
  *   The native data also contains data that is already in format X.
  *   Many use cases will want the result to be Format X, not Format X with 
embedded escapified Format X pieces.

Hence merging the translated with the encapsulated pieces is a natural need.



Format X could be XML, JSON, EXI (binary XML), S-expresions, SISL, or other 
things.



The fact that Daffodil has a built in validation module, that in the case of 
XML Schemas, would not be able to use the DFDL schema to validate "Format X" 
when Format X is XML, that's a corner case for XML.  If this really became 
important, we could add a validation feature to enable validation to choose a 
different XML schema than the DFDL schema. This is already needed just if you 
want the validation to have some things like key/unique constraints that are 
not allowed to appear in a DFDL schema. The feature is also almost already 
there because if you use schematron validation, that can use a separate 
schematron .sch file for the validation rules. So making the regular xerces XML 
validator able to take a different XML schema for the validation seems like a 
small thing.


So I think this is a good generic capability to add to Daffodil.


We just need a motivated contributor to create it 🙂 (always recruiting new 
developers!)


-mike beckerle

________________________________
From: Steve Lawrence <slawre...@apache.org>
Sent: Wednesday, September 22, 2021 12:21 PM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: Re: Parsing formats with embedded XML -- recursion and/or layering 
required?

We actually recently added a feature that was intended to solve just
this problem of including XML payloads in the resulting infoset as XML
rather than a string. Though it requires a custom InfosetInputter and
InfosetOutputter that have not been written yet.

The proposal is here:

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Runtime+Properties

The idea is that your payload element is just a normal xs:string, and
you annotate it with a custom runtime property like
treatStringAsXML=true. Then you can write a custom InfosetOuputter that
uses his annotation and outputs the string as XML during parse, and a
custom InfosetInputter that converts that XML back to a string during
unparse.

The Example Implementation discusses this exact use case and gives an
idea of how one might implement the custom InfosetInputter/Outputter.
This example uses Scala XML Nodes for simplicity, but could be done with
the standard text inputter/outputters as well.

One thing to point out though is that to Daffodil and its internals,
this payload element is still a string. Daffodil has no knowledge about
what the InfosetInputter/Outputters are doing, so Daffodil cannot
reference the XML payload in DFDL expressions, or validate the XML
against a schema. For validation, you would need to pipe the resulting
infoset to some other tool with a modified schema that does not treat
this payload as a string.

Since this is the second time I've come across this requirement, it
might be worth considering if this will be a more common technique, and
if maybe we should add some built-in mechanism to DFDL, one that would
work with both DFDL expressions and validation...

- Steve

On 9/22/21 11:58 AM, Ballard, Tom - US wrote:
> All,
>
> I have a complex data format I am trying to implement a DFDL schema for, but
> don’t believe it’s possible without support for either recursion decomposition
> and/or layering.  The format in question has a subset of messages which 
> consist
> of a binary “header” followed by an XML payload.  The messages begin with a
> handful of binary metadata fields, followed by a binary length field, and then
> an XML payload (which is the length indicated in length field).  In some cases
> there may be binary data subsequent to the XML payload as well.  I assume I 
> can
> pull the XML payload in as an opaque string blob, but the problem is I also 
> need
> to validate that XML against a schema.
>
> I know recursion and layering are on the project wish list, but is there a way
> to accomplish full parsing and validation of “hybrid” messages like I 
> described
> possible without them?
>
> V/R,
>
> Tom Ballard
>
>
> --------------------------------------------------------------------------------
>
> This electronic message contains information from CACI International Inc or
> subsidiary companies, which may be company sensitive, proprietary, privileged 
> or
> otherwise protected from disclosure. The information is intended to be used
> solely by the recipient(s) named above. If you are not an intended recipient, 
> be
> aware that any review, disclosure, copying, distribution or use of this
> transmission or its contents is prohibited. If you have received this
> transmission in error, please notify the sender immediately.
>

Reply via email to