Re: Proposal: Extensible DFDL; XSLT transformations for XML attribute support

Mike Beckerle Wed, 18 Dec 2024 08:06:18 -0800

Ok, so I understand your goal is to take a pre-existing XSD, designed with
no knowledge of the physical representation, and then annotate it so as to
populate it from a physical format. Let's ignore the unparsing problem for
now, as the parsing problem here is hard enough.


I want to constructively challenge this goal, and posit that the goal needs
refining.

Here's a problem that matches this goal.  My point is that I believe you
will look at this example problem and say "that's not what I meant",
leading to a refinement of the goal.

I have a pre-existing XSD that consists of a top-level element with three
child elements. Each child element contains a repeating array of a specific
kind of element having multiple simple-type children. Think of it as an XML
root element surrounding 3 tables of data. This is a XML representation of
a simple 3-table RDBMS.

Now, my native data is a hierarchical nest. There is a repeating top level
record, inside each of those records is a repeating sub-record, and inside
each of those is a repeating sub-sub-record. The top level, sub-record, and
sub-sub-record each contain some simple-type children. Each of these simple
type children corresponds to a column of one of the tables in the tabular
data schema.

Just using annotations on the tabular schema, efficiently
populate instances of that schema from the nested hierarchical data. The
simple children of the top-level native-format records become column values
in table 1, sub-records within those populate table 2, each carrying a
generated key identifying the table 1 record it was contained within. The
sub-sub records go into table 3, each carrying a generated key identifying
the table 2 record it was contained within.

This matches the goal of taking an existing XSD and annotating it to
populate it from the native format. But the schema we're populating, and
the shape of the data representation have nothing to do with each other.

Do you claim this should be possible and straightforward? Or is this level
of "unnesting" complexity not what was intended?

I claim this tabular schema does not provide the proper scaffolding onto
which to place the annotations that allow it to describe the nested
repeating sub-structures.

The point here: it's very hard to map data to some pre-defined XSD unless
the cardinalities and repeating sub-structures are 'similar' in some way
that we should be able to formally define.

That's the refinement I think is necessary for this idea to be feasible.
The "shapes" must match to some degree. I don't know exactly how to express
this concept of the shapes being compatible, but I think that needs to be
part of the requirements.

Moreover, I claim to express such a transformation you need not only a
schema for the tables, but a separate schema for the hierarchical nest, and
then a way to express the transformation between the two at the
schema-to-schema level. (Not at the instance level which is what XSLT and
XQuery do).

Note that DFDL's goal has heretofore always been to solve only ONE part of
this problem, the description of the physical, in this case hierarchical
nested data. It has never been a goal for DFDL (that is, the DFDL workgroup
I mean) to create a transformation language as part of DFDL. Commercial
systems that embed DFDL include their own transformation language which is
used on the output from DFDL parsing to manipulate the data. Displacing
that transformation infrastructure was never a goal.

That said, DFDL can *almost *do this sort of transformation by way of its
hidden groups feature. As an example: this DFDL schema does a matrix
transpose from a pair of lists to a list of pairs:

    https://github.com/OpenDFDL/examples/tree/master/pairsTransform

It is not pretty, but it is a* schema-driven approach to XML transformation*
which is not XSLT, nor XQuery. See the README.md file there for context.










On Thu, Nov 7, 2024 at 9:38 PM Brutzman, Donald (Don) (CIV) <
brutz...@nps.edu> wrote:

> Hi Mike.  Thanks for reminder that DFDL supports EXI.  Here are more
> reactions, perhaps duplicative, searching for common understanding.
>
> The motivating reason we did these transformations is to take advantage of
> already-existing XML Schemas.  We want to use DFDL to map alternative
> regular/parsable datasets into XML conforming to existing schemas.  Use
> case: if the DFDL engineer can add DFDL markup to an already-existing XSD
> schema (which typically includes attributes) then they have general-purpose
> DFDL parse/unparse for that XML document type.
>
> The ordering of attributes in the target XML schema is not significant,
> since attribute ordering is not considered an information item in Post
> Schema Validation Infoset (PSVI).  Similarly, interior whitespace between
> attributes or elements (outside of CDATA sections) is also insignificant.
> Knowing precisely what information is NOT contained or implied in an XML
> document is helpful.
>
> Data and whitespace remains significant in any source dataset, according
> to the rules of that dataset.
>
> Ordering of attribute declarations within an XSD schema has no implication
> regarding target XML, but probably is a point of sensitivity when defining
> the DFDL markup to add so that the source dataset is parsed properly.
>
> We had DFDL markup in the attribute-aware XSD declarations for those
> attributes.  That might be simplistic and require more work, but it worked
> for us and seems general.
>
> For your <attributes d="d" e="e"/> example, if that is an intermediate
> form then it won't much matter, but it likely makes more sense to put them
> immediately following the parent <record> since that is what they belong
> to.  Following child elements might put them hundreds of child elements
> away, complicating context and coherence.
>
> So list of things we don't want to have to do:
>
>    - Require any XML design whatsoever, rather just add DFDL to
>    previously defined schema,
>    - Require any engineering at all except addition of DFDL to a copy of
>    an existing XSD schema.
>
> Hope it is clear and correct... Happy to have a meeting sometime to
> discuss if that helps.
>
> Thanks for considering the possibilities.
>
> all the best, Don
>
> --
>
> Don Brutzman  Naval Postgraduate School, Code USW/Br
> brutz...@nps.edu
>
> Watkins 270,  MOVES Institute, Monterey CA 93943-5000 USA
> +1.831.656.2149
>
> X3D graphics, virtual worlds, navy robotics
> https://faculty.nps.edu/brutzman
>
>
>
> ------------------------------
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Thursday, November 7, 2024 9:44 AM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Cc:* Claude Mamo <claude.m...@gmail.com>; Roger L Costello <
> coste...@mitre.org>; Norbraten, Terry (CIV) <tdnor...@nps.edu>; Blais,
> Curtis (Curt) (CIV) <clbl...@nps.edu>
> *Subject:* Re: Proposal: Extensible DFDL; XSLT transformations for XML
> attribute support
>
>
> If I understand this, I think you are solving the primary issue of
> attribute declarations not being in the right lexical position in the XSD
> by essentially requiring the schema author to add element tiers so that the
> declaration order matches the physical order.
>
> So for example: I have a record of data that looks like: (a, b, c), d, e,
> (f, g, h)
> I want a vector v1 of elements containing a, b, and c,
> then I want two attributes to hold d and e.
> Then I want a vector v2 of elements containing f, g, and h.
>
> Here's what I can't achieve because declarations for d and e attributes
> would be after the element declarations for v2.
>
> <record d="d" e="e">
>   <v1>a</v1><v1>b</v1><v1>c</v1>
>   <v2>f</v2><v2>g</v2><v2>h</v2>
> </record>
>
> But I can achieve this by introducing an element to hold the attributes d
> and e in the physical location between the two data vectors.
>
> <record>
>   <v1>a</v1><v1>b</v1><v1>c</v1>
>   <attributes d="d" e="e"/>
>   <v2>f</v2><v2>g</v2><v2>h</v2>
> </record>
>
> If this is acceptable then this is well worth considering further.
>
> Using XML attributes would place some constraints on the physical data
> that you can choose to put into attributes. E.g., data can't contain
> leading nor trailing whitespace that needs to be preserved, or adjacent
> whitespace inside it which needs to be preserved, because of XML attribute
> whitespace collapsing. That or some sort of non-ordinary escaping may be
> required, such as using DFDL character entities.
>
> There is also the issue of the implied sequence surrounding attribute
> declarations. In my little example above, the d and e fields are comma
> separated. There would be no sequence group around their declarations. This
> can be finessed by letting dfdl:separator and related properties that we
> normally place on xs:sequence to also be placed on xs:complexType, and such
> properties would apply to both a sequence that is its model group and any
> declared attributes.
>
> I expect these sorts of issues could be worked out.
>
> re: EXI
>
> Daffodil supports EXI today. There is no need to further transform
> DFDL's output XML Infosets to take advantage of EXI's density.
>
>
>
> On Thu, Nov 7, 2024 at 9:00 AM Brutzman, Donald (Don) (CIV) <
> brutz...@nps.edu> wrote:
>
> [Apologies for delayed response, hiccup with our gitlab version control
> now fixed.  All related work should now be publicly visible and usable.]
>
> Mike, your capability request below sounds like an excellent match for the
> capabilities of our "DFDL Attribution" project, which took a pipeline
> approach to this long-standing challenge.
>
>    - Data Format Description Language (DFDL) "Attribution" Project
>    -
>    
> https://gitlab.nps.edu/Savage/robodata/-/blob/master/DFDL/attribution/README.md
>    - This project is working to show that additional DFDL support for XML
>    attributes is feasible by using a "pipeline" approach to processing.
>    - Good initial progress has been made that allows use of an
>    attribute-aware XML schema. Pre- and post-processing XSLT stylesheets can
>    convert XML documents and schemas into equivalent element-only form that
>    DFDL can use to parse/unparse data documents.
>    -
>    
> https://gitlab.nps.edu/Savage/robodata/-/raw/master/DFDL/attribution/images/DfdlXmlElementAttributeTransformations.png
>
> XSLT preprocessing and postprocessing of both intermediate and destination
> XML documents and XML XSD schema means that this approach can be used with
> any DFDL processor, either by tool builders or by data engineers.
>
> The essence of this approach is that XML attributes are converted into
> unique XML child elements.  This enables a DFDL parser to remain
> element-aware and attribute-unaware.   Here are some illustrations for the
> correspondences we chose... Pretty unambiguous, child-element names simply
> prefix an underscore to attribute name.
>
>    -
>    
> https://gitlab.nps.edu/Savage/robodata/-/blob/master/DFDL/attribution/images/DocumentTransformationsSchemaView.png
>    -
>    
> https://gitlab.nps.edu/Savage/robodata/-/blob/master/DFDL/attribution/images/DocumentTransformationsTreeView.png
>    -
>    
> https://gitlab.nps.edu/Savage/robodata/-/blob/master/DFDL/attribution/images/DocumentTransformationsXmlView.png
>
> We have only tested it for one or two cases, but have designed it to be
> general.  The primary test case shown in the preceding diagrams is quite
> general.
>
> Further testing welcome.  There are plenty of achievable TODO items in the
> README page.
>
> (deep breath) Here is why such a capability is really important.  DFDL
> offers magnificent capabilities.  However you cannot go from (some
> arbitrary dataset) to (some existing XML format).  This is a major
> limitation on DFDL utility for mapping arbitrary regular data into widely
> used XML data forms.
>
> (second deep breath)  If DFDL pipelines can achieve full support for
> conversions to and from XML, they can also take full advantage of Efficient
> XML Interchange (EXI) compression.  The EXI Recommendation algorithms have
> been experimentally shown to meet or beat any other general compression
> scheme (for example ZIP and GZIP), and further offer significantly faster
> (and computationally efficient) data decompression.
>
> I personally believe that adding full support for DFDL for maping to/from
> XML can eliminate a major barrier inhibiting more widespread DFDL
> employment.
>
> All questions and collaborative efforts welcome.   Very respectfully yours.
>
>
> all the best, Don
>
> --
>
> Don Brutzman  Naval Postgraduate School, Code USW/Br
> brutz...@nps.edu
>
> Watkins 270,  MOVES Institute, Monterey CA 93943-5000 USA
> +1.831.656.2149
>
> X3D graphics, virtual worlds, navy robotics
> https://faculty.nps.edu/brutzman
>
>
>
> ------------------------------
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Thursday, October 31, 2024 6:30 AM
> *To:* Brutzman, Donald (Don) (CIV) <brutz...@nps.edu>
> *Cc:* Claude Mamo <claude.m...@gmail.com>; Roger L Costello <
> coste...@mitre.org>; Norbraten, Terry (CIV) <tdnor...@nps.edu>; Blais,
> Curtis (Curt) (CIV) <clbl...@nps.edu>; users@daffodil.apache.org <
> users@daffodil.apache.org>
> *Subject:* Re: Proposal: Extensible DFDL
>
> Beside converting the XML instances to/from an attribute-centric form,
> there is the need for an XML Schema that describes that form.
>
> Converting a DFDL schema, or element-oriented XSD, into one which
> describes the attribute-oriented variant is non-trivial in the general
> case.
>
> Has anyone worked on tooling for that?
>
>
>
> On Wed, Oct 30, 2024 at 6:22 PM Brutzman, Donald (Don) (CIV) <
> brutz...@nps.edu> wrote:
>
> [cc: Curt]
>
> Thanks for updated status.
>
> Of relevant note is that our NPS team came up with a round-trip approach
> that converts arbitrary element-attribute XML to corresponding
> element-element XML, then back again.  XSLT is used in each direction.
> Online at
>
>    - https://gitlab.nps.edu/Savage/robodata/-/tree/master/DFDL/attribution
>
>    -
>    
> https://gitlab.nps.edu/Savage/robodata/-/blob/master/DFDL/attribution/README.md
>
> However... am surprised to see that this is not public access.  My mistake
> - apologies for the inconvenience, we will carefully work towards releasing
> it.
>
>    - Terry, can we please review together (there are some other things
>    available in parent robodata project) to ensure that we can indeed go fully
>    public with the project.
>    - Attached please find advance copies of some of the screenshots.
>
> Summary description appears in pages 3..6 of the following white paper.
>
>    - Data Strategy for Unmanned Systems: Field Experimentation (FX),
>    Simulation and Analysis
>    - https://nps.edu/web/now/data-strategy-for-autonomous-systems
>    -
>    
> https://nps.edu/documents/151816058/0/DataStrategyUnmannedSystemsTechnicalMemorandum2023January25.pdf
>
> As before: regardless of how complex the implementation of a DFDL
> processor might be, if this stylesheet is indeed general, then it might
> server as a DFDL preprocessor/postprocessor for handling attribute-aware
> DFDL schema.
>
> Some additional thoughts:
>
>    - Perhaps DFDL parsing/unparsing of XML of a source document that
>    includes attributes might provide another angle on this problem.
>    - You won't catch me using ChatGPT but adding descriptions within DFDL
>    schema might further encourage automated translation.
>
> Mike and Roger, if a meeting discussing this topic might help, I can be
> available during second half of November.
>
> Very respectfully yours.
>
>
> all the best, Don
>
> --
>
> Don Brutzman  Naval Postgraduate School, Code USW/Br
> brutz...@nps.edu
>
> Watkins 270,  MOVES Institute, Monterey CA 93943-5000 USA
> +1.831.656.2149
>
> X3D graphics, virtual worlds, navy robotics
> https://faculty.nps.edu/brutzman
>
>
>
> ------------------------------
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Wednesday, October 30, 2024 7:08 AM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Cc:* Roger L Costello <coste...@mitre.org>; Norbraten, Terry (CIV) <
> tdnor...@nps.edu>; Brutzman, Donald (Don) (CIV) <brutz...@nps.edu>
> *Subject:* Re: Proposal: Extensible DFDL
>
> That proposal for XML attributes in DFDL has not been prototyped.
>
>  I believe it is not ready - still only half baked. E.g, the implications
> of XML attributes' whitespace collapsing behavior are very problematic when
> using an XML attribute to logically represent data that physically does not
> conform. XML attributes are entirely unable to represent data that
> contains, for example, multiple adjacent space characters, or line-endings.
> If whitespace is significant, attributes won't work.
>
> Today there is XSLT and AI. E.g., chatGPT seems to be able to write XSLT
> very well from XML snippets and a description or example of what you want
> out of the transformation. The whole burden of having to write symmetric
> transforms - one for parsing, the inverse for unparsing, is eliminated when
> chatGPT writes them both for you.
>
>
>
>
>
>
>
>
>
> On Wed, Oct 30, 2024 at 4:58 AM Claude Mamo <claude.m...@gmail.com> wrote:
>
> Was there movement on creating attributes from DFDL? I found this
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Extend+DFDL+with+XML+Attribute+Support
>  but
> does someone know whether this will be available anytime soon?
>
> A bit of context. I have a scenario in EDI X12 where (1) the DFDL schema
> is very generic and (2) the segment ID needs to be an attribute in the
> parent element so that the XPath selectors in Smooks don't easily break
> when routing segments. Unfortunately, due to the streaming nature of
> Smooks, I can't use something like this for the selector: 
> */interchange/segment[segmentId/text()
> = "GS"]*. The workaround so far is to use indexes (e.g.,
> */interchange/segment[2]*) but this is bad for various reasons.
>
> Thanks,
>
> Claude
>
>
> On Thu, Nov 2, 2023 at 5:10 PM Brutzman, Donald (Don) (CIV) <
> brutz...@nps.edu> wrote:
>
> I think that the single most significant and powerful extension capability
> for DFDL would be to support attributes.
>
>
>
> XML Schema is highly extensible already and widely deployed.  JSON schema
> is pretty consistent and has similar expressive power - if ever finally
> standardized and consistently supported in tools, it might further broaden
> the available information-architecture infrastructure for many applications
> and much of the Web.
>
>
>
> The ability to align DFDL directly with any XML Schema, to support
> consistent mappings of diverse datasets with coherent data models, would be
> major increase in DFDL capability.
>
>
>
> p.s. long-held opinion:  skateboards and attributes are not a crime…  8)
>
>
>
> all the best, Don
>
> --
>
> Don Brutzman  Naval Postgraduate School, Code USW/Br
> brutz...@nps.edu
>
> Watkins 270,  MOVES Institute, Monterey CA 93943-5000 USA
> +1.831.656.2149
>
> X3D graphics, virtual worlds, navy robotics
> https://faculty.nps.edu/brutzman
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Thursday, November 2, 2023 8:27 AM
> *To:* users@daffodil.apache.org
> *Subject:* Re: Proposal: Extensible DFDL
>
>
>
> I think extensibility would be great for DFDL.
>
>
>
> The DFDL workgroup punted on this as there was no such thing as an
> extensible format description language to generalize into a standard.
>
> We realized that unparsing was already breaking a lot of new ground, but
> it was a must-have feature.
>
>
>
> So we had to draw a line somewhere on the number of untested new concepts
> in DFDL or it would never get done. It took 20 years as is to become
> standardized.
>
>
>
> Some format description languages may have been implemented this
> extensible way, but that was not a visible user feature in any one that I
> ever saw.
>
>
>
> As a research effort this is a good idea. Daffodil is available for use in
> prototyping if that's useful, and if it turns out to be valuable it could
> be proposed for inclusion in DFDL in the future.
>
>
>
> Some years ago I suggested this to someone as a thesis topic for a CS PhD
> project, but to my knowledge it didn't go anywhere.
>
>
>
>
>
> On Thu, Nov 2, 2023 at 10:20 AM Roger L Costello <coste...@mitre.org>
> wrote:
>
> Hi Folks,
>
> Consider this input containing a date time value:
>
> 20230926T124800Z
>
> We can design the DFDL, using the xs:datetime datatype and associated DFDL
> calendar properties, so that parsing produces this XML:
>
> <DateTimeIso>2023-09-26T12:48:00+00:00</DateTimeIso>
>
> That is beautiful XML - concise and precise.
>
> Next, consider input containing a lat/long value:
>
> 2006N-05912E
>
> It would be excellent if we could design the DFDL so that parsing produces
> this:
>
> <OriginOfBearing>20°06′N 059°12′E</OriginOfBearing>
>
> That is also beautiful XML.
>
> In fact, it is possible to achieve this! By hiding the input and then
> performing a bunch of transformations using dfdl:inputValueCalc.
>
> However, that's a terrible approach because, as Mike Beckerle often says,
> "DFDL is not a transformation language!"
>
> If only we had a latlong datatype and associated DFDL latlong properties
> .....
>
> If only we could extend DFDL .......
>
> How about making DFDL extensible? How about allowing users of DFDL to
> create their own datatypes (actually, XSD already allows this) and allow
> users to create their own DFDL properties for the user-defined datatype?
>
> That is, how about turning DFDL into extensible DFDL?
>
> Thoughts?
>
> /Roger
>
>

Re: Proposal: Extensible DFDL; XSLT transformations for XML attribute support

Reply via email to