Re: How to generate an error for invalid data, discard the invalid data, and continue parsing?

Mike Beckerle Fri, 20 Oct 2023 08:49:26 -0700

Re: getting rid of the empty "line" element that appears because the data
for the line is invalid and so hidden.


Alas, I think you are stuck here.

There is dfdl:emptyElementParsePolicy="treatAsAbsent". But despite sounding
like the right thing for this case, this won't work for you.

The concept of "empty" in DFDL is defined relative to the data stream, not
the infoset. So hiding something from the infoset does not change the data
stream representation (which is not at all empty in this case).

In other words you can't skip over actual data by arranging for it to be
considered "empty" using hidden groups.

So, I think you are stuck.

Element "line" is an array. It can't be a choice of not-hidden/hidden and
also be an array, as arrays must be elements. Either the whole array is
inside a hidden group, or none of it is.

So being unable to drop the empty "line" element is another case of "*DFDL
is not a transformation language*". The data has an invalid thing in it.
You have to model that thing to parse it. You can hide the content of it or
carry it as a hexbinary "blob", but you can't hide that it occupies an
array element.

While DFDL can do some transformations, those are only inadvertent side
effects of the things needed to express parse/unparse sensibly. There is no
goal for DFDL to be able to express general transformations, and I think
you are up against that limitation.

Frankly, I think the DFDL schema is better if it contains an explicit
element corresponding to a dropped invalid data part. It's there in the
data stream as an array element representation after all, so it's not
really DFDL's job to let you excise it from the infoset.

I prefer a schema that captures such invalid data into an "unrecognized"
element. This can be made to always issue a parse-time warning (failure
type "recoverableError").  I believe this can, however, unparse to nothing
at all, thereby sneakily achieving your transformation where a
parse/unparse removes the invalid data.


On Fri, Oct 20, 2023 at 7:43 AM Roger L Costello <coste...@mitre.org> wrote:

> ** Update **
>
>
>
> I discovered a better strategy. One that doesn’t involve manually
> reconstructing each line. And it unparses.
>
>
>
> My new strategy is this:
>
>
>
> for each line:
>
>     there is a choice with two branches:
>
>    1. a sequence that describes a valid label-colon-message. The label
>    element has xs:enumeration facets to specify Dear Sir and Dear Madam. The
>    label element has an assert with checkConstraints, and
>    failureType="processingError" so that an erroneous label will force
>    backing up and using the other branch of the choice.
>    2. a call to the hidden group. The label element has xs:enumeration
>    facets to specify Dear Sir and Dear Madam. The label element has an assert
>    with checkConstraints, and failureType="recoverableError" so that an
>    erroneous label will generate an error but continue processing (cool!).
>
>
>
> Below I show the updated DFDL.
>
>
>
> With this input:
>
>
>
> Dear Sir: Thank you for your response
> Dear Madam: How are you
> Dear Foo: Have a good day
> Dear Sir: Nice work
>
>
>
> Parsing produces this XML:
>
>
>
> <test>
>   <line><label><value>Dear Sir</value></label><message><value> Thank you
> for your response</value></message></line>
>   <line><label><value>Dear Madam</value></label><message><value> How are
> you</value></message></line>
>   <line></line>
>   <line><label><value>Dear Sir</value></label><message><value> Nice work
> </value></message></line>
> </test>
>
>
>
> And generates this error for the invalid third line:
>
>
>
>       Validation Error: Invalid Label
>
>
>
> And unparsing produces this:
>
>
>
> Dear Sir: Thank you for your response
> Dear Madam: How are you
> Invalid Label: Elided Message
> Dear Sir: Nice work
>
>
>
> There is just one final touch needed to make this updated strategy
> perfect: Get rid of the empty line: <line></line>
>
>
>
> I seek your recommendation on how to do that.
>
>
>
> Here is my updated DFDL schema:
>
>
>
> <xs:element name="line" maxOccurs="unbounded">
>     <xs:complexType>
>         <xs:choice dfdl:choiceLengthKind="implicit">
>             <xs:sequence dfdl:separator=":" dfdl:separatorPosition="infix"
> >
>                 <xs:element name="label">
>                     <xs:annotation>
>                         <xs:appinfo source=http://www.ogf.org/dfdl/>
>                             <dfdl:assert testKind="expression"
>                                 test="{dfdl:checkConstraints(.)}"
>                                 message="Invalid label"
>                                 failureType="processingError"/>
>                         </xs:appinfo>
>                     </xs:annotation>
>                     <xs:simpleType>
>                         <xs:restriction base="xs:string">
>                             <xs:enumeration value="Dear Sir"/>
>                             <xs:enumeration value="Dear Madam"/>
>                         </xs:restriction>
>                     </xs:simpleType>
>                 </xs:element>
>                 <xs:element name="message" type="xs:string" />
>             </xs:sequence>
>             <xs:sequence dfdl:hiddenGroupRef="hidden_label_message"/>
>         </xs:choice>
>     </xs:complexType>
> </xs:element>
>
>
>
>
>
> *From:* Roger L Costello <coste...@mitre.org>
> *Sent:* Thursday, October 19, 2023 3:56 AM
> *To:* users@daffodil.apache.org
> *Subject:* Re: How to generate an error for invalid data, discard the
> invalid data, and continue parsing?
>
>
>
> Hi Mike,
>
>
>
> I have been playing around with the ideas that you shared last week (esp.,
> failureType="recoverableError").
>
>
>
> I am now stuck and hope you will get me unstuck.
>
>
>
> I am trying to get my example working. The input consists of a series of
> label-colon-message lines. The label must be either Dear Sir or Dear Madam.
> Any other label is an error. Any line with an invalid label should result
> in generating an error message and the line must not appear in the output
> XML.  In the following input the third line has an invalid label:
>
>
>
> Dear Sir: Thank you for your response
> Dear Madam: How are you
> Dear Foo: Have a good day
> Dear Sir: Nice work
>
>
>
> My DFDL schema generates this XML:
>
>
>
> <test>
>   <line><label><value>Dear Sir</value></label><message><value> Thank you
> for your response</value></message></line>
>   <line><label><value>Dear Madam</value></label><message><value> How are
> you</value></message></line>
>   <line></line>
>   <line><label><value>Dear Sir</value></label><message><value> Nice work
> </value></message></line>
> </test>
>
>
>
> and this error message:
>
>
>
>       Validation Error: Invalid Label
>
>
>
> That is close to what I want. The third line contains an invalid label, so
> the XML does not contain it. However, notice that the XML contains an empty
> <line> element. How do I get rid of that empty <line> element? An error
> message is generated for the invalid data, which is what I want.
>
>
>
> Below is my DFDL schema. Here is the strategy it uses:
>
>
>
>    1. Iterate over each of the input lines.
>    2. Hide the input line, i.e., hide the label-colon-message using a
>    hidden group. In that hidden group, the label element has xs:enumeration
>    facets to specify Dear Sir and Dear Madam. The label element has an assert
>    with checkConstraints, and failureType="recoverableError" so that an
>    erroneous label will generate an error but continue processing (cool!).
>    3. Following the call to the hiddenGroup, I manually reconstruct the
>    <label> and <message> elements. For each of those elements, I check the
>    value of hiddenLabel and if that value is neither Dear Sir or Dear Madam, I
>    set occursCount="0".
>    4. Going back to the hidden group, notice that outputValueCalc is set
>    to the empty string. That’s bad. I’ve lost the ability to unparse. How can
>    I set outputValueCalc so that I can unparse?
>
>
>
> A broader question: is this the right strategy? Hiding the entire input
> line and then manually reconstructing it, doesn’t feel right.
>
>
>
> <xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema
> xmlns:dfdl=http://www.ogf.org/dfdl/dfdl-1.0/
> xmlns:fn=http://www.w3.org/2005/xpath-functions
> xmlns:math=http://www.w3.org/2005/xpath-functions/math
> elementFormDefault="qualified">
>
>             <xs:include schemaLocation=
> "default-dfdl-properties/defaults.dfdl.xsd" />
>
>             <xs:annotation>
>                         <xs:appinfo source=http://www.ogf.org/dfdl/>
>                                     <dfdl:format ref=
> "default-dfdl-properties" />
>                         </xs:appinfo>
>             </xs:annotation>
>
>             <xs:element name="test">
>                         <xs:complexType>
>                                     <xs:sequence dfdl:separator="%NL;"
> dfdl:separatorPosition="infix">
>                                                 <xs:element name="line"
> maxOccurs="unbounded">
>
> <xs:complexType>
>
> <xs:sequence dfdl:separator="">
>
>             <xs:sequence dfdl:hiddenGroupRef="hidden_label_message"/>
>
>             <xs:element name="label" minOccurs="0"
>
>                                                         dfdl:occursCountKind
> ="expression"
>
>                                                         dfdl:occursCount=
> "{
>
>                                                         *if* ((../
> *hidden_label* eq 'Dear Sir') or (../*hidden_label* eq 'Dear Madam'))
>
>                                                               *then* 1
> *else* 0}">
>
>                                                 <xs:complexType>
>
>                                                             <xs:sequence>
>
>
> <xs:element name="value" type="xs:string"
>
>
>                                             dfdl:inputValueCalc="{../../
> *hidden_label*}" />
>
>                                                             </xs:sequence>
>
>                                                 </xs:complexType>
>
>             </xs:element>
>
>             <xs:element name="message" minOccurs="0"
>
>                                                         dfdl:occursCountKind
> ="expression"
>
>                                                         dfdl:occursCount=
> "{
>
>                                                         *if* ((../
> *hidden_label* eq 'Dear Sir') or (../*hidden_label* eq 'Dear Madam'))
>
>                                                              *then* 1
> *else* 0}">
>
>                                                 <xs:complexType>
>
>                                                             <xs:sequence>
>
>
> <xs:element name="value" type="xs:string"
>
>
> dfdl:inputValueCalc="{../../*hidden_message*}" />
>
>                                                             </xs:sequence>
>
>                                                 </xs:complexType>
>
>             </xs:element>
>
> </xs:sequence>
>
> </xs:complexType>
>                                                 </xs:element>
>                                     </xs:sequence>
>                         </xs:complexType>
>             </xs:element>
>
>             <xs:group name="hidden_label_message">
>                         <xs:sequence dfdl:separator=":"
> dfdl:separatorPosition="infix">
>                                     <xs:element name="hidden_label"
>
>         dfdl:outputValueCalc="{''}">
>                                                 <xs:annotation>
>                                                             <xs:appinfo
> source=http://www.ogf.org/dfdl/>
>
> <dfdl:assert testKind="expression"
>
> test="{*dfdl:checkConstraints*(.)}"
>
>             message="Invalid label"
>
> failureType="recoverableError"/>
>                                                             </xs:appinfo>
>                                                 </xs:annotation>
>                                                 <xs:simpleType>
>
> <xs:restriction base="xs:string">
>
> <xs:enumeration value="Dear Sir"/>
>
> <xs:enumeration value="Dear Madam"/>
>
> </xs:restriction>
>                                                 </xs:simpleType>
>                                     </xs:element>
>                                     <xs:element name="hidden_message" type
> ="xs:string"
>                                         dfdl:outputValueCalc="{''}" />
>                         </xs:sequence>
>             </xs:group>
>
> </xs:schema>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Friday, October 13, 2023 9:26 AM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: How to generate an error for invalid data, discard
> the invalid data, and continue parsing?
>
>
>
> The mil-std-2045 schema on github uses techniques to achieve this sort of
> thing. There are a few such techniques. One I like is to capture the
> invalid data in an element named invalid, which has facets such that any
> content is deemed invalid.
>
> ZjQcmQRYFpfptBannerStart
>
> The mil-std-2045 schema on github uses techniques to achieve this sort of
> thing.
>
>
>
> There are a few such techniques.
>
>
>
> One I like is to capture the invalid data in an element named invalid,
> which has facets such that any content is deemed invalid. (E.g., a pattern
> that can't be matched.) This detail, that the <invalid>message</invalid>
> element is in fact invalid, is important, because an element named
> "invalid" can of course be entirely valid, which is a mistake we want to
> avoid.
>
>
>
> If you download that schema from github, look for the
>
>
>
> <*group *name*="urn_unit_name_group"*>
>
> ...
>
>
>
> definition. It detects when both a URN and a UNIT_NAME field are both
> present, and creates an invalid element containing an error message to that
> effect.
>
>
>
> One could just issue a warning diagnostic using dfdl:assert with
> "recoverableError" as the failure type. Then you will get out a warning,
> and you can decide whether or not you want the failure represented in the
> infoset with an invalid element, or a valid element, or not at all.
>
>
>
> For CDS use, I really like putting in the "guaranteed to be invalid"
> element, because it makes it clear and testable that the schema is
> detecting what is wrong. Allows use of the schema in situations where you
> want to see what the invalidity was, but a CDS will still block it.
>
>
>
> Adding a DFDL variable that controls which thing the schema does can be
> helpful for various testing scenarios. You can have a "fail fast" setting
> which causes the parse to fail, a "message only" for getting the warning
> only, or a "capture invalid" option which creates the
> <invalid>...</invalid> element.
>
>
>
> On Wed, Oct 11, 2023 at 7:44 AM Roger L Costello <coste...@mitre.org>
> wrote:
>
> My input consists of a series of label-colon-message lines:
>
>
>
> Dear Sir: Thank you for your response
>
> Dear Madam: How are you
>
> Dear Foo: Have a good day
>
> Dear Sir: Nice work
>
>
>
> The stuff before the colon is the “label” and the stuff after the colon is
> the “message”.
>
>
>
> There are two legal labels, Dear Sir and Dear Madam.
>
>
>
> I want this output:
>
>
>
> <Tests>
>
>     <line>
>
>         <label>Dear Sir</label>
>
>         <message>Thank you for your response</message>
>
>    </line>
>
>     <line>
>
>         <label>Dear Madam</label>
>
>         <message>How are you</message>
>
>    </line>
>
>     <line>
>
>         <label>Dear Sir</label>
>
>         <message>Nice work</message>
>
>    </line>
>
> </Tests>
>
>
>
> The third line (Dear Foo: Have a good day) contains invalid data (Dear Foo
> is not a valid label) so I want that line discarded and an error reported
> for that line and parsing to continue to the next line,
>
>
>
> In other words, I want am error generated for erroneous data, the
> erroneous data discarded, and parsing to continue.
>
>
>
> How to do that?
>
>
>
> /Roger
>
>

Re: How to generate an error for invalid data, discard the invalid data, and continue parsing?

Reply via email to