Thank you, Mike. This is great.

Here is the URL to the schema that Mike references:

https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/main/resources/com/owlcyberdefense/mil-std-2045/xsd/milstd2045.common.dfdl.xsd

From: Mike Beckerle <mbecke...@apache.org>
Sent: Friday, October 13, 2023 9:26 AM
To: users@daffodil.apache.org
Subject: [EXT] Re: How to generate an error for invalid data, discard the 
invalid data, and continue parsing?

The mil-std-2045 schema on github uses techniques to achieve this sort of 
thing. There are a few such techniques. One I like is to capture the invalid 
data in an element named invalid, which has facets such that any content is 
deemed invalid. 


The mil-std-2045 schema on github uses techniques to achieve this sort of thing.

There are a few such techniques.

One I like is to capture the invalid data in an element named invalid, which 
has facets such that any content is deemed invalid. (E.g., a pattern that can't 
be matched.) This detail, that the <invalid>message</invalid> element is in 
fact invalid, is important, because an element named "invalid" can of course be 
entirely valid, which is a mistake we want to avoid.

If you download that schema from github, look for the


<group name="urn_unit_name_group">

...


definition. It detects when both a URN and a UNIT_NAME field are both present, 
and creates an invalid element containing an error message to that effect.

One could just issue a warning diagnostic using dfdl:assert with 
"recoverableError" as the failure type. Then you will get out a warning, and 
you can decide whether or not you want the failure represented in the infoset 
with an invalid element, or a valid element, or not at all.

For CDS use, I really like putting in the "guaranteed to be invalid" element, 
because it makes it clear and testable that the schema is detecting what is 
wrong. Allows use of the schema in situations where you want to see what the 
invalidity was, but a CDS will still block it.

Adding a DFDL variable that controls which thing the schema does can be helpful 
for various testing scenarios. You can have a "fail fast" setting which causes 
the parse to fail, a "message only" for getting the warning only, or a "capture 
invalid" option which creates the <invalid>...</invalid> element.

On Wed, Oct 11, 2023 at 7:44 AM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
My input consists of a series of label-colon-message lines:

Dear Sir: Thank you for your response
Dear Madam: How are you
Dear Foo: Have a good day
Dear Sir: Nice work

The stuff before the colon is the “label” and the stuff after the colon is the 
“message”.

There are two legal labels, Dear Sir and Dear Madam.

I want this output:

<Tests>
    <line>
        <label>Dear Sir</label>
        <message>Thank you for your response</message>
   </line>
    <line>
        <label>Dear Madam</label>
        <message>How are you</message>
   </line>
    <line>
        <label>Dear Sir</label>
        <message>Nice work</message>
   </line>
</Tests>

The third line (Dear Foo: Have a good day) contains invalid data (Dear Foo is 
not a valid label) so I want that line discarded and an error reported for that 
line and parsing to continue to the next line,

In other words, I want am error generated for erroneous data, the erroneous 
data discarded, and parsing to continue.

How to do that?

/Roger

Reply via email to