Re: Design DFDL Schemas to Reject Malformed Data, Not Just Accept Correct Data [Was: DFDL can increase your IQ by 10 points!]

Roger L Costello Wed, 27 Sep 2023 06:56:42 -0700

Mike wrote:


  *   you don't care what's after OPER or EXER. Just those characters are 
enough for you to decide the optional element DOES exist.

Ahhhhhh!

Yes, I grok!

That is 10 more points to my IQ.

Thank you Mike!

/Roger

From: Mike Beckerle <mbecke...@apache.org>
Sent: Wednesday, September 27, 2023 9:45 AM
To: users@daffodil.apache.org
Subject: [EXT] Re: Design DFDL Schemas to Reject Malformed Data, Not Just 
Accept Correct Data [Was: DFDL can increase your IQ by 10 points!]

The only correction is to use a simpler test pattern: 
testPattern="(OPER)|(EXER)" because you don't care what's after OPER or EXER. 
Just those characters are enough for you to decide the optional element DOES 
exist. I claim
ZjQcmQRYFpfptBannerStart
The only correction is to use a simpler test pattern:

testPattern="(OPER)|(EXER)"

because you don't care what's after OPER or EXER. Just those characters are 
enough for you to decide the optional element DOES exist. I claim that's what 
the format usually means/intends when it describes data as having unique 
initiator strings like this. You look for those characters and only those to 
decide.


On Wed, Sep 27, 2023 at 8:53 AM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
Mike Beckerle wrote:


  *   Design the DFDL schema to reject malformed data, not just accept correct 
data.

Oh, oh, yea!

I like it!

Not sure how to do that, however. Would you help me work through this, please?

Mike points out, with this input:


  *   Foobar
  *   OPER/something not allowed//
  *   Barfoo
  *
  *   Parsing the OPER line will fail, but then it will try parsing it as an 
EXER line, which will also fail, so it will leave the whole wrapper element 
out, and it will continue to try to parse the OPER line instead of failing.

Is this the behavior we desire:

If an input line starts (is initiated by) OPER, then process the rest of the 
input line using the DFDL description of OPER. If, during the processing of the 
OPER field, an error arises, then the parser should display an error message, 
abandon the input line, proceed to the next input line and the element 
following the wrapper element.

Is that the behavior we desire?

Mike said that the solution is to:


  *   Use dfdl:discriminator with testKind='pattern'

I don’t think that I’ve ever used that combination, so I did some experimenting.

Suppose the legal value for the field following EXER is TANGO (all uppercase) 
and the legal value for the field following OPER is XRAY (all uppercase).

Is this how to declare the wrapper element:

<xs:element name="OPER-EXER-wrapper" minOccurs="0">
    <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/";>
            <dfdl:discriminator testKind="pattern" 
testPattern="(OPER/XRAY)|(EXER/TANGO)|"/>
        </xs:appinfo>
    </xs:annotation>
    <xs:complexType>
        <!-- OPER and EXER declarations -->
    </xs:complexType>
</xs:element>

Is that correct?

This is great stuff. Once I grok this, my IQ will have increased another 10 
points.

/Roger

From: Mike Beckerle <mbecke...@apache.org<mailto:mbecke...@apache.org>>
Sent: Tuesday, September 26, 2023 4:49 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: [EXT] Re: DFDL can increase your IQ by 10 points!
ZjQcmQR

YFThere is another detail which will further improve your schema.

What if the data contains an OPER line, but after the OPER characters there is 
some defect in the data of the OPER line.

foobar
OPER/something not allowed
barfoo

Parsing the OPER line will fail, but then it will try parsing it as an EXER 
line, which will also fail, so it will leave the whole wrapper element out, and 
it will continue to try to parse the OPER line instead of failing. Your 
optional element gave it a way to suppress the error and parse differently.

If the schema after this OPER/EXPR element is say, just a string, then 
"OPER/something not allowed" will be taken as the value of that string, and ... 
it's possible the parse will succeed and just produce an infoset that is 
perfectly valid according to the schema, but clearly the schema is allowing a 
solution we want to disallow.

The fix here is your optionality needs a discriminator. The discriminator on 
the optional element you need checks that the data starts with OPER or EXPR 
only.
(use dfdl:discriminator with testKind='pattern').

This issue is a matter of precision. It's the difference between:

  1.  It's either a fully correct OPER line, or a fully correct EXER line, or 
it isn't present.
  2.  It's either a line that starts with OPER or a line that starts with EXER 
or it isn't present.
That distinction is designing the schema to properly reject malformed data, not 
just accept correct data.

See in (1) above, it allows for faulty OPER or EXPR lines to be correctly 
parsed as "it isn't present". The decision really should NOT depend on any more 
than the OPER or EXPR characters being there.

I find it hard to remember to do this. But most decisions in the schema need 
discriminators. I have to revisit every decision point in the schema one by one 
to make sure there are discriminators everywhere there can be.







On Tue, Sep 26, 2023 at 10:13 AM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
Hi Folks,

I think DFDL is awesome. Think about it: DFDL is a standard language for 
describing (describe, not parse) just about any data format. Again, I emphasize 
that it's not about how to parse the data format, it's about describing the 
data format. Given a description a DFDL processor can figure out how to parse 
instances of the data format. Wow!

But there's another reason that DFDL is awesome: it forces you to be very 
precise in your description. It forces you to think very logically. It forces 
you to understand the implications of your description decisions. Let me give 
you an example of the later.

I am dealing with a data format that consists of a sequence of lines. Here's a 
sample instance:

John Doe
OPER/XRAY//
Sally Smith

The first and last lines are just strings. Not interesting. The second line is 
the interesting one. Here's another instance:

John Doe
EXER/TANGO//
Sally Smith

As you can see, the second line starts with either OPER or EXER and terminates 
with //. The second line is also optional. That is, the second line is either 
OPER, EXER, or neither. That leads one to this description:

choice
      OPER (optional)
      EXER (optional)

However, DFDL doesn't allow branches of a choice to be optional. So, the 
correct description is:

choice
      sequence
            OPER (optional)
      sequence
            EXER (optional)

Slick, aye?

But not correct.

Let's think about this. Suppose the input is this:

John Doe
EXER/TANGO//
Sally Smith

While processing the second line, you would think that the DFDL processor would 
find that the first branch of the choice (the OPER branch) doesn't match and 
therefore the processor would process the line using the second branch. Ha! Not 
correct!

The first branch is optional. That is key! Since the second line doesn't start 
with OPER, the DFDL processor thinks, "Oh, there must be no occurrences of the 
OPER line." So, the processor moves on to the description following the choice. 
Do you see it? Do you see the problem? I hope so. This is wicked cool. As I 
worked through this example, it forced me to think very, very clearly about the 
implication of an optional OPER line. So, what's the solution? Make OPER and 
EXER mandatory:

 choice
      sequence
            OPER (mandatory)
      sequence
            EXER (mandatory)

And, place the choice inside an optional wrapper element:

OPER-EXER-wrapper (optional)
      choice
            sequence
                  OPER (mandatory)
            sequence
                  EXER (mandatory)

Now, with this input:

John Doe
EXER/TANGO//
Sally Smith

The processor will try the first branch of the choice, it fails, so it tries 
the second branch and succeeds.

With this input:

John Doe
Sally Smith

The processor will try the first branch of the choice, it fails, try the second 
branch, it fails, so there is no value for the wrapper element.

This blows my mind. I feel like this example alone boosted my IQ by 10 points.

/Roger

Re: Design DFDL Schemas to Reject Malformed Data, Not Just Accept Correct Data [Was: DFDL can increase your IQ by 10 points!]

Reply via email to