Mike wrote:
* you don't care what's after OPER or EXER. Just those characters are enough for you to decide the optional element DOES exist. Ahhhhhh! Yes, I grok! That is 10 more points to my IQ. Thank you Mike! /Roger From: Mike Beckerle <mbecke...@apache.org> Sent: Wednesday, September 27, 2023 9:45 AM To: users@daffodil.apache.org Subject: [EXT] Re: Design DFDL Schemas to Reject Malformed Data, Not Just Accept Correct Data [Was: DFDL can increase your IQ by 10 points!] The only correction is to use a simpler test pattern: testPattern="(OPER)|(EXER)" because you don't care what's after OPER or EXER. Just those characters are enough for you to decide the optional element DOES exist. I claim ZjQcmQRYFpfptBannerStart The only correction is to use a simpler test pattern: testPattern="(OPER)|(EXER)" because you don't care what's after OPER or EXER. Just those characters are enough for you to decide the optional element DOES exist. I claim that's what the format usually means/intends when it describes data as having unique initiator strings like this. You look for those characters and only those to decide. On Wed, Sep 27, 2023 at 8:53 AM Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>> wrote: Mike Beckerle wrote: * Design the DFDL schema to reject malformed data, not just accept correct data. Oh, oh, yea! I like it! Not sure how to do that, however. Would you help me work through this, please? Mike points out, with this input: * Foobar * OPER/something not allowed// * Barfoo * * Parsing the OPER line will fail, but then it will try parsing it as an EXER line, which will also fail, so it will leave the whole wrapper element out, and it will continue to try to parse the OPER line instead of failing. Is this the behavior we desire: If an input line starts (is initiated by) OPER, then process the rest of the input line using the DFDL description of OPER. If, during the processing of the OPER field, an error arises, then the parser should display an error message, abandon the input line, proceed to the next input line and the element following the wrapper element. Is that the behavior we desire? Mike said that the solution is to: * Use dfdl:discriminator with testKind='pattern' I don’t think that I’ve ever used that combination, so I did some experimenting. Suppose the legal value for the field following EXER is TANGO (all uppercase) and the legal value for the field following OPER is XRAY (all uppercase). Is this how to declare the wrapper element: <xs:element name="OPER-EXER-wrapper" minOccurs="0"> <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:discriminator testKind="pattern" testPattern="(OPER/XRAY)|(EXER/TANGO)|"/> </xs:appinfo> </xs:annotation> <xs:complexType> <!-- OPER and EXER declarations --> </xs:complexType> </xs:element> Is that correct? This is great stuff. Once I grok this, my IQ will have increased another 10 points. /Roger From: Mike Beckerle <mbecke...@apache.org<mailto:mbecke...@apache.org>> Sent: Tuesday, September 26, 2023 4:49 PM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> Subject: [EXT] Re: DFDL can increase your IQ by 10 points! ZjQcmQR YFThere is another detail which will further improve your schema. What if the data contains an OPER line, but after the OPER characters there is some defect in the data of the OPER line. foobar OPER/something not allowed barfoo Parsing the OPER line will fail, but then it will try parsing it as an EXER line, which will also fail, so it will leave the whole wrapper element out, and it will continue to try to parse the OPER line instead of failing. Your optional element gave it a way to suppress the error and parse differently. If the schema after this OPER/EXPR element is say, just a string, then "OPER/something not allowed" will be taken as the value of that string, and ... it's possible the parse will succeed and just produce an infoset that is perfectly valid according to the schema, but clearly the schema is allowing a solution we want to disallow. The fix here is your optionality needs a discriminator. The discriminator on the optional element you need checks that the data starts with OPER or EXPR only. (use dfdl:discriminator with testKind='pattern'). This issue is a matter of precision. It's the difference between: 1. It's either a fully correct OPER line, or a fully correct EXER line, or it isn't present. 2. It's either a line that starts with OPER or a line that starts with EXER or it isn't present. That distinction is designing the schema to properly reject malformed data, not just accept correct data. See in (1) above, it allows for faulty OPER or EXPR lines to be correctly parsed as "it isn't present". The decision really should NOT depend on any more than the OPER or EXPR characters being there. I find it hard to remember to do this. But most decisions in the schema need discriminators. I have to revisit every decision point in the schema one by one to make sure there are discriminators everywhere there can be. On Tue, Sep 26, 2023 at 10:13 AM Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>> wrote: Hi Folks, I think DFDL is awesome. Think about it: DFDL is a standard language for describing (describe, not parse) just about any data format. Again, I emphasize that it's not about how to parse the data format, it's about describing the data format. Given a description a DFDL processor can figure out how to parse instances of the data format. Wow! But there's another reason that DFDL is awesome: it forces you to be very precise in your description. It forces you to think very logically. It forces you to understand the implications of your description decisions. Let me give you an example of the later. I am dealing with a data format that consists of a sequence of lines. Here's a sample instance: John Doe OPER/XRAY// Sally Smith The first and last lines are just strings. Not interesting. The second line is the interesting one. Here's another instance: John Doe EXER/TANGO// Sally Smith As you can see, the second line starts with either OPER or EXER and terminates with //. The second line is also optional. That is, the second line is either OPER, EXER, or neither. That leads one to this description: choice OPER (optional) EXER (optional) However, DFDL doesn't allow branches of a choice to be optional. So, the correct description is: choice sequence OPER (optional) sequence EXER (optional) Slick, aye? But not correct. Let's think about this. Suppose the input is this: John Doe EXER/TANGO// Sally Smith While processing the second line, you would think that the DFDL processor would find that the first branch of the choice (the OPER branch) doesn't match and therefore the processor would process the line using the second branch. Ha! Not correct! The first branch is optional. That is key! Since the second line doesn't start with OPER, the DFDL processor thinks, "Oh, there must be no occurrences of the OPER line." So, the processor moves on to the description following the choice. Do you see it? Do you see the problem? I hope so. This is wicked cool. As I worked through this example, it forced me to think very, very clearly about the implication of an optional OPER line. So, what's the solution? Make OPER and EXER mandatory: choice sequence OPER (mandatory) sequence EXER (mandatory) And, place the choice inside an optional wrapper element: OPER-EXER-wrapper (optional) choice sequence OPER (mandatory) sequence EXER (mandatory) Now, with this input: John Doe EXER/TANGO// Sally Smith The processor will try the first branch of the choice, it fails, so it tries the second branch and succeeds. With this input: John Doe Sally Smith The processor will try the first branch of the choice, it fails, try the second branch, it fails, so there is no value for the wrapper element. This blows my mind. I feel like this example alone boosted my IQ by 10 points. /Roger