Re: optional int and unparse formatting

Beckerle, Mike Mon, 30 Aug 2021 08:49:27 -0700

Good question.

I think what is happening is this. elem5 fails to parse because it is an empty 
string, but then the parse backtracks, and here's the trick: that means it is 
putting back the separator before this array/optional element. Then your schema 
has nothing to absorb the final separator.


Your schema has expressed an optional element, but what you want is a required 
separator, then an optional element after it.

I think wrapping an xs:sequence around elem5 will fix this.

To be sure, I need to see the occursCountKind property, lengthKind property, 
etc. Basically I need to be able to reproduce your run.
I would need your default-dfdl-properties/defaults.dfdl.xsd file.

w.r.t your 0001 issue....

The ability to control text number formats like leading zeros, is by way of the 
dfdl:textNumberPattern property. I think you want different values for this 
property for your two integer-type elements if they are supposed to have 
different numbers of digits, as evidenced by their max values of 999 and 99999.

However, your request that 0001 be preserved is not consistent with either 999 
nor 99999 as max values. So I'm not sure what you are trying to achieve in this 
format.

DFDL does not "remember how the integer was presented". It parses it according 
to rules, creates an xs:int in the infoset, and at that point the leading zero 
information is gone. It then unparses according to rules. If you want 0001 to 
parse and unparse as 0001, you want dfdl:textNumberPattern="#0000". That will 
give you 4 digits, optionally a fifth if needed, but will always produce 4.

But in this case, if you are first parsing, then unparsing data, then incoming 
"01" will also unparse as "0001". Using dfdl:textNumberPattern="#0000" means 
"canonical form for this data is at least 4 digits". If you parse the data 
using dfdl:lengthKind='delimited', then your schema has expressed "tolerate any 
number of digits, but always canonicalize to at least 4 digits".

If you want the text of these numbers preserved, not canonicalized, and your 
application does both parse and unparse, like data security apps often do, then 
you need to use strings, not numbers.

Note, however, that preserving leading/trailing non-numerically significant 
zeros is a security hole - they can be used to carry covert channel data.
Canonicalization of data is fundamentally more secure.

The usual reason people want preservation of data exactly, character for 
character, is to make test/QA easier. That's ok so long as you get that there 
is a loss of some data security when non-information-carrying things like 
leading/trailing zeros are preserved.



________________________________
From: Theodore Toth <ted.toth....@sage.northcom.mil>
Sent: Sunday, August 29, 2021 2:45 AM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: optional int and unparse formatting

I just started looking at daffodil and have a few questions about my
first experiment:
Here's my dfdl:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema";
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";>

  <xs:include schemaLocation="default-dfdl-properties/defaults.dfdl.xsd" />
  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
      <dfdl:format ref="default-dfdl-properties" />
    </xs:appinfo>
  </xs:annotation>

  <xs:element name="FOO"
              dfdl:initiator="FOO/"
              dfdl:lengthKind="implicit">
<!--
              dfdl:terminator="//%NL;%WSP*;">
-->
    <xs:complexType>
      <xs:sequence dfdl:sequenceKind="ordered"
                   dfdl:separator="/"
                   dfdl:separatorPosition="infix">

        <xs:element name="elem1">
          <xs:simpleType>
            <xs:restriction base="xs:string">
              <xs:minLength value="1"/>
              <xs:maxLength value="14"/>
            </xs:restriction>
          </xs:simpleType>
        </xs:element>

        <xs:element name="elem2">
          <xs:simpleType>
            <xs:restriction base="xs:string">
              <xs:pattern value="CAT|DOG|HORSE"/>
            </xs:restriction>
          </xs:simpleType>
        </xs:element>

        <xs:element name="elem3">
          <xs:simpleType>
            <xs:restriction base="xs:int">
              <xs:minInclusive value="1"/>
              <xs:maxInclusive value="99999"/>
            </xs:restriction>
          </xs:simpleType>
        </xs:element>

        <xs:element name="elem4" minOccurs="0" maxOccurs="1">
          <xs:simpleType>
            <xs:restriction base="xs:string">
              <xs:minLength value="1"/>
              <xs:maxLength value="20"/>
            </xs:restriction>
          </xs:simpleType>
        </xs:element>

        <xs:element name="elem5" minOccurs="0" maxOccurs="1">
          <xs:simpleType>
            <xs:restriction base="xs:int">
              <xs:minInclusive value="1"/>
              <xs:maxInclusive value="999"/>
            </xs:restriction>
          </xs:simpleType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

Here's some test data:
FOO/GONE FISHIN/DOG/0001///

The parse fails with:
[error] Parse Error: Unable to parse xs:int from empty string
Schema context: elem5 Location line 59 column 10 in
file:/home/tedx/dfdl-test/test.dfdl.xsd
Data location was preceding byte 26

Why does it fail when elem5 has minOccurs="0"? elem5 is optional.

Then if I put a 0 before the last slash it generates:
<?xml version="1.0" encoding="UTF-8"?>
<FOO>
  <elem1>GONE FISHIN</elem1>
  <elem2>DOG</elem2>
  <elem3>1</elem3>
  <elem4></elem4>
  <elem5>0</elem5>
</FOO>

and when I unparse it generates:
FOO/GONE FISHIN/DOG/1//0

but I'd like it to output 0001 for elem3, how do I do that?

Ted

Re: optional int and unparse formatting

Reply via email to