Thanks for the excellent explanation Mike!

I did a writeup of the problem and solution. If my writeup has any errors, 
please let me know. See below.  /Roger

Problem Statement: A field in a data format has a fixed width. Let’s say the 
width is 10 characters and we name the field “Foo.” If no data is available for 
the field, it must be populated with a single hyphen, surrounded by spaces. The 
hyphen may be in any position within the field. If data is available, it must 
conform to this regular expression: [A-Z]{10}; that is, the data must consist 
of exactly 10 uppercase letters.

First, I show the wrong approach to this problem. Then I show the right 
approach.

We need to specify that the field is populated with a hyphen when no data is 
available; nillable and nilValue are used for this:

nillable=true
nilValue='-'

The field's content is specified using a regex:

lengthKind=pattern
lengthPattern=[A-Z]{10}

The regex in that lengthPattern doesn't consider the hyphen. So we update the 
regex to allow for the hyphen:

lengthPattern=[A-Z]{10}|[ ]*-[ ]*

However, the right-hand side of that regex (which deals with the hyphen) 
doesn't constrain the length of the field. Recall the hyphen may be positioned 
anywhere within the 10-character field. Writing a regex that specifies all 
possible positions of the hyphen, while ensuring the field is 10 characters, is 
not reasonable.

So we explicitly specify the length:

lengthKind=explicit
length=10

But now we have conflicting requirements:

1. lengthKind=pattern for the regex
2. lengthKind=explicit for the field length

That's a problem. It's not legal. Can’t specify two different lengthKind values 
for an element.

Now for the correct solution.

First, discard nillable and nilValue; the allowed field values, including the 
hyphen, are specified by a regex.

Explicitly specify the length of the field:

lengthKind=explicit
length=10

When the field contains a hyphen, it is surrounded by spaces. Direct the parser 
to trim the surrounding spaces:

textTrimKind=padChar
textStringPadChar='%SP;'
textStringJustification=center

Note that the parser ensures the field is 10 characters prior to performing the 
trim operation.

We no longer need to be concerned with the surrounding spaces, so the regex is 
simplified:

[A-Z]{10}|-

As seen earlier we cannot specify the regex using lengthKind=pattern. Instead, 
use the XSD pattern facet:

<simpleType>
    <restriction base="xs:string">
       <pattern value="[A-Z]{10}|-"/>
    </restriction>
</simpleType>

Unless the parser is run in validation mode, XSD facets are not enforced. So 
force the parser to check the XSD pattern facet by preceding the simpleType 
with an annotation that contains checkConstraints:

<xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/";>
        <dfdl:assert test="{ dfdl:checkConstraints(.) }"
            message="Validation of Foo failed" />
    </xs:appinfo>
</xs:annotation>
<xs:simpleType>
    <xs:restriction base="xs:string">
        <xs:pattern value="[A-Z]{10}|-"/>
   </xs:restriction>
</xs:simpleType>

Putting it all together, here is how to declare the Foo element:

<xs:element name="Foo"
                       dfdl:lengthKind="explicit"
                       dfdl:length="10"
                       dfdl:textTrimKind="padChar"
                       dfdl:textPadKind="padChar"
                       dfdl:textStringPadCharacter="%SP;"
                       dfdl:textStringJustification="center">
            <xs:annotation>
                <xs:appinfo source="http://www.ogf.org/dfdl/";>
                    <dfdl:assert test="{ dfdl:checkConstraints(.) }"
                                          message="Validation of Foo failed" />
                </xs:appinfo>
            </xs:annotation>
            <xs:simpleType>
                <xs:restriction base="xs:string">
                    <xs:pattern value="[A-Z]{10}|-"/>
                </xs:restriction>
            </xs:simpleType>
</xs:element>


From: Mike Beckerle <mbecke...@apache.org>
Sent: Thursday, July 28, 2022 4:36 PM
To: users@daffodil.apache.org
Subject: [EXT] Re: Conflicting requirements: a data format field needs both 
lengthKind="explicit" and lengthKind="pattern"

You said the length is 100, so that's what's going to want to be the lengthKind 
'explicit' length.

What about using your regex but via a pattern facet?

<element name="Foo" dfdl:lengthKind='explicit' dfdl:length='100'>
  <simpleType>
    <restriction base="xs:string">
       <pattern value="[A-Z]{100}|[ ]*-[ ]*"/>
    </restriction>
  </simpleType>
</element>

You should be able to trim spaces as well from this so that you will get either 
100 characters of A-Z or a single "-" character as the string's actual length.

Note that in this case your regex is simpler. The two "[ ]*" are gone because 
the spaces will be trimmed from both ends of the string.

<element name="Foo" dfdl:lengthKind='explicit' dfdl:length='100'
   dfdl:textTrimKind='padChar'
   dfdl:textStringPadCharacter='%SP;'
   dfdl:textPadKind='padChar'
   dfdl:textStringJustification="center">
  <simpleType>
    <restriction base="xs:string">
       <pattern value="[A-Z]{100}|-"/>
    </restriction>
  </simpleType>
</element>

I did not run this DFDL, but this sort of thing is typical of fixed length data.

On Thu, Jul 28, 2022 at 8:52 AM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
Hi Folks,

The text data format that I am writing a DFDL schema for has a field (let's 
name it "Foo") with a fixed width. Let's say the width is 100 characters. The 
content of the field is uppercase letters. If there is no data available to 
populate the field, it must be populated with a single hyphen (surrounded by 
spaces to ensure the field has a width of 100). The hyphen may be in any 
position within the field. For reasons I will not share, I must specify the 
field's content using a regex:

lengthKind=pattern
lengthPattern=[A-Z]{100}

However, that lengthPattern doesn't take into account the hyphen that is needed 
when there is no data. So I updated the regex like this:

lengthPattern=[A-Z]{100}|[ ]*-[ ]*

However, the right-hand side of that regex (which deals with the hyphen) 
doesn't constrain the length of the field. Recall the hyphen may be positioned 
anywhere within the 100 character field. Writing a regex that specifies all 
possible positions of the hyphen, while ensuring the field is 100 characters, 
is not reasonable.

So it would seem that I need to specify length=100 on the element declaration:

lengthKind=explicit
length=100

But now I have conflicting requirements:

1. The element declaration needs to specify lengthKind=pattern for the regex

2. The element declaration needs to specify lengthKind=explicit for the field 
length

That's a problem. That's not legal.

It other words, I need this illegal DFDL:

<xs:element name="Foo"
        nillable="true"
        dfdl: nilValue="-"
        dfdl:lengthKind="explicit"
        dfdl:length="100"
        dfdl:lengthUnits="characters"
        dfdl:lengthKind="pattern"
        dfdl:lengthPattern="[A-Z]{100}|[ ]*-[ ]*">
   <xs:simpleType>
        <xs:annotation>
            <xs:appinfo source="http://www.ogf.org/dfdl/";>
                <dfdl:assert test="{ (fn:nilled(.)) or (. ne '') }"/>
            </xs:appinfo>
        </xs:annotation>
        <xs:restriction base="xs:string"/>
    </xs:simpleType>
</xs:element>

Is there a solution to this problem? If not, is there a workaround?

/Roger

Reply via email to