Thank you Steve. Terrific explanation. 

I tried the approach you described - dfdl:lengthKind="pattern" 
dfdl:lengthPattern="ABC|AB|AC|A" - and it worked great.

I also tried using enumeration facets coupled with dfdl:checkConstraints within 
dfdl:assert

<xs:element name="item1">
    <xs:annotation>
        <xs:appinfo 
            source="http://www.ogf.org/dfdl/";>
            <dfdl:assert 
                test="{ dfdl:checkConstraints(.) }"
                message="The value of item1 is not one of the allowable values" 
            />
        </xs:appinfo>
    </xs:annotation>
    <xs:simpleType>
        <xs:restriction base="xs:string">
            <xs:enumeration value="A" />
            <xs:enumeration value="ABC" />
            <xs:enumeration value="AB" />
            <xs:enumeration value="AC" />
        </xs:restriction>
    </xs:simpleType>
</xs:element>

But that did not work. Why does that not work?

/Roger

-----Original Message-----
From: Steve Lawrence <slawre...@apache.org> 
Sent: Monday, July 12, 2021 2:39 PM
To: users@daffodil.apache.org
Subject: [EXT] Re: How to specify data with two fields, no delimiter, variable 
length?

In cases like these, you need to use dfdl:lengthKind="pattern" and a regular 
expression to define the length of the first item.

There's lots of different regexs depending on what kinds of infosets you want 
to allow.

For example, one approach for the first item is a very strict regex that 
matches exactly one of the four values, e.g.

  <xs:element name="item" type="xs:string"
    dfdl:lengthKind="pattern" dfdl:lengthPattern="ABC|AB|AC|A" />

With this approach, the item will get a non-zero length if it is one of those 
items. Otherwise the item will be the empty string. And if you don't want empty 
string to be allowed, you need to add an assert that the length is greater than 
zero. Also, note that order in the regex matters so it matches the longest 
possibility first.

On the other end of the spectrum, you could instead model the first item to 
match as many non-digits as possible:

  <xs:element name="item" type="xs:string"
    dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />

This will match any of the four allowed values, but will also match anything 
else up to the first digit. So this could potentially produce infosets with an 
item value of XYZ, for example. In some cases, you might actually want this--we 
might consider the data to be "well-formed"
but not "valid". So you still get an infoset, it's just not "valid".
Whereas in the first case, you could only get a valid infoset.

You'll probably also need to use regex length for matching the numeric item if 
there's no delimiter after the number.

So putting it together, and using the second approach for both items, you might 
do something like this:

  <xs:sequence>
    <xs:element name="item1 type="xs:string"
      dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />
    <xs:element name="item2" type="xs:int"
      dfdl:lengthKind="pattern" dfdl:lengthPattern="[0-9]*" />
  </xs:sequence>

So the first item is string parsing as many non-digits as possible, and the 
second is an int parsing as many digits as possible. Note that this approach 
probably should have limits on the regex length in case the data is 
bad/malformed. For example, if the data didn't contain numbers then item1 would 
just consume the entire data. So instead of *, you might instead want to use 
something like "{0,10}" for both regexes.

- Steve

On 7/12/21 2:05 PM, Roger L Costello wrote:
> Hi Folks,
> 
> I have a data field composed to two items. 
> 
> The values for the first item can be enumerated:
> 
>       A
>       ABC
>       AB
>       AC
> 
> The values for the second item is any integer 0-999
> 
> So, here is a same data field:
> 
>       A250
> 
> How do I parse that using DFDL? I reckon I'm stuck.
> 
> /Roger
> 

Reply via email to