Re: How to specify data with two fields, no delimiter, variable length?

Steve Lawrence Tue, 20 Jul 2021 06:49:22 -0700

The enumeration + checkConstraints approach doesn't give daffodil any
information about the length of the field. Those are only used to
validate the field *after* it has been parsed.


So how is Daffodil determining the length of the field if you haven't
specified a length? My guess is since the schema compiles, that probably
means that your global dfdl:format has set lengthKind="delimited"--other
values would probably fail to compile since additional properties are
required.

And with lengthKind="delimited" and no delimiters in scope, the length
is just all the data up until the end-of-file is reached. So your item1
is going to be parsed as the entire contents of the file (including any
newlines), which will fail the enumeration constraint.

So even if you add the enumartion + checkConstratins, you still need the
pattern length to tell Daffodil the length of the field (either of the
ones I mentioned should work).

On 7/20/21 9:34 AM, Roger L Costello wrote:
> Thank you Steve. Terrific explanation. 
> 
> I tried the approach you described - dfdl:lengthKind="pattern" 
> dfdl:lengthPattern="ABC|AB|AC|A" - and it worked great.
> 
> I also tried using enumeration facets coupled with dfdl:checkConstraints 
> within dfdl:assert
> 
> <xs:element name="item1">
>     <xs:annotation>
>         <xs:appinfo 
>             source="http://www.ogf.org/dfdl/";>
>             <dfdl:assert 
>                 test="{ dfdl:checkConstraints(.) }"
>                 message="The value of item1 is not one of the allowable 
> values" 
>             />
>         </xs:appinfo>
>     </xs:annotation>
>     <xs:simpleType>
>         <xs:restriction base="xs:string">
>             <xs:enumeration value="A" />
>             <xs:enumeration value="ABC" />
>             <xs:enumeration value="AB" />
>             <xs:enumeration value="AC" />
>         </xs:restriction>
>     </xs:simpleType>
> </xs:element>
> 
> But that did not work. Why does that not work?
> 
> /Roger
> 
> -----Original Message-----
> From: Steve Lawrence <slawre...@apache.org> 
> Sent: Monday, July 12, 2021 2:39 PM
> To: users@daffodil.apache.org
> Subject: [EXT] Re: How to specify data with two fields, no delimiter, 
> variable length?
> 
> In cases like these, you need to use dfdl:lengthKind="pattern" and a regular 
> expression to define the length of the first item.
> 
> There's lots of different regexs depending on what kinds of infosets you want 
> to allow.
> 
> For example, one approach for the first item is a very strict regex that 
> matches exactly one of the four values, e.g.
> 
>   <xs:element name="item" type="xs:string"
>     dfdl:lengthKind="pattern" dfdl:lengthPattern="ABC|AB|AC|A" />
> 
> With this approach, the item will get a non-zero length if it is one of those 
> items. Otherwise the item will be the empty string. And if you don't want 
> empty string to be allowed, you need to add an assert that the length is 
> greater than zero. Also, note that order in the regex matters so it matches 
> the longest possibility first.
> 
> On the other end of the spectrum, you could instead model the first item to 
> match as many non-digits as possible:
> 
>   <xs:element name="item" type="xs:string"
>     dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />
> 
> This will match any of the four allowed values, but will also match anything 
> else up to the first digit. So this could potentially produce infosets with 
> an item value of XYZ, for example. In some cases, you might actually want 
> this--we might consider the data to be "well-formed"
> but not "valid". So you still get an infoset, it's just not "valid".
> Whereas in the first case, you could only get a valid infoset.
> 
> You'll probably also need to use regex length for matching the numeric item 
> if there's no delimiter after the number.
> 
> So putting it together, and using the second approach for both items, you 
> might do something like this:
> 
>   <xs:sequence>
>     <xs:element name="item1 type="xs:string"
>       dfdl:lengthKind="pattern" dfdl:lengthPattern="[^0-9]*" />
>     <xs:element name="item2" type="xs:int"
>       dfdl:lengthKind="pattern" dfdl:lengthPattern="[0-9]*" />
>   </xs:sequence>
> 
> So the first item is string parsing as many non-digits as possible, and the 
> second is an int parsing as many digits as possible. Note that this approach 
> probably should have limits on the regex length in case the data is 
> bad/malformed. For example, if the data didn't contain numbers then item1 
> would just consume the entire data. So instead of *, you might instead want 
> to use something like "{0,10}" for both regexes.
> 
> - Steve
> 
> On 7/12/21 2:05 PM, Roger L Costello wrote:
>> Hi Folks,
>>
>> I have a data field composed to two items. 
>>
>> The values for the first item can be enumerated:
>>
>>      A
>>      ABC
>>      AB
>>      AC
>>
>> The values for the second item is any integer 0-999
>>
>> So, here is a same data field:
>>
>>      A250
>>
>> How do I parse that using DFDL? I reckon I'm stuck.
>>
>> /Roger
>>
>

Re: How to specify data with two fields, no delimiter, variable length?

Reply via email to