Re: regex |AND| left over data

Beckerle, Mike Fri, 19 Mar 2021 09:37:34 -0700

Yeah, I worked through this in detail a little bit ago. Learned the hard way 
about XML attributes not being able to contain tabs or line-endings; hence 
inside the xs:pattern the value attribute can't express tabs or line endings 
using XML entities like &#x9; because that's a tab, and XML will convert to a 
space when found in an attribute.

Here's what works:

   <xs:simpleType name="singleByteCharsWithCodepointLessThan1F"
                    dfdl:encoding="iso-8859-1">
      <xs:restriction base="xs:string">
        <!--
          We cannot depend on Daffodil's mapping from E009 to 9, E00A to A, and 
E00D to D.
          Because that won't work in Xerces with "full" validation.

          We can't use numeric entities for the tab, LF, or CR, because those 
aren't allowed in attribute
          values (XML attribute normalization converts tabs, and line endings 
to spaces.)
          So we use \t, \n, and \r for those control characters.

          The result is this somewhat clumsy pattern is what is needed in DFDL 
to say you want only
          code points in the original data of 0x00 to 0x1F to be valid.
          -->
        <xs:pattern 
value="[&#xE000;-&#xE008;\t\n&#xE00B;&#xE00C;\r&#xE00E;-&#xE01F;&#x20;-&#x7F;]*"/>
      </xs:restriction>
    </xs:simpleType>

I added a regression test to Daffodil to make sure this works, and nothing 
breaks it in the future.

________________________________
From: Attila Horvath <attila.j.horv...@gmail.com>
Sent: Friday, March 19, 2021 9:03 AM
To: users@daffodil.apache.org <users@daffodil.apache.org>; Beckerle, Mike 
<mbecke...@owlcyberdefense.com>
Subject: Re: regex |AND| left over data

re: "...you need your regex to say [&#xE000;&#x01;-&#x7F;]" as it relates to 
all printable ASCII characters,
-suggested syntax above throws Daffodil error:...
[image.png]
[image.png]

Also tried following pattern which does not throw syntax error but throws 
validation error - not recognizing valid data pattern...
[image.png]
[image.png]

The only thing that seems to work w/o throwing error is:...
[image.png]

For posterity, I'd like to be able to specify specific subset of printable 
characters as hexadecimals.

Suggestions? Recommendations?

Thx in advance - Attila

On Mon, Mar 1, 2021 at 2:32 PM Beckerle, Mike 
<mbecke...@owlcyberdefense.com<mailto:mbecke...@owlcyberdefense.com>> wrote:
Ah, so you have some simple problems here, and this thorny little issue about 
the NUL character.

Your regex, the character entities say &#x7f this must have a trailing ";" to 
terminate the character entity

However, &#x00; is just plain disallowed by XML period. Can't put a NUL into 
XML even using a character entity to do so. This is one of the things I 
distinctly dislike about XML.

To cope with this, given that in DFDL people have to talk about real data with 
NUL in it, DFDL does a bi-directional remapping from 0 to &#xE000;

But, you are trying to express a numeric range that is from char code 0 to char 
 code 7F.  So you can't just change your regex to use &#xE000; because that's 
not the bottom of the range.

To do what you want you need your regex to say [&#xE000;&#x01;-&#x7F;]
Notice the semicolons in there.

With respect to the final CRLF at end of file, there are techniques to cope 
with this.
We need to clarify, what is the canonical/preferred representation, and whether 
you want your schema to accept data that is missing this final CRLF.

Assuming the final CRLF is required, non-optional, you can change the newline 
separator to add the DFDL property

dfdl:separatorPosition="postfix"

Just on the sequence that contains the rows of data.

This means you get all the infix separator line-endings, plus one more at the 
end.

However, that one at the end is NOT optional. If not present, you'll get parse 
errors.

If you want the final CRLF missing to be tolerated on parsing, and whether it 
is there or not preserved when unparsing, then you actually have to model it as 
a data element:

<element name="finalLineEnding" type="xs:string" minOccurs="0"
      dfdl:lengthKind="explicit" dfdl:length="0" dfdl:initiator="%CR;%LF;"/>

That final element will absorb, and represent, a final CRLF, and on unparsing, 
lay it down so it matches the input data.

________________________________
From: Attila Horvath 
<attila.j.horv...@gmail.com<mailto:attila.j.horv...@gmail.com>>
Sent: Monday, March 1, 2021 2:03 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> 
<users@daffodil.apache.org<mailto:users@daffodil.apache.org>>
Subject: Re: regex |AND| left over data

1) b) should read ...value="&#x00-&#x7f"

On 2021/03/01 18:58:08, Attila Horvath 
<attila.j.horv...@gmail.com<mailto:attila.j.horv...@gmail.com>> wrote:
> All - two quick questions...
>
> 1) regex
>
> I am trying to use character range query in regex-pression like:
>  a)...
>    <xs: restriction base="xs:string">
>      <xs:pattern value="[\x00-\x7F]{0,10}"/>
>    </cs:restriction>
>  |OR|
>  b)...
>    <xs: restriction base="xs:string">
>      <xs:pattern value="[�- ]{0,10}"/>
>    </cs:restriction>
>  - either way both throw error(s) re: invalid regex expression syntax.
>  - what is correct syntax for range of hex values?
>
> 2) my CSV files has CR/LF at end of last line in file
>  - when parsing, I get numerous warnings ultimately "left over data"
> ...starting at byte xyz (0x0d0a...)
>  a) how to consume (parse) last two bytes and avoid warnings
>  b) how to reconstitute (unparse) so last two bytes are included
>
> Thx in advance
>
> Attila (newbie)
>

Re: regex |AND| left over data

Reply via email to