Re: regex |AND| left over data

Stephen Sullivan Fri, 19 Mar 2021 12:03:26 -0700

This is really helpful information. I've been running into some issues
trying to define restrictions for things in 0x00 to 0x1f.
I'm going to give this a shot.


Thanks,
Steve

On Fri, Mar 19, 2021 at 12:37 PM Beckerle, Mike <
mbecke...@owlcyberdefense.com> wrote:

> Yeah, I worked through this in detail a little bit ago. Learned the hard
> way about XML attributes not being able to contain tabs or line-endings;
> hence inside the xs:pattern the value attribute can't express tabs or line
> endings using XML entities like &#x9; because that's a tab, and XML will
> convert to a space when found in an attribute.
>
> Here's what works:
>
>    <xs:simpleType name="singleByteCharsWithCodepointLessThan1F"
>                     dfdl:encoding="iso-8859-1">
>       <xs:restriction base="xs:string">
>         <!--
>           We cannot depend on Daffodil's mapping from E009 to 9, E00A to
> A, and E00D to D.
>           Because that won't work in Xerces with "full" validation.
>
>           We can't use numeric entities for the tab, LF, or CR, because
> those aren't allowed in attribute
>           values (XML attribute normalization converts tabs, and line
> endings to spaces.)
>           So we use \t, \n, and \r for those control characters.
>
>           The result is this somewhat clumsy pattern is what is needed in
> DFDL to say you want only
>           code points in the original data of 0x00 to 0x1F to be valid.
>           -->
>         <xs:pattern
> value="[&#xE000;-&#xE008;\t\n&#xE00B;&#xE00C;\r&#xE00E;-&#xE01F;&#x20;-&#x7F;]*"/>
>       </xs:restriction>
>     </xs:simpleType>
>
> I added a regression test to Daffodil to make sure this works, and nothing
> breaks it in the future.
>
> ------------------------------
> *From:* Attila Horvath <attila.j.horv...@gmail.com>
> *Sent:* Friday, March 19, 2021 9:03 AM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>; Beckerle,
> Mike <mbecke...@owlcyberdefense.com>
> *Subject:* Re: regex |AND| left over data
>
> re: "...you need your regex to say [&#xE000;&#x01;-&#x7F;]" as it relates
> to all printable ASCII characters,
> -suggested syntax above throws Daffodil error:...
> [image: image.png]
> [image: image.png]
>
> Also tried following pattern which does not throw syntax error but throws
> validation error - not recognizing valid data pattern...
> [image: image.png]
> [image: image.png]
>
> The only thing that seems to work w/o throwing error is:...
> [image: image.png]
>
> For posterity, I'd like to be able to specify specific subset of printable
> characters as hexadecimals.
>
> Suggestions? Recommendations?
>
> Thx in advance - Attila
>
> On Mon, Mar 1, 2021 at 2:32 PM Beckerle, Mike <
> mbecke...@owlcyberdefense.com> wrote:
>
> Ah, so you have some simple problems here, and this thorny little issue
> about the NUL character.
>
> Your regex, the character entities say &#x7f this must have a trailing ";"
> to terminate the character entity
>
> However, &#x00; is just plain disallowed by XML period. Can't put a NUL
> into XML even using a character entity to do so. This is one of the things
> I distinctly dislike about XML.
>
> To cope with this, given that in DFDL people have to talk about real data
> with NUL in it, DFDL does a bi-directional remapping from 0 to &#xE000;
>
> But, you are trying to express a numeric range that is from char code 0 to
> char  code 7F.  So you can't just change your regex to use &#xE000; because
> that's not the bottom of the range.
>
> To do what you want you need your regex to say [&#xE000;&#x01;-&#x7F;]
> Notice the semicolons in there.
>
> With respect to the final CRLF at end of file, there are techniques to
> cope with this.
> We need to clarify, what is the canonical/preferred representation, and
> whether you want your schema to accept data that is missing this final CRLF.
>
> Assuming the final CRLF is required, non-optional, you can change the
> newline separator to add the DFDL property
>
> dfdl:separatorPosition="postfix"
>
> Just on the sequence that contains the rows of data.
>
> This means you get all the infix separator line-endings, plus one more at
> the end.
>
> However, that one at the end is NOT optional. If not present, you'll get
> parse errors.
>
> If you want the final CRLF missing to be tolerated on parsing, and whether
> it is there or not preserved when unparsing, then you actually have to
> model it as a data element:
>
> <element name="finalLineEnding" type="xs:string" minOccurs="0"
>       dfdl:lengthKind="explicit" dfdl:length="0"
> dfdl:initiator="%CR;%LF;"/>
>
> That final element will absorb, and represent, a final CRLF, and on
> unparsing, lay it down so it matches the input data.
>
> ------------------------------
> *From:* Attila Horvath <attila.j.horv...@gmail.com>
> *Sent:* Monday, March 1, 2021 2:03 PM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Subject:* Re: regex |AND| left over data
>
> 1) b) should read ...value="&#x00-&#x7f"
>
> On 2021/03/01 18:58:08, Attila Horvath <attila.j.horv...@gmail.com>
> wrote:
> > All - two quick questions...
> >
> > 1) regex
> >
> > I am trying to use character range query in regex-pression like:
> >  a)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[\x00-\x7F]{0,10}"/>
> >    </cs:restriction>
> >  |OR|
> >  b)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[�- ]{0,10}"/>
> >    </cs:restriction>
> >  - either way both throw error(s) re: invalid regex expression syntax.
> >  - what is correct syntax for range of hex values?
> >
> > 2) my CSV files has CR/LF at end of last line in file
> >  - when parsing, I get numerous warnings ultimately "left over data"
> > ...starting at byte xyz (0x0d0a...)
> >  a) how to consume (parse) last two bytes and avoid warnings
> >  b) how to reconstitute (unparse) so last two bytes are included
> >
> > Thx in advance
> >
> > Attila (newbie)
> >
>
>

-- 

-

To err is human; to forgive, beyond the scope of the Operating System.

Re: regex |AND| left over data

Reply via email to