Re: regex |AND| left over data

Attila Horvath Thu, 04 Mar 2021 02:34:38 -0800

Thx Mike - good information - I'll give it a shot.

re: Daffodil version?


I'm using latest/greatest version 3.0

On Wed, Mar 3, 2021 at 9:14 PM Beckerle, Mike <mbecke...@owlcyberdefense.com>
wrote:

> Ah, and that pattern works for Daffodil limited validation, but not full
> "on" validation, which uses Xerces.
>
> This pattern works in all scenarios:
>
> <xs:pattern value="[&#xE000;-&#xE008;\t\n&#xE00B;&#xE00C;\r&#xE00E;-
> &#xE01F;&#x20;-&#x7F;]*"/>
>
> (Has to be all on one line. Email may break it since it's a long line.)
>
> The difference is the \t\n and \r in there. Those characters have to be
> expressed that way.
>
> Ticket DAFFODIL-2474 is about this.
> https://issues.apache.org/jira/browse/DAFFODIL-2474
>
>
> ------------------------------
> *From:* Beckerle, Mike <mbecke...@owlcyberdefense.com>
> *Sent:* Wednesday, March 3, 2021 7:02 PM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Subject:* Re: regex |AND| left over data
>
> Atilla,
>
> Are you using Daffodil 3.0.0 or earlier?
>
> This bug ticket https://issues.apache.org/jira/browse/DAFFODIL-2363
>
> was not fixed until that release, and it relates directly to this issue.
>
> Please read the section titled *XML Illegal Characters* on this page:
> https://daffodil.apache.org/infoset/
>
> The pattern facet you need to enforce char-codes below 7F only, is this:
>
> <xs:pattern value="[&#xE000;-&#xE01F;&#x20;-&#x7F;]*"/>
>
> I tested this on Daffodil 3.0.0 and it works there. Given this I think you
> can see how to disallow/allow various of the C0 control characters which
> appear in the above as E000-E01F.
>
> <xs:simpleType name="singleByteCharsWithCodepointLessThan1F"
>     dfdl:encoding="iso-8859-1">
>     <!--
>        Encoding iso-8859-1 means any single byte character at all will be
> considered
> well-formed.
>        However, only the characters with codes less than 7F will be
> considered valid.
>        -->
>      <xs:restriction base="xs:string">
>          <xs:pattern value="[&#xE000;-&#xE01F;&#x20;-&#x7F;]"/>
>      </xs:restriction>
> </xs:simpleType>
>
> -mikeb
>
>
>
> Mike Beckerle | Principal Engineer
>
> mbecke...@owlcyberdefense.com <bhum...@owlcyberdefense.com>
> P +1-781-330-0412
>
> Connect with us!
>
> <https://www.linkedin.com/company/owlcyberdefense/>
> <https://twitter.com/owlcyberdefense>
>
> <https://owlcyberdefense.com/resources/events/>
>
>
>
> The information contained in this transmission is for the personal and
> confidential use of the individual or entity to which it is addressed. If
> the reader is not the intended recipient, you are hereby notified that any
> review, dissemination, or copying of this communication is strictly
> prohibited. If you have received this transmission in error, please notify
> the sender immediately
>
>
>
>
>
>
>
> ------------------------------
> *From:* Attila Horvath <attila.j.horv...@gmail.com>
> *Sent:* Wednesday, March 3, 2021 2:07 PM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Subject:* Re: regex |AND| left over data
>
> Mike
>
> Appreciate the response.
>
> I'm trying to follow customer's data spec to only allow 'printable
> characters' in certain fields though spec doesn't define what is/isn't
> printable. Wikipedia has its own definition of printable characters
> <https://en.wikipedia.org/wiki/ASCII#Printable_characters>. Technically,
> for example, bell ^G [0x07] may/could be considered a printable character.
> ( I know, I'm showing my age but such is life. ;)
>
> re: [&#xE000;&#x01;-&#x7F;]
> This doesn't work. - daffodil throws errors as does Notepad++.
>
> Daffodil throws an error re: "[&#x01;-&#x7F;]" as well.
>
> The best I can do is "[&#x20;-&#x7F;]" - anything under 0x20 throws an
> error in Daffodil which may be a problem as anything is allowed in certain
> fields.
>
> re: dfdl:separatorPosition="postfix"
> I made your recommended change. It successfully suppresses "left over
> data" warnings.
>
> Thx - Attila
>
>
> On Mon, Mar 1, 2021 at 2:32 PM Beckerle, Mike <
> mbecke...@owlcyberdefense.com> wrote:
>
> Ah, so you have some simple problems here, and this thorny little issue
> about the NUL character.
>
> Your regex, the character entities say &#x7f this must have a trailing ";"
> to terminate the character entity
>
> However, &#x00; is just plain disallowed by XML period. Can't put a NUL
> into XML even using a character entity to do so. This is one of the things
> I distinctly dislike about XML.
>
> To cope with this, given that in DFDL people have to talk about real data
> with NUL in it, DFDL does a bi-directional remapping from 0 to &#xE000;
>
> But, you are trying to express a numeric range that is from char code 0 to
> char  code 7F.  So you can't just change your regex to use &#xE000; because
> that's not the bottom of the range.
>
> To do what you want you need your regex to say [&#xE000;&#x01;-&#x7F;]
> Notice the semicolons in there.
>
> With respect to the final CRLF at end of file, there are techniques to
> cope with this.
> We need to clarify, what is the canonical/preferred representation, and
> whether you want your schema to accept data that is missing this final CRLF.
>
> Assuming the final CRLF is required, non-optional, you can change the
> newline separator to add the DFDL property
>
> dfdl:separatorPosition="postfix"
>
> Just on the sequence that contains the rows of data.
>
> This means you get all the infix separator line-endings, plus one more at
> the end.
>
> However, that one at the end is NOT optional. If not present, you'll get
> parse errors.
>
> If you want the final CRLF missing to be tolerated on parsing, and whether
> it is there or not preserved when unparsing, then you actually have to
> model it as a data element:
>
> <element name="finalLineEnding" type="xs:string" minOccurs="0"
>       dfdl:lengthKind="explicit" dfdl:length="0"
> dfdl:initiator="%CR;%LF;"/>
>
> That final element will absorb, and represent, a final CRLF, and on
> unparsing, lay it down so it matches the input data.
>
> ------------------------------
> *From:* Attila Horvath <attila.j.horv...@gmail.com>
> *Sent:* Monday, March 1, 2021 2:03 PM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Subject:* Re: regex |AND| left over data
>
> 1) b) should read ...value="&#x00-&#x7f"
>
> On 2021/03/01 18:58:08, Attila Horvath <attila.j.horv...@gmail.com>
> wrote:
> > All - two quick questions...
> >
> > 1) regex
> >
> > I am trying to use character range query in regex-pression like:
> >  a)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[\x00-\x7F]{0,10}"/>
> >    </cs:restriction>
> >  |OR|
> >  b)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[�- ]{0,10}"/>
> >    </cs:restriction>
> >  - either way both throw error(s) re: invalid regex expression syntax.
> >  - what is correct syntax for range of hex values?
> >
> > 2) my CSV files has CR/LF at end of last line in file
> >  - when parsing, I get numerous warnings ultimately "left over data"
> > ...starting at byte xyz (0x0d0a...)
> >  a) how to consume (parse) last two bytes and avoid warnings
> >  b) how to reconstitute (unparse) so last two bytes are included
> >
> > Thx in advance
> >
> > Attila (newbie)
> >
>
>

Re: regex |AND| left over data

Reply via email to