Mike

Appreciate the response.

I'm trying to follow customer's data spec to only allow 'printable
characters' in certain fields though spec doesn't define what is/isn't
printable. Wikipedia has its own definition of printable characters
<https://en.wikipedia.org/wiki/ASCII#Printable_characters>. Technically,
for example, bell ^G [0x07] may/could be considered a printable character.
( I know, I'm showing my age but such is life. ;)

re: [&#xE000;&#x01;-&#x7F;]
This doesn't work. - daffodil throws errors as does Notepad++.

Daffodil throws an error re: "[&#x01;-&#x7F;]" as well.

The best I can do is "[&#x20;-&#x7F;]" - anything under 0x20 throws an
error in Daffodil which may be a problem as anything is allowed in certain
fields.

re: dfdl:separatorPosition="postfix"
I made your recommended change. It successfully suppresses "left over data"
warnings.

Thx - Attila


On Mon, Mar 1, 2021 at 2:32 PM Beckerle, Mike <mbecke...@owlcyberdefense.com>
wrote:

> Ah, so you have some simple problems here, and this thorny little issue
> about the NUL character.
>
> Your regex, the character entities say &#x7f this must have a trailing ";"
> to terminate the character entity
>
> However, &#x00; is just plain disallowed by XML period. Can't put a NUL
> into XML even using a character entity to do so. This is one of the things
> I distinctly dislike about XML.
>
> To cope with this, given that in DFDL people have to talk about real data
> with NUL in it, DFDL does a bi-directional remapping from 0 to &#xE000;
>
> But, you are trying to express a numeric range that is from char code 0 to
> char  code 7F.  So you can't just change your regex to use &#xE000; because
> that's not the bottom of the range.
>
> To do what you want you need your regex to say [&#xE000;&#x01;-&#x7F;]
> Notice the semicolons in there.
>
> With respect to the final CRLF at end of file, there are techniques to
> cope with this.
> We need to clarify, what is the canonical/preferred representation, and
> whether you want your schema to accept data that is missing this final CRLF.
>
> Assuming the final CRLF is required, non-optional, you can change the
> newline separator to add the DFDL property
>
> dfdl:separatorPosition="postfix"
>
> Just on the sequence that contains the rows of data.
>
> This means you get all the infix separator line-endings, plus one more at
> the end.
>
> However, that one at the end is NOT optional. If not present, you'll get
> parse errors.
>
> If you want the final CRLF missing to be tolerated on parsing, and whether
> it is there or not preserved when unparsing, then you actually have to
> model it as a data element:
>
> <element name="finalLineEnding" type="xs:string" minOccurs="0"
>       dfdl:lengthKind="explicit" dfdl:length="0"
> dfdl:initiator="%CR;%LF;"/>
>
> That final element will absorb, and represent, a final CRLF, and on
> unparsing, lay it down so it matches the input data.
>
> ------------------------------
> *From:* Attila Horvath <attila.j.horv...@gmail.com>
> *Sent:* Monday, March 1, 2021 2:03 PM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Subject:* Re: regex |AND| left over data
>
> 1) b) should read ...value="&#x00-&#x7f"
>
> On 2021/03/01 18:58:08, Attila Horvath <attila.j.horv...@gmail.com>
> wrote:
> > All - two quick questions...
> >
> > 1) regex
> >
> > I am trying to use character range query in regex-pression like:
> >  a)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[\x00-\x7F]{0,10}"/>
> >    </cs:restriction>
> >  |OR|
> >  b)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[�- ]{0,10}"/>
> >    </cs:restriction>
> >  - either way both throw error(s) re: invalid regex expression syntax.
> >  - what is correct syntax for range of hex values?
> >
> > 2) my CSV files has CR/LF at end of last line in file
> >  - when parsing, I get numerous warnings ultimately "left over data"
> > ...starting at byte xyz (0x0d0a...)
> >  a) how to consume (parse) last two bytes and avoid warnings
> >  b) how to reconstitute (unparse) so last two bytes are included
> >
> > Thx in advance
> >
> > Attila (newbie)
> >
>

Reply via email to