Sweet, golden...

re:...
> If you want the final CRLF missing to be tolerated on parsing, and
whether it is there or not preserved when unparsing,
> then you actually have to model it as a data element:
> <element name="finalLineEnding" type="xs:string" minOccurs="0"
>       dfdl:lengthKind="explicit" dfdl:length="0" dfdl:initiator="%CR;%LF;"/>

|OR|
>       dfdl:lengthKind="explicit" dfdl:length="0" dfdl:initiator="%NL"/>

In my particular case, I've no way of knowing what form record delimiters
take and how the file is terminated - whether "%NL;" will be present or
not, and if so what form will it take.
I struggled w/ this a bit b/c: [1]I'm a schema newbie and [2]didn't know
where to apply the script snippet.

With incorporation of Mike B's suggested script snippet (above), I've
reverted to "infix" instead of "postfix".
After some trial/error and lots of rereading, I narrowed down where the
snippet belongs and verified it works...
[image: image.png]

As usual, thx for the assist.

Attila

On 2021/03/01 19:32:33, "Beckerle, Mike" <mbecke...@owlcyberdefense.com>
wrote:
> Ah, so you have some simple problems here, and this thorny little issue
about the NUL character.
>
> Your regex, the character entities say &#x7f this must have a trailing
";" to terminate the character entity
>
> However, &#x00; is just plain disallowed by XML period. Can't put a NUL
into XML even using a character entity to do so. This is one of the things
I distinctly dislike about XML.
>
> To cope with this, given that in DFDL people have to talk about real data
with NUL in it, DFDL does a bi-directional remapping from 0 to &#xE000;
>
> But, you are trying to express a numeric range that is from char code 0
to char  code 7F.  So you can't just change your regex to use &#xE000;
because that's not the bottom of the range.
>
> To do what you want you need your regex to say [&#xE000;&#x01;-&#x7F;]
> Notice the semicolons in there.
>
> With respect to the final CRLF at end of file, there are techniques to
cope with this.
> We need to clarify, what is the canonical/preferred representation, and
whether you want your schema to accept data that is missing this final CRLF.
>
> Assuming the final CRLF is required, non-optional, you can change the
newline separator to add the DFDL property
>
> dfdl:separatorPosition="postfix"
>
> Just on the sequence that contains the rows of data.
>
> This means you get all the infix separator line-endings, plus one more at
the end.
>
> However, that one at the end is NOT optional. If not present, you'll get
parse errors.
>
> If you want the final CRLF missing to be tolerated on parsing, and
whether it is there or not preserved when unparsing, then you actually have
to model it as a data element:
>
> <element name="finalLineEnding" type="xs:string" minOccurs="0"
>       dfdl:lengthKind="explicit" dfdl:length="0"
dfdl:initiator="%CR;%LF;"/>
>
> That final element will absorb, and represent, a final CRLF, and on
unparsing, lay it down so it matches the input data.
>
> ________________________________
> From: Attila Horvath <attila.j.horv...@gmail.com>
> Sent: Monday, March 1, 2021 2:03 PM
> To: users@daffodil.apache.org <users@daffodil.apache.org>
> Subject: Re: regex |AND| left over data
>
> 1) b) should read ...value="&#x00-&#x7f"
>
> On 2021/03/01 18:58:08, Attila Horvath <attila.j.horv...@gmail.com> wrote:
> > All - two quick questions...
> >
> > 1) regex
> >
> > I am trying to use character range query in regex-pression like:
> >  a)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[\x00-\x7F]{0,10}"/>
> >    </cs:restriction>
> >  |OR|
> >  b)...
> >    <xs: restriction base="xs:string">
> >      <xs:pattern value="[�- ]{0,10}"/>
> >    </cs:restriction>
> >  - either way both throw error(s) re: invalid regex expression syntax.
> >  - what is correct syntax for range of hex values?
> >
> > 2) my CSV files has CR/LF at end of last line in file
> >  - when parsing, I get numerous warnings ultimately "left over data"
> > ...starting at byte xyz (0x0d0a...)
> >  a) how to consume (parse) last two bytes and avoid warnings
> >  b) how to reconstitute (unparse) so last two bytes are included
> >
> > Thx in advance
> >
> > Attila (newbie)
> >
>

Reply via email to