This is really helpful information. I've been running into some issues trying to define restrictions for things in 0x00 to 0x1f. I'm going to give this a shot.
Thanks, Steve On Fri, Mar 19, 2021 at 12:37 PM Beckerle, Mike < mbecke...@owlcyberdefense.com> wrote: > Yeah, I worked through this in detail a little bit ago. Learned the hard > way about XML attributes not being able to contain tabs or line-endings; > hence inside the xs:pattern the value attribute can't express tabs or line > endings using XML entities like 	 because that's a tab, and XML will > convert to a space when found in an attribute. > > Here's what works: > > <xs:simpleType name="singleByteCharsWithCodepointLessThan1F" > dfdl:encoding="iso-8859-1"> > <xs:restriction base="xs:string"> > <!-- > We cannot depend on Daffodil's mapping from E009 to 9, E00A to > A, and E00D to D. > Because that won't work in Xerces with "full" validation. > > We can't use numeric entities for the tab, LF, or CR, because > those aren't allowed in attribute > values (XML attribute normalization converts tabs, and line > endings to spaces.) > So we use \t, \n, and \r for those control characters. > > The result is this somewhat clumsy pattern is what is needed in > DFDL to say you want only > code points in the original data of 0x00 to 0x1F to be valid. > --> > <xs:pattern > value="[-\t\n\r- -]*"/> > </xs:restriction> > </xs:simpleType> > > I added a regression test to Daffodil to make sure this works, and nothing > breaks it in the future. > > ------------------------------ > *From:* Attila Horvath <attila.j.horv...@gmail.com> > *Sent:* Friday, March 19, 2021 9:03 AM > *To:* users@daffodil.apache.org <users@daffodil.apache.org>; Beckerle, > Mike <mbecke...@owlcyberdefense.com> > *Subject:* Re: regex |AND| left over data > > re: "...you need your regex to say [-]" as it relates > to all printable ASCII characters, > -suggested syntax above throws Daffodil error:... > [image: image.png] > [image: image.png] > > Also tried following pattern which does not throw syntax error but throws > validation error - not recognizing valid data pattern... > [image: image.png] > [image: image.png] > > The only thing that seems to work w/o throwing error is:... > [image: image.png] > > For posterity, I'd like to be able to specify specific subset of printable > characters as hexadecimals. > > Suggestions? Recommendations? > > Thx in advance - Attila > > On Mon, Mar 1, 2021 at 2:32 PM Beckerle, Mike < > mbecke...@owlcyberdefense.com> wrote: > > Ah, so you have some simple problems here, and this thorny little issue > about the NUL character. > > Your regex, the character entities say  this must have a trailing ";" > to terminate the character entity > > However, � is just plain disallowed by XML period. Can't put a NUL > into XML even using a character entity to do so. This is one of the things > I distinctly dislike about XML. > > To cope with this, given that in DFDL people have to talk about real data > with NUL in it, DFDL does a bi-directional remapping from 0 to  > > But, you are trying to express a numeric range that is from char code 0 to > char code 7F. So you can't just change your regex to use  because > that's not the bottom of the range. > > To do what you want you need your regex to say [-] > Notice the semicolons in there. > > With respect to the final CRLF at end of file, there are techniques to > cope with this. > We need to clarify, what is the canonical/preferred representation, and > whether you want your schema to accept data that is missing this final CRLF. > > Assuming the final CRLF is required, non-optional, you can change the > newline separator to add the DFDL property > > dfdl:separatorPosition="postfix" > > Just on the sequence that contains the rows of data. > > This means you get all the infix separator line-endings, plus one more at > the end. > > However, that one at the end is NOT optional. If not present, you'll get > parse errors. > > If you want the final CRLF missing to be tolerated on parsing, and whether > it is there or not preserved when unparsing, then you actually have to > model it as a data element: > > <element name="finalLineEnding" type="xs:string" minOccurs="0" > dfdl:lengthKind="explicit" dfdl:length="0" > dfdl:initiator="%CR;%LF;"/> > > That final element will absorb, and represent, a final CRLF, and on > unparsing, lay it down so it matches the input data. > > ------------------------------ > *From:* Attila Horvath <attila.j.horv...@gmail.com> > *Sent:* Monday, March 1, 2021 2:03 PM > *To:* users@daffodil.apache.org <users@daffodil.apache.org> > *Subject:* Re: regex |AND| left over data > > 1) b) should read ...value="�-" > > On 2021/03/01 18:58:08, Attila Horvath <attila.j.horv...@gmail.com> > wrote: > > All - two quick questions... > > > > 1) regex > > > > I am trying to use character range query in regex-pression like: > > a)... > > <xs: restriction base="xs:string"> > > <xs:pattern value="[\x00-\x7F]{0,10}"/> > > </cs:restriction> > > |OR| > > b)... > > <xs: restriction base="xs:string"> > > <xs:pattern value="[�- ]{0,10}"/> > > </cs:restriction> > > - either way both throw error(s) re: invalid regex expression syntax. > > - what is correct syntax for range of hex values? > > > > 2) my CSV files has CR/LF at end of last line in file > > - when parsing, I get numerous warnings ultimately "left over data" > > ...starting at byte xyz (0x0d0a...) > > a) how to consume (parse) last two bytes and avoid warnings > > b) how to reconstitute (unparse) so last two bytes are included > > > > Thx in advance > > > > Attila (newbie) > > > > -- - To err is human; to forgive, beyond the scope of the Operating System.