Ah, and that pattern works for Daffodil limited validation, but not full "on" validation, which uses Xerces.
This pattern works in all scenarios: <xs:pattern value="[-\t\n\r- -]*"/> (Has to be all on one line. Email may break it since it's a long line.) The difference is the \t\n and \r in there. Those characters have to be expressed that way. Ticket DAFFODIL-2474 is about this. https://issues.apache.org/jira/browse/DAFFODIL-2474 ________________________________ From: Beckerle, Mike <mbecke...@owlcyberdefense.com> Sent: Wednesday, March 3, 2021 7:02 PM To: users@daffodil.apache.org <users@daffodil.apache.org> Subject: Re: regex |AND| left over data Atilla, Are you using Daffodil 3.0.0 or earlier? This bug ticket https://issues.apache.org/jira/browse/DAFFODIL-2363 was not fixed until that release, and it relates directly to this issue. Please read the section titled XML Illegal Characters on this page: https://daffodil.apache.org/infoset/ The pattern facet you need to enforce char-codes below 7F only, is this: <xs:pattern value="[- -]*"/> I tested this on Daffodil 3.0.0 and it works there. Given this I think you can see how to disallow/allow various of the C0 control characters which appear in the above as E000-E01F. <xs:simpleType name="singleByteCharsWithCodepointLessThan1F" dfdl:encoding="iso-8859-1"> <!-- Encoding iso-8859-1 means any single byte character at all will be considered well-formed. However, only the characters with codes less than 7F will be considered valid. --> <xs:restriction base="xs:string"> <xs:pattern value="[- -]"/> </xs:restriction> </xs:simpleType> -mikeb Mike Beckerle | Principal Engineer [cid:0b185539-3a50-4678-91ec-aaa0bed3d23a] mbecke...@owlcyberdefense.com<mailto:bhum...@owlcyberdefense.com> P +1-781-330-0412 Connect with us! [cid:3d3464d2-29c9-4071-9e71-9b04daf9695c]<https://www.linkedin.com/company/owlcyberdefense/>[cid:3267dff1-dc7b-460c-b0f7-c79a6740ef0f]<https://twitter.com/owlcyberdefense> [cid:4d69717b-958d-4317-bb25-d1fc3f9e100e]<https://owlcyberdefense.com/resources/events/> The information contained in this transmission is for the personal and confidential use of the individual or entity to which it is addressed. If the reader is not the intended recipient, you are hereby notified that any review, dissemination, or copying of this communication is strictly prohibited. If you have received this transmission in error, please notify the sender immediately ________________________________ From: Attila Horvath <attila.j.horv...@gmail.com> Sent: Wednesday, March 3, 2021 2:07 PM To: users@daffodil.apache.org <users@daffodil.apache.org> Subject: Re: regex |AND| left over data Mike Appreciate the response. I'm trying to follow customer's data spec to only allow 'printable characters' in certain fields though spec doesn't define what is/isn't printable. Wikipedia has its own definition of printable characters<https://en.wikipedia.org/wiki/ASCII#Printable_characters>. Technically, for example, bell ^G [0x07] may/could be considered a printable character. ( I know, I'm showing my age but such is life. ;) re: [-] This doesn't work. - daffodil throws errors as does Notepad++. Daffodil throws an error re: "[-]" as well. The best I can do is "[ -]" - anything under 0x20 throws an error in Daffodil which may be a problem as anything is allowed in certain fields. re: dfdl:separatorPosition="postfix" I made your recommended change. It successfully suppresses "left over data" warnings. Thx - Attila On Mon, Mar 1, 2021 at 2:32 PM Beckerle, Mike <mbecke...@owlcyberdefense.com<mailto:mbecke...@owlcyberdefense.com>> wrote: Ah, so you have some simple problems here, and this thorny little issue about the NUL character. Your regex, the character entities say  this must have a trailing ";" to terminate the character entity However, � is just plain disallowed by XML period. Can't put a NUL into XML even using a character entity to do so. This is one of the things I distinctly dislike about XML. To cope with this, given that in DFDL people have to talk about real data with NUL in it, DFDL does a bi-directional remapping from 0 to  But, you are trying to express a numeric range that is from char code 0 to char code 7F. So you can't just change your regex to use  because that's not the bottom of the range. To do what you want you need your regex to say [-] Notice the semicolons in there. With respect to the final CRLF at end of file, there are techniques to cope with this. We need to clarify, what is the canonical/preferred representation, and whether you want your schema to accept data that is missing this final CRLF. Assuming the final CRLF is required, non-optional, you can change the newline separator to add the DFDL property dfdl:separatorPosition="postfix" Just on the sequence that contains the rows of data. This means you get all the infix separator line-endings, plus one more at the end. However, that one at the end is NOT optional. If not present, you'll get parse errors. If you want the final CRLF missing to be tolerated on parsing, and whether it is there or not preserved when unparsing, then you actually have to model it as a data element: <element name="finalLineEnding" type="xs:string" minOccurs="0" dfdl:lengthKind="explicit" dfdl:length="0" dfdl:initiator="%CR;%LF;"/> That final element will absorb, and represent, a final CRLF, and on unparsing, lay it down so it matches the input data. ________________________________ From: Attila Horvath <attila.j.horv...@gmail.com<mailto:attila.j.horv...@gmail.com>> Sent: Monday, March 1, 2021 2:03 PM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> <users@daffodil.apache.org<mailto:users@daffodil.apache.org>> Subject: Re: regex |AND| left over data 1) b) should read ...value="�-" On 2021/03/01 18:58:08, Attila Horvath <attila.j.horv...@gmail.com<mailto:attila.j.horv...@gmail.com>> wrote: > All - two quick questions... > > 1) regex > > I am trying to use character range query in regex-pression like: > a)... > <xs: restriction base="xs:string"> > <xs:pattern value="[\x00-\x7F]{0,10}"/> > </cs:restriction> > |OR| > b)... > <xs: restriction base="xs:string"> > <xs:pattern value="[�- ]{0,10}"/> > </cs:restriction> > - either way both throw error(s) re: invalid regex expression syntax. > - what is correct syntax for range of hex values? > > 2) my CSV files has CR/LF at end of last line in file > - when parsing, I get numerous warnings ultimately "left over data" > ...starting at byte xyz (0x0d0a...) > a) how to consume (parse) last two bytes and avoid warnings > b) how to reconstitute (unparse) so last two bytes are included > > Thx in advance > > Attila (newbie) >