Thx Mike - good information - I'll give it a shot. re: Daffodil version?
I'm using latest/greatest version 3.0 On Wed, Mar 3, 2021 at 9:14 PM Beckerle, Mike <mbecke...@owlcyberdefense.com> wrote: > Ah, and that pattern works for Daffodil limited validation, but not full > "on" validation, which uses Xerces. > > This pattern works in all scenarios: > > <xs:pattern value="[-\t\n\r- >  -]*"/> > > (Has to be all on one line. Email may break it since it's a long line.) > > The difference is the \t\n and \r in there. Those characters have to be > expressed that way. > > Ticket DAFFODIL-2474 is about this. > https://issues.apache.org/jira/browse/DAFFODIL-2474 > > > ------------------------------ > *From:* Beckerle, Mike <mbecke...@owlcyberdefense.com> > *Sent:* Wednesday, March 3, 2021 7:02 PM > *To:* users@daffodil.apache.org <users@daffodil.apache.org> > *Subject:* Re: regex |AND| left over data > > Atilla, > > Are you using Daffodil 3.0.0 or earlier? > > This bug ticket https://issues.apache.org/jira/browse/DAFFODIL-2363 > > was not fixed until that release, and it relates directly to this issue. > > Please read the section titled *XML Illegal Characters* on this page: > https://daffodil.apache.org/infoset/ > > The pattern facet you need to enforce char-codes below 7F only, is this: > > <xs:pattern value="[- -]*"/> > > I tested this on Daffodil 3.0.0 and it works there. Given this I think you > can see how to disallow/allow various of the C0 control characters which > appear in the above as E000-E01F. > > <xs:simpleType name="singleByteCharsWithCodepointLessThan1F" > dfdl:encoding="iso-8859-1"> > <!-- > Encoding iso-8859-1 means any single byte character at all will be > considered > well-formed. > However, only the characters with codes less than 7F will be > considered valid. > --> > <xs:restriction base="xs:string"> > <xs:pattern value="[- -]"/> > </xs:restriction> > </xs:simpleType> > > -mikeb > > > > Mike Beckerle | Principal Engineer > > mbecke...@owlcyberdefense.com <bhum...@owlcyberdefense.com> > P +1-781-330-0412 > > Connect with us! > > <https://www.linkedin.com/company/owlcyberdefense/> > <https://twitter.com/owlcyberdefense> > > <https://owlcyberdefense.com/resources/events/> > > > > The information contained in this transmission is for the personal and > confidential use of the individual or entity to which it is addressed. If > the reader is not the intended recipient, you are hereby notified that any > review, dissemination, or copying of this communication is strictly > prohibited. If you have received this transmission in error, please notify > the sender immediately > > > > > > > > ------------------------------ > *From:* Attila Horvath <attila.j.horv...@gmail.com> > *Sent:* Wednesday, March 3, 2021 2:07 PM > *To:* users@daffodil.apache.org <users@daffodil.apache.org> > *Subject:* Re: regex |AND| left over data > > Mike > > Appreciate the response. > > I'm trying to follow customer's data spec to only allow 'printable > characters' in certain fields though spec doesn't define what is/isn't > printable. Wikipedia has its own definition of printable characters > <https://en.wikipedia.org/wiki/ASCII#Printable_characters>. Technically, > for example, bell ^G [0x07] may/could be considered a printable character. > ( I know, I'm showing my age but such is life. ;) > > re: [-] > This doesn't work. - daffodil throws errors as does Notepad++. > > Daffodil throws an error re: "[-]" as well. > > The best I can do is "[ -]" - anything under 0x20 throws an > error in Daffodil which may be a problem as anything is allowed in certain > fields. > > re: dfdl:separatorPosition="postfix" > I made your recommended change. It successfully suppresses "left over > data" warnings. > > Thx - Attila > > > On Mon, Mar 1, 2021 at 2:32 PM Beckerle, Mike < > mbecke...@owlcyberdefense.com> wrote: > > Ah, so you have some simple problems here, and this thorny little issue > about the NUL character. > > Your regex, the character entities say  this must have a trailing ";" > to terminate the character entity > > However, � is just plain disallowed by XML period. Can't put a NUL > into XML even using a character entity to do so. This is one of the things > I distinctly dislike about XML. > > To cope with this, given that in DFDL people have to talk about real data > with NUL in it, DFDL does a bi-directional remapping from 0 to  > > But, you are trying to express a numeric range that is from char code 0 to > char code 7F. So you can't just change your regex to use  because > that's not the bottom of the range. > > To do what you want you need your regex to say [-] > Notice the semicolons in there. > > With respect to the final CRLF at end of file, there are techniques to > cope with this. > We need to clarify, what is the canonical/preferred representation, and > whether you want your schema to accept data that is missing this final CRLF. > > Assuming the final CRLF is required, non-optional, you can change the > newline separator to add the DFDL property > > dfdl:separatorPosition="postfix" > > Just on the sequence that contains the rows of data. > > This means you get all the infix separator line-endings, plus one more at > the end. > > However, that one at the end is NOT optional. If not present, you'll get > parse errors. > > If you want the final CRLF missing to be tolerated on parsing, and whether > it is there or not preserved when unparsing, then you actually have to > model it as a data element: > > <element name="finalLineEnding" type="xs:string" minOccurs="0" > dfdl:lengthKind="explicit" dfdl:length="0" > dfdl:initiator="%CR;%LF;"/> > > That final element will absorb, and represent, a final CRLF, and on > unparsing, lay it down so it matches the input data. > > ------------------------------ > *From:* Attila Horvath <attila.j.horv...@gmail.com> > *Sent:* Monday, March 1, 2021 2:03 PM > *To:* users@daffodil.apache.org <users@daffodil.apache.org> > *Subject:* Re: regex |AND| left over data > > 1) b) should read ...value="�-" > > On 2021/03/01 18:58:08, Attila Horvath <attila.j.horv...@gmail.com> > wrote: > > All - two quick questions... > > > > 1) regex > > > > I am trying to use character range query in regex-pression like: > > a)... > > <xs: restriction base="xs:string"> > > <xs:pattern value="[\x00-\x7F]{0,10}"/> > > </cs:restriction> > > |OR| > > b)... > > <xs: restriction base="xs:string"> > > <xs:pattern value="[�- ]{0,10}"/> > > </cs:restriction> > > - either way both throw error(s) re: invalid regex expression syntax. > > - what is correct syntax for range of hex values? > > > > 2) my CSV files has CR/LF at end of last line in file > > - when parsing, I get numerous warnings ultimately "left over data" > > ...starting at byte xyz (0x0d0a...) > > a) how to consume (parse) last two bytes and avoid warnings > > b) how to reconstitute (unparse) so last two bytes are included > > > > Thx in advance > > > > Attila (newbie) > > > >