You are correct that you can't have CRLF as an extra escaped character, just CR and LF.
The IBM example has both those characters as individual extra escaped characters Block escapes are quite different to what you are suggesting here. They are not proximate to the characters requiring the escape mechanism. They are placed at the beginning of the entire element string, and end of the entire element string. So in the example string: emp(name=Joe Smith|addr="862 West North PlaceCRLF Dept 23 M/S 77CRLF Unit 3",Madison,WI, 99999) The street element starts after addr=, and the value begins with the escapeBlockStart character, meaning delimiters are allowed to appear in the data until we encounter an unescaped escapeBlockEnd, which is the " after Unit 3. So the street element value looks like this as XML: <street>862 West North PlaceLF Dept 23 M/S 77LF Unit 3</street> (In the above XML, I used the '' character to indicate "what a CR becomes". Due to the fact that XML normalizes CRLF to LF, and isolated CR to LF, we have to do something to preserve the real Infoset data content, Hence, Daffodil's XML converter converts CR to U+E00D, which has some depiction, depending on font, that sometimes comes through as a little box containing tiny E00D characters.) Now that data doesn't actually contain any of the delimiters that are defined in the schema, so it doesn't need escaping due to that. It needs escaping because it contains the extra-escaped characters. On unparsing from XML, the the '' characters are turned back into CR characters of the DFDL Infoset. Then due to generateEscapeBlock='whenNeeded', the string will be scanned for delimiters, as well as extra escaped characters. The CR and LFs will be detected, and the whole string surrounded with the escapeBlockStart, and escapeBlockEnd, and those go at the very start and end of the element, resulting in: "862 West North PlaceCRLF Dept 23 M/S 77CRLF Unit 3" Which is what we started from. Now, why would someone defining a data format put extraEscapedCharacters='%#x0D; %#x0A;' anyway? I mean this data example would have worked fine without that. It just wouldn't put the escape block start/end around strings containing those, but since our string didn't contain any delimiters, it would have parsed/unparsed just fine without those escapeBlockStart/End. I think the reason is to enable the data to compose properly inside a format that *does* use CRLF or CR or LF as delimiters. By using extraEscapedCharacters with CR and LF, we ensure that data containing those is always escaped, as it would have to be if there were surrounding constructs with delimiters of CRLF, or CR or LF. To me this is a sensible practice for any text-based format. Either CRLF or CR or LF are delimiters, or if they are not, one should probably reserve them for the future by making them extraEscapedCharacters, so that they get treated like delimiters anyway. ________________________________ From: Roger L Costello <coste...@mitre.org> Sent: Tuesday, June 1, 2021 8:41 AM To: users@daffodil.apache.org <users@daffodil.apache.org> Subject: RE: Can you give me an example that motivates the need for extraEscapedCharacters please? Also, I just discovered: For property dfdl:extraEscapedCharacters the length of string must be exactly 1 character. So CRLF cannot be the value of extraEscapedCharacter, right? /Roger From: Roger L Costello <coste...@mitre.org> Sent: Tuesday, June 1, 2021 8:36 AM To: users@daffodil.apache.org Subject: Re: Can you give me an example that motivates the need for extraEscapedCharacters please? Hi Mike, Sorry for the delay in responding. I am finally getting around to digging into the example you provided on extraEscapedCharacters. I thought that extraEscapedCharacter was just for unparsing? “Hey DFDL processor, escape these additional (extra) characters when you unparse the XML.” Yes? In your example: emp(name=Joe Smith|addr="862 West North PlaceCRLF Dept 23 M/S 77CRLF Unit 3",Madison,WI, 99999) emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888) you say that the CRLF in the address field is an extraEscapedCharacter. So, on unparsing, CRLF will be escaped, right? An escape block is being used, so I don’t know what the DFDL processor would use as to escape CRLF – the escapeBlockStart character or the escapeBlockEnd character? In this case, they are to same, so it doesn’t matter. Wouldn’t the output of unparsing be this (I highlighted in yellow the escape character prior to CRLF): emp(name=Joe Smith|addr="862 West North Place"CRLF Dept 23 M/S 77"CRLF Unit 3",Madison,WI, 99999) emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888) Now the block quotes are all messed up. That can’t be right. I’m confused. /Roger From: Beckerle, Mike <mbecke...@owlcyberdefense.com<mailto:mbecke...@owlcyberdefense.com>> Sent: Thursday, May 20, 2021 2:14 PM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> Subject: [EXT] Re: Can you give me an example that motivates the need for extraEscapedCharacters please? The case you highlight where extraEscapedCharacters="a e" looks like correct interpretation of the DFDL spec., but this example doesn't motivate the feature, so we need to get you a better example. I did find IBM DFDL comes with an example that has extraEscapedCharacters="%#x0D; %#x0A", with escapeKind="escapeBlock". So here's example data illustrating why you want those extraEscapedCharacters: emp(name=Joe Smith|addr="862 West North Place Dept 23 M/S 77 Unit 3",Madison,WI, 99999) emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888) Note that the street element contains line endings in the first instance. The format requires that such multi-line entries have quotation marks around them, even if the data doesn't contain any of the delimiters (like | or , ) The second emp instance does not have any quotations around the street part. To model this, you have a lengthKind 'delimited' format with this escape scheme <dfdl:escapeScheme escapeKind="escapeBlock" escapeBlockStart='"' escapeBlockEnd='"' escapeCharacter='"' escapeEscapeCharacter='"' extraEscapedCharacters='%#x0D; %#x0A;' generateEscapeBlock="whenNeeded"/> The extraEscapedCharacters ensure that the multi-line data is inside the quotes. This even though it does NOT contain a comma, nor a "|" nor a ")%CR;%LF;" that would conflict with the separators/terminators of the format: DFDL schema roughly like this: <element name="employee" dfdl:initiator="emp(" dfdl:terminator=")%CR;%LF"> <complexType> <sequence dfdl:separator="|"> <element name="name" dfdl:initator="name=" type="xs:string"/> <element name="address" dfdl:initiator="addr="> <complexType> <sequence dfdl:separator=","> <element name="street" type="xs:string"/> <element name="city" type="xs:string"/> <element name="state" type="xs:string"/> <element name="postalCode" type="xs:string"/> </sequence> </complexType> </element> </sequence> </complexType> </element> A similar example using escapeKind 'escapeCharacter' with escapeCharacter="\" would be this data: emp(name=Joe Smith|addr=862 West North Place\ Dept 23 M/S 77\ Unit 3",Madison,WI, 99999) That's an example format where the line-endings must be escaped, even though line endings are not delimiters in this format. If that data contained CRLF line endings, it would actually appear with two escape characters, one for the CR, one for the LF: emp(name=Joe Smith|addr=862 West North Place\←\ Dept 23 M/S 77\←\ Unit 3",Madison,WI, 99999) (where ← represents the CR character. If you have some box character between the slashes when reading this it's because unicode left arrow U+2190 isn't rendering in your font) ________________________________ From: Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>> Sent: Wednesday, May 19, 2021 1:22 PM To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> <users@daffodil.apache.org<mailto:users@daffodil.apache.org>> Subject: Can you give me an example that motivates the need for extraEscapedCharacters please? Hi Folks, As I understand it, extraEscapedCharacters identifies characters that are to be escaped during unparsing. They are characters to be escaped above and beyond the characters identified by the escapeCharacter property. Below I created an example to illustrate the use of extraEscapedCharacters. Is it correct? The example is hokey, do you have a more compelling example? Example: Suppose a data format contains a sequence of data items separated by forward slash. If a data item contains a separator, the separator is escaped by a backslash. An instance contains these three data items: "Yellow", "Lemon and/or Banana", and "6". The forward slash in the second data item needs escaping. Here is the instance: Yellow/Lemon and\/or Banana/6 Parsing the instance produces this XML: <FruitBasket> <Color>Yellow</Color> <Fruits>Lemon and/or Banana</Fruits> <Quantity>6</Quantity> </FruitBasket> If extraEscapedCharacters="" (no additional characters to be escaped during unparsing), then unparsing produces: Yellow/Lemon and\/or Banana/6 If extraEscapedCharacters="a e", then the unparser will also escape all a's and e's, to produce: Y\ellow/L\emon \and\/or B\an\an\a/6 Is that correct? /Roger