You are correct that you can't have CRLF as an extra escaped character, just CR 
and LF.

The IBM example has both those characters as individual extra escaped characters

Block escapes are quite different to what you are suggesting here. They are not 
proximate to the characters requiring the escape mechanism. They are placed at 
the beginning of the entire element string, and end of the entire element 
string.

So in the example string:


emp(name=Joe Smith|addr="862 West North PlaceCRLF

Dept 23 M/S 77CRLF

Unit 3",Madison,WI, 99999)


The street element starts after addr=, and the value begins with the 
escapeBlockStart character, meaning delimiters are allowed to appear in the 
data until we encounter an unescaped escapeBlockEnd, which is the " after Unit 
3. So the street element value looks like this as XML:


<street>862 West North PlaceLF

Dept 23 M/S 77LF

Unit 3</street>


(In the above XML, I used the '' character to indicate "what a CR becomes". 
Due to the fact that XML normalizes CRLF to LF, and isolated CR to LF, we have 
to do something to preserve the real Infoset data content, Hence, Daffodil's 
XML converter converts CR to U+E00D, which has some depiction, depending on 
font, that sometimes comes through as a little box containing tiny E00D 
characters.)

Now that data doesn't actually contain any of the delimiters that are defined 
in the schema, so it doesn't need escaping due to that. It needs escaping 
because it contains the extra-escaped characters.

On unparsing from XML, the the '' characters are turned back into CR 
characters of the DFDL Infoset. Then due to generateEscapeBlock='whenNeeded', 
the string will be scanned for delimiters, as well as extra escaped characters. 
The CR and LFs will be detected, and the whole string surrounded with the 
escapeBlockStart, and escapeBlockEnd, and those go at the very start and end of 
the element, resulting in:


"862 West North PlaceCRLF

Dept 23 M/S 77CRLF

Unit 3"

Which is what we started from.

Now, why would someone defining a data format put 
extraEscapedCharacters='%#x0D; %#x0A;' anyway? I mean this data example would 
have worked fine without that. It just wouldn't put the escape block start/end 
around strings containing those, but since our string didn't contain any 
delimiters, it would have parsed/unparsed just fine without those 
escapeBlockStart/End.

I think the reason is to enable the data to compose properly inside a format 
that *does* use CRLF or CR or LF as delimiters.
By using extraEscapedCharacters with CR and LF, we ensure that data containing 
those is always escaped, as it would have to be if there were surrounding 
constructs with delimiters of CRLF, or CR or LF.

To me this is a sensible practice for any text-based format. Either CRLF or CR 
or LF are delimiters, or if they are not, one should probably reserve them for 
the future by making them extraEscapedCharacters, so that they get treated like 
delimiters anyway.




________________________________
From: Roger L Costello <coste...@mitre.org>
Sent: Tuesday, June 1, 2021 8:41 AM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: RE: Can you give me an example that motivates the need for 
extraEscapedCharacters please?


Also, I just discovered:



For property dfdl:extraEscapedCharacters the length of string must be exactly 1 
character.



So CRLF cannot be the value of extraEscapedCharacter, right?



/Roger



From: Roger L Costello <coste...@mitre.org>
Sent: Tuesday, June 1, 2021 8:36 AM
To: users@daffodil.apache.org
Subject: Re: Can you give me an example that motivates the need for 
extraEscapedCharacters please?



Hi Mike,



Sorry for the delay in responding. I am finally getting around to digging into 
the example you provided on extraEscapedCharacters.



I thought that extraEscapedCharacter was just for unparsing? “Hey DFDL 
processor, escape these additional (extra) characters when you unparse the 
XML.” Yes?



In your example:



emp(name=Joe Smith|addr="862 West North PlaceCRLF

Dept 23 M/S 77CRLF

Unit 3",Madison,WI, 99999)

emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)



you say that the CRLF in the address field is an extraEscapedCharacter. So, on 
unparsing, CRLF will be escaped, right? An escape block is being used, so I 
don’t know what the DFDL processor would use as to escape CRLF – the 
escapeBlockStart character or the escapeBlockEnd character? In this case, they 
are to same, so it doesn’t matter. Wouldn’t the output of unparsing be this (I 
highlighted in yellow the escape character prior to CRLF):



emp(name=Joe Smith|addr="862 West North Place"CRLF

Dept 23 M/S 77"CRLF

Unit 3",Madison,WI, 99999)

emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)



Now the block quotes are all messed up. That can’t be right.



I’m confused.



/Roger





From: Beckerle, Mike 
<mbecke...@owlcyberdefense.com<mailto:mbecke...@owlcyberdefense.com>>
Sent: Thursday, May 20, 2021 2:14 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: [EXT] Re: Can you give me an example that motivates the need for 
extraEscapedCharacters please?



The case you highlight where extraEscapedCharacters="a e" looks like correct 
interpretation of the DFDL spec., but this example doesn't motivate the 
feature, so we need to get you a better example.



I did find IBM DFDL comes with an example that has 
extraEscapedCharacters="%#x0D; %#x0A", with escapeKind="escapeBlock".



So here's example data illustrating why you want those extraEscapedCharacters:



emp(name=Joe Smith|addr="862 West North Place

Dept 23 M/S 77

Unit 3",Madison,WI, 99999)

emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)



Note that the street element contains line endings in the first instance. The 
format requires that such multi-line entries have quotation marks around them, 
even if the data doesn't contain any of the delimiters (like | or , )



The second emp instance does not have any quotations around the street part.



To model this, you have a lengthKind 'delimited' format with this escape scheme



<dfdl:escapeScheme escapeKind="escapeBlock" escapeBlockStart='"' 
escapeBlockEnd='"' escapeCharacter='"'
escapeEscapeCharacter='"' extraEscapedCharacters='%#x0D; %#x0A;' 
generateEscapeBlock="whenNeeded"/>



The extraEscapedCharacters ensure that the multi-line data is inside the 
quotes. This even though it does NOT contain a comma, nor a "|" nor a 
")%CR;%LF;" that would conflict with the separators/terminators of the format: 
DFDL schema roughly like this:



<element name="employee" dfdl:initiator="emp(" dfdl:terminator=")%CR;%LF">

   <complexType>

     <sequence dfdl:separator="|">

       <element name="name" dfdl:initator="name=" type="xs:string"/>

       <element name="address" dfdl:initiator="addr=">

         <complexType>

         <sequence dfdl:separator=",">

           <element name="street" type="xs:string"/>

           <element name="city" type="xs:string"/>

           <element name="state" type="xs:string"/>

          <element name="postalCode" type="xs:string"/>

          </sequence>

       </complexType>

      </element>

     </sequence>

   </complexType>

</element>



A similar example using escapeKind 'escapeCharacter' with escapeCharacter="\" 
would be this data:



emp(name=Joe Smith|addr=862 West North Place\

Dept 23 M/S 77\

Unit 3",Madison,WI, 99999)



That's an example format where the line-endings must be escaped, even though 
line endings are not delimiters in this format.



If that data contained CRLF line endings, it would actually appear with two 
escape characters, one for the CR, one for the LF:



emp(name=Joe Smith|addr=862 West North Place\←\

Dept 23 M/S 77\←\

Unit 3",Madison,WI, 99999)



(where ← represents the CR character. If you have some box character between 
the slashes when reading this it's because unicode left arrow U+2190 isn't 
rendering in your font)



________________________________

From: Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Wednesday, May 19, 2021 1:22 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> 
<users@daffodil.apache.org<mailto:users@daffodil.apache.org>>
Subject: Can you give me an example that motivates the need for 
extraEscapedCharacters please?



Hi Folks,

As I understand it, extraEscapedCharacters identifies characters that are to be 
escaped during unparsing. They are characters to be escaped above and beyond 
the characters identified by the escapeCharacter property.

Below I created an example to illustrate the use of extraEscapedCharacters. Is 
it correct? The example is hokey, do you have a more compelling example?

Example: Suppose a data format contains a sequence of data items separated by 
forward slash. If a data item contains a separator, the separator is escaped by 
a backslash. An instance contains these three data items: "Yellow", "Lemon 
and/or Banana", and "6". The forward slash in the second data item needs 
escaping. Here is the instance:

Yellow/Lemon and\/or Banana/6

Parsing the instance produces this XML:

<FruitBasket>
    <Color>Yellow</Color>
    <Fruits>Lemon and/or Banana</Fruits>
    <Quantity>6</Quantity>
</FruitBasket>

If extraEscapedCharacters="" (no additional characters to be escaped during 
unparsing), then unparsing produces:

Yellow/Lemon and\/or Banana/6

If extraEscapedCharacters="a e", then the unparser will also escape all a's and 
e's, to produce:

Y\ellow/L\emon \and\/or B\an\an\a/6

Is that correct?

/Roger

Reply via email to