The case you highlight where extraEscapedCharacters="a e" looks like correct 
interpretation of the DFDL spec., but this example doesn't motivate the 
feature, so we need to get you a better example.

I did find IBM DFDL comes with an example that has 
extraEscapedCharacters="%#x0D; %#x0A", with escapeKind="escapeBlock".

So here's example data illustrating why you want those extraEscapedCharacters:

emp(name=Joe Smith|addr="862 West North Place
Dept 23 M/S 77
Unit 3",Madison,WI, 99999)
emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)

Note that the street element contains line endings in the first instance. The 
format requires that such multi-line entries have quotation marks around them, 
even if the data doesn't contain any of the delimiters (like | or , )

The second emp instance does not have any quotations around the street part.

To model this, you have a lengthKind 'delimited' format with this escape scheme

<dfdl:escapeScheme escapeKind="escapeBlock" escapeBlockStart='"' 
escapeBlockEnd='"' escapeCharacter='"'
escapeEscapeCharacter='"' extraEscapedCharacters='%#x0D; %#x0A;' 
generateEscapeBlock="whenNeeded"/>

The extraEscapedCharacters ensure that the multi-line data is inside the 
quotes. This even though it does NOT contain a comma, nor a "|" nor a 
")%CR;%LF;" that would conflict with the separators/terminators of the format: 
DFDL schema roughly like this:

<element name="employee" dfdl:initiator="emp(" dfdl:terminator=")%CR;%LF">
   <complexType>
     <sequence dfdl:separator="|">
       <element name="name" dfdl:initator="name=" type="xs:string"/>
       <element name="address" dfdl:initiator="addr=">
         <complexType>
         <sequence dfdl:separator=",">
           <element name="street" type="xs:string"/>
           <element name="city" type="xs:string"/>
           <element name="state" type="xs:string"/>
          <element name="postalCode" type="xs:string"/>
          </sequence>
       </complexType>
      </element>
     </sequence>
   </complexType>
</element>

A similar example using escapeKind 'escapeCharacter' with escapeCharacter="\" 
would be this data:

emp(name=Joe Smith|addr=862 West North Place\
Dept 23 M/S 77\
Unit 3",Madison,WI, 99999)

That's an example format where the line-endings must be escaped, even though 
line endings are not delimiters in this format.

If that data contained CRLF line endings, it would actually appear with two 
escape characters, one for the CR, one for the LF:

emp(name=Joe Smith|addr=862 West North Place\←\
Dept 23 M/S 77\←\
Unit 3",Madison,WI, 99999)

(where ← represents the CR character. If you have some box character between 
the slashes when reading this it's because unicode left arrow U+2190 isn't 
rendering in your font)

________________________________
From: Roger L Costello <coste...@mitre.org>
Sent: Wednesday, May 19, 2021 1:22 PM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: Can you give me an example that motivates the need for 
extraEscapedCharacters please?

Hi Folks,

As I understand it, extraEscapedCharacters identifies characters that are to be 
escaped during unparsing. They are characters to be escaped above and beyond 
the characters identified by the escapeCharacter property.

Below I created an example to illustrate the use of extraEscapedCharacters. Is 
it correct? The example is hokey, do you have a more compelling example?

Example: Suppose a data format contains a sequence of data items separated by 
forward slash. If a data item contains a separator, the separator is escaped by 
a backslash. An instance contains these three data items: "Yellow", "Lemon 
and/or Banana", and "6". The forward slash in the second data item needs 
escaping. Here is the instance:

Yellow/Lemon and\/or Banana/6

Parsing the instance produces this XML:

<FruitBasket>
    <Color>Yellow</Color>
    <Fruits>Lemon and/or Banana</Fruits>
    <Quantity>6</Quantity>
</FruitBasket>

If extraEscapedCharacters="" (no additional characters to be escaped during 
unparsing), then unparsing produces:

Yellow/Lemon and\/or Banana/6

If extraEscapedCharacters="a e", then the unparser will also escape all a's and 
e's, to produce:

Y\ellow/L\emon \and\/or B\an\an\a/6

Is that correct?

/Roger

Reply via email to