Re: Can you give me an example that motivates the need for extraEscapedCharacters please?

Beckerle, Mike Tue, 01 Jun 2021 07:36:49 -0700

Re: your hokey example, it's not so hokey.

It is perfectly ok and really common, for a schema to accept data that it will 
specifically not reproduce on unparsing. For example, so as to properly accept 
multiple older versions of a data format, but on unparsing, to produce only the 
new version. I'd say this is far more common than not. We can call this the 
"accept legacy, create modern" principle for data formats.


The input/output matching requirement is uniquely from cybersecurity from my 
experience. No other system I've seen has this requirement.

Consider a database. It takes in data from transactions and or bulk loading. 
Output of data is by exporting the results of queries. These are uncorrelated 
sets of data, and this whole notion that the output and input data formats 
somehow "match" is simply nonexistent.





________________________________
From: Roger L Costello <coste...@mitre.org>
Sent: Tuesday, June 1, 2021 9:36 AM
To: users@daffodil.apache.org <users@daffodil.apache.org>
Subject: RE: Can you give me an example that motivates the need for 
extraEscapedCharacters please?


Hey Mike,



Is the a good or hokey use of the extraEscapedCharacter property?



----------------------------------------------------------

For example, a data format specifies that instances contain a label followed by 
a value, with a colon separator. Here is an instance of the data format:



     Quote: Four score and seven years ago our fathers brought
     forth on this continent a new nation, conceived in Liberty,
     and dedicated to the proposition that all men are created equal.



To show that a value extends onto another line we want the unparser to precede 
the linefeeds in the value with backslashes. We accomplish this using 
extraEscapedCharacters = %LF;. Here is the instance after unparsing:



     Quote: Four score and seven years ago our fathers brought \
    forth on this continent a new nation, conceived in Liberty, \
    and dedicated to the proposition that all men are created equal.
----------------------------------------------------------

The question that enters my mind with that example is: Why would you want the 
output (from unparsing) to be different than the input? I am thinking that that 
is typically undesirable. For that reason, I am thinking the example is hokey. 
Do you agree?



/Roger



From: Roger L Costello <coste...@mitre.org>
Sent: Tuesday, June 1, 2021 8:41 AM
To: users@daffodil.apache.org
Subject: RE: Can you give me an example that motivates the need for 
extraEscapedCharacters please?



Also, I just discovered:



For property dfdl:extraEscapedCharacters the length of string must be exactly 1 
character.



So CRLF cannot be the value of extraEscapedCharacter, right?



/Roger



From: Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Tuesday, June 1, 2021 8:36 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: Re: Can you give me an example that motivates the need for 
extraEscapedCharacters please?



Hi Mike,



Sorry for the delay in responding. I am finally getting around to digging into 
the example you provided on extraEscapedCharacters.



I thought that extraEscapedCharacter was just for unparsing? “Hey DFDL 
processor, escape these additional (extra) characters when you unparse the 
XML.” Yes?



In your example:



emp(name=Joe Smith|addr="862 West North PlaceCRLF

Dept 23 M/S 77CRLF

Unit 3",Madison,WI, 99999)

emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)



you say that the CRLF in the address field is an extraEscapedCharacter. So, on 
unparsing, CRLF will be escaped, right? An escape block is being used, so I 
don’t know what the DFDL processor would use as to escape CRLF – the 
escapeBlockStart character or the escapeBlockEnd character? In this case, they 
are to same, so it doesn’t matter. Wouldn’t the output of unparsing be this (I 
highlighted in yellow the escape character prior to CRLF):



emp(name=Joe Smith|addr="862 West North Place"CRLF

Dept 23 M/S 77"CRLF

Unit 3",Madison,WI, 99999)

emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)



Now the block quotes are all messed up. That can’t be right.



I’m confused.



/Roger





From: Beckerle, Mike 
<mbecke...@owlcyberdefense.com<mailto:mbecke...@owlcyberdefense.com>>
Sent: Thursday, May 20, 2021 2:14 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: [EXT] Re: Can you give me an example that motivates the need for 
extraEscapedCharacters please?



The case you highlight where extraEscapedCharacters="a e" looks like correct 
interpretation of the DFDL spec., but this example doesn't motivate the 
feature, so we need to get you a better example.



I did find IBM DFDL comes with an example that has 
extraEscapedCharacters="%#x0D; %#x0A", with escapeKind="escapeBlock".



So here's example data illustrating why you want those extraEscapedCharacters:



emp(name=Joe Smith|addr="862 West North Place

Dept 23 M/S 77

Unit 3",Madison,WI, 99999)

emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)



Note that the street element contains line endings in the first instance. The 
format requires that such multi-line entries have quotation marks around them, 
even if the data doesn't contain any of the delimiters (like | or , )



The second emp instance does not have any quotations around the street part.



To model this, you have a lengthKind 'delimited' format with this escape scheme



<dfdl:escapeScheme escapeKind="escapeBlock" escapeBlockStart='"' 
escapeBlockEnd='"' escapeCharacter='"'
escapeEscapeCharacter='"' extraEscapedCharacters='%#x0D; %#x0A;' 
generateEscapeBlock="whenNeeded"/>



The extraEscapedCharacters ensure that the multi-line data is inside the 
quotes. This even though it does NOT contain a comma, nor a "|" nor a 
")%CR;%LF;" that would conflict with the separators/terminators of the format: 
DFDL schema roughly like this:



<element name="employee" dfdl:initiator="emp(" dfdl:terminator=")%CR;%LF">

   <complexType>

     <sequence dfdl:separator="|">

       <element name="name" dfdl:initator="name=" type="xs:string"/>

       <element name="address" dfdl:initiator="addr=">

         <complexType>

         <sequence dfdl:separator=",">

           <element name="street" type="xs:string"/>

           <element name="city" type="xs:string"/>

           <element name="state" type="xs:string"/>

          <element name="postalCode" type="xs:string"/>

          </sequence>

       </complexType>

      </element>

     </sequence>

   </complexType>

</element>



A similar example using escapeKind 'escapeCharacter' with escapeCharacter="\" 
would be this data:



emp(name=Joe Smith|addr=862 West North Place\

Dept 23 M/S 77\

Unit 3",Madison,WI, 99999)



That's an example format where the line-endings must be escaped, even though 
line endings are not delimiters in this format.



If that data contained CRLF line endings, it would actually appear with two 
escape characters, one for the CR, one for the LF:



emp(name=Joe Smith|addr=862 West North Place\←\

Dept 23 M/S 77\←\

Unit 3",Madison,WI, 99999)



(where ← represents the CR character. If you have some box character between 
the slashes when reading this it's because unicode left arrow U+2190 isn't 
rendering in your font)



________________________________

From: Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Wednesday, May 19, 2021 1:22 PM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org> 
<users@daffodil.apache.org<mailto:users@daffodil.apache.org>>
Subject: Can you give me an example that motivates the need for 
extraEscapedCharacters please?



Hi Folks,

As I understand it, extraEscapedCharacters identifies characters that are to be 
escaped during unparsing. They are characters to be escaped above and beyond 
the characters identified by the escapeCharacter property.

Below I created an example to illustrate the use of extraEscapedCharacters. Is 
it correct? The example is hokey, do you have a more compelling example?

Example: Suppose a data format contains a sequence of data items separated by 
forward slash. If a data item contains a separator, the separator is escaped by 
a backslash. An instance contains these three data items: "Yellow", "Lemon 
and/or Banana", and "6". The forward slash in the second data item needs 
escaping. Here is the instance:

Yellow/Lemon and\/or Banana/6

Parsing the instance produces this XML:

<FruitBasket>
    <Color>Yellow</Color>
    <Fruits>Lemon and/or Banana</Fruits>
    <Quantity>6</Quantity>
</FruitBasket>

If extraEscapedCharacters="" (no additional characters to be escaped during 
unparsing), then unparsing produces:

Yellow/Lemon and\/or Banana/6

If extraEscapedCharacters="a e", then the unparser will also escape all a's and 
e's, to produce:

Y\ellow/L\emon \and\/or B\an\an\a/6

Is that correct?

/Roger

Re: Can you give me an example that motivates the need for extraEscapedCharacters please?

Reply via email to