Re: Question about gobbling up hex digits until arriving at a string

Steve Lawrence Tue, 20 Nov 2018 07:10:42 -0800


The value of dfdl:byteOrder actually does affect the order of hexBinary
output. Looking through the git log and my email history, I've found
where this decision was made.


In March 2017 we added support for non-byte size lengths for hexBinary
data. This resulted in some discussions about how to handle
canonicalization of hexBinary data where there aren't full bytes of data
(which non-byte size lengths would allow). The XSD specification is
mostly silent on this since it states that hexBinary data must always
represent full bytes. So some interpretation was needed. I've copied and
pasted the result of those discussions from Mike that I think explains
the reasoning why byteOrder (and bitOrder) affect the hexBinary output.

This gist is that you could think of the process as

1. Convert the specified length number of bits to a nonNegativeInteger
   using byteOrder and bitOrder
2. Convert that logical value to a big-endian two's complement bit
   string
3. Convert those bits to hexBinary

The actual process is a bit more efficient than that, but that's the
general idea.

The result is that if you don't want your bytes flipped in hexBinary
data, model it as bigEndian instead of littleEndian.

- Steve

Original discussion below:

> I looked up the XPath xs:hexBinary constructor function, ended up at
> this statement found in the XSD description of hexBinary:
> 
>   hexBinary has a lexical representation where each binary octet
>   is encoded as a character tuple, consisting of two hexadecimal
>   digits ([0-9a-fA-F]) representing the octet code. For example,
>   "0FB7" is a hex encoding for the 16-bit integer 4023 (whose
>   binary representation is 111110110111).
> 
> (They then say the "cannonical" representation doesn't use a-f
> lowercase.)
> 
> Note how the bit string they give has 12 bits in it, so they are
> padding on the left with zeros to get full bytes.
> 
> So what they're saying here is that hexBinary's lexical
> representation is given in terms of numeric value equivalents.
> 
> This achieves the canonicalization you were suggesting. It loses the
> behavior where if everything is byte aligned and byte sized that bit
> order and byte order don't matter, because they do matter to
> numbers.
> 
> For DFDL this would mean we define the xs:hexBinary value you get,
> in terms of xs:nonNegativeInteger of the same bits.
> 
> But notice how the interpretation of the binary representation is
> bigEndian MSBF relative to the binary data they give there, which is
> a "logical" binary number. Still. They're stating that if the data
> is the number 4023, then the hexBinary of it *is* "0FB7".
> 
> So, if I store logical integer 4023 in 16 bits, I can do it
> bigEndian MSBF, or littleEndian MSBF, or littleEndian LSBF. In all
> cases, if the value stored would be parsed/unparsed as 4023, the
> hexBinary would be "0FB7".
> 
> Now, if I store it as 12 bits, which is capable of holding enough
> bits to store that value as an unsignedInt, then bigEndian MSBF,
> starting at bit 2 of a byte I get
> 
>   XX111110 110111XX
> 
> Stored LSBF littleEndian, numbering bits and byte RTL I get that
> exact same picture, just numbered all backwards. But if I number
> everything the normal LTR way, I get
> 
>   110111XX XX111110
> 
> The bytes in the file, to represent the value 4023 in these two
> representations are extraordinarily different. But the hexBinary
> representation of these 12 bit elements would be exactly the same.
> I.e.,
> 
>   <element name="lsbf" type="xs:hexBinary"
>     dfdl:byteOrder="littleEndian"
>     dfdl:bitOrder="leastSignificantBitFirst"
>     dfdl:alignmentUnits="bits"
>     dfdl:leadingSkip="2"
>     dfdl:lengthUnits="bits"
>     dfdl:length="12"/>
> 
> vs. same thing but bigEndian, MSBF
> 
> Now, let's look at a interesting case:
> 
>   <element name="foo" dfdl:length="5"
>     .... everything else as above lsbf />
> 
> Suppose the byte is 01011010 (5A)
> 
> The foo element is X10110XX. The value is 0x16 or 22 decimal.
> 
> Now let's describe those exact same bits msbf.
> 
>   <element name="bar" dfdl:leadingSkip="1" dfdl:length="5".... />
> 
> These are the exact same 5 bits. We are just "coming at them" from
> the other side.
> 
> And the hexBinary for them would be... I think 0x16 also, or 22
> decimal.
> 
> Since this is less than 1 byte of data, byteOrder doesn't come into
> play. BitOrder plays the role of isolating which bits we're talking
> about, but the same 5 bits, once isolated, the bit positions don't
> matter MSBF and LSBF are about the assignment of bit-positions to
> place value of bit, but only the place value of the bit matters for
> purposes of the numeric value.
> 
> The implications of the above: hexBinary parser/unparser should
> share lots of code with xs:nonNegativeInteger parser/unparser.





On 11/20/18 9:00 AM, Costello, Roger L. wrote:
> Thank you Steve!
> 
> A very odd thing happened after I made those changes.
> 
> Recall that the second group of binary is this: all the binary until "PE" is 
> encountered. 
> 
> The PE data is actually 4 bytes (PE\0\0). Looking at the PE data in a hex 
> editor I see this:
> 
>       50 45 00 00
> 
> So, after outputting the second group of binary, I output the PE data:
> 
> <xs:element name="PE_Header">
>     <xs:complexType>
>         <xs:sequence>
>             <xs:sequence dfdl:hiddenGroupRef="hidden_signature_Group" />
>             <xs:element name='Signature' type='xs:string' 
> dfdl:inputValueCalc='{
>                 if (xs:string(../Hidden_signature) eq "50450000") then 
> "PE\0\0" 
>                 else fn:error("signature PE\0\0 not present")
>                 }'>
>             </xs:element>
>         </xs:sequence>
>     </xs:complexType>
> </xs:element>
> 
> Here's the odd part: Somehow, the PE data got reversed! For my inputValueCalc 
> to work, I needed to change the if-statement to this:
> 
> if (xs:string(../Hidden_signature) eq "00004500") then "PE\0\0"
> 
> Notice that I flipped 50450000 to 00004500.
> 
> It seems that picking up the second group of binary has had the side effect 
> of flipping the PE data.
> 
> Note: in my DFDL schema I have this setting: byteOrder="littleEndian" (I 
> think that is somehow related to what's happening).
> 
> Can you explain what's happening Steve, please? Why is the PE data flipping?
> 
> /Roger
> 
> -----Original Message-----
> From: Steve Lawrence <[email protected]> 
> Sent: Tuesday, November 20, 2018 8:23 AM
> To: [email protected]; Costello, Roger L. <[email protected]>
> Subject: Re: Question about gobbling up hex digits until arriving at a string
> 
> Looks like the issue in this case was the use of the dot character--my 
> suggested regex wasn't completely correct. By default the dot character in 
> regular expressions matches all characters EXCEPT for new line characters. 
> And some of the bytes in the second Instruction area happen to be 0x0D, which 
> is a carriage return and which dot does not match. So the regular expression 
> I provided didn't actually match all bytes like I suggested it did.
> 
> So you can replace the dot with a character class that matches all bytes 
> (i.e. [\x00-\xFF]) to ensure those newline characters are matched. Also, the 
> + needs to be made non-greedy by appending a question mark. Try changing the 
> regular expressions to the following:
> 
>   [\x00-\xFF]+?(?=This program cannot be run in DOS mode\.)
> 
>   [\x00-\xFF]+?(?=PE)
> 
> - Steve
> 
> On 11/19/18 5:17 PM, Costello, Roger L. wrote:
>> Thank you Steve and Mike!
>>
>> I have made progress, using your suggestions. I have almost got it.
>>
>> My input contains:
>>
>>   * A bunch of binary
>>   * Then the string: This program cannot be run in DOS mode.
>>   * Then a bunch more binary
>>   * And then the string: PE
>>
>> Here's the DFDL code:
>>
>> <xs:elementname="DOS_Stub">
>> <xs:complexType>
>> <xs:sequence>
>> <xs:element    name="Instructions"
>>                                     type="xs:hexBinary"
>>                                     dfdl:lengthKind="pattern"
>>                                     dfdl:lengthPattern=".+(?=This 
>> program cannot be run in DOS mode\.)"/>
>> <xs:element    name="Message"
>>                                     type="xs:string"
>>                                     dfdl:lengthUnits="characters"
>>                                     dfdl:lengthKind="explicit"
>>                                     dfdl:length="39"
>>                                     dfdl:representation="text"
>>                                     dfdl:encoding="ISO-8859-1"/>
>> <xs:element    name="Instructions"
>>                                     type="xs:hexBinary"
>>                                     dfdl:lengthKind="pattern"
>>                                     dfdl:lengthUnits="bytes"
>>                                     dfdl:representation="binary"
>>                                     dfdl:lengthPattern=".+(?=PE)"/> 
>> </xs:sequence> </xs:complexType> </xs:element>
>>
>> Parsing successfully gobbles up the first group of binary, then the 
>> first string, but fails to gobble up the second group of binary:
>>
>> <DOS_Stub>
>> <Instructions>0E1FBA0E00B409CD21B8014CCD21</Instructions>
>> <Message>This program cannot be run in DOS mode.</Message> 
>> <Instructions></Instructions> </DOS_Stub>
>>
>> Why is the second group of binary not being picked up?
>>
>> /Roger
>>
>> *From:* Mike Beckerle <[email protected]>
>> *Sent:* Monday, November 19, 2018 2:00 PM
>> *To:* [email protected]; Costello, Roger L. 
>> <[email protected]>
>> *Subject:* Re: Question about gobbling up hex digits until arriving at 
>> a string
>>
>> Also,
>>
>> Set dfdl:encoding to 'iso-8859-1'.
>>
>> If you are using ASCII, then as soon as a byte with the 8th bit set is 
>> encountered, you won't get what you think.
>>
>> Encoding 'iso-8859-1' is the magic "bytes" encoding where every byte 
>> is one character no matter the byte value.
>>
>> ASCII, surprising to some people, is not at all like this.
>>
>> ASCII is 7-bit, and if a byte has the 8th bit set, it will causes a 
>> decode error, and you will instead get a Unicode-replacement-character 
>> created for that byte.
>>
>> This replacement character  usually looks like a stylized question 
>> mark (if you have a unicode font). But that won't match your regex 
>> because the code-point for the Unicode replacement character is 
>> U+FFFD.  The ranges in your regex won't accept these.
>>
>> ...mike beckerle
>>
>> ----------------------------------------------------------------------
>> ----------
>>
>> *From:*Steve Lawrence <[email protected] 
>> <mailto:[email protected]>>
>> *Sent:* Monday, November 19, 2018 1:47:57 PM
>> *To:* [email protected] <mailto:[email protected]>; 
>> Roger Costello
>> *Subject:* Re: Question about gobbling up hex digits until arriving at 
>> a string
>>
>> On second look, I think the issue is more clear. The regex you have is:
>>
>>    [\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)
>>
>> Those hex values are all ASCII characters, and could be rewritten like so:
>>
>>    [0-9A-Fa-f]+?(?=T)
>>
>> So your regex actually will only match data that contains those ASCII 
>> characters followed by the letter T. But I suspect your data isn't 
>> ASCII, it's actual binary data that could be anything. Since your data 
>> doesn't contain those ASCII characters, your pattern will fail to 
>> match and the matched length is considered zero. It then decode 39 
>> bytes of data, with the initial bytes being binary data followed by 
>> the beginning of the ASCII string.
>>
>> So the schema needs to be modified to either use a different regex or 
>> use some other method to determine where the data ends and the message 
>> begins. To me, it seems odd to have a binary format where the length 
>> of binary data is just some amount until it finds the letter 'T', so I 
>> would think a better description would exist. That said, such a regex 
>> would look like this:
>>
>>    [^T]+
>>
>> - Steve
>>
>>
>> On 11/19/18 12:50 PM, Steve Lawrence wrote:
>>  > Roger,
>>  >
>>  > I am unable to reproduce this issue. I've created a TDML file at 
>> the  > below link, which defines a schema and a test case with sample 
>> input  > data and expected infoset, based on your description.
>>  >
>>  > 
>> https://gist.github.com/stevedlawrence/c4051386c4ed58279dbcae1e75d0821
>> 8
>>  >
>>  > This can be tested with:
>>  >
>>  >   daffodil test -i hexPattern.tml
>>  >
>>  > And I get the output:
>>  >
>>  >   [Fail] hexPattern
>>  >     Failure Information:
>>  >       Left over data. Consumed 408 bit(s) with 16 bit(s) remaining.
>>  >
>>  >   Total: 1, Pass: 0, Fail: 1, Not Found: 0
>>  >
>>  > So it fails, but it fails because the schema does not consume the  
>>> trailing PE, so that's expected. The actual infoset does match the  
>>> expected infoset.
>>  >
>>  > Maybe your input data is different or there is some other property 
>> you  > have defined in dfdl:format that is changing the behavior?
>>  >
>>  > Thanks,
>>  > - Steve
>>  >
>>  > On 11/17/18 10:54 AM, Costello, Roger L. wrote:
>>  >> Hello DFDL Community,
>>  >>
>>  >> Within my input is this:
>>  >>
>>  >> - a series of bytes
>>  >> - then the string: "This program cannot be run in DOS mode."
>>  >> - then another series of bytes until arriving at this string: "PE"
>>  >>
>>  >> I figured that for the first series of bytes I would use 
>> xs:hexBinary whose length ends when getting to "T" (hex 54)  >>
>>  >> <xs:element   name="Instructions_in_hex"
>>  >>               type="xs:hexBinary"
>>  >>               dfdl:lengthKind="pattern"
>>  >>               
>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
>>  >>
>>  >> The next item is a string of length 39  >>
>>  >> <xs:element   name="Message"
>>  >>               type="xs:string"
>>  >>               dfdl:lengthUnits="characters"
>>  >>               dfdl:lengthKind="explicit"
>>  >>               dfdl:length="39" />
>>  >>
>>  >> The last item is a series of hex digits whose length ends when 
>> getting to "P"(hex 50)  >>
>>  >> <xs:element   name="Instructions_in_hex"
>>  >>               type="xs:hexBinary"
>>  >>               dfdl:lengthKind="pattern"
>>  >>               
>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
>>  >>
>>  >> At the bottom of this message is the complete set of declarations.
>>  >>
>>  >> Unfortunately, it doesn't work. The first <Instructions_in_hex> 
>> picks up nothing. Then the <Message> element erroneously picks up a 
>> bunch of hex digits and the first part of the string "This program 
>> cannot be run in DOS mode.". Then it crashes.
>>  >>
>>  >> What am I doing wrong, please?  /Roger  >>  >> <xs:element 
>> name="DOS_Stub">
>>  >>     <xs:complexType>
>>  >>         <xs:sequence>
>>  >>             <xs:element       name="Instructions_in_hex"
>>  >>                       type="xs:hexBinary"
>>  >>                       dfdl:lengthKind="pattern"
>>  >>                       
>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x54)" />
>>  >>             <xs:element       name="Message"
>>  >>                       type="xs:string"
>>  >>                       dfdl:lengthUnits="characters"
>>  >>                       dfdl:lengthKind="explicit"
>>  >>                       dfdl:length="39" />
>>  >>             <xs:element       name="Instructions_in_hex"
>>  >>                       type="xs:hexBinary"
>>  >>                       dfdl:lengthKind="pattern"
>>  >>                       
>> dfdl:lengthPattern="[\x30-\x39\x41-\x46\x61-\x66]+?(?=\x50)" />
>>  >>         </xs:sequence>
>>  >>     </xs:complexType>
>>  >> </xs:element>
>>  >>
>>  >
>>
>

Re: Question about gobbling up hex digits until arriving at a string

Reply via email to