It is possible and you are not too far off, but requires changing how
you think about what the data is. Instead of thinking of the data as
strings separated by one or more NUL characters, we can instead think
about it as strings that contain zero or more NUL padding characters
followed by a single NUL character. For example, say we have the data
(using X instead of NUL for visibility):
fooXXXXXXbarXXXbazXXXXXXX
+++-----^+++--^+++------^
Here the X's with a ^ under them are delimiters and not part of string.
The characters making up the actual strings are those with either a + or
- under them, and the X's with a - under them are consider pad
characters and removed from the string before being put in the infoset.
So the length of each string is any number of characters (including NUL
characters) up until the last NUL character. And the last NUL character
is a NUL character followed by a non-NUL character or the end of data.
Fortunately, lengthKind="pattern" can handle this with this expression:
dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
This pattern matches one or more of any character (non-greedy), where
those characters are followed by a NUL character and either a non-NUL
character or the end of data. Note that the last NUL character is not
consumed since we're using a forward-lookahead. With that regular
expression, we can use the following schema to parse your data:
<xs:element name="Celsius_Executable_Imports">
<xs:complexType>
<xs:sequence>
<xs:element name="Name" type="xs:string"
maxOccurs="10"
dfdl:lengthKind="pattern"
dfdl:lengthPattern="[\x00-\xFF]+?(?=\x00([^\x00]|$))"
dfdl:representation="text"
dfdl:encoding="ISO-8859-1"
dfdl:textTrimKind="padChar"
dfdl:textStringPadCharacter="%NUL;"
dfdl:textStringJustification="left"
dfdl:terminator="%NUL;"/>
</xs:sequence>
</xs:complexType>
</xs:element>
So this is just a sequence of Name elements. For each element, we use
length pattern as describe above to find the length of the string
excluding the last NUL character. We also define a padding character as
%NUL; to consume any NUL's that appeared on the right side of the
string. And finally we set a NUL terminator to consume the last NUL
character that ends the string.
So I think that sort of answers your first question.
The use of |$ is a good technique to match things OR match the end of
data, I think that answers your second question.
- Steve
On 12/21/18 7:38 AM, Costello, Roger L. wrote:
> Hello DFDL community,
>
> I have a binary file consisting of strings of arbitrary length. Each string
> is
> followed by (terminated by) one or more null characters (hex 0). Here is a
> hex
> editor snapshot of a portion of the file:
>
> I would like the XML output to look like this:
>
> <Celsius_Executable_Imports>
> <Name>libgcc_s_dw2-1.dll</Name>
> <Name>__register_frame_info</Name>
> <Name>libgcj-13.dll</Name>
> <Name>_Jv_RegisterClasses</Name>
> ...
> </Celsius_Executable_Imports>
>
> Alas, I do not know how to implement a DFDL schema which outputs that XML.
>
> I /can/ create a DFDL schema that outputs this XML:
>
> <Celsius_Executable_Imports>
> <wrapper>
> <Name>libgcc_s_dw2-1.dll</Name>
> </wrapper>
> <wrapper>
> <Name>__register_frame_info</Name>
> </wrapper>
> <wrapper>
> <Name>libgcj-13.dll</Name>
> </wrapper>
> <wrapper>
> <Name>_Jv_RegisterClasses</Name>
> </wrapper>
> ...
> </Celsius_Executable_Imports>
>
> Here is the DFDL schema which achieves that output:
>
> <xs:elementname="Celsius_Executable_Imports">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="wrapper"maxOccurs="10">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="Name"type="xs:string"
> dfdl:lengthUnits="characters"
> dfdl:lengthKind="pattern"
> dfdl:lengthPattern="[\x01-\xFF]+?(?=\x00)"
> dfdl:representation="text"
> dfdl:encoding="ISO-8859-1"/>
> <xs:sequencedfdl:hiddenGroupRef="hidden_null_Group"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
>
> <xs:groupname="hidden_null_Group">
> <xs:sequence>
> <xs:elementname="Hidden_Null"type="xs:hexBinary"dfdl:lengthKind="pattern"
> dfdl:lengthPattern="[\x00]+?(?=[\x01-\xFF])"
> dfdl:outputValueCalc="{ . }"/>
> </xs:sequence>
> </xs:group>
>
> Notice that for the element declaration for <wrapper> I specified
> maxOccurs="10". That, of course, is terrible. I did that because I don’t know
> how to specify in the element declaration for <Hidden_Null> that it should
> gobble up null characters until it gets to a non-null character /or it gets
> to
> the end-of-file/. I don’t know how to express that latter part (/or it gets
> to
> the end-of-file/).
>
> To recap, I have two questions:
>
> 1. How to declare an element that holds a string of arbitrary length and is
> terminated by one or more null characters. There are an arbitrary number
> of
> these string/null character pairs.
> 2. How to declare an element that specifies a pattern and the pattern
> specifies
> “Gobble up characters until you get to this /xyz/ pattern or you get to
> the
> end-of-file.
>
> Below is my complete DFDL schema.
>
> **
>
> /Roger
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xs:schemaxmlns:xs="http://www.w3.org/2001/XMLSchema"
> xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
> xmlns:fn="http://www.w3.org/2005/xpath-functions"
> xmlns:math="http://www.w3.org/2005/xpath-functions/math"
> elementFormDefault="qualified">
>
>
> <xs:annotation>
> <xs:appinfosource="http://www.ogf.org/dfdl/">
> <dfdl:defineVariablename="CsizValue"type="xs:int"/>
> <dfdl:format
> alignment="1"
> alignmentUnits="bytes"
> binaryFloatRep="ieee"
> binaryNumberRep="binary"
> bitOrder="mostSignificantBitFirst"
> byteOrder="littleEndian"
> calendarPatternKind="implicit"
> documentFinalTerminatorCanBeMissing="yes"
> emptyValueDelimiterPolicy="none"
> encoding="ISO-8859-1"
> encodingErrorPolicy="replace"
> escapeSchemeRef=""
> fillByte="f"
> floating="no"
> ignoreCase="no"
> initiatedContent="no"
> initiator=""
> leadingSkip="0"
> lengthKind="implicit"
> lengthUnits="bits"
> nilKind="literalValue"
> nilValueDelimiterPolicy="none"
> occursCountKind="implicit"
> outputNewLine="%CR;%LF;"
> representation="binary"
> separator=""
> separatorPosition="infix"
> separatorPolicy="suppressed"
> sequenceKind="ordered"
> textStandardZeroRep="0"
> textStandardInfinityRep="Inf"
> textStandardExponentRep="E"
> textStandardNaNRep="NaN"
> textNumberPattern="#,##0.###;-#,##0.###"
> textNumberRounding="explicit"
> textNumberRoundingMode="roundUnnecessary"
> textNumberRoundingIncrement="0"
> textStandardGroupingSeparator=","
> terminator=""
> textBidi="no"
> textNumberCheckPolicy="strict"
> textNumberRep="standard"
> textOutputMinLength="0"
> textPadKind="none"
> textStandardBase="10"
> textTrimKind="none"
> trailingSkip="0"
> truncateSpecifiedLengthString="no"
> utf16Width="fixed"
> />
> </xs:appinfo>
> </xs:annotation>
>
> <xs:elementname="Celsius_Executable_Imports">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="wrapper"maxOccurs="10">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="Name"type="xs:string"
>
> dfdl:lengthUnits="characters"
>
> dfdl:lengthKind="pattern"
>
> dfdl:lengthPattern="[\x01-\xFF]+?(?=\x00)"
>
> dfdl:representation="text"
>
> dfdl:encoding="ISO-8859-1"/>
> <xs:sequencedfdl:hiddenGroupRef="hidden_null_Group"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
>
> <xs:groupname="hidden_null_Group">
> <xs:sequence>
> <xs:elementname="Hidden_Null"type="xs:hexBinary"dfdl:lengthKind="pattern"
>
> dfdl:lengthPattern="[\x00]+?(?=[\x01-\xFF])"
> dfdl:outputValueCalc="{ .
> }"/>
> </xs:sequence>
> </xs:group>
>
> </xs:schema>
>