I have a slightly different suggestion.

If you have very big hexBinary data items, consider making them arrays of
smaller hexBinary elements instead of one huge hexBinary.
This will avoid the giant memory hiccup when parsing where one needs to
hold the entire thing in memory as one structure.

This technique can be used, in principle, to parse arbitrarily large blobs
of data using relatively small/finite amounts of RAM (for parsing).

If you don't need to recompute the length, because the length won't be
changing, then leaving off the dfdl:outputValueCalc should enable Daffodil
to stream the unparsing, so lowering the memory footprint again to
small/finite size like the parser.

But that won't work if your use case involves using the schema to create
original data. Of course if the application already knows the exact data
size, then there is also no need for the dfdl:outputValueCalc.

Daffodil also supports a "BLOB feature", but it is an extension of DFDL.
The notion below is "ordinary DFDL".

E.g.,

<!--
   Caution: outputValueCalc below will require buffering up the entire
contents element
   which could be very large.

   Per other suggestions, probably should use a simple type here with a
maxInclusive
   facet to constrain maximum length value and a dfdl:assert with
dfdl:checkConstraints(.)
   call so as to avoid absurdly large nonsense values for this length.
  -->
<element name="len" type="xs:unsignedInt" ...
     dfdl:outputValueCalc="{ dfdl:valueLength(../bigBlob/contents) }" />
....
....
<element name="bigBlob" dfdl:lengthKind='explicit' dfdl:length="{ ../len }"
       type="tns:blobInChunks65536"/>

  <complexType name="blobInChunks65536">
    <sequence>
      <element name="contents"><!-- needed for outputValueCalc of length
field -->
          <complexType>
          <sequence dfdl:separator="">
          <element name="blobChunk" dfdl:occursCountKind="implicit"
maxOccurs="unbounded">
               <complexType>
                    <choice>
                        <!--
                              Each full chunk of some modest fixed size.
E.g, 65536.
                          -->
                        <element name="fullChunk" type="xs:hexBinary"
dfdl:lengthKind="explicit" dfdl:length="65536">
                          <simpleType>
                              <restriction base="xs:hexBinary">
                                 <length value="65536"/><!-- useful e.g.,
to validate data before unparsing -->
                              </restriction>
                           </simpleType>
                         </element>
                        <!--
                              For the last chunk, DFDL lengthKind
'endOfParent' not implemented in Daffodil yet,
                              So we use 'delimited' knowing there is no
delimiter specified, so
                              the ending will be as far as it can scan,
which is limited by the extent
                              of the enclosing 'bigBlob' element.
                         -->
                        <element name="lastChunk" type="xs:hexBinary"
dfdl:lengthKind="delimited" dfdl:terminator="">
                          <simpleType>
                              <restriction base="xs:hexBinary">
                                 <maxLength value="65535"/><!-- useful
e.g., to validate data before unparsing -->
                              </restriction>
                           </simpleType>
                         </element>
                    </choice>
               </complexType>
           </element><!-- end blobChunk -->
      </sequence>
      </complexType>
      </element><!-- end content -->
      </sequence>
   </complexType>

On Thu, Aug 4, 2022 at 10:10 AM Adams, Joshua <jad...@owlcyberdefense.com>
wrote:

> In addition to what Steve suggested, if you want to increase the memory
> available for the Daffodil CLI tool, you can change/set the following
> environment variables:
>
> DAFFODIL_JAVA_OPTS
> JAVA_OPTS
>
> The Daffodil CLI tool will check the DAFFODIL_JAVA_OPTS variable first but
> if it is not set will use the JAVA_OPTS environment variable.  If that
> isn't set either, it will default to the following options:
>
>   "-Xms1024m"
>   "-Xmx1024m"
>   "-XX:ReservedCodeCacheSize=128m"
>   "-Dlog4j.configurationFile=$CONFDIR/log4j2.xml"
>
> The "-Xmx1024m" is the most important here as this sets the upper limit
> for memory usage.  So a quick way to double the amount of memory for the
> CLI tool would be to do the following:
>
> export DAFFODIL_JAVA_OPTS="-Xms1024m -Xmx2048m
> -XX:ReservedCodeCacheSize=128m"
>
> Josh
> ------------------------------
> *From:* Steve Lawrence <slawre...@apache.org>
> *Sent:* Thursday, August 4, 2022 8:15 AM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Subject:* Re: Daffodil 3.2.1 - An Unexpected exception occurred...
>
> Thanks for reporting. How much memory have you given to your JVM? And
> what version of Daffodil are you using?
>
> It looks like the deliberate junk value you reference is about 1.5 GB.
> So Daffodil will try to create a 1.5GB array to store the hex binary,
> and if you don't have enough memory it will result in the OOM exception.
>
> There are a couple of solutions here:
>
> 1) Put an assert on the length field to ensure it is a reasonable size.
> For example:
>
>    <xs:element name="MDO_MovieDataSize" type="xs:int" ...>
>      <xs:annotation>
>        <xs:appinfo source="http://www.ogf.org/dfdl/";>
>          <dfdl:assert>{ . le 1000000 }</dfdl:assert>
>        </xs:appinfo>
>      </xs:annotation>
>    </xs:element>
>
> 2) Similar to above, put an xs:restriction on the length field to ensure
> it's a reasonable size, add an assert to check that restriction:
>
>    <xs:element name="MDO_MovieDataSize" ...>
>      <xs:annotation>
>        <xs:appinfo source="http://www.ogf.org/dfdl/";>
>          <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
>        </xs:appinfo>
>      </xs:annotation>
>      <xs:simpleType>
>        <xs:restriction base="xs:int">
>          <xs:maxInclusive value="1000000" />
>        </xs:restriction>
>      </xs:simpleType>
>    </xs:element>
>
> 3) Set the "maxHexBinaryLengthInBytes" tunable. This will create a
> processing error if the length of a hexBinary field is larger than that
> tunable. This value defaults to 2GB, but could be set to a lower value
> if you know your hex binary will never be that large.
>
> - Steve
>
> On 8/3/22 6:40 PM, Thompson, Mark M [US] (DS) wrote:
> > All,
> >
> > I am reporting the occurrence of an exception as requested. The attached
> file
> > contains a trace of the Exception with
> >
> >                 what I believe the relevant info.
> >
> >    * Command executed
> >
> > daffodil -t -vv parse -s ..\..\..\MPRemote.dfdl.xsd -V limited -o
> > image_file_name-8.invalid.dfdl.xml -r MP_REMOTE_Fields
> image_file_name-8.invalid
> >
> >    * It appears that Daffodil does not like large values (deliberate
> junk in this
> >      case) when used as a size for an xs:hexBinary  element. See
> >
> > <MDO_MovieDataSize> and <MDO_MovieData> respectively in the attached
> trace. In
> > this case, there is nowhere near that much data available
> >
> > In the test binary input. Normally, in the case of insufficient data,
> Daffodil
> > errors out gracefully and indicates that there was insufficient dat.
> >
> >                 Overview:
> >
> >    * I am not at liberty to provide the actual schema files.
> >    * I may be able to provide test messages if necessary. I’m hoping
> that the
> >      attached trace provides more than enough info.
> >    * The Input test files to Daffodil are binary.
> >    * The command used in this case is listed above.
> >    * Command line options:        -t -vv parse -V limited
> >    * Daffodil version:                     3.2.1
> >    * Offending element:               <MDO_MovieData>
> >
> > Thank you for your time,
> >
> >     Mark M. Thompson
> >
> >     Northrop Grumman Defense Systems
> >
> >     Software Engineer
> >
> >     (818) 712-7439
> >
>
>

Reply via email to