Re: XML file size limitation?

Steve Lawrence Wed, 10 Mar 2021 12:53:45 -0800

The "left over data" message means that Daffodil parsed some amount of
data that met what the DFDL schema described, but there was still data
left over that couldn't be parsed.

To explain why this can happen, imagine we have a schema like below that
just parses an unbounded array of single digit integers until we run out
of data:

  <xs:element name="numbers">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="number" type="xs:int"
          maxOccurs="unbounded" dfdl:occursCountKind="implicit"
          dfdl:lengthKind="explicit" dfdl:length="1" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

So if we have data that is "2357", that would parse to the infoset:

  <numbers>
    <number>2</number>
    <number>3</number>
    <number>5</number>
    <number>7</number>
  </numbers>

Now, one might ask what happens if there is something that isn't a
number in our data, for example "23a57".

To answer this, we need to understand how Daffodil parses with things
like dfdl:occursCountKind="implicit" and maxOccurs="unbounded". In this
case, Daffodil does not know how many elements are in the array, so it
repeatedly speculatively parses new array elements until one fails (or
until it runs out of data). When one fails, it assumes it speculatively
parsed too far, and whereever it ended before that failure was the last
element of the array.

So in the case of "23a57", we successfully speculatively parse "2" and
"3". Then we try to speculatively parse the "a" data as a number, but
that will fail, and create a parse error. This parse error tells
Daffodil that it speculated too far, and so the only elements in the
array are "2" and "3".

At this point, Daffodil continues parsing based on whatever schema
elements follow the array. But in our schema, no elements follow, and so
Daffodil considers the parse to be complete and successful. Daffodil
doesn't necessarily need to parse all the data to be a success, it just
needs to parse enough data to match the schema. So in this case, you
would get the infoset:

  <numbers>
    <number>2</number>
    <number>3</number>
  </numbers>

And a warning about left over data, because it didn't parse all the data.

This reason for this "left over data" warning is for cases just like
this. This error often times means we ran into an bad data while
speculatively parsing (e.g. "a" wasn't a number), and the schema allowed
it. Because this is sometimes an error, we warn the user that Daffodil
was able successfully parse data according to the schema, but that there
was left over data, which sometimes implies there was an error. Note
that it doesn't always mean there is an error, which is why it's just a
warning.

In your specific case, it probably does mean there was an error. It is
likely that Daffodil was speculatively parsing rows in your CSV file but
came across a row that wasn't a valid row (maybe it's missing a field,
maybe ran into a decode error, maybe a field had an incorrect type,
etc). This told Daffodil that the speculative parsing of the rows was
finished, and which no elements following the rows, it finished parsing.
The left over data message lets you know approximately where Daffodil
was when it ran into an invalid row and finished the parse.

This explains why you get different consumed/left over bit positions for
different files, because the invalid rows happen in different locations
in the files.

This also explains why the intermediate XML validates but the unparsed
data differs. This is because Daffodil stopped parsing part way through,
so your XML infoset only represents the the subset of data before it ran
into the invalid row. So when you unparse that infoset, you only get
back a subset of the original data and so it appears to have been truncated.

As to the maximum question, 3.0.0 had a bug where there was a limit of
something like 256MB during parse and some potential memory leaks during
unparse. But your errors appear to be happening well under the limit
that cause those errors (and you'd get a different error message), so I
doubt that's the issue here. Those issues are fixed in the current
development branch and will be art of Daffodil 3.1.0 when it is released.

On 3/10/21 12:51 PM, Attila Horvath wrote:
> All,
> 
> Iam passing various pipe delimited CSV test (case) files thru Daffodil [3.0] 
> on 
> Debian.
> 
> Daffodil w/ validate set to "on" throws warnings about 'left over data' for 
> following test cases (see snippet image below).
> Questions:
> 
>  1. Why do number of "consumed" bits vary from case to case? Follow up 
> question,
>     why do "left over ... starting at byte" locations vary? I assume that may 
> be
>     because record lengths from file to file may vary.
>  2. Does Daffodil default to maximum input/output file size limitation?
>  3. If so, can that they be overridden w/ larger size(s)?
>  4. What is Daffodil's "absolute" maximum file size limitations for input and
>     output?
>  5. At end of each test case, parsed source and unparsed target files are 
> diff'd
>     showing they differ but xmllint shows in each case the intermediate XML
>     files validate successfully against the DFDL schema. I assume that is
>     because only whole records are written to intermediate (parsed) XML file -
>     not partial records in which case the XML file will contain truncated data
>     from the original source hence the warning. Is this correct?
> 
> image.png
> 
> Thx in advance
> 
> Attila
> 
>

Re: XML file size limitation?

Reply via email to