Re: XML file size limitation?

Steve Lawrence Mon, 15 Mar 2021 09:05:20 -0700

I think --leftover-data could apply to --stream as well. The --stream
option essentially says to keep parsing the data until either 1) a parse
finishes with no remaining data or 2) a parse successfully consumes zero
bits. In the latter case, we must stop when or we would just infinitely
parse nothing. So the --leftover-data could apply in zero data parsed
case, i.e. we stream a bunch, at some point we "successfully" parse zero
data, and more data remains. This could be considered an error, warning,
or be ignored depending on this flag.


Yeah, I don't know of anyone that depends on this behavior. And it's not
very well documented, so I would even be surprised if people don't know
about it. It's also a more restrictive change, so if people did depend
on this behavior, they would very quickly see the error, and hopefully
before the new release gets to production. As long as we're clear in the
release notes, I think the benefit of defaulting to the arguably more
intuitive behavior outweighs the potential (and I think uncommon)
changes of breaking things.

But perhaps such a change needs to wait until a 4.0 release when
breaking backwards compatibility is more allowed. There are a handful of
deprecated things we need to remove at some point, maybe this is one of
them.

On 3/15/21 10:05 AM, Sloane, Brandon wrote:
> I like --leftover-data=error/warning/ignore.
> 
> I'm less clear about what the default should be. If we were starting from 
> scratch, I would say error is the clear answer. However, I too am concerned 
> about the backwards compatibility issue. Most Daffodil deployments I have 
> seen 
> use it through the CLI instead of the library, so I am not particularly 
> convinced by our not promising backwards compatibility there. On the other 
> hand, 
> I don't remember ever seeing a case where we wouldn't want to error on 
> leftover 
> data.
> 
> One potential complication is --stream mode. I think it would be intuitive 
> enough to just ignore the value of --leftover-data when --stream is active; 
> and 
> I don't really see an alternative.
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <slawre...@apache.org>
> *Sent:* Monday, March 15, 2021 8:55 AM
> *To:* users@daffodil.apache.org <users@daffodil.apache.org>
> *Subject:* Re: XML file size limitation?
> I'm definitely open to having some way to make this behavior configurable.
> 
> Perhaps we just add a new flag that says whether left over data is an
> error or not? Not sure of a good name, by may --error-on-leftover is
> clear enough? Or perhaps it should be something like
> --leftover-data=error/warning/ignore? Probably lots of different ways we
> could handle this.
> 
> I'm also not against making it so we error by default. From a newcomers
> perspective, that does seem like maybe the expected behavior.
> Unfortunately, that isn't a backwards compatible change, but it is a
> very obvious change (you would get an error instead of an easily
> overlooked warning). And I don't think we make any guarantees about CLI
> backwards compatibility.
> 
> I'm also open to other ideas, but whatever the solution it does need to
> be easily configurable. The "do not error on left over data" thing is
> very useful when developing schemas, where it's very common to only
> describe and test the first part of a file until the schema is complete.
> 
> Any other thoughts on an intuitive configuration or option for this?
> 
> - Steve
> 
> On 3/11/21 11:15 AM, Attila Horvath wrote:
>> Thx Steve - that was very helpful.
>> 
>> An excellent explanation deserves an attempt to respond in kind.
>> 
>> re: "When one fails...it speculatively parsed too far..."
>> 
>> I think I understand. What threw me was during 'parse' processing Daffodil 
>> threw
>> warning(s) only, no error(s), to indicate invalid elements were encountered. 
>> Admittedly my script was NOT looking at Daffodil's exit code to be non-zero 
>> - as
>> it should have - now it does. But this brings up a point.
>> 
>> The sample scenario you'd described perhaps assumes something - how end 
>> users 
>> determine parsing successes/failures. 1+ years ago when I attended Roger's 
>> training class we were using Windows-based CLI to run Daffodil. I don't 
>> recall 
>> having looked at Daffodil's exit codes back then - we looked for error 
>> messages
>> and/or XML outputs.
>> 
>> Point being, assuming users don't look at exit codes (even though they 
>> should),
>> no 'errors' and an output XML that validates regardless of 'warnings' may be 
>> confusing. I'm sure there is some rationale for Daffodil's behaviour of 
>> instantiating an XML that validates even though errors were encountered. The 
>> usefulness of an XML file, representing a subset of original input data, 
>> that 
>> validates is unclear to me.
>> 
>> Just an opinion, without understanding all the background, I'd have 
>> thought/expected one of following behaviors:
>> 
>>  1. Taking current behavior into consideration, instead of warnings, in 
>> addition
>>     to instantiating output XML Daffodil should generate error messages to
>>     identify offending fields/elements that are encountered with warnings re:
>>     "left over data".
>>  2. Suppress output XML file with warnings and/or errors to identify 
>> offending
>>     fields/elements that are encountered. 
>> 
>> I recommend above in light of explanations you and Mike B. provided 
>> regarding 
>> Daffodil's behavior relating to 'validate' option:
>> 
>>   * *Subject: *regex |AND| left over data <Subject: regex |AND| left over 
>> data>
>>   * *Subject: *regex not catching error(s)
>>     
>> <https://lists.apache.org/thread.html/rfb1866e14e7d6d59d0eb339979afd2e187a00d317c6dca3d815b4511%40%3Cusers.daffodil.apache.org%3E
>>  
> <https://lists.apache.org/thread.html/rfb1866e14e7d6d59d0eb339979afd2e187a00d317c6dca3d815b4511%40%3Cusers.daffodil.apache.org%3E>>
>> 
>> 
>> 
>> Thx
>> 
>> Attila
>> 
>> On Wed, Mar 10, 2021 at 3:53 PM Steve Lawrence <slawre...@apache.org 
>> <mailto:slawre...@apache.org <mailto:slawre...@apache.org>>> wrote:
>> 
>>     The "left over data" message means that Daffodil parsed some amount of
>>     data that met what the DFDL schema described, but there was still data
>>     left over that couldn't be parsed.
>> 
>>     To explain why this can happen, imagine we have a schema like below that
>>     just parses an unbounded array of single digit integers until we run out
>>     of data:
>> 
>>        <xs:element name="numbers">
>>          <xs:complexType>
>>            <xs:sequence>
>>              <xs:element name="number" type="xs:int"
>>                maxOccurs="unbounded" dfdl:occursCountKind="implicit"
>>                dfdl:lengthKind="explicit" dfdl:length="1" />
>>            </xs:sequence>
>>          </xs:complexType>
>>        </xs:element>
>> 
>>     So if we have data that is "2357", that would parse to the infoset:
>> 
>>        <numbers>
>>          <number>2</number>
>>          <number>3</number>
>>          <number>5</number>
>>          <number>7</number>
>>        </numbers>
>> 
>>     Now, one might ask what happens if there is something that isn't a
>>     number in our data, for example "23a57".
>> 
>>     To answer this, we need to understand how Daffodil parses with things
>>     like dfdl:occursCountKind="implicit" and maxOccurs="unbounded". In this
>>     case, Daffodil does not know how many elements are in the array, so it
>>     repeatedly speculatively parses new array elements until one fails (or
>>     until it runs out of data). When one fails, it assumes it speculatively
>>     parsed too far, and whereever it ended before that failure was the last
>>     element of the array.
>> 
>>     So in the case of "23a57", we successfully speculatively parse "2" and
>>     "3". Then we try to speculatively parse the "a" data as a number, but
>>     that will fail, and create a parse error. This parse error tells
>>     Daffodil that it speculated too far, and so the only elements in the
>>     array are "2" and "3".
>> 
>>     At this point, Daffodil continues parsing based on whatever schema
>>     elements follow the array. But in our schema, no elements follow, and so
>>     Daffodil considers the parse to be complete and successful. Daffodil
>>     doesn't necessarily need to parse all the data to be a success, it just
>>     needs to parse enough data to match the schema. So in this case, you
>>     would get the infoset:
>> 
>>        <numbers>
>>          <number>2</number>
>>          <number>3</number>
>>        </numbers>
>> 
>>     And a warning about left over data, because it didn't parse all the data.
>> 
>>     This reason for this "left over data" warning is for cases just like
>>     this. This error often times means we ran into an bad data while
>>     speculatively parsing (e.g. "a" wasn't a number), and the schema allowed
>>     it. Because this is sometimes an error, we warn the user that Daffodil
>>     was able successfully parse data according to the schema, but that there
>>     was left over data, which sometimes implies there was an error. Note
>>     that it doesn't always mean there is an error, which is why it's just a
>>     warning.
>> 
>>     In your specific case, it probably does mean there was an error. It is
>>     likely that Daffodil was speculatively parsing rows in your CSV file but
>>     came across a row that wasn't a valid row (maybe it's missing a field,
>>     maybe ran into a decode error, maybe a field had an incorrect type,
>>     etc). This told Daffodil that the speculative parsing of the rows was
>>     finished, and which no elements following the rows, it finished parsing.
>>     The left over data message lets you know approximately where Daffodil
>>     was when it ran into an invalid row and finished the parse.
>> 
>>     This explains why you get different consumed/left over bit positions for
>>     different files, because the invalid rows happen in different locations
>>     in the files.
>> 
>>     This also explains why the intermediate XML validates but the unparsed
>>     data differs. This is because Daffodil stopped parsing part way through,
>>     so your XML infoset only represents the the subset of data before it ran
>>     into the invalid row. So when you unparse that infoset, you only get
>>     back a subset of the original data and so it appears to have been 
>> truncated.
>> 
>>     As to the maximum question, 3.0.0 had a bug where there was a limit of
>>     something like 256MB during parse and some potential memory leaks during
>>     unparse. But your errors appear to be happening well under the limit
>>     that cause those errors (and you'd get a different error message), so I
>>     doubt that's the issue here. Those issues are fixed in the current
>>     development branch and will be art of Daffodil 3.1.0 when it is released.
>> 
>> 
>> 
>>     On 3/10/21 12:51 PM, Attila Horvath wrote:
>>      > All,
>>      >
>>      > Iam passing various pipe delimited CSV test (case) files thru Daffodil
>>     [3.0] on
>>      > Debian.
>>      >
>>      > Daffodil w/ validate set to "on" throws warnings about 'left over 
>> data' for
>>      > following test cases (see snippet image below).
>>      > Questions:
>>      >
>>      >  1. Why do number of "consumed" bits vary from case to case? Follow up
>>     question,
>>      >     why do "left over ... starting at byte" locations vary? I assume 
>> that
>>     may be
>>      >     because record lengths from file to file may vary.
>>      >  2. Does Daffodil default to maximum input/output file size 
>> limitation?
>>      >  3. If so, can that they be overridden w/ larger size(s)?
>>      >  4. What is Daffodil's "absolute" maximum file size limitations for 
>> input and
>>      >     output?
>>      >  5. At end of each test case, parsed source and unparsed target files 
>> are
>>     diff'd
>>      >     showing they differ but xmllint shows in each case the 
>> intermediate XML
>>      >     files validate successfully against the DFDL schema. I assume 
>> that is
>>      >     because only whole records are written to intermediate (parsed) 
>> XML
>>     file -
>>      >     not partial records in which case the XML file will contain 
>> truncated
>>     data
>>      >     from the original source hence the warning. Is this correct?
>>      >
>>      > image.png
>>      >
>>      > Thx in advance
>>      >
>>      > Attila
>>      >
>>      >
>> 
>

Re: XML file size limitation?

Reply via email to