The "left over data" message means that Daffodil parsed some amount of data that met what the DFDL schema described, but there was still data left over that couldn't be parsed.
To explain why this can happen, imagine we have a schema like below that just parses an unbounded array of single digit integers until we run out of data: <xs:element name="numbers"> <xs:complexType> <xs:sequence> <xs:element name="number" type="xs:int" maxOccurs="unbounded" dfdl:occursCountKind="implicit" dfdl:lengthKind="explicit" dfdl:length="1" /> </xs:sequence> </xs:complexType> </xs:element> So if we have data that is "2357", that would parse to the infoset: <numbers> <number>2</number> <number>3</number> <number>5</number> <number>7</number> </numbers> Now, one might ask what happens if there is something that isn't a number in our data, for example "23a57". To answer this, we need to understand how Daffodil parses with things like dfdl:occursCountKind="implicit" and maxOccurs="unbounded". In this case, Daffodil does not know how many elements are in the array, so it repeatedly speculatively parses new array elements until one fails (or until it runs out of data). When one fails, it assumes it speculatively parsed too far, and whereever it ended before that failure was the last element of the array. So in the case of "23a57", we successfully speculatively parse "2" and "3". Then we try to speculatively parse the "a" data as a number, but that will fail, and create a parse error. This parse error tells Daffodil that it speculated too far, and so the only elements in the array are "2" and "3". At this point, Daffodil continues parsing based on whatever schema elements follow the array. But in our schema, no elements follow, and so Daffodil considers the parse to be complete and successful. Daffodil doesn't necessarily need to parse all the data to be a success, it just needs to parse enough data to match the schema. So in this case, you would get the infoset: <numbers> <number>2</number> <number>3</number> </numbers> And a warning about left over data, because it didn't parse all the data. This reason for this "left over data" warning is for cases just like this. This error often times means we ran into an bad data while speculatively parsing (e.g. "a" wasn't a number), and the schema allowed it. Because this is sometimes an error, we warn the user that Daffodil was able successfully parse data according to the schema, but that there was left over data, which sometimes implies there was an error. Note that it doesn't always mean there is an error, which is why it's just a warning. In your specific case, it probably does mean there was an error. It is likely that Daffodil was speculatively parsing rows in your CSV file but came across a row that wasn't a valid row (maybe it's missing a field, maybe ran into a decode error, maybe a field had an incorrect type, etc). This told Daffodil that the speculative parsing of the rows was finished, and which no elements following the rows, it finished parsing. The left over data message lets you know approximately where Daffodil was when it ran into an invalid row and finished the parse. This explains why you get different consumed/left over bit positions for different files, because the invalid rows happen in different locations in the files. This also explains why the intermediate XML validates but the unparsed data differs. This is because Daffodil stopped parsing part way through, so your XML infoset only represents the the subset of data before it ran into the invalid row. So when you unparse that infoset, you only get back a subset of the original data and so it appears to have been truncated. As to the maximum question, 3.0.0 had a bug where there was a limit of something like 256MB during parse and some potential memory leaks during unparse. But your errors appear to be happening well under the limit that cause those errors (and you'd get a different error message), so I doubt that's the issue here. Those issues are fixed in the current development branch and will be art of Daffodil 3.1.0 when it is released. On 3/10/21 12:51 PM, Attila Horvath wrote: > All, > > Iam passing various pipe delimited CSV test (case) files thru Daffodil [3.0] > on > Debian. > > Daffodil w/ validate set to "on" throws warnings about 'left over data' for > following test cases (see snippet image below). > Questions: > > 1. Why do number of "consumed" bits vary from case to case? Follow up > question, > why do "left over ... starting at byte" locations vary? I assume that may > be > because record lengths from file to file may vary. > 2. Does Daffodil default to maximum input/output file size limitation? > 3. If so, can that they be overridden w/ larger size(s)? > 4. What is Daffodil's "absolute" maximum file size limitations for input and > output? > 5. At end of each test case, parsed source and unparsed target files are > diff'd > showing they differ but xmllint shows in each case the intermediate XML > files validate successfully against the DFDL schema. I assume that is > because only whole records are written to intermediate (parsed) XML file - > not partial records in which case the XML file will contain truncated data > from the original source hence the warning. Is this correct? > > image.png > > Thx in advance > > Attila > >