The difference here is that this warning only appears when *compiling*
the schema, just to alert you that the schema might not give you the
expected behavior. In this case, it's relatively easy to know that that
the schema is doing what we expect and that the warning can be ignored.

However, the warning about left over data appears when *parsing*, and
only in some casds. So you'd probably need to verify during each parse
if that warning is safe to be ignored or not. In most cases with this
schema, it probably does mean that there's just a left over NL and it
can be ignored. But it's also possible that some other parse error
occurred and Daffodil stopped parsing halfway through the data, leaving
more than just a newline of left over data. In that case, this warning
might actually be a problem.

And in fact, some uses case might even consider left over data an error
because the schema doesn't describe the complete data. This is actually
what the Daffodil NiFi processor does. If the entire data isn't parsed
the processor considers it an error.

So while these are both technically warnings, they sort of have
different severities and potential imlications.

Also, you can actually get rid of this error without
documentFinalTerminatorCanBeMissing by removing the ambiguity of the
sequences in a choice. For example, we could replace the choice with the
following:

  <xs:element name="EOL" xs:type="string" minOccurs="0"
    dfdl:initiator="%NL;"  dfdl:lengthKind="explicit" dfdl:length="0" />

So we have an optional zero-length string element with the NL initiator.
If the NL exists, then then EOL element will be in the infoset and NL
will be unparsed. If the data does not end with the NL, the initiator
will not be found and the EOL element will not be in the infoset (which
is valid since it's optional).

This is modeling the NL as data as Mike mentioned in a previous email.


On 11/10/19 9:42 AM, Costello, Roger L. wrote:
> Steve wrote:
> 
>> I think it would be reasonable to 
>> ignore this warning.
> 
> But, but, but, ...
> 
> Mike said (paraphrasing) that it is unwise to officially publish a DFDL 
> schema that produces warnings on valid data.
> 
> It appears that it is impossible to avoid getting a warning message (for the 
> CSV data format where the last record of a CSV file may or may not have a 
> newline) until dfdl:documentFinalTerminatorCanBeMissing="yes" is implemented. 
> Do you agree?
> 
> /Roger
> 
> -----Original Message-----
> From: Steve Lawrence <[email protected]> 
> Sent: Sunday, November 10, 2019 9:32 AM
> To: [email protected]
> Subject: [EXT] Re: Is it okay to officially publish a DFDL schema that 
> produces warnings on valid input data?
> 
> When unparsing a choice, we use the infoset to determine which branch of the 
> choice to unparse. For example, say we had this choice:
> 
>   <xs:choice>
>     <xs:element name="A" type="xs:string" ... />
>     <xs:element name="B" type="xs:int" ... />
>   </xs:choice>
> 
> If the infoset contained the "A" element, then we would unparse the first 
> branch of the choice. If the infoset contained the "B" element, then we would 
> unparse the second.
> 
> However, in this new choice you have, both branches only contain a sequence, 
> which do not have a representation in the infoset. So when unparsing we don't 
> know which branch to take.
> 
> That warning is trying to alert you that Daffodil will just have to pick one, 
> and that it might not be the one you expected. Daffodil will currently always 
> unparse the first of the ambiguous branches.
> 
> So this warning is actually normal and expected in this case. I think it 
> would be reasonable to ignore this warning.
> 
> 
> On 11/10/19 8:54 AM, Costello, Roger L. wrote:
>> Mike wrote:
>>
>> I suggest adding this
>>
>> <choice>
>>
>>    <sequence dfdl:initiator="%NL;" />
>>
>>    <sequence />
>>
>> </choice>
>>
>> At the end of the schema after the repeating row element.
>>
>> This will absorb and discard any final newline.
>>
>> Oh! That is a wicked cool idea! I gave it a try. Daffodil doesn't seem to 
>> like it:
>>
>> [warning] Schema Definition Warning: Multiple choice branches are 
>> associated with the end of element {}csv.
>>
>> Note that elements with dfdl:outputValueCalc cannot be used to 
>> distinguish choice branches.
>>
>> Note that choice branches with entirely optional content are not allowed.
>>
>> What does that message mean? How to fix it?
>>
>> /Roger
>>
>> *From:* Beckerle, Mike <[email protected]>
>> *Sent:* Sunday, November 10, 2019 7:56 AM
>> *To:* [email protected]
>> *Subject:* [EXT] Re: Is it okay to officially publish a DFDL schema 
>> that produces warnings on valid input data?
>>
>> I would avoid this.
>>
>> One thing you need to take a position on is whether on unparsing you 
>> generate this final new line, or not, or try to preserve whatever the file 
>> had originally.
>>
>> Choosing to always generate this, or always omit it is canonicalization.
>>
>> I suggest adding this
>>
>> <choice>
>>
>>    <sequence dfdl:initiator="%NL;" />
>>
>>    <sequence />
>>
>> </choice>
>>
>> At the end of the schema after the repeating row element.
>>
>> This will absorb and discard any final newline.
>>
>> If you want to preserve the final newline then you have to model it as 
>> data so change the first branch of the choice above and make it an 
>> element named 'finalNewLine' with initiator and type string with explicit 
>> length 0.
>>
>> ----------------------------------------------------------------------
>> ----------
>>
>> *From:*Costello, Roger L. <[email protected] 
>> <mailto:[email protected]>>
>> *Sent:* Saturday, November 9, 2019 8:05:19 AM
>> *To:* [email protected] <mailto:[email protected]>
>> <[email protected] <mailto:[email protected]>>
>> *Subject:* Is it okay to officially publish a DFDL schema that 
>> produces warnings on valid input data?
>>
>> Hi Folks,
>>
>> Suppose you are creating the official, standard DFDL schema for a data 
>> format. 
>> Would you be okay with officially releasing a schema that generates 
>> warnings on data that is valid?
>>
>> Here's an example. The RFC for CSV (RFC 4180) says that CSV files 
>> consist of records separated by newlines. Each record consists of 
>> fields separated by commas. The last record may or may not have a new line.
>>
>> Suppose the last record of a CSV file has newline. My DFDL schema 
>> generates this
>> warning:
>>
>> *[warning] Left over data. Consumed 1680 bit(s) with at least 16 
>> bit(s) remaining.*
>>
>> I am thinking that that warning is okay. Why? Because when the last 
>> record has a newline, then the file /really does/ have left over data 
>> - the newline on the last record. So, a warning is not unreasonable.
>>
>> Well, that's what I think. I might be thinking wrongly. What do you 
>> think? Would you ever officially release a DFDL schema that generates 
>> warnings on valid input data?
>>
>> /Roger
>>
> 

Reply via email to