DO NOT REPLY [Bug 20841] - linefeed character not handle properly on Windows.

bugzilla Thu, 26 Jun 2003 10:46:30 -0700

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=20841>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=20841

linefeed character not handle properly on Windows.





------- Additional Comments From [EMAIL PROTECTED]  2003-06-26 16:49 -------
As described the attributes value gets normalized first by a text-serializer.
For characters in the range 0-127 this leaves them alone except for a newline 
(NL or decimal 10).  The NL is turned into two characters here because of the 
Windows platform, and it is turned into CR,NL.  CR is the carriage-return, 
decimal 13.


This normalized value is later passed to the html-serializer which takes the 
attribute value with CR,NL combination and leaves the CR alone and but 
normalizes the NL yet again, producing CR,CR,NL.

Both the text-serializer and the html-serializer are not expecting a CR,NL 
windows sytle end-of-line combination. They are expecting that the XML parser 
has cleaned that up.  Both think that they are writing the final output, and 
both turn a NL into CR,NL.

Possiblity 1:
If this NL to CR,NL normalization never happened for both text and html 
serialization then the NL would stay a NL all the way through.  

Possibility 2:
Both the text and html serializers could be more suspicious of their input and 
look for a sequence of characters matching the internal character array 
m_lineSep. When running on windows this is an array of two charater array with 
the CR,NL combination. This wouldn't be a performance hit because they already 
pause to do special processing when they hit a NL. A sequence that matches 
those in m_lineSep could be left alone without the normalization on output. On 
other platforms this array is just a NL so the input NL is left alone by the 
serilizers and it acts like "Possibility 1". I'm just worried that some 
legitimate form of input might not get normalized properly, either for 
attributes or text, but I haven't thought of something that would break.

Possibility 3:
Temporarily turning normalization off in the text-serializer.  This is tricky 
because the code sees this serializer as a ContentHandler, which doesn't have a 
way to do this.  Also we might accidentally turn of this normalization when the 
output is really just to a text-serializer and no further.  One might argue 
that a text-serializer should not be used to do normalization of attribute 
values, but I think that changes in this area are harder to do than the ones 
listed in possibility 2 (which I favour).

Regards,
Brian Minchau

DO NOT REPLY [Bug 20841] - linefeed character not handle properly on Windows.

Reply via email to