New submission from James Hennessy:
The tempfile.SpooledTemporaryFile class doesn't correctly preserve data for
text (non-binary) SpooledTemporaryFile objects when Unicode characters are
written. The attached program demonstrates the failure. It creates a
SpooledTemporaryFile object, writes 20 string characters to it, and then tries
to read them back. If the SpooledTemporaryFile has rolled over to disk, as it
does in the demonstration program, then the data is not read back correctly.
Instead, an exception is recognized due to the data in the SpooledTemporaryFile
being corrupted.
The problem is this statement in tempfile.py, in the rollover() method:
newfile.seek(file.tell(), 0)
The "file" variable references a StringIO object, whose tell() and seek()
methods count in characters, not bytes, yet this value is applied to a
TemporaryFile object, whose tell() and seek() methods deal in bytes, not
characters. The demonstration program writes 10 characters to the
SpooledTemporaryFile. Since 10 exceeds the rollover size of 5, the
implementation writes the 10 characters to the TemporaryFile and then seeks to
position 10 in the TemporaryFile, which it thinks is the end of the stream.
But those 10 characters got encoded to 30 bytes, and seek position 10 is in the
middle of the UTF-8 sequence for the fourth character. The next write to the
SpooledTemporaryFile starts overlaying bytes from there. The attempt to read
back the data fails because the byte stream no longer represents a valid UTF-8
stream of data.
The related problem is the inconsistency of the behavior of tell() and seek()
for text (non-binary) SpooledTemporaryFile objects. If the data hasn't yet
rolled over to a TemporaryFile, they count in string characters. If the data
has rolled over, they count in bytes.
A quick fix for this is to remove the seek() in the rollover() method. I
presume it's there to preserve the stream position if an explicit call to
rollover() is made, since for an implicit call, the position would be at the
end of the stream already. This quick fix, therefore, would introduce an
external incompatibility in the behavior of rollover().
Another possibility is to never use a StringIO object, but to always buffer
data in a BytesIO object, as is done for binary SpooledTemporaryFile objects.
This has the advantage of "fixing" the tell() and seek() inconsistency, making
them count bytes all the time. The downside, of course, is that data that
doesn't end up being rolled over to a TemporaryFile gets encoded and decoded, a
round trip that could otherwise be avoided.
This problem can be circumvented by a user of SpooledTemporaryFile by
explicitly seeking to the end of the stream after every write to the
SpooledTemporaryFile object: spool.seek(0, io.SEEK_END)
--
components: Library (Lib)
files: showbug.py
messages: 263147
nosy: James Hennessy
priority: normal
severity: normal
status: open
title: SpooledTemporaryFile doesn't correctly preserve data for text
(non-binary) SpooledTemporaryFile objects when Unicode characters are written
type: behavior
versions: Python 3.4
Added file: http://bugs.python.org/file42423/showbug.py
___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26730>
___
___
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com