[issue8390] tarfile: use surrogates for undecode fields

2010-05-06 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-05-05 Thread Lars Gustäbel

Lars Gustäbel l...@gustaebel.de added the comment:

I think it is a good suggestion to use surrogateescape as the default, 
because (I hope) it produces the fewest errors and is the best choice if 
tarfile is used in connection with Python's filesystem calls.

- When reading tar headers, undecodable chars in filenames end up as 
surrogates. This way no information is lost. In principle tarfile is merely a 
gateway to a filesystem inside an archive, so it feels natural if it treats 
filenames the same as Python's filesystem calls.

- When writing tar headers, filenames with surrogate chars (e.g. from 
os.listdir()) will be converted back to bytes in the header (in case of gnu and 
ustar formats). Filenames will remain unchanged, this is exactly as one would 
expect.

- When writing pax headers, filenames with surrogates will raise a UnicodeError 
because we may only use strict utf-8 inside a pax header. This is actually no 
difference to the status quo.

@Martin: As I understand it, the pax invalid-option is supposed to deal with 
the case when strings from a pax header are not representable in the user's 
encoding. In tarfile's case we don't have this problem when reading the archive 
until we try to extract it.

Unfortunately, POSIX says nothing about how to store bad filenames in a pax 
archive. tarfile raises an error. GNU tar fails silently, it just puts the 
unchanged original filename into the pax header without converting it to utf-8, 
thus violating the standard.

--
Added file: http://bugs.python.org/file17227/tarfile_surrogates.2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-05-05 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Thank you for your review. I commited the patch as r80824 (I fixed the 
documentation, :versionadded = :versionchanged), blocked as r80825 (3.2).

--

 Unfortunately, POSIX says nothing about how to store bad filenames in
 a pax archive. tarfile raises an error. GNU tar fails silently,
 it just puts the unchanged original filename into the pax header
 without converting it to utf-8, thus violating the standard.

Right. I opened a new issue about that: #8333. I consider that it's a different 
problem.

--
resolution:  - fixed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-05-03 Thread Lars Gustäbel

Lars Gustäbel l...@gustaebel.de added the comment:

Yes, I will soon have ;-) Please give me a few days...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-05-03 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

A better fix is maybe to store fields as bytes, but it would break the 
compatibility and unicode is pratical in Python3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-05-03 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

My patch changes test_uname_unicode() of test_tarfile for the GNU and ustar 
formats (but not PAX). In GNU and ustar formats, the fields can be encoded in 
any encoding, and may contain invalid byte sequences.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-04-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

lars: Do you have an opinion about this suggestion?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-04-23 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
nosy: +lars.gustaebel

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8390] tarfile: use surrogates for undecode fields

2010-04-13 Thread STINNER Victor

New submission from STINNER Victor victor.stin...@haypocalc.com:

When reading a tar archive, tarfile decodes fields using replace error 
handler by default. The result is that we loose informations if there is an 
undecodable character.

Since the PEP 383, undecodable filenames are stored using surrogates in 
Python3. I think that it's a good idea to use surrogates for tar, because it's 
a common problem to have undecodable data in a tar archive (see the unicode 
section of the tarfile documentation).

--
components: Library (Lib), Unicode
files: tarfile_surrogates.patch
keywords: patch
messages: 103099
nosy: haypo, loewis
severity: normal
status: open
title: tarfile: use surrogates for undecode fields
versions: Python 3.1, Python 3.2
Added file: http://bugs.python.org/file16917/tarfile_surrogates.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8390
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com