[issue16310] zipfile: allow surrogates in filenames

2015-07-21 Thread Ethan Furman

Changes by Ethan Furman et...@stoneleaf.us:


--
nosy:  -ethan.furman

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2013-10-14 Thread Ethan Furman

Changes by Ethan Furman et...@stoneleaf.us:


--
nosy: +ethan.furman

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2013-03-21 Thread Toshio Kuratomi

Toshio Kuratomi added the comment:

Version 2 of the patch

* fixes for the style problems noted by ezio.melotti

--
Added file: http://bugs.python.org/file29531/python3-zipfile-surrogate.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2013-03-20 Thread Toshio Kuratomi

Toshio Kuratomi added the comment:

Okay, here's the first version of a patch to add surrogate support to a 
zipfile.  I think it's the minimum required to fix this bug.

When archiving, if a filename contains surrogateescape'd bytes, it switches to 
cp437 when it saves the filename into the zipfile.  This seems to be the 
strategy of other zip tools.  Nothing changes when unarchiving (probably to 
deal with what comes out of other tools).

The documentation is also updated to mention that unknown encodings are a 
problem that the zipfile module doesn't handle automatically for you.

I think we could do better but this is a major improvement over the status quo 
(no tracebacks).  Would someone care to review this for merge and then we could 
work on adding some notion of a user-specified encoding to override cp437 
encoding on dearchiving.  (which I think would satisfy:  issue10614, 
issue10972).

The use case in issue10757 might be fixed by this patch (or this patch plus the 
user specified encoding).  Have to look a little harder at it.

--
keywords: +patch
Added file: http://bugs.python.org/file29517/python3-zipfile-surrogate.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2013-03-15 Thread Toshio Kuratomi

Toshio Kuratomi added the comment:

I found some standards docs that could bear on this:

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Appendix D:
D.1 The ZIP format has historically supported only the original IBM PC 
character encoding set, commonly referred to as IBM Code Page 437.
[..]
D.2 If general purpose bit 11 is unset, the file name and comment should 
conform to the original ZIP character encoding.  If general purpose bit 11 is 
set, the filename and comment must support The Unicode Standard, Version 4.1.0 
or greater using the character encoding form defined by the UTF-8 storage 
specification.
[..]

So there's two choices for a filename in a zipfile:

* bytes that make valid UTF-8 strings
* bytes that make valid strings in code page 437

http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page

Code Page 437 takes up all 256 possible bit patterns available in a byte.

These two factors mean that if a filename in a zipfile is considered from the 
POV of a sequence of bytes, it can (according to the zipfile standard) contain 
any possible sequence of bytes.  If a filename is considered from the POV of a 
sequence of human characters, it can contain any possible sequence of unicode 
code points encoded as utf-8.  

The tricky bit: if the bytes are not valid utf-8 then officially the characters 
should be limited to the 256 characters of Code Page 437.   However, the client 
tools I've looked at exploit the fact that all bytes are possible to simply 
save the bytes that make up the filename into the zip file.

--
nosy: +a.badger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-30 Thread Stefan Holek

Stefan Holek added the comment:

 It's possible to distribute Python packages with non-ASCII filenames.

Well, it wasn't until very recently (distribute 0.6.29):
https://bitbucket.org/tarek/distribute/issue/303/no-support-for-unicode-manifest-files
Unless we are not talking about the same thing, which is possible. ;-)

 So yes, I have Latin-1 bytes on the filesystem,
 even though my locale is UTF-8.

 You system is not configured correctly. If you would like to distribute such 
 invalid filename,
 how do you plan to access it on other platforms where the filename is decoded 
 differently?
 It would be safer to build your project on a well configured system.

This was done on purpose, to test how Python fares. Such files can easily come 
into existence, e.g. when cloning a Git repo created on a different system. I 
am not after correct ZIP files in this case, I am after Python not raising 
UnicodeErrors when it is supposed to a) support non-ASCII module names and b) 
support surrogates.

python setup.py sdist --formats=gztar - works

python setup.py sdist --formats=zip - UnicodeError

If I am the only one to think this is wrong, then so be it. Our current 
workaround is to disallow surrogates in the manifest. /me shrugs.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-30 Thread STINNER Victor

STINNER Victor added the comment:

 If I am the only one to think this is wrong, then so be it.
 Our current workaround is to disallow surrogates in the manifest. /me shrugs.

You are not alone, that's why there are 3 open issues. But someone
should finish the different proposition and write a new fully
functionnal patch to support bytes filenames.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-29 Thread STINNER Victor

STINNER Victor added the comment:

 The use-case is building Python distributions containing
 non-ASCII filenames.

It's possible to distribute Python packages with non-ASCII filenames.

 So yes, I have Latin-1 bytes on the filesystem,
 even though my locale is UTF-8.

You system is not configured correctly. If you would like to distribute such 
invalid filename, how do you plan to access it on other platforms where the 
filename is decoded differently? It would be safer to build your project on a 
well configured system.

See issues mentionned in msg173766 to support: creating a ZIP archive with 
invalid filenames, and be able to specify the encoding of filenames when 
decoding a ZIP archive.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-28 Thread Andrew Svetlov

Changes by Andrew Svetlov andrew.svet...@gmail.com:


--
nosy: +asvetlov

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-25 Thread Stefan Holek

Stefan Holek added the comment:

What we are trying to do is make distribute work with non-ASCII filenames, and 
this is one of the things we ran into.

Fact 1: Filenames are bytes, whether you like it or not. Treating them as 
strings is going to give you more trouble than dragging the bytes along.

Fact 2: Surrogates are Python 3's way of dealing with bytes.

Fact 3: What follows is that surrogates must be supported wherever Python 3 
deals with filenames.

Fact 4: This is a *bug* since Python breaks its own rules here (I have removed 
the enhancement marker). The issue is not what ZIP can do, but what Python 3 
*must* do. Creating a potentially non-standard ZIP file is fine, exploding in 
the user's face is not.

--
type: enhancement - 
versions: +Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

If we allow for surrogates in the names, it will not correct UTF-8.  This can 
breaks other software.

We should clear 11th flag bit in this case.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Related issues: issue10614, issue10757, issue10972.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-24 Thread Stefan Holek

New submission from Stefan Holek:

Please allow for surrogates in the zipfile module like it was done for tarfile 
in #8390.

Currently zipfile breaks when encountering surrogates:

Traceback (most recent call last):
  File /usr/local/python3.3/lib/python3.3/zipfile.py, line 392, in 
_encodeFilenameFlags
return self.filename.encode('ascii'), self.flag_bits
UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 
21: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File setup.py, line 20, in module
'setuptools',
  File /usr/local/python3.3/lib/python3.3/distutils/core.py, line 148, in 
setup
dist.run_commands()
  File /usr/local/python3.3/lib/python3.3/distutils/dist.py, line 917, in 
run_commands
self.run_command(cmd)
  File /usr/local/python3.3/lib/python3.3/distutils/dist.py, line 936, in 
run_command
cmd_obj.run()
  File 
/home/stefan/sandbox/setuptools-git/lib/python3.3/site-packages/distribute-0.6.30-py3.3.egg/setuptools/command/sdist.py,
 line 161, in run
self.make_distribution()
  File /usr/local/python3.3/lib/python3.3/distutils/command/sdist.py, line 
447, in make_distribution
file = self.make_archive(base_name, fmt, base_dir=base_dir)
  File /usr/local/python3.3/lib/python3.3/distutils/cmd.py, line 370, in 
make_archive
dry_run=self.dry_run)
  File /usr/local/python3.3/lib/python3.3/distutils/archive_util.py, line 
178, in make_archive
filename = func(base_name, base_dir, **kwargs)
  File /usr/local/python3.3/lib/python3.3/distutils/archive_util.py, line 
118, in make_zipfile
zip.write(path, path)
  File /usr/local/python3.3/lib/python3.3/zipfile.py, line 1328, in write
self.fp.write(zinfo.FileHeader())
  File /usr/local/python3.3/lib/python3.3/zipfile.py, line 382, in FileHeader
filename, flag_bits = self._encodeFilenameFlags()
  File /usr/local/python3.3/lib/python3.3/zipfile.py, line 394, in 
_encodeFilenameFlags
return self.filename.encode('utf-8'), self.flag_bits | 0x800
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 
21: surrogates not allowed

--
components: Library (Lib), Unicode
messages: 173676
nosy: ezio.melotti, stefanholek
priority: normal
severity: normal
status: open
title: zipfile: allow surrogates in filenames
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-24 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +serhiy.storchaka
type:  - enhancement
versions: +Python 3.4 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-24 Thread R. David Murray

R. David Murray added the comment:

The problem you are reporting looks different than the problem addressed in 
issue 8390.  There, the surrogates are being introduced when reading filenames 
from the archive file.  Here, the surrogates presumably arose because the 
filename on your file system was not utf-8 encoded and so Python introduced the 
surrogates to preserve the filename.  The bug is that zipfile is not handling 
surrogates when *building* the archive...which may in fact be correct.  If I 
understand correctly there are two encodings supported by zipfile, a Microsoft 
code page and utf-8.  Anything else should probably be rejected as invalid, but 
with a better error message.  If you really need to include invalid filenames 
in an archive, we would introduce an explict flag for allowing that.

But, that's just my opinion.  (Be generous in what you accept, and strict in 
what you send)

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-24 Thread Stefan Holek

Stefan Holek added the comment:

A little more context perhaps:

The use-case is building Python distributions containing non-ASCII filenames. 
These seemingly invalid filenames can occur in real-life when the files have 
been created by, say, a 'git clone' operation.

So yes, I have Latin-1 bytes on the filesystem, even though my locale is UTF-8. 
And yes, Python 3 decodes that filename using surrogates. Creating .tar.gz 
distributions in this situation appears to work (even re-creating the foreign 
bytes when the archive is later extracted), whereas .zip archives fail in the 
way described above.

I was hoping zipfile could be made to work the same as tarfile in this regard. 
Concerns for standards certainly didn't keep tarfile from supporting 
surrogates. ;-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-24 Thread R. David Murray

R. David Murray added the comment:

I'm guessing that is because (if you read the issue) there are no specified 
standards for the filenames in tar (other than PAX format).  Although I would 
personally have preferred to need to specify a yes really use these binary 
filenames flag to tar, as well.

I'm not sure there are real standards for zip, either.  I'll have to leave 
that answer to someone more knowledgeable.

As for your immediate issue, can't you just set your locale to latin-1 while 
building the archive?  The filenames should then get encoded to utf-8 in the 
zip archive, which should do the right thing with respect to the user's locale 
when extracted.  I would think that that would be more portable.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-24 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16310] zipfile: allow surrogates in filenames

2012-10-24 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16310
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com