[issue13899] re pattern r[\A] should work like A but matches nothing. Ditto B and Z.

2012-01-29 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

@Ezio: Comparison of the behaviour of \letter inside/outside character classes 
is irrelevant. The rules for inside can be expressed simply as:

1. Letters dDsSwW are special; they represent categories as documented, and do 
in fact have a similar meaning outside character classes.

2. Otherwise normal Python rules for backslash escapes in string literals 
should be followed. This means automatically that \a - \x07, \A - A, \b - 
backspace, \B - B, \z - z and \Z - Z.

@Georg: No need to read the source, just read my initial posting: It's compiled 
as a zero-length matcher (at) inside a character class (in) i.e. a 
nonsense, then at runtime the illegality is deliberately ignored.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13899
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13899] re pattern r[\A] should work like A but matches nothing. Ditto B and Z.

2012-01-29 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

Whoops: normal Python rules for backslash escapes should have had a note but 
revert to the C behaviour of stripping the \ from unrecognised escapes which 
is what re appears to do in its own \ handling.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13899
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13899] re pattern r[\A] should work like A but matches nothing. Ditto B and Z.

2012-01-28 Thread John Machin

New submission from John Machin sjmac...@lexicon.net:

Expected behaviour illustrated using C:

 import re
 re.findall(r'[\C]', 'CCC')
['C', 'C', 'C']
 re.compile(r'[\C]', 128)
literal 67
_sre.SRE_Pattern object at 0x01FC6E78
 re.compile(r'C', 128)
literal 67
_sre.SRE_Pattern object at 0x01FC6F08

Incorrect behaviour exhibited by A (and by B and 
Z):

 re.findall(r'[\A]', 'AAA')
[]
 re.compile(r'A', 128)
literal 65
_sre.SRE_Pattern object at 0x01FC6F98
 re.compile(r'[\A]', 128)
in
  at at_beginning_string  FAIL 
_sre.SRE_Pattern object at 0x01FDF0B0


Also there is no self-checking at runtime; the switch default has a comment to 
the effect that nothing can be done, so pretend that the unknown opcode matched 
nothing. Zen?

--
messages: 152194
nosy: sjmachin
priority: normal
severity: normal
status: open
title: re pattern r[\A] should work like A but matches nothing. Ditto B and 
Z.
type: behavior
versions: Python 2.7, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13899
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13899] re pattern r[\A] should work like A but matches nothing. Ditto B and Z.

2012-01-28 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

@ezio: Of course the context is inside a character class.

I expect r'[\b]' to act like r'\b' aka r'\x08' aka backspace because (1) that 
is the treatment applied to all other C-like control char escapes (2) the docs 
say so explicitly: Inside a character range, \b represents the backspace 
character, for compatibility with Python’s string literals.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13899
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13782] xml.etree.ElementTree: Element.append doesn't type-check its argument

2012-01-13 Thread John Machin

New submission from John Machin sjmac...@lexicon.net:

import xml.etree.ElementTree as et
node = et.Element('x')
node.append(not_an_Element_instance)

2.7 and 3.2 produce no complaint at all.
2.6 and 3.1 produce an AssertionError.

However cElementTree in all 4 versions produces a TypeError.

Please fix 2.7 and 3.2 ElementTree to produce a TypeError.

--
messages: 151210
nosy: sjmachin
priority: normal
severity: normal
status: open
title: xml.etree.ElementTree: Element.append doesn't type-check its argument
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13782
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7198] Extraneous newlines with csv.writer on Windows

2011-03-19 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

Can somebody please review my doc patch submitted 2 months ago?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7198
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7198] Extraneous newlines with csv.writer on Windows

2011-03-19 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

Skip, The changes that I suggested have NOT been made. Please re-read the doc 
page you pointed to. The writer paragraph does NOT mention that newline='' is 
required when writing. The writer examples do NOT include newline=''. The 
examples have NOT been enhanced by using a with statement and not using space 
as an example delimiter.

PLEASE RE-OPEN THIS ISSUE.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7198
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10954] No warning for csv.writer API change

2011-03-19 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

The doc patch proposed by Skip on 2001-01-24 for this bug has NOT been 
reviewed, let alone applied. Sibling bug #7198 has been closed in error. 
Somebody please help.

--
nosy: +skip.montanaro

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10954
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10954] No warning for csv.writer API change

2011-03-19 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

Terry, I have already made the point the docs bug is #7198. This is the 
meaningful-exception bug.

My review is changing 'should' to 'must' is not very useful without a 
consistent interpretation of what those two words mean and without any 
enforcement of use of newline=''.

I was patient enough to wait 2 months for a review of my doc patch on #7198. 

My issues are that the 3.2 docs have NOT been changed (have a look at the 
csv.writer paragraph: do you see the word newline anywhere??), #7198 has been 
closed without any action, and BOTH of these two issues (which have in effect 
been lurking about since Python 3.0.0alpha) appear to have been abandoned.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10954
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11204] re module: strange behaviour of space inside {m, n}

2011-02-12 Thread John Machin

New submission from John Machin sjmac...@lexicon.net:

A pattern like rb{1,3}\Z matches b, bb, and bbb, as expected. There is 
no documentation of the behaviour of rb{1, 3}\Z -- it matches the LITERAL 
TEXT b{1, 3} in normal mode and b{1,3} in verbose mode.

# paste the following at the interactive prompt:
pat = rb{1, 3}\Z
bool(re.match(pat, bb)) # False
bool(re.match(pat, b{1, 3})) # True
bool(re.match(pat, bb, re.VERBOSE)) # False
bool(re.match(pat, b{1, 3}, re.VERBOSE)) # False
bool(re.match(pat, b{1,3}, re.VERBOSE)) # True

Suggested change, in decreasing order of preference:
(1) Ignore leading/trailing spaces when parsing the m and n components of {m,n}
(2) Raise an exception if the exact syntax is not followed
(3) Document the existing behaviour

Note: deliberately matching the literal text would be expected to be done by 
escaping the left brace:

pat2 = rb\{1, 3}\Z
bool(re.match(pat2, b{1, 3})) # True

and this is not prevented by the suggested changes.

--
messages: 128472
nosy: sjmachin
priority: normal
severity: normal
status: open
title: re module: strange behaviour of space inside {m, n}
versions: Python 2.7, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue11204
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10954] No warning for csv.writer API change

2011-01-23 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

Skip, the docs bug is #7198. This is the meaningful-exception bug.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10954
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10954] No warning for csv.writer API change

2011-01-22 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

I don't understand Changing csv api is a feature request that could only 
happen in 3.3. This is NOT a request for an API change. Lennert's point is 
that an API change was made in 3.0 as compared with 2.6 but there is no fixer 
in 2to3. What is requested is for csv.reader/writer to give more meaningful 
error messages for valid 2.x code that has been put through fixer-less 2to3.

The name of the arg is newline. newlines is an attribute that stores what 
was actually found in universal newlines mode.

newline='' is needed on input for the same reason that binary mode is required 
in 2.x: \r and \n may quite validly appear in data, inside a quoted field, and 
must not be treated as part of a row separator.

newline='' is needed on output for the same reason that binary mode is required 
in 2.x: any \n in the data and any \n in the caller's chosen line terminator 
must be preserved from being changed to os.linesep (e.g. \r\n).

newline is not available as an attribute of the _io.TextIOWrapper object 
created by open('xxx.csv', 'w', newline=''); is exposing this possible?

--
versions: +Python 3.2 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10954
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10954] No warning for csv.writer API change

2011-01-20 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

I believe that both csv.reader and csv.writer should fail with a meaningful 
message if mode is binary or newline is not ''

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10954
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7198] Extraneous newlines with csv.writer on Windows

2011-01-19 Thread John Machin

John Machin sjmac...@lexicon.net added the comment:

docpatch for 3.x csv docs:

In the csv.writer docs, insert the sentence If csvfile is a file object, it 
should be opened with newline=''. immediately after the sentence csvfile can 
be any object with a write() method.

In the closely-following example, change the open call from open('eggs.csv', 
'w') to open('eggs.csv', 'w', newline='').

In section 13.1.5 Examples, there are 2 reader cases and 1 writer case that 
likewise need inserting , newline='' in the open call.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7198
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7198] Extraneous newlines with csv.writer on Windows

2010-12-26 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Skip, I'm WRITING, not reading.. Please read the 3.1 documentation for 
csv.writer. It does NOT mention newline='', and neither does the example. 
Please fix.

Other problems with the examples: (1) They encourage a bad habit (open inside 
the call to reader/writer); good practice is to retain the reference to the 
file handle (preferably with a with statement) so that it can be closed 
properly. (2) delimiter=' ' is very unrealistic.

The documentation for both 2.x and 3.x should be much more explicit about what 
is needed in open() for csv to work properly and portably:

2.x read: use mode='rb' -- otherwise fail on Windows
2.x write: use mode='wb' -- otherwise fail on Windows
3.x read: use newline='' -- otherwise fail unconditionally(?)
3.x write: use newline='' -- otherwise fail on Windows

The 2.7 documentation says If csvfile is a file object, it must be opened 
with the 'b' flag on platforms where that makes a difference ... in my 
experience, people are left asking what platforms? what difference?; Windows 
should be mentioned explicitly.

--
versions: +Python 2.7, Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7198
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7198] Extraneous newlines with csv.writer on Windows

2010-12-23 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Please re-open this. The binary/text mode problem still exists with Python 3.X 
on Windows. Quite simply, there is no option available to the caller to open 
the output file in binary mode, because the module is throwing str objects at 
the file. The module's idea of taking control in the default case appears to 
be to write \r\n which is then processed by the Windows runtime and becomes 
\r\r\n.

Python 3.1.3 (r313:86834, Nov 27 2010, 18:30:53) [MSC v.1500 32 bit (Intel)] on 
win32
Type help, copyright, credits or license for more information.
 import csv
 f = open('terminator31.csv', 'w')
 row = ['foo', None, 3.14159]
 writer = csv.writer(f)
 writer.writerow(row)
14
 writer.writerow(row)
14
 f.close()
 open('terminator31.csv', 'rb').read()
b'foo,,3.14159\r\r\nfoo,,3.14159\r\r\n'


And it's not just a row terminator problem; newlines embedded in fields are 
likewise expanded to \r\n by the Windows runtime.

--
nosy: +sjmachin
versions: +Python 3.1 -Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7198
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9980] str(float) failure

2010-09-29 Thread John Machin

Changes by John Machin sjmac...@users.sourceforge.net:


--
nosy: +sjmachin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9980
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-03 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

About the E0 80 81 61 problem: my interpretation is that you are correct, the 
80 is not valid in the current state (start byte == E0), so no look-ahead, 
three FFFDs must be issued followed by 0061. I don't really care about issuing 
too many FFFDs so long as it doesn't munch valid sequences. However it would be 
very nice to get an explicit message about surrogates.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8308] raw_bytes.decode('cp932') -- spurious mappings

2010-04-04 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Thanks, Martin. Issue closed as far as I'm concerned.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8308
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8308] raw_bytes.decode('cp932') -- spurious mappings

2010-04-03 Thread John Machin

New submission from John Machin sjmac...@users.sourceforge.net:

According to the following references, the bytes 80, A0, FD, FE, and FF are not 
defined in cp932:

http://msdn.microsoft.com/en-au/goglobal/cc305152.aspx
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003s=ALL

However CPython 3.1.2 does this:

  print(ascii(b'\x80\xa0\xfd\xfe\xff'.decode('cp932')))
 '\x80\uf8f0\uf8f1\uf8f2\uf8f3'

(as do 2.5, 2.6. and 2.7 with the appropriate syntax)

This maps 80 to U+0080 (not very useful) and maps the other 4 bytes into the 
Private Use Area (PUA)!! Each case should be treated as 
undefined/unexpected/error/...

--
components: Unicode
messages: 102308
nosy: sjmachin
severity: normal
status: open
title: raw_bytes.decode('cp932') -- spurious mappings
type: behavior
versions: Python 2.7, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8308
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

@ezio.melotti: Your second sentence is true, but it is not the whole truth. 
Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered 
part of the sequence because they (like 00-7F) are invalid as continuation 
bytes; they are either starter bytes (C2-F4) or invalid for any purpose (C0-C2 
and F5-FF). Further, some bytes in the range 80-BF are NOT always valid as the 
first continuation byte, it depends on what starter byte they follow.

The simple way of summarising the above is to say that a byte that is not a 
valid continuation byte in the current state (failing byte) is not a part of 
the current (now known to be invalid) sequence, and the decoder must try again 
(resync) with the failing byte.

Do you agree with my example 3?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

#ezio.melotti: I'm considering valid all the bytes that start with '10...'

Sorry, WRONG. Read what I wrote: Further, some bytes in the range 80-BF are 
NOT always valid as the first continuation byte, it depends on what starter 
byte they follow.

Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences 
(over-long). Specifically the first continuation byte may not be in 80-9F. 
Those bytes start with '10...' but they are invalid after an E0 starter byte.

Please read Table 3-7. Well-Formed UTF-8 Byte Sequences and surrounding text 
in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) 
doesn't implement the surrogates restriction, so that the special case for 
starter byte ED is not used in CPython). Note the other 3 special cases for the 
first continuation byte.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Unicode has been frozen at 0x10. That's it. There is no such thing as a 
valid 5-byte or 6-byte UTF-8 string.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

@lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now 
says 21 bits is it. F5-FF are declared to be invalid. I don't understand what 
you mean by supporting those possibilities. The code is correctly issuing an 
error message. The goal of supporting the new resyncing and FFFD-emitting rules 
might be better met however by throwing away the code in the default clause and 
instead merely setting the entries for F5-FF in the utf8_code_length array to 
zero.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Patch review:

Preamble: pardon my ignorance of how the codebase works, but trunk 
unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k 
unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the 
r79542 that the patch mentions ... help, please!

length 2 case: 
1. the loop can be hand-unrolled into oblivion. It can be entered only when 
s[1]  0xC0 != 0x80 (previous if test).
2. the over-long check (if (ch  0x80)) hasn't been touched. It could be 
removed and the entries for C0 and C1 in the utf8_code_length array set to 0.

length 3 case:
1. the tests involving s[0] being 0xE0 or 0xED are misplaced.
2. the test s[0] == 0xE0  s[1]  0xA0 if not misplaced would be shadowing the 
over-long test (ch  0x800). It seems better to use the over-long test (with 
endinpos set to 1).
3. The test s[0] == 0xED relates to the surrogates caper which in the py3k 
version is handled in the same place as the over-long test.
4. unrolling loop: needs no loop, only 1 test ... if s[1] is good, then we know 
s[2] must be bad without testing it, because we start the for loop only when 
s[1] is bad || s[2] is bad.

length 4 case: as for the len 3 case generally ... misplaced tests, F1 test 
shadows over-long test, F4 test shadows max value test, too many loop 
iterations.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Chapter 3, page 94: As a consequence of the well-formedness conditions 
specified in Table 3-7, the following byte values are disallowed in UTF-8: 
C0–C1, F5–FF

Of course they should be handled by the simple expedient of setting their 
length entry to zero. Why write code when there is an existing mechanism??

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

@lemburg: perhaps applying the same logic as for the other sequences is a 
better strategy

What other sequences??? F5-FF are invalid bytes; they don't start valid 
sequences. What same logic?? At the start of a character, they should get the 
same short sharp treatment as any other non-starter byte e.g. 80 or C0.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

@lemburg: failing byte seems rather obvious: first byte that you meet that is 
not valid in the current state. I don't understand your explanation, especially 
does not have the high bit set. I think you mean is a valid starter byte. 
See example 3 below.

Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not 
in 80-BF. It is the failing byte; high bit not set. Required action is to 
emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total 
output: FFFD 0041 0042 0043. Current code emits FFFD 0043.

Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not 
in 80-BF. It is the failing byte. Required action is to emit FFFD then resync 
on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, 
causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current 
code emits FFFD 0043.

Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not 
in 80-BF. It is the failing byte. Required action is to emit FFFD then resync 
on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and 
remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 
0081 0043. Current code emits FFFD 0043.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-30 Thread John Machin

New submission from John Machin sjmac...@users.sourceforge.net:

Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed Constraints on 
Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't 
comply. Using the Unicode example:

  print(ascii(b\xc2\x41\x42.decode('utf8', 'replace')))
 '\ufffdB'
 # should produce u'\ufffdAB'

Resynchronisation currently starts at a position derived by considering the 
length implied by the start byte:

  print(ascii(b\xf1ABCD.decode('utf8', 'replace')))
 '\ufffdD'
 # should produce u'\ufffdABCD'; resync should start from the *failing* byte.

Notes: This applies to the 'ignore' option as well as the 'replace' option. The 
Unicode discussion mentions security exploits.

--
messages: 101972
nosy: sjmachin
severity: normal
status: open
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
type: behavior
versions: Python 2.7, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8271
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-15 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Simplification of mark's first two problems:

Problem 1: looks like regex's negative look-head assertion is broken
 re.findall(r'(?!a)\w', 'abracadabra')
['b', 'r', 'c', 'd', 'b', 'r']
 regex.findall(r'(?!a)\w', 'abracadabra')
[]


Problem 2: in VERBOSE mode, regex appears to be ignoring spaces inside
character classes

 import re, regex
 pat = r'(\w)([- ]?)(\w{4})'
 for data in ['a', 'a-', 'a ']:
...print re.compile(pat).findall(data), regex.compile(pat).findall(data)
...print re.compile(pat, re.VERBOSE).findall(data),
regex.compile(pat,regex.
VERBOSE).findall(data)
...
[('a', '', '')] [('a', '', '')]
[('a', '', '')] [('a', '', '')]
[('a', '-', '')] [('a', '-', '')]
[('a', '-', '')] [('a', '-', '')]
[('a', ' ', '')] [('a', ' ', '')]
[('a', ' ', '')] []

HTH,
John

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-11 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

What is the expected timing comparison with re? Running the Aug10#3
version on Win XP SP3 with Python 2.6.3, I see regex typically running
at only 20% to %50 of the speed of re in ASCII mode, with
not-very-atypical tests (find all Python identifiers in a line, failing
search for a Python identifier in an 80-byte text). Is the supplied
_regex.pyd from some sort of debug or unoptimised build? Here are some
results:

dos-prompt\python26\python -mtimeit -simport re as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='def __init__(self, arg1,
arg2):\n' r.findall(t)
10 loops, best of 3: 5.32 usec per loop

dos-prompt\python26\python -mtimeit -simport regex as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='def __init__(self, arg1,
arg2):\n' r.findall(t)
10 loops, best of 3: 12.2 usec per loop

dos-prompt\python26\python -mtimeit -simport re as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='1234567890'*8 r.search(t)
100 loops, best of 3: 1.61 usec per loop

dos-prompt\python26\python -mtimeit -simport regex as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='1234567890'*8 r.search(t)
10 loops, best of 3: 7.62 usec per loop

Here's the worst case that I've found so far:

dos-prompt\python26\python -mtimeit -simport re as
x;r=x.compile(r'z{80}');t='z'*79 r.search(t)
100 loops, best of 3: 1.19 usec per loop

dos-prompt\python26\python -mtimeit -simport regex as
x;r=x.compile(r'z{80}');t='z'*79 r.search(t)
1000 loops, best of 3: 334 usec per loop

See Friedl: length cognizance. Corresponding figures for match() are
1.11 and 8.5.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-10 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Adding to vbr's report: [2.6.2, Win XP SP3] (1) bug mallocs memory
inside loop (2) also happens to regex.findall with patterns 'a{0,0}' and
'\B' (3) regex.sub('', 'x', 'abcde') has similar problem BUT 'a{0,0}'
and '\B' appear to work OK.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

2009-08-03 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Problem is memory leak from repeated calls of e.g.
compiled_pattern.search(some_text). Task Manager performance panel shows
increasing memory usage with regex but not with re. It appears to be
cumulative i.e. changing to another pattern or text doesn't release memory.

Environment: Python 2.6.2, Windows XP SP3, latest (29 July) regex zip file.

Example:

8-- regex_timer.py
import sys
import time
if sys.platform == 'win32':
timer = time.clock
else:
timer = time.time
module = __import__(sys.argv[1])
count = int(sys.argv[2])
pattern = sys.argv[3]
expected = sys.argv[4]
text = 80 * '~' + 'qwerty'
rx = module.compile(pattern)
t0 = timer()
for i in xrange(count):
assert rx.search(text).group(0) == expected
t1 = timer()
print %d iterations in %.6f seconds % (count, t1 - t0)
8---

Here are the results of running this (plus observed difference between
peak memory usage and base memory usage):

dos-prompt\python26\python regex_timer.py regex 100 ~ ~
100 iterations in 3.811500 seconds [60 Mb]

dos-prompt\python26\python regex_timer.py regex 200 ~ ~
200 iterations in 7.581335 seconds [128 Mb]

dos-prompt\python26\python regex_timer.py re 200 ~ ~
200 iterations in 2.549738 seconds [3 Mb]

This happens on a variety of patterns: w, wert, [a-z]+, [a-z]+t,
...

--
nosy: +sjmachin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2636
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5095] msi missing from bdist --help-formats

2009-03-25 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

The 2.6.1 documentation consists of a *single* line:
distutils.command.bdist_msi — Build a Microsoft Installer binary
package. AFAICT this is the *only* mention of msi in the docs
(outside the msilib module). I heard about it only by word-of-mouth.
Docs should explain why a packager might want to use it instead of
wininst, and why the output msi is specific to the creating version of
python.

--
nosy: +sjmachin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5095
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4847] csv fails when file is opened in binary mode

2009-03-09 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Before patching, could we discuss the requirements?

There are two different concepts:
(1) text file (assume that CR and/or LF are line terminators, and
provide methods for accessing a line at a time) versus binary file (no
such assumptions, no such access)
(2) reading the file as a raw undecoded bytes file or as a decoded
str file.

Options for 3.X:
(1) caller uses mode 'rb', is given bytes objects back.
(2) caller uses mode 'rt' and provides an encoding, is given str objects
back.
IMPORTANT: Option 2 must NOT not read the file as a collection of
lines; it must process it (conceptually at least) a character at a
time so that embedded CR and/or LF are not taken to be row terminators.

Following the line that 3.X line should do what's best, not what we used
to do, the implication is that we choose option 2.

--
message_count: 10.0 - 11.0
nosy: +skip.montanaro
nosy_count: 6.0 - 7.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4847
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4847] csv fails when file is opened in binary mode

2009-03-09 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

... and it looks like Option 2 might already *almost* be in place.
Continuing with the previous example (book1.csv has embedded lone LFs):

C:\devel\csv\python30\python -c import csv;
print(repr(list(csv.reader(open('book1.csv','rt', encoding='ascii')
[['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11',
'2.22', '3.33']]

Looks good. However consider book2.csv which has embedded CRLFs:
C:\devel\csv\python30\python -c print(repr(open('book2.csv',
'rb').read()))
b'Field1,Field 2 has a\r\nvery
long\r\nheading,Field3\r\n1.11,2.22,3.33\r\n'

This gives:
C:\devel\csv\python30\python -c import csv;
print(repr(list(csv.reader(open('book2.csv','rt', encoding='ascii')
[['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11',
'2.22', '3.33']]

Not good. It should preserve ALL characters in the field.

--
message_count: 11.0 - 12.0
versions: +Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4847
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4847] csv fails when file is opened in binary mode

2009-03-09 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

pitrou Please look at the doc for open() and io.TextIOWrapper. The
`newline` parameter defaults to None, which means universal newlines
with newline translation. Setting to '' (yes, the empty string) enables
universal newlines but disables newline translation ...

I had already read it. I gave it a prize for least intuitive arg in the
language. So you plan to use that, reading lines instead of blocks?
You'll still have to examine which CRs and LFs are embedded and which
are line terminators. You might just as well use f.read(BLOCKSZ) and
avoid having to insist that the user explicitly write , newline=''.

pitrou However, I think csv should accept files opened in binary mode
and be able to deal with line endings itself. How am I supposed to know
the encoding of a CSV file? Surely Excel uses a defined, default
encoding when exporting to CSV... that knowledge should be embedded in
the csv module.

Excel has no default, because the user has no option -- the defined
encoding is cp + str(codepage_number_derived_from_locale), e.g.
cp1252. Likewise other software writing delimited data to text files
will use (one of) the local legacy encoding(s).

So: (i) mode='rb' and no encoding = caller gets bytes back and needs to
do own decoding or (ii) mode='rb' and an encoding [which looks rather
daft and is currently not possible] and the the caller gets str objects.
Both of these are ugly -- hence my preference for the mode=rt variety
of solution. Do we really want the double hassle of both a str csv
implementation and a bytes csv implementation?

--
message_count: 13.0 - 14.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4847
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5455] csv module no longer works as expected when file opened in binary mode

2009-03-08 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

This is in effect a duplicate of issue 4847.

Summary:
The docs are CORRECT.
The 3.X implementation is WRONG.
The 2.X implementation is CORRECT.

See examples in my comment on issue 4847.

--
message_count: 3.0 - 4.0
nosy: +sjmachin
nosy_count: 2.0 - 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5455
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4847] csv fails when file is opened in binary mode

2009-02-23 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Sorry, folks, we've got an understanding problem here. CSV files are
typically NOT created by text editors. They are created e.g. by save as
csv from a spreadsheet program, or as an output option by some database
query program. They can have just about any character in a field,
including \r and \n. Fields containing those characters should be quoted
(just like a comma) by the csv file producer. A csv reader should be
capable of reproducing the original field division. Here for example is
a dump of a little file I just created using Excel 2003:

C:\devel\csv\python26\python -c print repr(open('book1.csv','rb').read())
'Field1,Field 2 has a\nvery long\nheading,Field3\r\n1.11,2.22,3.33\r\n'

Inserting \n into a text field in Excel (using Alt-Enter) is a
well-known user trick.

Here's what we get from Python 2.6.1:
C:\devel\csv\python26\python -c import csv; print
repr(list(csv.reader(open('book1.csv','rb'
[['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11',
'2.22', '3.33']]
and the same by design all the way back to Python 2.3's csv module and
its ancestor, the ObjectCraft csv module.

However with Python 3.0.1 we get:
C:\devel\csv\python30\python -c import csv;
print(repr(list(csv.reader(open('book1.csv','rb')
Traceback (most recent call last):
  File string, line 1, in module
_csv.Error: iterator should return strings, not bytes (did you open the
file in text mode?)

This sentence in the documentation is NOT an error: If csvfile is a
file object, it must be opened with the ‘b’ flag on platforms where that
makes a difference.

The problem *IS* a biggie.

This paragraph in the documentation (evidently introduced in 2.5) is
rather confusing:The parser is quite strict with respect to
multi-line quoted fields. Previously, if a line ended within a quoted
field without a terminating newline character, a newline would be
inserted into the returned field. This behavior caused problems when
reading files which contained carriage return characters within fields.
The behavior was changed to return the field without inserting newlines.
As a consequence, if newlines embedded within fields are important, the
input should be split into lines in a manner which preserves the newline
characters. Some examples of what it is talking about would be a very
good idea.

--
nosy: +sjmachin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4847
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5107] built-in open(..., encoding=vague_default)

2009-01-29 Thread John Machin

New submission from John Machin sjmac...@users.sourceforge.net:

Docs say The default encoding is platform dependent but don't say
how to find out what that is, or how it is determined. On my Windows XP
SP3 setup, the default is cp1252, but the best/only guess at finding out
without actually opening a file involved sys.defaultencoding() which
produces 'utf-8'. I was pointed at locale.getpreferredencoding(), which
returns 'cp1252' on my machine.

Please add a sentence along these lines: The default encoding is
(obtained by calling|the same as) locale.getpreferredencoding(), not
sys.getdefaultencoding() -- corrected/amplified as necessary.

--
assignee: georg.brandl
components: Documentation
messages: 80811
nosy: georg.brandl, sjmachin
severity: normal
status: open
title: built-in open(..., encoding=vague_default)
versions: Python 3.0, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5107
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4971] Incorrect title case

2009-01-17 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Martin:Considering this note, the simple titlecase of U+01C5 *is*
U+01C4: the titlecase value is omitted, hence it is the same as
uppercase, hence it is U+01C4.

Perhaps we are looking at different files; in the Unicode 5.1
UnicodeData.txt that I downloaded
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), the title field
for U+01C5 is *NOT* omitted, it is set to 01C5. AFAICT the intention is
that the four characters in question are their own titlecase, which is
not altogether unexpected given their visual representation.

Here's the record for U+01C5:
01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH
CARON;Lt;0;L;compat 0044 017EN;LATIN LETTER CAPITAL D SMALL Z
HACEK;;01C4;01C6;01C5

The note (which I hadn't noticed and explains the mention of
ctype-upper in the _PyUnicode_ToTitlecase function) says that the
titlecase value may be omitted if it is the same as the uppercase. FWIW
there are *no* examples in the current (5.1) file where the title field
is empty and the upper field is not empty. 

ISTM the problem is that implementing the default-to-uppercase was not
done in Tools/unicode/makeunicodedata.py where full information is
available. This left no way in _PyUnicode_ToTitlecase of resolving the
ambiguity of a zero value for ctype-title -- is it no titlecase
supplied so use uppercase or is it titlecase supplied, delta == 0,
means ch.title() - ch?

--
nosy: +sjmachin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4971
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4742] 3.0 distutils byte-compiling - Syntax error: unknown encoding: cp1252

2008-12-30 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

TWO POINTS:
(1) I am not very concerned about chars like \x9d which are not valid in
the declared encoding; I am more concerned with chars like \x93 and \x94
which *ARE* valid in the declared encoding. Please ensure that these
cases are included in tests.
(2) Please check your test data and test results. I get different
results. I have created a file x9d.py by making the minimal changes to
x94.py. For me, this blows up on bytecompiling with *both* 3.0
(UnicodeDecodeError, as expected) and 2.x (Syntax Error unknown encoding
cp1252, wrong message) -- see below.

byte-compiling C:\python30\Lib\site-packages\x9d.py to x9d.pyc
Traceback (most recent call last):
  File setup.py, line 5, in module
py_modules = [foo3, bar3, x93, x94, x9d, xa0b7]
  File C:\python30\lib\distutils\core.py, line 149, in setup
dist.run_commands()
  File C:\python30\lib\distutils\dist.py, line 942, in run_commands
self.run_command(cmd)
  File C:\python30\lib\distutils\dist.py, line 962, in run_command
cmd_obj.run()
  File C:\python30\lib\distutils\command\install.py, line 571, in run
self.run_command(cmd_name)
  File C:\python30\lib\distutils\cmd.py, line 317, in run_command
self.distribution.run_command(command)
  File C:\python30\lib\distutils\dist.py, line 962, in run_command
cmd_obj.run()
  File C:\python30\lib\distutils\command\install_lib.py, line 91, in run
self.byte_compile(outfiles)
  File C:\python30\lib\distutils\command\install_lib.py, line 125, in
byte_compile
dry_run=self.dry_run)
  File C:\python30\lib\distutils\util.py, line 520, in byte_compile
compile(file, cfile, dfile)
  File C:\python30\lib\py_compile.py, line 137, in compile
codestring = f.read()
  File C:\python30\lib\io.py, line 1724, in read
decoder.decode(self.buffer.read(), final=True))
  File C:\python30\lib\io.py, line 1295, in decode
output = self.decoder.decode(input, final=final)
  File C:\python30\lib\encodings\cp1252.py, line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
64: character maps to undefined

byte-compiling C:\python26\Lib\site-packages\x9d.py to x9d.pyc
SyntaxError: ('unknown encoding: cp1252',
('C:\\python26\\Lib\\site-packages\\x9d.py', 0, 0, None))

byte-compiling c:\python25\Lib\site-packages\x9d.py to x9d.pyc
  File c:\python25\Lib\site-packages\x9d.py, line 0
SyntaxError: ('unknown encoding: cp1252',
('c:\\python25\\Lib\\site-packages\\x9d.py', 0, 0, None))

Added file: http://bugs.python.org/file12492/x9d.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4742
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4742] 3.0 distutils byte-compiling - Syntax error: unknown encoding: cp1252

2008-12-30 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

(1) what am I supposed to infer from Yup?? That all of that \x9d stuff
was a mistake?

(2)
+def tearDown(self):
+pyc_file = os.path.join(os.path.dirname(__file__), 'cp1252.pyc')
+if os.path.exists(pyc_file):
+os.patth.remove(pyc_file)

os.patth is novel :-)

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4742
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4626] compile() doesn't ignore the source encoding when a string is passed in

2008-12-30 Thread John Machin

Changes by John Machin sjmac...@users.sourceforge.net:


--
nosy: +sjmachin

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4626
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4742] 3.0 distutils byte-compiling - Syntax error: unknown encoding: cp1252

2008-12-24 Thread John Machin

New submission from John Machin sjmac...@users.sourceforge.net:

File foo3.py is [cut down (orig 87Kb)] output of 2to3 conversion tool
and (coincidentally) is still valid 2.x syntax. There are no syntax
errors reported by any of the following:
   \python26\python -c import foo3
   \python26\python foo3.py
   \python26\python setup.py install
   \python30\python -c import foo3
   \python30\python foo3.py
However 3.0 install
   \python30\python setup.py install
produces:

[snip]
running install_lib
copying build\lib\foo3.py - C:\python30\Lib\site-packages
byte-compiling C:\python30\Lib\site-packages\foo3.py to foo3.pyc
  File C:\python30\Lib\site-packages\foo3.py, line 0
### Note also line 0 above ###
SyntaxError: unknown encoding: cp1252

Same happens if alternative name windows-1252 is used instead of cp1252.

NOTE: file foo3.py actually does have some non-ASCII characters (\xa0,
\x93, \x94), in comments. Another file (bar3.py) from the same package
contains \xb7 twice, but doesn't have the unknown encoding problem.
There are several other files in the same package that start with # -*-
coding: windows-1252 -*- (or cp1252, or even cp1251(!)) but have no
non-ASCII characters in them. They don't get this incorrect error
message either.

--
components: Distutils
files: py3encbug.zip
messages: 78273
nosy: sjmachin
severity: normal
status: open
title: 3.0 distutils byte-compiling - Syntax error: unknown encoding: cp1252
versions: Python 3.0
Added file: http://bugs.python.org/file12445/py3encbug.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4742
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4742] 3.0 distutils byte-compiling - Syntax error: unknown encoding: cp1252

2008-12-24 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

A clue:

 print(ascii(b'\xa0\x93\x94\xb7'.decode('cp1252')))
'\xa0\u201c\u201d\xb7'

Could be that it only happens where there's a cp1252 character that's
not in latin1; see files x93.py and x94.py (have problem) and xa0b7.py
(doesn't have problem).

Added file: http://bugs.python.org/file12446/py3encbug2.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4742
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4742] 3.0 distutils byte-compiling - Syntax error: unknown encoding: cp1252

2008-12-24 Thread John Machin

Changes by John Machin sjmac...@users.sourceforge.net:


Removed file: http://bugs.python.org/file12445/py3encbug.zip

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4742
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4743] intra-pkg multiple import (import local1, local2) not fixed

2008-12-24 Thread John Machin

New submission from John Machin sjmac...@users.sourceforge.net:

In a package, import local1, local2 is not fixed. Here's some real
live 2to3 output showing the problem and the workaround:
  
 import ExcelFormulaParser, ExcelFormulaLexer
-import ExcelFormulaParser
-import ExcelFormulaLexer
+from . import ExcelFormulaParser
+from . import ExcelFormulaLexer
 import sys, struct
-from antlr import ANTLRException
+from .antlr import ANTLRException

As a solution that covers cases like import sys, local1, local2 is
possibly difficult, I suggest putting out a warning that a manual fix
(one import per line) may be required. I've put this kludge in my copy
of fix_import.py:

 def probably_a_local_import(imp_name, file_path):
+if , in imp_name:
+print(*** Can't handle import %r in %s % (imp_name, file_path))
 # Must be stripped because the right space is included by the parser
 imp_name = imp_name.split('.', 1)[0].strip()
 base_path = dirname(file_path)
 base_path = join(base_path, imp_name)

[Aside: right space? Possibly should be left space]

and it produces:

*** Can't handle import ' ExcelFormulaParser, ExcelFormulaLexer' in
\2to3\xlwt\py3\xlwt\ExcelFormula.py
*** Can't handle import ' sys, struct' in
\2to3\xlwt\py3\xlwt\ExcelFormula.py

--
components: 2to3 (2.x to 3.0 conversion tool)
messages: 78276
nosy: sjmachin
severity: normal
status: open
title: intra-pkg multiple import (import local1, local2) not fixed
versions: Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4743
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4669] bytes,join and bytearray.join not in manual; help for bytes.join is wrong.

2008-12-19 Thread John Machin

John Machin sjmac...@users.sourceforge.net added the comment:

Terry, you are right. I missed that. My report was based on looking via
the index and finding only (str method), no (byte[sarray] method).

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4669
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4669] bytes,join and bytearray.join not in manual; help for bytes.join is wrong.

2008-12-15 Thread John Machin

New submission from John Machin sjmac...@users.sourceforge.net:

These methods are parallel to str.join, seem to work as expected, and
have help entries. However there is nothing in the Library Reference
Manual about them.

 help(bytearray.join)
Help on method_descriptor:

join(...)
B.join(iterable_of_bytes) - bytearray

Concatenate any number of bytes/bytearray objects, with B
in between each pair, and return the result as a new bytearray.
### OK but could use an example.

 help(bytes.join)
Help on method_descriptor:

join(...)
B.join(iterable_of_bytes) - bytes

Concatenate any number of bytes objects, with B in between each pair.
### Above sentence should read Concatenate any number of
bytes/bytearray objects, with B in between each pair, and return the
result as a new bytes object.
Example: b'.'.join([b'ab', b'pq', b'rs']) - b'ab.pq.rs'.

--
assignee: georg.brandl
components: Documentation
messages: 77849
nosy: georg.brandl, sjmachin
severity: normal
status: open
title: bytes,join and bytearray.join not in manual; help for bytes.join is 
wrong.
versions: Python 3.0, Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue4669
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4574] reading UTF16-encoded text file crashes if \r on 64-char boundary

2008-12-07 Thread John Machin

New submission from John Machin [EMAIL PROTECTED]:

Problem in the newline handling in io.py, class
IncrementalNewlineDecoder, method decode. It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk in case
there's an LF at the start of the next chunk. It prepends b'\r' (only 1
byte) to the next chunk's raw bytes and decodes that. But \r in UTF-16
takes 2 bytes; we are now 1 byte out of kilter and various failures are
possible (including silently producing garbage output from a truncated
file with an odd number of bytes).

The attached script illustrates the problems.

--
components: Interpreter Core
files: py30cr64bug.py
messages: 77219
nosy: sjmachin
severity: normal
status: open
title: reading UTF16-encoded text file crashes if \r on 64-char boundary
type: crash
versions: Python 3.0
Added file: http://bugs.python.org/file12260/py30cr64bug.py

___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue4574
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com