from:"Walter Dörwald"

[Python-Dev] Re: The current state of typing PEPs

2021-11-30 Thread Walter Dörwald


On 29 Nov 2021, at 23:56, Barry Warsaw wrote:


[...]
(not that you're not allowed to use for anything else, of course you 
are, but that other uses won;t be taken into account when designing 
the new interface)


But I have never seen that clearly stated anywhere. The closest is 
from PEP 563, where it says:


"""
With this in mind, uses for annotations incompatible with the 
aforementioned PEPs should be considered deprecated.

"""

Which pretty much makes the point, but it's a bit subtle -- what does 
"incompatible' mean?


You make a good point.  I agree that while all the signs are there for 
“annotations are for typing”, this has never been explicitly or 
sufficiently codified, and I’ve been proposing that Someone write a 
PEP that makes this an official pronouncement.  That someone may very 
well be a person on the SC, but given the timing of elections, it’s 
likely to fall to the next SC, if they still agree with that position! 
:-D


My recollection of the history of annotations falls somewhere between 
Greg’s and Guido’s.  Annotations as a feature were inspired by the 
typing use case (with no decision at the time whether those were to be 
static or runtime checks), but at the same time allowing for 
experimentation for other use cases.  Over time, 
annotations-for-typing clearly won the mindset and became the 
predominant use case.  Personally, I was strongly against type 
annotations because of my experience in other languages, but over time 
I was also won over.  From library documentation, to complex code 
bases, to the transient nature of contributors, type annotations are 
pretty compelling, and most (but not all) of my worries really 
didn’t come to fruition.


We can lament the non-typing use of annotations, but I think that 
horse is out of the barn and I don’t know how you would resolve 
conflicts of use for typing and non-typing annotations.  It’s been a 
slow boil, with no definitive pronouncement, and that needs to be 
fixed, but I think it's just acknowledging reality.  That’s my 
personal opinion.


But isn't that the reason why we have `typing.Annotated`, so that 
annotations used for typing and annotations used for other purposes can 
coexist?


An annotation used for typing only is:

```

def f(x: int):

...return x
...

f.__annotations__['x']


```

An annotation used for something else is:

```

def f(x: 'something not typing related'):

...return x
...

f.__annotations__['x']

'something not typing related'
```

`typing.Annotated` gives us both:

```

from typing import *
def f(x: Annotated[int, 'something not typing related']):

...return x
...

f.__annotations__['x'].__args__[0]



f.__annotations__['x'].__metadata__[0]

'something not typing related'
```

Granted, for someone who only wants to use annotations for their own 
purpose, digging out the original not typing related stuff from 
`typing.Annotated` is more work, but doable.


Or am I missing something here?


[...]


Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/4WM43CS45R7H7X2WO4F3UVDCH7HFHDJJ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Relaxing the annotation syntax

2021-04-18 Thread Walter Dörwald


On 16 Apr 2021, at 19:38, Jelle Zijlstra wrote:

El vie, 16 abr 2021 a las 10:01, Walter Dörwald 
()

escribió:


On 16 Apr 2021, at 16:59, Guido van Rossum wrote:

If you look deeper, the real complaints are all about the backwards
incompatibility when it comes to locally-scoped types in annotations. 
I.e.


def test():
class C: ...
def func(arg: C): ...
return func

typing.get_type_hints(test()) # raises NameError: name 'C' is not 
defined


Can't this be solved by wrapping the annotation in a lambda, i.e.


def test():

...   class C: ...
...   def func(arg: lambda: C): ...
...   return func
...

test().__annotations__['arg']()

.C'>

So typing.get_type_hints() would simply call an annotation if the
annotation was callable and replace it with the result of the call.

That sort of thing can work, but just like string annotations it's not 
good

for usability.


Yes, but it's close to what PEP 649 does. The PEP even calls it 
"implicit lambda expressions".



Users using annotations will have to remember that in some
contexts they need to wrap their annotation in a lambda, and unless 
they
have a good understanding of how type annotations work under the hood, 
it
will feel like a set of arbitrary rules. That's what I like about PEP 
649:

code like this would (hopefully!) just work without needing users to
remember to use any special syntax.


Yes, that's what I like about PEP 649 too. It just works (in most 
cases), and for scoping it works like an explicit lambda expression, 
which is nothing new to learn.


If Python had taken the decision to evaluate default values for 
arguments not once at definition time, but on every call, I don't think 
that that would have been implemented via restringifying the AST for the 
default value.


But then again, the difference between default values and type 
annotations is that Python *does* use the default values. In most cases 
however Python does not use the type annotations, only the type checker 
does. The problem is where Python code *does* want to use the type 
annotation. For this case PEP 649 is the more transparent approach.


Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/IMDY2NLPM36CHO6JFKDBE54NDWF3OPZO/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Relaxing the annotation syntax

2021-04-16 Thread Walter Dörwald


On 16 Apr 2021, at 16:59, Guido van Rossum wrote:


If you look deeper, the real complaints are all about the backwards
incompatibility when it comes to locally-scoped types in annotations. 
I.e.


def test():
  class C: ...
  def func(arg: C): ...
  return func

typing.get_type_hints(test())  # raises NameError: name 'C' is not 
defined


Can't this be solved by wrapping the annotation in a lambda, i.e.

```

def test():

...   class C: ...
...   def func(arg: lambda: C): ...
...   return func
...

test().__annotations__['arg']()

.C'>
```

So `typing.get_type_hints()` would simply call an annotation if the 
annotation was callable and replace it with the result of the call.



And that is a considerable concern (we've always let backwards
compatibility count more strongly than convenience of new features). 
While
it was known this would change, there was no real deprecation of the 
old

way. Alas.

On Fri, Apr 16, 2021 at 1:51 AM Sebastian Rittau  
wrote:


On Sun, Apr 11, 2021 at 1:31 PM Barry Warsaw  
wrote:


[snip]

This is something the SC has been musing about, but as it’s not a 
fully
formed idea, I’m a little hesitant to bring it up.  That said, 
it’s
somewhat relevant: We wonder if it may be time to in a sense 
separate the
typing syntax from Python’s regular syntax.  TypeGuards are a case 
where if
typing had more flexibility to adopt syntax that wasn’t strictly 
legal
“normal” Python, maybe something more intuitive could have been 
proposed.
I wonder if the typing-sig has discussed this possibility (in the 
future,

of course)?

I am strongly in favor of diverging type annotation syntax from 
Python

syntax. Currently, type annotations are a very useful tool, but often
clunky to use. Enhancements have been made, but design space is 
limited

when working within existing Python syntax. Type annotations have a
different set of rules, needs, and constraints than general-purpose 
Python

code. This is similar to other domain specific languages like regular
expressions. Ideally, Python itself would not check the syntax of
annotations, except as needed for determining the end of an 
annotation. PEP

563 is a step in that direction.

As far as I understand the arguments against PEP 563 and in favor of 
PEP
649 mostly boil down to "annotations are used outside of typing, 
these uses
would need to use eval() in the future and eval() is slow". (At least 
from
a user's perspective, there are more arguments from a Python 
maintainer's
perspective that I can't comment on.) Are there benchmarks to verify 
that
using eval() has a non-negligible effect for this use case? Overall, 
I
don't find this to be a compelling argument when compared to the 
problem

that PEP 649 would close all design space for type annotation syntax
enhancements.

 - Sebastian


--
--Guido van Rossum (python.org/~guido)


Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HXJWNS4IIAHKTOCYR7AS5DOI5JAGLRPP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 637 - Support for indexing with keyword arguments: request for feedback for SC submission

2021-02-04 Thread Walter Dörwald


On 2 Feb 2021, at 12:36, Stefano Borini wrote:


Hi all,

I would like to request feedback by python-dev on the current
implementation of PEP 637 - Support for indexing with keyword
arguments.

https://www.python.org/dev/peps/pep-0637/

The PEP is ready for SC submission and it has a prototype
implementation ready, available here (note, not reviewed, but
apparently fully functional)

https://github.com/python/cpython/compare/master...stefanoborini:PEP-637-implementation-attempt-2

(note: not sure if there's a preference for the link to be to the diff
or to the branch, let me know if you prefer I change the PEP link)


It seems to me, that what complicates the specification is the need for 
backwards compatibility. If that wasn't an issue, we could make indexing 
operations behave exactly like function calls. Handling the additional 
argument for `__setitem__` could be done the same way that passing 
`self` in a method call is done: By passing an additional positional 
argument (in this case as the second argument after `self`), So:


*   `foo[1]` is `type(foo).__getitem__(foo, 1)`
*   `foo[1, 2]` is `type(foo).__getitem__(foo, 1, 2)`
*   `foo[(1, 2)]` is `type(foo).__getitem__(foo, (1, 2))`
*   `foo[1] = 3` is `type(foo).__setitem__(foo, 3, 1)`
*   `foo[1, 2] = 3` is `type(foo).__setitem__(foo, 3, 1, 2)`
*   `foo[(1, 2)] = 3` is `type(foo).__setitem__(foo, 3, (1, 2))`

But of course this isn't backwards compatible with respect to the 
treatment of tuple arguments and the argument order in `__setitem__`. 
However it is **much** easier to remember and to teach.


The PEP rejects the idea to implement this approach via a new set of 
dunder methods (e.g. `__getitem_ex__`, `__setitem_ex__` and 
`__delitem_ex__`) for performance reasons, but would it make sense, to 
mark existing `__getitem__`, `__setitem__` and `__delitem__` methods as 
supporting the new calling convention via a decorator? i.e something 
like:



```python
class X:
@newstyle_item
def __getitem__(self, x, y, z=42):
...

@newstyle_item
def __setitem__(self, value, x, y, z=42):
...

@newstyle_item
def __detitem__(self, x, y, z=42):
...
```

This wouldn't require an additional dictionary lookup, but just a check 
of a bit in the function object.



Thank you for your help.


Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/DTOO36EXJRBGA7OJVTRAE7I43D2FR7BS/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575)

2020-10-07 Thread Walter Dörwald


On 7 Oct 2020, at 1:27, Victor Stinner wrote:


Hi Walter,


https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3


Le mar. 6 oct. 2020 à 17:02, Walter Dörwald  
a écrit :
It would be even simpler to use unicodedata.lookup() which returns 
the unicode character when passed the name of the character


That was my first idea as well when I reviewed the change, but the
function contains this comment:

def checkletter(self, name, code):
# Helper that put all \N escapes inside eval'd raw strings,
# to make sure this script runs even if the compiler
# chokes on \N escapes

test_named_sequences_full() checks that unicodedata.lookup() works,


OK, that change would then have checked unicodedata.lookup() twice.

However I'm puzzled by the fact that the "\N{}" escape sequence is 
supposed to raise a SyntaxError. And indeed it does in some cases:


Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import unicodedata
unicodedata.lookup("DIGIT ZERO")

'0'

"\N{DIGIT ZERO}"

'0'

"\N{EURO SIGN}"

'€'

unicodedata.lookup("EURO SIGN")

'€'

unicodedata.lookup("KEYCAP NUMBER SIGN")

'#️⃣'

"\N{KEYCAP NUMBER SIGN}"

  File "", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-21: unknown Unicode character name

unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")

'Ā̀'

"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"

  File "", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-47: unknown Unicode character name


It seems that unicodedata.lookup() honors "Code point sequences", but 
\N{} does not.


Indeed 
https://docs.python.org/3/library/unicodedata.html#unicodedata.lookup

mentions that fact:

   Changed in version 3.3: Support for name aliases and named sequences 
has been added.


https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

doesn't mention anything. It simply states

Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database

with the footnote "Changed in version 3.3: Support for name aliases has 
been added.".


Which leads to the question:

Should \N{} be updated to support "Code point sequences"?

Furthermore it states: "Unlike Standard C, all unrecognized escape 
sequences are left in the string unchanged", which could be interpreted 
as meaning that "\N{BAD}" results in "\\N{BAD}".



but that checkletter() raises a SyntaxError. Look at the code ;-)


That would have helped. ;)


Victor


Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RNZUBXZ3WGIQ57CONGFEVEPM4NFS5CWW/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575)

2020-10-06 Thread Walter Dörwald


On 6 Oct 2020, at 16:22, Florian Bruhin wrote:


https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3
commit: a8bf44d04915f7366d9f8dfbf84822ac37a4bab3
branch: master
author: Florian Bruhin 
committer: GitHub 
date: 2020-10-06T16:21:56+02:00
summary:

bpo-41944: No longer call eval() on content received via HTTP in the 
UnicodeNames tests (GH-22575)


Similarly to GH-22566, those tests called eval() on content received 
via
HTTP in test_named_sequences_full. This likely isn't exploitable 
because

unicodedata.lookup(seqname) is called before self.checkletter(seqname,
None) - thus any string which isn't a valid unicode character name
wouldn't ever reach the checkletter method.

Still, it's probably better to be safe than sorry.

files:
M Lib/test/test_ucn.py
[...]
 # Helper that put all \N escapes inside eval'd raw strings,
 # to make sure this script runs even if the compiler
 # chokes on \N escapes
-res = eval(r'"\N{%s}"' % name)
+res = ast.literal_eval(r'"\N{%s}"' % name)
 self.assertEqual(res, code)
 return res


It would be even simpler to use unicodedata.lookup() which returns the 
unicode character when passed the name of the character, e.g.



unicodedata.lookup("NO-BREAK SPACE")

'\xa0'

Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3S7BHOFRG3KYYXQUBGZBFTDIDN2IHG3M/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622 (match statement) playground

2020-07-02 Thread Walter Dörwald


On 1 Jul 2020, at 18:54, Brandt Bucher wrote:


Walter Dörwald wrote:
This looks strange to me. In all other cases of variable lookup the 
global variable z would be found.


The next case assigns to z, making z local to whereis. This is 
consistent with python's existing scoping rules (for example, try 
rewriting this as the equivalent if-elif chain and you'll get the same 
error). It sounds like you want to add "global z" to the top of the 
function definition.



whereis(23) however works.


This branch is hit before the unbound local lookup is attempted.


OK, understood.

However I still find the rule "dotted names are looked up" and "undotted 
names are matched" surprising and "case .constant" ugly.


A way to solve this would be to use "names at the top level are always 
looked up".


With this the constant value pattern:

case .name:

could be written as:

case name:

The capture pattern (of which there can only be one anyway) could then 
be written as:


case object(name):

instead of

case name:

Or we could use "matches are always done against a match object", i.e. 
the code from the example would look like this:


from dataclasses import dataclass

@dataclass
class Point:
x: int
y: int

z = 42

def whereis(point):
w = 23
match point as m:
case Point(0, 0):
print("Origin")
case Point(0, m.y):
print(f"Y={m.y}")
case Point(m.x, 0):
print(f"X={m.x}")
case Point():
print("Somewhere else")
case w:
print("Not the answer")
case z:
print("The answer")
case object(z):
print(f"{z!r} is not a point")

Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/SPMTCDIRJJGCKSIPWE4EX6DSIBFXLAL4/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 622 (match statement) playground

2020-07-01 Thread Walter Dörwald


On 1 Jul 2020, at 17:58, Guido van Rossum wrote:


If you are interested in learning more about how PEP 622 would work in
practice, but don't feel like compiling a Python 3.10 fork from 
source,

here's good news for you.

In a hurry?
https://mybinder.org/v2/gh/gvanrossum/patma/master?urlpath=lab/tree/playground-622.ipynb


If I change the example code to:

---
from dataclasses import dataclass

@dataclass
class Point:
x: int
y: int

z = 42

def whereis(point):
w = 23
match point:
case Point(0, 0):
print("Origin")
case Point(0, y):
print(f"Y={y}")
case Point(x, 0):
print(f"X={x}")
case Point():
print("Somewhere else")
case .w:
print("Not the answer")
case .z:
print("The answer")
case z:
print("Not a point")
---

whereis(42)


gives me:

---
UnboundLocalError Traceback (most recent call 
last)

 in 
> 1 whereis(42)

 in whereis(point)
 10 def whereis(point):
 11 w = 23
 12 match point:
 13 case Point(0, 0):
 14 print("Origin")
 15 case Point(0, y):
 16 print(f"Y={y}")
 17 case Point(x, 0):
 18 print(f"X={x}")
 19 case Point():
 20 print("Somewhere else")
 21 case .w:
 22 print("Not the answer")
---> 23 case .z:
 24 print("The answer")
 25 case z:
 26 print("Not a point")

UnboundLocalError: local variable 'z' referenced before assignment
---

This looks strange to me. In all other cases of variable lookup the 
global variable z would be found.


whereis(23) however works.


[...]


Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6OLWD67D7EFBM6UHGQHZHS7MM22QE4N2/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 616 -- String methods to remove prefixes and suffixes

2020-03-25 Thread Walter Dörwald

On 25 Mar 2020, at 9:48, Stephen J. Turnbull wrote:

Walter Dörwald writes:

A `find()` that supports multiple search strings (and returns the
leftmost position where a search string can be found) is a great help
in

implementing some kind of tokenizer:

In other words, you want the equivalent of Emacs's "(search-forward
(regexp-opt list-of-strings))", which also meets the requirement of
returning which string was found (as "(match-string 0)").

Sounds like it. I'm not familiar with Emacs.

Since Python already has a functionally similar API for regexps, we
can add a regexp-opt (with appropriate name) method to re, perhaps as
.compile_string_list(), and provide a convenience function
re.search_string_list() for your application.

If you're using regexps anyway, building the appropriate or-expression
shouldn't be a problem. I guess that's what most lexers/tokenizers do
anyway.

I'm applying practicality before purity, of course. To some extent
we want to encourage simple string approaches, and putting this in
regex is not optimal for that.

Exactly. I'm always a bit hesitant when using regexps, if there's a
simpler string approach.

Steve

Servus,
Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/46KMMKYHW7DIDNZFO27GNQCJVILNSQ6Q/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 616 -- String methods to remove prefixes and suffixes

2020-03-24 Thread Walter Dörwald


On 24 Mar 2020, at 2:42, Steven D'Aprano wrote:


On Sun, Mar 22, 2020 at 10:25:28PM -, Dennis Sweeney wrote:


Changes:
- More complete Python implementation to match what the type 
checking in the C implementation would be

- Clarified that returning ``self`` is an optimization
- Added links to past discussions on Python-Ideas and Python-Dev
- Specified ability to accept a tuple of strings


I am concerned about that tuple of strings feature.
[...]
Aside from those questions about the reference implementation, I am
concerned about the feature itself. No other string method that 
returns

a modified copy of the string takes a tuple of alternatives.

* startswith and endswith do take a tuple of (pre/suff)ixes, but they
  don't return a modified copy; they just return a True or False flag;

* replace does return a modified copy, and only takes a single
  substring at a time;

* find/index/partition/split etc don't accept multiple substrings
  to search for.

That makes startswith/endswith the unusual ones, and we should be
conservative before emulating them.


Actually I would like for other string methods to gain the ability to 
search for/chop off multiple substrings too.


A `find()` that supports multiple search strings (and returns the 
leftmost position where a search string can be found) is a great help in 
implementing some kind of tokenizer:


```python
def tokenize(source, delimiter):
lastpos = 0
while True:
pos = source.find(delimiter, lastpos)
if pos == -1:
token = source[lastpos:].strip()
if token:
yield token
break
else:
token = source[lastpos:pos].strip()
if token:
yield token
yield source[pos]
lastpos = pos + 1

print(list(tokenize(" [ 1, 2, 3] ", ("[", ",", "]"
```

This would output `['[', '1', ',', '2', ',', '3', ']']` if `str.find()` 
supported multiple substring.


Of course to be really usable `find()` would have to return **which** 
substring was found, which would make the API more complicated (and 
somewhat incompatible with the existing `find()`).


But for `cutprefix()` (or whatever it's going to be called). I'm +1 on 
supporting multiple prefixes. For ambiguous cases, IMHO the most 
straight forward option would be to chop off the first prefix found.



[...]


Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/3MYYK6AINVTVCNVYC53FEB4T3LQGPWSC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 616 -- String methods to remove prefixes and suffixes

2020-03-21 Thread Walter Dörwald


On 21 Mar 2020, at 19:09, Steven D'Aprano wrote:


On Sat, Mar 21, 2020 at 12:15:21PM -0400, Eric V. Smith wrote:

On 3/21/2020 11:20 AM, Ned Batchelder wrote:



Why be so prescriptive? The semantics of these functions should be
about what the resulting string contains.  Leave it to implementors 
to

decide when it is OK to return self or not.


I agree with Ned -- whether the string object is returned unchanged or 
a

copy is an implementation decision, not a language decision.


[Eric]

The only reason I can think of is to enable the test above: did a
suffix/prefix removal take place? That seems like a useful thing.


We don't make this guarantee about string identity for any other 
string

method, and CPython's behaviour varies from method to method:

py> s = 'a b c'
py> s is s.strip()
True
py> s is s.lower()
False

and version to version:

py> s is s.replace('a', 'a')  # 2.7
False
py> s is s.replace('a', 'a')  # 3.5
True


And it is different for string subclasses, because the method always 
returns an instance of the baseclass:



class str2(str):

...pass
...

isinstance(str2('a b c').strip(), str2)

False

isinstance(str2('a b c').strip(), str2)

False

Servus,
   Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/JNIVR6IZAG7GEDREHCEHD25KANJDTR3C/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: More feedback on PEP 611, please

2019-12-11 Thread Walter Dörwald

On 11 Dec 2019, at 12:12, Mark Shannon wrote:

Hi everyone,

Thanks for all your feedback so far.

Previously I asked for "more precise" feedback, which has been
interpreted as evidence backed feedback. That's not what I meant.

My fault for not being clearer.

Your opinions without any justifications are welcome, but I need
precision.

The PEP states:

"""
Reference Implementation

None, as yet. This will be implemented in CPython, once the PEP has been
accepted.

"""

A Python that implements PEP 611 might have advantages and
disadvantages. The disadvantages we can clearly see: Things that were
possible before will no longer be possible. But whether the PEP has any
advantages is unknown. So an implementation might have to come first.

[...]

Almost any performance gain, even 0.1% is worth the, IMO, slight
inconvenience of 1 million limits.
The reason I believe this is that a 0.1% speedup benefits all Python
applications and libraries everywhere and forever, whereas the
inconvenience will be felt by a handful of developers, very rarely.

A 0.1% performance gain means that a script that runs for an hour will
finish 4 seconds earlier.

[...]

Servus,
Walter
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/3I6MAAGNIWPH7PHKVBKBTJUTQOGND7IX/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-Dev] PEP 594: Removing dead batteries from the standard library

2019-05-21 Thread Walter Dörwald


On 20 May 2019, at 22:15, Christian Heimes wrote:


Hi,

here is the first version of my PEP 594 to deprecate and eventually 
remove modules from the standard library. The PEP started last year 
with talk during the Python Language Summit 2018, 
https://lwn.net/Articles/755229/.


[...]

colorsys


The `colorsys `_ 
module
defines color conversion functions between RGB, YIQ, HSL, and HSV 
coordinate

systems. The Pillow library provides much faster conversation between
color systems.

Module type
  pure Python
Deprecated in
  3.8
To be removed in
  3.10
Substitute
  `Pillow `_,
  `colorspacious `_


I'm using colorsys constantly as the basis for a tool that converts CSS 
colors between different coordinate systems. I don't see how that could 
be done via Pillow (which AFAICT only converts complete images). 
RGB<->HSV<->HLS conversion seems to be not available (or not obvious) in 
colorspacious.


colorsys is a module where we can be pretty sure that it has zero bugs, 
and doesn't require any maintenance or security updates, so I don't see 
any reason to deprecate it.



[...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] ctypes: is it intentional that id() is the only way to get the address of an object?

2019-01-18 Thread Walter Dörwald


On 18 Jan 2019, at 11:57, Antoine Pitrou wrote:


On Fri, 18 Jan 2019 03:00:54 +
MRAB  wrote:

On 2019-01-18 00:48, Gregory P. Smith wrote:
I've heard that libraries using ctypes, cffi, or cython code of 
various

sorts in the real world wild today does abuse the unfortunate side
effect of CPython's implementation of id(). I don't have specific
instances of this in mind but trust what I've heard: that it is 
happening.


id() should never be considered to be the PyObject*.  In as much as 
code
shouldn't assume it is running on top of a specific CPython 
implementation.
If there is a _need_ to get a pointer to a C struct handle 
referencing a
CPython C API PyObject, we should make an explicit API for that 
rather
than the id() hack.  That way code can be explicit about its need, 
and
code that is just doing a funky form of identity tracking without 
using

is and is not can continue using id() without triggering regressive
behavior on VMs that don't have a CPython compatible PyObject under 
the

hood by default.

[who uses id() anyways?]


I use it in some of my code.

If I want to cache some objects, I put them in a dict, using the id 
as

the key. If I wanted to locate an object in a cache and didn't have
id(), I'd have to do a linear search for it.


Indeed.  I've used it for the same purpose in the past 
(identity-dict).


Its useful in all situations where you do topology preserving 
transformations, for example pickling (i.e. object serialization) or a 
deep copy of some object structures.


In these cases you need a way to record and quickly detect whether 
you've handled a specific object before. In Python we can do that with a 
dictionary that has object ids as keys. Java provides IdentityHashMap 
for that. Javascript provides neither, so deep-copying objects in 
Javascript seems to be impossible.



Regards

Antoine.


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bpo-34595: How to format a type name?

2018-09-13 Thread Walter Dörwald

On 13 Sep 2018, at 2:33, Victor Stinner wrote:

Hi,

For the type name, sometimes, we only get a type (not an instance),
and we want to format its FQN. IMHO we need to provide ways to format
the FQN of a type for *types* and for *instances*. Here is my
proposal:

* Add !t conversion to format string
* Add ":T" format to type.__format__()
* Add "%t" and "%T" formatters to PyUnicode_FromUnicodeV()

As far as I can remember, the distinction between lowercase and 
uppercase format letter for PyUnicode_FromUnicodeV() and friends was: 
lowercase letters are for formatting C types (like `char *` etc.) and 
uppercase formatting letters are for Python types (i.e. the C type is 
`PyObject *`). IMHO we should keep that distinction.

* Add a read-only type.__fqn__ property

I like that.

# Python: "!t" for instance
raise TypeError(f"must be str, not {obj!t}")

/* C: "%t" for instance */
PyErr_Format(PyExc_TypeError, "must be str, not %t", obj);

/* C: "%T" for type */
PyErr_Format(PyExc_TypeError, "must be str, not %T", mytype);

# Python: ":T" for type
raise TypeError(f"must be str, not {mytype!T}")

We could solve the problem with instances and classes by adding two new 
! operators to str.format/f-strings and making them chainable. The !t 
operator would get the class of the argument and the !c operator would 
require a class argument and would convert it to its name (which is 
obj.__module__ + "." + obj.__qualname__ (or only obj.__qualname__ for 
builtin types)). So:

   >>> import pathlib
   >>> p = pathlib.Path("spam.py")
   >>> print(f"{pathlib.Path}")

   >>> print(f"{pathlib.Path!c}")
pathlib.Path
   >>> print(f"{pathlib.Path!c!r}")
'pathlib.Path'
   >>> print(f"{p!t}")

   >>> print(f"{p!t!c}")
pathlib.Path
   >>> print(f"{p!c}")
Traceback (most recent call last):
  File "", line 1, in 
TypeError: object is not a class

This would also give us:

   >>> print(f"{p!s!r}")
'spam.py'

Which is different from:

   >>> print(f"{p}")
spam.py
   >>> print(f"{p!r}")
PosixPath('spam.py')

Open question: Should we also add "%t" and "%T" formatters to the str
% args operator at the Python level?

I have a proof-of-concept implementation:
https://github.com/python/cpython/pull/9251

Victor

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 554 v3 (new interpreters module)

2017-09-26 Thread Walter Dörwald


On 23 Sep 2017, at 3:09, Eric Snow wrote:


[...]

``list_all()``::

   Return a list of all existing interpreters.


See my naming proposal in the previous thread.


Sorry, your previous comment slipped through the cracks.  You 
suggested:


As for the naming, let's make it both unconfusing and explicit?
How about three functions: `all_interpreters()`, 
`running_interpreters()`

and `idle_interpreters()`, for example?

As to "all_interpreters()", I suppose it's the difference between
"interpreters.all_interpreters()" and "interpreters.list_all()".  To
me the latter looks better.


But in most cases when Python returns a container (list/dict/iterator) 
of things, the name of the function/method is the name of the things, 
not the name of the container, i.e. we have sys.modules, dict.keys, 
dict.values etc.. Or if the collection of things itself has a name, it 
is that name, i.e. os.environ, sys.path etc.


Its a little bit unfortunate that the name of the module would be the 
same as the name of the function, but IMHO interpreters() would be 
better than list().



As to "running_interpreters()" and "idle_interpreters()", I'm not sure
what the benefit would be.  You can compose either list manually with
a simple comprehension:

[interp for interp in interpreters.list_all() if 
interp.is_running()]
[interp for interp in interpreters.list_all() if not 
interp.is_running()]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Make stacklevel=2 by default in warnings.warn()

2015-09-21 Thread Walter Dörwald


On 21 Sep 2015, at 9:18, Victor Stinner wrote:


2015-09-20 8:44 GMT+02:00 Serhiy Storchaka <storch...@gmail.com>:

I propose to make the default value of stacklevel to be 2.
I think that unlikely this will break existing code.


Consider this simple script:
---
import warnings
warnings.warn("here")
---

Currrent output:
---
x.py:3: UserWarning: here
warnings.warn("here")
---

=> it shows the script name (x.py), the line number and the line, as 
expected.


Now try stacklevel=2:
---
import warnings
warnings.warn("here", stacklevel=2)
---

New output:
---
sys:1: UserWarning: here
---

"sys:1" is not really useful :-/

I would describe this as a regression, not an enhancement.

It's hard to find a "good" default value. It's better to always
specify stacklevel :-)


A "dynamic" stacklevel might help. Normally when you implement a call to 
warning.warn() inside a module, you want to report the innermost 
stacklevel that is outside of your module, because that's the spot where 
the error likely is. This could be done automatically (report the first 
module that is different from the one where the call to warning.warn() 
is), or by specifying a base package name or regular expression, i.e. 
report the innermost stackframe that is not from 
"mypackage.mysubpackage").


See http://bugs.python.org/issue850482

Bye,
   Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 492: async/await in Python; v3

2015-04-28 Thread Walter Dörwald


On 28 Apr 2015, at 5:07, Yury Selivanov wrote:


Hi python-dev,

Another round of updates.  Reference implementation
has been updated: https://github.com/1st1/cpython/tree/await
(includes all things from the below summary of updates +
tests).

[...]
New Coroutine Declaration Syntax


The following new syntax is used to declare a coroutine::

 async def read_data(db):
 pass

Key properties of coroutines:

* ``async def`` functions are always coroutines, even if they do not
contain ``await`` expressions.

* It is a ``SyntaxError`` to have ``yield`` or ``yield from``
expressions in an ``async`` function.


Does this mean it's not possible to implement an async version of 
os.walk() if we had an async version of os.listdir()?


I.e. for async code we're back to implementing iterators by hand 
instead of using generators for it.



[...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 448 (almost finished!) — Question regarding test_ast

2015-01-23 Thread Walter Dörwald


On 22 Jan 2015, at 23:03, Neil Girdhar wrote:

Thanks for taking a look.  I looked at inspect and I can't see 
anything
that needs to change since it's the caller rather than the receiver 
who has

more options after this PEP.


You are probably right. And for calling via Signature.bind() your patch 
takes care of business.



Did you see anything in particular?


No, I was just using inspect.signature lately and reading the PEP 
reminded me of it.



Best,

Neil


Servus,
   Walter

On Thu, Jan 22, 2015 at 12:23 PM, Walter Dörwald 
wal...@livinglogic.de

wrote:


On 20 Jan 2015, at 17:34, Neil Girdhar wrote:


My question first:
test_ast is mostly generated code, but I can't find where it is 
being
generated.  I am pretty sure I know how to fix most of the 
introduced

problems.  Who is generating test_ast??

Update:

So far, I've done the following:

Updated the patch to 3.5
Fixed the grammar to accept final commas in argument lists always, 
and to

work with the already implemented code.
Fixed the ast to accept what it needs to accept and reject according 
to

the

limitation laid down by Guido.
Fixed the parsing library.

Fixed these tests:
test_ast.py
test_extcall.py
test_grammar.py
test_syntax.py
test_unpack_ex.py

As far as I can tell, all I have left is to fix test_ast and 
possibly

write
some more tests (there are already some new tests and some of the 
old

negative tests expecting SyntaxError are now positive tests).


inspect.signature might need an update.

Servus,
 Walter


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 448 (almost finished!) — Question regarding test_ast

2015-01-22 Thread Walter Dörwald

On 20 Jan 2015, at 17:34, Neil Girdhar wrote:

 My question first:
 test_ast is mostly generated code, but I can't find where it is being
 generated.  I am pretty sure I know how to fix most of the introduced
 problems.  Who is generating test_ast??

 Update:

 So far, I've done the following:

 Updated the patch to 3.5
 Fixed the grammar to accept final commas in argument lists always, and to
 work with the already implemented code.
 Fixed the ast to accept what it needs to accept and reject according to the
 limitation laid down by Guido.
 Fixed the parsing library.

 Fixed these tests:
 test_ast.py
 test_extcall.py
 test_grammar.py
 test_syntax.py
 test_unpack_ex.py

 As far as I can tell, all I have left is to fix test_ast and possibly write
 some more tests (there are already some new tests and some of the old
 negative tests expecting SyntaxError are now positive tests).

inspect.signature might need an update.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Bytes path related questions for Guido

2014-08-29 Thread Walter Dörwald


On 28 Aug 2014, at 19:54, Glenn Linderman wrote:


On 8/28/2014 10:41 AM, R. David Murray wrote:
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman 
v+pyt...@g.nevcal.com wrote:

[...]
Also for
cases where the data stream is *supposed* to be in a given encoding, 
but
contains undecodable bytes.  Showing the stuff that incorrectly 
decodes

as whatever it decodes to is generally what you want in that case.
Sure, people can learn to recognize mojibake for what it is, and maybe 
even learn to recognize it for what it was intended to be, in limited 
domains. But suppressing/replacing the surrogates doesn't help with 
that... would it not be better to replace the surrogates with an 
escape sequence that shows the original, undecodable, byte value?  
Like  \xNN ?


For that we could extend the backslashreplace codec error callback, so 
that it can be used for decoding too, not just for encoding. I.e.


   ba\xffb.decode(utf-8, backslashreplace)

would return

   a\\xffb

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Updates to PEP 471, the os.scandir() proposal

2014-07-09 Thread Walter Dörwald


On 8 Jul 2014, at 15:52, Ben Hoyt wrote:


Hi folks,

After some very good python-dev feedback on my first version of PEP
471, I've updated the PEP to clarify a few things and added various
Rejected ideas subsections. Here's a link to the new version (I've
also copied the full text below):

http://legacy.python.org/dev/peps/pep-0471/ -- new PEP as HTML
http://hg.python.org/peps/rev/0da4736c27e8 -- changes

[...]
Rejected ideas
==

[...]
Return values being pathlib.Path objects


With Antoine Pitrou's new standard library ``pathlib`` module, it
at first seems like a great idea for ``scandir()`` to return instances
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
``lstat()`` functions are explicitly not cached, whereas ``scandir``
has to cache them by design, because it's (often) returning values
from the original directory iteration system call.

And if the ``pathlib.Path`` instances returned by ``scandir`` cached
lstat values, but the ordinary ``pathlib.Path`` objects explicitly
don't, that would be more than a little confusing.

Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
the context of scandir `here
https://mail.python.org/pipermail/python-dev/2013-November/130583.html`_,
making ``pathlib.Path`` objects a bad choice for scandir return
values.


Can we at least make sure that attributes of DirEntry that have the same 
meaning as attributes of pathlib.Path have the same name?



[...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Walter Dörwald


On 28 Jun 2014, at 21:48, Ben Hoyt wrote:


[...]

Crazy idea: would it be possible to convert a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full  stat_result object.


The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
Rejected ideas section.


However, it would be bad to have two implementations of the concept of 
filename with different attribute and method names.


The best way to ensure compatible APIs would be if one class was derived 
from the other.



[...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Subclasses vs. special methods

2014-01-06 Thread Walter Dörwald


On 04.01.14 13:58, Serhiy Storchaka wrote:


Should implicit converting an instance of int, float, complex, str,
bytes, etc subclasses to call appropriate special method __int__ (or
__index__), __float__, __complex__, __str__, __bytes__, etc? Currently
explicit converting calls these methods, but implicit converting doesn't.


class I(int):

... def __int__(self): return 42
... def __index__(self): return 43
...

class F(float):

... def __float__(self): return 42.0
...

class S(str):

... def __str__(self): return '*'
...

int(I(65))

42

float(F(65))

42.0

str(S('A'))

'*'

chr(I(65))

'A'

import cmath; cmath.rect(F(65), 0)

(65+0j)

ord(S('A'))

65

Issue17576 [1] proposes to call special methods for implicit converting.
I have doubts about this.


Note that for explicit conversion this was implemented a long time ago. 
See this ancient thread about str/unicode subclasses and 
__str__/__unicode__:


   https://mail.python.org/pipermail/python-dev/2005-January/051175.html

And this bug report:

   http://bugs.python.org/issue1109424


[...]


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] How long the wrong type of argument should we limit (or not) in the error message (C-api)?

2013-12-16 Thread Walter Dörwald


On 15.12.13 17:33, Ethan Furman wrote:

On 12/14/2013 07:51 PM, Steven D'Aprano wrote:

On Sun, Dec 15, 2013 at 11:25:10AM +1000, Nick Coghlan wrote:


Oh, yes, a %T shortcut for length limited type name of the supplied
object would be brilliant. We need this frequently for C level error
messages, and I almost always have to look at an existing example to
remember the exact incantation :)


What are the chances that could be made available from pure Python too?
Having to extract the name of the type is a very common need for error
messages, and I never know whether I ought to write type(obj).__name__
or obj.__class__.__name__. A %T and/or {:T} format code could be the One
Obvious Way to include the type name in strings


+1


I'd vote for including the module name in the string and using 
__qualname__ instead of __name__, i.e. make {:T}.format(obj) 
equivalent to 
{0.__class__.__module__}.{0.__class__.qualname__}.format(obj).


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-20 Thread Walter Dörwald


On 20.11.13 02:28, Jim J. Jewett wrote:


[...]
Instead of relying on introspection of .decodes_to and .encodes_to, it
would be useful to have charsetcodecs and tranformcodecs as entirely
different modules, with their own separate registries.  I will even
note that the existing help(codecs) seems more appropriate for
charsetcodecs than it does for the current conjoined module.


I don't understand how a registry of transformation functions would 
simplify code. Without the transform() method I would write:


import binascii
binascii.hexlify(b'foo')
   b'666f6f'

With the transform() method I should be able to write:

b'foo'.transform(hex)

However how does the hex transformer get registered in the registry? If 
the hex transformer is not part of the stdlib, there must be some code 
that does the registration, but to get that code to execute, I'd have to 
import a module, so we're back to square one, as I'd have to write:


import hex_transformer
b'foo'.transform(hex)

A way around this would be some kind of import magic, but is this really 
neccessary to be able to avoid one import statement?


Furthermore different transformation functions might have different 
additional options. Supporting those is simple when we have simple 
transformation functions: The functions has arguments, and those are 
documented where the function is documented. If we want to support 
custom options for the .transform() method, transform() would have to 
pass along *args, **kwargs to the underlying transformer. However this 
is difficult to document in a way that makes it easy to find which 
options exist for a particular transformer.


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-checkins] cpython: Close #17828: better handling of codec errors

2013-11-19 Thread Walter Dörwald


On 15.11.13 00:02, Greg Ewing wrote:


Walter Dörwald wrote:

Unfortunaty the frame from the decorator shows up in the traceback.


Maybe the decorator could remove its own frame from
the traceback?


True, this could be done via either an additional attribute on the 
frame, or a special value for frame.f_annotation.


Would we want to add frame annotations to every function call in the 
Python stdlib? Certainly not. So which functions would get annotations 
and which ones won't?


When we have many annotations, doing it with a decorator might be a 
performance problem, as each function call goes through another stack level.


Is there any other way to implement it?

Servus,
   Walter


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Add transform() and untranform() methods

2013-11-15 Thread Walter Dörwald

Am 15.11.2013 um 16:57 schrieb Stephen J. Turnbull step...@xemacs.org:
 
 Walter Dörwald writes:
 Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com:
 
 15.11.13 00:32, Victor Stinner написав(ла):
 And add transform() and untransform() methods to bytes and str types.
 In practice, it might be same codecs registry for all codecs just with
 a new attribute.
 
 If the transform() method will be added, I prefer to have only
 one transformation method and specify a direction by the
 transformation name (bzip2/unbzip2).
 
 +1
 
 -1
 
 I can't support adding such methods (and that's why I ended up giving
 Nick's proposal for exposing codecs.encode and codecs.decode a +1).

My +1 was only for having the transformation be one-way under the condition 
that it is added at all.

 People think about these transformations as en- or de-coding, not
 transforming, most of the time.  Even for a transformation that is
 an involution (eg, rot13), people have an very clear idea of what's
 encoded and what's not, and they are going to prefer the names
 encode and decode for these (generic) operations in many cases.
 
 Eg, I don't think s.transform(decoder) is an improvement over
 decode(s, codec) (but tastes vary).[1]  It does mean that we need
 to add a redundant method, and I don't really see an advantage to it.

Actually my preferred method would be codec.decode(s). codec being the module 
that implements the functionality.

I don't think we need to invent another function registry.

 The semantics seem slightly off to me, since the purpose of the
 operation is to create a new object, not transform the original
 in-place.

This would mean the method would have to be called transformed()?

  (But of course str.encode and bytes.decode are precedents
 for those semantics.)
 
 
 Footnotes: 
 [1]  Arguments decoder and codec are identifiers, not metavariables.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-checkins] cpython: Close #17828: better handling of codec errors

2013-11-14 Thread Walter Dörwald


On 13.11.13 17:25, Nick Coghlan wrote:


On 14 November 2013 02:12, Nick Coghlan ncogh...@gmail.com wrote:

On 14 November 2013 00:30, Walter Dörwald wal...@livinglogic.de wrote:

On 13.11.13 14:51, nick.coghlan wrote:


http://hg.python.org/cpython/rev/854a2cea31b9
changeset:   87084:854a2cea31b9
user:Nick Coghlan ncogh...@gmail.com
date:Wed Nov 13 23:49:21 2013 +1000
summary:
Close #17828: better handling of codec errors

- output type errors now redirect users to the type-neutral
convenience functions in the codecs module
- stateless errors that occur during encoding and decoding
will now be automatically wrapped in exceptions that give
the name of the codec involved



Wouldn't it be better to add an annotation API to the exceptions classes?
This would allow to annotate all exceptions without having to replace the
exception object.


Hmm, it might be better to have the traceback machinery print the 
annotation information instead of BaseException.__str__, so we don't get 
any compatibility issues with custom __str__ implementations.



There's a reason the C API for this is private - it's a band aid fix,
because solving it properly is hard :)


Note that the specific problem with just annotating the exception
rather than a specific frame is that you lose the stack context for
where the annotation occurred. The current chaining workaround doesn't
just change the exception message, it also breaks the stack into two
pieces (inside and outside the codec) that get displayed separately.

Mostly though, it boils down to the fact that I'm far more comfortable
changing codec exception stack trace details in some cases than I am
proposing a new API for all exceptions this close to the Python 3.4
feature freeze.


Sure, this is something that might go into 3.5, but not 3.4.


A more elegant (and comprehensive) solution as a PEP for 3.5 would
certainly be a nice thing to have, but I think this is still much
better than the 3.3 status quo.


Thinking further about this, I like your frame annotation suggestion

Tracebacks could then look like this:

 bhello.decode(uu_codec)
Traceback (most recent call last):
  File stdin, line 1, in module: decoding with 'uu_codec' codec 
failed

ValueError: Missing begin line in input data

In fact the traceback already lays out the chain of events. What is 
missing is simply a little additional information.


Could frame annotation be added via decorators, i.e. something like this:

@annotate(while doing something with {param})
def func(param):
   do something

annotate() would catch the exception, call .format() on the annotation 
string with the local variables of the frame as keyword arguments, 
attach the result to a special attribute of the frame and reraise the 
exception.


The traceback machinery would simply have to print this additional 
attribute.


Servus,
   Walter


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-checkins] cpython: Close #17828: better handling of codec errors

2013-11-14 Thread Walter Dörwald


On 14.11.13 14:22, Walter Dörwald wrote:


On 13.11.13 17:25, Nick Coghlan wrote:


 [...]

A more elegant (and comprehensive) solution as a PEP for 3.5 would
certainly be a nice thing to have, but I think this is still much
better than the 3.3 status quo.


Thinking further about this, I like your frame annotation suggestion

Tracebacks could then look like this:

  bhello.decode(uu_codec)
Traceback (most recent call last):
   File stdin, line 1, in module: decoding with 'uu_codec' codec
failed
ValueError: Missing begin line in input data

In fact the traceback already lays out the chain of events. What is
missing is simply a little additional information.

Could frame annotation be added via decorators, i.e. something like this:

@annotate(while doing something with {param})
def func(param):
do something

annotate() would catch the exception, call .format() on the annotation
string with the local variables of the frame as keyword arguments,
attach the result to a special attribute of the frame and reraise the
exception.

The traceback machinery would simply have to print this additional
attribute.


http://bugs.python.org/19585 is a patch that implements that. With the 
patch the following code:


   import traceback

   @traceback.annotate(while handling x={x!r})
   def handle(x):
  raise ValueError(42)

   handle(spam)

will give the traceback:

   Traceback (most recent call last):
 File spam.py, line 8, in module
   handle(spam)
 File frame-annotation/Lib/traceback.py, line 322, in wrapped
   f(*args, **kwargs)
 File spam.py, line 5, in handle: while handling x='spam'
   raise ValueError(42)
   ValueError: 42

Unfortunaty the frame from the decorator shows up in the traceback.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Add transform() and untranform() methods

2013-11-14 Thread Walter Dörwald

Am 15.11.2013 um 00:42 schrieb Serhiy Storchaka storch...@gmail.com:
 
 15.11.13 00:32, Victor Stinner написав(ла):
 And add transform() and untransform() methods to bytes and str types.
 In practice, it might be same codecs registry for all codecs just with
 a new attribute.
 
 If the transform() method will be added, I prefer to have only one 
 transformation method and specify a direction by the transformation name 
 (bzip2/unbzip2).

+1

Some of the transformations might not be revertible (s.transform(lower)? ;))

And the transform function probably doesn't need any error handling machinery.

What about the stream/iterator/incremental parts of the codec API?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-checkins] cpython: Close #17828: better handling of codec errors

2013-11-13 Thread Walter Dörwald


On 13.11.13 14:51, nick.coghlan wrote:


http://hg.python.org/cpython/rev/854a2cea31b9
changeset:   87084:854a2cea31b9
user:Nick Coghlan ncogh...@gmail.com
date:Wed Nov 13 23:49:21 2013 +1000
summary:
   Close #17828: better handling of codec errors

- output type errors now redirect users to the type-neutral
   convenience functions in the codecs module
- stateless errors that occur during encoding and decoding
   will now be automatically wrapped in exceptions that give
   the name of the codec involved


Wouldn't it be better to add an annotation API to the exceptions 
classes? This would allow to annotate all exceptions without having to 
replace the exception object.


I.e. BaseException would have an additional method annotate():

   try:
  dostuff(param)
   except Exception as exc:
  exc.annotate(while doing stuff with {}.format(param))

annotate() would simply append the message to an internal list attribute.

BaseException.__str__() could then use this to format an appropriate 
exception message.


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Best practice for documentation for std lib

2013-09-24 Thread Walter Dörwald


On 23.09.13 17:18, Skip Montanaro wrote:


It would be great if the docstring contained a link to the online
documentation.


That would have to be a feature of help(), not hardcoded in each docstring.


That *is* a feature of the help function:

Help on built-in module sys:


help(sys)

NAME
 sys

FILE
 (built-in)

MODULE DOCS
 http://docs.python.org/library/sys
...

(pydoc too, though I'm 99.9% sure they use the same underlying
facility Ping originally implemented.)


Hmm, but it doesn't work for functions:

 import sys
 help(sys.settracee)

Help on built-in function settrace in module sys:

settrace(...)
settrace(function)

Set the global debug tracing function.  It will be called on each
function call.  See the debugger chapter in the library manual.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Best practice for documentation for std lib

2013-09-23 Thread Walter Dörwald


On 22.09.13 16:34, Brett Cannon wrote:


The rule of thumb I go by is the docstring should be enough to answer
the question what args does this thing take and what does it do in
general to know it's the function I want and another one in the same
module? quickly and succinctly; i.e. just enough so that help() reminds
you about details for a module you are already familiar with that might
come up while at the interpreter prompt. Everything else -- in-depth
discussion of the algorithms, extra examples, why you want to use this
function, etc. -- all go in the .rst docs.


It would be great if the docstring contained a link to the online 
documentation.


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Best practice for documentation for std lib

2013-09-23 Thread Walter Dörwald


On 23.09.13 15:38, Fred Drake wrote:


On Mon, Sep 23, 2013 at 7:27 AM, Walter Dörwald wal...@livinglogic.de wrote:

It would be great if the docstring contained a link to the online
documentation.


The docstring itself, or the presentation generated by help() ?


The presentation generated by help(), or the output of IPython's foo? or 
foo?? syntax.


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] eval and triple quoted strings

2013-06-17 Thread Walter Dörwald


On 14.06.13 23:03, PJ Eby wrote:

On Fri, Jun 14, 2013 at 2:11 PM, Ron Adam ron3...@gmail.com wrote:



On 06/14/2013 10:36 AM, Guido van Rossum wrote:


Not a bug. The same is done for file input -- CRLF is changed to LF before
tokenizing.




Should this be the same?


python3 -c 'print(bytes(\r\n, utf8))'
b'\r\n'



eval('print(bytes(\r\n, utf8))')

b'\n'


No, but:

eval(r'print(bytes(\r\n, utf8))')

should be.  (And is.)

What I believe you and Walter are missing is that the \r\n in the eval
strings are converted early if you don't make the enclosing string
raw.  So what you're eval-ing is not what you think you are eval-ing,
hence the confusion.


I expected that eval()ing a string that contains the characters

   U+0027: APOSTROPHE
   U+0027: APOSTROPHE
   U+0027: APOSTROPHE
   U+000D: CR
   U+000A: LR
   U+0027: APOSTROPHE
   U+0027: APOSTROPHE
   U+0027: APOSTROPHE

to return a string containing the characters:

   U+000D: CR
   U+000A: LR

Making the string raw, of course turns it into:

   U+0027: APOSTROPHE
   U+0027: APOSTROPHE
   U+0027: APOSTROPHE
   U+005C: REVERSE SOLIDUS
   U+0072: LATIN SMALL LETTER R
   U+005C: REVERSE SOLIDUS
   U+006E: LATIN SMALL LETTER N
   U+0027: APOSTROPHE
   U+0027: APOSTROPHE
   U+0027: APOSTROPHE

and eval()ing that does indeed give \r\n as expected.

Hmm, it seems that codecs.unicode_escape_decode() does what I want:

 codecs.unicode_escape_decode(\r\n\\r\\n\\x0d\\x0a\\u000d\\u000a)
('\r\n\r\n\r\n\r\n', 26)

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] eval and triple quoted strings

2013-06-17 Thread Walter Dörwald


On 17.06.13 19:04, Walter Dörwald wrote:


Hmm, it seems that codecs.unicode_escape_decode() does what I want:

  codecs.unicode_escape_decode(\r\n\\r\\n\\x0d\\x0a\\u000d\\u000a)
('\r\n\r\n\r\n\r\n', 26)


Hmm, no it doesn't:

 codecs.unicode_escape_decode(\u1234)
('á\x88´', 3)

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] eval and triple quoted strings

2013-06-14 Thread Walter Dörwald


Hello all!

This surprised me:

eval('''\r\n''')
   '\n'

Where did the \r go? ast.literal_eval() has the same problem:

ast.literal_eval('''\r\n''')
   '\n'

Is this a bug/worth fixing?

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 443 - Single-dispatch generic functions

2013-05-23 Thread Walter Dörwald


On 23.05.13 00:33, Łukasz Langa wrote:


Hello,
I would like to submit the following PEP for discussion and evaluation.


PEP: 443
Title: Single-dispatch generic functions
[...]
@fun.register(int)
   ... def _(arg, verbose=False):
   ... if verbose:
   ... print(Strength in numbers, eh?, end= )
   ... print(arg)
   ...


Should it be possible to register multiple types for the generic 
function with one register() call, i.e. should:


   @fun.register(int, float)
   def _(arg, verbose=False):
  ...

be allowed as a synonym for

   @fun.register(int)
   @fun.register(float)
   def _(arg, verbose=False):
  ...

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] noob contributions to unit tests

2013-03-28 Thread Walter Dörwald


Am 27.03.2013 um 03:24 schrieb R. David Murray rdmur...@bitdance.com:

 On Tue, 26 Mar 2013 16:59:06 -0700, Maciej Fijalkowski fij...@gmail.com 
 wrote:
 On Tue, Mar 26, 2013 at 4:49 PM, Sean Felipe Wolfe ether@gmail.com 
 wrote:
 Hey everybody how are you all :)
 
 I am an intermediate-level python coder looking to get help out. I've
 been reading over the dev guide about helping increase test coverage
 --
 http://docs.python.org/devguide/coverage.html
 
 And also the third-party code coverage referenced in the devguide page:
 http://coverage.livinglogic.de/
 
 I'm seeing that according to the coverage tool, two of my favorite
 libraries, urllib/urllib2, have no unit tests? Is that correct or am I
 reading it wrong?
 
 If that's correct it seems like a great place perhaps for me to cut my
 teeth and I would be excited to learn and help out here.
 
 And of course any thoughts or advice for an aspiring Python
 contributor would be appreciated. Of course the dev guide gives me
 plenty of good info.
 
 Thanks!
 
 That looks like an error in the coverage report, there are certainly
 urllib and urllib2 tests in test/test_urllib*
 
 The devguide contains instructions for running coverage yourself,
 and if I recall correctly the 'fullcoverage' recipe does a better
 job than what runs at coverage.livinglogic.de.

The job that produces that output has been broken for some time now, and I 
haven't found the time to look into it. If someone wants to try, here's the 
code:

   https://pypi.python.org/pypi/pycoco/0.7.2

 […]

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-checkins] cpython: Add a few entries to whatsnew/3.3.rst.

2012-09-26 Thread Walter Dörwald


On 26.09.12 16:43, ezio.melotti wrote:


http://hg.python.org/cpython/rev/36f61661f71e
changeset:   79194:36f61661f71e
user:Ezio Melotti ezio.melo...@gmail.com
date:Wed Sep 26 17:43:23 2012 +0300
summary:
   Add a few entries to whatsnew/3.3.rst.
[...]
+
+A new :data:`~html.entities.html5` dictionary that maps HTML5 named character
+references to the equivalent Unicode character(s) (e.g. ``html5['gt;'] == 
''``)
+has been added to the :mod:`html.entities` module.  The dictionary is now also
+used by :class:`~html.parser.HTMLParser`.


Is there a reason why the trailing ';' is included in the entity names?

Servus,
   Walter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Issue 2736: datetimes and Unix timestamps

2012-06-05 Thread Walter Dörwald


On 04.06.12 13:19, Dirkjan Ochtman wrote:


I recently opened issue14908. At work, I have to do a bunch of things
with dates, times and timezones, and sometimes Unix timestamps are
also involved (largely for easy compatibility with legacy APIs). I
find the relative obscurity when converting datetimes to timestamps
rather painful; IMO it should be possible to do everything I need
straight from the datetime module objects, instead of having to
involve the time or calendar modules.

Anyway, I was pointed to issue 2736, which seems to have got a lot of
discouraged core contributors (Victor, Antoine, David and Ka-Ping, to
name just a few) up against Alexander (the datetime maintainer,
AFAIK).


Also see: http://bugs.python.org/issue665194 (datetime-RFC2822 
roundtripping)



It seems like a fairly straightforward case of practicality
over purity: Alexander argues that there are easy one-liners to do
things like datetime.totimestamp(),


I don't want one-liners, I want one-callers! ;)


but most other people seem to not
find them so easy. They've since been added to the documentation at
least, but I would like to see if there is consensus on python-dev
that adding a little more timestamp support to datetime objects would
make sense.


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP: New timestamp formats

2012-02-03 Thread Walter Dörwald

Am 03.02.2012 um 01:59 schrieb Nick Coghlan ncogh...@gmail.com:

 On Fri, Feb 3, 2012 at 10:21 AM, Victor Stinner
 victor.stin...@haypocalc.com wrote:
 I updated and completed my PEP and published the last draft. It will
 be available at:
 http://www.python.org/dev/peps/pep-0410/
 ( or read the source: http://hg.python.org/peps/file/tip/pep-0410.txt )
 
 I tried to list all alternatives.
 
 [...]
 
 datetime.datetime
 
 - as noted earlier in the thread, total_seconds() actually gives you a
 decent timestamp value and always returning UTC avoids timezone issues
 - real problem with the idea is that not all timestamps can be easily
 made absolute (e.g. some APIs may return time since system started
 or time since process started)
 - the complexity argument used against timedelta also applies

Wasn't datetime supposed to be the canonical date/time infrastructure that 
everybody uses? Why do we need yet another way to express a point in time? And 
even if we're going with Decimal, at least datetime.datetime should we extended 
to support the higher resolution (in fact it's the one where this can be done 
with no or minimal backward compatibility problems).

 [other alternatives]

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-24 Thread Walter Dörwald

On 24.05.11 02:08, Victor Stinner wrote:

 [...]
 codecs.open() and StreamReader, StreamWriter and StreamReaderWriter
 classes of the codecs module don't support universal newlines, still
 have some issues with stateful codecs (like UTF-16/32 BOMs), and each
 codec has to implement a StreamReader and a StreamWriter class.
 
 StreamReader and StreamWriter are stateless codecs (no reset() or
 setstate() method),

They *are* stateful, they just don't expose their state to the public.

 and so it's not possible to write a generic fix for
 all child classes in the codecs module. Each stateful codec has to
 handle special cases like seek() problems.

Yes, which in theory makes it possible to implement shortcuts for
certain codecs (e.g. the UTF-32-BE/LE codecs could simply multiply the
character position by 4 to get the byte position). However AFAICR none
of the readers/writers does that.

 For example, UTF-16 codec
 duplicates some IncrementalEncoder/IncrementalDecoder code into its
 StreamWriter/StreamReader class.

Actually it's the other way round: When I implemented the incremental
codecs, I copied code from the StreamReader/StreamWriter classes.

 The io module is well tested, supports non-seekable streams, handles
 correctly corner-cases (like UTF-16/32 BOMs) and supports any kind of
 newlines including an universal newline mode. TextIOWrapper reuses
 incremental encoders and decoders, so BOM issues were fixed only once,
 in TextIOWrapper.
 
 It's trivial to replace a call to codecs.open() by a call to open(),
 because the two API are very close. The main different is that
 codecs.open() doesn't support universal newline, so you have to use
 open(..., newline='') to keep the same behaviour (keep newlines
 unchanged). This task can be done by 2to3. But I suppose that most
 people will be happy with the universal newline mode.
 
 I don't see which usecase is not covered by TextIOWrapper. But I know
 some cases which are not supported by StreamReader/StreamWriter.

This could be be partially fixed by implementing generic
StreamReader/StreamWriter classes that reuse the incremental codecs, but
I don't think thats worth it.

 [...] 

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-24 Thread Walter Dörwald

On 24.05.11 12:58, Victor Stinner wrote:
 Le mardi 24 mai 2011 à 12:42 +0200, Łukasz Langa a écrit :
 Wiadomość napisana przez Walter Dörwald w dniu 2011-05-24, o godz. 12:16:

 I don't see which usecase is not covered by TextIOWrapper. But I know
 some cases which are not supported by StreamReader/StreamWriter.

 This could be be partially fixed by implementing generic
 StreamReader/StreamWriter classes that reuse the incremental codecs, but
 I don't think thats worth it.

 Why not?
 
 We have already an implementation of this idea, it is called
 io.TextIOWrapper.

Exactly.

From another post by Victor:

 As I wrote, codecs.open() is useful in Python 2. But I don't know any
 program or library using directly StreamReader or StreamWriter.

So: implementing this is a lot of work, duplicates existing
functionality and is mostly unused.

Servus,
   Walter




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Could these restrictions be removed?

2011-05-12 Thread Walter Dörwald

On 12.05.11 18:53, Walter Dörwald wrote:

 On 12.05.11 18:33, s...@pobox.com wrote:
 
 A friend at work who is new to Python wondered why this didn't work with
 pickle:

 class Outer:

 Class Inner:

 ...

 def __init__(self):
 self.i = Outer.Inner()

 I explained:

 http://docs.python.org/library/pickle.html#what-can-be-pickled-and-unpickled
  


  From that:

 # functions defined at the top level of a module
 # built-in functions defined at the top level of a module
 # classes that are defined at the top level of a module

 I've never questions this, but I wonder, is this a fundamental restriction
 or could it be overcome with a modest amount of work?
 
 This is related to http://bugs.python.org/issue633930

See also the thread started at:

   http://mail.python.org/pipermail/python-dev/2005-March/052454.html

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Could these restrictions be removed?

2011-05-12 Thread Walter Dörwald

On 12.05.11 18:33, s...@pobox.com wrote:

 A friend at work who is new to Python wondered why this didn't work with
 pickle:
 
 class Outer:
 
 Class Inner:
 
 ...
 
 def __init__(self):
 self.i = Outer.Inner()
 
 I explained:
 
 http://docs.python.org/library/pickle.html#what-can-be-pickled-and-unpickled 


  From that:

 # functions defined at the top level of a module
 # built-in functions defined at the top level of a module
 # classes that are defined at the top level of a module
 
 I've never questions this, but I wonder, is this a fundamental restriction
 or could it be overcome with a modest amount of work?

This is related to http://bugs.python.org/issue633930

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Code coverage doesn't show .py stats

2010-11-04 Thread Walter Dörwald

On 03.11.10 19:21, anatoly techtonik wrote:

 Hi,
 
 Python code coverage doesn't include any .py files. What happened?
 http://coverage.livinglogic.de/
 
 Did it work before?

It did, however currently the logfile

   http://coverage.livinglogic.de/testlog.txt

shows the following exception:

Traceback (most recent call last):
  File Lib/test/regrtest.py, line 1500, in module
main()
  File Lib/test/regrtest.py, line 696, in main
r.write_results(show_missing=True, summary=True, coverdir=coverdir)
  File /home/coverage/python/Lib/trace.py, line 319, in write_results
lnotab, count)
  File /home/coverage/python/Lib/trace.py, line 369, in write_results_file
outfile.write(line.expandtabs(8))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in
position 30: ordinal not in range(128)

BTW, this is the py3k branch (i.e.
http://svn.python.org/snapshots/python3k.tar.bz2)

It seems the trace module has a problem with unicode.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] blocking 2.7

2010-07-06 Thread Walter Dörwald

On 05.07.10 16:19, Nick Coghlan wrote:
 On Mon, Jul 5, 2010 at 5:20 AM, Terry Reedy tjre...@udel.edu wrote:
 On 7/4/2010 2:31 AM, Éric Araujo wrote:

 But Python tests lack coverage stats, so it is hard to say anything.

 FYI: http://coverage.livinglogic.de/

 Turns out the audioop is one of the best covered modules, at 98%
 
 Alas, those are only the stats for the audioop test suite. audioop
 itself is written in C, so the automatic coverage stats generated by
 livinglogic don't provide any details.

http://coverage.livinglogic.de/ *does* include coverage info for stuff
written in C, see for example:

   http://coverage.livinglogic.de/Objects/unicodeobject.c.html

However it *is* strange that test_audioop.py gets executed, but
audioop.c doesn't seem to be.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Coverage, was: Re: blocking 2.7

2010-07-06 Thread Walter Dörwald

On 06.07.10 15:07, Mark Dickinson wrote:

 On Tue, Jul 6, 2010 at 1:10 PM, Walter Dörwald wal...@livinglogic.de wrote:
 http://coverage.livinglogic.de/ *does* include coverage info for stuff
 written in C, see for example:

   http://coverage.livinglogic.de/Objects/unicodeobject.c.html

 However it *is* strange that test_audioop.py gets executed, but
 audioop.c doesn't seem to be.
 
 It looks as though none of the extension modules (besides those that
 are compiled statically into the interpreter) are reporting coverage.
 I wonder whether the correct flags are being passed to the module
 build stage?  Incidentally, there doesn't seem to be any of the usual
 'make' output I'd associate with the module-building stage in the
 build log at:
 
 http://coverage.livinglogic.de/buildlog.txt
 
 For example, I'd expect to see the string 'mathmodule' somewhere in that 
 output.

True, there seems to be a problem. I'm running

   ./configure --enable-unicode=ucs4 --with-pydebug

and then

   make coverage

This doesn't seem to build extension modules. However as far as I
understand the Makefile, make coverage should build extension modules:

# Default target
all:build_all
build_all:  $(BUILDPYTHON) oldsharedmods sharedmods gdbhooks

coverage:
@echo Building with support for coverage checking:
$(MAKE) clean
$(MAKE) all CFLAGS=$(CFLAGS) -O0 -pg -fprofile-arcs -ftest-coverage
LIBS=$(LIBS) -lgcov

# Build the shared modules
sharedmods: $(BUILDPYTHON)
@case $$MAKEFLAGS in \
*s*) $(RUNSHARED) CC='$(CC)' LDSHARED='$(BLDSHARED)'
LDFLAGS='$(LDFLAGS)' OPT='$(OPT)' ./$(BUILDPYTHON) -E $(srcdir)/setup.py
-q build;; \
*) $(RUNSHARED) CC='$(CC)' LDSHARED='$(BLDSHARED)' LDFLAGS='$(LDFLAGS)'
OPT='$(OPT)' ./$(BUILDPYTHON) -E $(srcdir)/setup.py build;; \
esac

I'm rerunning now with make  make coverage to see if this fixes
anything.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python Library Support in 3.x (Was: email package status in 3.X)

2010-06-19 Thread Walter Dörwald


Am 18.06.2010 um 22:53 schrieb Terry Reedy tjre...@udel.edu:


On 6/18/2010 12:32 PM, Walter Dörwald wrote:


   http://coverage.livinglogic.de/


I am a bit puzzled as to the meaning of the gray/red/green bars  
since the correlation between coverage % and bars is not very high.


The gray bar is the uncoverable part of the source (empty lines,  
comments etc.), the green bar is the covered part (i.e. those lines  
that really got executed) and the red bar is the uncovered part (i.e.  
Those lines that could have been executed but weren't). So coverage is


   green / (green + red)

Just click on the coverage header to sort by coverage and you *will*  
see a correlation.


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python Library Support in 3.x (Was: email package status in 3.X)

2010-06-18 Thread Walter Dörwald

On 18.06.10 17:04, Brian Curtin wrote:

 [...]
 2. no code coverage (test/user story/rfc/pep)
 
 
 If you know of a way to incorporate code coverage tools and metrics into
 the current process, I believe a number of people would be interested.
 There currently exists some coverage tool that runs on the current
 repository, but I'm not sure of its location or status.

   http://coverage.livinglogic.de/

I haven't touched the code in a year, but the job's still running.

 [...]

Servus,
   Walter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Reintroduce or drop completly hex, bz2, rot13, ... codecs

2010-06-11 Thread Walter Dörwald

On 10.06.10 21:31, Terry Reedy wrote:

 On 6/10/2010 7:08 AM, M.-A. Lemburg wrote:
 Walter Dörwald wrote:
 
 The PEP would also serve as a reference back to both this discussion and
 the previous one (which was long enough ago that I've forgotten most of 
 it).

 I too think that a PEP is required here.

 Fair enough. I'll write a PEP.
 
 Thank you from me.

 Codecs support several types of error handling that don't make sense for
 transform()/untransform(). What should 'abc'.decode('hex', 'replace')
 do? (In 2.6 it raises an assertion error, because errors *must* be strict).
 
 I would expext either ValueError: errors arg must be 'strict' for 
 trransform

What use is an argument that must always have the same value?

'abc'.transform('hex', errors='strict', obey_the_flufl=True)

 or else TypeError: tranform takes 1 arg, 2 given.

IMHO that's the better option.

 That's not really an issue since codecs don't have to implement
 all error handling schemes.

 For starters, they will all only implement 'strict' mode.

I would prefer it if transformers were separate from codecs and had
their own registry.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Reintroduce or drop completly hex, bz2, rot13, ... codecs

2010-06-10 Thread Walter Dörwald

On 09.06.10 14:47, Nick Coghlan wrote:

 On 09/06/10 22:18, Victor Stinner wrote:
 Le mercredi 09 juin 2010 10:41:29, M.-A. Lemburg a écrit :
 No, .transform() and .untransform() will be interface to same-type
 codecs, i.e. ones that convert bytes to bytes or str to str. As with
 .encode()/.decode() these helper methods also implement type safety
 of the return type.

 What about buffer compatible objects like array.array(), memoryview(), etc.?
 Should we use codecs.encode() / codecs.decode() for these types?
 
 There are probably enough subtleties that this is all worth specifying 
 in a PEP:
 
 - which codecs from 2.x are to be restored
 - the domain each codec operates in (binary data or text)*
 - review behaviour of codecs.encode and codecs.decode
 - behaviour of the new str, bytes and bytearray (un)transform methods
 - whether to add helper methods for reverse codecs (like base64)
 
 The PEP would also serve as a reference back to both this discussion and 
 the previous one (which was long enough ago that I've forgotten most of it).

I too think that a PEP is required here.

Codecs support several types of error handling that don't make sense for
transform()/untransform(). What should 'abc'.decode('hex', 'replace')
do? (In 2.6 it raises an assertion error, because errors *must* be strict).

I think we should takt this opportunity to implement
transform/untransform without being burdened with features we inherited
from codecs which don't make sense for transform/untransform.

 [...]

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald

On 09.01.10 14:38, Victor Stinner wrote:

 Le samedi 09 janvier 2010 12:18:33, Walter Dörwald a écrit :
 Good idea, I choosed open(filename, encoding=BOM).

 On the surface this looks like there's an encoding named BOM, but
 looking at your patch I found that the check is still done in
 TextIOWrapper. IMHO the best approach would to the implement a *real*
 codec named BOM (or sniff). This doesn't require *any* changes to
 the IO library. It could even be developed as a standalone project and
 published in the Cheeseshop.
 
 Why not, this is another solution to the point (2) (Check for a BOM while 
 reading or detect it before?). Which encoding would be used if there is not 
 BOM? UTF-8 sounds like a good choice.

UTF-8 might be a good choice, are the failback could be specified in the
encoding name, i.e.

   open(file.txt, encoding=BOM-UTF-8)

falls back to UTF-8, if there's no BOM at the start.

This could be implemented via a custom codec search function (see
http://docs.python.org/library/codecs.html#codecs.register for more info).

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald

On 10.01.10 00:40, Martin v. Löwis wrote:
 How does the requirement that it be implemented as a codec miss the
 point?

 If we want it to be the default, it must be able to fallback on the current
 locale-based algorithm if no BOM is found. I don't think it would be easy 
 for a
 codec to do that.
 
 Yes - however, Victor currently apparently *doesn't* want it to be the
 default, but wants the user to specify encoding=BOM. If so, it isn't
 the default, and it is easy to implement as a codec.
 
 FWIW, I agree with Walter that if it is provided through the encoding=
 argument, it should be a codec. If it is built into the open function
 (for whatever reason), it must be provided by some other parameter.

 Why not simply encoding=None?
 
 I don't mind. Please re-read Walter's message - it only said that
 *if* this is activated through encoding=BOM, *then* it must be
 a codec, and could be on PyPI. I don't think Walter was talking about
 the case it is not activated through encoding='BOM' *at all*.

However if this autodetection feature is useful in other cases (no
matter how it's activated), it should be a codec, because as part of the
open() function it isn't reusable.

 The default value should provide the most useful
 behaviour possible. Forcing users to choose between two different 
 autodetection
 strategies (encoding=None and another one) is a little insane IMO.

And encoding=mbcs is a third option on Windows.

 That wouldn't disturb me much. There are a lot of things in that area
 that are a little insane, starting with Microsoft Windows :-)

;)

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-11 Thread Walter Dörwald

On 11.01.10 13:45, Lennart Regebro wrote:

 On Mon, Jan 11, 2010 at 13:29, Walter Dörwald wal...@livinglogic.de wrote:
 However if this autodetection feature is useful in other cases (no
 matter how it's activated), it should be a codec, because as part of the
 open() function it isn't reusable.
 
 But an autodetect feature is not a codec. Sure it should be reusable,
 but making it a codec seems to be  a weird hack to me.

I think we already had this discussion two years ago in the context of
XML decoding ;):

http://mail.python.org/pipermail/python-dev/2007-November/075138.html

 And how would
 you reuse it if it was a codec? A reusable autodetect feature would be
 useable to detect what codec it is. A autodetect codec would not be
 useful for that, as it would simply just decode.

I have implemented an XML codec (as part of XIST:
http://pypi.python.org/pypi/ll-xist), that can do that:

 from ll import xml_codec
 import codecs
 c = codecs.getincrementaldecoder(xml)()
 c.encoding
 c.decode(?xml)
u''
 c.encoding
 c.decode( version='1.0')
u''
 c.encoding
 c.decode( encoding='iso-8859-1'?)
u?xml version='1.0' encoding='iso-8859-1'?
 c.encoding
'iso-8859-1'

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Quick sum up about open() + BOM

2010-01-09 Thread Walter Dörwald

On 09.01.10 01:47, Glenn Linderman wrote:

 On approximately 1/8/2010 3:59 PM, came the following characters from
 the keyboard of Victor Stinner:
 Hi,

 Thanks for all the answers! I will try to sum up all ideas here.
 
 One concern I have with this implementation encoding=BOM is that if
 there is no BOM it assumes UTF-8.  That is probably a good assumption in
 some circumstances, but not in others.
 
 * It is not required that UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE
 encoded files include a BOM.  It is only required that UTF-16 and UTF-32
 (cases where the endianness is unspecified) contain a BOM.  Hence, it
 might be that someone would expect a UTF-16LE (or any of the formats
 that don't require a BOM, rather than UTF-8), but be willing to accept
 any BOM-discriminated format.
 
 * Potentially, this could be expanded beyond the various Unicode
 encodings... one could envision that a program whose data files
 historically were in any particular national language locale, could want
 to be enhance to accept Unicode, and could declare that they will accept
 any BOM-discriminated format, but want to default, in the absence of a
 BOM, to the original national language locale that they historically
 accepted.  That would provide a migration path for their old data files.
 
 So the point is, that it might be nice to have
 BOM-otherEncodingForDefault for each other encoding that Python
 supports.  Not sure that is the right API, but I think it is expressive
 enough to handle the cases above.  Whether the cases solve actual
 problems or not, I couldn't say, but they seem like reasonable cases.

This is doable with the currect API. Simply define a codec search
function that handles all encoding names that start with BOM- and pass
the otherEncodingForDefault part along to the codec.

 It would, of course, be nicest if OS metadata had been invented way back
 when, for all OSes, such that all text files were flagged with their
 encoding... then languages could just read the encoding and do the right
 thing! But we live in the real world, instead.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Improve open() to support reading file starting with an unicode BOM

2010-01-09 Thread Walter Dörwald


Victor Stinner wrote:

Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit :

Builtin open() function is unable to open an UTF-16/32 file starting with
a BOM if the encoding is not specified (raise an unicode error). For an
UTF-8 file starting with a BOM, read()/readline() returns also the BOM
whereas the BOM should be ignored.

It depends. If you use the utf-8-sig encoding, it *will* ignore the
UTF-8 signature.


Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and 
UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to 
remove the BOM after the first read (much harder if you use a module like 
ConfigParser or csv).



Since my proposition changes the result TextIOWrapper.read()/readline()
for files starting with a BOM, we might introduce an option to open() to
enable the new behaviour. But is it really needed to keep the backward
compatibility?

Absolutely. And there is no need to produce a new option, but instead
use the existing options: define an encoding that auto-detects the
encoding from the family of BOMs. Maybe you call it encoding=sniff.


Good idea, I choosed open(filename, encoding=BOM).


On the surface this looks like there's an encoding named BOM, but 
looking at your patch I found that the check is still done in 
TextIOWrapper. IMHO the best approach would to the implement a *real* 
codec named BOM (or sniff). This doesn't require *any* changes to 
the IO library. It could even be developed as a standalone project and 
published in the Cheeseshop.


To see how something like this can be done, take a look at the UTF-16 
codec, that switches to bigendian or littleendian mode depending on the 
first read/decode call.


Servus,
   Walter





___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-08 Thread Walter Dörwald

Stephen J. Turnbull wrote:
 Walter Dörwald writes:
 
   surrogatepass (for the don't complain about lone half surrogates
   handler) and surrogatereplace sound OK to me. However the other
   ...replace handlers are destructive (i.e. when such a ...replace
   handler is used for encoding, decoding will not produce the original
   unicode string).
 
 That doesn't bother me in the slightest.  Replace does not connote
 destructive or non-destructive to me; it connotes substitution.
 The fact that other error handlers happen to be destructive doesn't
 affect that at all for me.  YMMV.
 
   The purpose of the PEP 383 error handler however is to be roundtrip
   safe, so maybe we should choose a slightly different name?  How
   about surrogateescape?
 
 To me, escape has a strong connotation of a multicharacter
 representation of a single character, and that's not true here.
 
 How about surrogatetranslate?  I still prefer surrogatereplace, as
 it's slightly easier for me to type.

I like surrogatetranslate better than surrogateescape better than
surrogatereplace.

But I'll stop bikesheding now and let Martin decide.

Servus,
   alter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Walter Dörwald

M.-A. Lemburg wrote:
 Antoine Pitrou wrote:
 Martin v. Löwis martin at v.loewis.de writes:
 py b'\xed\xa0\x80'.decode(utf-8,surrogates)
 '\ud800'
 The point is, surrogates does not mean anything intuitive for an /error
 handler/. You seem to be the only one who finds this name explicit enough,
 perhaps because you chose it.
 Most other handlers' names have verbs in them (ignore, replace,
 xmlcharrefreplace, etc.).
 
 Correct.
 
 The purpose of an error handler name is to indicate to the user
 what it does, hence the use of verbs.
 
 Walter started with xmlcharrefreplace, ie. no space names, so
 surrogatereplace would be the logically correct name for the
 replace with lone surrogates scheme invented by Markus Kuhn.

surrogatepass (for the don't complain about lone half surrogates
handler) and surrogatereplace sound OK to me. However the other
...replace handlers are destructive (i.e. when such a ...replace
handler is used for encoding, decoding will not produce the original
unicode string). The purpose of the PEP 383 error handler however is to
be roundtrip safe, so maybe we should choose a slightly different name?
How about surrogateescape?

 The error handler for undoing this operation (ie. when converting
 a Unicode string to some other encoding) should probably use the
 same name based on symmetry and the fact that the escaping
 scheme is meant to be used for enabling round-trip safety.

We have only one error handler registry, but we *can* have one error
handler for both directions (encoding and decoding) as the error handler
can simply check whether it got passed a UnicodeEncodeError or
UnicodeDecodeError object.

 BTW: It would also be appropriate to reference Markus Kuhn in the PEP
 as the inventor of the escaping scheme.
 
 Even if only to give the reader an idea of how that scheme works and
 why (the PEP on python.org currently doesn't explain this).
 
 It should also explain that the scheme is meant to assure round-trip
 safety and doesn't necessarily work when using transcoding, ie.
 reading using one encoding, writing using another.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-07 Thread Walter Dörwald

Michael Urman wrote:

 [...]
 Well, there is a way to stack error handlers, although it's not pretty:
 [...]
 codecs.register_error(surrogates_then_replace,
  surrogates_then_replace)
 
 That mitigates my arguments significantly, although I'd rather see
 something like errors=('surrogates', 'replace') chain the handlers
 without additional registrations. But that's a different PEP or
 arbitrary change. :)

The first version of PEP 293 changed the errors argument to be a string
or callable. This would have simplified handler stacking somewhat
(because you don't have to register or lookup handlers) but it had the
disadvantage that many char * arguments in the C API would have had to
changed to PyObject *. Changing the errors argument to a list of
strings would have the same problem.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald

Martin v. Löwis wrote:

 I'm proposing the following PEP for inclusion into Python 3.1.
 Please comment.
 
 Regards,
 Martin
 
 PEP: 383
 Title: Non-decodable Bytes in System Character Interfaces
 Version: $Revision: 71793 $
 Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
 Author: Martin v. Löwis mar...@v.loewis.de
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 22-Apr-2009
 Python-Version: 3.1
 Post-History:
 
 Abstract
 
 
 File names, environment variables, and command line arguments are
 defined as being character data in POSIX; the C APIs however allow
 passing arbitrary bytes - whether these conform to a certain encoding
 or not. This PEP proposes a means of dealing with such irregularities
 by embedding the bytes in character strings in such a way that allows
 recreation of the original byte string.
 
 Rationale
 =
 
 The C char type is a data type that is commonly used to represent both
 character data and bytes. Certain POSIX interfaces are specified and
 widely understood as operating on character data, however, the system
 call interfaces make no assumption on the encoding of these data, and
 pass them on as-is. With Python 3, character strings use a
 Unicode-based internal representation, making it difficult to ignore
 the encoding of byte strings in the same way that the C interfaces can
 ignore the encoding.
 
 On the other hand, Microsoft Windows NT has correct the original

correct - corrected

 design limitation of Unix, and made it explicit in its system
 interfaces that these data (file names, environment variables, command
 line arguments) are indeed character data, by providing a
 Unicode-based API (keeping a C-char-based one for backwards
 compatibility).
 
 [...]
 
 Specification
 =
 
 On Windows, Python uses the wide character APIs to access
 character-oriented APIs, allowing direct conversion of the
 environmental data to Python str objects.
 
 On POSIX systems, Python currently applies the locale's encoding to
 convert the byte data to Unicode. If the locale's encoding is UTF-8,
 it can represent the full set of Unicode characters, otherwise, only a
 subset is representable. In the latter case, using private-use
 characters to represent these bytes would be an option. For UTF-8,
 doing so would create an ambiguity, as the private-use characters may
 regularly occur in the input also.
 
 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.

Would this mean that real private use characters in the file name would
raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
any error handler.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.

Then the error callback for encoding would become specific to the target
encoding. Would this mean that the handler checks which encoding is used
and behaves like strict if it doesn't recognize the encoding?

 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.

Is this done by the codec, or the error handler? If it's done by the
codec I don't see a reason for the python-escape error handler.

 Discussion
 ==
 
 While providing a uniform API to non-decodable bytes, this interface
 has the limitation that chosen representation only works if the data
 get converted back to bytes with the python-escape error handler
 also.

I thought the error handler would be used for decoding.

 Encoding the data with the locale's encoding and the (default)
 strict error handler will raise an exception, encoding them with UTF-8
 will produce non-sensical data.
 
 For most applications, we assume that they eventually pass data
 received from a system interface back into the same system
 interfaces. For example, and application invoking os.listdir() will

and - an

 likely pass the result strings back into APIs like os.stat() or
 open(), which then encodes them back into their original byte
 representation. Applications that need to process the original byte
 strings can obtain them by encoding the character strings with the
 file system encoding, passing python-escape as the error handler
 name.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald

Martin v. Löwis wrote:
 correct - corrected
 
 Thanks, fixed.
 
 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.
 Would this mean that real private use characters in the file name would
 raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
 any error handler.
 
 The python-escape codec is only used/meaningful if the env encoding
 is not UTF-8. For any other encoding, it is assumed that no character
 actually maps to the private-use characters.

Which should be true for any encoding from the pre-unicode era, but not
for UTF-16/32 and variants.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.
 Then the error callback for encoding would become specific to the target
 encoding.
 
 Why would it become specific? It can work the same way for any encoding:
 take U+F01xx, and generate the byte xx.

If any error callback emits bytes these byte sequences must be legal in
the target encoding, which depends on the target encoding itself.

However for the normal use of this error handler this might be
irrelevant, because those filenames that get encoded were constructed in
such a way that reencoding them regenerates the original byte sequence.

 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.
 Is this done by the codec, or the error handler? If it's done by the
 codec I don't see a reason for the python-escape error handler.
 
 utf-8b is a new codec. However, the utf-8b codec is only used if the
 env encoding would otherwise be utf-8. For utf-8b, the error handler
 is indeed unnecessary.

Wouldn't it make more sense to be consistent how non-decodable bytes get
decoded? I.e. should the utf-8b codec decode those bytes to PUA
characters too (and refuse to encode then, so the error handler outputs
them)?

 While providing a uniform API to non-decodable bytes, this interface
 has the limitation that chosen representation only works if the data
 get converted back to bytes with the python-escape error handler
 also.
 I thought the error handler would be used for decoding.
 
 It's used in both directions: for decoding, it converts \xXX to
 U+F01XX. For encoding, U+F01XX will trigger an error, which is then
 handled by the handler to produce \xXX.

But only for non-UTF8 encodings?

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Google Summer of Code/core Python projects - RFC

2009-04-13 Thread Walter Dörwald

C. Titus Brown wrote:

 [...]
 I have had a hard time getting a good sense of what core code is well
 tested and what is not well tested, across various platforms.  While
 Walter's C/Python integrated code coverage site is nice, it would be
 even nicer to have a way to generate all that information within any
 particular checkout on a real-time basis.

This might have to be done incrementally. Creating the output for
http://coverage.livinglogic.de/ takes about 90 minutes. This breaks done
like this:

Downloading: 2sec
Unpacking: 3sec
Configuring: 30sec
Compiling: 1min
Running the test suite: 1hour
Reading coverage files: 8sec
Generating HTML files: 30min

 Doing so in the context of
 Snakebite would be icing... and I think it's worth supporting in core,
 especially if it can be done without any changes *to* core.

The only thing we'd probably need in core is a way to configure Python
to run with code coverage. The coverage script does this by patching the
makefile.

Running the code coverage script on Snakebite would be awesome. The
script is available from here:

http://pypi.python.org/pypi/pycoco

 - Another small nit is that they should address Python 2.x, too.
 
 I asked that they focus on EITHER 2.x or 3.x, since too broad is an
 equally valid criticism.  Certainly 3.x is the future so I though
 focusing on increasing code coverage, and especially C code coverage,
 could best be applied to 3.x.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] pprint(iterator)

2009-02-02 Thread Walter Dörwald


Paul Moore wrote:

2009/1/30 Walter Dörwald wal...@livinglogic.de:

Paul Moore wrote:


[...]
In all honesty, I think pkgutil.simplegeneric should be documented,
exposed, and moved to a library of its own[1].

http://pypi.python.org/pypi/simplegeneric


Thanks, I was aware of that.


I wasn't aware of the fact that simplegeneric is part of the stdlib, 
albeit in a strange spot.



I assume that the barrier to getting this
into the stdlib will be higher than to simply exposing an
implementation already available in the stdlib.


At least we'd need documentation and tests. And of course the code must 
be stable and there must be someone willing to maintain it (then again 
it's less than 40 lines of code).


There should be enough third-party module that use it to justify making 
simplegeneric an official part of the stdlib.


The best spot for generic() is probably in the functools module.


To be honest, all I
would like is for these regular let's have another special method
discussions to become unnecessary...


Me too.

Servus,
   Walter



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] pprint(iterator)

2009-01-30 Thread Walter Dörwald

Paul Moore wrote:

 [...]
 In all honesty, I think pkgutil.simplegeneric should be documented,
 exposed, and moved to a library of its own[1].

http://pypi.python.org/pypi/simplegeneric

 [...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Code coverage

2008-09-19 Thread Walter Dörwald

Hello all!

The code coverage site at http://coverage.livinglogic.de/ was broken for
the last few months. It's fixed again now and runs the test suite once
per day with

   regrtest.py -T -N -uurlfetch,largefile,network,decimal

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-3000] Betas today - I hope

2008-06-13 Thread Walter Dörwald


M.-A. Lemburg wrote:

On 2008-06-12 16:59, Walter Dörwald wrote:

M.-A. Lemburg wrote:

.transform() and .untransform() use the codecs to apply same-type
conversions. They do apply type checks to make sure that the
codec does indeed return the same type.

E.g. text.transform('xml-escape') or data.transform('base64').


So what would a base64 codec do with the errors argument?


It could use it to e.g. try to recover as much data as possible
from broken input data.

Currently (in Py2.x), it raises an exception if you pass in anything
but strict.


I think for transformations we don't need the full codec machinery:

  ...

No need to invent another wheel :-) The codecs already exist for
Py2.x and can be used by the .encode()/.decode() methods in Py2.x
(where no type checks occur).


By using a new API we could get rid of old warts. For example: Why 
does the stateless encoder/decoder return how many input 
characters/bytes it has consumed? It must consume *all* bytes anyway!


No, it doesn't and that's the point in having those return values :-)

Even though the encoder/decoders are stateless, that doesn't mean
they have to consume all input data. The caller is responsible to
make sure that all input data was in fact consumed.

You could for example have a decoder that stops decoding after
having seen a block end indicator, e.g. a base64 line end or
XML closing element.


So how should the UTF-8 decoder know that it has to stop at a closing 
XML element?



Just because all codecs that ship with Python always try to decode
the complete input doesn't mean that the feature isn't being used.


I know of no other code that does. Do you have an example for this use.


The interface was designed to allow for the above situations.


Then could we at least have a new codec method that does:

def statelesencode(self, input):
   (output, consumed) = self.encode(input)
   assert len(input) == consumed
   return output

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-3000] Betas today - I hope

2008-06-13 Thread Walter Dörwald


M.-A. Lemburg wrote:

On 2008-06-13 11:32, Walter Dörwald wrote:

M.-A. Lemburg wrote:

On 2008-06-12 16:59, Walter Dörwald wrote:

M.-A. Lemburg wrote:

.transform() and .untransform() use the codecs to apply same-type
conversions. They do apply type checks to make sure that the
codec does indeed return the same type.

E.g. text.transform('xml-escape') or data.transform('base64').


So what would a base64 codec do with the errors argument?


It could use it to e.g. try to recover as much data as possible
from broken input data.

Currently (in Py2.x), it raises an exception if you pass in anything
but strict.


I think for transformations we don't need the full codec machinery:

  ...

No need to invent another wheel :-) The codecs already exist for
Py2.x and can be used by the .encode()/.decode() methods in Py2.x
(where no type checks occur).


By using a new API we could get rid of old warts. For example: Why 
does the stateless encoder/decoder return how many input 
characters/bytes it has consumed? It must consume *all* bytes anyway!


No, it doesn't and that's the point in having those return values :-)

Even though the encoder/decoders are stateless, that doesn't mean
they have to consume all input data. The caller is responsible to
make sure that all input data was in fact consumed.

You could for example have a decoder that stops decoding after
having seen a block end indicator, e.g. a base64 line end or
XML closing element.


So how should the UTF-8 decoder know that it has to stop at a closing 
XML element?


The UTF-8 decoder doesn't support this, but you could write a codec
that applies this kind of detection, e.g. to not try to decode
partial UTF-8 byte sequences at the end of input, which would then
result in error.


Just because all codecs that ship with Python always try to decode
the complete input doesn't mean that the feature isn't being used.


I know of no other code that does. Do you have an example for this use.


I already gave you a few examples.


Maybe I was unclear, I meant real world examples, not hypothetical ones.


The interface was designed to allow for the above situations.


Then could we at least have a new codec method that does:

def statelesencode(self, input):
   (output, consumed) = self.encode(input)
   assert len(input) == consumed
   return output


You mean as method to the Codec class ?


No, I meant as a method for the CodecInfo clas.


Sure, we could do that, but please use a different name,
e.g. .encodeall() and .decodeall() - .encode() and .decode()
are already stateles (and so would the new methods be), so
stateless isn't all that meaningful in this context.


I like the names encodeall/decodeall!


We could also add such a check to the PyCodec_Encode() and _Decode()
functions. They currently do not apply the above check.

In Python, those two functions are exposed as codecs.encode()
and codecs.decode().


This change will probably have to wait for the 2.7 cycle.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-3000] Betas today - I hope

2008-06-13 Thread Walter Dörwald


Walter Dörwald wrote:

[...] 

Sure, we could do that, but please use a different name,
e.g. .encodeall() and .decodeall() - .encode() and .decode()
are already stateles (and so would the new methods be), so
stateless isn't all that meaningful in this context.


I like the names encodeall/decodeall!


We could also add such a check to the PyCodec_Encode() and _Decode()
functions. They currently do not apply the above check.

In Python, those two functions are exposed as codecs.encode()
and codecs.decode().


This change will probably have to wait for the 2.7 cycle.


BTW, what I noticed is that the unicode-internal codec seems to be broken:

 import codecs
 codecs.getencoder(unicode-internal)(uabc)
('a\x00b\x00c\x00', 6)

I would have expected it to return:

 import codecs
 codecs.getencoder(unicode-internal)(uabc)
('a\x00b\x00c\x00', 3)

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-3000] Betas today - I hope

2008-06-12 Thread Walter Dörwald


M.-A. Lemburg wrote:

On 2008-06-11 17:15, Walter Dörwald wrote:

M.-A. Lemburg wrote:

On 2008-06-11 13:35, Barry Warsaw wrote:
So I had planned to do a bunch of work last night looking at the 
release blocker issues, but nature intervened.  A bunch of severe 
thunderstorms knock out my 'net access until this morning.


I'll try to find some time during the day to look at the RB issues.  
Hopefully we can get Guido to look at them too and Pronounce on some 
of them.  Guido please start with:


http://bugs.python.org/issue643841

My plan is to begin building the betas tonight, at around 9 or 10pm 
EDT (0100 to 0200 UTC Thursday).  If a showstopper comes up before 
then, I'll email the list.  If you think we really aren't ready for 
beta, then I would still like to get a release out today.  In that 
case, we'll call it alpha and delay the betas.


There are two things I'd like to get in to 3.0:

 * .transform()/.untransform() methods (this is mostly done, just need
   to add the methods to PyUnicode, PyBytes and PyByteArray)


What would these methods do? Use the codec machinery without any type 
checks?


As discussed in another thread some weeks ago:

.transform() and .untransform() use the codecs to apply same-type
conversions. They do apply type checks to make sure that the
codec does indeed return the same type.

E.g. text.transform('xml-escape') or data.transform('base64').


So what would a base64 codec do with the errors argument?


I think for transformations we don't need the full codec machinery:

  ...

No need to invent another wheel :-) The codecs already exist for
Py2.x and can be used by the .encode()/.decode() methods in Py2.x
(where no type checks occur).


By using a new API we could get rid of old warts. For example: Why does 
the stateless encoder/decoder return how many input characters/bytes it 
has consumed? It must consume *all* bytes anyway!



In Py3.x, .encode()/.decode() only allow conversions of the type
unicode - bytes. .transform()/.untransform() add conversions
of the type unicode - unicode or bytes - bytes.

All other conversions in Py3.x have to go through codecs.encode() and
codecs.decode() which are the generic codec access functions from
the codec registry.


Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-3000] Betas today - I hope

2008-06-11 Thread Walter Dörwald


M.-A. Lemburg wrote:

On 2008-06-11 13:35, Barry Warsaw wrote:
So I had planned to do a bunch of work last night looking at the 
release blocker issues, but nature intervened.  A bunch of severe 
thunderstorms knock out my 'net access until this morning.


I'll try to find some time during the day to look at the RB issues.  
Hopefully we can get Guido to look at them too and Pronounce on some 
of them.  Guido please start with:


http://bugs.python.org/issue643841

My plan is to begin building the betas tonight, at around 9 or 10pm 
EDT (0100 to 0200 UTC Thursday).  If a showstopper comes up before 
then, I'll email the list.  If you think we really aren't ready for 
beta, then I would still like to get a release out today.  In that 
case, we'll call it alpha and delay the betas.


There are two things I'd like to get in to 3.0:

 * .transform()/.untransform() methods (this is mostly done, just need
   to add the methods to PyUnicode, PyBytes and PyByteArray)


What would these methods do? Use the codec machinery without any type 
checks?


I think for transformations we don't need the full codec machinery:

We probably don't need extensible error handling.

There are transformation that are not invertible, so it doesn't make 
sense to have both operations in one object. If the operation *is* 
invertible, two tranformers can be used.


Do we really need a registry that maps function named to functions?

A simple API might look like this:

class TransformInfo:
   # stateless transformer
   def transform(self, input):

   # return stateful incremental transformer
   def incrementaltransformer(self):

   # wrap stream in a transforming stream
   def streamtransformer(self, stream):

incrementaltransformer() would return an object that has one method:

   def transform(self, input, final=False);


[...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP: per user site-packages directory

2008-01-14 Thread Walter Dörwald

Christian Heimes wrote:

 [...] 
 PEP: XXX
 Title: Per user site-packages directory
 Version: $Revision$
 Last-Modified: $Date$
 Author: Christian Heimes christian(at)cheimes(dot)de
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 11-Jan-2008
 Python-Version: 2.6, 3.0
 Post-History:
 [...] 
 user site directory
 
A site directory inside the users' home directory. An user site
directory is specific to a Python version. The path contains
the version number (major and minor only).
 
Windows: %APPDATA%/Python/Python26/site-packages
Mac: ~/Library/Python/2.6/site-packages
Unix: ~/.local/lib/python2.6/site-packages
 
 
 user configuration directory
 
Usually the parent directory of the user site directory. It's meant
for Python version specific data like config files.
 
Windows: %APPDATA%/Python/Python26
Mac: ~/Library/Python/2.6
Unix: ~/.local/lib/python2.6

So if I'm using the --user option, where would scripts be installed? 
Would this be:

Windows: %APPDATA%/Python/Python26/bin
Mac: ~/Library/Python/2.6/bin
Unix: ~/.local/lib/python2.6/bin

I'd like to be able to switch between several versions of my user 
installation simply by changing a link. (On the Mac I'm doing this by 
relinking ~/Library/Python to different directories.)

Servus,
Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-13 Thread Walter Dörwald

Fred Drake wrote:
 On Nov 12, 2007, at 8:56 AM, Walter Dörwald wrote:
 It isn't embedded. codecs.detect_xml_encoding() is callable without
 any problems (though not documented).
 
 Not documented means not available, I think.

I just din't think that someone wants the detection function, but not
the codec, so I left the function undocumented.

 Who would use such a function for what?
 
 Being able to detect the encoding can be useful anytime you want
 information about a file, actually.  In particular, presenting encoding
 information in a user interface (yes, you can call that contrived, but
 some people want to be able to see such things, and for them it's a
 requirement).

And if you want to display the XML you'd need to decode it. An example
might be a text viewer. E.g. Apples QuickLook.

 If you want to parse the XML and re-encode, it's common
 to want to re-encode in the origin encoding; it's needed for that as
 well.  If you just want to toss the text into an editor, the encoding is
 also needed.  In that case, the codec approach *might* be acceptable
 (depending on the rest of the editor implementation), but the same
 re-encoding issue applies as well.
 
 Simply, it's sometimes desired to know the encoding for purposes that
 don't require immediate decoding.  A function would be quite handing in
 these cases.

So the consensus seems to be: Add an encoding detection function
(implemented in Python) to the xml module?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-12 Thread Walter Dörwald

Martin v. Löwis wrote:
 I don't know. Is an XML document ill-formed if it doesn't contain an
 XML declaration, is not in UTF-8 or UTF-8, but there's external
 encoding info?
 
 If there is external encoding info, matching the actual encoding,
 it would be well-formed. Of course, preserving that information would
 be up to the application.

OK. When the application passes an encoding to the decoder this is
supposed to be the external encoding info, so for the decoder it makes
sense to assume that the encoding passed to the encoder is the external
encoding info and will be transmitted along with the encoded bytes.

 This looks good. Now we would have to extent the code to detect and
 replace the encoding in the XML declaration too.
 
 I'm still opposed to making this a codec. Right - for a pure Python
 solution, the processing of the XML declaration would still need to
 be implemented.
 
 I think there could be a much simpler routine to have the same 
 effect. - if it's less than 4 bytes, answer need more data.
 Can there be an XML document that is less then 4 bytes? I guess not.
 
 No, the smallest document has exactly 4 characters (e.g. f/).
 However, external entities may be smaller, such as x.
 
 But anyway: would a Python implementation of these two functions
 (detect_encoding()/fix_encoding()) be accepted?
 
 I could agree to a Python implementation of this algorithm as long
 as it's not packaged as a codec.

I still can't understand your objection to a codec. What's the
difference between UTF-16 decoding and XML decoding? In fact PEP 263
IMHO does specify how to decode Python source, so in theory it could be
a codec (in practice this probably wouldn't work because of
bootstrapping problems).

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-12 Thread Walter Dörwald

Martin v. Löwis wrote:
   In case it isn't clear - this is exactly my view also.

 But is there an API to do it?  As MAL points out that API would have
 to return not an encoding, but a pair of an encoding and the rewound
 stream.  
 
 The API wouldn't operate on streams. Instead, you pass a string, and
 it either returns the detected encoding, or an information telling that
 it needs more data. No streams.

But in many cases you read the data out of a stream and pass it to an
incremental XML parser. So if you're transcoding the input (either
because the XML parser can't handle the encoding in question or because
there's an external encoding specified, but it's not possible to pass
that to the parser), a codec makes the most sense.

 For non-seekable, non-peekable streams (if any), what you'd
 need would be a stream that consisted of a concatenation of the
 buffered data used for detection and the continuation of the stream.
 
 The application would read data out of the stream, and pass it to
 the detection. It then can process it in whatever manner it meant to
 process it in the first place.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-12 Thread Walter Dörwald

Fred Drake wrote:

 On Nov 12, 2007, at 8:16 AM, M.-A. Lemburg wrote:
 We have a -1 from Martin and a +1 from Walter, Guido and myself.
 Pretty clear vote if you ask me. I'd say we end the discussion here
 and move on.
 
 If we're counting, you've got a -1 on the codec from me as well.   
 Martin's right: there's no value to embedding the logic of auto- 
 detection into the codec.

It isn't embedded. codecs.detect_xml_encoding() is callable without
any problems (though not documented).

 A function somewhere in the xml package is  
 all that's warranted.

Who would use such a function for what?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-10 Thread Walter Dörwald

Martin v. LÃ¶wis sagte:

 So what if the unicode string doesn't start with an XML declaration?
 Will it add one?

 No.

 Ok. So the XML document would be ill-formed then unless the encoding is
 UTF-8, right?

I don't know. Is an XML document ill-formed if it doesn't contain an XML 
declaration, is not in UTF-8 or UTF-8, but there's
external encoding info? If it is, then yes, the document would be ill-formed.

 The point of this code is not just to return whether the string starts
 with ?xml or not. There are actually three cases:

 Still, it's overly complex for that matter:

   * The string does start with ?xml

if s.startswith(?xml):
  return Yes

   * The string starts with a prefix of ?xml, i.e. we can only
 decide if it starts with ?xml if we have more input.

if ?xml.startswith(s):
  return Maybe

   * The string definitely doesn't start with ?xml.

return No

This looks good. Now we would have to extent the code to detect and replace the 
encoding in the XML declaration too.

 What bit fiddling are you referring to specifically that you think
 is better done in C than in Python?

 The code that checks the byte signature, i.e. the first part of
 detect_xml_encoding_str().

 I can't see any *bit* fiddling there, except for the bit mask of
 candidates. For the candidate list, I cannot quite understand why
 you need a bit mask at all, since the candidates are rarely
 overlapping.

I tried many variants and that seemed to be the most straitforward one.

 I think there could be a much simpler routine to have the same
 effect.
 - if it's less than 4 bytes, answer need more data.

Can there be an XML document that is less then 4 bytes? I guess not.

 - otherwise, implement annex F literally. Make a dictionary
   of all prefixes that are exactly 4 bytes, i.e.

   prefixes4 = {\x00\x00\xFE\xFF:utf-32be, ...
   ...,\0\x3c\0\x3f:utf-16le}

   try: return prefixes4[s[:4]]
   except KeyError: pass
   if s.startswith(codecs.BOM_UTF16_BE):return utf-16be
   ...
   if s.startswith(?xml):
  return get_encoding_from_declaration(s)
   return utf-8

get_encoding_from_declaration() would have to do the same yes/no/maybe decision.

But anyway: would a Python implementation of these two functions 
(detect_encoding()/fix_encoding()) be accepted?

Servus,
   Walter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-10 Thread Walter Dörwald

Martin v. LÃ¶wis sagte:
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?

 Because it is the XML parser that does the decoding, not the
 application. Also, it is better to provide functionality in
 a modular manner (i.e. encoding detection separately from
 encodings),

It is separate. Detection is done by codecs.detect_xml_encoding(), decoding is 
done by the codec.

 and leaving integration of modules to the application,
 in particular if the integration is trivial.

Servus,
   Walter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald

Martin v. Löwis wrote:

 ci = codecs.lookup(xml-auto-detect)
 p = expat.ParserCreate()
 e = utf-32
 s = (u?xml version='1.0' encoding=%r?foo/ % e).encode(e)
 s = ci.encode(ci.decode(s)[0], encoding=utf-8)[0]
 p.Parse(s, True)
 
 So how come the document being parsed is recognized as UTF-8?

Because you can force the encoder to use a specified encoding. If you do
this and the unicode string starts with an XML declaration, the encoder
will put the specified encoding into the declaration:

import codecs

e = codecs.getencoder(xml-auto-detect)
print e(u?xml version='1.0' encoding='iso-8859-1'?foo/,
encoding=utf-8)[0]

This prints:
?xml version='1.0' encoding='utf-8'?foo/

 OK, so should I put the C code into a _xml module?
 
 I don't see the need for C code at all.

Doing the bit fiddling for
Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the
right thing to do.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald

Adam Olsen wrote:

 On 11/8/07, Walter Dörwald [EMAIL PROTECTED] wrote:
 [...]
 Furthermore encoding-detection might be part of the responsibility of
 the XML parser, but this decoding phase is totally distinct from the
 parsing phase, so why not put the decoding into a common library?
 I would not object to that - just to expose it as a codec. Adding it
 to the XML library is fine, IMO.
 But it does make sense as a codec. The decoding phase of an XML parser
 has to turn a byte stream into a unicode stream. That's the job of a codec.
 
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.

So the code is good, if it is inside an XML parser, and it's bad if it
is inside a codec?

 It's not even sufficient for
 XML:
 
 1) round-tripping a file should be done in the original encoding.
 Containing the auto-detected encoding within a codec doesn't let you
 see what it picked.

The chosen encoding is available from the incremental encoder:

import codecs

e = codecs.getincrementalencoder(xml-auto-detect)()
e.encode(u?xml version='1.0' encoding='utf-32'?foo/, True)
print e.encoding

This prints utf-32.

 2) the encoding may be specified externally from the file/stream[1].
 The xml parser needs to handle these out-of-band encodings anyway.

It does. You can pass an encoding to the stateless decoder, the
incremental decoder and the streamreader. It will then use this encoding
instead the one detected from the byte stream. It even will put the
correct encoding into the XML declaration (if there is one):

import codecs

d = codecs.getdecoder(xml-auto-detect)
print d(?xml version='1.0' encoding='iso-8859-1'?foo/,
encoding=utf-8)[0]

This prints:
?xml version='1.0' encoding='utf-8'?foo/

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald

Martin v. Löwis wrote:

 Because you can force the encoder to use a specified encoding. If you do
 this and the unicode string starts with an XML declaration
 
 So what if the unicode string doesn't start with an XML declaration?
 Will it add one?

No.

 If so, what version number will it use?

If we added this we could add an extra argument version to the encoder
constructor defaulting to '1.0'.

 OK, so should I put the C code into a _xml module?
 I don't see the need for C code at all.
 Doing the bit fiddling for
 Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the
 right thing to do.
 
 Hmm. I don't think a sequence like
 
 +if (strlen0)
 +{
 +if (*str++ != '')
 +return 1;
 +if (strlen1)
 +{
 +if (*str++ != '?')
 +return 1;
 +if (strlen2)
 +{
 +if (*str++ != 'x')
 +return 1;
 +if (strlen3)
 +{
 +if (*str++ != 'm')
 +return 1;
 +if (strlen4)
 +{
 +if (*str++ != 'l')
 +return 1;
 +if (strlen5)
 +{
 +if (*str != ' '  *str != '\t'  *str !=
 '\r'  *str != '\n')
 +return 1;
 
 is well-maintainable C. I feel it is much better writing
 
   if not s.startswith(=?xml):
  return 1

The point of this code is not just to return whether the string starts
with ?xml or not. There are actually three cases:
  * The string does start with ?xml
  * The string starts with a prefix of ?xml, i.e. we can only
decide if it starts with ?xml if we have more input.
  * The string definitely doesn't start with ?xml.

 What bit fiddling are you referring to specifically that you think
 is better done in C than in Python?

The code that checks the byte signature, i.e. the first part of
detect_xml_encoding_str().

Servus,
   Walter




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald

Martin v. Löwis wrote:
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?
 
 Exactly so. This functionality just *isn't* a codec - there is no
 encoding. Instead, it is an algorithm for *detecting* an encoding.

And what do you do once you've detected the encoding? You decode the
input, so why not combine both into an XML decoder?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald

Walter Dörwald wrote:
 Martin v. Löwis wrote:
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?
 Exactly so. This functionality just *isn't* a codec - there is no
 encoding. Instead, it is an algorithm for *detecting* an encoding.
 
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?

In fact, we already have such a codec. The utf-16 decoder looks at the
first two bytes and then decides to forward the rest to either a
utf-16-be or a utf-16-le decoder.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald

M.-A. Lemburg wrote:

 On 2007-11-09 14:10, Walter Dörwald wrote:
 Martin v. Löwis wrote:
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?
 Exactly so. This functionality just *isn't* a codec - there is no
 encoding. Instead, it is an algorithm for *detecting* an encoding.
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?
 
 FWIW: I'm +1 on adding such a codec.
 
 It makes working with XML data a lot easier: you simply don't have to
 bother with the encoding of the XML data anymore and can just let the
 codec figure out the details. The XML parser can then work directly
 on the Unicode data.

Exactly. I have a version of sgmlop lying around that does that.

 Whether it needs to be in C or not is another question (I would have
 done this in Python since performance is not really an issue), but since
 the code is already written, why not use it ?

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-08 Thread Walter Dörwald

Martin v. Löwis wrote:
 Any comments?
 
 -1. First, (as already discussed on the tracker,) xml is a bad name
 for an encoding. How would you encode Hello in xml?

Then how about the suggested xml-auto-detect?

 Then, I'd claim that the problem that the codec solves doesn't really
 exist. IOW, most XML parsers implement the auto-detection of encodings,
 anyway, and this is where architecturally this functionality belongs.

But not all XML parsers support all encodings. The XML codec makes it
trivial to add this support to an existing parser.

Furthermore encoding-detection might be part of the responsibility of
the XML parser, but this decoding phase is totally distinct from the
parsing phase, so why not put the decoding into a common library?

 For a text editor, much more useful than a codec would be a routine
 (say, xml.detect_encoding) which performs the auto-detection.

There's a (currently undocumented) codecs.detect_xml_encoding() in the
patch. We could document this function and make it public. But if
there's no codec that uses it, this function IMHO doesn't belong in the
codecs module. Should this function be available from xml/__init__.py or
should be put it into something like xml/utils.py?

 Finally, I think the codec is incorrect. When saving XML to a file
 (e.g. in a text editor), there should rarely be encoding errors, since
 one could use character references in many cases.

This requires some intelligent fiddling with the errors attribute of the
encoder.

 Also, the XML
 spec talks about detecting EBCDIC, which I believe your implementation
 doesn't.

Correct, but as long as Python doesn't have an EBCDIC codec, that won't
help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
rather simple though.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] XML codec?

2007-11-08 Thread Walter Dörwald

Martin v. Löwis wrote:

 Then how about the suggested xml-auto-detect?
 
 That is better.

OK.

 Then, I'd claim that the problem that the codec solves doesn't really
 exist. IOW, most XML parsers implement the auto-detection of encodings,
 anyway, and this is where architecturally this functionality belongs.
 But not all XML parsers support all encodings. The XML codec makes it
 trivial to add this support to an existing parser.
 
 I would like to question this claim. Can you give an example of a parser
 that doesn't support a specific encoding

It seems that e.g. expat doesn't support UTF-32:

from xml.parsers import expat

p = expat.ParserCreate()
e = utf-32
s = (u?xml version='1.0' encoding=%r?foo/ % e).encode(e)
p.Parse(s, True)

This fails with:

Traceback (most recent call last):
   File gurk.py, line 6, in module
 p.Parse(s, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, 
column 1

Replace utf-32 with utf-16 and the problem goes away.

 and where adding such a codec
 solves that problem?
 
 In particular, why would that parser know how to process Python Unicode
 strings?

It doesn't have to. You can use an XML encoder to reencode the unicode 
string into bytes (forcing an encoding that the parser knows):

import codecs
from xml.parsers import expat

ci = codecs.lookup(xml-auto-detect)
p = expat.ParserCreate()
e = utf-32
s = (u?xml version='1.0' encoding=%r?foo/ % e).encode(e)
s = ci.encode(ci.decode(s)[0], encoding=utf-8)[0]
p.Parse(s, True)

 Furthermore encoding-detection might be part of the responsibility of
 the XML parser, but this decoding phase is totally distinct from the
 parsing phase, so why not put the decoding into a common library?
 
 I would not object to that - just to expose it as a codec. Adding it
 to the XML library is fine, IMO.

But it does make sense as a codec. The decoding phase of an XML parser 
has to turn a byte stream into a unicode stream. That's the job of a codec.

 There's a (currently undocumented) codecs.detect_xml_encoding() in the
 patch. We could document this function and make it public. But if
 there's no codec that uses it, this function IMHO doesn't belong in the
 codecs module. Should this function be available from xml/__init__.py or
 should be put it into something like xml/utils.py?
 
 Either - or.

OK, so should I put the C code into a _xml module?

 Finally, I think the codec is incorrect. When saving XML to a file
 (e.g. in a text editor), there should rarely be encoding errors, since
 one could use character references in many cases.
 This requires some intelligent fiddling with the errors attribute of the
 encoder.
 
 Much more than that, I think - you cannot use a character reference
 in an XML Name. So the codec would have to parse the output stream
 to know whether or not a character reference could be used.

That's what I meant with intelligent fiddling. But I agree this is way 
beyond what a text editor should do. AFAIK it is way beyond what 
existing text editors do. However using the XML codec would at least 
guarantee that the encoding specified in the XML declaration and the 
encoding used for encoding the file stay consistent.

 Correct, but as long as Python doesn't have an EBCDIC codec, that won't
 help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
 rather simple though.
 
 But it does! cp037 is EBCDIC, and supported by Python.

I didn't know that. I'm going to update the patch.

Servus,
Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] XML codec?

2007-11-07 Thread Walter Dörwald

I have a patch ready (http://bugs.python.org/issue1399) that adds an XML
codec. This codec implements encoding detection as specified in
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing and could be
used for the decoding phase of an XML parser. Other use cases are:

The codec could be used for transcoding an XML input before passing it
to the real parser, if the parser itself doesn't support the encoding in
question.

A text editor could use the codec to decode an XML file. When the user
changes the XML declaration and resaves the file, it would be saved in
the correct encoding.

I'd like to have this codec in 2.6 and 3.0.

Any comments?

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] test_calendar broken on trunk

2007-08-28 Thread Walter Dörwald

[EMAIL PROTECTED] wrote:

 On the community trunk buildbots this checkin:
 
1.
 
   Changed by: walter.doerwald
   Changed at: Tue 28 Aug 2007 16:38:27
   Branch: trunk
   Revision: 57620
 
   Changed files:
   * trunk/Doc/library/calendar.rst
   * trunk/Lib/calendar.py
   Comments:
 
   Fix title endtag in HTMLCalender.formatyearpage(). Fix documentation
   for
   HTMLCalender.formatyearpage() (there's no themonth parameter).
 
   This fixes issue1046.
 
 broke test_calendar.  Details here:
 
 http://www.python.org/dev/buildbot/community/all/

Should be fixed in r57628.

Servus,
Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] T_PYSSIZET in Include/structmember.h can be hidden

2007-08-03 Thread Walter Dörwald

Neal Norwitz wrote:
 Martin,
 
 Do you know why T_PYSSIZET is inside a #ifdef HAVE_LONG_LONG?  That
 seems like a mistake.  Here's the code:
 
 #ifdef HAVE_LONG_LONG
 #define T_LONGLONG  17
 #define T_ULONGLONG 18
 #define T_PYSSIZET   19 /* Py_ssize_t */
 #endif /* HAVE_LONG_LONG */
 
 ISTM, that T_PYSSIZET should be after the #endif.  Was this a mistake
 or intentional?

That was my mistake. Iy should be outside of the #ifdef.

Servus,
Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] itertools addition: getitem()

2007-07-11 Thread Walter Dörwald

Giovanni Bajo wrote:

 On 09/07/2007 21.23, Walter Dörwald wrote:
 
   from ll.xist import parsers, xfind
   from ll.xist.ns import html
   e = parsers.parseURL(http://www.python.org;, tidy=True)
   print e.walknode(html.h2  xfind.hasclass(news))[-1]
 Google Adds Python Support to Google Calendar Developer's Guide


 Get the first comment line from a python file:

   getitem((line for line in open(Lib/codecs.py) if 
 line.startswith(#)), 0)
 '### Registry and builtin stateless codec functions\n'


 Create a new unused identifier:

   def candidates(base):
 ... yield base
 ... for suffix in count(2):
 ... yield %s%d % (base, suffix)
 ...
   usedids = set((foo, bar))
   getitem((i for i in candidates(foo) if i not in usedids), 0)
 'foo2'
 
 You keep posting examples where you call your getitem() function with 0 as 
 index, or -1.
 
 getitem(it, 0) already exists and it's spelled it.next(). getitem(it, -1) 
 might be useful in fact, and it might be spelled last(it) (or it.last()). 
 Then 
 one may want to add first() for simmetry, but that's it:
 
 first(i for i in candidates(foo) if i not in usedids)
 last(line for line in open(Lib/codecs.py) if line[0] == '#')
 
 Are there real-world use cases for getitem(it, n) with n not in (0, -1)? I 
 share Raymond's feelings on this. And by the way, if you wonder, I have these 
 exact feelings as well for islice... :)

It useful for screen scraping HTML. Suppose you have the following HTML 
table:

table
trtd01.01.2007/tdtd12.34/tdtdFoo/td/tr
trtd13.01.2007/tdtd23.45/tdtdBar/td/tr
trtd04.02.2007/tdtd45.56/tdtdBaz/td/tr
trtd27.02.2007/tdtd56.78/tdtdSpam/td/tr
trtd17.03.2007/tdtd67.89/tdtdEggs/td/tr
trtd  /tdtd164.51/tdtdTotal/td/tr
trtd  /tdtd(incl. VAT)/tdtd/td/tr
/table

To extract the total sum, you want the second column from the second to 
last row, i.e. something like:
row = getitem((r for r in table if r.name == tr), -2)
col = getitem((c for c in row if c.name == td), 1)

Servus,
Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] itertools addition: getitem()

2007-07-09 Thread Walter Dörwald

Raymond Hettinger wrote:
 [Walter Dörwald]
 I'd like to propose the following addition to itertools: A function
 itertools.getitem() which is basically equivalent to the following
 python code:

 _default = object()

 def getitem(iterable, index, default=_default):
try:
   return list(iterable)[index]
except IndexError:
   if default is _default:
  raise
   return default

 but without materializing the complete list. Negative indexes are
 supported too (this requires additional temporary storage for abs(index)
 objects).
 
 Why not use the existing islice() function?
 
   x = list(islice(iterable, i, i+1)) or default

This doesn't work, because it produces a list

 list(islice(xrange(10), 2, 3)) or 42
[2]

The following would work:
   x = (list(islice(iterable, i, i+1)) or [default])[0]

However islice() doesn't support negative indexes, getitem() does.

 Also, as a practical matter, I think it is a bad idea to introduce
 __getitem__ style access to itertools because the starting point
 moves with each consecutive access:
 
# access items 0, 2, 5, 9, 14, 20, ...
for i in range(10):
print getitem(iterable, i)
 
 Worse, this behavior changes depending on whether the iterable
 is re-iterable (a string would yield consecutive items while a
 generator would skip around as shown above).

islice() has the same problem:

 from itertools import *
 iterable = iter(xrange(100))
 for i in range(10):
... print list(islice(iterable, i, i+1))
[0]
[2]
[5]
[9]
[14]
[20]
[27]
[35]
[44]
[54]

 iterable = xrange(100)
 for i in range(10):
... print list(islice(iterable, i, i+1))
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]

 Besides being a bug factory, I think the getitem proposal would
 tend to steer people down the wrong road, away from more
 natural solutions to problems involving iterators.

I don't think that
   (list(islice(iterable, i, i+1)) or [default])[0]
is more natural than
   getitem(iterable, i, default)

 A basic step
 in learning the language is to differentiate between sequences
 and general iterators -- we should not conflate the two.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] itertools addition: getitem()

2007-07-09 Thread Walter Dörwald

Guido van Rossum wrote:
 On 7/9/07, Raymond Hettinger [EMAIL PROTECTED] wrote:
 Also, as a practical matter, I think it is a bad idea to introduce
 __getitem__ style access to itertools because the starting point
 moves with each consecutive access:

 # access items 0, 2, 5, 9, 14, 20, ...
 for i in range(10):
 print getitem(iterable, i)

 Worse, this behavior changes depending on whether the iterable
 is re-iterable (a string would yield consecutive items while a
 generator would skip around as shown above).

 Besides being a bug factory, I think the getitem proposal would
 tend to steer people down the wrong road, away from more
 natural solutions to problems involving iterators.  A basic step
 in learning the language is to differentiate between sequences
 and general iterators -- we should not conflate the two.
 
 But doesn't the very same argument also apply against islice(), which
 you just offered as an alternative?

Exactly.

 PS. If Walter is also at EuroPython, maybe you two could discuss this in
 person?

Sorry, I won't be at EuroPython.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] itertools addition: getitem()

2007-07-09 Thread Walter Dörwald

Raymond Hettinger wrote:

 From: Guido van Rossum [EMAIL PROTECTED]
 But doesn't the very same argument also apply against islice(), which
 you just offered as an alternative?
 
 Not really.  The use cases for islice() typically do not involve
 repeated slices of an iterator unless it is slicing off the front
 few elements on each pass.  In contrast, getitem() is all about
 grabbing something other than the frontmost element and seems
 to be intended for repeated calls on the same iterator. 

That wouldn't make sense as getitem() consumes the iterator! ;)

But seriously: perhaps the name getitem() is misleading? What about 
item() or pickitem()?

 And its
 support for negative indices seems somewhat weird in the
 context of general purpose iterators:  getitem(genprimes(), -1).

This does indeed make as much sense as sum(itertools.count()).

 I'll study Walter's use case but my instincts say that adding
 getitem() will do more harm than good.

Here's the function in use (somewhat invisibly, as it's used by the 
walknode() method). This gets the oldest news from Python's homepage:

  from ll.xist import parsers, xfind
  from ll.xist.ns import html
  e = parsers.parseURL(http://www.python.org;, tidy=True)
  print e.walknode(html.h2  xfind.hasclass(news))[-1]
Google Adds Python Support to Google Calendar Developer's Guide


Get the first comment line from a python file:

  getitem((line for line in open(Lib/codecs.py) if 
line.startswith(#)), 0)
'### Registry and builtin stateless codec functions\n'


Create a new unused identifier:

  def candidates(base):
... yield base
... for suffix in count(2):
... yield %s%d % (base, suffix)
...
  usedids = set((foo, bar))
  getitem((i for i in candidates(foo) if i not in usedids), 0)
'foo2'

Servus,
Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] itertools addition: getitem()

2007-07-08 Thread Walter Dörwald

I'd like to propose the following addition to itertools: A function 
itertools.getitem() which is basically equivalent to the following 
python code:

_default = object()

def getitem(iterable, index, default=_default):
try:
   return list(iterable)[index]
except IndexError:
   if default is _default:
  raise
   return default

but without materializing the complete list. Negative indexes are 
supported too (this requires additional temporary storage for abs(index) 
objects).

The patch is available at http://bugs.python.org/1749857

Servus,
Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] itertools addition: getitem()

2007-07-08 Thread Walter Dörwald

Guido van Rossum wrote:
 On 7/8/07, Georg Brandl [EMAIL PROTECTED] wrote:
 Guido van Rossum schrieb:
 How important is it to have the default in this API? __getitem__()
 doesn't have a default; instead, there's a separate API get() that
 provides a default (and I find defaulting to None more manageable than
 the _default = object() pattern).

Of course it isn't implemented this way in the C version.

 getattr() has a default too, while __getattr__ hasn't...
 
 Fair enough.
 
 But I still want to hear of a practical use case for the default here.

In most cases

foo = getitem(iterable, 0, None)
if foo is not None:
   ...

is simpler than:

try:
   foo = getitem(iterable, 0)
except IndexError:
   pass
else:
   ...

Here is a use case from one of my import XML into the database scripts:

compid = getitem(root[ns.Company_company_id], 0, None)
if compid:
   compid = int(compid)

The expression root[ns.company_id] returns an iterator that produces all 
children of the root node that are of the element type company_id. If 
there is a company_id its content will be turned into an int, if not 
None will be used.

Servus,
Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] itertools addition: getitem()

2007-07-08 Thread Walter Dörwald

Guido van Rossum wrote:

 On 7/8/07, Walter Dörwald [EMAIL PROTECTED] wrote:
 [quoting Guido]
  But I still want to hear of a practical use case for the default here.

 In most cases

 foo = getitem(iterable, 0, None)
 if foo is not None:
...

 is simpler than:

 try:
foo = getitem(iterable, 0)
 except IndexError:
pass
 else:
...

 Here is a use case from one of my import XML into the database scripts:

 compid = getitem(root[ns.Company_company_id], 0, None)
 if compid:
compid = int(compid)

 The expression root[ns.company_id] returns an iterator that produces all
 children of the root node that are of the element type company_id. If
 there is a company_id its content will be turned into an int, if not
 None will be used.
 
 Ahem. I hope you have a better use case for getitem() than that
 (regardless of the default issue). I find it clearer to write that as
 
 try:
  compid = root[ns.company_id].next()
 except StopIteration:
  compid = None
 else:
  compid = int(compid)
 
 While this is more lines, it doesn't require one to know about
 getitem() on an iterator. This is the same reason why setdefault() was
 a mistake -- it's too obscure to invent a compact spelling for it
 since the compact spelling has to be learned or looked up.

Well I have used (a Python version of) this getitem() function to 
implement a library that can match a CSS3 expression against an XML 
tree. For implementing the nth-child(), nth-last-child(), nth-of-type() 
and nth-last-of-type() pseudo classes (see 
http://www.w3.org/TR/css3-selectors/#structural-pseudos) getitem() was 
very useful.

Servus,
Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-3000] Python 3000 Status Update (Long!)

2007-06-19 Thread Walter Dörwald

Georg Brandl wrote:
 Nick Coghlan schrieb:
 Georg Brandl wrote:
 Guido van Rossum schrieb:
 I've written up a comprehensive status report on Python 3000. Please read:

 http://www.artima.com/weblogs/viewpost.jsp?thread=208549
 Thank you! Now I have something to show to interested people except read
 the PEPs.

 A minuscule nit: the rot13 codec has no library equivalent, so it won't be
 supported anymore :)
 Given that there are valid use cases for bytes-to-bytes translations, 
 and a common API for them would be nice, does it make sense to have an 
 additional category of codec that is invoked via specific recoding 
 methods on bytes objects? For example:

encoded = data.encode_bytes('bz2')
decoded = encoded.decode_bytes('bz2')
assert data == decoded
 
 This is exactly what I proposed a while before under the name
 bytes.transform().
 
 IMO it would make a common use pattern much more convenient and
 should be given thought.
 
 If a PEP is called for, I'd be happy to at least co-author it.

Codecs are a major exception to Guido's law: Never have a parameter
whose value switches between completely unrelated algorithms.

Why don't we put all string transformation functions into a common
module (the string module might be a good place):

 import string
 string.rot13('abc')

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] [Python-3000] Python 3000 Status Update (Long!)

2007-06-19 Thread Walter Dörwald

Georg Brandl wrote:
 Walter Dörwald schrieb:
 Georg Brandl wrote:
 Nick Coghlan schrieb:
 Georg Brandl wrote:
 Guido van Rossum schrieb:
 I've written up a comprehensive status report on Python 3000. Please 
 read:

 http://www.artima.com/weblogs/viewpost.jsp?thread=208549
 Thank you! Now I have something to show to interested people except read
 the PEPs.

 A minuscule nit: the rot13 codec has no library equivalent, so it won't be
 supported anymore :)
 Given that there are valid use cases for bytes-to-bytes translations, 
 and a common API for them would be nice, does it make sense to have an 
 additional category of codec that is invoked via specific recoding 
 methods on bytes objects? For example:

encoded = data.encode_bytes('bz2')
decoded = encoded.decode_bytes('bz2')
assert data == decoded
 This is exactly what I proposed a while before under the name
 bytes.transform().

 IMO it would make a common use pattern much more convenient and
 should be given thought.

 If a PEP is called for, I'd be happy to at least co-author it.
 Codecs are a major exception to Guido's law: Never have a parameter
 whose value switches between completely unrelated algorithms.
 
 I don't think that applies here. This is more like __import__():
 depending on the first parameter, completely different things can happen.
 Yes, the same import algorithm is used, but in the case of
 bytes.encode_bytes, the same algorithm is used to find and execute the
 codec.

What would a registry of tranformation algorithms buy us compared to a
module with transformation functions?

The function version is shorter:

   transform.rot13('foo')

compared to:

   'foo'.transform('rot13')

If each transformation has its own function, these functions can have
their own arguments, e.g.
   transform.bz2encode(data: bytes, level: int=6) - bytes

Of course str.transform() could pass along all arguments to the
registered function, but that's worse from a documentation viewpoint,
because the real signature is hidden deep in the registry.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

1 2 3 >

1 - 100 of 211 matches

Mail list logo