[Python-Dev] crash in dict on gc collect
I wonder if this is similar to Kevin's problem? I couldn't reproduce
his problem though. This happens with both debug and release builds.
Not sure how to reduce the test case. pychecker was just iterating
through the byte codes. It wasn't doing anything particularly
interesting.
./python pychecker/pychecker/checker.py Lib/encodings/cp1140.py
0x004cfa18 in visit_decref (op=0x661180, data=0x0) at gcmodule.c:270
270 if (PyObject_IS_GC(op)) {
(gdb) bt
#0 0x004cfa18 in visit_decref (op=0x661180, data=0x0) at gcmodule.c:270
#1 0x004474ab in dict_traverse (op=0x7cdd90, visit=0x4cf9e0
, arg=0x0) at dictobject.c:1819
#2 0x004cfaf0 in subtract_refs (containers=0x670240) at gcmodule.c:295
#3 0x004d07fd in collect (generation=0) at gcmodule.c:790
#4 0x004d0ad1 in collect_generations () at gcmodule.c:897
#5 0x004d1505 in _PyObject_GC_Malloc (basicsize=56) at gcmodule.c:1332
#6 0x004d1542 in _PyObject_GC_New (tp=0x64f4a0) at gcmodule.c:1342
#7 0x0041d992 in PyInstance_NewRaw (klass=0x2a95dffcc0,
dict=0x800e80) at classobject.c:505
#8 0x0041dab8 in PyInstance_New (klass=0x2a95dffcc0,
arg=0x2a95f5f9e0, kw=0x0) at classobject.c:525
#9 0x0041aa4e in PyObject_Call (func=0x2a95dffcc0,
arg=0x2a95f5f9e0, kw=0x0) at abstract.c:1802
#10 0x0049ecd2 in do_call (func=0x2a95dffcc0,
pp_stack=0x7fbfffb5b0, na=3, nk=0) at ceval.c:3785
#11 0x0049e46f in call_function (pp_stack=0x7fbfffb5b0,
oparg=3) at ceval.c:3597
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add pure python PNG writer module to stdlib?
On Jun 10, 2006, at 10:52 PM, Johann C. Rocholl wrote: >>> Does anybody think it could go into stdlib before the feature >>> freeze for >> 2.5? >> >> Nope. To get added to the stdlib there needs to be support from the >> community that the module is useful and best-of-breed. Try >> posting on >> c.l.py and see if people pick it up and like it. No way that is >> going to >> happen before b1. But there is always 2.6 . > > That's what I thought. My remote hope was that there would be > immediate concensus on python-dev about both the 'useful' and > 'best-of-breed' parts. Anybody else with a +1? ;-) > > Seriously, it's totally fine with me if the module doesn't make it > into 2.5, or even if it never makes it into stdlib. I'm just offering > it with some enthusiasm. The best way to do this would be to make it available as its own package. Give it a setup.py, stick it on CheeseShop, etc. For performance and memory usage reasons it would probably make sense to take an iterator that returns a scanline at a time. The current implementation does a lot more allocations than it needs to (full image, then one str per scanline). It also asserts type is str, when a buffer or mmap object would work perfectly well otherwise. If reading from a file or something you could skip the full allocation and a lot of memcpy by reading a scanline at a time. I'd also like to see RGBA support as well. Often the reason for choosing png over other lossless formats is its support for alpha. For your use case it's irrelevant, but there are many use cases that need the alpha channel. But to reiterate, further discussion of this really belongs on c.l.py for now... -bob ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 2.5 issues need resolving in a few days
Fred L. Drake, Jr. wrote: > With the introduction of the xmlcore package in Python 2.5, should we > document > xml.etree or xmlcore.etree? If someone installs PyXML with Python 2.5, I > don't think they're going to get xml.etree, which will be really confusing. > We can be sure that xmlcore.etree will be there. I think it would be unfortunate if an external, mostly unmaintained package could claim absolute ownership of the xml package root. how about tweaking the xml loader to map "xml.foo" to "_xmlplus.foo" only if that subpackage really exists ? ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 2.5 issues need resolving in a few days
On 11 jun 2006, at 12.09, Fredrik Lundh wrote: > Fred L. Drake, Jr. wrote: > >> With the introduction of the xmlcore package in Python 2.5, should >> we document >> xml.etree or xmlcore.etree? If someone installs PyXML with Python >> 2.5, I >> don't think they're going to get xml.etree, which will be really >> confusing. >> We can be sure that xmlcore.etree will be there. > > I think it would be unfortunate if an external, mostly unmaintained > package could claim absolute ownership of the xml package root. > > how about tweaking the xml loader to map "xml.foo" to "_xmlplus.foo" > only if that subpackage really exists ? I'm a bit confused by what the problem is. I though this was all handled like it should be now. >>> import xml.etree >>> xml.etree >>> import xml.sax >>> xml.sax It picks up modules from both places //Simon ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] UUID module
Ka-Ping Yee <[EMAIL PROTECTED]> wrote:
> Quite a few people have expressed interest in having UUID
> functionality in the standard library, and previously on this
> list some suggested possibly using the uuid.py module i wrote:
>
> http://zesty.ca/python/uuid.py
Some comments on the code:
> for dir in ['', r'c:\windows\system32', r'c:\winnt\system32']:
Can we get rid of these absolute paths? Something like this should suffice:
>>> from ctypes import *
>>> buf = create_string_buffer(4096)
>>> windll.kernel32.GetSystemDirectoryA(buf, 4096)
17
>>> buf.value.decode("mbcs")
u'C:\\WINNT\\system32'
> for function in functions:
>try:
>_node = function()
>except:
>continue
This also hides typos and whatnot. I guess it's better if each function catches
its own exceptions, and either return None or raise a common exception (like a
class _GetNodeError(RuntimeError)) which is then caught.
Giovanni Bajo
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 2.5 issues need resolving in a few days
Simon Percivall wrote: >> how about tweaking the xml loader to map "xml.foo" to "_xmlplus.foo" >> only if that subpackage really exists ? > > I'm a bit confused by what the problem is. I though this was all > handled like it should be now. that's how I thought things were done, but then I read Fred's post, and looked at the source code, and didn't see this line: _xmlplus.__path__.extend(xmlcore.__path__) or-maybe-someone's-been-using-the-time-machine-ly yrs /F ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] sgmllib Comments
Planet is a feed aggregator written in Python. It depends heavily on SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, and I've submitted a test case and a patch[1] (use or discard the patch, it is the test that I care about). While looking around, a few things surfaced. For starters, it would seem that the version of sgmllib in SVN HEAD will selectively unescape certain character references that might appear in an attribute. I say selectively, as: * it will unescape & * it won't unescape © * it will unescape & * it won't unescape & * it will unescape ’ * it won't unescape ’ There are a number of issues here. While not unescaping anything is suboptimal, at least the recipient is aware of exactly which characters have been unescaped (i.e., none of them). The proposed solution makes it impossible for the recipient to know which characters are unescaped, and which are original. (Note: feeds often contain such abominations as © which the new code will treat indistinguishably from ©) Additionally, there is a unicode issue here - one that is shared by handle_charref, but at least that method is overrideable. If unescaping remains, do it for hex character references and for values greather than 8-bits, i.e., use unichr instead of chr if the value is greater than 127. - Sam Ruby [1] http://tinyurl.com/j4a6n ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Switch statement
Greg Ewing wrote:
> [EMAIL PROTECTED] wrote:
>
>
>>switch raw_input("enter a, b or c: "):
>>case 'a':
>>print 'yay! an a!'
>>case 'b':
>>print 'yay! a b!'
>>case 'c':
>>print 'yay! a c!'
>>else:
>>print 'hey dummy! I said a, b or c!'
>
>
> Before accepting this, we could do with some debate about the
> syntax. It's not a priori clear that C-style switch/case is
> the best thing to adopt.
Since you don't have the 'fall-through' behavior of C, I would also
assume that you could associate more than one value with a case, i.e.:
case 'a', 'b', 'c':
...
It seems to me that the value of a 'switch' statement is that it is a
computed jump - that is, instead of having to iteratively test a bunch
of alternatives, you can directly jump to the code for a specific value.
I can see this being very useful for parser generators and state machine
code. At the moment, similar things can be done with hash tables of
functions, but those have a number of limitations, such as the fact that
they can't access local variables.
I don't have any specific syntax proposals, but I notice that the suite
that follows the switch statement is not a normal suite, but a
restricted one, and I am wondering if we could come up with a syntax
that avoids having a special suite.
Here's an (ugly) example, not meant as a serious proposal:
select (x) when 'a':
...
when 'b', 'c':
...
else:
...
The only real difference between this and an if-else chain is that the
compiler knows that all of the test expressions are constants and can be
hashed at compile time.
-- Talin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
On Sun, Jun 11, 2006, Sam Ruby wrote: > > Planet is a feed aggregator written in Python. It depends heavily on > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, > and I've submitted a test case and a patch[1] (use or discard the patch, > it is the test that I care about). > > [1] http://tinyurl.com/j4a6n When providing links to SF, please use the python.org tinyurl equivalent to ensure that people can easily see the bug/patch number: http://www.python.org/sf?id=1504333 -- Aahz ([EMAIL PROTECTED]) <*> http://www.pythoncraft.com/ "I saw `cout' being shifted "Hello world" times to the left and stopped right there." --Steve Gonedes ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Import semantics
Python and Jython import semantics differ on how sub-packages should be accessed after importing some module:Jython 2.1 on java1.5.0 (JIT: null)Type "copyright", "credits" or "license" for more information. >>> import xml>>> xml.domPython 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32Type "help", "copyright", "credits" or "license" for more information. >>> import xml>>> xml.domTraceback (most recent call last): File "", line 1, in ?AttributeError: 'module' object has no attribute 'dom'>>> from xml.dom import pulldom>>> xml.domNote that in Jython importing a module makes all subpackages beneath it available, whereas in python, only the tokens available in __init__.py are accessible, but if you do load the module later even if not getting it directly into the namespace, it gets accessible too -- this seems more like something unexpected to me -- I would expect it to be available only if I did some "import xml.dom" at some point.My problem is that in Pydev, in static analysis, I would only get the tokens available for actually imported modules, but that's not true for Jython, and I'm not sure if the current behaviour in Python was expected. So... which would be the right semantics for this?Thanks,Fabio ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Switch statement
talin> Since you don't have the 'fall-through' behavior of C, I would
talin> also assume that you could associate more than one value with a
talin> case, i.e.:
talin> case 'a', 'b', 'c':
talin>...
As Andrew Koenig pointed out, that's not discussed in the PEP. Given the
various examples though, I would have to assume the above is equivalent to
case ('a', 'b', 'c'):
...
since in all cases the PEP implies a single expression.
talin> It seems to me that the value of a 'switch' statement is that it
talin> is a computed jump - that is, instead of having to iteratively
talin> test a bunch of alternatives, you can directly jump to the code
talin> for a specific value.
I agree, but that of course limits the expressions to constants which can be
evaluated at compile-time as I indicated in my previous mail. Also, as
someone else pointed out, that probably prevents something like
START_TOKEN = '<'
END_TOKEN = '>'
...
switch expr:
case START_TOKEN:
...
case END_TOKEN:
...
The PEP states that the case clauses must accept constants, but the sample
implementation allows arbitrary expressions. If we assume that the case
expressions need not be constants, does that force the compiler to evaluate
the case expressions in the order given in the file? To make my dumb
example from yesterday even dumber:
def f():
switch raw_input("enter b, d or f:"):
case incr('a'):
print 'yay! a b!'
case incr('b'):
print 'yay! a d!'
case incr('c'):
print 'yay! an f!'
else:
print 'hey dummy! I said b, d or f!'
_n = 0
def incr(c):
global _n
try:
return chr(ord(c)+1+_n)
finally:
_n += 1
print _n
The cases must be evaluated in the order they are written for the example to
work properly.
The tension between efficient run-time and Python's highly dynamic nature
would seem to prevent the creation of a switch statement that will satisfy
all demands.
Skip
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Switch statement
Talin wrote: > I don't have any specific syntax proposals, but I notice that the suite > that follows the switch statement is not a normal suite, but a > restricted one, and I am wondering if we could come up with a syntax > that avoids having a special suite. don't have K&R handy, but I'm pretty sure they put switch and case at the same level (just like if/else), thus eliminating the need for silly special suites. > The only real difference between this and an if-else chain is that the > compiler knows that all of the test expressions are constants and can be > hashed at compile time. the compiler can of course figure that out also for if/elif/else state- ments, by inspecting the AST. the only advantage for switch/case is user syntax... ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Switch statement
[EMAIL PROTECTED] wrote:
> talin> Since you don't have the 'fall-through' behavior of C, I would
> talin> also assume that you could associate more than one value with a
> talin> case, i.e.:
>
> talin> case 'a', 'b', 'c':
> talin>...
>
> As Andrew Koenig pointed out, that's not discussed in the PEP. Given the
> various examples though, I would have to assume the above is equivalent to
>
> case ('a', 'b', 'c'):
> ...
I had recognized that ambiguity as well, but chose not to mention it :)
> since in all cases the PEP implies a single expression.
>
> talin> It seems to me that the value of a 'switch' statement is that it
> talin> is a computed jump - that is, instead of having to iteratively
> talin> test a bunch of alternatives, you can directly jump to the code
> talin> for a specific value.
>
> I agree, but that of course limits the expressions to constants which can be
> evaluated at compile-time as I indicated in my previous mail. Also, as
> someone else pointed out, that probably prevents something like
>
> START_TOKEN = '<'
> END_TOKEN = '>'
>
> ...
>
> switch expr:
> case START_TOKEN:
> ...
> case END_TOKEN:
> ...
Here's another ugly thought experiment, not meant as a serious proposal;
it's intent is to stimulate ideas by breaking preconceptions. Suppose we
take the notion of a computed jump literally:
def myfunc( x ):
goto dispatcher[ x ]
section s1:
...
section s2:
...
dispatcher=dict('a'=myfunc.s1, 'b'=myfunc.s2)
No, I am *not* proposing that Python add a goto statement. What I am
really talking about is the idea that you could (somehow) use a
dictionary as the input to a control construct.
In the above example, rather than allowing arbitrary constant
expressions as cases, we would require the compiler to generate a set of
opaque tokens representing various code fragments. These fragments would
be exactly like inner functions, except that they don't have their own
scope (and therefore have no parameters either).
Since the jump labels are symbols generated by the compiler, there's no
ambiguity about when they get evaluated.
The above example also allows these labels to be accessed externally
from the function by defining attributes on the function object itself
which correspond to the code fragments.
So in the example, the dictionary which associates specific values with
executable sections is created once, at runtime, but before the first
time that myfunc is called.
Of course, this is quite a bit clumsier than a switch statement, which
is why I say its not a serious proposal.
-- Talin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] subprocess.Popen(.... stdout=IGNORE, ...)
In the subprocess module, by default the files handles in the child
are inherited from the parent. To ignore a child's output, I can use
the stdout or stderr options to send the output to a pipe::
p = Popen(command, stdout=PIPE, stderr=PIPE)
However, this is sensitive to the buffer deadlock problem, where for
example the buffer for stderr might become full and a deadlock occurs
because the child is blocked on writing to stderr and the parent is
blocked on reading from stdout or waiting for the child to finish.
For example, using this command will cause deadlock::
call('cat /boot/vmlinuz'.split(), stdout=PIPE, stderr=PIPE)
Popen.communicate() implements a solution using either select() or
multiple threads (under Windows) to read from the pipes, and returns
the strings as a result. It works out like this::
p = Popen(command, stdout=PIPE, stderr=PIPE)
output, errors = p.communicate()
if p.returncode != 0:
…
Now, as a user of the subprocess module, sometimes I just want to
call some child process and simply ignore its output, and to do so I
am forced to use communicate() as above and wastefully capture and
ignore the strings. This is actually quite a common use case. "Just
run something, and check the return code". Right now, in order to do
this without polluting the parent's output, you cannot use the call()
convenience (or is there another way?).
A workaround that works under UNIX is to do this::
FNULL = open('/dev/null', 'w')
returncode = call(command, stdout=FNULL, stderr=FNULL)
Some feedback requested, I'd like to know what you think:
1. Would it not be nice to add a IGNORE constant to subprocess.py
that would do this automatically?, i.e. ::
returncode = call(command, stdout=IGNORE, stderr=IGNORE)
Rather than capture and accumulate the output, it would find an
appropriate OS-specific way to ignore the output (the /dev/null file
above works well under UNIX, how would you do this under Windows?
I'm sure we can find something.)
2. call() should be modified to not be sensitive to the deadlock
problem, since its interface provides no way to return the
contents of the output. The IGNORE value provides a possible
solution for this.
3. With the /dev/null file solution, the following code actually
works without deadlock, because stderr is never blocked on writing
to /dev/null::
p = Popen(command, stdout=PIPE, stderr=IGNORE)
text = p.stdout.read()
retcode = p.wait()
Any idea how this idiom could be supported using a more portable
solution (i.e. how would I make this idiom under Windows, is there
some equivalent to /dev/null)?
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] UUID module
Thomas Heller wrote: > I don't know if this is the uuidgen you're talking about, but > on linux there is libuuid: Thanks! Okay, that's in there now. Have a look at http://zesty.ca/python/uuid.py . Phillip J. Eby wrote: > By the way, I'd love to see a uuid.uuid() constructor that simply calls the > platform-specific default UUID constructor (CoCreateGuid or uuidgen(2)), I've added code to make uuid1() use uuid_generate_time() if available and uuid4() use uuid_generate_random() if available. These functions are provided on Mac OS X (in libc) and on Linux (in libuuid). Does that work for you? I'm using the Windows UUID generation calls (UuidCreate and UuidCreateSequential in rpcrt4) only to get the hardware address, not to make UUIDs, because they yield results that aren't compliant with RFC 4122. Even worse, they actually have the variant bits set to say that they are RFC 4122, but they can have an illegal version number. If there are better alternatives on Windows, i'm happy to use them. -- ?!ng ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Should hex() yield 'L' suffix for long numbers?
I did this earlier: >>> hex(9) '0x9184e729fffL' and found it a little jarring, because i feel there's been a general trend toward getting rid of the 'L' suffix in Python. Literal long integers don't need an L anymore; they're automatically made into longs if the number is too big. And while the repr() of a long retains the L on the end, the str() of a long does not, and i rather like that. So i kind of expected that hex() would not include the L either. I see its main job as just giving me the hex digits (in fact, for Python 3000 i'd prefer even to drop the '0x' as well), and the L seems superfluous and distracting. What do you think? Is Python 2.5 a reasonable time to drop this L? -- ?!ng ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a note in random.shuffle.__doc__ ...
Terry Jones wrote: > Suppose you have a RNG with a cycle length of 5. There's nothing to stop an > algorithm from taking multiple already returned values and combining them > in some (deterministic) way to generate > 5 outcomes. No, it's not. As long as the RNG output is the only input to the algorithm, and the algorithm is deterministic, it is not possible get more than N different outcomes. It doesn't matter what the algorithm does with the input. > If you > expanded what you meant by "internal states" to include the state of the > algorithm (as well as the state of the RNG), then I'd be more inclined to > agree. If the algorithm can start out with more than one initial state, then the RNG is not the only input. > Worse, if you have multiple threads / processes using the same RNG, the > individual threads could exhibit _much_ more random behavior Then you haven't got a deterministic algorithm. -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Pre-PEP: Allow Empty Subscript List Without Parentheses
BJörn Lindqvist wrote: > I don't know how difficult it is to get rid of the > implicit "return None" or even if it is doable, but if it is, it > should, IMHO, be done. It's been proposed before, and the conclusion was that it would cause more problems than it would solve. (Essentially it would require returning some object that raised an exception when anything at all was done to it, but such an object would cause debuggers and other introspective code to choke.) -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
On Sunday 11 June 2006 16:26, Sam Ruby wrote:
> Planet is a feed aggregator written in Python. It depends heavily on
> SGMLLib. A recent bug report turned out to be a deficiency in sgmllib,
> and I've submitted a test case and a patch[1] (use or discard the patch,
> it is the test that I care about).
And it's a nice aggregator to use, indeed!
> While looking around, a few things surfaced. For starters, it would
> seem that the version of sgmllib in SVN HEAD will selectively unescape
> certain character references that might appear in an attribute. I say
> selectively, as:
>
> * it will unescape &
> * it won't unescape ©
> * it will unescape &
> * it won't unescape &
> * it will unescape ’
> * it won't unescape ’
And just why would you use sgmllib to handle RSS or ATOM feeds? Neither is
defined in terms of SGML. The sgmllib documentation also notes that it isn't
really a fully general SGML parser (it isn't), but that it exists primarily
as a foundation for htmllib.
> There are a number of issues here. While not unescaping anything is
> suboptimal, at least the recipient is aware of exactly which characters
> have been unescaped (i.e., none of them). The proposed solution makes
> it impossible for the recipient to know which characters are unescaped,
> and which are original. (Note: feeds often contain such abominations as
> © which the new code will treat indistinguishably from ©)
My suspicion is that the "right" thing to do at the sgmllib level is to
categorize the markup and call a method depending on what the entity
reference is, and let that handle whatever it is. For SGML, that means we
have things like &name; (entity references), { (character references),
and that's it. ģ isn't legal SGML under any circumstance;
the "" syntax was introduced with XML.
> Additionally, there is a unicode issue here - one that is shared by
> handle_charref, but at least that method is overrideable. If unescaping
> remains, do it for hex character references and for values greather than
> 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
For SGML, it's worse than that, since the document character set is defined in
the SGML declaration, which is a far hairier beast than an XML
declaration. :-)
It really sounds like sgmllib is the wrong foundation for this. While the
module has some questionable behaviors, none of them are signifcant in the
context it's intended context (support for htmllib). Now, I understand that
RSS has historical issues, with HTML-as-practiced getting embedded as payload
data with various flavors of escaping applied, and I'm not an expert in the
details of that. Have you looked at HTMLParser as an alternate to sgmllib?
It has better support for XHTML constructs.
-Fred
--
Fred L. Drake, Jr.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Switch statement
Talin wrote: > Since you don't have the 'fall-through' behavior of C, I would also > assume that you could associate more than one value with a case, i.e.: > > case 'a', 'b', 'c': >... Multiple values could be written case 'a': case 'b': case 'c': ... without conflicting with the no-fallthrough semantics, since a do-nothing case can be written as case 'd': pass > I don't have any specific syntax proposals, but I notice that the suite > that follows the switch statement is not a normal suite, but a > restricted one, I don't see that as a problem. And all the proposed syntaxes I've ever seen for putting the cases at the same level as the switch look ugly to me. -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Switch statement
[EMAIL PROTECTED] wrote: > I agree, but that of course limits the expressions to constants which can be > evaluated at compile-time as I indicated in my previous mail. A way out of this would be to define the semantics so that the expression values are allowed to be cached, and the order of evaluation and testing is undefined. So the first time through, the values could all be put in a dict, to be looked up thereafter. -- Greg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a note in random.shuffle.__doc__ ...
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:
Greg> Terry Jones wrote:
>> Suppose you have a RNG with a cycle length of 5. There's nothing to stop an
>> algorithm from taking multiple already returned values and combining them
>> in some (deterministic) way to generate > 5 outcomes.
Greg> No, it's not. As long as the RNG output is the only input to
Greg> the algorithm, and the algorithm is deterministic, it is
Greg> not possible get more than N different outcomes. It doesn't
Greg> matter what the algorithm does with the input.
Greg> If the algorithm can start out with more than one initial
Greg> state, then the RNG is not the only input.
The code below uses a RNG with period 5, is deterministic, and has one
initial state. It produces 20 different outcomes.
It's just doing a simplistic version of what a lagged RNG generator does,
but the lagged part is in the "algorithm" not in the rng. That's why I said
if you included the state of the algorithm in what you meant by "state" I'd
be more inclined to agree.
Terry
n = map(float, range(1, 17, 3))
i = 0
def rng():
global i
i += 1
if i == 5: i = 0
return n[i]
if __name__ == '__main__':
seen = {}
history = [rng()]
o = 0
for lag in range(1, 5):
for x in range(5):
o += 1
new = rng()
outcome = new / history[-lag]
if outcome in seen: print "DUP!"
seen[outcome] = True
print "outcome %d = %f" % (o, outcome)
history.append(new)
# Outputs
outcome 1 = 1.75
outcome 2 = 1.428571
outcome 3 = 1.30
outcome 4 = 0.076923
outcome 5 = 4.00
outcome 6 = 7.00
outcome 7 = 2.50
outcome 8 = 1.857143
outcome 9 = 0.10
outcome 10 = 0.307692
outcome 11 = 0.538462
outcome 12 = 10.00
outcome 13 = 3.25
outcome 14 = 0.142857
outcome 15 = 0.40
outcome 16 = 0.70
outcome 17 = 0.769231
outcome 18 = 13.00
outcome 19 = 0.25
outcome 20 = 0.571429
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
"Fred L. Drake, Jr." <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On Sunday 11 June 2006 16:26, Sam Ruby wrote: > > Planet is a feed aggregator written in Python. It depends heavily on > > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, > > and I've submitted a test case and a patch[1] (use or discard the > > patch, > > it is the test that I care about). ... > > and which are original. (Note: feeds often contain such abominations > > as > > © which the new code will treat indistinguishably from ©) > It really sounds like sgmllib is the wrong foundation for this. ... > Have you looked at HTMLParser as an alternate to sgmllib? > It has better support for XHTML constructs. Have you (the OP), checked how related Python projects, such as Mark Pilgrim's feed parser, http://www.feedparser.org/ handle the same sort of input (I have only looked at docs and tests, not code). tjr ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import semantics
"Fabio Zadrozny" <[EMAIL PROTECTED]> wrote in message >Jython 2.1 on java1.5.0 (JIT: null) >Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on >win32 Jython 2.1 intends to match Python 2.1, I believe. Python 2.2, which I still have loaded, matches Python 2.4 in the behavior reported. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] subprocess.Popen(.... stdout=IGNORE, ...)
"Martin Blais" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Any idea how this idiom could be supported using a more portable > solution (i.e. how would I make this idiom under Windows, is there > some equivalent to /dev/null)? On a DOS/Windows command line, '>NUL:' or '>nul:' ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Should hex() yield 'L' suffix for long numbers?
[Ka-Ping Yee] > I did this earlier: > > >>> hex(9) > '0x9184e729fffL' > > and found it a little jarring, because i feel there's been a general > trend toward getting rid of the 'L' suffix in Python. > > Literal long integers don't need an L anymore; they're automatically > made into longs if the number is too big. And while the repr() of > a long retains the L on the end, the str() of a long does not, and > i rather like that. > > So i kind of expected that hex() would not include the L either. > I see its main job as just giving me the hex digits (in fact, for > Python 3000 i'd prefer even to drop the '0x' as well), and the L > seems superfluous and distracting. > > What do you think? Is Python 2.5 a reasonable time to drop this L? As I read pep 237, that should have happened in Python 2.3 or 2.4. This specific case is kinda muddy there. Regardless, the only part that was left for Python 3 was "phase C", and this is phase C in its entirety: C. The trailing 'L' is dropped from repr(), and made illegal on input. (If possible, the 'long' type completely disappears.) It's possible, though, that hex() and oct() were implicitly considered to be variants of repr() for purposes of phase C. How much are we willing to pay Guido to Pronounce? ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a note in random.shuffle.__doc__ ...
[Terry Jones] > The code below uses a RNG with period 5, is deterministic, and has one > initial state. It produces 20 different outcomes. Well, I'd call the sequence of 20 numbers it produces one outcome. >From that view, there are at most 5 outcomes it can produce (at most 5 distinct 20-number sequences). In much the same way, there are at most P distinct infinite sequences this can produce, if the PRNG used by random.random() has period P: def belch(): import random, math start = random.random() i = 0 while True: i += 1 yield math.fmod(i * start, 1.0) The trick is to define "outcome" in such a way that the original claim is true :-) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
Fred L. Drake, Jr. wrote:
> On Sunday 11 June 2006 16:26, Sam Ruby wrote:
> > Planet is a feed aggregator written in Python. It depends heavily on
> > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib,
> > and I've submitted a test case and a patch[1] (use or discard the patch,
> > it is the test that I care about).
>
> And it's a nice aggregator to use, indeed!
>
> > While looking around, a few things surfaced. For starters, it would
> > seem that the version of sgmllib in SVN HEAD will selectively unescape
> > certain character references that might appear in an attribute. I say
> > selectively, as:
> >
> > * it will unescape &
> > * it won't unescape ©
> > * it will unescape &
> > * it won't unescape &
> > * it will unescape ’
> > * it won't unescape ’
>
> And just why would you use sgmllib to handle RSS or ATOM feeds? Neither is
> defined in terms of SGML. The sgmllib documentation also notes that it isn't
> really a fully general SGML parser (it isn't), but that it exists primarily
> as a foundation for htmllib.
The feed itself is read first with SAX (then with a fallback using
sgmllib if the feed is not well formed, but that's beside the point).
Then the embedded HTML portions are then processed with subclasses of
sgmllib.
> > There are a number of issues here. While not unescaping anything is
> > suboptimal, at least the recipient is aware of exactly which characters
> > have been unescaped (i.e., none of them). The proposed solution makes
> > it impossible for the recipient to know which characters are unescaped,
> > and which are original. (Note: feeds often contain such abominations as
> > © which the new code will treat indistinguishably from ©)
>
> My suspicion is that the "right" thing to do at the sgmllib level is to
> categorize the markup and call a method depending on what the entity
> reference is, and let that handle whatever it is. For SGML, that means we
> have things like &name; (entity references), { (character references),
> and that's it. ģ isn't legal SGML under any circumstance;
> the "" syntax was introduced with XML.
... but it effectively is valid HTML. And as you point out below
sgmllib's raison d’être is to support htmllib.
> > Additionally, there is a unicode issue here - one that is shared by
> > handle_charref, but at least that method is overrideable. If unescaping
> > remains, do it for hex character references and for values greather than
> > 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
>
> For SGML, it's worse than that, since the document character set is defined
> in
> the SGML declaration, which is a far hairier beast than an XML
> declaration. :-)
understood
> It really sounds like sgmllib is the wrong foundation for this. While the
> module has some questionable behaviors, none of them are signifcant in the
> context it's intended context (support for htmllib). Now, I understand that
> RSS has historical issues, with HTML-as-practiced getting embedded as payload
> data with various flavors of escaping applied, and I'm not an expert in the
> details of that. Have you looked at HTMLParser as an alternate to sgmllib?
> It has better support for XHTML constructs.
HTMLParser is less forgiving, and generally less suitable for consuming
HTML as practiced.
- Sam Ruby
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
Terry Reedy wrote: > "Fred L. Drake, Jr." <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >> On Sunday 11 June 2006 16:26, Sam Ruby wrote: >>> Planet is a feed aggregator written in Python. It depends heavily on >>> SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, >>> and I've submitted a test case and a patch[1] (use or discard the >>> patch, >>> it is the test that I care about). > ... >>> and which are original. (Note: feeds often contain such abominations >>> as >>> © which the new code will treat indistinguishably from ©) > >> It really sounds like sgmllib is the wrong foundation for this. > ... >> Have you looked at HTMLParser as an alternate to sgmllib? >> It has better support for XHTML constructs. > > Have you (the OP), checked how related Python projects, such as Mark > Pilgrim's feed parser, > http://www.feedparser.org/ > handle the same sort of input (I have only looked at docs and tests, not > code). Just to be clear: Planet uses Mark's feed parser, which uses SGMLlib. I'm a committer on that project: http://sourceforge.net/project/memberlist.php?group_id=112328 I was investigating a bug in sgmllib which affected the feed parser (and therefore Planet), and noticed that there were changes in the SVN head of Python which broke three feed parser unit tests. It is my belief that these changes will break other existing users of sgmllib. - Sam Ruby ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
Aahz wrote: > When providing links to SF, please use the python.org tinyurl equivalent > to ensure that people can easily see the bug/patch number: > > http://www.python.org/sf?id=1504333 Although I usually use the path-style form: http://www.python.org/sf/1504333 Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
On Monday 12 June 2006 00:05, Sam Ruby wrote: > Just to be clear: Planet uses Mark's feed parser, which uses SGMLlib. Cool. > I was investigating a bug in sgmllib which affected the feed parser (and > therefore Planet), and noticed that there were changes in the SVN head > of Python which broke three feed parser unit tests. > > It is my belief that these changes will break other existing users of > sgmllib. This is good to know; thanks for pointing it out. If you can summarize the specific changes to sgmllib that cause problems for the feed parser, and identify the tests there that rely on the old behavior, I'll be glad to look at the problems. I expect to have some time in the next few evenings, so I should be able to look at these soon. Is the SourceForge CVS the definitive development source for the feed parser? -Fred -- Fred L. Drake, Jr. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
Sam Ruby wrote: > Planet is a feed aggregator written in Python. It depends heavily on > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, > and I've submitted a test case and a patch[1] (use or discard the patch, > it is the test that I care about). I think (but am not sure) you are referring to patch #1462498 here, which fixes bugs 1452246 and 1087808. > * it will unescape & > * it won't unescape © That must be because you have amp in your entitydefs, but not copy. > * it will unescape & > * it won't unescape & That's because it doesn't recognize hex character references. That's systematic, though: it doesn't just ignore them in attribute values, but also in content. > * it will unescape ’ > * it won't unescape ’ That's because the value is larger than 256, so chr() fails. > There are a number of issues here. While not unescaping anything is > suboptimal, at least the recipient is aware of exactly which characters > have been unescaped (i.e., none of them). The proposed solution makes > it impossible for the recipient to know which characters are unescaped, > and which are original. (Note: feeds often contain such abominations as > © which the new code will treat indistinguishably from ©) The recipient should then add © to entitydefs; sgmllib will unescape copy, so the recipient can know not to unescape that. Alternatively, the recipient could provide an empty entitydefs. > Additionally, there is a unicode issue here - one that is shared by > handle_charref, but at least that method is overrideable. If unescaping > remains, do it for hex character references and for values greather than > 8-bits, i.e., use unichr instead of chr if the value is greater than 127. Alternatively, a callback function could be provided for character references. Unfortunately, the existing callback is unsuitable, as it is supposed to do the full processing; this callback should return the replacement text. Generally assuming Unicode would be wrong, though. Would you like to contribute a patch? Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 2.5 issues need resolving in a few days
Neal Norwitz wrote: > The most important outstanding issue is the xmlplus/xmlcore issue. > It's not going to get fixed unless someone works on it. There's only > a few days left before beta 1. Can someone please address this? >From my point of view, I shall consider them resolved/irrelevant: I'm going to step down as a PyXML maintainer, so I don't have to worry anymore about how to maintain PyXML. If PyXML then gets unmaintained, the problem goes away, otherwise, the new maintainer will have to find a solution. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
Fred L. Drake, Jr. wrote: > On Monday 12 June 2006 00:05, Sam Ruby wrote: > > Just to be clear: Planet uses Mark's feed parser, which uses SGMLlib. > > Cool. > > > I was investigating a bug in sgmllib which affected the feed parser (and > > therefore Planet), and noticed that there were changes in the SVN head > > of Python which broke three feed parser unit tests. > > > > It is my belief that these changes will break other existing users of > > sgmllib. > > This is good to know; thanks for pointing it out. > > If you can summarize the specific changes to sgmllib that cause problems for > the feed parser, and identify the tests there that rely on the old behavior, > I'll be glad to look at the problems. I expect to have some time in the next > few evenings, so I should be able to look at these soon. > > Is the SourceForge CVS the definitive development source for the feed parser? Yes: but if you check out the CVS HEAD, you won't see any failures as I've committed changes that mitigate the problems I've found. However, if you get the latest release instead, you will see that feeds that contain < & or > in attribute values will get these converted to <, &, and > characters instead. In some cases, this can cause problems. Particularly if the output is reparsed by sgmllib. Additionally, entity references in the range of to ÿ will cause the released Feed Parser to die with a UnicodeDecodeError. My workarounds are to re-escape < and > characters, and to escape bare ampersands - beyond that I can't really tell for sure which ampersands need to be re-escaped, and which ones I should leave as is. And I first try decoding attributes in the original declared encoding and then fall back to iso-8859-1. If a single attribute value contains both non-ASCII utf-8 characters and a numeric character reference above € then this will produce incorrect results. I also have committed a workaround to the incorrect parsing of attributes with quoted markup that I originally reported. - Sam Ruby ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
Martin v. Löwis wrote: > > Alternatively, a callback function could be provided for character > references. Unfortunately, the existing callback is unsuitable, > as it is supposed to do the full processing; this callback should > return the replacement text. Generally assuming Unicode would be > wrong, though. > > Would you like to contribute a patch? If we can agree on the behavior, I would be glad to write up a patch. It seems to me that the simplest way to proceed would be for the code that attempts to resolve character references (both named and numeric) in attributes to be isolated in a single method. Subclasses that desire different behavior (including the existing Python 2.4 and prior behaviour) could simply override this method. - Sam Ruby ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sgmllib Comments
Sam Ruby wrote: > If we can agree on the behavior, I would be glad to write up a patch. > > It seems to me that the simplest way to proceed would be for the code > that attempts to resolve character references (both named and numeric) > in attributes to be isolated in a single method. Subclasses that desire > different behavior (including the existing Python 2.4 and prior > behaviour) could simply override this method. In SGML, this is problematic: The named things are not character references, they are entity references, and it isn't necessarily the case that they expand to a character. For example, &author; might expand to "Martin v. Löwis", and &logo; might refer to a bitmap image which is unparsed. That said, providing a overridable replacement function sounds like the right approach. To keep with tradition, I would still distinguish between character references and entity references, i.e. providing two overridable functions instead. Returning None could mean that no replacement is available. As for default implementations, I think they should do what currently happens: entity references are replaced according to entitydefs, character references are replaced to bytes if they are smaller than 256. Contrary to what others said, it appears that SGML *does* support hexadecimal character references, provided that the SGML declaraction contains the HCRO definition (which, for HTML and XML, is defined as HCRO ""). So it seems safe to process hex character references by default (although it isn't safe to assume Unicode, IMO). Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
