Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-07 Thread John Nagle
Steven D'Aprano wrote: On Sun, 05 Jul 2009 01:58:13 -0700, Paul Rubin wrote: Steven D'Aprano st...@remove-this-cybersource.com.au writes: Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't see how a case statement would help you here: you're not dispatching on a value, but

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-07 Thread John Nagle
Steven D'Aprano wrote: On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is doing, and hopefully he'll speak

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-07 Thread Stefan Behnel
John Nagle wrote: I have a small web crawler robust enough to parse real-world HTML, which can be appallingly bad. I currently use an extra-robust version of BeautifulSoup, and even that sometimes blows up. So I'm very interested in a new Python parser which supposedly handles bad HTML

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-07 Thread John Nagle
Stefan Behnel wrote: John Nagle wrote: I have a small web crawler robust enough to parse real-world HTML, which can be appallingly bad. I currently use an extra-robust version of BeautifulSoup, and even that sometimes blows up. So I'm very interested in a new Python parser which

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread David M . Cooke
Martin v. Löwis martin at v.loewis.de writes: This is a good test for Python implementation bottlenecks. Run that tokenizer on HTML, and see where the time goes. I looked at it with cProfile, and the top function that comes up for a larger document (52k) is

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Lawrence D'Oliveiro
In message 4a4f91f9$0$1587$742ec...@news.sonic.net, John Nagle wrote: (It should be written in C is not an acceptable answer.) I don't see why not. State machines that have to process input byte by byte are well known to be impossible to implement efficiently in high-level languages. That's

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Jean-Michel Pichavant
protocol = {start:initialiser,hunt:hunter,classify:classifier,other states} def state_machine(): next_step = protocol[start]() while True: next_step = protocol[next_step]() Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike the OP,

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Hendrik van Rooyen
Jean-Michel Pichavant jeanmic...@sns.com wrote: Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike the OP, I still find this way of doing very simple and so intuitive (one will successfully argue how I was not figuring this out by myself if it was

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread J Kenneth King
a...@pythoncraft.com (Aahz) writes: In article mailman.2639.1246802753.8015.python-l...@python.org, Hendrik van Rooyen m...@microcorp.co.za wrote: But wait - maybe if he passes an iterator around - the equivalent of for char in input_stream... Still no good though, unless the next call to the

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-06 Thread Jean-Michel Pichavant
Hendrik van Rooyen wrote: Jean-Michel Pichavant jeanmic...@sns.com wrote: Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike the OP, I still find this way of doing very simple and so intuitive (one will successfully argue how I was not figuring this

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Nick Craig-Wood
John Nagle na...@animats.com wrote: As an example of code that really needs to run fast, but is speed-limited by Python's limitations, see tokenizer.py in http://code.google.com/p/html5lib/ This is a parser for HTML 5, a piece of code that will be needed in many places and

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Hendrik van Rooyen
John Nagle na...@...ats.com wrote: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented version

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Steven D'Aprano
On Sat, 04 Jul 2009 20:15:11 -0700, John Nagle wrote: Paul Rubin wrote: John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. Cython has a built-in optimisation that maps if-elif-else chains to C's switch statement if they only test a single int/char variable, even

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Steven D'Aprano st...@remove-this-cybersource.com.au writes: Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't see how a case statement would help you here: you're not dispatching on a value, but running through a series of tests until one passes. A case statement

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. def dataState(self): data = self.stream.char()

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: Paul Rubin wrote: John Nagle na...@animats.com writes: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. ... There's a comment in the code that it would be useful to run a few billion

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Steven D'Aprano
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is doing, and hopefully he'll speak up and say, but whatever

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Steven D'Aprano st...@remove-this-cybersource.com.au writes: Yes, I'm aware of that, but that's not what John's code is doing -- he's doing a series of if expr ... elif expr tests. I don't think a case statement can do much to optimize that. The series of tests is written that way because

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Stefan Behnel wrote: John Nagle wrote: Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. def dataState(self): data

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: Steven D'Aprano writes: Yes, I'm aware of that, but that's not what John's code is doing -- he's doing a series of if expr ... elif expr tests. I don't think a case statement can do much to optimize that. The series of tests is written that way because there is no case

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Stefan Behnel stefan...@behnel.de writes: # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): Is the tuple (contentModelFlags[CDATA], contentModelFlags[RCDATA]) constant? If that

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Stefan Behnel stefan...@behnel.de writes: The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. Although doing some of the tests first and

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: Stefan Behnel writes: The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. Although doing some of the tests first and then

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: Stefan Behnel writes: # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): Is the tuple (contentModelFlags[CDATA], contentModelFlags[RCDATA]) constant? If that is the

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul Rubin
Stefan Behnel stefan...@behnel.de writes: You may notice that the creation of this exact tuple appears in almost all if the conditionals of this method. So it is part of the bottleneck. I don't think so. The tuple is only created when the character has already matched, and for the vast

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
Paul Rubin wrote: Stefan Behnel writes: You may notice that the creation of this exact tuple appears in almost all if the conditionals of this method. So it is part of the bottleneck. I don't think so. The tuple is only created when the character has already matched, and for the vast

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Paul McGuire
On Jul 5, 3:12 am, Hendrik van Rooyen m...@microcorp.co.za wrote: Use a dispatch dict, and have each state return the next state. Then you can use strings representing state names, and everybody will be able to understand the code. toy example, not tested, nor completed: protocol =

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Stefan Behnel
John Nagle wrote: Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. def dataState(self): data = self.stream.char()

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Hendrik van Rooyen
Steven D'Aprano st...@remove-this-cy..e.com.au wrote: On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Hendrik van Rooyen
Paul Rubin http://phr...@nospam.invalid wrote: The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. It could be that using ord(c) as an index

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Lino Mastrodomenico
2009/7/5 Hendrik van Rooyen m...@microcorp.co.za: I cannot see how you could avoid a python function call - even if he bites the bullet and implements my laborious scheme, he would still have to fetch the next character to test against, inside the current state. So if it is the function calls

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Aahz
In article mailman.2639.1246802753.8015.python-l...@python.org, Hendrik van Rooyen m...@microcorp.co.za wrote: But wait - maybe if he passes an iterator around - the equivalent of for char in input_stream... Still no good though, unless the next call to the iterator is faster than an ordinary

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-05 Thread Martin v. Löwis
This is a good test for Python implementation bottlenecks. Run that tokenizer on HTML, and see where the time goes. I looked at it with cProfile, and the top function that comes up for a larger document (52k) is ...validator.HTMLConformanceChecker.__iter__. This method dispatches various

Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread John Nagle
As an example of code that really needs to run fast, but is speed-limited by Python's limitations, see tokenizer.py in http://code.google.com/p/html5lib/ This is a parser for HTML 5, a piece of code that will be needed in many places and will process large amounts of data. It's

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Benjamin Kaplan
On Sat, Jul 4, 2009 at 1:33 PM, John Naglena...@animats.com wrote:   As an example of code that really needs to run fast, but is speed-limited by Python's limitations, see tokenizer.py in        http://code.google.com/p/html5lib/ This is a parser for HTML 5, a piece of code that will be

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Mel
John Nagle wrote: [ ... ] Parsers have many named compile-time constants. Python doesn't support named compile-time constants, and this is one of the places where we have to pay the bill for that limitation. Something to think about when you need three more racks of servers because the

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Paul Rubin
John Nagle na...@animats.com writes: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. ... There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread John Nagle
Paul Rubin wrote: John Nagle na...@animats.com writes: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. ... There's a comment in the code that it would be useful to run a few billion lines of HTML through an

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Paul Rubin
John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. Maybe you could use a regexp (and then

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Nobody
On Sat, 04 Jul 2009 16:35:08 -0700, John Nagle wrote: The temptation is to write tokenizers in C, but that's an admission of language design failure. The only part that really needs to be written in C is the DFA loop. The code to construct the state table from regexps could be written

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread Ben Finney
John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. This is an issue that comes up

Re: Code that ought to run fast, but can't due to Python limitations.

2009-07-04 Thread John Nagle
Paul Rubin wrote: John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. Maybe you could use a