Steven D'Aprano wrote:
On Sun, 05 Jul 2009 01:58:13 -0700, Paul Rubin wrote:
Steven D'Aprano st...@remove-this-cybersource.com.au writes:
Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't
see how a case statement would help you here: you're not dispatching on
a value, but
Steven D'Aprano wrote:
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
Python is not C.
John Nagle is an old hand at Python. He's perfectly aware of this, and
I'm sure he's not trying to program C in Python.
I'm not entirely sure *what* he is doing, and hopefully he'll speak
John Nagle wrote:
I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML
Stefan Behnel wrote:
John Nagle wrote:
I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which
Martin v. Löwis martin at v.loewis.de writes:
This is a good test for Python implementation bottlenecks. Run
that tokenizer on HTML, and see where the time goes.
I looked at it with cProfile, and the top function that comes up
for a larger document (52k) is
In message 4a4f91f9$0$1587$742ec...@news.sonic.net, John Nagle wrote:
(It should be written in C is not an acceptable answer.)
I don't see why not. State machines that have to process input byte by byte
are well known to be impossible to implement efficiently in high-level
languages. That's
protocol = {start:initialiser,hunt:hunter,classify:classifier,other
states}
def state_machine():
next_step = protocol[start]()
while True:
next_step = protocol[next_step]()
Woot ! I'll keep this one in my mind, while I may not be that concerned
by speed unlike the OP,
Jean-Michel Pichavant jeanmic...@sns.com wrote:
Woot ! I'll keep this one in my mind, while I may not be that concerned
by speed unlike the OP, I still find this way of doing very simple and
so intuitive (one will successfully argue how I was not figuring this
out by myself if it was
a...@pythoncraft.com (Aahz) writes:
In article mailman.2639.1246802753.8015.python-l...@python.org,
Hendrik van Rooyen m...@microcorp.co.za wrote:
But wait - maybe if he passes an iterator around - the equivalent of
for char in input_stream... Still no good though, unless the next call
to the
Hendrik van Rooyen wrote:
Jean-Michel Pichavant jeanmic...@sns.com wrote:
Woot ! I'll keep this one in my mind, while I may not be that concerned
by speed unlike the OP, I still find this way of doing very simple and
so intuitive (one will successfully argue how I was not figuring this
John Nagle na...@animats.com wrote:
As an example of code that really needs to run fast, but is
speed-limited by Python's limitations, see tokenizer.py in
http://code.google.com/p/html5lib/
This is a parser for HTML 5, a piece of code that will be needed
in many places and
John Nagle na...@...ats.com wrote:
Python doesn't have a switch or case statement, and when
you need a state machine with many states, that makes for painful,
slow code. There's a comment in the code that it would be useful
to run a few billion lines of HTML through an instrumented version
On Sat, 04 Jul 2009 20:15:11 -0700, John Nagle wrote:
Paul Rubin wrote:
John Nagle na...@animats.com writes:
A dictionary lookup (actually, several of them) for every
input character is rather expensive. Tokenizers usually index into a
table of character classes, then use the character
John Nagle wrote:
Python doesn't have a switch or case statement, and when
you need a state machine with many states, that makes for painful,
slow code.
Cython has a built-in optimisation that maps if-elif-else chains to C's
switch statement if they only test a single int/char variable, even
Steven D'Aprano st...@remove-this-cybersource.com.au writes:
Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't
see how a case statement would help you here: you're not dispatching on a
value, but running through a series of tests until one passes.
A case statement
John Nagle wrote:
Here's some actual code, from tokenizer.py. This is called once
for each character in an HTML document, when in data state (outside
a tag). It's straightforward code, but look at all those
dictionary lookups.
def dataState(self):
data = self.stream.char()
John Nagle wrote:
Paul Rubin wrote:
John Nagle na...@animats.com writes:
Python doesn't have a switch or case statement, and when
you need a state machine with many states, that makes for painful,
slow code. ...
There's a comment in the code that it would be useful
to run a few billion
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
Python is not C.
John Nagle is an old hand at Python. He's perfectly aware of this, and
I'm sure he's not trying to program C in Python.
I'm not entirely sure *what* he is doing, and hopefully he'll speak up
and say, but whatever
Steven D'Aprano st...@remove-this-cybersource.com.au writes:
Yes, I'm aware of that, but that's not what John's code is doing -- he's
doing a series of if expr ... elif expr tests. I don't think a case
statement can do much to optimize that.
The series of tests is written that way because
Stefan Behnel wrote:
John Nagle wrote:
Here's some actual code, from tokenizer.py. This is called once
for each character in an HTML document, when in data state (outside
a tag). It's straightforward code, but look at all those
dictionary lookups.
def dataState(self):
data
Paul Rubin wrote:
Steven D'Aprano writes:
Yes, I'm aware of that, but that's not what John's code is doing -- he's
doing a series of if expr ... elif expr tests. I don't think a case
statement can do much to optimize that.
The series of tests is written that way because there is no case
Stefan Behnel stefan...@behnel.de writes:
# Keep a charbuffer to handle the escapeFlag
if self.contentModelFlag in\
(contentModelFlags[CDATA], contentModelFlags[RCDATA]):
Is the tuple
(contentModelFlags[CDATA], contentModelFlags[RCDATA])
constant? If that
Stefan Behnel stefan...@behnel.de writes:
The series of tests is written that way because there is no case
statement available. It is essentially switching on a bunch of
character constants and then doing some additional tests in each
branch.
Although doing some of the tests first and
Paul Rubin wrote:
Stefan Behnel writes:
The series of tests is written that way because there is no case
statement available. It is essentially switching on a bunch of
character constants and then doing some additional tests in each
branch.
Although doing some of the tests first and then
Paul Rubin wrote:
Stefan Behnel writes:
# Keep a charbuffer to handle the escapeFlag
if self.contentModelFlag in\
(contentModelFlags[CDATA], contentModelFlags[RCDATA]):
Is the tuple
(contentModelFlags[CDATA], contentModelFlags[RCDATA])
constant? If that is the
Stefan Behnel stefan...@behnel.de writes:
You may notice that the creation of this exact tuple appears in almost all
if the conditionals of this method. So it is part of the bottleneck.
I don't think so. The tuple is only created when the character has
already matched, and for the vast
Paul Rubin wrote:
Stefan Behnel writes:
You may notice that the creation of this exact tuple appears in almost all
if the conditionals of this method. So it is part of the bottleneck.
I don't think so. The tuple is only created when the character has
already matched, and for the vast
On Jul 5, 3:12 am, Hendrik van Rooyen m...@microcorp.co.za wrote:
Use a dispatch dict, and have each state return the next state.
Then you can use strings representing state names, and
everybody will be able to understand the code.
toy example, not tested, nor completed:
protocol =
John Nagle wrote:
Here's some actual code, from tokenizer.py. This is called once
for each character in an HTML document, when in data state (outside
a tag). It's straightforward code, but look at all those
dictionary lookups.
def dataState(self):
data = self.stream.char()
Steven D'Aprano st...@remove-this-cy..e.com.au wrote:
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
Python is not C.
John Nagle is an old hand at Python. He's perfectly aware of this, and
I'm sure he's not trying to program C in Python.
I'm not entirely sure *what* he is
Paul Rubin http://phr...@nospam.invalid wrote:
The series of tests is written that way because there is no case
statement available. It is essentially switching on a bunch of
character constants and then doing some additional tests in each
branch.
It could be that using ord(c) as an index
2009/7/5 Hendrik van Rooyen m...@microcorp.co.za:
I cannot see how you could avoid a python function call - even if he
bites the bullet and implements my laborious scheme, he would still
have to fetch the next character to test against, inside the current state.
So if it is the function calls
In article mailman.2639.1246802753.8015.python-l...@python.org,
Hendrik van Rooyen m...@microcorp.co.za wrote:
But wait - maybe if he passes an iterator around - the equivalent of
for char in input_stream... Still no good though, unless the next call
to the iterator is faster than an ordinary
This is a good test for Python implementation bottlenecks. Run
that tokenizer on HTML, and see where the time goes.
I looked at it with cProfile, and the top function that comes up
for a larger document (52k) is
...validator.HTMLConformanceChecker.__iter__.
This method dispatches various
As an example of code that really needs to run fast, but is
speed-limited by Python's limitations, see tokenizer.py in
http://code.google.com/p/html5lib/
This is a parser for HTML 5, a piece of code that will be needed
in many places and will process large amounts of data. It's
On Sat, Jul 4, 2009 at 1:33 PM, John Naglena...@animats.com wrote:
As an example of code that really needs to run fast, but is
speed-limited by Python's limitations, see tokenizer.py in
http://code.google.com/p/html5lib/
This is a parser for HTML 5, a piece of code that will be
John Nagle wrote:
[ ... ]
Parsers have many named compile-time constants. Python doesn't support
named compile-time constants, and this is one of the places where we
have to pay the bill for that limitation.
Something to think about when you need three more racks of servers
because the
John Nagle na...@animats.com writes:
Python doesn't have a switch or case statement, and when
you need a state machine with many states, that makes for painful,
slow code. ...
There's a comment in the code that it would be useful
to run a few billion lines of HTML through an instrumented
Paul Rubin wrote:
John Nagle na...@animats.com writes:
Python doesn't have a switch or case statement, and when
you need a state machine with many states, that makes for painful,
slow code. ...
There's a comment in the code that it would be useful
to run a few billion lines of HTML through an
John Nagle na...@animats.com writes:
A dictionary lookup (actually, several of them) for every
input character is rather expensive. Tokenizers usually index into
a table of character classes, then use the character class index in
a switch statement.
Maybe you could use a regexp (and then
On Sat, 04 Jul 2009 16:35:08 -0700, John Nagle wrote:
The temptation is to write tokenizers in C, but that's an admission
of language design failure.
The only part that really needs to be written in C is the DFA loop. The
code to construct the state table from regexps could be written
John Nagle na...@animats.com writes:
A dictionary lookup (actually, several of them) for every input
character is rather expensive. Tokenizers usually index into a table
of character classes, then use the character class index in a switch
statement.
This is an issue that comes up
Paul Rubin wrote:
John Nagle na...@animats.com writes:
A dictionary lookup (actually, several of them) for every
input character is rather expensive. Tokenizers usually index into
a table of character classes, then use the character class index in
a switch statement.
Maybe you could use a
43 matches
Mail list logo