[issue7311] Bug on regexp of HTMLParser

2011-04-07 Thread Roundup Robot

Roundup Robot devnull@devnull added the comment:

New changeset 225400cb6e84 by Ezio Melotti in branch '3.2':
#7311: fix html.parser to accept non-ASCII attribute values.
http://hg.python.org/cpython/rev/225400cb6e84

New changeset a1dea7cde58f by Ezio Melotti in branch 'default':
#7311: merge with 3.2.
http://hg.python.org/cpython/rev/a1dea7cde58f

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-07 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
resolution:  - fixed
stage: commit review - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

I think the stdlib should comply with HTML 4.01, and in the future HTML 5.

(FTR, I don’t think XHTML is useful, and deny that XHTML-compatible HTML 
exists.  See http://bugs.python.org/issue11567#msg131509 :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

I would agree if the HTMLParser was compliant with the HTML 4.01 specs, but 
since it's more permissive and uses its own heuristic to determine what should 
be parsed and what shouldn't, I think it's better to use already existing 
heuristics (either the HTML5 ones or the ones used by the browsers).
I.e., I'm not trying to make it HTML5 compliant, just to make it work with what 
works on the browsers.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

Okay, sounds good.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Senthil Kumaran

Senthil Kumaran sent...@uthcode.com added the comment:

We need not base changes to html/parser.py on html5 spec, but rather make 
changes based on the requirements on parsers which may rely on this library. 
Like the tolerant mode was brought in issue1486713 for some practical reasons 
and it was seen useful tor parsers.

I don't know, how common is leaving out quotes for attributes is, but I think 
it can become really confusing to parsers (custom parsers). If we had not 
supported non-quote attributes I think, it is still okay still to 
not-to-support unless presented with case as very concrete bug. (like spec html 
4.1 allows, which I see it does not).

The patch which added support for non-ascii characters is fine.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

So is the issue7311-3.diff patch fine?  It changes the strict regex to match 
the 2.7 one, and leave the tolerant one unchanged (even if now the two regexs 
are really close).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

Sounds fine to me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Senthil Kumaran

Senthil Kumaran sent...@uthcode.com added the comment:

 So is the issue7311-3.diff patch fine? 

Just that it allows unquoted attrs for unicode too.

My previous suggestion was not to allow unquoted attribute values, but as the 
change is already made in 2.7 and discussion pointed out a portion in 4.1 spec 
which allows unquoted attrs for ASCII, it seems fine. html/parse.py will be bit 
more permissive than what the spec says.

 It changes the strict regex to match the 2.7 one, and leave the tolerant one 
 unchanged.

That is fine.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

On 3.2 the patch changes only the range of chars matched by the regex when the 
attribute value doesn't have quotes and strict=True.

The parser already allowed unquotes attribute values even before the patch (in 
both strict and tolerant mode), but used an explicit list of allowed chars that 
was limited to the ASCII range.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread Roundup Robot

Roundup Robot devnull@devnull added the comment:

New changeset 7d4dea76c476 by Ezio Melotti in branch '2.7':
#7311: fix HTMLParser to accept non-ASCII attribute values.
http://hg.python.org/cpython/rev/7d4dea76c476

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

With 3.2 the situation is more complicated because there is a strict and a 
non-strict mode.
The strict mode uses:
attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|[^]*|[-a-zA-Z0-9./,:;+*%?!$\(\)_#=~@]*))?')

and the tolerant mode uses:
attrfind_tolerant = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|[^]*|[^\s]*))?')

This means that the strict mode doesn't allow valid non-ASCII chars, and that 
tolerant mode is a little too permissive.

The attached patch changes the strict regex to be more permissive and leaves 
the tolerant regex unchanged. The difference between the two are now so small 
that the tolerant version could be removed, except that re.search is used 
instead of re.match when the tolerant regex is used.

--
nosy: +r.david.murray
Added file: http://bugs.python.org/file21545/issue7311-3.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

The goal of tolerant mode is to accept anything a typical browser would accept. 
 I suspect that means the tolerant regex should stay, but I don't remember the 
details.

As for the strictas far as I know the current module follows 4.01, not 5.  
I'm not sure what should be done about that.

--
nosy: +orsenthil

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

I don't see many use cases for the strict mode.  It is not strict enough to be 
used for validation, and while parsing HTML I can't think of any other case 
where I would want an exception raised (always as long as what is parsed by the 
tolerant mode is a superset of what is parsed by the strict mode).

If the parser is still able to parse what it was parsing before, I wouldn't 
worry too much about backward compatibility, because I can't imagine a valid 
use case where people would want the parser to fail (maybe someone else can?).

--
stage: test needed - commit review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-03 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
assignee:  - ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-04-03 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Here's a patch that matches unquoted attribute values according to the HTML5 
specifications.

The regex uses \s even if this includes the \v char that, according to the 
HTML5 specs, shouldn't be included.  I left it there for simplicity and 
backward-compatibility, and also because it's a rather obscure corner case.

--
versions:  -Python 3.1
Added file: http://bugs.python.org/file21517/issue7311-2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-03-27 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The HTML 4.01 specifications says[0]:

In certain cases, authors may specify the value of an attribute without any 
quotation marks. The attribute value may only contain letters (a-z and A-Z), 
digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), 
underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend 
using quotation marks even when it is possible to eliminate them.


The HTML 5 draft says[1]:

The attribute name, followed by zero or more space characters, followed by a 
single U+003D EQUALS SIGN character, followed by zero or more space characters, 
followed by the attribute value, which, in addition to the requirements given 
above for attribute values, must not contain any literal space characters, any 
U+0022 QUOTATION MARK characters (), U+0027 APOSTROPHE characters ('), U+003D 
EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (), U+003E 
GREATER-THAN SIGN characters (), or U+0060 GRAVE ACCENT characters (`), and 
must not be the empty string.


So maybe [^\s] is a little too permissive here.

[0]: http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
[1]: http://dev.w3.org/html5/spec/Overview.html#attributes-0

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-03-26 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The attached patch changes the regex to allow non-ascii letters in attribute 
values (using \w with the re.UNICODE flag instead of [a-zA-Z0-9_]).

Using [^\s] (or even [^ ]) might be OK too, since that's what browsers seem 
to use (e.g. Firefox and Chrome show テ<ス＀☃ト   -d-fg as title of 'a href= 
title=テ<ス＀☃ト   -d-fg href=foo/a', including the non-ascii spaces in the 
middle).

--
keywords: +patch
nosy: +belopolsky
Added file: http://bugs.python.org/file21406/issue7311.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2011-03-20 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
nosy: +eric.araujo
versions: +Python 3.1, Python 3.2, Python 3.3 -Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2009-11-19 Thread Glenn Linderman

Glenn Linderman v+pyt...@g.nevcal.com added the comment:

Re: the BTW --  and  should be entity-escaped when used in attribute
values inside tag attributes... (but are probably seldom found as part
of tag attribute values)

But the example you showed is not an attribute in a tag, but rather text
within a paired tag.

But your suggestion for the regexp seems correct to me, if the non-ASCII
characters are permitted for non-quoted attribute values.

--
nosy: +v+python

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2009-11-19 Thread Chiyuan Zhang

Chiyuan Zhang plus...@gmail.com added the comment:

re: Yes. In fact, the BTW is a different problem with respect to this
bug. And that seems to be more complicated to fix.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2009-11-13 Thread Fred L. Drake, Jr.

Changes by Fred L. Drake, Jr. fdr...@acm.org:


--
nosy: +fdrake

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2009-11-13 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
priority:  - normal
stage:  - test needed
versions: +Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2009-11-12 Thread Chiyuan Zhang

New submission from Chiyuan Zhang plus...@gmail.com:

Hi all,

I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like img
src=/foo/bar.png alt=中文 , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in
HTMLParser.py is :

 attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|[^]*|[-a-zA-Z0-9./,:;+*%?!$\(\)_...@]*))?')

Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^\s] IMHO.

BTW: It seems something like :

script
var st = a/;
/script

can not be parsed. :-/

--
components: Library (Lib)
messages: 95162
nosy: pluskid
severity: normal
status: open
title: Bug on regexp of HTMLParser
type: behavior
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com