Roundup Robot devnull@devnull added the comment:
New changeset 225400cb6e84 by Ezio Melotti in branch '3.2':
#7311: fix html.parser to accept non-ASCII attribute values.
http://hg.python.org/cpython/rev/225400cb6e84
New changeset a1dea7cde58f by Ezio Melotti in branch 'default':
#7311: merge
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
resolution: - fixed
stage: commit review - committed/rejected
status: open - closed
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
Éric Araujo mer...@netwok.org added the comment:
I think the stdlib should comply with HTML 4.01, and in the future HTML 5.
(FTR, I don’t think XHTML is useful, and deny that XHTML-compatible HTML
exists. See http://bugs.python.org/issue11567#msg131509 :)
--
Ezio Melotti ezio.melo...@gmail.com added the comment:
I would agree if the HTMLParser was compliant with the HTML 4.01 specs, but
since it's more permissive and uses its own heuristic to determine what should
be parsed and what shouldn't, I think it's better to use already existing
Éric Araujo mer...@netwok.org added the comment:
Okay, sounds good.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list
Senthil Kumaran sent...@uthcode.com added the comment:
We need not base changes to html/parser.py on html5 spec, but rather make
changes based on the requirements on parsers which may rely on this library.
Like the tolerant mode was brought in issue1486713 for some practical reasons
and it
Ezio Melotti ezio.melo...@gmail.com added the comment:
So is the issue7311-3.diff patch fine? It changes the strict regex to match
the 2.7 one, and leave the tolerant one unchanged (even if now the two regexs
are really close).
--
___
Python
R. David Murray rdmur...@bitdance.com added the comment:
Sounds fine to me.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Senthil Kumaran sent...@uthcode.com added the comment:
So is the issue7311-3.diff patch fine?
Just that it allows unquoted attrs for unicode too.
My previous suggestion was not to allow unquoted attribute values, but as the
change is already made in 2.7 and discussion pointed out a portion
Ezio Melotti ezio.melo...@gmail.com added the comment:
On 3.2 the patch changes only the range of chars matched by the regex when the
attribute value doesn't have quotes and strict=True.
The parser already allowed unquotes attribute values even before the patch (in
both strict and tolerant
Roundup Robot devnull@devnull added the comment:
New changeset 7d4dea76c476 by Ezio Melotti in branch '2.7':
#7311: fix HTMLParser to accept non-ASCII attribute values.
http://hg.python.org/cpython/rev/7d4dea76c476
--
nosy: +python-dev
___
Python
Ezio Melotti ezio.melo...@gmail.com added the comment:
With 3.2 the situation is more complicated because there is a strict and a
non-strict mode.
The strict mode uses:
attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
R. David Murray rdmur...@bitdance.com added the comment:
The goal of tolerant mode is to accept anything a typical browser would accept.
I suspect that means the tolerant regex should stay, but I don't remember the
details.
As for the strictas far as I know the current module follows
Ezio Melotti ezio.melo...@gmail.com added the comment:
I don't see many use cases for the strict mode. It is not strict enough to be
used for validation, and while parsing HTML I can't think of any other case
where I would want an exception raised (always as long as what is parsed by the
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
assignee: - ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here's a patch that matches unquoted attribute values according to the HTML5
specifications.
The regex uses \s even if this includes the \v char that, according to the
HTML5 specs, shouldn't be included. I left it there for simplicity
Ezio Melotti ezio.melo...@gmail.com added the comment:
The HTML 4.01 specifications says[0]:
In certain cases, authors may specify the value of an attribute without any
quotation marks. The attribute value may only contain letters (a-z and A-Z),
digits (0-9), hyphens (ASCII decimal 45),
Ezio Melotti ezio.melo...@gmail.com added the comment:
The attached patch changes the regex to allow non-ascii letters in attribute
values (using \w with the re.UNICODE flag instead of [a-zA-Z0-9_]).
Using [^\s] (or even [^ ]) might be OK too, since that's what browsers seem
to use (e.g.
Changes by Éric Araujo mer...@netwok.org:
--
nosy: +eric.araujo
versions: +Python 3.1, Python 3.2, Python 3.3 -Python 2.6
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
Glenn Linderman v+pyt...@g.nevcal.com added the comment:
Re: the BTW -- and should be entity-escaped when used in attribute
values inside tag attributes... (but are probably seldom found as part
of tag attribute values)
But the example you showed is not an attribute in a tag, but rather text
Chiyuan Zhang plus...@gmail.com added the comment:
re: Yes. In fact, the BTW is a different problem with respect to this
bug. And that seems to be more complicated to fix.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
Changes by Fred L. Drake, Jr. fdr...@acm.org:
--
nosy: +fdrake
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +ezio.melotti
priority: - normal
stage: - test needed
versions: +Python 2.7
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
New submission from Chiyuan Zhang plus...@gmail.com:
Hi all,
I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese
24 matches
Mail list logo