Ezio Melotti ezio.melo...@gmail.com added the comment:
Attached another file with a dict that contains the 2231 HTML5 entities listed
at http://www.w3.org/TR/html5/named-character-references.html
The dict is like:
html5namedcharref = {
'Aacute;': '\xc1',
'Aacute': '\xc1',
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here is a proper patch, still using the html5namedcharref name.
HTMLParser should also be updated to use this dict.
--
keywords: +patch
stage: patch review - commit review
Added file: http://bugs.python.org/file26109/issue3.diff
Changes by Ezio Melotti ezio.melo...@gmail.com:
Removed file: http://bugs.python.org/file26109/issue3.diff
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
Changes by Ezio Melotti ezio.melo...@gmail.com:
Added file: http://bugs.python.org/file26110/issue3.diff
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
Martin v. Löwis mar...@v.loewis.de added the comment:
How about calling it just html5, or HTML5? That it is about entities
already follows from the module name.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here's a new patch that uses the html5 name for the dict, if there aren't
other comments I'll commit it.
--
Added file: http://bugs.python.org/file26113/issue3-2.diff
___
Python tracker
Roundup Robot devn...@psf.upfronthosting.co.za added the comment:
New changeset 2b54e25d6ecb by Ezio Melotti in branch 'default':
#3: add a new html5 dictionary containing the named character references
defined by the HTML5 standard and the equivalent Unicode character(s) to the
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
resolution: - fixed
stage: commit review - committed/rejected
status: open - closed
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
Éric Araujo mer...@netwok.org added the comment:
The ';' is not part of the entity name but an SGML delimiter, like ''; the
strings in the dict should not include it (like in the other dict they don’t).
--
___
Python tracker rep...@bugs.python.org
Éric Araujo mer...@netwok.org added the comment:
BTW in the doc you may point to collections.ChainMap to explain to people how
to make one dict with HTML 4 and HTML 5 entities. (Note that I assume there
are two dicts, but I only skimmed the diff.)
--
Ezio Melotti ezio.melo...@gmail.com added the comment:
The problem is that the standard allows some charref to end without a ';', but
not all of them.
So both Eacuteric and Eacute;ric will be parsed as Éric, but only
alpha;centauri will result in αcentauri -- alphacentauri will be
returned
Éric Araujo mer...@netwok.org added the comment:
The explanations make sense, don’t change anything.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
Ezio Melotti ezio.melo...@gmail.com added the comment:
http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5
entities (see also attached file for a dict generated from that table).
Currently html.entities only has 252 entities, organized in 3 dicts:
1) name - intvalue
Martin v. Löwis mar...@v.loewis.de added the comment:
1) the current approach of having a dict with name - intvalue doesn't work
anymore, and a name - valuelist should be used instead;
2) the reverse dict for this would have to use tuples as keys, but I'm not
sure how useful would that
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
assignee: - ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Ezio Melotti ezio.melo...@gmail.com added the comment:
Having them in different mappings would be good, but I expect that for most
real world application a single mappings that includes them all is the way to
go. If I'm parsing a supposedly HTML page that contains an apos; I'd rather
have it
Éric Araujo mer...@netwok.org added the comment:
Ah, this changes the situation. I suppose it’s too late to stop pretending
that HTML and XHTML are nearly the same thing (IOW change the doc), so apos
needs to be defined for XHTML.
IMO, we need a way to have the right entity references for
Éric Araujo mer...@netwok.org added the comment:
I just closed #12329 as a duplicate of this bug. It requested the addition of
the apos named entity reference.
TTBOMK, the html module (or htmlentitydefs in 2.x) doesn’t claim to support
XHTML TTBOMK; an XML parser should be used for XHTML.
Hans Peter de Koning h...@xs4all.nl added the comment:
The reason I raised #12329 was that the v2.7.1 documentation in
http://docs.python.org/library/htmllib.html#module-htmlentitydefs
says:
... The definition provided here contains all the entities defined by XHTML
1.0 ...
The only diff
Hans Peter de Koning h...@xs4all.nl added the comment:
BTW, the HTMLParser module (as well as html.parser in 3.x) does claim to parse
both HTML and XHTML, see
http://docs.python.org/library/htmlparser.html#module-HTMLParser .
--
___
Python tracker
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Eric Smith e...@trueblade.com added the comment:
I don't see the need for a parameter to support different sets of entities.
Just supporting the ones from HTML 5 seems like the right thing.
--
nosy: +eric.smith
___
Python tracker
Éric Araujo mer...@netwok.org added the comment:
To make my intent explicit: an updated mapping could generate references
invalid for 4.01.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
Eric Smith e...@trueblade.com added the comment:
Ah. I hadn't thought of generating them, only parsing them. In that case, then
yes, it's an issue for generation.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
Martin v. Löwis mar...@v.loewis.de added the comment:
Supporting the ones in HTML 5 would be fine with me. Supporting those of
xml-entity-names would be inappropriate - it's not clear (to me, at least) that
all of them are really meant for use in HTML.
--
nosy: +loewis
Éric Araujo mer...@netwok.org added the comment:
Agreed with Martin. I wonder if we should provide a means to use only HTML
4.01 entity references (say with a function parameter html5 defaulting to True)
or we should just update the mapping.
--
nosy: +eric.araujo
stage: - needs
New submission from Brian Jones bkjo...@gmail.com:
In Python 3.2b2, html.entities.codepoint2name and name2codepoint only support
the 252 HTML entity names defined in the HTML 4 spec from 1997. I'm wondering
if there's a reason not to support W3C Recommendation 'XML Entity Definitions
for
27 matches
Mail list logo