[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Attached another file with a dict that contains the 2231 HTML5 entities listed 
at http://www.w3.org/TR/html5/named-character-references.html
The dict is like:

html5namedcharref = {
'Aacute;': '\xc1',
'Aacute': '\xc1',
'aacute;': '\xe1',
'aacute': '\xe1',
'Abreve;': '\u0102',
'abreve;': '\u0103',
...
}

A better name could be found for the dict if you have better ideas (maybe 
html.entities.html5 only?).  The dict will be added to html.entities.

--
stage: needs patch - patch review
Added file: http://bugs.python.org/file26107/entities.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Here is a proper patch, still using the html5namedcharref name.
HTMLParser should also be updated to use this dict.

--
keywords: +patch
stage: patch review - commit review
Added file: http://bugs.python.org/file26109/issue3.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


Removed file: http://bugs.python.org/file26109/issue3.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


Added file: http://bugs.python.org/file26110/issue3.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

How about calling it just html5, or HTML5? That it is about entities 
already follows from the module name.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Here's a new patch that uses the html5 name for the dict, if there aren't 
other comments I'll commit it.

--
Added file: http://bugs.python.org/file26113/issue3-2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset 2b54e25d6ecb by Ezio Melotti in branch 'default':
#3: add a new html5 dictionary containing the named character references 
defined by the HTML5 standard and the equivalent Unicode character(s) to the 
html.entities module.
http://hg.python.org/cpython/rev/2b54e25d6ecb

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
resolution:  - fixed
stage: commit review - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

The ';' is not part of the entity name but an SGML delimiter, like ''; the 
strings in the dict should not include it (like in the other dict they don’t).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

BTW in the doc you may point to collections.ChainMap to explain to people how 
to make one dict with HTML 4 and HTML 5 entities.  (Note that I assume there 
are two dicts, but I only skimmed the diff.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The problem is that the standard allows some charref to end without a ';', but 
not all of them.

So both Eacuteric and Eacute;ric will be parsed as Éric, but only 
alpha;centauri will result in αcentauri -- alphacentauri will be 
returned unchanged.

I'm now working on #15156 to use this dict in HTMLParser, and detecting the 
';'-less entities is not easy.  A possible solution is to keep the names that 
are accepted without ',' in a separate (private) dict and expose a function 
like HTMLParser.unescape that implements all the necessary logic.

Regarding ChainMap, the html5 dict should be a superset of the html4 one.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2012-06-23 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

The explanations make sense, don’t change anything.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-11-29 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

http://www.w3.org/TR/html5/named-character-references.html lists 2152 HTML 5 
entities (see also attached file for a dict generated from that table).
Currently html.entities only has 252 entities, organized in 3 dicts:
  1) name - intvalue (e.g. 'amp': 0x0026);
  2) intvalue - name (e.g. 0x0026: 'amp');
  3) name - char (e.g. 'amp': '');

In HTML 5, some of the entities map to a sequence of 2 characters, for example 
NotEqualTilde; corresponds to [U+2242, U+0338] (i.e. MINUS TILDE + COMBINING 
LONG SOLIDUS OVERLAY).

This means that:
  1) the current approach of having a dict with name - intvalue doesn't work 
anymore, and a name - valuelist should be used instead;
  2) the reverse dict for this would have to use tuples as keys, but I'm not 
sure how useful would that be (producing entities is not a common case, 
especially unusual ones like these).
  3) The name - char dict might still be useful, and can easily become a name 
- str dict in order to deal with the multichar entities;

Since 1) is not backward-compatible the HTML5 entities should probably go in a 
separate dict.

Also note that the entities are case-sensitive and some of them include 
different spellings (e.g. both 'amp' and 'AMP' map to ''), so the reverse dict 
won't work too well.  Having '' - 'amp' seems better than '' - 'AMP', but 
this might not be obvious for all the entities and requires some extra logic in 
the code to get it right.

--
Added file: http://bugs.python.org/file23803/entities_dict.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-11-29 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

   1) the current approach of having a dict with name - intvalue doesn't work 
 anymore, and a name - valuelist should be used instead;
   2) the reverse dict for this would have to use tuples as keys, but I'm not 
 sure how useful would that be (producing entities is not a common case, 
 especially unusual ones like these).
   3) The name - char dict might still be useful, and can easily become a 
 name - str dict in order to deal with the multichar entities;
 
 Since 1) is not backward-compatible the HTML5 entities should probably go in 
 a separate dict.

+1 for a separate dict; -1 for a value list. The right value type is
'str'; name2codepoint ought to be deprecated (it's a left-over from
when the str type wasn't unicode in 2.x).

As for the reverse mapping: I'd add a dictionary that is reverse to
entitydefs (i.e. with str keys). That some keys then have two characters
is no real issue: applications that want to use this dictionary can
either ignore them, or follow the approach of always checking
Unicode combining characters - I'd expect that all second characters
are indeed combining.

OTOH, it's easy enough to create an inverted dictionary yourself
when you need it, and not every three-line function needs to be
in the standard library. It might actually be more useful to compile
the values into a regular expression which you can then use to
find out whether characters can be escaped using entity references.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-11-28 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
assignee:  - ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-07-20 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Having them in different mappings would be good, but I expect that for most 
real world application a single mappings that includes them all is the way to 
go.  If I'm parsing a supposedly HTML page that contains an apos; I'd rather 
have it converted even if it's not an HTML entity.
If the set of entities supported by HTML5 is a superset of the HTML4 and XHTML 
ones, than we might just use that (I haven't checked though).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-06-15 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

Ah, this changes the situation.  I suppose it’s too late to stop pretending 
that HTML and XHTML are nearly the same thing (IOW change the doc), so apos 
needs to be defined for XHTML.

IMO, we need a way to have the right entity references for HTML 4.01, XHTML 1.0 
and HTML5, not put them all in one mapping.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-06-14 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

I just closed #12329 as a duplicate of this bug.  It requested the addition of 
the apos named entity reference.

TTBOMK, the html module (or htmlentitydefs in 2.x) doesn’t claim to support 
XHTML TTBOMK; an XML parser should be used for XHTML.  In HTML 4.01, apos is 
not defined, but it is in HTML5.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-06-14 Thread Hans Peter de Koning

Hans Peter de Koning h...@xs4all.nl added the comment:

The reason I raised #12329 was that the v2.7.1 documentation in
http://docs.python.org/library/htmllib.html#module-htmlentitydefs
says:
... The definition provided here contains all the entities defined by XHTML 
1.0 ...
The only diff between the 252 HTML 4.01 and 253 XHTML 1.0 entities is apos. 
See http://www.w3.org/TR/html401/sgml/entities.html and 
http://www.w3.org/TR/xhtml1/dtds.html .

--
nosy: +hp.dekoning

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-06-14 Thread Hans Peter de Koning

Hans Peter de Koning h...@xs4all.nl added the comment:

BTW, the HTMLParser module (as well as html.parser in 3.x) does claim to parse 
both HTML and XHTML, see 
http://docs.python.org/library/htmlparser.html#module-HTMLParser .

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-06-14 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-02-06 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

I don't see the need for a parameter to support different sets of entities. 
Just supporting the ones from HTML 5 seems like the right thing.

--
nosy: +eric.smith

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-02-06 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

To make my intent explicit: an updated mapping could generate references 
invalid for 4.01.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-02-06 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

Ah. I hadn't thought of generating them, only parsing them. In that case, then 
yes, it's an issue for generation.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com




[issue11113] html.entities mapping dicts need updating?

2011-02-04 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Supporting the ones in HTML 5 would be fine with me. Supporting those of 
xml-entity-names would be inappropriate - it's not clear (to me, at least) that 
all of them are really meant for use in HTML.

--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-02-04 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

Agreed with Martin.  I wonder if we should provide a means to use only HTML 
4.01 entity references (say with a function parameter html5 defaulting to True) 
or we should just update the mapping.

--
nosy: +eric.araujo
stage:  - needs patch
versions: +Python 3.3 -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue11113] html.entities mapping dicts need updating?

2011-02-03 Thread Brian Jones

New submission from Brian Jones bkjo...@gmail.com:

In Python 3.2b2, html.entities.codepoint2name and name2codepoint only support 
the 252 HTML entity names defined in the HTML 4 spec from 1997. I'm wondering 
if there's a reason not to support W3C Recommendation 'XML Entity Definitions 
for Characters' 

http://www.w3.org/TR/xml-entity-names/

This standard contains significantly more characters, and it is noted in that 
spec that the HTML 5 drafts use that spec's entities. You can see the current 
HTML 5 'Named character references' here: 

http://www.w3.org/TR/html5/named-character-references.html#named-character-references

If this is just a matter of somebody going in to do the grunt work, let me 
know. 

If startup costs associated with importing a huge dictionary are a concern, 
perhaps a more efficient type that enables the same lookup interface can be 
defined. 

If other reasons exist to not move in this direction, please do let me know!

--
components: Library (Lib), Unicode, XML
messages: 127865
nosy: Brian.Jones
priority: normal
severity: normal
status: open
title: html.entities mapping dicts need updating?
type: feature request
versions: Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue3
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com