Re: [webkit-dev] HTML5 & MathML3 entities
> From: aba...@webkit.org > Date: Sun, 11 Jul 2010 16:33:57 -0700 > To: m...@apple.com > CC: webkit-dev@lists.webkit.org > Subject: Re: [webkit-dev] HTML5 & MathML3 entities > > On Sat, Jul 10, 2010 at 6:28 PM, Maciej Stachowiak wrote: >> On Jul 10, 2010, at 11:10 AM, Sausset François wrote: >>> I just saw that when looking at the code by myself. >>> What do you exactly mean by a prefix tree? >> >> The data structure commonly called a "Trie" is a prefix tree: >> http://en.wikipedia.org/wiki/Trie Never missing a chance to reveal my ignorance, I didn't know these had a name. However, I am using a prefix tree to store URL's in our browser cache. I did keep vaciliating over redundant hashes and linked lists as well as explicit copies of complete keys (uh, I don't want to explain now ). As pointed out, this helps eliminated redundnant prefixes ( http:// may come uip a lot ). Also of course memory coherence options proliferate if you start thinking about things like this. >> >> This data structure not only lets you tell if a particular key is present, >> but it also lets you check if a string you have could possibly be the prefix >> of any valid key. >> >> I think it is challenging, though, to make a trie structure that can be a >> compile-time constant, and building one dynamically will cost runtime memory >> per-process (whereas constant data would be in the data segment and shared). >> >> Another possibility is to make an array of all the entity names in sorted >> order. Then lookup can use a binary search, and on a failed lookup, looking >> to either side of the last key checked should determine whether it is a >> valid prefix. >> >> I expect binary search would be slower than Trie lookup, though I don't know >> by how much. > > Binary search will certainly be easier to implement. Let's start with > that and experiment with prefix trees as a possible performance > optimization. I'll give it a try now. When I did this, I wrote the code in java but made heavy use of conditioanl compilation to get it to work on j2me or j2se. This proved to be invaluable since lots of subtle low probability errors can occur and debugging in target setting ( a phone) would have taken forever. Certainly with java simple things can be surprisingly slow ( like recreating a key from pieces if you need to do it a lot) but with things that translate to native code this may be easier to optimize. Also, I'm not sure about the wiki speed analysis. If you do simple string compares on highly redundant keys, you spend a lot of time comparing "http://www."; to "http:///www."; etc/ Fail fast equality compares as well as memory compaction could offer many benefits. Its too early for me to think about Orders and logs LOL. > > Adam > ___ > webkit-dev mailing list > webkit-dev@lists.webkit.org > http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5 ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
On Sat, Jul 10, 2010 at 6:28 PM, Maciej Stachowiak wrote: > On Jul 10, 2010, at 11:10 AM, Sausset François wrote: >> I just saw that when looking at the code by myself. >> What do you exactly mean by a prefix tree? > > The data structure commonly called a "Trie" is a prefix tree: > http://en.wikipedia.org/wiki/Trie > > This data structure not only lets you tell if a particular key is present, > but it also lets you check if a string you have could possibly be the prefix > of any valid key. > > I think it is challenging, though, to make a trie structure that can be a > compile-time constant, and building one dynamically will cost runtime memory > per-process (whereas constant data would be in the data segment and shared). > > Another possibility is to make an array of all the entity names in sorted > order. Then lookup can use a binary search, and on a failed lookup, looking > to either side of the last key checked should determine whether it is a valid > prefix. > > I expect binary search would be slower than Trie lookup, though I don't know > by how much. Binary search will certainly be easier to implement. Let's start with that and experiment with prefix trees as a possible performance optimization. I'll give it a try now. Adam ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
My aim was not to rush. I'm currently looking at what needs to be implemented in WebKit to support MathML 3. I noticed that a lot of entities are not implemented and I first thought it was easy to implement. After this discussion on the mailing list, it appears not to be so simple. I filled a bug to continue the discussion and track the progresses in a refactoring of the entity parser: https://bugs.webkit.org/show_bug.cgi?id=42041 François Sausset Le 11 juil. 2010 à 04:21, Maciej Stachowiak a écrit : > > On Jul 10, 2010, at 9:36 AM, Alexey Proskuryakov wrote: > >> >> 10.07.2010, в 04:49, Maciej Stachowiak написал(а): >> >>> Go with the HTML5 / MathML 3 definitions for everything. Our XHTML >>> implementation targets XHTML5, not XHTML 1.0. >> >> >> I think that xml-entity-names and HTML5 made a poor choice changing the >> semantics of ⟩ and ⟨ (they used to be CJK punctuation, and now >> they are suddenly math). These are rendered differently. We should probably >> take a pragmatic approach, and avoid rushing to be the first to implement >> this aspect of the specs. > > I agree we shouldn't rush on potential compatibility-breaking changes, if we > can get someone else to do some testing for us first. However I believe > Firefox dev builds have the new meanings of ⟩ and ⟨. They haven't > discovered a problem yet, as far as I know. > > Regards, > Maciej > > ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
On Jul 10, 2010, at 11:10 AM, Sausset François wrote: > I just saw that when looking at the code by myself. > What do you exactly mean by a prefix tree? The data structure commonly called a "Trie" is a prefix tree: http://en.wikipedia.org/wiki/Trie This data structure not only lets you tell if a particular key is present, but it also lets you check if a string you have could possibly be the prefix of any valid key. I think it is challenging, though, to make a trie structure that can be a compile-time constant, and building one dynamically will cost runtime memory per-process (whereas constant data would be in the data segment and shared). Another possibility is to make an array of all the entity names in sorted order. Then lookup can use a binary search, and on a failed lookup, looking to either side of the last key checked should determine whether it is a valid prefix. I expect binary search would be slower than Trie lookup, though I don't know by how much. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
On Jul 10, 2010, at 9:36 AM, Alexey Proskuryakov wrote: > > 10.07.2010, в 04:49, Maciej Stachowiak написал(а): > >> Go with the HTML5 / MathML 3 definitions for everything. Our XHTML >> implementation targets XHTML5, not XHTML 1.0. > > > I think that xml-entity-names and HTML5 made a poor choice changing the > semantics of ⟩ and ⟨ (they used to be CJK punctuation, and now they > are suddenly math). These are rendered differently. We should probably take a > pragmatic approach, and avoid rushing to be the first to implement this > aspect of the specs. I agree we shouldn't rush on potential compatibility-breaking changes, if we can get someone else to do some testing for us first. However I believe Firefox dev builds have the new meanings of ⟩ and ⟨. They haven't discovered a problem yet, as far as I know. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
I'm not sure to understand everything, but the given link doesn't deal with the case where an entity should be translated to 2 Unicode characters, instead of only one as it is the case with the current hash table system. Such 2 characters entities don't exist in the HTML 5 entity list, but some are present in the one used by MathML 3 (link in my previous message). François Sausset Le 10 juil. 2010 à 21:17, Adam Barth a écrit : > On Sat, Jul 10, 2010 at 11:10 AM, Sausset François wrote: >> I just saw that when looking at the code by myself. >> What do you exactly mean by a prefix tree? > > http://en.wikipedia.org/wiki/Trie > >> I also noticed that the entity parser does not take into account combined >> Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/). >> In addition, even without entities, combined characters are displayed as >> separate ones. > > My understanding is that is the correct behavior w.r.t. the HTML5 > specification of entity parsing. Our entity processing aims for > perfect compliance with this algorithm: > > http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references > > My belief is the only things we're missing for perfect compliance is > the expanded list of entity names: > > http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references > > and the prefix tree. > > Adam ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
On Sat, Jul 10, 2010 at 11:10 AM, Sausset François wrote: > I just saw that when looking at the code by myself. > What do you exactly mean by a prefix tree? http://en.wikipedia.org/wiki/Trie > I also noticed that the entity parser does not take into account combined > Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/). > In addition, even without entities, combined characters are displayed as > separate ones. My understanding is that is the correct behavior w.r.t. the HTML5 specification of entity parsing. Our entity processing aims for perfect compliance with this algorithm: http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references My belief is the only things we're missing for perfect compliance is the expanded list of entity names: http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references and the prefix tree. Adam > Le 10 juil. 2010 à 21:00, Adam Barth a écrit : > Implementing MathML entities is not as easy as adding them to > HTMLEntityNames.gperf. The problem is our entity parsing code (both > the legacy entity parser and thew new HTML5 one we're using) assumes > that all named entities are <= 8 characters: > > http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194 > > Rather than just bumping up that number, we need to change the data > structure we use to store entities. Instead of a perfect hash, we > should use a prefix tree. In order to parse entities correctly > according to the spec, we need to know whether a given string is a > prefix of a named entity, which is what the prefix tree would tell us. > > Adam > ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
I just saw that when looking at the code by myself. What do you exactly mean by a prefix tree? I also noticed that the entity parser does not take into account combined Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/). In addition, even without entities, combined characters are displayed as separate ones. François Sausset Le 10 juil. 2010 à 21:00, Adam Barth a écrit : > Implementing MathML entities is not as easy as adding them to > HTMLEntityNames.gperf. The problem is our entity parsing code (both > the legacy entity parser and thew new HTML5 one we're using) assumes > that all named entities are <= 8 characters: > > http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194 > > Rather than just bumping up that number, we need to change the data > structure we use to store entities. Instead of a perfect hash, we > should use a prefix tree. In order to parse entities correctly > according to the spec, we need to know whether a given string is a > prefix of a named entity, which is what the prefix tree would tell us. > > Adam ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
On Sat, Jul 10, 2010 at 4:49 AM, Maciej Stachowiak wrote: > On Jul 10, 2010, at 3:47 AM, Sausset François wrote: >> I'm currently working on the MathML3 implementation and I noticed that new >> XML entities have been defined by the W3C: >> http://www.w3.org/TR/xml-entity-names/ >> >> They are supposed to be used by both HTML 5 & MathML 3. >> >> I would like to include them in WebCore/html/HTMLEntityNames.gperf. >> However there is one conflict with the existing XHTML 1.0 entities: \rangle >> (and \langle) doesn't point to the same Unicode character in XHTML 1.0 and >> HTML 5 entity definitions. >> For instance, U+27E9 ("⟩") instead of U+3009 ("〉"). >> >> There are two possibilities: >> - either update WebCore/html/HTMLEntityNames.gperf and overwrite the two >> conflicting cases with the new standard, but it won't respect the XHTML 1.0 >> specification anymore. >> - or use two sets of HTML entities depending on the DTD of the document. It >> would be the cleanest way, but I don't know how to make WebCore handle two >> such sets. >> >> I think the best solution is the second one, but I'll need help to make >> WebCore handle two entity sets and switch depending on the DTD. It is >> outside of my present skills. > > Go with the HTML5 / MathML 3 definitions for everything. Our XHTML > implementation targets XHTML5, not XHTML 1.0. Implementing MathML entities is not as easy as adding them to HTMLEntityNames.gperf. The problem is our entity parsing code (both the legacy entity parser and thew new HTML5 one we're using) assumes that all named entities are <= 8 characters: http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194 Rather than just bumping up that number, we need to change the data structure we use to store entities. Instead of a perfect hash, we should use a prefix tree. In order to parse entities correctly according to the spec, we need to know whether a given string is a prefix of a named entity, which is what the prefix tree would tell us. Adam ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
10.07.2010, в 04:49, Maciej Stachowiak написал(а): > Go with the HTML5 / MathML 3 definitions for everything. Our XHTML > implementation targets XHTML5, not XHTML 1.0. I think that xml-entity-names and HTML5 made a poor choice changing the semantics of ⟩ and ⟨ (they used to be CJK punctuation, and now they are suddenly math). These are rendered differently. We should probably take a pragmatic approach, and avoid rushing to be the first to implement this aspect of the specs. The new entities defined for math specifically should of course use math characters. - WBR, Alexey Proskuryakov ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] HTML5 & MathML3 entities
On Jul 10, 2010, at 3:47 AM, Sausset François wrote: > I'm currently working on the MathML3 implementation and I noticed that new > XML entities have been defined by the W3C: > http://www.w3.org/TR/xml-entity-names/ > > They are supposed to be used by both HTML 5 & MathML 3. > > I would like to include them in WebCore/html/HTMLEntityNames.gperf. > However there is one conflict with the existing XHTML 1.0 entities: \rangle > (and \langle) doesn't point to the same Unicode character in XHTML 1.0 and > HTML 5 entity definitions. > For instance, U+27E9 ("⟩") instead of U+3009 ("〉"). > > There are two possibilities: > - either update WebCore/html/HTMLEntityNames.gperf and overwrite the two > conflicting cases with the new standard, but it won't respect the XHTML 1.0 > specification anymore. > - or use two sets of HTML entities depending on the DTD of the document. It > would be the cleanest way, but I don't know how to make WebCore handle two > such sets. > > I think the best solution is the second one, but I'll need help to make > WebCore handle two entity sets and switch depending on the DTD. It is outside > of my present skills. Go with the HTML5 / MathML 3 definitions for everything. Our XHTML implementation targets XHTML5, not XHTML 1.0. Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
[webkit-dev] HTML5 & MathML3 entities
I'm currently working on the MathML3 implementation and I noticed that new XML entities have been defined by the W3C: http://www.w3.org/TR/xml-entity-names/ They are supposed to be used by both HTML 5 & MathML 3. I would like to include them in WebCore/html/HTMLEntityNames.gperf. However there is one conflict with the existing XHTML 1.0 entities: \rangle (and \langle) doesn't point to the same Unicode character in XHTML 1.0 and HTML 5 entity definitions. For instance, U+27E9 ("⟩") instead of U+3009 ("〉"). There are two possibilities: - either update WebCore/html/HTMLEntityNames.gperf and overwrite the two conflicting cases with the new standard, but it won't respect the XHTML 1.0 specification anymore. - or use two sets of HTML entities depending on the DTD of the document. It would be the cleanest way, but I don't know how to make WebCore handle two such sets. I think the best solution is the second one, but I'll need help to make WebCore handle two entity sets and switch depending on the DTD. It is outside of my present skills. François Sausset ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev