Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-12 Thread Mike Marchywka
















> From: aba...@webkit.org
> Date: Sun, 11 Jul 2010 16:33:57 -0700
> To: m...@apple.com
> CC: webkit-dev@lists.webkit.org
> Subject: Re: [webkit-dev] HTML5 & MathML3 entities
>
> On Sat, Jul 10, 2010 at 6:28 PM, Maciej Stachowiak  wrote:
>> On Jul 10, 2010, at 11:10 AM, Sausset François wrote:
>>> I just saw that when looking at the code by myself.
>>> What do you exactly mean by a prefix tree?
>>
>> The data structure commonly called a "Trie" is a prefix tree:
>> http://en.wikipedia.org/wiki/Trie

Never missing a chance to reveal my ignorance, I didn't know these had a name.
However, I am using a prefix tree to store URL's in our browser cache.
I did keep vaciliating over redundant hashes and linked lists as well as 
explicit
copies of complete keys (uh, I don't want to explain now ). As pointed
out, this helps eliminated redundnant prefixes ( http:// may come uip a lot ).
Also of course memory coherence options proliferate if you start thinking
about things like this. 

>>
>> This data structure not only lets you tell if a particular key is present, 
>> but it also lets you check if a string you have could possibly be the prefix 
>> of any valid key.
>>
>> I think it is challenging, though, to make a trie structure that can be a 
>> compile-time constant, and building one dynamically will cost runtime memory 
>> per-process (whereas constant data would be in the data segment and shared).
>>
>> Another possibility is to make an array of all the entity names in sorted 
>> order. Then lookup can use a binary search, and on a failed lookup, looking 
>> to either side of the last key checked should determine whether it is a 
>> valid prefix.
>>
>> I expect binary search would be slower than Trie lookup, though I don't know 
>> by how much.
>
> Binary search will certainly be easier to implement. Let's start with
> that and experiment with prefix trees as a possible performance
> optimization. I'll give it a try now.

When I did this, I wrote the code in java but made heavy use of conditioanl 
compilation
to get it to work on j2me or j2se. This proved to be invaluable since lots of 
subtle low
probability errors can occur and debugging in target setting ( a phone) would 
have taken forever.
Certainly with java simple things can be surprisingly slow ( like recreating a 
key from pieces if
you need to do it a lot) but with things that translate to native code this may 
be
easier to optimize. 

Also, I'm not sure about the wiki speed analysis. If you do simple string 
compares on highly
redundant keys, you spend a lot of time comparing "http://www."; to 
"http:///www."; etc/
Fail fast equality compares as well as memory compaction could offer many 
benefits.  
Its too early for me to think about Orders and logs LOL. 

>
> Adam
> ___
> webkit-dev mailing list
> webkit-dev@lists.webkit.org
> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-11 Thread Adam Barth
On Sat, Jul 10, 2010 at 6:28 PM, Maciej Stachowiak  wrote:
> On Jul 10, 2010, at 11:10 AM, Sausset François wrote:
>> I just saw that when looking at the code by myself.
>> What do you exactly mean by a prefix tree?
>
> The data structure commonly called a "Trie" is a prefix tree:
> http://en.wikipedia.org/wiki/Trie
>
> This data structure not only lets you tell if a particular key is present, 
> but it also lets you check if a string you have could possibly be the prefix 
> of any valid key.
>
> I think it is challenging, though, to make a trie structure that can be a 
> compile-time constant, and building one dynamically will cost runtime memory 
> per-process (whereas constant data would be in the data segment and shared).
>
> Another possibility is to make an array of all the entity names in sorted 
> order. Then lookup can use a binary search, and on a failed lookup, looking 
> to either side of the last key checked should determine whether it is a valid 
> prefix.
>
> I expect binary search would be slower than Trie lookup, though I don't know 
> by how much.

Binary search will certainly be easier to implement.  Let's start with
that and experiment with prefix trees as a possible performance
optimization.  I'll give it a try now.

Adam
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-11 Thread Sausset François
My aim was not to rush.
I'm currently looking at what needs to be implemented in WebKit to support 
MathML 3.
I noticed that a lot of entities are not implemented and I first thought it was 
easy to implement.

After this discussion on the mailing list, it appears not to be so simple.

I filled a bug to continue the discussion and track the progresses in a 
refactoring of the entity parser:
https://bugs.webkit.org/show_bug.cgi?id=42041

François Sausset


Le 11 juil. 2010 à 04:21, Maciej Stachowiak a écrit :

> 
> On Jul 10, 2010, at 9:36 AM, Alexey Proskuryakov wrote:
> 
>> 
>> 10.07.2010, в 04:49, Maciej Stachowiak написал(а):
>> 
>>> Go with the HTML5 / MathML 3 definitions for everything. Our XHTML 
>>> implementation targets XHTML5, not XHTML 1.0.
>> 
>> 
>> I think that xml-entity-names and HTML5 made a poor choice changing the 
>> semantics of ⟩ and ⟨ (they used to be CJK punctuation, and now 
>> they are suddenly math). These are rendered differently. We should probably 
>> take a pragmatic approach, and avoid rushing to be the first to implement 
>> this aspect of the specs.
> 
> I agree we shouldn't rush on potential compatibility-breaking changes, if we 
> can get someone else to do some testing for us first. However I believe 
> Firefox dev builds have the new meanings of ⟩ and ⟨. They haven't 
> discovered a problem yet, as far as I know.
> 
> Regards,
> Maciej
> 
> 

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Maciej Stachowiak

On Jul 10, 2010, at 11:10 AM, Sausset François wrote:

> I just saw that when looking at the code by myself.
> What do you exactly mean by a prefix tree?

The data structure commonly called a "Trie" is a prefix tree:
http://en.wikipedia.org/wiki/Trie

This data structure not only lets you tell if a particular key is present, but 
it also lets you check if a string you have could possibly be the prefix of any 
valid key.

I think it is challenging, though, to make a trie structure that can be a 
compile-time constant, and building one dynamically will cost runtime memory 
per-process (whereas constant data would be in the data segment and shared).

Another possibility is to make an array of all the entity names in sorted 
order. Then lookup can use a binary search, and on a failed lookup, looking to 
either side of the last key checked should determine whether it is a valid 
prefix.

I expect binary search would be slower than Trie lookup, though I don't know by 
how much.

Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Maciej Stachowiak

On Jul 10, 2010, at 9:36 AM, Alexey Proskuryakov wrote:

> 
> 10.07.2010, в 04:49, Maciej Stachowiak написал(а):
> 
>> Go with the HTML5 / MathML 3 definitions for everything. Our XHTML 
>> implementation targets XHTML5, not XHTML 1.0.
> 
> 
> I think that xml-entity-names and HTML5 made a poor choice changing the 
> semantics of ⟩ and ⟨ (they used to be CJK punctuation, and now they 
> are suddenly math). These are rendered differently. We should probably take a 
> pragmatic approach, and avoid rushing to be the first to implement this 
> aspect of the specs.

I agree we shouldn't rush on potential compatibility-breaking changes, if we 
can get someone else to do some testing for us first. However I believe Firefox 
dev builds have the new meanings of ⟩ and ⟨. They haven't discovered 
a problem yet, as far as I know.

Regards,
Maciej


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Sausset François
I'm not sure to understand everything, but the given link doesn't deal with the 
case where an entity should be translated to 2 Unicode characters, instead of 
only one as it is the case with the current hash table system.

Such 2 characters entities don't exist in the HTML 5 entity list, but some are 
present in the one used by MathML 3 (link in my previous message).

François Sausset


Le 10 juil. 2010 à 21:17, Adam Barth a écrit :

> On Sat, Jul 10, 2010 at 11:10 AM, Sausset François  wrote:
>> I just saw that when looking at the code by myself.
>> What do you exactly mean by a prefix tree?
> 
> http://en.wikipedia.org/wiki/Trie
> 
>> I also noticed that the entity parser does not take into account combined
>> Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/).
>> In addition, even without entities, combined characters are displayed as
>> separate ones.
> 
> My understanding is that is the correct behavior w.r.t. the HTML5
> specification of entity parsing.  Our entity processing aims for
> perfect compliance with this algorithm:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references
> 
> My belief is the only things we're missing for perfect compliance is
> the expanded list of entity names:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references
> 
> and the prefix tree.
> 
> Adam

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Adam Barth
On Sat, Jul 10, 2010 at 11:10 AM, Sausset François  wrote:
> I just saw that when looking at the code by myself.
> What do you exactly mean by a prefix tree?

http://en.wikipedia.org/wiki/Trie

> I also noticed that the entity parser does not take into account combined
> Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/).
> In addition, even without entities, combined characters are displayed as
> separate ones.

My understanding is that is the correct behavior w.r.t. the HTML5
specification of entity parsing.  Our entity processing aims for
perfect compliance with this algorithm:

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references

My belief is the only things we're missing for perfect compliance is
the expanded list of entity names:

http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references

and the prefix tree.

Adam


> Le 10 juil. 2010 à 21:00, Adam Barth a écrit :
> Implementing MathML entities is not as easy as adding them to
> HTMLEntityNames.gperf.  The problem is our entity parsing code (both
> the legacy entity parser and thew new HTML5 one we're using) assumes
> that all named entities are <= 8 characters:
>
> http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194
>
> Rather than just bumping up that number, we need to change the data
> structure we use to store entities.  Instead of a perfect hash, we
> should use a prefix tree.  In order to parse entities correctly
> according to the spec, we need to know whether a given string is a
> prefix of a named entity, which is what the prefix tree would tell us.
>
> Adam
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Sausset François
I just saw that when looking at the code by myself.
What do you exactly mean by a prefix tree?

I also noticed that the entity parser does not take into account combined 
Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/).
In addition, even without entities, combined characters are displayed as 
separate ones.

François Sausset
 

Le 10 juil. 2010 à 21:00, Adam Barth a écrit :

> Implementing MathML entities is not as easy as adding them to
> HTMLEntityNames.gperf.  The problem is our entity parsing code (both
> the legacy entity parser and thew new HTML5 one we're using) assumes
> that all named entities are <= 8 characters:
> 
> http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194
> 
> Rather than just bumping up that number, we need to change the data
> structure we use to store entities.  Instead of a perfect hash, we
> should use a prefix tree.  In order to parse entities correctly
> according to the spec, we need to know whether a given string is a
> prefix of a named entity, which is what the prefix tree would tell us.
> 
> Adam

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Adam Barth
On Sat, Jul 10, 2010 at 4:49 AM, Maciej Stachowiak  wrote:
> On Jul 10, 2010, at 3:47 AM, Sausset François wrote:
>> I'm currently working on the MathML3 implementation and I noticed that new 
>> XML entities have been defined by the W3C:
>> http://www.w3.org/TR/xml-entity-names/
>>
>> They are supposed to be used by both HTML 5 & MathML 3.
>>
>> I would like to include them in WebCore/html/HTMLEntityNames.gperf.
>> However there is one conflict with the existing XHTML 1.0 entities: \rangle 
>> (and \langle) doesn't point to the same Unicode character in XHTML 1.0 and 
>> HTML 5 entity definitions.
>> For instance, U+27E9 ("⟩") instead of U+3009 ("〉").
>>
>> There are two possibilities:
>> - either update WebCore/html/HTMLEntityNames.gperf and overwrite the two 
>> conflicting cases with the new standard, but it won't respect the XHTML 1.0 
>> specification anymore.
>> - or use two sets of HTML entities depending on the DTD of the document. It 
>> would be the cleanest way, but I don't know how to make WebCore handle two 
>> such sets.
>>
>> I think the best solution is the second one, but I'll need help to make 
>> WebCore handle two entity sets and switch depending on the DTD. It is 
>> outside of my present skills.
>
> Go with the HTML5 / MathML 3 definitions for everything. Our XHTML 
> implementation targets XHTML5, not XHTML 1.0.

Implementing MathML entities is not as easy as adding them to
HTMLEntityNames.gperf.  The problem is our entity parsing code (both
the legacy entity parser and thew new HTML5 one we're using) assumes
that all named entities are <= 8 characters:

http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194

Rather than just bumping up that number, we need to change the data
structure we use to store entities.  Instead of a perfect hash, we
should use a prefix tree.  In order to parse entities correctly
according to the spec, we need to know whether a given string is a
prefix of a named entity, which is what the prefix tree would tell us.

Adam
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Alexey Proskuryakov

10.07.2010, в 04:49, Maciej Stachowiak написал(а):

> Go with the HTML5 / MathML 3 definitions for everything. Our XHTML 
> implementation targets XHTML5, not XHTML 1.0.


I think that xml-entity-names and HTML5 made a poor choice changing the 
semantics of ⟩ and ⟨ (they used to be CJK punctuation, and now they 
are suddenly math). These are rendered differently. We should probably take a 
pragmatic approach, and avoid rushing to be the first to implement this aspect 
of the specs.

The new entities defined for math specifically should of course use math 
characters.

- WBR, Alexey Proskuryakov

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Maciej Stachowiak

On Jul 10, 2010, at 3:47 AM, Sausset François wrote:

> I'm currently working on the MathML3 implementation and I noticed that new 
> XML entities have been defined by the W3C:
> http://www.w3.org/TR/xml-entity-names/
> 
> They are supposed to be used by both HTML 5 & MathML 3.
> 
> I would like to include them in WebCore/html/HTMLEntityNames.gperf.
> However there is one conflict with the existing XHTML 1.0 entities: \rangle 
> (and \langle) doesn't point to the same Unicode character in XHTML 1.0 and 
> HTML 5 entity definitions.
> For instance, U+27E9 ("⟩") instead of U+3009 ("〉").
> 
> There are two possibilities:
> - either update WebCore/html/HTMLEntityNames.gperf and overwrite the two 
> conflicting cases with the new standard, but it won't respect the XHTML 1.0 
> specification anymore.
> - or use two sets of HTML entities depending on the DTD of the document. It 
> would be the cleanest way, but I don't know how to make WebCore handle two 
> such sets.
> 
> I think the best solution is the second one, but I'll need help to make 
> WebCore handle two entity sets and switch depending on the DTD. It is outside 
> of my present skills.

Go with the HTML5 / MathML 3 definitions for everything. Our XHTML 
implementation targets XHTML5, not XHTML 1.0.

Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


[webkit-dev] HTML5 & MathML3 entities

2010-07-10 Thread Sausset François
I'm currently working on the MathML3 implementation and I noticed that new XML 
entities have been defined by the W3C:
http://www.w3.org/TR/xml-entity-names/

They are supposed to be used by both HTML 5 & MathML 3.

I would like to include them in WebCore/html/HTMLEntityNames.gperf.
However there is one conflict with the existing XHTML 1.0 entities: \rangle 
(and \langle) doesn't point to the same Unicode character in XHTML 1.0 and HTML 
5 entity definitions.
For instance, U+27E9 ("⟩") instead of U+3009 ("〉").

There are two possibilities:
- either update WebCore/html/HTMLEntityNames.gperf and overwrite the two 
conflicting cases with the new standard, but it won't respect the XHTML 1.0 
specification anymore.
- or use two sets of HTML entities depending on the DTD of the document. It 
would be the cleanest way, but I don't know how to make WebCore handle two such 
sets.

I think the best solution is the second one, but I'll need help to make WebCore 
handle two entity sets and switch depending on the DTD. It is outside of my 
present skills.


François Sausset
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev