Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon


On 21 Dec 2008, at 05:41, Ian Hickson wrote:

1. Given an input stream that is known to be valid UTF-8, is it  
possible
to implement the tokenization algorithm with byte-wise operations  
only?
I think it's possible, since all of the character matching parts of  
the

algorithm map to characters in ASCII space.


Yes. (At least, that's the intent; if you find anything that  
contradicts

that, please let me know.)


Indeed it is possible (or at least it certainly was a year and a half  
ago, but I have seen nothing change that would stop it).



2. Would such an implementation be conforming?


Looking just at parsing, yes, probably... But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might not  
depending

on exactly how this is done.


That should be no problem: just convert Windows-1252 to UTF-8 using  
strtr() (as it is a SBCS this is simple enough — doing the inverse is  
not) — see the attached file. Then all you need to do is normalize the  
character set name to match all aliases of Windows-1252 and UTF-8, as  
well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to  
Windows-1252. http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php 
 does that (the only dependancy is for getting the file via HTTP,  
that can just be replaced with cURL if you wish to just require that).



--
Geoffrey Sneddon
http://gsnedders.com/
?php

/**
 * Converts a Windows-1252 encoded string to a UTF-8 encoded string	
 *
 * @copyright 2008 Geoffrey Sneddon
 * @license http://www.opensource.org/licenses/bsd-license.php BSD License
 * @param string $string Windows-1252 encoded string
 * @return string UTF-8 encoded string
 */
	
function windows_1252_to_utf8($string)	
{
static $convert_table = array(
\x80 = \xE2\x82\xAC,
\x81 = \xEF\xBF\xBD,
\x82 = \xE2\x80\x9A,
\x83 = \xC6\x92,
\x84 = \xE2\x80\x9E,
\x85 = \xE2\x80\xA6,
\x86 = \xE2\x80\xA0,
\x87 = \xE2\x80\xA1,
\x88 = \xCB\x86,
\x89 = \xE2\x80\xB0,
\x8A = \xC5\xA0,
\x8B = \xE2\x80\xB9,
\x8C = \xC5\x92,
\x8D = \xEF\xBF\xBD,
\x8E = \xC5\xBD,
\x8F = \xEF\xBF\xBD,
\x90 = \xEF\xBF\xBD,
\x91 = \xE2\x80\x98,
\x92 = \xE2\x80\x99,
\x93 = \xE2\x80\x9C,
\x94 = \xE2\x80\x9D,
\x95 = \xE2\x80\xA2,
\x96 = \xE2\x80\x93,
\x97 = \xE2\x80\x94,
\x98 = \xCB\x9C,
\x99 = \xE2\x84\xA2,
\x9A = \xC5\xA1,
\x9B = \xE2\x80\xBA,
\x9C = \xC5\x93,
\x9D = \xEF\xBF\xBD,
\x9E = \xC5\xBE,
\x9F = \xC5\xB8,
\xA0 = \xC2\xA0,
\xA1 = \xC2\xA1,
\xA2 = \xC2\xA2,
\xA3 = \xC2\xA3,
\xA4 = \xC2\xA4,
\xA5 = \xC2\xA5,
\xA6 = \xC2\xA6,
\xA7 = \xC2\xA7,
\xA8 = \xC2\xA8,
\xA9 = \xC2\xA9,
\xAA = \xC2\xAA,
\xAB = \xC2\xAB,
\xAC = \xC2\xAC,
\xAD = \xC2\xAD,
\xAE = \xC2\xAE,
\xAF = \xC2\xAF,
\xB0 = \xC2\xB0,
\xB1 = \xC2\xB1,
\xB2 = \xC2\xB2,
\xB3 = \xC2\xB3,
\xB4 = \xC2\xB4,
\xB5 = \xC2\xB5,
\xB6 = \xC2\xB6,
\xB7 = \xC2\xB7,
\xB8 = \xC2\xB8,
\xB9 = \xC2\xB9,
\xBA = \xC2\xBA,
\xBB = \xC2\xBB,
\xBC = \xC2\xBC,
\xBD = \xC2\xBD,
\xBE = \xC2\xBE,
\xBF = \xC2\xBF,
\xC0 = \xC3\x80,
\xC1 = \xC3\x81,
\xC2 = \xC3\x82,
\xC3 = \xC3\x83,
\xC4 = \xC3\x84,
\xC5 = \xC3\x85,
\xC6 = \xC3\x86,
\xC7 = \xC3\x87,
\xC8 = \xC3\x88,
\xC9 = \xC3\x89,
\xCA = \xC3\x8A,
\xCB = \xC3\x8B,
\xCC = \xC3\x8C,
\xCD = \xC3\x8D,
\xCE = \xC3\x8E,
\xCF = \xC3\x8F,
\xD0 = \xC3\x90,
\xD1 = \xC3\x91,
\xD2 = \xC3\x92,
\xD3 = \xC3\x93,
\xD4 = \xC3\x94,
\xD5 = \xC3\x95,
\xD6 = \xC3\x96,
\xD7 = \xC3\x97,
\xD8 = \xC3\x98,
\xD9 = \xC3\x99,
\xDA = \xC3\x9A,
\xDB = \xC3\x9B,
\xDC = \xC3\x9C,
\xDD = \xC3\x9D,
\xDE = \xC3\x9E,
\xDF = \xC3\x9F,
\xE0 = \xC3\xA0,
\xE1 = \xC3\xA1,
\xE2 = \xC3\xA2,
\xE3 = \xC3\xA3,
\xE4 = \xC3\xA4,
\xE5 = \xC3\xA5,
\xE6 = \xC3\xA6,
\xE7 = \xC3\xA7,
\xE8 = \xC3\xA8,
\xE9 = \xC3\xA9,
\xEA = \xC3\xAA,
\xEB = \xC3\xAB,
\xEC = \xC3\xAC,
\xED = \xC3\xAD,
\xEE = \xC3\xAE,
\xEF = \xC3\xAF,
\xF0 = \xC3\xB0,
\xF1 = \xC3\xB1,
\xF2 = \xC3\xB2,
\xF3 = \xC3\xB3,
\xF4 = \xC3\xB4,
\xF5 = \xC3\xB5,
\xF6 = \xC3\xB6,
\xF7 = \xC3\xB7,
\xF8 = \xC3\xB8,
\xF9 = \xC3\xB9,
\xFA = \xC3\xBA,
 

Re: [whatwg] Thoughts on HTML 5

2008-12-21 Thread Philipp Kempgen
Ian Hickson schrieb:

 Deprecating HTML thus seems like vain effort. (We already tried over the 
 past few years with XHTML 1.x, and it didn't work.)

I'd say it _did_ work.  :-)

   Philipp Kempgen


Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Philip Taylor
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson i...@hixie.ch wrote:
 On Sat, 20 Dec 2008, Edward Z. Yang wrote:

 1. Given an input stream that is known to be valid UTF-8, is it possible
 to implement the tokenization algorithm with byte-wise operations only?
 I think it's possible, since all of the character matching parts of the
 algorithm map to characters in ASCII space.

 Yes. (At least, that's the intent; if you find anything that contradicts
 that, please let me know.)

I think there are some cases where it still should work but you might
have to be a little careful - e.g. tablefoo notionally results in
three parse errors according to the spec (one for each character token
which gets foster-parented), so table☹ results in one if you work
with Unicode characters but three if you treat each UTF-8 byte as a
separate character token.

But in practice, tokenisers emit sequence-of-many-characters tokens
instead of single-character tokens, so they only emit one parse error
for tablefoo, and the html5lib test cases assume that behaviour,
and it should work identically if you have sequence-of-many-bytes
tokens instead.

(Apparently only the distinction between 0 and more-than-0 parse
errors is important as far as the spec is concerned, since that has an
effect on whether the document is conforming; but it seems useful for
implementors to share test cases that are precise about exactly where
all the parse errors are emitted, since that helps find bugs, and so
the parse error count is relevant.)

-- 
Philip Taylor
exc...@gmail.com


Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon


On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:


I suppose the big pivot point is as if. A byte-wise implementation
would replace character globally with byte, and any U+ designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?


It states that what is done must be wholly equivalent to the given  
algorithm.



But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might not  
depending

on exactly how this is done.


The plan is to convert Windows-1252 into UTF-8 before processing;  
with a

reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.


I've never seen any way of getting iconv (at least via PHP) to do what  
HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is,  
however, possible using mbstring (which also has the advantage of not  
being system dependant), as well as with PHP6's Unicode support.



--
Geoffrey Sneddon
http://gsnedders.com/



Re: [whatwg] Thoughts on HTML 5

2008-12-21 Thread Nils Dagsson Moskopp
Am Sonntag, den 21.12.2008, 17:54 +0100 schrieb Philipp Kempgen:
 Ian Hickson schrieb:
 
  Deprecating HTML thus seems like vain effort. (We already tried over the 
  past few years with XHTML 1.x, and it didn't work.)
 
 I'd say it _did_ work.  :-)
I'd say too: The worst abominations have disappeared (for new sites,
that is). the font element, for example, or frames through deprecating
them.

Fact: Deprecating stuff takes it out of (X)HTML-Books, Howtos like
Selfhtml warn against it, thus ensuring lesser use by novices. Does
anyone remember marquee ?


Cheers
-- 
Nils Dagsson Moskopp
http://dieweltistgarnichtso.net



Re: [whatwg] Thoughts on HTML 5

2008-12-21 Thread Giovanni Campagna
Please Note: all the following is my personal humble opinion.

As I discovered lately, the main problem of HTML5 is its design oriented to
keep features that are distributed across browsers, that work or that are
simple way to solve big problem. Actually, they are a bunch of different
features somehow not integrated to the others.
Instead, programmer (please note, I use the word programmer, not author or
web designer) developing *new* application may more like a more structured
and logical organization, like XHTML modularization is.
HTML5 features, summed in big groups, are (in spec order):
1) common syntax for the most used datatypes.
2) additional DOM interfaces, which include HTMLElement - HTMLCollection -
HTMLFormsControlCollection - HTMLOptionsCollection - DOMTokenList -
DOMStringMap
3) Elements and Content Models
4) Element types: metadata - structure - sectioning - grouping - text -
editing - embedding - table - forms - interactive - scripting elements
5) User agent requirements
6) User Interaction
7) Communication
8) HTML Syntax

Some of these features can be achieved without any of HTML5, for example
1) use XMLSchema datatypes
2) you don't need HTMLElement: markup insertion, attributes querying can be
done using DOM3Core (that in latest browser are even more performant as no
parser is involved), events are far better handled by DOM3Events, styling is
included by CSSOM
you don't need collection either: just use appropriate DOMNodeLists, while
for DOMStringMap you may use binding specific features (all Object are hash
maps in ECMAScript3): it works this way even in HTML5
3) use XHTML2, which is extensible because modularized
4) metadata is better handled by XHTML2 Meta Attributes module, which fully
integrates the RDF module in any elements;
structure, sectioning, grouping are the same;
text is very similar: you don't have time, but you can have span
datatype=xsd:date content=2008-12-21Today/span as in HTML5 you have
time value=2008-12-21Today/time; for progress and meter semantic you
can use role attribute (for styling you always use CSS); editing is the
same, but you have an attribute instead of an element, so you don't have the
issue that ins and del can contain everything, even a whole document (not
including html);
embedding is much more powerful as any element can be replaced by embedded
content;
tables are the same (you don't have tables API; but you can still use
DOM3Core);
XForms are actually more powerful than WebForms2, since you divide
presentation from data from action (that is implemented declaratively);
interactive elements are not needed at all: details is better implemented as
it is now (ECMAScript3 + CSS3), datagrid is just a way to put data in a tree
model: use plain XML for that; command and a in XHTML2 implemented in any
element using href attribute; menu is mostly an ul with some style;
scripting uses XMLEvents and handler: it looks the same, but it is different
as it is more event oriented (scripts are not executed by default, they're
executed when some event fires)
8) HTML syntax: as I said before, use XML for that

There are instead features that are indeed very useful to develop a web
application, but are not achievable using other means that HTML5:
1) some way to interact with object (please note: object, not embed: object
is for plugins, embed for content) : actually this can be done using
something like cross document messaging, assuming that object creates a new
browsing context (it already does if the target is text/html or
application/xhtml+xml), but we need a specification for message syntax
2) the binding specific global scope (that is, what object are available in
all scopes, if binding supports this); this is normally the window object,
but scripts use certain features only on their own browsing context, so that
may be moved from that to global scope, removing the whole window object
from scope (for current javascript you can write
window.window.window.window.window... and get the same as nothing)
3) the Window object (which includes window name, window location, cross
document messaging, dialog windows)
4) Protocol and Content Handlers
5) Session and Local storage
6) Database storage
7) Drag and Drop
8) WebSockets

What I am asking now is so to modularize HTML. copy those features into
separate, interoperable modules, removing legacy features (like
window.on-whatever event listener)
A copy of those will remain in HTML5, because browser implement them at the
moment, and the HTML5 goal is that all browser implement the same things in
the same ways

Instead, some web developers in the future will think that a modularized and
less redudant API is more usable, like I personally do, and switch to that,
without mixing with HTML5: actually, I guess what a Database API does inside
HTML.

Best regards,
Giovanni Campagna


Re: [whatwg] Thoughts on HTML 5

2008-12-21 Thread Jorgen Horstink

Hi Giovanni,

I haven't read your entire comment, but with your point eight will  
break backwards compatibility. As far as I know is HTML5 supposed to  
combine old and new. The problem with interfaces is that you can not  
simply change them. That's just a fact we have to deal with.


jorgen

On Dec 21, 2008, at 7:12 PM, Giovanni Campagna wrote:


Please Note: all the following is my personal humble opinion.

As I discovered lately, the main problem of HTML5 is its design  
oriented to keep features that are distributed across browsers, that  
work or that are simple way to solve big problem. Actually, they are  
a bunch of different features somehow not integrated to the others.
Instead, programmer (please note, I use the word programmer, not  
author or web designer) developing *new* application may more like a  
more structured and logical organization, like XHTML modularization  
is.

HTML5 features, summed in big groups, are (in spec order):
1) common syntax for the most used datatypes.
2) additional DOM interfaces, which include HTMLElement -  
HTMLCollection - HTMLFormsControlCollection - HTMLOptionsCollection  
- DOMTokenList - DOMStringMap

3) Elements and Content Models
4) Element types: metadata - structure - sectioning - grouping -  
text - editing - embedding - table - forms - interactive - scripting  
elements

5) User agent requirements
6) User Interaction
7) Communication
8) HTML Syntax

Some of these features can be achieved without any of HTML5, for  
example

1) use XMLSchema datatypes
2) you don't need HTMLElement: markup insertion, attributes querying  
can be done using DOM3Core (that in latest browser are even more  
performant as no parser is involved), events are far better handled  
by DOM3Events, styling is included by CSSOM
you don't need collection either: just use appropriate DOMNodeLists,  
while for DOMStringMap you may use binding specific features (all  
Object are hash maps in ECMAScript3): it works this way even in HTML5

3) use XHTML2, which is extensible because modularized
4) metadata is better handled by XHTML2 Meta Attributes module,  
which fully integrates the RDF module in any elements;

structure, sectioning, grouping are the same;
text is very similar: you don't have time, but you can have span  
datatype=xsd:date content=2008-12-21Today/span as in HTML5  
you have time value=2008-12-21Today/time; for progress and  
meter semantic you can use role attribute (for styling you always  
use CSS); editing is the same, but you have an attribute instead of  
an element, so you don't have the issue that ins and del can contain  
everything, even a whole document (not including html);
embedding is much more powerful as any element can be replaced by  
embedded content;
tables are the same (you don't have tables API; but you can still  
use DOM3Core);
XForms are actually more powerful than WebForms2, since you divide  
presentation from data from action (that is implemented  
declaratively);
interactive elements are not needed at all: details is better  
implemented as it is now (ECMAScript3 + CSS3), datagrid is just a  
way to put data in a tree model: use plain XML for that; command and  
a in XHTML2 implemented in any element using href attribute; menu is  
mostly an ul with some style;
scripting uses XMLEvents and handler: it looks the same, but it is  
different as it is more event oriented (scripts are not executed by  
default, they're executed when some event fires)

8) HTML syntax: as I said before, use XML for that

There are instead features that are indeed very useful to develop a  
web application, but are not achievable using other means that HTML5:
1) some way to interact with object (please note: object, not embed:  
object is for plugins, embed for content) : actually this can be  
done using something like cross document messaging, assuming that  
object creates a new browsing context (it already does if the target  
is text/html or application/xhtml+xml), but we need a specification  
for message syntax
2) the binding specific global scope (that is, what object are  
available in all scopes, if binding supports this); this is normally  
the window object, but scripts use certain features only on their  
own browsing context, so that may be moved from that to global  
scope, removing the whole window object from scope (for current  
javascript you can write window.window.window.window.window... and  
get the same as nothing)
3) the Window object (which includes window name, window location,  
cross document messaging, dialog windows)

4) Protocol and Content Handlers
5) Session and Local storage
6) Database storage
7) Drag and Drop
8) WebSockets

What I am asking now is so to modularize HTML. copy those features  
into separate, interoperable modules, removing legacy features (like  
window.on-whatever event listener)
A copy of those will remain in HTML5, because browser implement them  
at the moment, and the HTML5 goal is that all browser 

Re: [whatwg] Thoughts on HTML 5

2008-12-21 Thread Benjamin Hawkes-Lewis

On 21/12/08 17:22, Nils Dagsson Moskopp wrote:

Am Sonntag, den 21.12.2008, 17:54 +0100 schrieb Philipp Kempgen:

Ian Hickson schrieb:


Deprecating HTML thus seems like vain effort. (We already tried over the
past few years with XHTML 1.x, and it didn't work.)

I'd say it _did_ work.  :-)

I'd say too: The worst abominations have disappeared (for new sites,
that is). thefont  element, for example, or frames through deprecating
them.


You're assuming that's an indication of the power of specifications 
rather than of actual advantages to using CSS or avoiding frames.


What mostly failed, and which Hixie is referring to, was the attempt to 
move the web from a tag soup (text/html) basis to an XML 
(application/xhtml+xml) basis. Perhaps that's because the advantages of 
the later were not persuasive. As I've argued elsewhere in the thread, 
there's money in staying with text/html.



Does anyone remembermarquee  ?


That's a bad example. MARQUEE was never standardized in a specification, 
so it was never possible to deprecate it.


--
Benjamin Hawkes-Lewis


Re: [whatwg] Thoughts on HTML 5

2008-12-21 Thread Garrett Smith
On Sun, Dec 21, 2008 at 10:12 AM, Giovanni Campagna
scampa.giova...@gmail.com wrote:
 Please Note: all the following is my personal humble opinion.

 parser is involved), events are far better handled by DOM3Events, styling is
 included by CSSOM

Styling is done in css.

I don't have time to go into the all the problems with CSSOM here.
Shortcomings of the CSSOM 'views' module were discussed on www-style.
'VIews' is not the only CSSOM module that has problems.

 you don't need collection either: just use appropriate DOMNodeLists, while
 for DOMStringMap you may use binding specific features (all Object are hash
 maps in ECMAScript3): it works this way even in HTML5

Where are you getting this information?


 but scripts use certain features only on their own browsing context, so that
 may be moved from that to global scope, removing the whole window object
 from scope (for current javascript you can write
 window.window.window.window.window... and get the same as nothing)

The closest definition to 'nothing' would be the value undefined. I do
not know of a browser where - window.window.window === undefined is
true by default. I get window.

A relevant example would be useful.

The closes thing we got to an example of invalid html is TJ post about
jquery validation plugin. If you click throuh, there is an demo using
a minlength custom attribute. The attribute may have the effect the
author wanted it to have in a set of browses he is concerned about.
That effect and the set of browsers could be more clearly
demonstrated in an example that shows only that, as well as edge cases
where results may vary.

If you can't define clearly what can be reasonably expected of a piece
of (invalid) code, then nothing can be reasonably expected of it. It's
not a good to write code that can't have an expected outcome.

 Best regards,
 Giovanni Campagna



Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Ian Hickson
On Sun, 21 Dec 2008, Edward Z. Yang wrote:
 
 I suppose the big pivot point is as if. A byte-wise implementation 
 would replace character globally with byte, and any U+ designation 
 with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not 
 the actual algorithm implementation, no?

Right; conformance requirements phrased as algorithms or specific steps 
may be implemented in any manner, so long as the end result is equivalent.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'