Re: [webkit-dev] Why so many text nodes in the DOM? (especially ones with just whitespace)

2010-06-18 Thread Andreas Delmelle

On 17 Jun 2010, at 20:37, Alexey Proskuryakov wrote:

 
 17.06.2010, в 9:53, Andreas Delmelle написал(а):
 
 If WebKit chooses, for example, to ignore character events from the parser 
 in nodes where logically it doesn't make sense to have stray characters
 
 
 That would break e.g. Web sites where JS accesses DOM in ways such as 
 node.firstChild.nextSibling, or node.childNodes[3]. We've previously seen 
 similar breakage happen after changing WebCore parsing code.

Wow, good point! Suddenly I feel foolish, not having thought of that 
hyper-trivial scenario. Obviously a very good reason to keep those nodes in. 

Still, one wonders from time to time how much bandwidth is actually wasted by 
sending over all these extraneous bytes that ultimately compel JS developers to 
write code like the above. I don't think I have ever seen a website that does 
/not/ serve its HTML pretty-printed... That seems like an awful lot of spaces, 
tabs and linefeeds!

On the other hand, node.firstChild.nextSibling just seems like asking for 
trouble. One could argue that people who do use that to get to the first 
element node do not need to be accommodated. It would suffice for one of the 
page's authors to insert a small comment node to break that code.

One could just as easily extend Node with a firstElement() method that would 
work under all circumstances --but, oh yes, IE didn't support that back then... 
;-)


Regards,

Andreas Delmelle
---

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Why so many text nodes in the DOM? (especially ones with just whitespace)

2010-06-17 Thread Andreas Delmelle
On 16 Jun 2010, at 23:12, David Hyatt wrote:

 On Jun 14, 2010, at 7:00 PM, Matt 'Murph' Finnicum wrote:
 
 Why are there so many Text nodes in the DOM? I had a look at the initial DOM 
 tree from rendering slashdot, and there are 1959 Text nodes. Of those 1959, 
 1246 were whitespace-only nodes.
 
 Does there need to be this many nodes? Why can't whitespace be combined with 
 the nodes next to it?

 Whitespace nodes most commonly occur between elements, so they can't be 
 coalesced.

Hmm, this touches on a very interesting topic...

Strictly speaking, a basic parser, be it XML or HTML, should never, ever report 
anything to downstream consumers that was not in the original source document. 
The software is doing its job pretty accurately in that respect. All it needs 
is a little help from the consumer/user/developer. --Think JS minifying. This 
saves on bandwidth and even, IIRC, makes the compiler's job easier.
Basically, almost all of that whitespace serves only one purpose: to make the 
source human-readable. All well while you're developing a website or webapp, 
but come deployment, you will always fare better if the input stream is 
guaranteed to be processor-friendly to begin with. Less ghosts to chase.

If the input is at least XHTML, and one is a tiny bit versed in XSLT, adding a 
preprocessing whitespace stripper stylesheet could be a quick-fix solution to 
reduce the waste of resources. That does consume resources elsewhere, 
obviously, so you may want to check if it's really worth the effort.
xml:space, you would get for free if the processor is compliant in that 
respect. For the remainder, basically a plain copy template for all nodes. The 
exception being text nodes, for which you can use normalize-space() to see if 
they contain anything other than XML whitespace, and thus need copying.
The limitation is that you do not have access to the resolved CSS, IIC. In 
other words, if you have elements that can have #PCDATA content and that get a 
class assigned that sets properties related to whitespace preservation, the 
XSLT stylesheet will not see it (although there may exist extensions for CSS 
parsing, not sure). 
Then again, whitespace within, before or after text nodes is no problem, since 
that is presumed significant by default (but that gets coalesced with the other 
text later on, so no issue at all).

Part of it could, maybe, remotely, be implemented in WebKit itself.
If WebKit chooses, for example, to ignore character events from the parser in 
nodes where logically it doesn't make sense to have stray characters (which, 
incidentally, is the strategy Apache FOP uses, but that may be a slightly 
different story since that is pure XML), it could mean a significant reduction 
of the above 1246 nodes... perhaps even to 0? 

Downsides? The live DOM no longer *exactly* reflects the input, so it would 
definitely need to be configurable, just in case one does need that 
functionality. OTOH, let's say that 95% of a site's visitors is not interested 
at all in what the HTML source looks like. If you really want to share your 
code with the other 5%, there are far better ways to do that than relying on 
'View Source', no? For the remainder, I must admit I am having a hard time 
imagining scenarios where ignorable whitespace would be desirable to keep 
around. In the worst case, it could even needlessly complicate certain layout- 
and rendering-related tasks...



Regards,

Andreas Delmelle
---

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev