2015-07-13 11:15 GMT+02:00 Marcel Schneider <[email protected]>: > It's roughly the same problem with the CSS and UTF-8 malfunctioning that > is laughed at with the other merchandising items brought in by Umesh: > http://www.zazzle.com/cheap_css_is_awesome_mug-168565401817501350 > > http://www.zazzle.com/css_is_awesome_with_java-script_mug-168685521846695550 > and Karl Williamson <[email protected]> (On Sat, Jul 11, 2015, > 19:42): > http://i1.cpcache.com/product/27297813/utf8_value_tshirt.jpg > > Personally the only time CSS was awesome to me is when I'd written bad > code. In truth, CSS is very smart and allows browsers to adapt the box > width to the content, if not hindered in doing so by some fixed-width. We > can write bad code in any language, but then we should rather laugh at our > own incapacity. > > Idem with charsets. The only time I saw UTF-8 like on the T-shirt, was > when opening UTF-8 files that didn't specify charset=UTF-8. The thing to do > was to add the charset in the file header. > Or simply add a leading BOM. All browsers will autodetect it. This only concerns HTML files (on a local filesystem).
BOMs are not recommended for UTF-8 encoded javascripts: if your HTML local file references a local javascript file, it can specify the expected file type in addition to the local URL of the script file itself: this is an HTML attribute to add to the HTML "script" element. If your page needs to perform JSON requests, the JSON is normally served by a webserver that will deliver the MIME type and charset in metadata. Some JSON parsers can also be set to autodetect the BOM and then discard it from the visible content. That's just the first 3 bytes to check in the input stream before sending the stream data to the parser which can then be instantiated and initialized directly with the correct charset. For pages served by webservers, you add it in the metadata of your shared folder to associate some files with MIME types. This can even be a global setting of the server if all your pages and scripts are UTF-8 encoded, or this can be set on the main folder and changed for specific folders for files that should not be sent with the UTF-8 MIME metadata but with another charset. Or you can add the autodetection feature in Apache which will autodetect the BOM in the file, then serve the UTF-8 file without this leading BOM but with the corrected filesize and the correct MIME type with its charset extension. It is more complicate for files hosted on FTP as there's no MIME metadata: for that the BOM is still the easiest option (but it will be up to the FTP client to perform the autodetection. Autodetecting a BOM is much more efficient than autodetecting an HTML meta tag in the header (this requires aborting the curent parsing in the middle and restart it, this uses more memory that will need to be garbage collected, and requires some miliseconds and more CPU resources as HTML parsers are very costly in terms of CPU-processing).. If you place the charset in a meta tag of the HTML page, make sure that this tag is near the begining of the HTML header (it should be fully within the first 4KB, and even before the mandatory <title> element). In my opinion this meta tag should ve the first child element of the <head> element which is otself the first element of the <html> element that immediately follows the optional HTML doc type declaration. If your page is XHTML, you should use the leading XML declaration line to put that charset indication: putting the indication in the first 4KB allows some charset guessers to identify the charset faster without actually starting to instanciate a parser and abort it in the middle. 4KB is typically the size of a single memory page, so that page will remain in CPU/bus caches without using paging I/O. The CPU cost will be minimal if the charset can be autodetected very early in a few nanoseconds by just scanning the content of a single memory page. 4KB is much large enough so that any placement of the autodetected signatures will succeed without having to wait for long. Actually I even think that the tag should be in the first 1400 bytes (to match the maximum size of a single TCP packet with the smallest MTU:it will minimize the networking I/O delays: aborting a parser and restartging it has a significant processing time that could delay even more the processing of the next TCP packet, which coudl then be paged out by the OS if there are concurrent networking streams used by concurrent processes, such as large file downloads or an active streamed video). I just wonder why HTML5 did not deprecate the old meta tag of HTML4 in favor of an attribute directly in the <html> root element, or even in its recommended DOCTYPE declaration. But if you use the abbreviated HTML5 doctype line, its default should be UTF-8 and no indeication is necessary (charset guessers should not be used with HTML5, except in case of parsing failure only as a possible recovery solution, in which case the meta tag may be processed. If there's no parsing error for the main document, excluding all other referenced documents suc has scripts or inner frames, the meta tag should better be ignored even if its present and specifies something else). May be in some future, there will be an HTML6 that enforces the use of a single charset and possibly a more compact encoding. We've seen similar radical changes including for core protocols such as HTTP(S) itself. this could become a single unified protocol mixing this new generation HTTP and HTML capabilities, but with more capabilities such as dynamic parallel streams, encryption, authentication, simplified and more efficient data signature, real time constraints and QoS management of streams for web applications, and a more efficient support for encapsulated binary data (notably audio/video/images, or even nearly native executable scripts, precompiled by the server for the target client when its processing capabilities are constrained, notably smartphones to save energy in their battery). That future of HTML will focus muich more on its API, the effective encoding may be autoadapted or negociated and cached (given that we need security now everywhere on the web, negociation protocols are already used: this is for now just for authenticating and exchanging encryption pairs, but it could negociate in the same roundtrip some presentation formats such as the MIME type and charset encoding, compression levels, and binary compatibility of the clients for receiving precompiled executable contents, or for sharing tasks and CPU/GPU resources or local/remote storage, or synchronization of cached data) --- We'll rapidely need in the future a true "network-centered OS" where applications can run on one or more devices in parallel, owned by the client or by the service provider, and allowing on-demand allocation and sharing of processing ressources available locally or remotely. On that OS, there will no longer be the concept of a host (or it will just be a virtual delocalized host), the concept of "local" may be replaced by the concept of personal user environment which will autoadapt to the capacilities of devices around him and the available networking bandwidths. At that time, this virtualized OS will certainly be 128-bit (and not 64-bit as of today), and it will manage many terabytes of virtual memory, including the environment of other users located anywhere. Clients and servers will share or demand resources to that network dynamically and the core element of this OS will be to manage caches, automatic synchronization, and bandwidth allocations, and nobody will know "where" the code is actually running physically. All devices will then exchange indifferently code or data, or will perform computing tasks delegated to them by other members in the network (including transformation codecs). The network OS will provide the necessary isolation for security and the architecture will be more peer-to-peer, working in a collaborative grid computing architecture. It will be also failure resistant, with implicit backup/replication.

