2014-06-05 0:48 GMT+02:00 Doug Ewell <[email protected]>: > If you are processing arbitrary fragments of a stream, without knowledge > of preceding fragments, as in this example, then you have no business > making *any* changes to that fragment based on interpretation of that > fragment as Unicode text. Your sole responsibilities at that point are > to pass the fragments, intact, from one process to the next, or to > disassemble and reassemble them.
Not necessarily true. You can easily think about the debugging log coming from an OS or device and accumulating text data coming from various sources in the device. Then you can connect to a live stream at any time without necessarily following all what happened before. You'll probably want to sync on the first newline control and then proceed from that point. But now if you have those devices configured heterogenously and generating their own output encoding you won't necessarily know how it is encoded even uf all of them use some UTF of Unicode. So the stream will regularly repost an encoding mark, for exampel at the begining of each dated logged entry, and this could be just an encoded BOM (even with UTF-8, or some other UTF like UTF-16 which would be more likely if the language contained essentially an East-Asian (CJK) language. These devices would emit these messages or logs with a very basic protocol, or no protocol at all (Telnet, serial link, ...) without any pror negociation (these data feeds are unidirectional meant to be used by any number of consumers that can connect or disconnct from them at any time, the log producer will never know how many clients there are, notably for passive debugging logs) You could then expect BOMs to occur many times in the stream (this is what I called a "live" stream : it has no start, no end, no defined total size, you don't know when new texts will be emitted, you don't even know at which rate; which could be very huge : if the rate is too high one can use a fast local proxy to filter the feed with patterns (e.g. a debug level, reported in the start of line of each log entry, or some identifier of the real source, not controlled drctly at the point of connection where you connect to listen the stream) and hear only the result that can be supported over a slower link to the client. But here also the proxy will not necesarily work continuously but only when there will be some interested client for it and providing a pattern matching. The resulted texts will then be highly fragmented. So your assumption if only true when you think about processes that have a prior agreement to use some specific convention. But in an heterogeous world here participants (prodicers and consumers) and maintained separately and can appear or disappear at any time, you cannot expect they will all use the same encoding and that disassembling/reassembling is as safe as what you think. This is only true if they work in close cooperation under strict common standards. Take na eample of a service that would archive all received emails in a feed or a list of SMS messages from a group of participants; do you need to archive not only the texts them selves but also all the protocol meta data fro which they originated when the application is creating a baic log which will not be used by SMS or emails due to the generated volume? Encoded texts in heterogenous environement and over the web where people could use various OSes and languages are well known examples where plain-text is not always sufficient to determine how to devide it, you cannot just "guess" fro mthe content when this content can change at any time. And these texts are not always safely convertible to the same encoding without data losses or alterations. If you don't insert in the live stream enough BOMs after some resynchronization points, the result that consumers will ger will be full of mojibake.
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

