Arcane Jill wrote:
> realistically, Lars, I think you should just take the
> performance hit. The
It is not just about performance and the CPU cycles. Suppose I have a million lines of code. And want to replace a UTF-8 conversion with my conversion. If my conversion has different size requirements than the previous one, I have to carefully analyze what programmers did in the code, or risk a buffer overrun in some odd corner of the application.
And even if it was about performance. Suppose I am processing thousands of filenames per second, gathering from multiple systems to one. Well, this one system will have little disk activity, no fstats, just a bunch of conversions. Suppose I have put filenames in an XML along with their properties. Now I have to convert entire XMLs.
Now, during the time there are some odd characters present, the network load will also increase. Sure, that will become irrelevant after some time. But I will still need to have oversized buffers, just in case, indefinitely. Which is only slightly better than 'strconvlen' each incoming buffer and burden the system with a bunch of malloc and free calls.
> In that case,
> escape sequences will work just as well as resevered
> characters. They will
> fulfil exactly the same function ... EXCEPT that you no
> longer have to worry
> that Unicode text might contain single codepoints by
> accident.
I am not worried about it. My solution with PUA is solid enough for me. The range was carefully chosen. The performance (and convenience) requirements were stronger and have prevailed. In my case it was a trade-off. In itself, this makes the solution unclean. But if other people would want to use the same solution and we would agree to have it standardized, then assigning the 128 codepoints would solve that problem too. And that would remove the unclean part of my solution. And make it suitable for standardization.
> There is also one other thing which you seem not to have
> considered. It is
> possible (and /much/ more likely than that a suitably chosen
> escape sequence
> might turn up by accident) that, in some non-Unicode encoding
> ... let's say the
> fictitious encoding Krakozhian ... the byte sequence emitted
> by UTF-8(c) might
> be extremely common (where c is one of your 128 reserved
> codepoints).
No problem. They are escaped themselves and do roundtrip. My size requirements are also met.
You could be also worried not about the 128 sequences, but about all UTF-8 sequences. Those will be far more frequent. One could argue that presence of the escape codepoints in Unicode should indicate a legacy encoding and that this is not guaranteed. Well, this possibility of late detection is only a side-effect of what I am doing. It is not guaranteed and is not a requirement. Eventually, the problem will be detected, even if not a single invalid sequence was encountered, and the important thing is that the original byte sequence can be recreated entirely.
> In other
> words, you have to forbid the byte-sequences UTF-8(c), for all 128 c's, not
> just in Unicode
The codepoints in Unicode are not to be forbidden (on the contrary) nor reserved. They are merely assigned for a specific purpose. Using codepoints that are already assigned for some other purpose is bad. Good enough for my private solution, but I am looking for a solution that can be used by everyone. You are frustrated, because you cannot find it. Well, there isn't one, at least not one that would meet all the requirements. I still claim that my solution works and that there is just one step missing.
> One last question - why /can't/ locale conversion be
> automated?
It *sorta* works in *some* cases. Not all users will do it. And the odd filenames will keep reappearing for a long time. Perhaps even for malicious reasons.
Lars
P.S.
> PS. I'm on holiday from tomorrow, so if I fail to respond to
> any comments,
> it'll be because I'm not here. :-)
You have taken my "take a break" seriously :) Merry Christmas ;)
L.

