Speed is not much linked to the in-memory buffer sizes (memory is cheap now and cumfortable) and parsing in memory encodings is extremely fast. The actual limitation is in I/O (network or storage on disk), and at this level you work with network datagrams/packets, or disk buffers or memory pages for paging, which are using buffers with static size (so the memory allocation cost can be avoided as it is reusable).
Given that, you can easily create default buffers as small as about 4KB and convert it from any encoding to another with a static auxiliary buffer also small (16 KB for the worst cases) and manage with little cost the transition that may occur in the middle of an encoding sequence. Working with buffers considerably reduces the number of I/O performed, and you can still compress it by chunk (just make sure your auxiliary buffer has enough spare bytes at end for the worst case to avoid performing 2 I/O or compressing two chucks including a degenerate one. Even data compression is fast now and helps reducing the I/O : the cost of compression in memory is small compared to the cost of I/O, so much that now the Windows kernel can also use generic data compression for memory page paging to improve the global performance of the system, when the global memory page pool is full, or for disk virtualization purpose. The UTF-8 encoding is extremely simple and very fast to implement, and for most cases, it saves a lot compared to storing UTF-32 (including for large collections of text elements in memory). So using iterators is the way to go, it is simple to program, easy to optimize, and you completely forget that UTF-8 is used in the background store. 2015-10-14 0:37 GMT+02:00 Richard Wordingham < [email protected]>: > On Tue, 13 Oct 2015 16:09:16 +0100 > Daniel Bünzli <[email protected]> wrote (under topic heading > 'Counting Codepoints') > > > I don't understand why people still insist on programming with > > Unicode at the encoding level rather than at the scalar value level. > > Deal with encoding errors and sanitize your inputs at the IO boundary > > of your program and then simply work with scalar values internally. > > If you are referring to indexing, I suspect the issue is performance. > UTF-32 feels wasteful, and if the underlying character text is UTF-8 or > UTF-16 we need an auxiliary array to convert character number to byte > offset if we are to have O(1) time for access. > > This auxiliary array can be compressed chunk by chunk, but the larger > the chunk, the greater the maximum access time. The way it could work > is a bit strange, because this auxiliary array is redundant. For > example, you could use it to record the location of every 4th or every > 5th codepoint so as to store UTF-8 offset variation in 4 bits, or every > 15th codepoint for UTF-16. Access could proceed by looking up the > index for the relevant chunk, then adding up nibbles to find the > relevant recorded location within the chunk, and then use the basic > character storage itself to finally reach the intermediate points. > > (I doubt this is an original idea, but I couldn't find it expressed > anywhere. It probably performs horribly for short strings.) > > Perhaps you are merely suggesting that people work with a character > iterator, or in C refrain from doing integer arithmetic on pointers > into strings. > > Richard. > >

