Note also that UTF-8 encoded sequences can be up to 5 bytes long...
How is that possible. I was under the impression that a UTF-8 sequence could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF).
Unicode & ISO/IEC 10646 define UTF-8 differently; Unicode stops at 4 bytes, while ISO/IEC 10646 allows more bytes; however, all combinations with more bytes than 4 result in illegal sequences or illegal code points.
Stefan

