Francois PIETTE wrote:
In which context do you have such a need of on-the-fly conversion ?
Are you trying to display an email content while it is being
transfered ?
It's required in TMimeDec for example, the parser reads from stream into
a buffer of fixed length. So it is possible that at the end of buffer there
are _any_ number of non-translatable bytes. We need to be able to
detect such invalid bytes, otherwise garbage is decoded.
We also need a reliable CharNext-function in order to not unintentionally
break a byte sequence. This is required, for instance, in TSmtpCli when
the component has to fold header lines or wrap message text.
array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28,
$42); The leading ESC-sequence $1B, $24, $42 tells the decoder to
treat the following bytes as double-byte characters ($21, $41, $21,
$41) and the trailing ESC-sequence shifts back to ASCII mode. This
sample should translate to two Unicode code points ~~ correctly.
It's easy to imagine what will happen if we do not pass the entire
sequence
to MultiByteToWideChar() but, for instance, split up in two chunks.
Where the first chunk $1B, $24, $42, $21, $41 should translate
just fine, however the second $21, $41, $1B, $28, $42 translates
to garbage since there is no longer a leading ESC-sequence.
The leading ESC-sequence $1B, $24, $42 is always the same ?
No, there are multiple different ESC-sequences per charset with
variable length and some even shift in to three-byte character mode,
this was just one example.
Further more, MultiByteToWideChar() and WideCharToMultiByte() do
not work with those charsets correctly they are buggy!
The ConvertINet-API works around this by first convert these strings to
one of their corresponding native Windows charsets internally, in my
sample to DBCS Windows-932.
Have a look here: http://source.winehq.org/source/dlls/mlang/mlang.c
However the implementation in WINE is _wrong_ and incomplete,
it handles Japanese only.
My sample above as two Unicode code points:
UStr := #$FF5E#$FF5E;
Try to convert this string with WideCharToMultiByte() to ansi code page
50220, the result is two question marks ??. Both MLang's
ConvertINetUnicodeToMultybyte() and iconv give the correct result.
At first glance, it is not difficult to implement a conversion
routine based on MultiByteToWideChar by prefixing the next chunk with
the same leading ESC-sequence we would have detected in the previous
chunk. Implementation could mimic ConvertInetMultibyteToUnicode or be
encapsulated in a class, for example a stream like class.
Yep, that was my first idea as well and the reason why I looked at the
WINE source code. I already translated parts of their mlang.c to Delphi,
but as said above their implementation is buggy and incomplete.
--
Arno Garrels
--
francois.pie...@overbyte.be
The author of the freeware multi-tier middleware MidWare
The author of the freeware Internet Component Suite (ICS)
http://www.overbyte.be
- Original Message -
From: Arno Garrels arno.garr...@gmx.de
To: ICS support mailing twsocket@elists.org
Sent: Saturday, March 27, 2010 10:52 PM
Subject: [twsocket] Charset conversion On-The-Fly
Hi,
You won't believe it, Windows neither provides an API to convert true
multi-byte character streams On-The-Fly to Unicode nor an API to
determine the number of bytes of such multi-byte character
sequences. I do not speak about the double-byte charsets.
Let's say we receive an ansi-stream encoded with code page 50220
(iso-2020-jp). Those 7-bit encodings are still frequently used in
emails or
HTML in Far East. They use ESC-sequences to shift in or out another
encoding
mode.
Example:
array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28,
$42); The leading ESC-sequence $1B, $24, $42 tells the decoder to
treat the following bytes as double-byte characters ($21, $41, $21,
$41) and the trailing ESC-sequence shifts back to ASCII mode. This
sample should translate to two Unicode code points ~~ correctly.
It's easy to imagine what will happen if we do not pass the entire
sequence
to MultiByteToWideChar() but, for instance, split up in two chunks.
Where the first chunk $1B, $24, $42, $21, $41 should translate
just fine, however the second $21, $41, $1B, $28, $42 translates
to garbage since there is no longer a leading ESC-sequence.
AFAIK there are two possible solutions.
1.) Internet Explorer's (v5+) MLang.dll provides a much better API.
ConvertInetMultibyteToUnicode() takes and returns a Mode value
that must be initialized to zero on the first call. After converting
the first chunk Mode returns a value 0. Passing this to convert
the second chunk results in a correctly translated second chunk.
ConvertInetMultibyteToUnicode() also returns the number of translated
source
bytes which is rather useful too.
2.) GNU library iconv.dll which is under LGPL.
It's around 800 KB and