Re: [twsocket] Charset conversion On-The-Fly

2010-03-28 Thread Francois PIETTE

Hello Arno,

In which context do you have such a need of on-the-fly conversion ? Are you 
trying to display an email content while it is being transfered ?



array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, $42);
The leading ESC-sequence $1B, $24, $42 tells the decoder to treat the
following bytes as double-byte characters ($21, $41, $21, $41) and the
trailing ESC-sequence shifts back to ASCII mode. This sample should
translate to two Unicode code points ~~ correctly.

It's easy to imagine what will happen if we do not pass the entire 
sequence

to MultiByteToWideChar() but, for instance, split up in two chunks. Where
the first chunk $1B, $24, $42, $21, $41 should translate just fine,
however the second $21, $41, $1B, $28, $42 translates to garbage since
there is no longer a leading ESC-sequence.


The leading ESC-sequence $1B, $24, $42 is always the same ? Is there 
different leading ESC-sequences ?
At first glance, it is not difficult to implement a conversion routine based 
on MultiByteToWideChar by prefixing the next chunk with the same leading 
ESC-sequence we would have detected in the previous chunk. Implementation 
could mimic ConvertInetMultibyteToUnicode or be encapsulated in a class, for 
example a stream like class.



--
francois.pie...@overbyte.be
The author of the freeware multi-tier middleware MidWare
The author of the freeware Internet Component Suite (ICS)
http://www.overbyte.be


- Original Message - 
From: Arno Garrels arno.garr...@gmx.de

To: ICS support mailing twsocket@elists.org
Sent: Saturday, March 27, 2010 10:52 PM
Subject: [twsocket] Charset conversion On-The-Fly



Hi,

You won't believe it, Windows neither provides an API to convert true
multi-byte character streams On-The-Fly to Unicode nor an API to determine
the number of bytes of such multi-byte character sequences. I do not speak
about the double-byte charsets.

Let's say we receive an ansi-stream encoded with code page 50220
(iso-2020-jp). Those 7-bit encodings are still frequently used in emails 
or
HTML in Far East. They use ESC-sequences to shift in or out another 
encoding

mode.

Example:
array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28, $42);
The leading ESC-sequence $1B, $24, $42 tells the decoder to treat the
following bytes as double-byte characters ($21, $41, $21, $41) and the
trailing ESC-sequence shifts back to ASCII mode. This sample should
translate to two Unicode code points ~~ correctly.

It's easy to imagine what will happen if we do not pass the entire 
sequence

to MultiByteToWideChar() but, for instance, split up in two chunks. Where
the first chunk $1B, $24, $42, $21, $41 should translate just fine,
however the second $21, $41, $1B, $28, $42 translates to garbage since
there is no longer a leading ESC-sequence.

AFAIK there are two possible solutions.

1.) Internet Explorer's (v5+) MLang.dll provides a much better API.
ConvertInetMultibyteToUnicode() takes and returns a Mode value that must
be initialized to zero on the first call. After converting the first chunk
Mode returns a value  0. Passing this to convert the second chunk
results in a correctly translated second chunk.
ConvertInetMultibyteToUnicode() also returns the number of translated 
source

bytes which is rather useful too.

2.) GNU library iconv.dll which is under LGPL.
It's around 800 KB and natively available in Linux and MAC OS. It's 
similiar

here, iconv uses some context-pointer to achieve the same.

Both require passing around either the Mode or the context. So, if we want
to fix current charset-bugs in ICS, some changes are required. It finally
turned out that simply a CodePage parameter is not enough to always handle
charset-works properly.

IMO it's time to move on to another design, some custom TEncoding class 
most

likely.

Or maybe you have another idea?

--
Arno Garrels

--
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be 


--
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

Re: [twsocket] Charset conversion On-The-Fly

2010-03-28 Thread Arno Garrels
Francois PIETTE wrote:

 In which context do you have such a need of on-the-fly conversion ?
 Are you trying to display an email content while it is being
 transfered ? 

It's required in TMimeDec for example, the parser reads from stream into 
a buffer of fixed length. So it is possible that at the end of buffer there
are  _any_ number of non-translatable bytes. We need to be able to
detect such invalid bytes, otherwise garbage is decoded.

We also need a reliable CharNext-function in order to not unintentionally
break a byte sequence. This is required, for instance, in TSmtpCli when 
the component has to fold header lines or wrap message text.
 
 array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28,
 $42); The leading ESC-sequence $1B, $24, $42 tells the decoder to
 treat the following bytes as double-byte characters ($21, $41, $21,
 $41) and the trailing ESC-sequence shifts back to ASCII mode. This
 sample should translate to two Unicode code points ~~ correctly.
 
 It's easy to imagine what will happen if we do not pass the entire
 sequence
 to MultiByteToWideChar() but, for instance, split up in two chunks.
 Where the first chunk $1B, $24, $42, $21, $41 should translate
 just fine, however the second $21, $41, $1B, $28, $42 translates
 to garbage since there is no longer a leading ESC-sequence.
 
 The leading ESC-sequence $1B, $24, $42 is always the same ?

No, there are multiple different ESC-sequences per charset with 
variable length and some even shift in to three-byte character mode,
this was just one example.

Further more, MultiByteToWideChar() and WideCharToMultiByte() do
not work with those charsets correctly they are buggy!
 
The ConvertINet-API works around this by first convert these strings to
one of their corresponding native Windows charsets internally, in my 
sample to DBCS Windows-932.
Have a look here: http://source.winehq.org/source/dlls/mlang/mlang.c 
However the implementation in WINE is _wrong_ and incomplete,
it handles Japanese only.

My sample above as two Unicode code points:

UStr := #$FF5E#$FF5E;

Try to convert this string with WideCharToMultiByte() to ansi code page 
50220, the result is two question marks ??. Both MLang's 
ConvertINetUnicodeToMultybyte() and iconv give the correct result.

 At first glance, it is not difficult to implement a conversion
 routine based on MultiByteToWideChar by prefixing the next chunk with
 the same leading ESC-sequence we would have detected in the previous
 chunk. Implementation could mimic ConvertInetMultibyteToUnicode or be
 encapsulated in a class, for example a stream like class.

Yep, that was my first idea as well and the reason why I looked at the 
WINE source code. I already translated parts of their mlang.c to Delphi,
but as said above  their implementation is buggy and incomplete.

--
Arno Garrels


 
 
 --
 francois.pie...@overbyte.be
 The author of the freeware multi-tier middleware MidWare
 The author of the freeware Internet Component Suite (ICS)
 http://www.overbyte.be
 
 
 - Original Message -
 From: Arno Garrels arno.garr...@gmx.de
 To: ICS support mailing twsocket@elists.org
 Sent: Saturday, March 27, 2010 10:52 PM
 Subject: [twsocket] Charset conversion On-The-Fly
 
 
 Hi,
 
 You won't believe it, Windows neither provides an API to convert true
 multi-byte character streams On-The-Fly to Unicode nor an API to
 determine the number of bytes of such multi-byte character
 sequences. I do not speak about the double-byte charsets.
 
 Let's say we receive an ansi-stream encoded with code page 50220
 (iso-2020-jp). Those 7-bit encodings are still frequently used in
 emails or
 HTML in Far East. They use ESC-sequences to shift in or out another
 encoding
 mode.
 
 Example:
 array [0..9] of Byte = ($1B, $24, $42, $21, $41, $21, $41, $1B, $28,
 $42); The leading ESC-sequence $1B, $24, $42 tells the decoder to
 treat the following bytes as double-byte characters ($21, $41, $21,
 $41) and the trailing ESC-sequence shifts back to ASCII mode. This
 sample should translate to two Unicode code points ~~ correctly.
 
 It's easy to imagine what will happen if we do not pass the entire
 sequence
 to MultiByteToWideChar() but, for instance, split up in two chunks.
 Where the first chunk $1B, $24, $42, $21, $41 should translate
 just fine, however the second $21, $41, $1B, $28, $42 translates
 to garbage since there is no longer a leading ESC-sequence.
 
 AFAIK there are two possible solutions.
 
 1.) Internet Explorer's (v5+) MLang.dll provides a much better API.
 ConvertInetMultibyteToUnicode() takes and returns a Mode value
 that must be initialized to zero on the first call. After converting
 the first chunk Mode returns a value  0. Passing this to convert
 the second chunk results in a correctly translated second chunk.
 ConvertInetMultibyteToUnicode() also returns the number of translated
 source
 bytes which is rather useful too.
 
 2.) GNU library iconv.dll which is under LGPL.
 It's around 800 KB and