RE: ASCII/Unicode

Patrik Stridvall Wed, 26 Apr 2000 13:29:33 -0700
> > If we design a general enough solution, yes.
> > However, I think that such a solution is to
> > inefficent. I think the only way to get it
> > fast enough is to limit it so it knows how
> > the different formats relate.
> 
> I beg to differ. 

OK, slight missunderstanding.

> What I mean is that if we mark the encoding
> with an int, 32 bits is big enough to hold any forseeable
> number of encodings.

Yes, that is not a problem.

> Now, this should not affect (significantly)
> our performance.  What will happen is that we will carry around
> the encoding until we will have to transform it to a specfic one
> (say, if it is a filename, or we need to print it, it will be UTF8),
> so we will have functions of the form:
> 
> 
> HEAP_strdupXtoUTF8(int enc, LPSTR str);
> HEAP_strdupXtoUTF16(int enc, LPSTR str);
> 
> the speed of such functions is (almost) unaffected by the number of
> encodings that we support, as internally all they have is probably a
> switch statement.

What I meant is that _every_ such function need to now about
_every_ supported encoding, so you can't just add new encodings
just like that, unless you have an very advanced and likely
ineffiecient scheme.

Note that you need more functions like 
        HEAP_strcatXandYtoZ(int enc1, LPSTR str1, 
                int enc2, LPSTR str2, int *unified_enc);
if you want to be able to concatate string without
having to convert to a common format and you will need 
        HEAP_strcmpXandY(int enc1, LPSTR str1,
                int enc2, LPSTR str2);
as well and probably others as well.

> > > And, on top  of it all, it should be more efficient.
> >
> > More efficient for the _teoretical_ average case perhaps,
> > but definitely not for the common case which is ASCII and
> > will likely remain so for forseable future.
> >
> > Being lazy penalizes all cases equally, but all cases are
> > not equally likely in the real world.
> 
> No, it doesn't. In fact, if the input is streight ASCII, we need not
> worry because in most cases we deal with UTF8 which is
> compatible with ASCII. So for the common case, we are as
> fast as we can be (ignoring the very small overhead of carring
> the encoding around).

Even if HEAP_strdupXtoUTF8(ASCII, str) did nothing,
you still have the overhead caused by
1. Extra function call to HEAP_strdupXtoUTF8
2. Extra memory allocation. And no, you can't just pass the
   same string since somebody will do HeapFree later.

Furthermore your solution will require close to every
function dealing with string to be rewritten.

> > My prefered solution is that have a common
> > c file and compile it several times with different
> > defines for each format that needs to be supported.
> 
> Hmm, I don't like this either -- I agree with Alexandre on this one.

I see, mind if I ask why?
I really can't see what is so bad with my solution,
in comparision with your or the current solution.

It can support
1. ASCII only (no runtime overhead compared to theoretical)
2. UNICODE (UTF16) only (no runtime overhead compared to theoretical best)
3. ASCII and UNICODE (UTF16) with ASCII as internal format
   (ASCII has no runtime overhead compared to theoretical best)
4. ASCII and UNICODE (UTF16) with UTF8 as internal format
   (ASCII has no runtime overhead compared to theoretical best)
5. ASCII and UNICODE (UTF16) with UTF16 as internal format
   (UNICODE (UTF16) has no runtime overhead compared to theoretical best)
6. ASCII and UNICODE (UTF16) with <whatever> as internal format

OK, it takes a lot of disk space if you want to be able
to run in every mode, but that is of course not nessary.
I should guess that most distribution will choose (4),
but that is up to them. Embedded system can choose (1)
or in rare case (2) if they really need UNICODE support.
RE: ASCII/Unicode

Reply via email to