Re: [sqlite] Encodings question

2004-04-19 Thread D. Richard Hipp
Bertrand Mansion wrote:
I am not sure what this means ?


It is not recommended that you use PHP in a web-server configuration with a
version of the SQLite library compiled with UTF-8 support, since libsqlite
will abort the process if it detects a problem with the UTF-8 encoding.

There is no code to do this in *my* version of SQLite.  I don't
know what changes the PHP people may have made to the version
they bundle, however.
The only difference in the core SQLite between ISO8859 and UTF8
is in the operation of the length(), substr(), and like() functions.
(The like() function is used to implement the LIKE operator.)
Since all of those routines can be overridden at run-time, you
can make an SQLite that is complied for ISO8859 work with
UTF8 and vice versa, simply by substituting different implementations
for the effected functions.  If you don't use any of those functions,
the encoding does not matter.
--
D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [sqlite] Encodings question

2004-04-19 Thread Bertrand Mansion
<[EMAIL PROTECTED]> wrote :

> Bertrand Mansion wrote:
> 
>> As far as I understand, UTF-8 will read 8859-1 without problem but
>> ISO-8859-1 will not be able to read UTF-8, unless everything in the UTF8
>> string uses only 8859-1 codes.
> 
> You're wrong, I think.
> 
> UTF-8 is a variable length encoding of character codes of the unicode
> code page. Iso8869-1 is a definition of a code page, each character is
> encoded in exactly one byte.
> 
> Unicode itself is a code page with much more characters than iso8859-1.
> 
> Unicode, iso8859-1 and ASCII code pages share following properties:
> 
> a.) character codes 0 upto 127 in unicode are equal to ASCII codes.
> b.) character codes 128 upto 255 in unicode are equal to the iso8859-1
> codes.
> 
> Please note: A 'character code' is _not_ a byte! It's the number of the
> position of that character in a code page. The code page in iso8859-1 is
> only 8 bits wide and has 256 entries. The unicode code page is 21 bits
> wide, and not all positions are assigned to characters.
> 
> In iso8859-1 all 256 character codes are encoded using simply one byte.
> The value of the byte is the character position in the code page.
> 
> In UTF-8 character codes 0 upto 127 are encoded in one byte and
> character codes above 127 are encoded in _two_ bytes!
> 
> That means the byte value of encoded character codes 0 upto 127 are
> equal in UTF-8 and iso8859-1, but character codes above 127 takes two
> bytes in UTF-8 and one byte in iso8859-1.
> 
> In iso8859-1 the byte value is always the character code. In UTF-8 this
> is only true for character codes 0 upto 127.
> 
> However, in UTF-8 (the unicode code page encoding) you can encode
> character codes upto 31 bits wide, using 6 bytes.

Thanks for the clear explanations :)

Does this mean that as long as I only use ASCII in an UTF8 compiled sqlite
library, the db will be also usable with a ISO-8859-1 compiled version of
the library, but if I use for instance accentuated characters, it won't be
compatible anymore ?

I am asking because I once created a 8859-1 db and it could be read and
modified in the UTF8 version of the library. I haven't tested the other way
though. What will happen if I update fields with accentuated characters in
my application compiled with the UTF8 and then try to open the db with let's
say PHP sqlite extension ? I'll try to see what happens.

On the php site, they warn users:


The default PHP distribution builds libsqlite in ISO-8859-1 encoding mode.
However, this is a misnomer; rather than handling ISO-8859-1, it operates
according to your current locale settings for string comparisons and sort
ordering. So, rather than ISO-8859-1, you should think of it as being
'8-bit' instead.


I am not sure what this means ?


It is not recommended that you use PHP in a web-server configuration with a
version of the SQLite library compiled with UTF-8 support, since libsqlite
will abort the process if it detects a problem with the UTF-8 encoding.


So, it looks like it is recommended not to use UTF8. But how then can I deal
with characters like the euro symbol ? I guess that I am stuck ?

Bertrand Mansion
Mamasam



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Encodings question

2004-04-19 Thread Michael Roth
Bertrand Mansion wrote:

As far as I understand, UTF-8 will read 8859-1 without problem but
ISO-8859-1 will not be able to read UTF-8, unless everything in the UTF8
string uses only 8859-1 codes.
You're wrong, I think.

UTF-8 is a variable length encoding of character codes of the unicode 
code page. Iso8869-1 is a definition of a code page, each character is 
encoded in exactly one byte.

Unicode itself is a code page with much more characters than iso8859-1.

Unicode, iso8859-1 and ASCII code pages share following properties:

a.) character codes 0 upto 127 in unicode are equal to ASCII codes.
b.) character codes 128 upto 255 in unicode are equal to the iso8859-1 
codes.

Please note: A 'character code' is _not_ a byte! It's the number of the 
position of that character in a code page. The code page in iso8859-1 is 
only 8 bits wide and has 256 entries. The unicode code page is 21 bits 
wide, and not all positions are assigned to characters.

In iso8859-1 all 256 character codes are encoded using simply one byte. 
The value of the byte is the character position in the code page.

In UTF-8 character codes 0 upto 127 are encoded in one byte and 
character codes above 127 are encoded in _two_ bytes!

That means the byte value of encoded character codes 0 upto 127 are 
equal in UTF-8 and iso8859-1, but character codes above 127 takes two 
bytes in UTF-8 and one byte in iso8859-1.

In iso8859-1 the byte value is always the character code. In UTF-8 this 
is only true for character codes 0 upto 127.

However, in UTF-8 (the unicode code page encoding) you can encode 
character codes upto 31 bits wide, using 6 bytes.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[sqlite] Encodings question

2004-04-18 Thread Bertrand Mansion
Hi,

I am a bit of a newbie with encodings...
I know that sqlite supports 2 kinds of encodings natively at the moment.
These encodings are ISO-8859-1 and UTF8. The choice of encoding is set at
compile time.

As far as I understand, UTF-8 will read 8859-1 without problem but
ISO-8859-1 will not be able to read UTF-8, unless everything in the UTF8
string uses only 8859-1 codes.

So, the best choice for compatibility and portability seems to be UTF8.

Unfortunately, PHP for example, ships a version of sqlite that is 8859-1
compiled, this means that a lot of people are going to use sqlite with this
charset, without knowing they could benefit from UTF8. So at the moment, I
prefer to stick with ISO-8859-1 in my desktop application.

I have tried to insert the euro symbol in a column and it came out as '?'

Do you have any idea about what is causing this ?

I have read that the euro symbol was supported in an extension of the 8859-1
charset (), is it also supported by sqlite or do I have to switch to UTF8,
something I would like to avoid at the moment (waiting for Sqlite 3 ;) ).

Can you confirm that you don't have any problems with the euro symbol in
your own applications, using 8859-1.

BTW, my desktop app runs on MACOSX so I first convert the user input from
MacRoman to ISO-8859-1, which works fine with accents, then runs it through
sqlite_mprintf("%q", myCString) to escape the string. Before displaying
again the string, I convert it back from 8859-1 to MacRoman.

Thanks for any advices,

Bertrand Mansion
Mamasam


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]