Re: [sqlite] hex and char functions

2017-08-09 Thread x
Thanks Rowan. I’ve picked up bits and pieces from the various replies to get a 
basic idea of what’s going on. My question should’ve been posted on the c++ 
builder forum in the first place. Sorry for wasting everyone’s time.

Tom



From: Rowan Worth<mailto:row...@dug.com>
Sent: 09 August 2017 03:29
To: SQLite mailing list<mailto:sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions

On 8 August 2017 at 18:32, x <tam118...@hotmail.com> wrote:

> Why does it always turn out to be my own fault ☹
>

Welcome to programming ;)

It's very rare in practice to find an actual bug in a library or language,
especially one as widely deployed and tested as sqlite. Glad you're sorted,
I just wanted to have a quick go at clearing this up:

but finding out that the UTF8 code for the UTF16 code \u0085 is in fact
> \uc285 has tipped me over the edge
>

1. \u0085 is a unicode code-point, 'NEXT LINE (NEL)'
2. \uc285 is a unicode code-point, 'HANGUL SYLLABLE SWIT (솅)'
3. Neither of these is a UTF8 or UTF8 "code" -- UTF8 and UTF16 are
different ways of enconding/decoding unicode code-points
4. In UTF8, \u0085 is encoded as two bytes -- 0xc2 0x85
5. In UTF8, \uc285 is encoded as three bytes -- 0xec 0x8a 0x85

HTH,
-Rowan
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-08 Thread Rowan Worth
On 8 August 2017 at 18:32, x  wrote:

> Why does it always turn out to be my own fault ☹
>

Welcome to programming ;)

It's very rare in practice to find an actual bug in a library or language,
especially one as widely deployed and tested as sqlite. Glad you're sorted,
I just wanted to have a quick go at clearing this up:

but finding out that the UTF8 code for the UTF16 code \u0085 is in fact
> \uc285 has tipped me over the edge
>

1. \u0085 is a unicode code-point, 'NEXT LINE (NEL)'
2. \uc285 is a unicode code-point, 'HANGUL SYLLABLE SWIT (솅)'
3. Neither of these is a UTF8 or UTF8 "code" -- UTF8 and UTF16 are
different ways of enconding/decoding unicode code-points
4. In UTF8, \u0085 is encoded as two bytes -- 0xc2 0x85
5. In UTF8, \uc285 is encoded as three bytes -- 0xec 0x8a 0x85

HTH,
-Rowan
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-08 Thread x
Just received this on Embarcadero forum.

You are assigning an Ansi literal string to a Unicode string so a conversion 
from the current Ansi locale is occurring.
Try assigning a unicode literal:

UnicodeString S= L"\u0085";

That L in front of the string makes Code==133 and it’s also 133 inside the udf 
after retrieval using sqlite3_value_text16.

Why does it always turn out to be my own fault ☹

Many thanks to all who contributed.




From: x<mailto:tam118...@hotmail.com>
Sent: 08 August 2017 09:24
To: SQLite mailing list<mailto:sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions

I accidentally deleted the latest 3 replies to this thread in my mail so 
apologies for not including any content from those mails.

Ryan, the error happens immediately in c++ builder. E.g.

UnicodeString S=”\u0085”;
Int Code=S[1];

Code returns 63 (x3F) which is code for ‘?’ and the question mark is also what 
shows when I hover the cursor over S during debugging.
As u+0085 is a control character, and so has no associated graphic, I can 
forgive the ‘?’ being printed in its place but why does the underlying Code 
change? I have posted this question on the c++ builder website and will post 
any answer here.


Simon’s timing with his comment about debugging is spooky. Last night I found 
that when I try to ‘Evaluate/Modify’ S[1] during debugging I get the following 
error message

E2015 Ambiguity between 'System::UnicodeString::System::UnicodeString::operator 
[](const int)' and 'System::UnicodeString::System::UnicodeString::operator 
[](const int)'

Ambiguity between X and X?? I had no such problems earlier yesterday but, as it 
was working fine when I first encountered the ‘\u0085’ problem, I doubt it’s 
the debugger.


I conducted the following experiment in SQLite Expert Pro by creating a couple 
of new empty databases.
select hex(char(65,133,66))
returned
'41C28542' when the current db was utf-8
'410085004200' when the current db was utf-16le
'004100850042' when the current db was utf-16be

The bottom one of those is what I expected when I first ran the select yet I 
think I'm right in saying my system (windows 10) is little endian. This is a 
further confusion thrown into the mix for me. As is usually the case. I've 
little doubt sqlite is coming up with the right answer in each case.

Tom
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-08 Thread x
I accidentally deleted the latest 3 replies to this thread in my mail so 
apologies for not including any content from those mails.

Ryan, the error happens immediately in c++ builder. E.g.

UnicodeString S=”\u0085”;
Int Code=S[1];

Code returns 63 (x3F) which is code for ‘?’ and the question mark is also what 
shows when I hover the cursor over S during debugging.
As u+0085 is a control character, and so has no associated graphic, I can 
forgive the ‘?’ being printed in its place but why does the underlying Code 
change? I have posted this question on the c++ builder website and will post 
any answer here.


Simon’s timing with his comment about debugging is spooky. Last night I found 
that when I try to ‘Evaluate/Modify’ S[1] during debugging I get the following 
error message

E2015 Ambiguity between 'System::UnicodeString::System::UnicodeString::operator 
[](const int)' and 'System::UnicodeString::System::UnicodeString::operator 
[](const int)'

Ambiguity between X and X?? I had no such problems earlier yesterday but, as it 
was working fine when I first encountered the ‘\u0085’ problem, I doubt it’s 
the debugger.


I conducted the following experiment in SQLite Expert Pro by creating a couple 
of new empty databases.
select hex(char(65,133,66))
returned
'41C28542' when the current db was utf-8
'410085004200' when the current db was utf-16le
'004100850042' when the current db was utf-16be

The bottom one of those is what I expected when I first ran the select yet I 
think I'm right in saying my system (windows 10) is little endian. This is a 
further confusion thrown into the mix for me. As is usually the case. I've 
little doubt sqlite is coming up with the right answer in each case.

Tom
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Darko Volaric
Actually a maximum of 4 bytes are required to encode a single valid code-point 
in UTF-8.


> On Aug 8, 2017, at 2:44 AM, Jens Alfke  wrote:
> 
> 
>> On Aug 7, 2017, at 8:29 AM, x  wrote:
>> 
>> I thought I had learned enough about this string lunacy to get by but 
>> finding out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 
>> has tipped me over the edge. I assumed they both used the same codes but 
>> UTF16 allowed some characters UTF8 didn’t have.
> 
> UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent 
> the same characters as their ASCII equivalents. Beyond that, UTF-8 uses a 
> sequence of two to five bytes in the range 80-ff to encode a single Unicode 
> character/code-point. (You can sort of think of this as every byte holding 7 
> bits of the actual character number, with its MSB set to 1. It’s not exactly 
> like that, but close.)
> 
> IMHO UTF-8 is the best general purpose text encoding. Code that works with 
> ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally 
> work with UTF-8; the main thing to watch out for tends to be breaking or 
> trimming strings, because you don’t want to cut part of a multibyte sequence. 
> UTF-8 is also quite compact for Roman languages (although not non-Roman ones.)
> 
> 16-bit encodings used to seem like a good idea back when Unicode has fewer 
> than 65,536 characters, so you could assume that one unichar = one character. 
> Those days are long gone. Now dealing with UTF-16 has all the same problems 
> of dealing with UTF-8 (i.e. multi-word sequences) without the benefits of 
> compactness or ASCII compatibility.
> 
> 32-bit encodings are just silly, unless for some reason you really really 
> have to optimize for speed over size (and even then the added size may well 
> blow out your CPU caches and negate the speed boost.)
> 
> —Jens
> 
> PS: Apparently C++11 allows Unicode string literals by putting a letter U in 
> front of the initial quote. The result will be a string of wchar_t.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Jens Alfke

> On Aug 7, 2017, at 8:29 AM, x  wrote:
> 
> I thought I had learned enough about this string lunacy to get by but finding 
> out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
> me over the edge. I assumed they both used the same codes but UTF16 allowed 
> some characters UTF8 didn’t have.

UTF-8 is backwards-compatible with ASCII. All 7-bit bytes (00-7f) represent the 
same characters as their ASCII equivalents. Beyond that, UTF-8 uses a sequence 
of two to five bytes in the range 80-ff to encode a single Unicode 
character/code-point. (You can sort of think of this as every byte holding 7 
bits of the actual character number, with its MSB set to 1. It’s not exactly 
like that, but close.)

IMHO UTF-8 is the best general purpose text encoding. Code that works with 
ASCII (real 7-bit ASCII, not the nonstandard “extended” stuff) will generally 
work with UTF-8; the main thing to watch out for tends to be breaking or 
trimming strings, because you don’t want to cut part of a multibyte sequence. 
UTF-8 is also quite compact for Roman languages (although not non-Roman ones.)

16-bit encodings used to seem like a good idea back when Unicode has fewer than 
65,536 characters, so you could assume that one unichar = one character. Those 
days are long gone. Now dealing with UTF-16 has all the same problems of 
dealing with UTF-8 (i.e. multi-word sequences) without the benefits of 
compactness or ASCII compatibility.

32-bit encodings are just silly, unless for some reason you really really have 
to optimize for speed over size (and even then the added size may well blow out 
your CPU caches and negate the speed boost.)

—Jens

PS: Apparently C++11 allows Unicode string literals by putting a letter U in 
front of the initial quote. The result will be a string of wchar_t.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Simon Slavin


On 7 Aug 2017, at 8:45pm, R Smith  wrote:

> If it ends up as something else in that database, follow the call stack 
> chain, see how those look in the debugger initially (which may be different 
> to what you expect, depending on the debugger display encoding)

Note this point very hard.  It is always possible that your program is working 
perfectly, and your debugger is faulty.  I’ve wasted a lot of time over the 
years through believing the debugger after multiple versions of my own code all 
produced the same, apparently wrong, result.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread R Smith


On 2017/08/07 9:01 PM, x wrote:

Thanks Ryan for going to the trouble of typing that out. Hope you’re not a one 
fingered typewriter like myself. The borland related stuff is welcome but I 
still can’t say I’m any less confused by it all.

I’m having a bad day today. I’ve spent most of it trying to fathom this stuff out. 
Igor & Gunter were correct earlier. The ‘\u0085’ is changed to ‘?’ before the 
string is anywhere near sqlite. Why I don’t know. It doesn’t seem unreasonable to 
want to put a Unicode code into a UnicodeString. As regards the hex(char(133)) 
returning C285, following the posts by Nico and Richard I’m wondering if it’s 
because I’m using SQLite Expert pro on a database that’s encoded in utf-8. I tried 
to change the coding to utf-16 to see if I would get a different result but, while 
the software seemed to accept the request, the request was never completed and no 
feedback was given aside from both the ‘Apply’ and ‘Cancel’ buttons both being 
greyed out for hours (it’s only a small database). I’ve had enough for today though.


Don't worry, I can type faster than I can think (which isn't very fast, 
and quite evident from the typos!).


I feel your pain though, and hope we can help get you un-discouraged 
soon. (Imagine that is a word.)


One thing to Note: You can only change the Database encoding at the 
START of making the Database. Once you put stuff in the database, the 
encoding is set for life. That is an SQLite quirk - but that said, it 
wouldn't solve your problem, since the encoding problem (by your 
account) happens before the database is reached.  It's real easy to test.


Use the lowest ranked interface to simply add a string into the DB. By 
lowest ranked, I mean before it goes through any of your own code, or 
other wrapper code.


Use direct SQL, such as DB.Execute('INSERT INTO t(v1) 
VALUES(''Geronimo'');');
(This assumes a table called t with a column v1 of type TEXT. Those are 
2 single-quotes btw, not double-quotes).


Use some SQLite tool or the command line to check what ended up in that DB.

If that works, add some Unicode value, like this:
DB.Execute('INSERT INTO t VALUES(''©Geronimo®'');');

If it ends up as something else in that database, follow the call stack 
chain, see how those look in the debugger initially (which may be 
different to what you expect, depending on the debugger display 
encoding), what you are looking for is Change - the moment it changes - 
you will see which thing is responsible for misinterpreting it, if any.


Tell us more when you did the tests. Don't give up!

Also - make sure that whatever you read it with reads (and can 
faithfully display) any unicode characters. Often the fault is not 
yours. The SQLite CLI for instance can read it correctly, but sometimes 
your computer stdout uses a codepage that doesn't display it correctly. 
Mail me off-list if you would like an SQLite DB and table with lots of 
Unicode test data in it and a correct interpretation of it to test with.


Best of luck!
Ryan


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread x
Thanks Ryan for going to the trouble of typing that out. Hope you’re not a one 
fingered typewriter like myself. The borland related stuff is welcome but I 
still can’t say I’m any less confused by it all.

I’m having a bad day today. I’ve spent most of it trying to fathom this stuff 
out. Igor & Gunter were correct earlier. The ‘\u0085’ is changed to ‘?’ before 
the string is anywhere near sqlite. Why I don’t know. It doesn’t seem 
unreasonable to want to put a Unicode code into a UnicodeString. As regards the 
hex(char(133)) returning C285, following the posts by Nico and Richard I’m 
wondering if it’s because I’m using SQLite Expert pro on a database that’s 
encoded in utf-8. I tried to change the coding to utf-16 to see if I would get 
a different result but, while the software seemed to accept the request, the 
request was never completed and no feedback was given aside from both the 
‘Apply’ and ‘Cancel’ buttons both being greyed out for hours (it’s only a small 
database). I’ve had enough for today though.

Thanks to all who have contributed.

From: R Smith<mailto:rsm...@rsweb.co.za>
Sent: 07 August 2017 19:33
To: 
sqlite-users@mailinglists.sqlite.org<mailto:sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions


On 2017/08/07 5:29 PM, x wrote:
> Apologies, I should have said I was using c++ builder Berlin on windows 10 
> and that UnicodeString was UTF16.
>
> I thought I had learned enough about this string lunacy to get by but finding 
> out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
> me over the edge. I assumed they both used the same codes but UTF16 allowed 
> some characters UTF8 didn’t have.
>
> I’m now wondering if I should go to the trouble of changing my sqlite wrapper 
> over to communicate with the sqlite utf8 functions rather than the utf16 
> ones. Trouble is many of c++ builder’s built in types such as TStringList etc 
> are utf16.

No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and
TStringlist is not UTF16 - let me see if I can clear up some bit of the
confusion. This next bit is very short and really requires much more
study, but I hope I say enough (and correct enough) for you to get the
picture a little better.

First some (very short) history on the "string lunacy" you refer to.
Note that when storing text in any system there are two confusing
concepts that are hard to get your head around: Firstly there is the
actual Characters, or Character-sets - these are the things referred to
as ANSI, Latin, CN-Big5 etc., then there are Character Code Index
mappings, these are things such as Code-Pages and the like that
basically says stuff like the Uppercase Latin character A has a code of
65 (Hex 41) in the ASCII code-page etc. These may all differ for
different code-pages, though there were good overlap.  Eventually
Unicode intended to save the World by indeed unifying all the
code-paging (hence "Unicode") and they did a marvelous job of it - but
there were very many real-World characters to cater for, so they have
code-point indices much larger than any single or even double-byte
character arrray or string can ever contain.

Here we Enter the character-encodings. These are things like UTF8,
UTF16LE and they specify an encoding, a way to make a sequence of bytes
refer to a specific codepoint in a code-space (in typically the Unicode
code-point space) that can be much larger than 8 or 16 bits may
accommodate.  UTF-8 for instance specifies that any byte value less than
128 refers to the first 127 code points, as soon as that final bit (MSB)
goes high, it means another byte is needed (or byteS, depending on how
many high bits follow the initial) to complete the encoding, and further
bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure
consistency and safely lets any reader know as soon as they encounter a
high MSB that it is definitely part of a multi-byte UTF8 sequence -
which is a brilliant encoding.  Although slightly technical, it is very
lean, we only escalate bytes when needed, and only as much as is needed.
The UTF16 encoding is a bit less technical, we can represent far more
code points with a consistent 2 byte setup, but even that is much
smaller than the full Unicode world, so UTF16 has specific character
ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also
known as "Surrogate pairs" (this is the thing that you said pushed you
over the edge, finding that some Unicode characters are represented by 2
double-byte characters, so 4-byte total width). There is much more to be
said about all this, but I don't want to take everyone's time and the
above is enough to understand the next bit regarding C++ history:

One of the great features of the bcc32 compilers of yonder was that they
(Borland) embraced strong typing, probably to this day the
strongest-typed language around is Pascal, later Turb

Re: [sqlite] hex and char functions

2017-08-07 Thread R Smith


On 2017/08/07 5:29 PM, x wrote:

Apologies, I should have said I was using c++ builder Berlin on windows 10 and 
that UnicodeString was UTF16.

I thought I had learned enough about this string lunacy to get by but finding 
out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
me over the edge. I assumed they both used the same codes but UTF16 allowed 
some characters UTF8 didn’t have.

I’m now wondering if I should go to the trouble of changing my sqlite wrapper 
over to communicate with the sqlite utf8 functions rather than the utf16 ones. 
Trouble is many of c++ builder’s built in types such as TStringList etc are 
utf16.


No you shouldn't. UTF16 doesn't have "more" characters than UTF8, and 
TStringlist is not UTF16 - let me see if I can clear up some bit of the 
confusion. This next bit is very short and really requires much more 
study, but I hope I say enough (and correct enough) for you to get the 
picture a little better.


First some (very short) history on the "string lunacy" you refer to. 
Note that when storing text in any system there are two confusing 
concepts that are hard to get your head around: Firstly there is the 
actual Characters, or Character-sets - these are the things referred to 
as ANSI, Latin, CN-Big5 etc., then there are Character Code Index 
mappings, these are things such as Code-Pages and the like that 
basically says stuff like the Uppercase Latin character A has a code of 
65 (Hex 41) in the ASCII code-page etc. These may all differ for 
different code-pages, though there were good overlap.  Eventually 
Unicode intended to save the World by indeed unifying all the 
code-paging (hence "Unicode") and they did a marvelous job of it - but 
there were very many real-World characters to cater for, so they have 
code-point indices much larger than any single or even double-byte 
character arrray or string can ever contain.


Here we Enter the character-encodings. These are things like UTF8, 
UTF16LE and they specify an encoding, a way to make a sequence of bytes 
refer to a specific codepoint in a code-space (in typically the Unicode 
code-point space) that can be much larger than 8 or 16 bits may 
accommodate.  UTF-8 for instance specifies that any byte value less than 
128 refers to the first 127 code points, as soon as that final bit (MSB) 
goes high, it means another byte is needed (or byteS, depending on how 
many high bits follow the initial) to complete the encoding, and further 
bytes must specify a 1.0.x.x.x.x.x.x format in turn to ensure 
consistency and safely lets any reader know as soon as they encounter a 
high MSB that it is definitely part of a multi-byte UTF8 sequence - 
which is a brilliant encoding.  Although slightly technical, it is very 
lean, we only escalate bytes when needed, and only as much as is needed. 
The UTF16 encoding is a bit less technical, we can represent far more 
code points with a consistent 2 byte setup, but even that is much 
smaller than the full Unicode world, so UTF16 has specific character 
ranges (0xd800 to 0xdbff) that requires follow-on double-bytes, also 
known as "Surrogate pairs" (this is the thing that you said pushed you 
over the edge, finding that some Unicode characters are represented by 2 
double-byte characters, so 4-byte total width). There is much more to be 
said about all this, but I don't want to take everyone's time and the 
above is enough to understand the next bit regarding C++ history:


One of the great features of the bcc32 compilers of yonder was that they 
(Borland) embraced strong typing, probably to this day the 
strongest-typed language around is Pascal, later Turbo Pascal, Delphi 
etc. I mention this because you use (apparently) the Borland origin 
version of C++ and they always had a stronger typed vision than the C++ 
standard. There is a precise type for everything - but that also came 
with problems when adapting as times changed. C++ specifically avoided a 
standard string type for quite a while, which was one of the design 
mistakes often noted by people reflecting on the early C++ development. 
Anyway...


The first iterations of C++ started out long ago using the convention 
borrowed from C for pointer strings with null terminators. Indeed, a 
constant string such as 'Hello World!' today still  is a pointer and 
null terminator setup and the "std::string" type has ways of converting 
itself to that still.  After some time came a standardization in C++ 
with the "std::string" type. This standard string usually represents an 
array of 8-bit characters in ASCII encoding.


ASCII of course did not last long as the defacto character set 
definition, and eventually everything went Unicode (as mentioned above). 
The first adaptations seen was "Wide" types (std::wstring) which allowed 
a nice range of 65536 character codes per string index "character", same 
as what Windows went for, but in no way guaranteed compatibility with 
specific encodings. Later still came things 

Re: [sqlite] hex and char functions

2017-08-07 Thread Nico Williams
On Mon, Aug 07, 2017 at 11:45:50AM -0400, Richard Hipp wrote:
> On 8/7/17, Nico Williams  wrote:
> > Internally SQLite3 uses UTF-8.  The SQLite3 API lets you deal with
> > UTF-16, but this just transcodes to/from UTF-8 internally.
> 
> That is not quite correct.
> 
> SQL statements are always converted into UTF8 for parsing and code
> generation.  But data can be stored in the database file and processed
> as UTF8, UTF16be, or UTF16le.  All text content for a single database
> must use the same encoding.  When creating the database file, set the
> encoding before adding any content by running one of:
> 
> PRAGMA encoding('utf-8');
> PRAGMA encoding('utf-16be');
> PRAGMA encoding('utf-16le');
> 
> See https://www.sqlite.org/pragma.html#pragma_encoding for additional
> information.

Ah, OK, thanks for the correction!
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Richard Hipp
On 8/7/17, Nico Williams  wrote:
>
> Internally SQLite3 uses UTF-8.  The SQLite3 API lets you deal with
> UTF-16, but this just transcodes to/from UTF-8 internally.
>

That is not quite correct.

SQL statements are always converted into UTF8 for parsing and code
generation.  But data can be stored in the database file and processed
as UTF8, UTF16be, or UTF16le.  All text content for a single database
must use the same encoding.  When creating the database file, set the
encoding before adding any content by running one of:

PRAGMA encoding('utf-8');
PRAGMA encoding('utf-16be');
PRAGMA encoding('utf-16le');

See https://www.sqlite.org/pragma.html#pragma_encoding for additional
information.

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Nico Williams
On Mon, Aug 07, 2017 at 03:29:41PM +, x wrote:
> Apologies, I should have said I was using c++ builder Berlin on
> windows 10 and that UnicodeString was UTF16.
> 
> I thought I had learned enough about this string lunacy to get by but
> finding out that the UTF8 code for the UTF16 code \u0085 is in fact
> \uc285 has tipped me over the edge. I assumed they both used the same
> codes but UTF16 allowed some characters UTF8 didn’t have.

Internally SQLite3 uses UTF-8.  The SQLite3 API lets you deal with
UTF-16, but this just transcodes to/from UTF-8 internally.

So when you SELECT hex(); the hex() function
sees UTF-8, not UTF-16.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread x
Apologies, I should have said I was using c++ builder Berlin on windows 10 and 
that UnicodeString was UTF16.

I thought I had learned enough about this string lunacy to get by but finding 
out that the UTF8 code for the UTF16 code \u0085 is in fact \uc285 has tipped 
me over the edge. I assumed they both used the same codes but UTF16 allowed 
some characters UTF8 didn’t have.

I’m now wondering if I should go to the trouble of changing my sqlite wrapper 
over to communicate with the sqlite utf8 functions rather than the utf16 ones. 
Trouble is many of c++ builder’s built in types such as TStringList etc are 
utf16.

From: Igor Tandetnik<mailto:i...@tandetnik.org>
Sent: 07 August 2017 15:49
To: 
sqlite-users@mailinglists.sqlite.org<mailto:sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] hex and char functions

On 8/7/2017 9:38 AM, x wrote:
> Related
>
> Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.
>
> What is the ‘C2’ about?

Two-byte sequence C2 85 is the UTF-8 encoding of the Unicode codepoint U+0085.
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Igor Tandetnik

On 8/7/2017 9:38 AM, x wrote:

Related

Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.

What is the ‘C2’ about?


Two-byte sequence C2 85 is the UTF-8 encoding of the Unicode codepoint U+0085.
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Hick Gunter
The sqlite char() function returns unicode. Apparently, the encoding for code 
point 133 is two characters, namely c2 85. You seem to be expecting char() to 
return ISO characters, which it does not do.

Calling sqlite3_value_text16 instructs SQLite to convert the contents of the 
field into utf16 with native byte order from (assumed) utf8, which may have 
funny results if the source is not indeed UTF8 but rather ISO or some strange 
(windows) codepage.

-Ursprüngliche Nachricht-
Von: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] Im 
Auftrag von x
Gesendet: Montag, 07. August 2017 15:39
An: sqlite-users@mailinglists.sqlite.org
Betreff: [sqlite] hex and char functions

In c++ I tried to call a sqlite udf using the following sql

UnicodeString SQL=“select udf(‘5\u00856’);”

Which was prepared using sqlite3_prepare16_v2.

(I was experimenting with sending a udf a group of numbers without getting 
involved with blobs).

Debugging the udf I recovered the string (call it Str) using 
sqlite3_value_text16.

Str[1]==’5’ & Str[3]==’6’ but Str[2]==’?’ (or Unicode 3F)

I also tried retrieving the string using sqlite3_value_text to return a const 
unsigned char * but that also seemed to have a problem with values beyond 127.

What am I missing?

Related

Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.

What is the ‘C2’ about?
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___
 Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: h...@scigames.at

This communication (including any attachments) is intended for the use of the 
intended recipient(s) only and may contain information that is confidential, 
privileged or legally protected. Any unauthorized use or dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, please immediately notify the sender by return e-mail message and 
delete all copies of the original communication. Thank you for your cooperation.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] hex and char functions

2017-08-07 Thread Igor Tandetnik

On 8/7/2017 9:38 AM, x wrote:

In c++ I tried to call a sqlite udf using the following sql

UnicodeString SQL=“select udf(‘5\u00856’);”


You are using a narrow string literal to try and hold a Unicode character. You 
are at the whim of your compiler as to how it represents the latter in the 
former. My guess is, it tries to convert according to the system default code 
page (I assume Windows here), and since the character is not in fact 
representable therein, it's converted to '?'.

That is, the character was lost before the program even ran, let alone before 
SQLite got involved.

Also, what's UnicodeString?
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] hex and char functions

2017-08-07 Thread x
In c++ I tried to call a sqlite udf using the following sql

UnicodeString SQL=“select udf(‘5\u00856’);”

Which was prepared using sqlite3_prepare16_v2.

(I was experimenting with sending a udf a group of numbers without getting 
involved with blobs).

Debugging the udf I recovered the string (call it Str) using 
sqlite3_value_text16.

Str[1]==’5’ & Str[3]==’6’ but Str[2]==’?’ (or Unicode 3F)

I also tried retrieving the string using sqlite3_value_text to return a const 
unsigned char * but that also seemed to have a problem with values beyond 127.

What am I missing?

Related

Select hex(char(65,133,66)); returns ‘41C28542’ whereas I expected ‘418542’.

What is the ‘C2’ about?
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users