Re: [sqlite] How to search for fields with accents in UTF-8 data?

Scott Robison Tue, 20 Jun 2017 11:05:40 -0700

On Tue, Jun 20, 2017 at 8:17 AM, Olivier Mascia <[email protected]> wrote:
>> Le 20 juin 2017 à 15:24, R Smith <[email protected]> a écrit :
>>
>> As an aside - I never understood the reasons for that. I get that Windows 
>> has a less "techy" clientèle than Linux for instance, and that the backwards 
>> compatibility is paramount, and that no console command ever need fall 
>> outside the 7-bit ANSI range of characters... but geez, how much effort can 
>> it be to make it Unicode-friendly? It's not like the Windows API lacks any 
>> Unicode functionality - even Notepad can handle it masterfully.
>
> I wouldn't like looking like I'm trolling this subject, but this is only a 
> matter of I/O functions used by programs built to interact with the display 
> and keyboard when run in a console. Windows needs those programs to use 
> ReadConsoleW/WriteConsoleW to do the proper thing.  Those programs using C 
> library to read or output byte streams can't do anything equivalent no matter 
> what 'codepage' is set to be used or to/from what DBCS the program attempts 
> conversion to/from.
>
> I learned this postulat here last year and have had excellent success with 
> console I/O ever since in my programmings.


About a year ago I had to write an emergency fixup tool for my
employer because of a backward breaking change at Microsoft that was
almost certainly due to a breakdown in revision control. The tool
needed to be localized, but it was sufficiently simple that a console
mode executable was sufficient. I had to jump through hoops to make it
work, but (by way of confirmation), the problems were in the CRT, not
the Win32 API. It was able to write and read Unicode ...

> To be complete, regarding proper display of the output, there is a secondary 
> consideration. The fonts available in Windows are far from covering a large 
> subset of the glyphs.  For eastern languages on a western Windows edition, 
> you generally need to change your console settings to make it use another 
> font than the default one, just so that it can draw the needed glyphs.  But 
> the basic thing to do is get the program running in the console (here we are 
> talking shell.c - sqlite3.exe) to output Windows wide-chars using the 
> function WriteConsoleW(). And use ReadConsoleW() to read wide-chars chunks 
> from the console input, before converting internally to UTF-8 or whatever 
> wanted.

... assuming of course that the locale was using a font that supported
the character set for that area. This was true for our purposes by
default, as we weren't expecting English speaking customers to need to
see Asian languages.

> Sqlite3 shell.c when patched that way is as pleasant to use on Windows 
> console as it can be on a modern Linux or macOS.
>
> Input files feeded to sqlite3.exe need to be in UTF-8, as well as output sent 
> by sqlite3.exe will be: that part is perfectly OK today in sqlite3.exe. Only 
> the keyboard reading and console output writing lacks a little.

Agreed.

>> but geez, how much effort can it be to make it Unicode-friendly?
>
> To further comment on a more general plane than the sqlite3.exe, the issue is 
> deeper in Windows than in its console. Once upon a time (!), they made the 
> choice of 16 bits per characters encoding as the *right* way (their right 
> way!) to do Unicode. It took time for this to evolve, recognizing the need 
> for multi-16 bits words encoding (UTF-16), so they could have chosen UTF-8 
> from day one, but that was not what history recorded. Later UTF-8 got *some* 
> support in the OS (through conversion functions). But never UTF-8 was raised 
> to full citizenship.  There is even a CHCP 65001 to set the 'codepage' to 
> UTF-8. It works partly in some circumstances, but is far from being 'right'. 
> No matter what you would do, there is no way for any file I/O primitive of 
> the OS to take an UTF-8 string as a filename. And this extend to the 
> C-library on Windows platform. The only unicode support is to pass a UTF-16 
> filename through functions ending with a W in the name. Those 'ansi' 
> functions, ending with an A in the name, are merely wrappers converting to 
> the wide chars versions.  There have been numerous requests to Microsoft to 
> let people and developers set the ANSI codepage to UTF-8 so that the file I/O 
> functions taking a narrow char filename string can interpret it as UTF-8. 
> Some are still waiting for that day to come, others use the W-side of things, 
> complicating portability of their codebase. :)

Windows NT was released in 1993. It had been in development for years.
It decided Unicode for I18N/L10N/W6R (WhateveR) purposes was better
than a ton of different code pages. At the point Microsoft committed
to Unicode, it was a two byte / sixteen bit encoding. There was no
UTF-8. There was no UTF-16. Other than endian issues, there was
nothing to worry about. Win32 was an "all new" API.

POSIX people didn't want to re-write the entire API to support 16 bit
characters, so they came up with the FSS-UTF (File System Safe UCS
Transformation Format) alternative that eventually led to what we know
as UTF-8 today. Had Microsoft made the decision to implement a
variable width encoding of Unicode on their own, I dare say they'd be
excoriated for embrace/extend/extinguish practices! UTF-8 wasn't a
thing until late 1992 and wasn't presented until 1993, long after
Microsoft had committed to UCS. Unicode and ISO-10646 had not yet
unified when Microsoft made their call, and Unicode 1 was only 16 bits
(but ISO-10646 was 31 bits). By the time of Unicode 2 UTF-16 was
developed to give early Unicode adopters a means of accessing code
points beyond the basic multilingual plane & the code space was
limited to 17 planes. Not until 2003 was UTF-8 finally limited to the
current 4 byte form from the original six byte form.

Microsoft wasn't the only organization to commit to UCS-2.
https://en.wikipedia.org/wiki/UTF-16#Usage provides a list of others.

Regardless, I prefer UTF-8 to UCS-2 / UTF-16. Microsoft has certainly
had time to make their interfaces more UTF-8 friendly. I just don't
think they get enough credit for committing to Unicode in an era when
few were.
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] How to search for fields with accents in UTF-8 data?

Reply via email to