Hello, I have a question regarding text encoding of filenames on Unix platforms. I’ve read the two related mailing list threads I could find in the archive, <https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg35875.html> and <https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg94184.html>. Both of those explain that, on Unix platforms, the filename string is passed unmodified by SQLite directly to the open() syscall.
From what I understand from reading a lot of information on the Internet, this may or may not be correct, and nobody can agree. It seems that, taking a survey of other software, GLib expects filenames to always be UTF-8 but allows that to be changed via environment variable, Qt expects filenames to always be in the locale encoding, and Coreutils (“ls”) also expects filenames to be in the locale encoding (at least, it sometimes decides to show filenames in '$'\XYZ escaped form, and it decides whether or not to do that based on your $LANG and co. variables, in a way which is consistent with it considering filename to be locale encoded). It seems, though I could be wrong, that more people fall on the “locale encoded” side than on the “always UTF-8” side (though thank goodness it’s becoming less and less relevant as more and more systems are running with UTF-8 locales anyway). My question is this: In those two mailing list posts, it was explained that SQLite’s current behaviour is to pass the string unmodified to the open() syscall. Is this just an explanation of current behaviour, or is it an official policy? That is to say, which of the following statements is correct? (1) SQLite developers believe that Unix filenames should be UTF-8 at the syscall layer regardless of your locale, and therefore if your particular box has a non-UTF-8 file on its disk, you shouldn’t be able to access it. (2) SQLite developers believe that Unix filenames should be locale-encoded at the syscall layer, and therefore the missing transcode is a bug. (3) SQLite developers refuse to get into this argument and think it’s up to the developer of the client application, who should pass a string of whatever encoding they think right into sqlite_open() which in turn passes it on to open(). I can’t really tell which of these is the official policy. If it’s #1, the documentation and code are both fine, though it makes some files inaccessible for some users. If it’s #2, the documentation and the code are both wrong. If it’s #3, I think it would make sense if the documentation were updated to explain this. The reason I ask is because, in addition to the current behaviour (easy to find out just by testing or reading the source code), I want some idea of whether this might change in future. That is, if I just write “known bug” and insert a workaround in my client code to pass a locale string to sqlite_open() instead of a UTF-8 string, is that workaround going to break sometime in future when the bug (if you consider it a bug) gets fixed? Thanks for the clarification, and please note that I am not subscribed to this list so I would appreciate being included explicitly in replies. -- Christopher Head
pgpGC4T9j_U9d.pgp
Description: OpenPGP digital signature
_______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users