Re: Fwd: LMDB and text encoding
Timur Kristóf wrote: Hi Everyone, I've just came accross this old thread and am wondering, is this still an issue? No, it was resolved long ago. Does LMDB have a way to use non-ASCII path names with mdb_env_open in a cross-platform way? If not, would you guys accept patches to LMDB with this regard? There's no issue on POSIX filesystems, and on Windows we already convert pathnames from UTF-8 to UTF-16. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Re: Fwd: LMDB and text encoding
Hi Everyone, I've just came accross this old thread and am wondering, is this still an issue? Does LMDB have a way to use non-ASCII path names with mdb_env_open in a cross-platform way? If not, would you guys accept patches to LMDB with this regard? Thanks, Timur
Re: Fwd: LMDB and text encoding
* Timur Kristóf: >> > A path is always a Unicode string, while a DB name can be an arbitrary >> > binary blob. >> >> On many POSIX platforms, a path is a blob which does not contain >> '\000'. These systems do not enforce Unicode encoding at all. > > My mistake. I was unaware. > On those platforms, how do you type a path name into a terminal? There are some files which are not directly nameable. Many programs support special sequences such as “Ctrl+V 3 7 7” to enter arbitrary bytes, but that's not universal. Depending on the actual implementation of the terminal, cut-and-paste of funny file names can work, too. Older programs have trouble accessing such files even if the user chooses them in a file selection dialog, but current version are supposed to have been fixed (including OpenJDK, which took a ridiculously long time). Beyond that, it's not much different from dealing with file names in an unfamiliar script.
Re: Fwd: LMDB and text encoding
>>> > A path is always a Unicode string, while a DB name can be an arbitrary >>> > binary blob. >>> >>> On many POSIX platforms, a path is a blob which does not contain >>> '\000'. These systems do not enforce Unicode encoding at all. >> >> My mistake. I was unaware. >> On those platforms, how do you type a path name into a terminal? > > There are some files which are not directly nameable. Many programs > support special sequences such as “Ctrl+V 3 7 7” to enter arbitrary > bytes, but that's not universal. Depending on the actual > implementation of the terminal, cut-and-paste of funny file names can > work, too. > > Older programs have trouble accessing such files even if the user > chooses them in a file selection dialog, but current version are > supposed to have been fixed (including OpenJDK, which took a > ridiculously long time). Beyond that, it's not much different from > dealing with file names in an unfamiliar script. Interesting. So ultimately, there are always going to be things that you cannot type into your terminal directly.
Re: Fwd: LMDB and text encoding
> > A path is always a Unicode string, while a DB name can be an arbitrary > > binary blob. > > On many POSIX platforms, a path is a blob which does not contain > '\000'. These systems do not enforce Unicode encoding at all. My mistake. I was unaware. On those platforms, how do you type a path name into a terminal?
Re: Fwd: LMDB and text encoding
* Timur Kristóf: > A path is always a Unicode string, while a DB name can be an arbitrary > binary blob. On many POSIX platforms, a path is a blob which does not contain '\000'. These systems do not enforce Unicode encoding at all.
Re: Fwd: LMDB and text encoding
On 02. feb. 2015 17:11, Timur Kristóf wrote: I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice. And then I also suggest to try to make this mess simple to deal with for programmers and or users. I guess I should have separated that from the rest more clearly. I can write a patch which does the UTF-8 to UTF-16 conversion on Windows for file paths, but I would hate to restrict db names to UTF-8 text only (or for that matter, any text only). However, not supporting non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable compromise to me. I suggest we wait to deal with DB names until we also have a way to deal with filenames. -- Hallvard
Re: Fwd: LMDB and text encoding
> I suggest we wait to deal with DB names until we also have a way to > deal with filenames. And this time test that it works is practice. > > And then I also suggest to try to make this mess simple to deal > with for programmers and or users. I guess I should have separated > that from the rest more clearly. I can write a patch which does the UTF-8 to UTF-16 conversion on Windows for file paths, but I would hate to restrict db names to UTF-8 text only (or for that matter, any text only). However, not supporting non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable compromise to me.
Re: Fwd: LMDB and text encoding
On 02. feb. 2015 16:25, Timur Kristóf wrote: Okay. What do you suggest? I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice. And then I also suggest to try to make this mess simple to deal with for programmers and or users. I guess I should have separated that from the rest more clearly. -- Hallvard
Re: Fwd: LMDB and text encoding
>> A path is always a Unicode string, while a DB name can be an arbitrary >> binary blob. So I don't think that we can treat them the same way. > > > Not the point. A program which uses LDMB can choose to treat its > own DB names in its own LMDB environments as the same kind of > strings as filenames (WCHAR, UTF-8 char, or whatever). Unless we > make that impossible. > > As for what LMDB will accept and what it must handle, that's up to > us. DB names are not binary blobs yet, after all. Okay. What do you suggest?
Re: Fwd: LMDB and text encoding
On 02. feb. 2015 16:03, Timur Kristóf wrote: A path is always a Unicode string, while a DB name can be an arbitrary binary blob. So I don't think that we can treat them the same way. Not the point. A program which uses LDMB can choose to treat its own DB names in its own LMDB environments as the same kind of strings as filenames (WCHAR, UTF-8 char, or whatever). Unless we make that impossible. As for what LMDB will accept and what it must handle, that's up to us. DB names are not binary blobs yet, after all. -- Hallvard
Re: Fwd: LMDB and text encoding
>> DB names are purely internal to LMDB, so they bear no relation to OS >> filenames and none of this discussion matters to them. > > They're exposed to the programmer and the program's users. Either may > want them on command-line arguments, in config files, etc. It will be > inconvenient if LMDB requires different string handling for non-ASCII > filenames and non-ASCII DB names in such cases. The programmer may > choose to use different string handling but let's try to avoid forcing > him to do so. A path is always a Unicode string, while a DB name can be an arbitrary binary blob. So I don't think that we can treat them the same way.
Re: Fwd: LMDB and text encoding
On 02. feb. 2015 14:24, Howard Chu wrote: Hallvard Breien Furuseth wrote: I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice:-) Hopefully users and programmers will only need one method of handling non-ASCII LMDB names on Windows, not two. I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would reading DB names and filenames from an config file. Yet OS-aware and OS-specific config files can look rather different. Maybe LMDB must handle DB names more flexibly than filenames, or maybe we'll end up recommending that "portable" DB names must be UTF-8. And add a "flag convert UTF8<->WCHAR if this is Windows". DB names are purely internal to LMDB, so they bear no relation to OS filenames and none of this discussion matters to them. They're exposed to the programmer and the program's users. Either may want them on command-line arguments, in config files, etc. It will be inconvenient if LMDB requires different string handling for non-ASCII filenames and non-ASCII DB names in such cases. The programmer may choose to use different string handling but let's try to avoid forcing him to do so. -- Hallvard
Re: Fwd: LMDB and text encoding
> DB names are purely internal to LMDB, so they bear no relation to OS > filenames and none of this discussion matters to them. If we let the users treat db names as an MDB_val (essentially, an arbitrary byte array), then all bets are off: we can't even make the assumption that a db name is meaningful text in any encoding. We can make it possible to type such a thing in the console if we represent it as a string of hexadecimal numbers. For example, mdb_dump could do something like to_hex_string in this code snippet: http://pastebin.com/jqnGSS6C (note: you need -std=c11 to compile the snippet).
Re: Fwd: LMDB and text encoding
Hallvard Breien Furuseth wrote: I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice:-) Hopefully users and programmers will only need one method of handling non-ASCII LMDB names on Windows, not two. I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would reading DB names and filenames from an config file. Yet OS-aware and OS-specific config files can look rather different. Maybe LMDB must handle DB names more flexibly than filenames, or maybe we'll end up recommending that "portable" DB names must be UTF-8. And add a "flag convert UTF8<->WCHAR if this is Windows". DB names are purely internal to LMDB, so they bear no relation to OS filenames and none of this discussion matters to them. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Re: Fwd: LMDB and text encoding
I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice:-) Hopefully users and programmers will only need one method of handling non-ASCII LMDB names on Windows, not two. I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would reading DB names and filenames from an config file. Yet OS-aware and OS-specific config files can look rather different. Maybe LMDB must handle DB names more flexibly than filenames, or maybe we'll end up recommending that "portable" DB names must be UTF-8. And add a "flag convert UTF8<->WCHAR if this is Windows". -- Hallvard
Re: Fwd: LMDB and text encoding
Timur Kristóf wrote: I just had a look at how BDB handled this. As you can see they used a TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16. https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136 (And a FROM_TSTRING for the reverse, as well.) (Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry. Now reposting to the mailing list.) Since we only need to do this on Windows, we could use MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.) I do not think we would ever need to do any such conversion on UNIX. Correct, these macros only exist in the Windows-specific source files of BDB. None of this is needed for POSIX. https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx I'm not sure if we can just copy-paste BDB's code. Probably not, that would lead to licensing issues, wouldn't it? I wasn't suggesting a copy/paste, just using it as an example of how the problem could be approached. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Fwd: LMDB and text encoding
> I just had a look at how BDB handled this. As you can see they used a > TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16. > > https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c > > https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136 > > (And a FROM_TSTRING for the reverse, as well.) (Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry. Now reposting to the mailing list.) Since we only need to do this on Windows, we could use MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.) I do not think we would ever need to do any such conversion on UNIX. https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx I'm not sure if we can just copy-paste BDB's code. Probably not, that would lead to licensing issues, wouldn't it?
Fwd: LMDB and text encoding
On Mon, Feb 2, 2015 at 3:37 AM, Howard Chu wrote: > Hallvard Breien Furuseth wrote: >> >> On 02/02/15 00:40, Howard Chu wrote: >>> >>> It looks OK to me. No one raises any concerns I'll commit it in a few >>> hours. >> >> >> Some sudden last thoughts: >> >> mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size) >> to exclude non-databases, which is no longer valid. > > > Good point. As Timur's patch comment notes, we probably need an API call "is > valid DB" now. > >> Database names with \0 in them can no longer be spelled as strings, >> everything which gets DB names from the database must use binary blobs. >> Including mdb_load and mdb_dump; I notice mdb_load uses >> strdup() for the "database=" name. Come to think of it, I have no >> idea if the dump format supports DB names with \0 in them. > > > No, it doesn't. It's the BDB format, and BDB only accepted C strings. (Just noticed that I hit "reply" instead of "reply all". Sorry. Now reposting to the mailing list.) I think it is an acceptable limitation of mdb_dump and mdb_load. This is not the only thing they don't support: they also don't work with user-defined comparison functions. Although I could think about ways to solve it. For example, we could add a command line option that would make mdb_dump output db names as a string of hexadecimal numbers, and mdb_load interpret them as such.