Re: Fwd: LMDB and text encoding

2017-06-07 Thread Howard Chu

Timur Kristóf wrote:

Hi Everyone,

I've just came accross this old thread and am wondering, is this still an issue?


No, it was resolved long ago.


Does LMDB have a way to use non-ASCII path names with mdb_env_open in a
cross-platform way?

If not, would you guys accept patches to LMDB with this regard?


There's no issue on POSIX filesystems, and on Windows we already convert 
pathnames from UTF-8 to UTF-16.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: Fwd: LMDB and text encoding

2017-06-07 Thread Timur Kristóf
Hi Everyone,

I've just came accross this old thread and am wondering, is this still an
issue?
Does LMDB have a way to use non-ASCII path names with mdb_env_open in a
cross-platform way?

If not, would you guys accept patches to LMDB with this regard?

Thanks,
Timur


Re: Fwd: LMDB and text encoding

2015-02-15 Thread Florian Weimer
* Timur Kristóf:

>> > A path is always a Unicode string, while a DB name can be an arbitrary
>> > binary blob.
>>
>> On many POSIX platforms, a path is a blob which does not contain
>> '\000'.  These systems do not enforce Unicode encoding at all.
>
> My mistake. I was unaware.
> On those platforms, how do you type a path name into a terminal?

There are some files which are not directly nameable.  Many programs
support special sequences such as “Ctrl+V 3 7 7” to enter arbitrary
bytes, but that's not universal.  Depending on the actual
implementation of the terminal, cut-and-paste of funny file names can
work, too.

Older programs have trouble accessing such files even if the user
chooses them in a file selection dialog, but current version are
supposed to have been fixed (including OpenJDK, which took a
ridiculously long time).  Beyond that, it's not much different from
dealing with file names in an unfamiliar script.



Re: Fwd: LMDB and text encoding

2015-02-15 Thread Timur Kristóf
>>> > A path is always a Unicode string, while a DB name can be an arbitrary
>>> > binary blob.
>>>
>>> On many POSIX platforms, a path is a blob which does not contain
>>> '\000'.  These systems do not enforce Unicode encoding at all.
>>
>> My mistake. I was unaware.
>> On those platforms, how do you type a path name into a terminal?
>
> There are some files which are not directly nameable.  Many programs
> support special sequences such as “Ctrl+V 3 7 7” to enter arbitrary
> bytes, but that's not universal.  Depending on the actual
> implementation of the terminal, cut-and-paste of funny file names can
> work, too.
>
> Older programs have trouble accessing such files even if the user
> chooses them in a file selection dialog, but current version are
> supposed to have been fixed (including OpenJDK, which took a
> ridiculously long time).  Beyond that, it's not much different from
> dealing with file names in an unfamiliar script.

Interesting.
So ultimately, there are always going to be things that you cannot
type into your terminal directly.



Re: Fwd: LMDB and text encoding

2015-02-15 Thread Timur Kristóf
> > A path is always a Unicode string, while a DB name can be an arbitrary
> > binary blob.
>
> On many POSIX platforms, a path is a blob which does not contain
> '\000'.  These systems do not enforce Unicode encoding at all.

My mistake. I was unaware.
On those platforms, how do you type a path name into a terminal?



Re: Fwd: LMDB and text encoding

2015-02-15 Thread Florian Weimer
* Timur Kristóf:

> A path is always a Unicode string, while a DB name can be an arbitrary
> binary blob.

On many POSIX platforms, a path is a blob which does not contain
'\000'.  These systems do not enforce Unicode encoding at all.



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Hallvard Breien Furuseth

On 02. feb. 2015 17:11, Timur Kristóf wrote:

I suggest we wait to deal with DB names until we also have a way to
deal with filenames.  And this time test that it works is practice.

And then I also suggest to try to make this mess simple to deal
with for programmers and or users.  I guess I should have separated
that from the rest more clearly.


I can write a patch which does the UTF-8 to UTF-16 conversion on
Windows for file paths, but I would hate to restrict db names to UTF-8
text only (or for that matter, any text only). However, not supporting
non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable
compromise to me.


I suggest we wait to deal with DB names until we also have a way to
deal with filenames.

--
Hallvard



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Timur Kristóf
> I suggest we wait to deal with DB names until we also have a way to
> deal with filenames.  And this time test that it works is practice.
>
> And then I also suggest to try to make this mess simple to deal
> with for programmers and or users.  I guess I should have separated
> that from the rest more clearly.

I can write a patch which does the UTF-8 to UTF-16 conversion on
Windows for file paths, but I would hate to restrict db names to UTF-8
text only (or for that matter, any text only). However, not supporting
non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable
compromise to me.



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Hallvard Breien Furuseth

On 02. feb. 2015 16:25, Timur Kristóf wrote:

Okay. What do you suggest?


I suggest we wait to deal with DB names until we also have a way to
deal with filenames.  And this time test that it works is practice.

And then I also suggest to try to make this mess simple to deal
with for programmers and or users.  I guess I should have separated
that from the rest more clearly.

--
Hallvard



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Timur Kristóf
>> A path is always a Unicode string, while a DB name can be an arbitrary
>> binary blob. So I don't think that we can treat them the same way.
>
>
> Not the point.  A program which uses LDMB can choose to treat its
> own DB names in its own LMDB environments as the same kind of
> strings as filenames (WCHAR, UTF-8 char, or whatever).  Unless we
> make that impossible.
>
> As for what LMDB will accept and what it must handle, that's up to
> us.  DB names are not binary blobs yet, after all.

Okay. What do you suggest?



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Hallvard Breien Furuseth

On 02. feb. 2015 16:03, Timur Kristóf wrote:

A path is always a Unicode string, while a DB name can be an arbitrary
binary blob. So I don't think that we can treat them the same way.


Not the point.  A program which uses LDMB can choose to treat its
own DB names in its own LMDB environments as the same kind of
strings as filenames (WCHAR, UTF-8 char, or whatever).  Unless we
make that impossible.

As for what LMDB will accept and what it must handle, that's up to
us.  DB names are not binary blobs yet, after all.

--
Hallvard



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Timur Kristóf
>> DB names are purely internal to LMDB, so they bear no relation to OS
>> filenames and none of this discussion matters to them.
>
> They're exposed to the programmer and the program's users.  Either may
> want them on command-line arguments, in config files, etc.  It will be
> inconvenient if LMDB requires different string handling for non-ASCII
> filenames and non-ASCII DB names in such cases.  The programmer may
> choose to use different string handling but let's try to avoid forcing
> him to do so.

A path is always a Unicode string, while a DB name can be an arbitrary
binary blob. So I don't think that we can treat them the same way.



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Hallvard Breien Furuseth

On 02. feb. 2015 14:24, Howard Chu wrote:

Hallvard Breien Furuseth wrote:

I suggest we wait to deal with DB names until we also have a way to
deal with filenames.  And this time test that it works is practice:-)
Hopefully users and programmers will only need one method of handling
non-ASCII LMDB names on Windows, not two.

I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would
reading DB names and filenames from an config file.  Yet OS-aware and
OS-specific config files can look rather different.  Maybe LMDB must
handle DB names more flexibly than filenames, or maybe we'll end up
recommending that "portable" DB names must be UTF-8.  And add a "flag
convert UTF8<->WCHAR if this is Windows".


DB names are purely internal to LMDB, so they bear no relation to OS
filenames and none of this discussion matters to them.


They're exposed to the programmer and the program's users.  Either may
want them on command-line arguments, in config files, etc.  It will be
inconvenient if LMDB requires different string handling for non-ASCII
filenames and non-ASCII DB names in such cases.  The programmer may
choose to use different string handling but let's try to avoid forcing
him to do so.

--
Hallvard



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Timur Kristóf
> DB names are purely internal to LMDB, so they bear no relation to OS
> filenames and none of this discussion matters to them.

If we let the users treat db names as an MDB_val (essentially, an
arbitrary byte array), then all bets are off: we can't even make the
assumption that a db name is meaningful text in any encoding. We can
make it possible to type such a thing in the console if we represent
it as a string of hexadecimal numbers. For example, mdb_dump could do
something like to_hex_string in this code snippet:
http://pastebin.com/jqnGSS6C (note: you need -std=c11 to compile the
snippet).



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Howard Chu

Hallvard Breien Furuseth wrote:

I suggest we wait to deal with DB names until we also have a way to
deal with filenames.  And this time test that it works is practice:-)
Hopefully users and programmers will only need one method of handling
non-ASCII LMDB names on Windows, not two.

I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would
reading DB names and filenames from an config file.  Yet OS-aware and
OS-specific config files can look rather different.  Maybe LMDB must
handle DB names more flexibly than filenames, or maybe we'll end up
recommending that "portable" DB names must be UTF-8.  And add a "flag
convert UTF8<->WCHAR if this is Windows".

DB names are purely internal to LMDB, so they bear no relation to OS 
filenames and none of this discussion matters to them.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Hallvard Breien Furuseth

I suggest we wait to deal with DB names until we also have a way to
deal with filenames.  And this time test that it works is practice:-)
Hopefully users and programmers will only need one method of handling
non-ASCII LMDB names on Windows, not two.

I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would
reading DB names and filenames from an config file.  Yet OS-aware and
OS-specific config files can look rather different.  Maybe LMDB must
handle DB names more flexibly than filenames, or maybe we'll end up
recommending that "portable" DB names must be UTF-8.  And add a "flag
convert UTF8<->WCHAR if this is Windows".

--
Hallvard



Re: Fwd: LMDB and text encoding

2015-02-02 Thread Howard Chu

Timur Kristóf wrote:

I just had a look at how BDB handled this. As you can see they used a
TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136

(And a FROM_TSTRING for the reverse, as well.)


(Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry.
Now reposting to the mailing list.)

Since we only need to do this on Windows, we could use
MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.)
I do not think we would ever need to do any such conversion on UNIX.


Correct, these macros only exist in the Windows-specific source files of 
BDB. None of this is needed for POSIX.



https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx

I'm not sure if we can just copy-paste BDB's code. Probably not, that
would lead to licensing issues, wouldn't it?


I wasn't suggesting a copy/paste, just using it as an example of how the 
problem could be approached.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Fwd: LMDB and text encoding

2015-02-02 Thread Timur Kristóf
> I just had a look at how BDB handled this. As you can see they used a
> TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.
>
> https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/os_windows/os_open.c
>
> https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274c2a32cf2b8999c0:src/dbinc/win_db.h#L136
>
> (And a FROM_TSTRING for the reverse, as well.)

(Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry.
Now reposting to the mailing list.)

Since we only need to do this on Windows, we could use
MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.)
I do not think we would ever need to do any such conversion on UNIX.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx

I'm not sure if we can just copy-paste BDB's code. Probably not, that
would lead to licensing issues, wouldn't it?



Fwd: LMDB and text encoding

2015-02-02 Thread Timur Kristóf
On Mon, Feb 2, 2015 at 3:37 AM, Howard Chu  wrote:
> Hallvard Breien Furuseth wrote:
>>
>> On 02/02/15 00:40, Howard Chu wrote:
>>>
>>> It looks OK to me. No one raises any concerns I'll commit it in a few
>>> hours.
>>
>>
>> Some sudden last thoughts:
>>
>> mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size)
>> to exclude non-databases, which is no longer valid.
>
>
> Good point. As Timur's patch comment notes, we probably need an API call "is
> valid DB" now.
>
>> Database names with \0 in them can no longer be spelled as strings,
>> everything which gets DB names from the database must use binary blobs.
>> Including mdb_load and mdb_dump; I notice mdb_load uses
>> strdup() for the "database=" name.  Come to think of it, I have no
>> idea if the dump format supports DB names with \0 in them.
>
>
> No, it doesn't. It's the BDB format, and BDB only accepted C strings.

(Just noticed that I hit "reply" instead of "reply all". Sorry. Now
reposting to the mailing list.)

I think it is an acceptable limitation of mdb_dump and mdb_load. This
is not the only thing they don't support: they also don't work with
user-defined comparison functions. Although I could think about ways
to solve it.

For example, we could add a command line option that would make
mdb_dump output db names as a string of hexadecimal numbers, and
mdb_load interpret them as such.