Re: Will mmap and the read buffer cache be unified, anyone working with it?

2016-05-04 Thread Stuart Henderson
On 2016/05/04 01:20, Tinker wrote:
> A separate question about combining mmap access and file
> IO in the current absence of a unified buffer cache:
> 
> If I have a readonly mmap and do fwrite to it, could I use
> fsync (or msync or any other call) right after the fwrite, as a tool to
> guarantee that the memory mapping interface is up to date?

It needs more than fsync.

Playing around with the simple test program from
https://lists.samba.org/archive/samba-technical/2001-May/013552.html
(at least on amd64) it appears that msync (with either MS_SYNC,
MS_ASYNC or MS_INVALIDATE) is needed after changes are made on the
mmap side of things, and msync with MS_INVALIDATE is needed after
changes done on the file io side of things.

> Also between the fwrite and that call, the possible data inconsistency would
> be limited to the particular bytes written right?

I don't know.



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2016-05-03 Thread Tinker

A separate question about combining mmap access and file
IO in the current absence of a unified buffer cache:

If I have a readonly mmap and do fwrite to it, could I use
fsync (or msync or any other call) right after the fwrite, as a tool to 
guarantee that the memory mapping interface is up to date?



Also between the fwrite and that call, the possible data inconsistency 
would be limited to the particular bytes written right?


On 2015-11-12 15:45, Stuart Henderson wrote:



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-12 Thread Stuart Henderson
On 2015-05-23, Tinker  wrote:
> On 2015-05-23 16:42, Stuart Henderson wrote:
>> On 2015-05-22, Tinker  wrote:
>>> Okay there's msync(). So this really doesn't look like a problem but 
>>> let
>>> me think.
>> 
>> It is a problem because it's not needed on the majority of other OS
>> where most upstream testing is done, so even programs which try to do
>> the right thing will often silently corrupt data on OpenBSD.
>> 
>> We definitely had problems with Cyrus imapd in the past. Dovecot 
>> disables
>> some things on OpenBSD as a result. OpenLDAP MDB has had issues. And
>> I suspect this may be involved in problems we saw in NSD prior to
>> disabling the database in the default config.
>
> Agreed.
>
> (Also SQLite has mmap disabled on OpenBSD for this reason, and is maaybe 
> ~30% slower for this reason. I guess in a way this is just a silly 
> architectural detail, but it "can" require unified caching because all 
> other OS:es implement it.)

Since the thread came up again, I'll add a few more for the archives:
tokyocabinet and kyotocabinet have problems which look very like they're
related to this too, they do have code meant to work in this case but
there are problems that show up in regression tests.



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-12 Thread Stuart Henderson
On 2015-11-11, Tinker  wrote:
> But LMDB doesn't even compile on OpenBSD in mmap mode, does it, or did 
> you fix it last months?

I haven't tried standalone LMDB but last time I tried OpenLDAP with MDB
enabled (using MDB_WRITEMAP) it did build, but had failures at runtime.



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-11 Thread Tinker

On 2015-11-12 01:30, Theo de Raadt wrote:
I find this conversation puzzling, since even back in BSD 4.3, read() 
was

actually implemented by memory mapping the underlying file.


That is an wildly incorrect description of the 4.3 BSD buffer cache,
furthermore 4.3 lacked a working mmap() to hit the coherency issue.


Wait, so if you make a read-only mmap(), will the buffer in OpenBSD be 
"unified" in the sense that fwrite():s always will be immediately 
visible in the memory mapped version?


(This is what I understood that Howard suggested would apply from BSD 
4.3 and on and Theo now said not was correct, so, my question is if that 
behavior was introduced later in OpenBSD.)




Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-11 Thread Howard Chu

On 2015-11-10 14:10, Tinker wrote:
...

A "safe" approach to file access would be to read data using mmap()
but write data using fwrite() only. Mmap does have a read-only mode.
This does NOT work in OpenBSD currently though because of the absence
of unified caching.


I find this conversation puzzling, since even back in BSD 4.3, read() was 
actually implemented by memory mapping the underlying file.



The nice thing about reading files from memory using mmap instead of
using fread(), is that you offload Lots of work to the OS kernel.



Suddenly file reading is free of mallocs for instance.

And the program doesn't need internal file caching, so the extent of the
OS' disk caching is increased. And I guess maybe the OS disk cache can
prioritize better what to keep in RAM.

In a way, mmap() is a way to "zero-copy file access", which is just
awesome.

A database that uses this technique is LMDB (OpenLDAP's default DB
backend).

A key feature of LMDB is that it's only 9600 locs,
https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblmdb/mdb.c :

LMDB is interesting in how lowlevel it is, as it's written in C and the
"loaded" database entries are simply pointers into the mmap():ed space.

I saw some fantastic-looking benchmarks, I think at
http://symas.com/mdb/#bench , where LMDB goes light-speed where others
remain on the ground.

(LMDB has a limit in usability in that it never shrinks a DB file,
however, that is in no way because of its use of mmap() and could really
be overcome by working more on it.


That is not a limit in usability. Any active database undergoes insert and 
delete operations; returning space to the OS on a delete would be foolish 
since the DB will just request the space back again on the next insert operation.


High performance malloc libraries generally operate the same way - once they 
have acquired memory from the OS they seldom/never return it. There's no good 
reason to incur the cost of allocation more than once.



Also, LMDB serializes its DB writes, which also is an architecture
decision specific to it and which has severe performance implications -
and that is unspecific to mmap() also, and could be overcome.)


LMDB's write performance is pretty mediocre, by design - we emphasized 
durability/reliability over performance here. But in practice, it is always 
faster than e.g. BerkeleyDB, which supports multiple concurrent writers. With 
multiple writer concurrency, we found that BDB spends much of its time in 
contended locks and deadlock resolution. In most applications, lock 
acquisition/release, deadlock detection, and resolution will consume a huge 
amount of CPU time, completely erasing any potential throughput gains from 
allowing multiple concurrent writers.


If you want to do writes thru mmap() then you need to be extremely careful, so 
yes, how LMDB does writes is actually highly specific to its use of mmap. 
Transactional integrity requires that certain writes are persisted to disk in 
a particular order, otherwise you get corrupted data structures. You can use 
mlock() to prevent pages from being flushed before you intend, but then you're 
invoking a number of system calls per write, and so you haven't gained 
anything in the performance department. Or you can do what LMDB does, and 
write arbitrarily to the map until the end of the transaction (using no 
syscalls), and then do carefully sequenced final updates and msyncs.


Note that LMDB works perfectly well on OpenBSD even without a unified buffer 
cache; it just requires you to perform writes thru the mmap as well as reads 
to sidestep the cache coherency issue. (Of course, using a writable mmap means 
you lose LMDB's default immunity to stray writes thru wild pointers.)


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-11 Thread Theo de Raadt
> I find this conversation puzzling, since even back in BSD 4.3, read() was 
> actually implemented by memory mapping the underlying file.

That is an wildly incorrect description of the 4.3 BSD buffer cache,
furthermore 4.3 lacked a working mmap() to hit the coherency issue.



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-11 Thread Tinker

...

LMDB's write performance is pretty mediocre, by design - we emphasized
durability/reliability over performance here. But in practice, it is
always faster than e.g. BerkeleyDB, which supports multiple concurrent
writers. With multiple writer concurrency, we found that BDB spends
much of its time in contended locks and deadlock resolution. In most
applications, lock acquisition/release, deadlock detection, and
resolution will consume a huge amount of CPU time, completely erasing
any potential throughput gains from allowing multiple concurrent
writers.

If you want to do writes thru mmap() then you need to be extremely
careful, so yes, how LMDB does writes is actually highly specific to
its use of mmap. Transactional integrity requires that certain writes
are persisted to disk in a particular order, otherwise you get
corrupted data structures. You can use mlock() to prevent pages from
being flushed before you intend, but then you're invoking a number of
system calls per write, and so you haven't gained anything in the
performance department. Or you can do what LMDB does, and write
arbitrarily to the map until the end of the transaction (using no
syscalls), and then do carefully sequenced final updates and msyncs.

Note that LMDB works perfectly well on OpenBSD even without a unified
buffer cache; it just requires you to perform writes thru the mmap as
well as reads to sidestep the cache coherency issue. (Of course, using
a writable mmap means you lose LMDB's default immunity to stray writes
thru wild pointers.)



But LMDB doesn't even compile on OpenBSD in mmap mode, does it, or did 
you fix it last months?



Having some kind of vacuum/GC feature on the database file would be 
nice, even if it's only rare and sporadic. I am principally against 
database files that are hardcoded to always be ever-growing.


Am aware that some kind of "hotswap" logic can be implemented, maybe 
quite easily, atop LMDB though. Perhaps such a "hotswap-vacuum" wrapper 
could be provided as a standard wrapper atop LMDB just because it is so 
useful to everyone though. *A lot* would be needed for me to accept an 
ever-growing database file.




Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-11 Thread Theo de Raadt
> On 2015-11-12 01:30, Theo de Raadt wrote:
> >> I find this conversation puzzling, since even back in BSD 4.3, read() 
> >> was
> >> actually implemented by memory mapping the underlying file.
> > 
> > That is an wildly incorrect description of the 4.3 BSD buffer cache,
> > furthermore 4.3 lacked a working mmap() to hit the coherency issue.
> 
> Wait, so if you make a read-only mmap(), will the buffer in OpenBSD be 
> "unified" in the sense that fwrite():s always will be immediately 
> visible in the memory mapped version?

My reply is specifically about what Howard said.

> (This is what I understood that Howard suggested would apply from BSD 
> 4.3 and on and Theo now said not was correct, so, my question is if that 
> behavior was introduced later in OpenBSD.)

I made none of the claims you are composing.



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-11 Thread Tinker

On 2015-11-12 01:42, Theo de Raadt wrote:

On 2015-11-12 01:30, Theo de Raadt wrote:
>> I find this conversation puzzling, since even back in BSD 4.3, read()
>> was
>> actually implemented by memory mapping the underlying file.
>
> That is an wildly incorrect description of the 4.3 BSD buffer cache,
> furthermore 4.3 lacked a working mmap() to hit the coherency issue.

Wait, so if you make a read-only mmap(), will the buffer in OpenBSD be
"unified" in the sense that fwrite():s always will be immediately
visible in the memory mapped version?


My reply is specifically about what Howard said.


(This is what I understood that Howard suggested would apply from BSD
4.3 and on and Theo now said not was correct, so, my question is if 
that

behavior was introduced later in OpenBSD.)


I made none of the claims you are composing.


Aha noted, so unifying the buffer cache then is still altogether in the 
future. Thanks for clarifyign.




Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-09 Thread Tinker

Really, unifying the buffer cache would make all sense in the world.

Also, *all* other major OS:es have done this already, so even just doing 
it for the sake of symmetry would make sense.



I think I talked to someone who suggested that an attempt to unify had 
been done before, and that the NFS filesystem drivers had been what 
stopped it then, by some reason they were difficult. Perhaps that could 
be worked around afterall?




On 2015-05-22 15:27, Tinker wrote:

Hi,

Will mmap and the read buffer cache be unified, anyone working with it?

Some programs disable features on OBSD for this reason so would be
nice! (I admit though that a program combining mmap() and read() on
the same file sounds like a slightly quirky design choice to me.)

Thanks!
Tinker




Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-09 Thread Tinker

On 2015-11-10 14:10, Tinker wrote:
...

A "safe" approach to file access would be to read data using mmap()
but write data using fwrite() only. Mmap does have a read-only mode.
This does NOT work in OpenBSD currently though because of the absence
of unified caching.


The nice thing about reading files from memory using mmap instead of 
using fread(), is that you offload Lots of work to the OS kernel.


Suddenly file reading is free of mallocs for instance.

And the program doesn't need internal file caching, so the extent of the 
OS' disk caching is increased. And I guess maybe the OS disk cache can 
prioritize better what to keep in RAM.


In a way, mmap() is a way to "zero-copy file access", which is just 
awesome.




A database that uses this technique is LMDB (OpenLDAP's default DB 
backend).


A key feature of LMDB is that it's only 9600 locs, 
https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblmdb/mdb.c :


LMDB is interesting in how lowlevel it is, as it's written in C and the 
"loaded" database entries are simply pointers into the mmap():ed space.


I saw some fantastic-looking benchmarks, I think at 
http://symas.com/mdb/#bench , where LMDB goes light-speed where others 
remain on the ground.


(LMDB has a limit in usability in that it never shrinks a DB file, 
however, that is in no way because of its use of mmap() and could really 
be overcome by working more on it.


Also, LMDB serializes its DB writes, which also is an architecture 
decision specific to it and which has severe performance implications - 
and that is unspecific to mmap() also, and could be overcome.)




Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-09 Thread Tinker

Just to clarify:

Security: Neutral (it's all in-kernel, on the userland side it just 
simplifies code and is inherently security promoting as it just is a 
symmetry aspect)


Will unifying it improve performance?: Yes, because programs with 
dynamic file sizes will be able to MMAP up to some point and then 
fread/fwrite beyond that one. (In contrast, having anticipatory 1TB 
mmap:s for all mmap:ed files doesn't make sense.)
 I think msync() is expensive, and unifying the buffer would remove 
the need for msync():s all over the place!


Will it make the code simpler?: Userland code, yes, by indirectly 
encouraging mmap use.


Details: SQLite has its mmap acceleration disabled on OpenBSD because 
the mmmap-based access not is symmetric with the fread/fwrite which it 
uses above the mmapped area and by default.




A "safe" approach to file access would be to read data using mmap() but 
write data using fwrite() only. Mmap does have a read-only mode. This 
does NOT work in OpenBSD currently though because of the absence of 
unified caching.



(Sorry for not providing a diff for this.)



On 2015-11-10 13:36, Tinker wrote:

Really, unifying the buffer cache would make all sense in the world.

Also, *all* other major OS:es have done this already, so even just
doing it for the sake of symmetry would make sense.


I think I talked to someone who suggested that an attempt to unify had
been done before, and that the NFS filesystem drivers had been what
stopped it then, by some reason they were difficult. Perhaps that
could be worked around afterall?



On 2015-05-22 15:27, Tinker wrote:

Hi,

Will mmap and the read buffer cache be unified, anyone working with 
it?


Some programs disable features on OBSD for this reason so would be
nice! (I admit though that a program combining mmap() and read() on
the same file sounds like a slightly quirky design choice to me.)

Thanks!
Tinker




Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-11-09 Thread Theo de Raadt
>(Sorry for not providing a diff for this.)

Aww come on, you had us going until then.



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-05-23 Thread Stuart Henderson
On 2015-05-22, Tinker ti...@openmailbox.org wrote:
 Okay there's msync(). So this really doesn't look like a problem but let 
 me think.

It is a problem because it's not needed on the majority of other OS
where most upstream testing is done, so even programs which try to do
the right thing will often silently corrupt data on OpenBSD.

We definitely had problems with Cyrus imapd in the past. Dovecot disables
some things on OpenBSD as a result. OpenLDAP MDB has had issues. And
I suspect this may be involved in problems we saw in NSD prior to
disabling the database in the default config.



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-05-23 Thread Tinker

On 2015-05-23 16:42, Stuart Henderson wrote:

On 2015-05-22, Tinker ti...@openmailbox.org wrote:
Okay there's msync(). So this really doesn't look like a problem but 
let

me think.


It is a problem because it's not needed on the majority of other OS
where most upstream testing is done, so even programs which try to do
the right thing will often silently corrupt data on OpenBSD.

We definitely had problems with Cyrus imapd in the past. Dovecot 
disables

some things on OpenBSD as a result. OpenLDAP MDB has had issues. And
I suspect this may be involved in problems we saw in NSD prior to
disabling the database in the default config.


Agreed.

(Also SQLite has mmap disabled on OpenBSD for this reason, and is maaybe 
~30% slower for this reason. I guess in a way this is just a silly 
architectural detail, but it can require unified caching because all 
other OS:es implement it.)


(I heard someone had worked on unifying the cache before, but it didn't 
go all the way because of issues with making it work with NFS - 
something like this.)


Any plants to unify the cache?



Re: Will mmap and the read buffer cache be unified, anyone working with it?

2015-05-22 Thread Tinker
Okay there's msync(). So this really doesn't look like a problem but let 
me think.


Any thoughts on this one welcome.

On 2015-05-22 12:57, Tinker wrote:

Hi,

Will mmap and the read buffer cache be unified, anyone working with it?

Some programs disable features on OBSD for this reason so would be
nice! (I admit though that a program combining mmap() and read() on
the same file sounds like a slightly quirky design choice to me.)

Thanks!
Tinker




Will mmap and the read buffer cache be unified, anyone working with it?

2015-05-22 Thread Tinker

Hi,

Will mmap and the read buffer cache be unified, anyone working with it?

Some programs disable features on OBSD for this reason so would be nice! 
(I admit though that a program combining mmap() and read() on the same 
file sounds like a slightly quirky design choice to me.)


Thanks!
Tinker