Re: write-scaling problems in LMDB

2014-10-20 Thread Luke Kenneth Casson Leighton
On Mon, Oct 20, 2014 at 3:47 PM, Howard Chu  wrote:


>>   ok... what's the number of writes being carried out in each
>> transaction?  the original scenario in which i saw the issue was when
>> there was one (!!).  the design at the time was not optimised for
>> batches of reads followed by batches of writes.
>
>
> This is 1 write per txn.

 already? oh ok so that's hammering the machine rather a lot, then :)

 ok let me do some exploring this end, see if i can come up with
something that replicates the dreadful performance.

l.



Re: write-scaling problems in LMDB

2014-10-20 Thread Howard Chu

Luke Kenneth Casson Leighton wrote:

On Mon, Oct 20, 2014 at 3:25 PM, Howard Chu  wrote:

The timings on the left are for the writer process; the reader process
is on the right. You were right, the number of writers is largely
irrelevant because they're all waiting most of the time. There's a
huge jump in System CPU time from 1 writer to 2 writers, because with
only 1 writer the mutex is uncontended and essentially zero-cost. As
more writer threads are added, the System CPU time will rise.


  by some 15 to 20% or so, from 2 writers going up to 32 writers.
which is not unreasonable - nothing like what i saw (CPU usage
dropping to 40%, loadavg jumping to over 8x the number of cores).

  ok... what's the number of writes being carried out in each
transaction?  the original scenario in which i saw the issue was when
there was one (!!).  the design at the time was not optimised for
batches of reads followed by batches of writes.


This is 1 write per txn.


   a single write per transaction would mean quite a lot more mutex usage.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: write-scaling problems in LMDB

2014-10-20 Thread Luke Kenneth Casson Leighton
On Mon, Oct 20, 2014 at 3:25 PM, Howard Chu  wrote:
> Luke Kenneth Casson Leighton wrote:
>>
>> On Mon, Oct 20, 2014 at 2:34 PM, Howard Chu  wrote:
>>>
>>> I can probably use taskset or something similar to restrict a process to
>>> a
>>> particular set of cores. What exactly do you have in mind?
>>
>>
>>   keeping the writers and readers proposed test, but onto the same 8
>> cores only, running:
>
>
> Well in the meantime, the 32/32 test finished:
>
> Writers Run TimeCPU %   Context
> SwitchesWrite   ReadRun Time
> CPU %
> WallUserSys Vol
> Invol   WallUserSys
> 1   10:01.3800:09:34.90 00:00:25.80 99  9
> 204027160   5877894 10:00.6805:19:03.44 00:00:35.34 3192
> 2   10:01.3800:09:15.15 00:02:41.94 119 14397526
> 295524012   6150801 10:00.6705:19:00.96 00:00:38.23 3192
> 4   10:01.3400:09:06.93 00:02:46.94 118 14246971
> 854523795   6067928 10:00.6805:18:59.65 00:00:39.52 3192
> 8   10:01.4900:09:04.22 00:02:53.09 119 14011236
> 10850   23589   5961381 10:00.6905:18:56.82 00:00:42.54 3192
> 16  10:01.6400:09:07.32 00:02:53.09 119 13787738
> 20049   23670   5966794 10:00.6805:18:59.13 00:00:40.35 3192
> 32  10:01.7800:09:04.29 00:02:58.40 120 13205576
> 27847   23337   5959624 10:00.7005:19:00.54 00:00:39.46 3192
>
> The timings on the left are for the writer process; the reader process
> is on the right. You were right, the number of writers is largely
> irrelevant because they're all waiting most of the time. There's a
> huge jump in System CPU time from 1 writer to 2 writers, because with
> only 1 writer the mutex is uncontended and essentially zero-cost. As
> more writer threads are added, the System CPU time will rise.

 by some 15 to 20% or so, from 2 writers going up to 32 writers.
which is not unreasonable - nothing like what i saw (CPU usage
dropping to 40%, loadavg jumping to over 8x the number of cores).

 ok... what's the number of writes being carried out in each
transaction?  the original scenario in which i saw the issue was when
there was one (!!).  the design at the time was not optimised for
batches of reads followed by batches of writes.

  a single write per transaction would mean quite a lot more mutex usage.

l.



Re: write-scaling problems in LMDB

2014-10-20 Thread Howard Chu

Luke Kenneth Casson Leighton wrote:

On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu  wrote:

   then it would be possible to make a direct comparison (against the
figures you just sent), against the e.g. 32-threads case.  32 readers,
2 writers.  32 readers, 4 writers.  32 readers, 8 writers and so on.
keeping the number of threads (write plus read) to below or equal the
total number of cores avoids any unnecessary context-switching


We can do that by running two instances of the benchmark program
concurrently; one doing a read-only job with a fixed number of threads (32)
and one doing a write-only job with the increasing number of threads.


  ohh, ok - great.  saves a job doing some programming at least.


This is why it's important to support both multi-process and multi-threaded
concurrency ;)

--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: write-scaling problems in LMDB

2014-10-20 Thread Howard Chu

Luke Kenneth Casson Leighton wrote:

On Mon, Oct 20, 2014 at 2:34 PM, Howard Chu  wrote:

I can probably use taskset or something similar to restrict a process to a
particular set of cores. What exactly do you have in mind?


  keeping the writers and readers proposed test, but onto the same 8
cores only, running:


Well in the meantime, the 32/32 test finished:

Writers Run TimeCPU %   Context 
SwitchesWrite   ReadRun Time
CPU %
WallUserSys Vol 
Invol   WallUserSys 
1   10:01.3800:09:34.90 00:00:25.80 99  9   
204027160   5877894 10:00.6805:19:03.44 00:00:35.34 3192
2   10:01.3800:09:15.15 00:02:41.94 119 14397526
295524012   6150801 10:00.6705:19:00.96 00:00:38.23 3192
4   10:01.3400:09:06.93 00:02:46.94 118 14246971
854523795   6067928 10:00.6805:18:59.65 00:00:39.52 3192
8   10:01.4900:09:04.22 00:02:53.09 119 14011236
10850   23589   5961381 10:00.6905:18:56.82 00:00:42.54 3192
16  10:01.6400:09:07.32 00:02:53.09 119 13787738
20049   23670   5966794 10:00.6805:18:59.13 00:00:40.35 3192
32  10:01.7800:09:04.29 00:02:58.40 120 13205576
27847   23337   5959624 10:00.7005:19:00.54 00:00:39.46 3192

The timings on the left are for the writer process; the reader process
is on the right. You were right, the number of writers is largely
irrelevant because they're all waiting most of the time. There's a
huge jump in System CPU time from 1 writer to 2 writers, because with
only 1 writer the mutex is uncontended and essentially zero-cost. As
more writer threads are added, the System CPU time will rise.

The readers are basically unaffected by the number of writers.

--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: write-scaling problems in LMDB

2014-10-20 Thread Luke Kenneth Casson Leighton
On Mon, Oct 20, 2014 at 1:00 PM, Howard Chu  wrote:
> Howard Chu wrote:
>>
>> Luke Kenneth Casson Leighton wrote:
>>>
>>> http://symas.com/mdb/inmem/scaling.html
>>>
>>> can i make the suggestion that, whilst i am aware that it is generally
>>> not recommended for production environments to run more processes than
>>> there are cores, you try running 128, 256 and even 512 processes all
>>> hitting that 64-core system, and monitor its I/O usage (iostats) and
>>> loadavg whilst doing so?

>>> the hypothesis to test is that the performance, which should scale
>>> reasonably linearly downwards as a ratio of the number of processes to
>>> the number of cores, instead drops like a lead balloon.
>

> Looks to me like the system was reasonably well behaved.

 and it looks like the writer rate is approximately-halving with each
doubling from 64 onwards.

 ok, so that didn't show anything up... but wait... there's only one
writer, right?  the scenarios where i am seeing difficulties is when
there are multiple writers and readers (actually, multiple writers and
readers to multiple envs simultaneously).

 so to duplicate that scenario, it would either be necessary to modify
the benchmark to do multiple writer threads (knowing that they are
going to have contention, but that's ok) or, to be closer to the
scenario where i have observed difficulties to run the test several
times *simultaneously* on the same database.

 *thinks* actually in order to ensure that the reads and writes
are approximately balanced, it would likely be necessary to modify the
benchmark code to allow multiple writer threads and distribute the
workload amongst them whilst at the same time keeping the number of
reader threads the same as it was previously.

 then it would be possible to make a direct comparison (against the
figures you just sent), against the e.g. 32-threads case.  32 readers,
2 writers.  32 readers, 4 writers.  32 readers, 8 writers and so on.
keeping the number of threads (write plus read) to below or equal the
total number of cores avoids any unnecessary context-switching

 the hypothesis being tested is that the writers performance overall
remains the same, as only one may perform writes at a time.

 i know it sounds silly to do that: it sounds so obvious that yeah it
really should not make any difference given that no matter how many
writers there are they will always do absolutely nothing (except one
of them), and the context switching when one finishes should also be
negligeable, but i know there's something wrong and i'd like to help
find out what it is.

l.



Re: write-scaling problems in LMDB

2014-10-20 Thread Luke Kenneth Casson Leighton
On Mon, Oct 20, 2014 at 2:34 PM, Howard Chu  wrote:
> Luke Kenneth Casson Leighton wrote:
>>
>> On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu  wrote:
>>>
>>> My experience from benchmarking OpenLDAP over the years is that mutexes
>>> scale only up to a point. When you have threads grabbing the same mutex
>>> from
>>> across socket boundaries, things go into the toilet. There's no fix for
>>> this; that's the nature of inter-socket communication.
>>
>>
>>   argh.  ok.  so... actually accidentally, the design where i used
>> a single LMDB (one env) shared amongst (20 to 30) processes using
>> db_open to create (10 or so) databases would mitigate against that...
>> taking a quick look at mdb.c the mutex lock is done on the env not on
>> the database...
>>
>>   sooo compared to the previous design there would only be a 20/30-to-1
>> mutex contention whereas previously there were  *10 sets* of 20 or 30
>> to 1 mutexes all competing... and if mutexes use sockets underneath
>> that would explain why the inter-process communication (which also
>> used sockets) was so dreadful.
>
>
> Note - I was talking about physical CPU sockets, not network sockets.

 oh right, haha :)  ok scratch that theory then.

>> or is it
>> relatively easy to turn off the NUMA architecture?
>
>
> I can probably use taskset or something similar to restrict a process to a
> particular set of cores. What exactly do you have in mind?

 keeping the writers and readers proposed test, but onto the same 8
cores only, running:

 * 1 writer 16 readers (single program, 16 threads) as a base-line,
2-to-1 contention between threads and cores
 * creating extra tests adding extra programs (single writers only)
first 1 extra writer, then 2, then 4, then 8, then maybe even 16.

the idea is to see how the mutexes affect performance as a... (can't
think of the word!!) factor(??) of the number of writers, but without
the effects of NUMA to contend with.

l.



Re: write-scaling problems in LMDB

2014-10-20 Thread Luke Kenneth Casson Leighton
On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu  wrote:

>>   then it would be possible to make a direct comparison (against the
>> figures you just sent), against the e.g. 32-threads case.  32 readers,
>> 2 writers.  32 readers, 4 writers.  32 readers, 8 writers and so on.
>> keeping the number of threads (write plus read) to below or equal the
>> total number of cores avoids any unnecessary context-switching
>
>
> We can do that by running two instances of the benchmark program
> concurrently; one doing a read-only job with a fixed number of threads (32)
> and one doing a write-only job with the increasing number of threads.

 ohh, ok - great.  saves a job doing some programming at least.

>>   the hypothesis being tested is that the writers performance overall
>> remains the same, as only one may perform writes at a time.
>
>
>>   i know it sounds silly to do that: it sounds so obvious that yeah it
>> really should not make any difference given that no matter how many
>> writers there are they will always do absolutely nothing (except one
>> of them), and the context switching when one finishes should also be
>> negligeable, but i know there's something wrong and i'd like to help
>> find out what it is.
>
>
> My experience from benchmarking OpenLDAP over the years is that mutexes
> scale only up to a point. When you have threads grabbing the same mutex from
> across socket boundaries, things go into the toilet. There's no fix for
> this; that's the nature of inter-socket communication.

 argh.  ok.  so... actually accidentally, the design where i used
a single LMDB (one env) shared amongst (20 to 30) processes using
db_open to create (10 or so) databases would mitigate against that...
taking a quick look at mdb.c the mutex lock is done on the env not on
the database...

 sooo compared to the previous design there would only be a 20/30-to-1
mutex contention whereas previously there were  *10 sets* of 20 or 30
to 1 mutexes all competing... and if mutexes use sockets underneath
that would explain why the inter-process communication (which also
used sockets) was so dreadful.

 huh, how about that.

do you happen to have access to a straight 8-core SMP system, or is it
relatively easy to turn off the NUMA architecture?

l.



Re: write-scaling problems in LMDB

2014-10-20 Thread Luke Kenneth Casson Leighton
On Mon, Oct 20, 2014 at 5:51 AM, Howard Chu  wrote:

>> basically i suspect a severe bug in the linux kernel which these
>> extreme circumstances (32 or 16 processes accessing the same mmap'd
>> file for example) have never been encountered, so the bug is simply...
>> unknown.
>
>
> I've occasionally seen the system lockup as well; I'm pretty sure it's due
> to dirty page writeback. There are plenty of other references on the web
> about Linux dealing poorly with heavy writeback I/O.

 ugh.  that doesn't bode well for the future reliability of the system
i've just been working on.  i modified parabench.py yesterday (sent
you and david a copy) and managed to completely hang my lovely
precious macbook running debian amd64.

 also i managed to get processes to deadlock when being asked to die.
killing them off -9 one by one often at some point they would all
properly exit, but until the one that was deadlocked was found the
parent would simply sit there.

 is anyone actually dealing with this actively [poor page writeback]
in the linux kernel?

l.



Re: write-scaling problems in LMDB

2014-10-20 Thread Howard Chu

Luke Kenneth Casson Leighton wrote:

On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu  wrote:

My experience from benchmarking OpenLDAP over the years is that mutexes
scale only up to a point. When you have threads grabbing the same mutex from
across socket boundaries, things go into the toilet. There's no fix for
this; that's the nature of inter-socket communication.


  argh.  ok.  so... actually accidentally, the design where i used
a single LMDB (one env) shared amongst (20 to 30) processes using
db_open to create (10 or so) databases would mitigate against that...
taking a quick look at mdb.c the mutex lock is done on the env not on
the database...

  sooo compared to the previous design there would only be a 20/30-to-1
mutex contention whereas previously there were  *10 sets* of 20 or 30
to 1 mutexes all competing... and if mutexes use sockets underneath
that would explain why the inter-process communication (which also
used sockets) was so dreadful.


Note - I was talking about physical CPU sockets, not network sockets.


  huh, how about that.

do you happen to have access to a straight 8-core SMP system,


No.


or is it
relatively easy to turn off the NUMA architecture?


I can probably use taskset or something similar to restrict a process to a 
particular set of cores. What exactly do you have in mind?


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: write-scaling problems in LMDB

2014-10-20 Thread Howard Chu

Luke Kenneth Casson Leighton wrote:

On Mon, Oct 20, 2014 at 1:00 PM, Howard Chu  wrote:

Howard Chu wrote:


Luke Kenneth Casson Leighton wrote:


http://symas.com/mdb/inmem/scaling.html

can i make the suggestion that, whilst i am aware that it is generally
not recommended for production environments to run more processes than
there are cores, you try running 128, 256 and even 512 processes all
hitting that 64-core system, and monitor its I/O usage (iostats) and
loadavg whilst doing so?



the hypothesis to test is that the performance, which should scale
reasonably linearly downwards as a ratio of the number of processes to
the number of cores, instead drops like a lead balloon.





Looks to me like the system was reasonably well behaved.


  and it looks like the writer rate is approximately-halving with each
doubling from 64 onwards.

  ok, so that didn't show anything up... but wait... there's only one
writer, right?  the scenarios where i am seeing difficulties is when
there are multiple writers and readers (actually, multiple writers and
readers to multiple envs simultaneously).

  so to duplicate that scenario, it would either be necessary to modify
the benchmark to do multiple writer threads (knowing that they are
going to have contention, but that's ok) or, to be closer to the
scenario where i have observed difficulties to run the test several
times *simultaneously* on the same database.

  *thinks* actually in order to ensure that the reads and writes
are approximately balanced, it would likely be necessary to modify the
benchmark code to allow multiple writer threads and distribute the
workload amongst them whilst at the same time keeping the number of
reader threads the same as it was previously.

  then it would be possible to make a direct comparison (against the
figures you just sent), against the e.g. 32-threads case.  32 readers,
2 writers.  32 readers, 4 writers.  32 readers, 8 writers and so on.
keeping the number of threads (write plus read) to below or equal the
total number of cores avoids any unnecessary context-switching


We can do that by running two instances of the benchmark program concurrently; 
one doing a read-only job with a fixed number of threads (32) and one doing a 
write-only job with the increasing number of threads.



  the hypothesis being tested is that the writers performance overall
remains the same, as only one may perform writes at a time.



  i know it sounds silly to do that: it sounds so obvious that yeah it
really should not make any difference given that no matter how many
writers there are they will always do absolutely nothing (except one
of them), and the context switching when one finishes should also be
negligeable, but i know there's something wrong and i'd like to help
find out what it is.


My experience from benchmarking OpenLDAP over the years is that mutexes scale 
only up to a point. When you have threads grabbing the same mutex from across 
socket boundaries, things go into the toilet. There's no fix for this; that's 
the nature of inter-socket communication.


This test machine has 4 physical sockets but 8 NUMA nodes; internally each 
"processor" in a socket is really a pair of 8-core CPUs on a MCM which is why 
there are two NUMA nodes per physical socket.


Write throughput should degrade pretty noticeably as the number of writer 
threads goes up. When we get past 8 writer threads there's no way to keep them 
all in a single NUMA domain, so at that point we should see a sharp drop in 
throughput.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: write-scaling problems in LMDB

2014-10-20 Thread Howard Chu

Howard Chu wrote:

Luke Kenneth Casson Leighton wrote:

http://symas.com/mdb/inmem/scaling.html

can i make the suggestion that, whilst i am aware that it is generally
not recommended for production environments to run more processes than
there are cores, you try running 128, 256 and even 512 processes all
hitting that 64-core system, and monitor its I/O usage (iostats) and
loadavg whilst doing so?


Sure, I can conduct that test and collect system stats using atop. Will let ya
know. By the way, we're using threads here, not processes. But the overall
loading behavior should be the same either way.


the hypothesis to test is that the performance, which should scale
reasonably linearly downwards as a ratio of the number of processes to
the number of cores, instead drops like a lead balloon.


Threads Run TimeCPU %   DB Size 
Process SizeContext Switc   Write   Read
WallUserSys 
Vol Invol   
1   10:01.3900:19:39.18 00:00:21.00 199 12647888
1265046021  151345605   275331
2   10:01.3800:29:35.21 00:00:24.33 299 12647888
1265047248  266142726   528514
4   10:01.3700:49:32.93 00:00:25.42 498 12647888
1265049684  410640961   1068050
8   10:01.3601:29:32.68 00:00:23.25 897 12647888
12650756157 773838812   2058741
16  10:01.3602:49:22.44 00:00:28.51 169412647888
12650852345 16941   33357   3857045
32  10:01.3605:28:35.39 00:01:02.69 328812647888
12651308923 258250  23922   6091558
64  10:01.3810:35:44.42 00:01:51.69 636112648060
126521321766145585  16571   8724687
128 10:01.3810:36:43.09 00:01:45.52 636812649296
1265492832762906109 85949846720
256 10:01.4810:36:53.05 00:01:36.83 636912649304
1265805653653557137 417810453540
512 10:02.1110:36:09.58 00:03:00.83 636912649320
1266430483033511456 194710728221

Looks to me like the system was reasonably well behaved.

This is reusing a DB that had already had multiple iterations of this
benchmark run on it, so the size is larger than for a fresh DB, and it
would have significant internal fragmentation - i.e., a lot of sequential
data will be in non-adjacent pages.

The only really obvious impact is that the number of involuntary context
switches jumps up at 128 threads, which is what you'd expect since there
are fewer cores than threads. The writer gets progressively starved, and
read rates increase slightly.

--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/



Re: write-scaling problems in LMDB

2014-10-19 Thread Howard Chu

Luke Kenneth Casson Leighton wrote:

http://symas.com/mdb/inmem/scaling.html

i just looked at this and i have a sneaking suspicion that you may be
running into the same problem that i encountered when accidentally
opening 10 LMDBs 20 times by forking 30 processes *after* opening the
10 LMDBs... (and also forgetting to close them in the parent).

what i found was that when i reduced the number of LMDBs to 3 or
below, the loadavg on the multi-core system was absolutely fine [ok,
it was around 3 to 4 but that's ok]

adding one more LMDB (remember that's 31 extra file handles opened to
a shm-mmap'd file) increased the loadavg to 6 or 7.  by the time that
was up to 10 the loadavg had completely unreasonably jumped to over
30.  i could log in over ssh - that was not a problem.  editing a file
was ok (opening it) but trying to create new files resulted in
applications (such as vim) was blocked so badly that i often could not
even press ctrl-z in order to background the task, and had to kill the
entire ssh session.

in each test run the amount of work being done was actually relatively small.

basically i suspect a severe bug in the linux kernel which these
extreme circumstances (32 or 16 processes accessing the same mmap'd
file for example) have never been encountered, so the bug is simply...
unknown.


I've occasionally seen the system lockup as well; I'm pretty sure it's due to 
dirty page writeback. There are plenty of other references on the web about 
Linux dealing poorly with heavy writeback I/O.



can i make the suggestion that, whilst i am aware that it is generally
not recommended for production environments to run more processes than
there are cores, you try running 128, 256 and even 512 processes all
hitting that 64-core system, and monitor its I/O usage (iostats) and
loadavg whilst doing so?


Sure, I can conduct that test and collect system stats using atop. Will let ya 
know. By the way, we're using threads here, not processes. But the overall 
loading behavior should be the same either way.



the hypothesis to test is that the performance, which should scale
reasonably linearly downwards as a ratio of the number of processes to
the number of cores, instead drops like a lead balloon.

l.




--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/