Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-16 Thread Karol Lewandowski
On 12/14/2013 05:12 PM, Zbigniew Jędrzejewski-Szmek wrote:
 On Fri, Dec 13, 2013 at 10:16:16PM +0100, Karol Lewandowski wrote:
 One of the problems I see, though, is that no matter how deep I make
 the queue (`max_dgram_qlen') I still see process sleeping on send()
 way earlier that configured queue depth would suggest.

 Are you sure that the sysctl is set early enough, before the listening
 socket is created?

I'm doing most of my testing in nspawn-able container, so it's ok here.

I will take special precautions to make sure it's set early enough on
our handheld, though.

Thanks for the hint!

Karol
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-16 Thread Karol Lewandowski
On 12/14/2013 04:47 AM, Lennart Poettering wrote:
 On Fri, 13.12.13 22:16, Karol Lewandowski (lmc...@gmail.com) wrote:
 On Fri, Dec 13, 2013 at 03:45:36PM +0100, Lennart Poettering wrote:
 On Fri, 13.12.13 12:46, Karol Lewandowski (k.lewando...@samsung.com) wrote:

 One of the problems I see, though, is that no matter how deep I make
 the queue (`max_dgram_qlen') I still see process sleeping on send()
 way earlier that configured queue depth would suggest.

 It would be interesting to find out why this happens. I mean, there are
 three parameters here I could think of that matter: the qlen, SO_SNDBUF
 on the sender, and SO_RCVBUF on the receiver (though the latter two might
 actually change the same value on AF_UNIX? or maybe one of the latter
 two is a NOP on AF_UNIX?). If any of them reaches the limit then the
 sender will necessarily have to block.
 
 (SO_SNDBUF and SO_RCVBUF can also be controlled via
 /proc/sys/net/core/rmem* and ../wmem*... For testing purposes it might
 be easier to play around with these and set them to ludicrously high
 values...)

That's it.

While journal code tries to set buffer size via SO_SNDBUF/SO_RCVBUF
options to 8MB, kernel limits these to wmem_max/rmem_max. On machines
I've tested respective values are quite small - around 150-200kB each.

Increasing these did reduce context switches considerably - preliminary
tests show that I can now queue thousands of messages (~5k) without
problems.  I will test this thoroughly in next few days.

I do wonder what is the rationale behind such low limits...

Thanks a lot!

Karol


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-16 Thread Lennart Poettering
On Mon, 16.12.13 15:36, Karol Lewandowski (k.lewando...@samsung.com) wrote:

 On 12/14/2013 04:47 AM, Lennart Poettering wrote:
  On Fri, 13.12.13 22:16, Karol Lewandowski (lmc...@gmail.com) wrote:
  On Fri, Dec 13, 2013 at 03:45:36PM +0100, Lennart Poettering wrote:
  On Fri, 13.12.13 12:46, Karol Lewandowski (k.lewando...@samsung.com) 
  wrote:
 
  One of the problems I see, though, is that no matter how deep I make
  the queue (`max_dgram_qlen') I still see process sleeping on send()
  way earlier that configured queue depth would suggest.
 
  It would be interesting to find out why this happens. I mean, there are
  three parameters here I could think of that matter: the qlen, SO_SNDBUF
  on the sender, and SO_RCVBUF on the receiver (though the latter two might
  actually change the same value on AF_UNIX? or maybe one of the latter
  two is a NOP on AF_UNIX?). If any of them reaches the limit then the
  sender will necessarily have to block.
  
  (SO_SNDBUF and SO_RCVBUF can also be controlled via
  /proc/sys/net/core/rmem* and ../wmem*... For testing purposes it might
  be easier to play around with these and set them to ludicrously high
  values...)
 
 That's it.
 
 While journal code tries to set buffer size via SO_SNDBUF/SO_RCVBUF
 options to 8MB, kernel limits these to wmem_max/rmem_max. On machines
 I've tested respective values are quite small - around 150-200kB each.

Hmm, so on the journald's side we actually use SO_RCVBUFFORCE to
override that kernel limit. If I understood you correctly though then
SO_SNDBUF on the sending side is the issue here, not SO_RCVBUF on the
receiving side.

We could certainly update src/journal/journal-send.c to also use
SO_SNDBUFFORCE on the client side, but that would leave unpriviliged
clients and traditional /dev/log clients in the cold, since the
SO_SNDBUFFORCE requires privs, and the client side for /dev/log lives in
glibc, not in systemd.

 Increasing these did reduce context switches considerably - preliminary
 tests show that I can now queue thousands of messages (~5k) without
 problems.  I will test this thoroughly in next few days.
 
 I do wonder what is the rationale behind such low limits...

Well, usually the logic is to keep things conservative until you notice
that this creates issues.

To fix this properly, and comprehensively I'd really like to see three
changes in the kernel:

- Introduce a pair of SO_QLEN and SO_QLENFORCE sockopts to the kernel so
  that we can set the qlen per-socket

- Make the defaults for the rwmem configurable at kernel compile time

- Increase the defaults of the kernel

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-16 Thread Lennart Poettering
On Mon, 16.12.13 16:54, Lennart Poettering (lenn...@poettering.net) wrote:

  That's it.
  
  While journal code tries to set buffer size via SO_SNDBUF/SO_RCVBUF
  options to 8MB, kernel limits these to wmem_max/rmem_max. On machines
  I've tested respective values are quite small - around 150-200kB each.
 
 Hmm, so on the journald's side we actually use SO_RCVBUFFORCE to
 override that kernel limit. If I understood you correctly though then
 SO_SNDBUF on the sending side is the issue here, not SO_RCVBUF on the
 receiving side.
 
 We could certainly update src/journal/journal-send.c to also use
 SO_SNDBUFFORCE on the client side, but that would leave unpriviliged
 clients and traditional /dev/log clients in the cold, since the
 SO_SNDBUFFORCE requires privs, and the client side for /dev/log lives in
 glibc, not in systemd.

I made such a change in git now. But again, this is only a very partial
solution, as it only covers native clients with priviliges. It does not
cover unpriviliged clients or traditional syslog() clients... Also, it
cannot influence the qlen. 

But I fear the rest of this really needs to be fixed in the kernel (and
glibc), we cannot do much about this from the systemd side...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-15 Thread David Timothy Strauss
On Sat, Dec 14, 2013 at 8:12 AM, Zbigniew Jędrzejewski-Szmek
zbys...@in.waw.pl wrote:
 Are you sure that the sysctl is set early enough, before the listening
 socket is created?

Perhaps this is why your suggestion [1] for the journal bootup issue
didn't help.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=71483#c10
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-14 Thread Zbigniew Jędrzejewski-Szmek
On Fri, Dec 13, 2013 at 10:16:16PM +0100, Karol Lewandowski wrote:
 One of the problems I see, though, is that no matter how deep I make
 the queue (`max_dgram_qlen') I still see process sleeping on send()
 way earlier that configured queue depth would suggest.
Are you sure that the sysctl is set early enough, before the listening
socket is created?

Zbyszek
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-13 Thread Karol Lewandowski
On 12/12/2013 07:05 PM, Lennart Poettering wrote:
 On Thu, 12.12.13 17:42, Karol Lewandowski (lmc...@gmail.com) wrote:
 
 Hmm, but this wouldn't be any different from old syslog, right? I mean,
 old syslog also uses sendv() and recv() on AF_UNIX SOCK_DGRAM sockets...

 That's true, however, we aren't migrating from syslog, but from
 Android Logger.  Android provides /dev/log_{main,radio,...} device
 nodes which are simple in-kernel circular buffers apps can write to at
 any time.  This was invented when /dev/kmsg wasn't writable and serves
 more-or-less same purpose.

 From 2.6.29 it's in mailine kernel under
 drivers/staging/android/logger.c
 
 But if the the socket layer is really that much slower than android
 logging, then that's kinda sad, and the kernel socket stuff should
 probably be fixed?

It's not that slower when we take throughput as only parameter:

packet   |  |
size [B] | Adnroid [kB] |  DGRAM [kB]
-+--+
  64 |110640| 84827
 128 |212602|166866
 256 |423880|328440
 512 |815094|644499
1024 |   1543326|   1246914
2048 |   2751674|   2359654
4096 |   2821260|   4145395

However, if take context switches into account picture starts
to look differently:

 | Android   |   DGRAM
-+---+--
context switches |  1818 |   580443  (involuntary + voluntary)
 |   |
system time  | 149.97|   118.62  [seconds]
user time|   4.03| 7.45  [secodns]

(Columns contain cumulative data: client  + server)

Attached raw results and test program.

 This is why I decided to try to make logging completely async and see
 if/what changes.

 Well, if AF_UNIX is slow, then we really should try to fix that in the
 kernel instead of bypassing it in userspace... Whether we rely on the
 socket buffer of the AF_UNIX socket or put together our own buffer in
 /dev/shm shouldn't be much of a difference if they both have the same
 size and can take the same number of entries...

 I do agree.  Truth is I have looked at linux/net/ code once but I
 didn't grok it well.  I guess it's the time to take a second look. ;)
 
 Here's another option: extend journald to use kdbus as additional
 transport. This is something we want to do anyway since the kdbus
 transport will attach the metadata we need without race to each
 packet. Given that kdbus ultimately is just a way to write into an
 mmaped tmpfs that some other process owns this should not be much worse
 than the android logger in performance.

I doubt it would help as other side would still be woken up on _every_
message, right?

Cheers,
Karol
  Client:  /usr/bin/time --verbose ./logger-client
  

Average stats:
packet size  = Bytes/s
  64 =113295768 Bytes/s = 110640 kB/s
 128 =217704976 Bytes/s = 212602 kB/s
 256 =434053344 Bytes/s = 423880 kB/s
 512 =834656704 Bytes/s = 815094 kB/s
1024 =   1580365824 Bytes/s = 1543326 kB/s
2048 =   2817714432 Bytes/s = 2751674 kB/s
4096 =   2888970749 Bytes/s = 2821260 kB/s
Command being timed: ./logger-client
User time (seconds): 2.20
System time (seconds): 74.80
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:17.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 624
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 204
Voluntary context switches: 1064
Involuntary context switches: 368
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0


  Server:  /usr/bin/time --verbose ./logger-server
  

 Command being timed: ./logger-server
User time (seconds): 1.83
System time (seconds): 75.17
Percent of CPU this job got: 30%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:08.72
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 480
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 167
Voluntary context switches: 242
Involuntary context switches: 144
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
 

Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-13 Thread Lennart Poettering
On Fri, 13.12.13 12:46, Karol Lewandowski (k.lewando...@samsung.com) wrote:

  From 2.6.29 it's in mailine kernel under
  drivers/staging/android/logger.c
  
  But if the the socket layer is really that much slower than android
  logging, then that's kinda sad, and the kernel socket stuff should
  probably be fixed?
 
 It's not that slower when we take throughput as only parameter:
 
 packet   |  |
 size [B] | Adnroid [kB] |  DGRAM [kB]
 -+--+
   64 |110640| 84827
  128 |212602|166866
  256 |423880|328440
  512 |815094|644499
 1024 |   1543326|   1246914
 2048 |   2751674|   2359654
 4096 |   2821260|   4145395
 
 However, if take context switches into account picture starts
 to look differently:
 
  | Android   |   DGRAM
 -+---+--
 context switches |  1818 |   580443  (involuntary + voluntary)
  |   |
 system time  | 149.97|   118.62  [seconds]
 user time|   4.03| 7.45  [secodns]
 
 (Columns contain cumulative data: client  + server)
 
 Attached raw results and test program.

Well, are you suggesting that the AF_UNIX/SOCK_DGRAM code actually hands
off the timeslice to the other side as soon as it queued something in?
THat would be news to me. The kernel does context switches when the
timeslice is over not earlier afaik, and you'll get that aynway...

  This is why I decided to try to make logging completely async and see
  if/what changes.
 
  Well, if AF_UNIX is slow, then we really should try to fix that in the
  kernel instead of bypassing it in userspace... Whether we rely on the
  socket buffer of the AF_UNIX socket or put together our own buffer in
  /dev/shm shouldn't be much of a difference if they both have the same
  size and can take the same number of entries...
 
  I do agree.  Truth is I have looked at linux/net/ code once but I
  didn't grok it well.  I guess it's the time to take a second look. ;)
  
  Here's another option: extend journald to use kdbus as additional
  transport. This is something we want to do anyway since the kdbus
  transport will attach the metadata we need without race to each
  packet. Given that kdbus ultimately is just a way to write into an
  mmaped tmpfs that some other process owns this should not be much worse
  than the android logger in performance.
 
 I doubt it would help as other side would still be woken up on _every_
 message, right?

Well, it gets woken up if its waiting for that. But the kernel will only
give CPU time to it when the senders timeslice is over... There's very
little difference to mmap... I mean, you need to tell the other side
that it should look in the buffer, how do you want to do that otherwise?

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-13 Thread Karol Lewandowski
On Fri, Dec 13, 2013 at 03:45:36PM +0100, Lennart Poettering wrote:
 On Fri, 13.12.13 12:46, Karol Lewandowski (k.lewando...@samsung.com) wrote:

 Well, are you suggesting that the AF_UNIX/SOCK_DGRAM code actually hands
 off the timeslice to the other side as soon as it queued something in?

If by other side you mean receiving one, then no - qnx seems to do
that, but it isn't the case here. What I'm trying to say here is that
kernel puts the process doing send(2) into sleep when (a) queue fills
up and (b) fd is blocking one (otherwise we just get EAGAIN).  That's
expected, I presume.

One of the problems I see, though, is that no matter how deep I make
the queue (`max_dgram_qlen') I still see process sleeping on send()
way earlier that configured queue depth would suggest.


 THat would be news to me. The kernel does context switches when the
 timeslice is over not earlier afaik, and you'll get that aynway...

If you take a look at Results.sock_dgram from my previous mail you
will find this for sock-client:

Voluntary context switches: 577696
Involuntary context switches: 633

Please correct me if I'm wrong but I think that involuntary context
switch happens when process' timeslice is over, where voluntary is
caused by syscall, probably doing some form of IO (read/send/select/etc.)

If I make sock-server sleep(5) just before poll and strace the client
I can clearly see that it blocks on send(). After receiving side picks
packet up, then the clients is able to sent another one.

This what I think is happening, please let me know if this sounds like
utter nonsense to you. :)


   Here's another option: extend journald to use kdbus as additional
   transport. This is something we want to do anyway since the kdbus
   transport will attach the metadata we need without race to each
   packet. Given that kdbus ultimately is just a way to write into an
   mmaped tmpfs that some other process owns this should not be much worse
   than the android logger in performance.
  
  I doubt it would help as other side would still be woken up on _every_
  message, right?
 
 Well, it gets woken up if its waiting for that. But the kernel will only
 give CPU time to it when the senders timeslice is over... 

I'm either doing something terribly wrong or timeslice can end due to
the call to send(2) and friends.


 There's very
 little difference to mmap... I mean, you need to tell the other side
 that it should look in the buffer, how do you want to do that otherwise?

I don't.  I just want the buffer to be huge enough to not cause client
to block, effectively waiting for receiving side pick stuff up.

Cheers,
Karol
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-12 Thread Karol Lewandowski
Folks,

We are trying to apply journald as default (and only) logging system
for our ARM-based handhelds.  There is one major problem, though -
performance.

We have found that journald can consume considerable amount of CPU 
time under moderate load (10-20% of CPU per, for say 100-300msg/s).

We have profiled journald and there are few places that could benefit
from _some_ optimizations but I have come to conclusion it isn't
journald code that is real bottleneck - it's, or seems to be,
synchronous nature of send()/recv().  In default configuration each
send() can cause context-switch which is especially costly on ARM.

This is why I decided to try to make logging completely async and see
if/what changes.

Following patchset (based on top of f20c84c15f3) implements pickup
logic in journald [patch 1] and uncoditionally enables it in client
api [patch 2].

In short - instead of sending DGRAMs I do drop files to /dev/shm
which are then picked up by journald at specified intervals.

(Precisely, I have (ab)used mechanism used to pass fd via DGRAM
 when message is too big.)

Patch 3 contains standalone test program, which floods journald with
specified number of messages per second.

On my development (x86) machine I made following test with good old
time(1):


  # send 2000 messages (each of size 100 bytes) to journal per second
  # for ~5mins
  ./journal-feeder 2000 100  sleep $((60*5 - 10))  kill $!


Unpatched systemd:
==

  /usr/bin/time --verbose ./systemd-journald   sleep $((60*5))  kill $(pidof 
systemd-journald)
User time (seconds): 16.87
System time (seconds): 39.38
Percent of CPU this job got: 18%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:00.06
Major (requiring I/O) page faults: 30
Minor (reclaiming a frame) page faults: 1061740
Voluntary context switches: 496757
Involuntary context switches: 497
File system inputs: 0
File system outputs: 50592


Deferred pickup every 5 secs (logs passed via /dev/shm/FILE rather than DGRAMs):


  /usr/bin/time --verbose ./systemd-journald   sleep $((60*5))  kill $(pidof 
systemd-journald)
User time (seconds): 8.27
System time (seconds): 8.40
Percent of CPU this job got: 5%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:00.36
Major (requiring I/O) page faults: 180
Minor (reclaiming a frame) page faults: 38976
Voluntary context switches: 157
Involuntary context switches: 1933
File system inputs: 8
File system outputs: 307312


Deferred pickup every 1 sec (logs passed via /dev/shm/FILE rather than DGRAMs):
===

  /usr/bin/time --verbose ./systemd-journald   sleep $((60*5))  kill $(pidof 
systemd-journald)
User time (seconds): 9.54
System time (seconds): 9.56
Percent of CPU this job got: 6%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:00.15
Major (requiring I/O) page faults: 180
Minor (reclaiming a frame) page faults: 39034
Voluntary context switches: 382
Involuntary context switches: 2359
File system inputs: 0
File system outputs: 307768


This is quite naive test as a far more changes when we pass fd instead
of DGRAM (ucred is NULL, which means we won't gather information from
proc at all).  However, we verified that cost of this is rather minor
and not that important in this test.

Of course, I'm not proposing to include it as is (uncoditionally
enable it for all messages) but I would just like to hear your opinion
about this idea.

If I read these numbers correctly there seems to be considerable
benefit in terms of CPU time.  Something like that wouldn't be useful
for all messages, but I think that gazzilions of debug messages that
our apps produce could be send in non-synchronous and low-overhead
manner.

RFC.

Thanks in advance!


Karol Lewandowski (3):
  journald: Add deferred log processing logic
  journal: Try to pass log via /dev/shm to avoid waking up journal
  journal-feeder: trivial test program to flood journald

 journal-feeder.c |  64 +++
 src/journal/journal-send.c   |  39 ++
 src/journal/journald-gperf.gperf |   2 +
 src/journal/journald-server.c| 107 +++
 src/journal/journald-server.h|   5 ++
 src/journal/journald.conf|   2 +
 6 files changed, 219 insertions(+)
 create mode 100644 journal-feeder.c

-- 
1.8.4.rc3

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-12 Thread Lennart Poettering
On Thu, 12.12.13 14:12, Karol Lewandowski (k.lewando...@samsung.com) wrote:

 Folks,
 
 We are trying to apply journald as default (and only) logging system
 for our ARM-based handhelds.  There is one major problem, though -
 performance.
 
 We have found that journald can consume considerable amount of CPU 
 time under moderate load (10-20% of CPU per, for say 100-300msg/s).
 
 We have profiled journald and there are few places that could benefit
 from _some_ optimizations but I have come to conclusion it isn't
 journald code that is real bottleneck - it's, or seems to be,
 synchronous nature of send()/recv().  In default configuration each
 send() can cause context-switch which is especially costly on ARM.

Hmm, but this wouldn't be any different from old syslog, right? I mean,
old syslog also uses sendv() and recv() on AF_UNIX SOCK_DGRAM sockets...

Note that on Linux the number of datagrams you can queue in an
AF_UNIX/SOCK_DGRAM socket is very low (15). THis is controlled via
/proc/sys/net/unix/max_dgram_qlen. When this limit is hit the client
needs to wait for the server to finish processing. Unfortunately there's
only that system-global way to increase the qlen. For a long time it has
been on our wishlist to make this a per-socket tunable with a new
sockopt() that behaves similar to SO_SNDBUF and friends...

 This is why I decided to try to make logging completely async and see
 if/what changes.

Well, if AF_UNIX is slow, then we really should try to fix that in the
kernel instead of bypassing it in userspace... Whether we rely on the
socket buffer of the AF_UNIX socket or put together our own buffer in
/dev/shm shouldn't be much of a difference if they both have the same
size and can take the same number of entries...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-12 Thread Karol Lewandowski
On Thu, Dec 12, 2013 at 03:51:05PM +0100, Lennart Poettering wrote:
 On Thu, 12.12.13 14:12, Karol Lewandowski (k.lewando...@samsung.com) wrote:

(OP here from private email.)

  Folks,
  
  We are trying to apply journald as default (and only) logging system
  for our ARM-based handhelds.  There is one major problem, though -
  performance.
  
  We have found that journald can consume considerable amount of CPU 
  time under moderate load (10-20% of CPU per, for say 100-300msg/s).
  
  We have profiled journald and there are few places that could benefit
  from _some_ optimizations but I have come to conclusion it isn't
  journald code that is real bottleneck - it's, or seems to be,
  synchronous nature of send()/recv().  In default configuration each
  send() can cause context-switch which is especially costly on ARM.
 
 Hmm, but this wouldn't be any different from old syslog, right? I mean,
 old syslog also uses sendv() and recv() on AF_UNIX SOCK_DGRAM sockets...

That's true, however, we aren't migrating from syslog, but from
Android Logger.  Android provides /dev/log_{main,radio,...} device
nodes which are simple in-kernel circular buffers apps can write to at
any time.  This was invented when /dev/kmsg wasn't writable and serves
more-or-less same purpose.

From 2.6.29 it's in mailine kernel under drivers/staging/android/logger.c


 Note that on Linux the number of datagrams you can queue in an
 AF_UNIX/SOCK_DGRAM socket is very low (15). THis is controlled via
 /proc/sys/net/unix/max_dgram_qlen. When this limit is hit the client
 needs to wait for the server to finish processing. Unfortunately there's
 only that system-global way to increase the qlen. For a long time it has
 been on our wishlist to make this a per-socket tunable with a new
 sockopt() that behaves similar to SO_SNDBUF and friends...

We have played this tunable and it didn't bring considerable/measurable
improvements, unfortunately.

 
  This is why I decided to try to make logging completely async and see
  if/what changes.
 
 Well, if AF_UNIX is slow, then we really should try to fix that in the
 kernel instead of bypassing it in userspace... Whether we rely on the
 socket buffer of the AF_UNIX socket or put together our own buffer in
 /dev/shm shouldn't be much of a difference if they both have the same
 size and can take the same number of entries...

I do agree.  Truth is I have looked at linux/net/ code once but I
didn't grok it well.  I guess it's the time to take a second look. ;)

Cheers,
Karol
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] [RFC] [PATCH 0/3] journal: Add deferred log processing to reduce synchonous IO overhead

2013-12-12 Thread Lennart Poettering
On Thu, 12.12.13 17:42, Karol Lewandowski (lmc...@gmail.com) wrote:

  Hmm, but this wouldn't be any different from old syslog, right? I mean,
  old syslog also uses sendv() and recv() on AF_UNIX SOCK_DGRAM sockets...
 
 That's true, however, we aren't migrating from syslog, but from
 Android Logger.  Android provides /dev/log_{main,radio,...} device
 nodes which are simple in-kernel circular buffers apps can write to at
 any time.  This was invented when /dev/kmsg wasn't writable and serves
 more-or-less same purpose.
 
 From 2.6.29 it's in mailine kernel under
 drivers/staging/android/logger.c

But if the the socket layer is really that much slower than android
logging, then that's kinda sad, and the kernel socket stuff should
probably be fixed?

   This is why I decided to try to make logging completely async and see
   if/what changes.
  
  Well, if AF_UNIX is slow, then we really should try to fix that in the
  kernel instead of bypassing it in userspace... Whether we rely on the
  socket buffer of the AF_UNIX socket or put together our own buffer in
  /dev/shm shouldn't be much of a difference if they both have the same
  size and can take the same number of entries...
 
 I do agree.  Truth is I have looked at linux/net/ code once but I
 didn't grok it well.  I guess it's the time to take a second look. ;)

Here's another option: extend journald to use kdbus as additional
transport. This is something we want to do anyway since the kdbus
transport will attach the metadata we need without race to each
packet. Given that kdbus ultimately is just a way to write into an
mmaped tmpfs that some other process owns this should not be much worse
than the android logger in performance.

Special care needs to be taken to make this work though, as we should do
this only on kdbus systems. On non-kdbus systems journald can never talk
to dbus, because dbus-daemon is a client of journald and things would
deadlock...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel