Re: [systemd-devel] How to debug occasional hashmap corruption?

2018-11-14 Thread juice

juice kirjoitti 2018-11-06 14:30:

Lennart Poettering kirjoitti 2018-11-06 12:27:

On Di, 06.11.18 11:57, juice (ju...@swagman.org) wrote:



Hi,

During the past half year I have seen systemd dump core three times 
due

to what I suspect a hashmap corruption or race.
Each time it looks a bit different and is triggered by different 
things

but it somehow centers on hashmap operations.

What would be the prefered way to debug this? I cannot add huge 
logging

as this is something that happens once in a blue moon and always in
different compute nodes.
Is there some way I could easily test it by increasing the chance of 
such

corruption/race happening?


This looks very much like a memory corruption of some sorts and
valgrind should be the tool of choice to track that down.

Lennart


Thanks tor the prompt reply, Lennart.

I agree; using valgrind indeed was something already considered, 
however I

suspect it might add some overhead in systemd operation?


I have been trying to start systemd under valgrind but seems it is not a 
trivial
task to do. Moreover, no searching has revealed a general receipe for 
doing that
other than the advice in systemd README's to compile with 
-Dvalgrind=true option.


So, where could I find information on how to set up memory corruption 
debug on

a live system for testing?

 - juice -

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to debug occasional hashmap corruption?

2018-11-06 Thread Vito Caputo
On Tue, Nov 06, 2018 at 02:30:19PM +0200, juice wrote:
> Lennart Poettering kirjoitti 2018-11-06 12:27:
> > On Di, 06.11.18 11:57, juice (ju...@swagman.org) wrote:
> > 
> > > 
> > > Hi,
> > > 
> > > During the past half year I have seen systemd dump core three times
> > > due
> > > to what I suspect a hashmap corruption or race.
> > > Each time it looks a bit different and is triggered by different
> > > things
> > > but it somehow centers on hashmap operations.
> > > 
> > > What would be the prefered way to debug this? I cannot add huge
> > > logging
> > > as this is something that happens once in a blue moon and always in
> > > different compute nodes.
> > > Is there some way I could easily test it by increasing the chance of
> > > such
> > > corruption/race happening?
> > 
> > This looks very much like a memory corruption of some sorts and
> > valgrind should be the tool of choice to track that down.
> > 
> > Lennart
> 
> Thanks tor the prompt reply, Lennart.
> 
> I agree; using valgrind indeed was something already considered, however I
> suspect it might add some overhead in systemd operation?
> 
> The question here was more on the lines how to trigger the problem?
> It is quite rare as it seems the occurrance is about once per two months on
> our QL3 test pool which contains hunderds of VM guests...
> It would be impractical to build and deploy a release which contains systemd
> running under valgrind on every node! :)
> 

In such scenarios where valgrind's overhead is impractical, I'd give
address sanitizer a try.

https://clang.llvm.org/docs/AddressSanitizer.html

Regards,
Vito Caputo
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to debug occasional hashmap corruption?

2018-11-06 Thread John Reiser

On 11/6/18 9:57 UTC, juice wrote:


During the past half year I have seen systemd dump core three times due
to what I suspect a hashmap corruption or race.
Each time it looks a bit different and is triggered by different things
but it somehow centers on hashmap operations.


Three intermittent hardware failures in one year on 10,000 boxes is normal.
Keep good records.  If the same box appears twice, then physically destroy it.

Meanwhile, log all events to a circular buffer that just keeps rotating:
date+time (32 bits, 1 microsecond precision), caller (return address),
argument summary (fixed format: string prefixes or hash).  Analyze the dump.

Lock each hashmap operation to insure single-threaded operation,t;
prevent even multiple [supposedly] read-only access.
Lock each signal handler: only one instance of a given signal at a time.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to debug occasional hashmap corruption?

2018-11-06 Thread Dorian ROSSE
For repair a hash corruption which a signature for each file are the BASS 
framework from Thalos cybersecurity worker that add this program on Clam AV for 
check all files signatures ;)

Regards.


Dorian Rosse.


De : systemd-devel  de la part de 
juice 
Envoyé : mardi, novembre 6, 2018 11:03
À : systemd-devel@lists.freedesktop.org
Objet : [systemd-devel] How to debug occasional hashmap corruption?


Hi,

During the past half year I have seen systemd dump core three times due
to what I suspect a hashmap corruption or race.
Each time it looks a bit different and is triggered by different things
but it somehow centers on hashmap operations.

What would be the prefered way to debug this? I cannot add huge logging
as this is something that happens once in a blue moon and always in
different compute nodes.
Is there some way I could easily test it by increasing the chance of
such
corruption/race happening?

The systemd version we currently have is 238 and the linux kernel is
4.14.69

Here is the latest coredump trace of the problem.

[root@mn-0 robot]#
[root@mn-0 robot]# gdb /lib/systemd/systemd
./core.systemd.0.6db5fc1f8d114369ae625ab03660c46e.1437.153926661200
GNU gdb (GDB) Fedora 8.1-14.wf29
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later

This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.

|
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /lib/systemd/systemd...Reading symbols from
/usr/lib/debug/usr/lib/systemd/systemd-238-10.wf29.x86_64.debug...done.
done.
[New LWP 1437]
[New LWP 1]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib/systemd/systemd --switched-root --system
--deserialize 16'.
Program terminated with signal SIGABRT, Aborted.
#0 0x7f13d7e2e1c7 in kill () from /lib64/libc.so.6
[Current thread is 1 (LWP 1437)]
Missing separate debuginfos, use: dnf debuginfo-install
audit-libs-2.8.3-4.wf29.x86_64 glibc-2.27-15.wf29.x86_64
kmod-libs-25-2.wf29.x86_64 libacl-2.2.52-20.wf29.x86_64 libattr-2.4.47-2
3.wf29.x86_64 libblkid-2.32-5.wf29.x86_64 libcap-2.25-10.wf29.x86_64
libcap-ng-0.7.8-9.wf29.x86_64 libgcc-7.3.1-5.wf29.x86_64
libmount-2.32-5.wf29.x86_64 libuuid-2.32-5.wf29.x86_64 lz4-
libs-1.8.1.2-4.wf29.x86_64 pam-1.3.0-13.wf29.x86_64
xz-libs-5.2.3-6.wf29.x86_64 zlib-1.2.11-5.wf29.x86_64
(gdb)
(gdb) bt
#0 0x7f13d7e2e1c7 in kill () from /lib64/libc.so.6
#1 0x5607e9a93887 in crash (sig=6) at ../src/core/main.c:196
#2 
#3 0x7f13d7e2dedb in raise () from /lib64/libc.so.6
#4 0x7f13d7e2f291 in abort () from /lib64/libc.so.6
#5 0x7f13d937739a in log_assert_failed_realm
(realm=realm@entry=LOG_REALM_SYSTEMD, text=text@entry=0x7f13d9423042
"x",
file=file@entry=0x7f13d9435e08
"../src/libsystemd/sd-bus/bus-objects.c", line=line@entry=1618,
func=func@entry=0x7f13d94366d0 <__PRETTY_FUNCTION__.13175>
"vtable_member_compare_func") at ../src/basic/log.c:826
#6 0x7f13d93bdca1 in vtable_member_compare_func (a=,
b=) at ../src/libsystemd/sd-bus/bus-objects.c:1618
#7 0x7f13d936d4bb in base_bucket_scan (h=h@entry=0x7f13d98f9740,
idx=412, idx@entry=411, key=key@entry=0x5607eb85b700) at
../src/basic/hashmap.c:1251
#8 0x7f13d936ea11 in hashmap_put (h=0x7f13d98f9740,
key=key@entry=0x5607eb85b700, value=value@entry=0x5607eb85b700) at
../src/basic/hashmap.c:1268
#9 0x7f13d93c0066 in add_object_vtable_internal
(bus=bus@entry=0x5607eb7b2e00, slot=slot@entry=0x0,
path=path@entry=0x5607e9b51fc8 "/org/freedesktop/systemd1/unit",
interface=, interface@entry=0x7f13d9418b30
"org.freedesktop.systemd1.Service", vtable=,
fallback=fallback@entry=true,
find=0x5607e9aa1080 ,
userdata=0x5607eb7273c0) at ../src/libsystemd/sd-bus/bus-objects.c:1739
#10 0x7f13d93c0515 in sd_bus_add_fallback_vtable
(bus=bus@entry=0x5607eb7b2e00, slot=slot@entry=0x0,
prefix=prefix@entry=0x5607e9b51fc8 "/org/freedesktop/systemd1/unit",
interface=interface@entry=0x7f13d9418b30
"org.freedesktop.systemd1.Service", vtable=,
find=find@entry=0x5607e9aa1080 ,
userdata=0x5607eb7273c0) at
../src/libsystemd/sd-bus/bus-objects.c:1849
#11 0x5607e9aa08be in bus_setup_api_vtables
(m=m@entry=0x5607eb7273c0, bus=0x5607eb7b2e00) at ../src/core/dbus.c:573
#12 0x5607e9aa1fc2 in bus_on_connection (s=,
fd=, revents=, userdata=0x5607eb7273c0) at
../src/core/dbus.c:711
#13 0x7f13d93f35f8 in source_dispatch (s=s@entry=0x5607eb8162c0) at

Re: [systemd-devel] How to debug occasional hashmap corruption?

2018-11-06 Thread juice

Lennart Poettering kirjoitti 2018-11-06 12:27:

On Di, 06.11.18 11:57, juice (ju...@swagman.org) wrote:



Hi,

During the past half year I have seen systemd dump core three times 
due

to what I suspect a hashmap corruption or race.
Each time it looks a bit different and is triggered by different 
things

but it somehow centers on hashmap operations.

What would be the prefered way to debug this? I cannot add huge 
logging

as this is something that happens once in a blue moon and always in
different compute nodes.
Is there some way I could easily test it by increasing the chance of 
such

corruption/race happening?


This looks very much like a memory corruption of some sorts and
valgrind should be the tool of choice to track that down.

Lennart


Thanks tor the prompt reply, Lennart.

I agree; using valgrind indeed was something already considered, however 
I

suspect it might add some overhead in systemd operation?

The question here was more on the lines how to trigger the problem?
It is quite rare as it seems the occurrance is about once per two months 
on

our QL3 test pool which contains hunderds of VM guests...
It would be impractical to build and deploy a release which contains 
systemd

running under valgrind on every node! :)


--
   - Juice -
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] How to debug occasional hashmap corruption?

2018-11-06 Thread Lennart Poettering
On Di, 06.11.18 11:57, juice (ju...@swagman.org) wrote:

> 
> Hi,
> 
> During the past half year I have seen systemd dump core three times due
> to what I suspect a hashmap corruption or race.
> Each time it looks a bit different and is triggered by different things
> but it somehow centers on hashmap operations.
> 
> What would be the prefered way to debug this? I cannot add huge logging
> as this is something that happens once in a blue moon and always in
> different compute nodes.
> Is there some way I could easily test it by increasing the chance of such
> corruption/race happening?

This looks very much like a memory corruption of some sorts and
valgrind should be the tool of choice to track that down.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel