Re: [Freeipa-devel] [PATCH 0291-0294] Fix locking to prevent crashes and deadlocks
On 11/09/14 21:58, Petr Spacek wrote: On 11.9.2014 18:34, Martin Basti wrote: On 11/09/14 15:57, Martin Basti wrote: On 11/09/14 11:59, Petr Spacek wrote: Hello, I was fighting with random crashes for couple of days ... and discovered that run_exclusive_enter()/isc_task_beginexclusive() usage was completely incorrect and didn't actually lock anything. This series of patches reworks internal locking (and related event system) to work around limitations of isc_task_beginexclusive() mechanism. It would be better to get rid of isc_task_beginexclusive() completely but IMHO it is not possible because of BIND's dns_view*() functions have to be guarded with it. Testing is going to be interesting because we are speaking about race conditions. I used ~ 100 DNS zones, each zone had ~ 100 random domain names inside with random A//TXT RRs. My LDIF is here: http://people.redhat.com/~pspacek/a/2014/09/11/dns-test.ldif.xz I was able to randomly reproduce various crashes when BIND was running with more threads than usually. You can try to run BIND with this command (as root) and play games with -n parameter: $ export KRB5_KTNAME=/etc/named.keytab $ named -4 -g -u named -m record -n 10 Please test also the case where BIND receives SIGINT during start-up. It is possible to run BIND with commands above and wait for message: 11-Sep-2014 11:54:58.092 running At this point send SIGINT (CTRL+C) to BIND and see what happens. It could crash or deadlock. It is necessary to send the signal before BIND prints this message: 11-Sep-2014 11:55:11.707 zone z1.test/IN: loaded serial 1410429304 Let me know if you need any assistance. I need your assistance, I haven't been able to reproduce it. Martin I applied the patchset, and NACK I don't understand how I could possibly miss this. I was convinced that the patch set was thoroughly tested ... Anyway, attached patch should fix the problem you were facing. Please re-test it. Thank you! Petr^2 Spacek #1 If named is running and I randomly choose few zones and delete them, it causes named failure dig @localhost A r1.z12.test ; DiG 9.9.4-P2-RedHat-9.9.4-12.P2.fc20 @localhost A r1.z12.test ; (2 servers found) ;; global options: +cmd ;; connection timed out; no servers could be reached * SIGINT doesn't work * rndc doesn't work * DS worksSIGINT signal stops working. Output: snip 11-Sep-2014 11:26:37.495 client 127.0.0.1#62615: received notify for zone 'z99.test' ^C^C^C^C^C^C^C^C Process: named29125 1.1 2.9 789972 45976 pts/0Sl+ 11:26 0:02 named -4 -g -u named -m record -n 10 I have to kill it with kill -9 #2 same as #1 If new zone is added, #3 same as #1 If new record is added #4 same as #1 If record is deleted Functional ACK -- Martin Basti ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] [PATCH 0291-0294] Fix locking to prevent crashes and deadlocks
On 12.9.2014 14:45, Martin Basti wrote: On 11/09/14 21:58, Petr Spacek wrote: On 11.9.2014 18:34, Martin Basti wrote: On 11/09/14 15:57, Martin Basti wrote: On 11/09/14 11:59, Petr Spacek wrote: Hello, I was fighting with random crashes for couple of days ... and discovered that run_exclusive_enter()/isc_task_beginexclusive() usage was completely incorrect and didn't actually lock anything. This series of patches reworks internal locking (and related event system) to work around limitations of isc_task_beginexclusive() mechanism. It would be better to get rid of isc_task_beginexclusive() completely but IMHO it is not possible because of BIND's dns_view*() functions have to be guarded with it. Testing is going to be interesting because we are speaking about race conditions. I used ~ 100 DNS zones, each zone had ~ 100 random domain names inside with random A//TXT RRs. My LDIF is here: http://people.redhat.com/~pspacek/a/2014/09/11/dns-test.ldif.xz I was able to randomly reproduce various crashes when BIND was running with more threads than usually. You can try to run BIND with this command (as root) and play games with -n parameter: $ export KRB5_KTNAME=/etc/named.keytab $ named -4 -g -u named -m record -n 10 Please test also the case where BIND receives SIGINT during start-up. It is possible to run BIND with commands above and wait for message: 11-Sep-2014 11:54:58.092 running At this point send SIGINT (CTRL+C) to BIND and see what happens. It could crash or deadlock. It is necessary to send the signal before BIND prints this message: 11-Sep-2014 11:55:11.707 zone z1.test/IN: loaded serial 1410429304 Let me know if you need any assistance. I need your assistance, I haven't been able to reproduce it. Martin ... Functional ACK Pushed to master: 1d23e27e0198cacdbbb14ac13ff039c33546d9af 09386eb446ac8d9c12110d21f2d4da36c1b792bf e7d8f6f175eae795c603eee00d06d5799b2d6c39 17a29e20406ad9db0b28fcb6ec4b02394385fe53 -- Petr^2 Spacek ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] [PATCH 0291-0294] Fix locking to prevent crashes and deadlocks
On 11/09/14 11:59, Petr Spacek wrote: Hello, I was fighting with random crashes for couple of days ... and discovered that run_exclusive_enter()/isc_task_beginexclusive() usage was completely incorrect and didn't actually lock anything. This series of patches reworks internal locking (and related event system) to work around limitations of isc_task_beginexclusive() mechanism. It would be better to get rid of isc_task_beginexclusive() completely but IMHO it is not possible because of BIND's dns_view*() functions have to be guarded with it. Testing is going to be interesting because we are speaking about race conditions. I used ~ 100 DNS zones, each zone had ~ 100 random domain names inside with random A//TXT RRs. My LDIF is here: http://people.redhat.com/~pspacek/a/2014/09/11/dns-test.ldif.xz I was able to randomly reproduce various crashes when BIND was running with more threads than usually. You can try to run BIND with this command (as root) and play games with -n parameter: $ export KRB5_KTNAME=/etc/named.keytab $ named -4 -g -u named -m record -n 10 Please test also the case where BIND receives SIGINT during start-up. It is possible to run BIND with commands above and wait for message: 11-Sep-2014 11:54:58.092 running At this point send SIGINT (CTRL+C) to BIND and see what happens. It could crash or deadlock. It is necessary to send the signal before BIND prints this message: 11-Sep-2014 11:55:11.707 zone z1.test/IN: loaded serial 1410429304 Let me know if you need any assistance. I need your assistance, I haven't been able to reproduce it. Martin -- Martin Basti ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] [PATCH 0291-0294] Fix locking to prevent crashes and deadlocks
On 11/09/14 15:57, Martin Basti wrote: On 11/09/14 11:59, Petr Spacek wrote: Hello, I was fighting with random crashes for couple of days ... and discovered that run_exclusive_enter()/isc_task_beginexclusive() usage was completely incorrect and didn't actually lock anything. This series of patches reworks internal locking (and related event system) to work around limitations of isc_task_beginexclusive() mechanism. It would be better to get rid of isc_task_beginexclusive() completely but IMHO it is not possible because of BIND's dns_view*() functions have to be guarded with it. Testing is going to be interesting because we are speaking about race conditions. I used ~ 100 DNS zones, each zone had ~ 100 random domain names inside with random A//TXT RRs. My LDIF is here: http://people.redhat.com/~pspacek/a/2014/09/11/dns-test.ldif.xz I was able to randomly reproduce various crashes when BIND was running with more threads than usually. You can try to run BIND with this command (as root) and play games with -n parameter: $ export KRB5_KTNAME=/etc/named.keytab $ named -4 -g -u named -m record -n 10 Please test also the case where BIND receives SIGINT during start-up. It is possible to run BIND with commands above and wait for message: 11-Sep-2014 11:54:58.092 running At this point send SIGINT (CTRL+C) to BIND and see what happens. It could crash or deadlock. It is necessary to send the signal before BIND prints this message: 11-Sep-2014 11:55:11.707 zone z1.test/IN: loaded serial 1410429304 Let me know if you need any assistance. I need your assistance, I haven't been able to reproduce it. Martin I applied the patchset, and NACK #1 If named is running and I randomly choose few zones and delete them, it causes named failure dig @localhost A r1.z12.test ; DiG 9.9.4-P2-RedHat-9.9.4-12.P2.fc20 @localhost A r1.z12.test ; (2 servers found) ;; global options: +cmd ;; connection timed out; no servers could be reached * SIGINT doesn't work * rndc doesn't work * DS worksSIGINT signal stops working. Output: snip 11-Sep-2014 11:26:37.495 client 127.0.0.1#62615: received notify for zone 'z99.test' ^C^C^C^C^C^C^C^C Process: named29125 1.1 2.9 789972 45976 pts/0Sl+ 11:26 0:02 named -4 -g -u named -m record -n 10 I have to kill it with kill -9 #2 same as #1 If new zone is added, #3 same as #1 If new record is added #4 same as #1 If record is deleted -- Martin Basti ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] [PATCH 0291-0294] Fix locking to prevent crashes and deadlocks
On 11.9.2014 18:34, Martin Basti wrote: On 11/09/14 15:57, Martin Basti wrote: On 11/09/14 11:59, Petr Spacek wrote: Hello, I was fighting with random crashes for couple of days ... and discovered that run_exclusive_enter()/isc_task_beginexclusive() usage was completely incorrect and didn't actually lock anything. This series of patches reworks internal locking (and related event system) to work around limitations of isc_task_beginexclusive() mechanism. It would be better to get rid of isc_task_beginexclusive() completely but IMHO it is not possible because of BIND's dns_view*() functions have to be guarded with it. Testing is going to be interesting because we are speaking about race conditions. I used ~ 100 DNS zones, each zone had ~ 100 random domain names inside with random A//TXT RRs. My LDIF is here: http://people.redhat.com/~pspacek/a/2014/09/11/dns-test.ldif.xz I was able to randomly reproduce various crashes when BIND was running with more threads than usually. You can try to run BIND with this command (as root) and play games with -n parameter: $ export KRB5_KTNAME=/etc/named.keytab $ named -4 -g -u named -m record -n 10 Please test also the case where BIND receives SIGINT during start-up. It is possible to run BIND with commands above and wait for message: 11-Sep-2014 11:54:58.092 running At this point send SIGINT (CTRL+C) to BIND and see what happens. It could crash or deadlock. It is necessary to send the signal before BIND prints this message: 11-Sep-2014 11:55:11.707 zone z1.test/IN: loaded serial 1410429304 Let me know if you need any assistance. I need your assistance, I haven't been able to reproduce it. Martin I applied the patchset, and NACK I don't understand how I could possibly miss this. I was convinced that the patch set was thoroughly tested ... Anyway, attached patch should fix the problem you were facing. Please re-test it. Thank you! Petr^2 Spacek #1 If named is running and I randomly choose few zones and delete them, it causes named failure dig @localhost A r1.z12.test ; DiG 9.9.4-P2-RedHat-9.9.4-12.P2.fc20 @localhost A r1.z12.test ; (2 servers found) ;; global options: +cmd ;; connection timed out; no servers could be reached * SIGINT doesn't work * rndc doesn't work * DS worksSIGINT signal stops working. Output: snip 11-Sep-2014 11:26:37.495 client 127.0.0.1#62615: received notify for zone 'z99.test' ^C^C^C^C^C^C^C^C Process: named29125 1.1 2.9 789972 45976 pts/0Sl+ 11:26 0:02 named -4 -g -u named -m record -n 10 I have to kill it with kill -9 #2 same as #1 If new zone is added, #3 same as #1 If new record is added #4 same as #1 If record is deleted From 1d23e27e0198cacdbbb14ac13ff039c33546d9af Mon Sep 17 00:00:00 2001 From: Petr Spacek pspa...@redhat.com Date: Wed, 10 Sep 2014 16:48:52 +0200 Subject: [PATCH] Rework locking in settings.c module. Locking mechanism run_exclusive_enter()/exit() works correctly only if all calls operate on the same task. Locking was reworked to use ordinary mutex stored in settings_set_t structure. Signed-off-by: Petr Spacek pspa...@redhat.com --- src/ldap_helper.c | 32 +--- src/settings.c| 63 ++- src/settings.h| 11 +- 3 files changed, 55 insertions(+), 51 deletions(-) diff --git a/src/ldap_helper.c b/src/ldap_helper.c index b3cc7f8389e52decd2f90a18eae761fbc37433a0..89d585abe1cda14fb23807e612970e9bd1e84459 100644 --- a/src/ldap_helper.c +++ b/src/ldap_helper.c @@ -394,7 +394,7 @@ validate_local_instance_settings(ldap_instance_t *inst, settings_set_t *set) { if (strcmp(dir_name, str_buf(buff)) != 0) CHECK(setting_set(directory, inst-local_settings, - str_buf(buff), inst-task)); + str_buf(buff))); str_destroy(buff); dir_name = NULL; CHECK(setting_get_str(directory, inst-local_settings, dir_name)); @@ -429,8 +429,7 @@ validate_local_instance_settings(ldap_instance_t *inst, settings_set_t *set) { CLEANUP_WITH(ISC_R_FAILURE); } CHECK(isc_string_printf(print_buff, PRINT_BUFF_SIZE, %u, auth_method_enum)); - CHECK(setting_set(auth_method_enum, inst-local_settings, print_buff, - inst-task)); + CHECK(setting_set(auth_method_enum, inst-local_settings, print_buff)); /* check we have the right data when SASL/GSSAPI is selected */ CHECK(setting_get_str(sasl_mech, set, sasl_mech)); @@ -485,13 +484,11 @@ validate_local_instance_settings(ldap_instance_t *inst, settings_set_t *set) { default '%s', str_buf(buff)); CHECK(setting_set(krb5_principal, set, - str_buf(buff), - inst-task)); + str_buf(buff))); } } else { CHECK(setting_set(krb5_principal, set, - sasl_user, - inst-task)); + sasl_user)); } } } else if (auth_method_enum == AUTH_SASL) { @@ -559,7 +556,7 @@ new_ldap_instance(isc_mem_t *mctx, const char *db_name, sizeof(settings_global_default), settings_name,