On (16/09/13 18:16), Michal Židek wrote:
On 09/14/2013 12:35 AM, Lukas Slebodnik wrote:
On (13/09/13 19:17), Michal Židek wrote:
On 09/13/2013 05:58 PM, Michal Židek wrote:
Hello,
This patch should add another line of defence against memory cache
problems caused by accessing slot outside of bounds.
Thanks
Michal
After discussion with Lukas I am attaching alternative version
without the call to save the corrupted cache.
Thanks
Michal
I tested patch with latest sssd-1-9 on RHEL6.
Message was written to the sssd_nss.log:
(Sat Sep 14 00:19:24 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
But it seems that sssd_nss crashed.
Program received signal SIGILL, Illegal instruction.
0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
#0 0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000000000000008 in ?? ()
No symbol table info available.
#2 0x0000000001086cf0 in ?? ()
No symbol table info available.
#3 0x00000000010842a0 in ?? ()
No symbol table info available.
#4 0x00000000004288de in sss_mc_find_free_slots (_mcc=0x1086cf0, rec_len=252,
key=<value optimized out>, _rec=0x7fff1b6bb008) at
src/responder/nss/nsssrv_mmap_cache.c:495
tot_slots = <value optimized out>
i = <value optimized out>
rec = <value optimized out>
cur = 0
t = <value optimized out>
used = true
#5 sss_mc_get_record (_mcc=0x1086cf0, rec_len=252, key=<value optimized out>,
_rec=0x7fff1b6bb008) at src/responder/nss/nsssrv_mmap_cache.c:637
mcc = 0x7fa632dfa038
old_rec = <value optimized out>
rec = <value optimized out>
old_slots = <value optimized out>
num_slots = 8
base_slot = <value optimized out>
ret = 853516376
i = <value optimized out>
__FUNCTION__ = "sss_mc_get_record"
#6 0x0000000000428eee in sss_mmap_cache_pw_store (_mcc=0x10842a0,
name=0x7fff1b6bb140, pw=0x7fff1b6bb150, uid=126516, gid=16000,
gecos=0x7fff1b6bb180, homedir=0x7fff1b6bb170, shell=0x7fff1b6bb160) at
src/responder/nss/nsssrv_mmap_cache.c:731
mcc = 0x1086cf0
rec = <value optimized out>
data = <value optimized out>
uidkey = {str = 0x7fff1b6bb010 "126516", len = 7}
uidstr = "126516\000\000\021)\213"
data_len = 204
rec_len = 252
pos = <value optimized out>
ret = <value optimized out>
LS
The crash was caused, because mmap_cache was invalidated
and memset was called for invalidated record.
memset(rec->data, MC_INVALID_VAL8, <snip> * MC_SIZE_TO_SLOTS(rec->len))
^^^^^^^^^^
rec->len is 0xffffffff
Good catch.
In function sss_mc_get_next_slot_with_hash I returned MC_INVALID_VAL
in case of failure. This function is called in function
sss_mc_rm_rec_from_chain, which returns no error code. So the upper
layers worked with the wrong MC_INVALID_VAL, as if it was valid value
and the MC_PTR_TO_SLOT resulted in some big slot number (which
probably caused the crash). In this patch I populate the information
about failure to to sss_mc_invalidate_rec and terminate it if
necessary.
New patch is attached. Please run the testing script with your
configuration that failed to see if it is fixed.
Thanks
Michal
I ran a stress test for 20+ hours and sssd_nss didn't crash,
only mmap_cache was reseted or reinvalidated.
Here is sssd_nss.log:
=============================================================
(Mon Sep 16 18:49:56 2013) [sssd[nss]] [monitor_common_rotate_logs] (0x0010):
Debug level changed to 0x0070
(Mon Sep 16 18:49:57 2013) [sssd[nss]] [monitor_common_rotate_logs] (0x0010):
Debug level changed to 0x0030
(Mon Sep 16 18:53:31 2013) [sssd[nss]] [setpwent_result_timeout] (0x0020):
setpwent result object has expired. Cleaning up.
(Mon Sep 16 19:37:17 2013) [sssd[nss]] [sss_mc_find_record] (0x0010): Corrupted
fastcache. name_ptr value is 16.
--- 1st reset.
(Mon Sep 16 21:32:40 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
--- 2nd reset.
(Mon Sep 16 22:04:04 2013) [sssd[nss]] [sss_mc_find_record] (0x0010): Corrupted
fastcache. name_ptr value is 909128497.
--- 3rd reset.
(Mon Sep 16 23:15:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): Fatal
internal mmap cache error, invalidating cache!
(Mon Sep 16 23:40:19 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to store
user testbigldap129485(default) in mmap cache!
--- 1st invalidation.
(Tue Sep 17 06:06:51 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
--- 4th reset.
(Tue Sep 17 06:25:04 2013) [sssd[nss]] [sss_mc_add_rec_to_chain] (0x0010):
Corrupted fastcache. Slot number too big.
--- 5th reset.
(Tue Sep 17 07:48:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
--- 6th reset.
(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): Fatal
internal mmap cache error, invalidating cache!
(Tue Sep 17 08:08:36 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to store
user testbigldap119808(default) in mmap cache!
--- 2nd invalidation.
(Tue Sep 17 12:56:02 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
--- 7th reset.
(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
Corrupted fastcache. Slot number too big.
(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): Fatal
internal mmap cache error, invalidating cache!
(Tue Sep 17 13:22:27 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to store
user testbigldap126987(default) in mmap cache!
--- 3rd invalidation.
(Tue Sep 17 14:30:56 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
(0x0010): Corrupted fastcache. Slot number too big.
--- 8th reset.
(Tue Sep 17 14:54:21 2013) [sssd[nss]] [sss_mc_add_rec_to_chain] (0x0010):
Corrupted fastcache. Slot number too big.
--- 9th reset.
=============================================================
In older log sile, I also saw reseting caches in another functions.
BTW: I tested my patch with "two next pointers" and it did not crash and
log file is empty :-)
just few small nitpicks.
>From ba054788692f713eed0738f5b3f5e6bb48fe4c4b Mon Sep 17 00:00:00 2001
From: Michal Zidek <[email protected]>
Date: Fri, 13 Sep 2013 17:41:28 +0200
Subject: [PATCH] Check slot validity before MC_SLOT_TO_PTR.
resolves:
https://fedorahosted.org/sssd/ticket/2049
---
src/responder/nss/nsssrv_mmap_cache.c | 80 +++++++++++++++++++++++++++++++----
src/sss_client/nss_mc_common.c | 4 ++
2 files changed, 75 insertions(+), 9 deletions(-)
diff --git a/src/responder/nss/nsssrv_mmap_cache.c
b/src/responder/nss/nsssrv_mmap_cache.c
index 4e45405..09a4c2b 100644
--- a/src/responder/nss/nsssrv_mmap_cache.c
+++ b/src/responder/nss/nsssrv_mmap_cache.c
@@ -191,6 +191,13 @@ static void sss_mc_add_rec_to_chain(struct sss_mc_ctx *mcc,
}
do {
+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
+ DEBUG(SSSDBG_FATAL_FAILURE,
+ ("Corrupted fastcache. Slot number too big.\n"));
+ sss_mmap_cache_reset(mcc);
+ return;
+ }
+
cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
if (cur == rec) {
/* rec already stored in hash chain */
@@ -205,16 +212,24 @@ static void sss_mc_add_rec_to_chain(struct sss_mc_ctx
*mcc,
cur->next = MC_PTR_TO_SLOT(mcc->data_table, rec);
}
-static inline uint32_t
+static inline errno_t
sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
struct sss_mc_rec *start_rec,
- uint32_t hash)
+ uint32_t hash,
+ uint32_t *_slot)
{
struct sss_mc_rec *rec;
uint32_t slot;
slot = start_rec->next;
while (slot != MC_INVALID_VAL) {
+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
+ DEBUG(SSSDBG_FATAL_FAILURE,
+ ("Corrupted fastcache. Slot number too big.\n"));
+ sss_mmap_cache_reset(mcc);
+ return EINVAL;
+ }
+
rec = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
if (rec->hash1 == hash || rec->hash2 == hash) {
break;
@@ -223,23 +238,26 @@ sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
slot = rec->next;
}
- return slot;
+ *_slot = slot;
+
+ return EOK;
}
-static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
+static errno_t sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
struct sss_mc_rec *rec,
uint32_t hash)
{
struct sss_mc_rec *prev = NULL;
struct sss_mc_rec *cur = NULL;
uint32_t slot;
+ uint32_t ret;
if (hash > MC_HT_ELEMS(mcc->ht_size)) {
/* It can happen if rec->hash1 and rec->hash2 was the same.
* or it is invalid hash. It is better to return
* than trying to access out of bounds memory
*/
- return;
+ return EOK;
^^^
EOK should be returned on if hash is equal to MC_INVALID_VAL
}
slot = mcc->hash_table[hash];
@@ -247,18 +265,37 @@ static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx
*mcc,
/* record has already been removed. It may happen if rec->hash1 and
* rec->has2 are the same. (It is not very likely).
*/
- return;
+ return EOK;
+ }
+
+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
+ DEBUG(SSSDBG_FATAL_FAILURE,
+ ("Corrupted fastcache. Slot number too big.\n"));
+ sss_mmap_cache_reset(mcc);
+ return EINVAL;
}
+
cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
if (cur == rec) {
/* rec->next can refer to record without matching hashes.
* We need to skip this(those) records, because
* mcc->hash_table[hash] have to refer to valid start of the chain.
*/
- mcc->hash_table[hash] = sss_mc_get_next_slot_with_hash(mcc, rec, hash);
+ ret = sss_mc_get_next_slot_with_hash(mcc, rec, hash,
+ &mcc->hash_table[hash]);
+ if (ret != EOK) {
+ return ret;
+ }
This condition does not make sense.
You can use:
either return ret;
or return sss_mc_get_next_slot_with_hash