On (17/09/13 17:15), Michal Židek wrote:
>On 09/17/2013 04:41 PM, Lukas Slebodnik wrote:
>>On (16/09/13 18:16), Michal Židek wrote:
>>>On 09/14/2013 12:35 AM, Lukas Slebodnik wrote:
>>>>On (13/09/13 19:17), Michal Židek wrote:
>>>>>On 09/13/2013 05:58 PM, Michal Židek wrote:
>>>>>>Hello,
>>>>>>
>>>>>>This patch should add another line of defence against memory cache
>>>>>>problems caused by accessing slot outside of bounds.
>>>>>>
>>>>>>Thanks
>>>>>>Michal
>>>>>>
>>>>>
>>>>>After discussion with Lukas I am attaching alternative version
>>>>>without the call to save the corrupted cache.
>>>>>
>>>>>Thanks
>>>>>Michal
>>>>>
>>>>
>>>>I tested patch with latest sssd-1-9 on RHEL6.
>>>>
>>>>Message was written to the sssd_nss.log:
>>>>(Sat Sep 14 00:19:24 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
>>>>(0x0010): Corrupted fastcache. Slot number too big.
>>>>
>>>>But it seems that sssd_nss crashed.
>>>>
>>>>Program received signal SIGILL, Illegal instruction.
>>>>0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
>>>>#0 0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
>>>>No symbol table info available.
>>>>#1 0x0000000000000008 in ?? ()
>>>>No symbol table info available.
>>>>#2 0x0000000001086cf0 in ?? ()
>>>>No symbol table info available.
>>>>#3 0x00000000010842a0 in ?? ()
>>>>No symbol table info available.
>>>>#4 0x00000000004288de in sss_mc_find_free_slots (_mcc=0x1086cf0,
>>>>rec_len=252, key=<value optimized out>, _rec=0x7fff1b6bb008) at
>>>>src/responder/nss/nsssrv_mmap_cache.c:495
>>>> tot_slots = <value optimized out>
>>>> i = <value optimized out>
>>>> rec = <value optimized out>
>>>> cur = 0
>>>> t = <value optimized out>
>>>> used = true
>>>>#5 sss_mc_get_record (_mcc=0x1086cf0, rec_len=252, key=<value optimized
>>>>out>, _rec=0x7fff1b6bb008) at src/responder/nss/nsssrv_mmap_cache.c:637
>>>> mcc = 0x7fa632dfa038
>>>> old_rec = <value optimized out>
>>>> rec = <value optimized out>
>>>> old_slots = <value optimized out>
>>>> num_slots = 8
>>>> base_slot = <value optimized out>
>>>> ret = 853516376
>>>> i = <value optimized out>
>>>> __FUNCTION__ = "sss_mc_get_record"
>>>>#6 0x0000000000428eee in sss_mmap_cache_pw_store (_mcc=0x10842a0,
>>>>name=0x7fff1b6bb140, pw=0x7fff1b6bb150, uid=126516, gid=16000,
>>>>gecos=0x7fff1b6bb180, homedir=0x7fff1b6bb170, shell=0x7fff1b6bb160) at
>>>>src/responder/nss/nsssrv_mmap_cache.c:731
>>>> mcc = 0x1086cf0
>>>> rec = <value optimized out>
>>>> data = <value optimized out>
>>>> uidkey = {str = 0x7fff1b6bb010 "126516", len = 7}
>>>> uidstr = "126516\000\000\021)\213"
>>>> data_len = 204
>>>> rec_len = 252
>>>> pos = <value optimized out>
>>>> ret = <value optimized out>
>>>>
>>>>LS
>>>>
>>The crash was caused, because mmap_cache was invalidated
>>and memset was called for invalidated record.
>>
>>memset(rec->data, MC_INVALID_VAL8, <snip> * MC_SIZE_TO_SLOTS(rec->len))
>> ^^^^^^^^^^
>> rec->len is 0xffffffff
>>>
>>>Good catch.
>>>
>>>In function sss_mc_get_next_slot_with_hash I returned MC_INVALID_VAL
>>>in case of failure. This function is called in function
>>>sss_mc_rm_rec_from_chain, which returns no error code. So the upper
>>>layers worked with the wrong MC_INVALID_VAL, as if it was valid value
>>>and the MC_PTR_TO_SLOT resulted in some big slot number (which
>>>probably caused the crash). In this patch I populate the information
>>>about failure to to sss_mc_invalidate_rec and terminate it if
>>>necessary.
>>>
>>>New patch is attached. Please run the testing script with your
>>>configuration that failed to see if it is fixed.
>>>
>>>Thanks
>>>Michal
>>>
>>I ran a stress test for 20+ hours and sssd_nss didn't crash,
>>only mmap_cache was reseted or reinvalidated.
>>
>>Here is sssd_nss.log:
>>=============================================================
>>(Mon Sep 16 18:49:56 2013) [sssd[nss]] [monitor_common_rotate_logs] (0x0010):
>>Debug level changed to 0x0070
>>(Mon Sep 16 18:49:57 2013) [sssd[nss]] [monitor_common_rotate_logs] (0x0010):
>>Debug level changed to 0x0030
>>(Mon Sep 16 18:53:31 2013) [sssd[nss]] [setpwent_result_timeout] (0x0020):
>>setpwent result object has expired. Cleaning up.
>>(Mon Sep 16 19:37:17 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
>>Corrupted fastcache. name_ptr value is 16.
>>--- 1st reset.
>>(Mon Sep 16 21:32:40 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
>>(0x0010): Corrupted fastcache. Slot number too big.
>>--- 2nd reset.
>>(Mon Sep 16 22:04:04 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
>>Corrupted fastcache. name_ptr value is 909128497.
>>--- 3rd reset.
>>(Mon Sep 16 23:15:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
>>(0x0010): Corrupted fastcache. Slot number too big.
>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
>>Corrupted fastcache. Slot number too big.
>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): Fatal
>>internal mmap cache error, invalidating cache!
>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to store
>>user testbigldap129485(default) in mmap cache!
>>--- 1st invalidation.
>>(Tue Sep 17 06:06:51 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
>>(0x0010): Corrupted fastcache. Slot number too big.
>>--- 4th reset.
>>(Tue Sep 17 06:25:04 2013) [sssd[nss]] [sss_mc_add_rec_to_chain] (0x0010):
>>Corrupted fastcache. Slot number too big.
>>--- 5th reset.
>>(Tue Sep 17 07:48:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
>>(0x0010): Corrupted fastcache. Slot number too big.
>>--- 6th reset.
>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
>>Corrupted fastcache. Slot number too big.
>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): Fatal
>>internal mmap cache error, invalidating cache!
>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to store
>>user testbigldap119808(default) in mmap cache!
>>--- 2nd invalidation.
>>(Tue Sep 17 12:56:02 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
>>(0x0010): Corrupted fastcache. Slot number too big.
>>--- 7th reset.
>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
>>Corrupted fastcache. Slot number too big.
>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): Fatal
>>internal mmap cache error, invalidating cache!
>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to store
>>user testbigldap126987(default) in mmap cache!
>>--- 3rd invalidation.
>>(Tue Sep 17 14:30:56 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
>>(0x0010): Corrupted fastcache. Slot number too big.
>>--- 8th reset.
>>(Tue Sep 17 14:54:21 2013) [sssd[nss]] [sss_mc_add_rec_to_chain] (0x0010):
>>Corrupted fastcache. Slot number too big.
>>--- 9th reset.
>>=============================================================
>>
>>In older log sile, I also saw reseting caches in another functions.
>>BTW: I tested my patch with "two next pointers" and it did not crash and
>>log file is empty :-)
>>
>>just few small nitpicks.
>>
>>>From ba054788692f713eed0738f5b3f5e6bb48fe4c4b Mon Sep 17 00:00:00 2001
>>>From: Michal Zidek <[email protected]>
>>>Date: Fri, 13 Sep 2013 17:41:28 +0200
>>>Subject: [PATCH] Check slot validity before MC_SLOT_TO_PTR.
>>>
>>>resolves:
>>>https://fedorahosted.org/sssd/ticket/2049
>>>---
>>>src/responder/nss/nsssrv_mmap_cache.c | 80
>>>+++++++++++++++++++++++++++++++----
>>>src/sss_client/nss_mc_common.c | 4 ++
>>>2 files changed, 75 insertions(+), 9 deletions(-)
>>>
>>>diff --git a/src/responder/nss/nsssrv_mmap_cache.c
>>>b/src/responder/nss/nsssrv_mmap_cache.c
>>>index 4e45405..09a4c2b 100644
>>>--- a/src/responder/nss/nsssrv_mmap_cache.c
>>>+++ b/src/responder/nss/nsssrv_mmap_cache.c
>>>@@ -191,6 +191,13 @@ static void sss_mc_add_rec_to_chain(struct sss_mc_ctx
>>>*mcc,
>>> }
>>>
>>> do {
>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
>>>+ ("Corrupted fastcache. Slot number too big.\n"));
>>>+ sss_mmap_cache_reset(mcc);
>>>+ return;
>>>+ }
>>>+
>>> cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
>>> if (cur == rec) {
>>> /* rec already stored in hash chain */
>>>@@ -205,16 +212,24 @@ static void sss_mc_add_rec_to_chain(struct sss_mc_ctx
>>>*mcc,
>>> cur->next = MC_PTR_TO_SLOT(mcc->data_table, rec);
>>>}
>>>
>>>-static inline uint32_t
>>>+static inline errno_t
>>>sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
>>> struct sss_mc_rec *start_rec,
>>>- uint32_t hash)
>>>+ uint32_t hash,
>>>+ uint32_t *_slot)
>>>{
>>> struct sss_mc_rec *rec;
>>> uint32_t slot;
>>>
>>> slot = start_rec->next;
>>> while (slot != MC_INVALID_VAL) {
>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
>>>+ ("Corrupted fastcache. Slot number too big.\n"));
>>>+ sss_mmap_cache_reset(mcc);
>>>+ return EINVAL;
>>>+ }
>>>+
>>> rec = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
>>> if (rec->hash1 == hash || rec->hash2 == hash) {
>>> break;
>>>@@ -223,23 +238,26 @@ sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
>>> slot = rec->next;
>>> }
>>>
>>>- return slot;
>>>+ *_slot = slot;
>>>+
>>>+ return EOK;
>>>}
>>>
>>>-static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
>>>+static errno_t sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
>>> struct sss_mc_rec *rec,
>>> uint32_t hash)
>>>{
>>> struct sss_mc_rec *prev = NULL;
>>> struct sss_mc_rec *cur = NULL;
>>> uint32_t slot;
>>>+ uint32_t ret;
>>>
>>> if (hash > MC_HT_ELEMS(mcc->ht_size)) {
>>> /* It can happen if rec->hash1 and rec->hash2 was the same.
>>> * or it is invalid hash. It is better to return
>>> * than trying to access out of bounds memory
>>> */
>>>- return;
>>>+ return EOK;
>> ^^^
>> EOK should be returned on if hash is equal to MC_INVALID_VAL
>
>Ok. I also changed the comments (split the error and normal case).
>
>>
>>> }
>>>
>>> slot = mcc->hash_table[hash];
>>>@@ -247,18 +265,37 @@ static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx
>>>*mcc,
>>> /* record has already been removed. It may happen if rec->hash1 and
>>> * rec->has2 are the same. (It is not very likely).
>>> */
>>>- return;
>>>+ return EOK;
>>>+ }
>>>+
>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
>>>+ ("Corrupted fastcache. Slot number too big.\n"));
>>>+ sss_mmap_cache_reset(mcc);
>>>+ return EINVAL;
>>> }
>>>+
>>> cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
>>> if (cur == rec) {
>>> /* rec->next can refer to record without matching hashes.
>>> * We need to skip this(those) records, because
>>> * mcc->hash_table[hash] have to refer to valid start of the chain.
>>> */
>>>- mcc->hash_table[hash] = sss_mc_get_next_slot_with_hash(mcc, rec,
>>>hash);
>>>+ ret = sss_mc_get_next_slot_with_hash(mcc, rec, hash,
>>>+ &mcc->hash_table[hash]);
>>>+ if (ret != EOK) {
>>>+ return ret;
>>>+ }
>>This condition does not make sense.
>>You can use:
>> either return ret;
>> or return sss_mc_get_next_slot_with_hash
>
>Well, I wanted the function to run to the end if nothing went wrong,
>but yes, in this case we return at several places in the function so
>it does not make much sense.
>
>>
>>LS
>
>Thanks for the review Lukas. New patch is attached.
>
>Michal
I tested patch for sssd-1-9 branch on RHEL6
and it works fine. I think it is good alternative for sssd-1-9 branch.
ACK
LS
_______________________________________________
sssd-devel mailing list
[email protected]
https://lists.fedorahosted.org/mailman/listinfo/sssd-devel