On Tue, Sep 17, 2013 at 07:27:17PM +0200, Michal Židek wrote:
> On 09/17/2013 07:09 PM, Lukas Slebodnik wrote:
> >On (17/09/13 18:33), Michal Židek wrote:
> >>On 09/17/2013 06:15 PM, Lukas Slebodnik wrote:
> >>>On (17/09/13 17:15), Michal Židek wrote:
> >>>>On 09/17/2013 04:41 PM, Lukas Slebodnik wrote:
> >>>>>On (16/09/13 18:16), Michal Židek wrote:
> >>>>>>On 09/14/2013 12:35 AM, Lukas Slebodnik wrote:
> >>>>>>>On (13/09/13 19:17), Michal Židek wrote:
> >>>>>>>>On 09/13/2013 05:58 PM, Michal Židek wrote:
> >>>>>>>>>Hello,
> >>>>>>>>>
> >>>>>>>>>This patch should add another line of defence against memory cache
> >>>>>>>>>problems caused by accessing slot outside of bounds.
> >>>>>>>>>
> >>>>>>>>>Thanks
> >>>>>>>>>Michal
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>After discussion with Lukas I am attaching alternative version
> >>>>>>>>without the call to save the corrupted cache.
> >>>>>>>>
> >>>>>>>>Thanks
> >>>>>>>>Michal
> >>>>>>>>
> >>>>>>>
> >>>>>>>I tested patch with latest sssd-1-9 on RHEL6.
> >>>>>>>
> >>>>>>>Message was written to the sssd_nss.log:
> >>>>>>>(Sat Sep 14 00:19:24 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>>>
> >>>>>>>But it seems that sssd_nss crashed.
> >>>>>>>
> >>>>>>>Program received signal SIGILL, Illegal instruction.
> >>>>>>>0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
> >>>>>>>#0  0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
> >>>>>>>No symbol table info available.
> >>>>>>>#1  0x0000000000000008 in ?? ()
> >>>>>>>No symbol table info available.
> >>>>>>>#2  0x0000000001086cf0 in ?? ()
> >>>>>>>No symbol table info available.
> >>>>>>>#3  0x00000000010842a0 in ?? ()
> >>>>>>>No symbol table info available.
> >>>>>>>#4  0x00000000004288de in sss_mc_find_free_slots (_mcc=0x1086cf0, 
> >>>>>>>rec_len=252, key=<value optimized out>, _rec=0x7fff1b6bb008) at 
> >>>>>>>src/responder/nss/nsssrv_mmap_cache.c:495
> >>>>>>>         tot_slots = <value optimized out>
> >>>>>>>         i = <value optimized out>
> >>>>>>>         rec = <value optimized out>
> >>>>>>>         cur = 0
> >>>>>>>         t = <value optimized out>
> >>>>>>>         used = true
> >>>>>>>#5  sss_mc_get_record (_mcc=0x1086cf0, rec_len=252, key=<value 
> >>>>>>>optimized out>, _rec=0x7fff1b6bb008) at 
> >>>>>>>src/responder/nss/nsssrv_mmap_cache.c:637
> >>>>>>>         mcc = 0x7fa632dfa038
> >>>>>>>         old_rec = <value optimized out>
> >>>>>>>         rec = <value optimized out>
> >>>>>>>         old_slots = <value optimized out>
> >>>>>>>         num_slots = 8
> >>>>>>>         base_slot = <value optimized out>
> >>>>>>>         ret = 853516376
> >>>>>>>         i = <value optimized out>
> >>>>>>>         __FUNCTION__ = "sss_mc_get_record"
> >>>>>>>#6  0x0000000000428eee in sss_mmap_cache_pw_store (_mcc=0x10842a0, 
> >>>>>>>name=0x7fff1b6bb140, pw=0x7fff1b6bb150, uid=126516, gid=16000, 
> >>>>>>>gecos=0x7fff1b6bb180, homedir=0x7fff1b6bb170, shell=0x7fff1b6bb160) at 
> >>>>>>>src/responder/nss/nsssrv_mmap_cache.c:731
> >>>>>>>         mcc = 0x1086cf0
> >>>>>>>         rec = <value optimized out>
> >>>>>>>         data = <value optimized out>
> >>>>>>>         uidkey = {str = 0x7fff1b6bb010 "126516", len = 7}
> >>>>>>>         uidstr = "126516\000\000\021)\213"
> >>>>>>>         data_len = 204
> >>>>>>>         rec_len = 252
> >>>>>>>         pos = <value optimized out>
> >>>>>>>         ret = <value optimized out>
> >>>>>>>
> >>>>>>>LS
> >>>>>>>
> >>>>>The crash was caused, because mmap_cache was invalidated
> >>>>>and memset was called for invalidated record.
> >>>>>
> >>>>>memset(rec->data, MC_INVALID_VAL8, <snip> * MC_SIZE_TO_SLOTS(rec->len))
> >>>>>                                                             ^^^^^^^^^^
> >>>>>                                                     rec->len is 
> >>>>> 0xffffffff
> >>>>>>
> >>>>>>Good catch.
> >>>>>>
> >>>>>>In function sss_mc_get_next_slot_with_hash I returned MC_INVALID_VAL
> >>>>>>in case of failure. This function is called in function
> >>>>>>sss_mc_rm_rec_from_chain, which returns no error code. So the upper
> >>>>>>layers worked with the wrong MC_INVALID_VAL, as if it was valid value
> >>>>>>and the MC_PTR_TO_SLOT resulted in some big slot number (which
> >>>>>>probably caused the crash). In this patch I populate the information
> >>>>>>about failure to to sss_mc_invalidate_rec and terminate it if
> >>>>>>necessary.
> >>>>>>
> >>>>>>New patch is attached. Please run the testing script with your
> >>>>>>configuration that failed to see if it is fixed.
> >>>>>>
> >>>>>>Thanks
> >>>>>>Michal
> >>>>>>
> >>>>>I ran a stress test for 20+ hours and sssd_nss didn't crash,
> >>>>>only mmap_cache was reseted or reinvalidated.
> >>>>>
> >>>>>Here is sssd_nss.log:
> >>>>>=============================================================
> >>>>>(Mon Sep 16 18:49:56 2013) [sssd[nss]] [monitor_common_rotate_logs] 
> >>>>>(0x0010): Debug level changed to 0x0070
> >>>>>(Mon Sep 16 18:49:57 2013) [sssd[nss]] [monitor_common_rotate_logs] 
> >>>>>(0x0010): Debug level changed to 0x0030
> >>>>>(Mon Sep 16 18:53:31 2013) [sssd[nss]] [setpwent_result_timeout] 
> >>>>>(0x0020): setpwent result object has expired. Cleaning up.
> >>>>>(Mon Sep 16 19:37:17 2013) [sssd[nss]] [sss_mc_find_record] (0x0010): 
> >>>>>Corrupted fastcache. name_ptr value is 16.
> >>>>>--- 1st reset.
> >>>>>(Mon Sep 16 21:32:40 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 2nd reset.
> >>>>>(Mon Sep 16 22:04:04 2013) [sssd[nss]] [sss_mc_find_record] (0x0010): 
> >>>>>Corrupted fastcache. name_ptr value is 909128497.
> >>>>>--- 3rd reset.
> >>>>>(Mon Sep 16 23:15:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010): 
> >>>>>Corrupted fastcache. Slot number too big.
> >>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): 
> >>>>>Fatal internal mmap cache error, invalidating cache!
> >>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to 
> >>>>>store user testbigldap129485(default) in mmap cache!
> >>>>>--- 1st invalidation.
> >>>>>(Tue Sep 17 06:06:51 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 4th reset.
> >>>>>(Tue Sep 17 06:25:04 2013) [sssd[nss]] [sss_mc_add_rec_to_chain] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 5th reset.
> >>>>>(Tue Sep 17 07:48:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 6th reset.
> >>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010): 
> >>>>>Corrupted fastcache. Slot number too big.
> >>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): 
> >>>>>Fatal internal mmap cache error, invalidating cache!
> >>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to 
> >>>>>store user testbigldap119808(default) in mmap cache!
> >>>>>--- 2nd invalidation.
> >>>>>(Tue Sep 17 12:56:02 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 7th reset.
> >>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010): 
> >>>>>Corrupted fastcache. Slot number too big.
> >>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_get_record] (0x0020): 
> >>>>>Fatal internal mmap cache error, invalidating cache!
> >>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to 
> >>>>>store user testbigldap126987(default) in mmap cache!
> >>>>>--- 3rd invalidation.
> >>>>>(Tue Sep 17 14:30:56 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 8th reset.
> >>>>>(Tue Sep 17 14:54:21 2013) [sssd[nss]] [sss_mc_add_rec_to_chain] 
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 9th reset.
> >>>>>=============================================================
> >>>>>
> >>>>>In older log sile, I also saw reseting caches in another functions.
> >>>>>BTW: I tested my patch with "two next pointers" and it did not crash and
> >>>>>log file is empty :-)
> >>>>>
> >>>>>just few small nitpicks.
> >>>>>
> >>>>>>From ba054788692f713eed0738f5b3f5e6bb48fe4c4b Mon Sep 17 00:00:00 2001
> >>>>>>From: Michal Zidek <[email protected]>
> >>>>>>Date: Fri, 13 Sep 2013 17:41:28 +0200
> >>>>>>Subject: [PATCH] Check slot validity before MC_SLOT_TO_PTR.
> >>>>>>
> >>>>>>resolves:
> >>>>>>https://fedorahosted.org/sssd/ticket/2049
> >>>>>>---
> >>>>>>src/responder/nss/nsssrv_mmap_cache.c | 80 
> >>>>>>+++++++++++++++++++++++++++++++----
> >>>>>>src/sss_client/nss_mc_common.c        |  4 ++
> >>>>>>2 files changed, 75 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>>diff --git a/src/responder/nss/nsssrv_mmap_cache.c 
> >>>>>>b/src/responder/nss/nsssrv_mmap_cache.c
> >>>>>>index 4e45405..09a4c2b 100644
> >>>>>>--- a/src/responder/nss/nsssrv_mmap_cache.c
> >>>>>>+++ b/src/responder/nss/nsssrv_mmap_cache.c
> >>>>>>@@ -191,6 +191,13 @@ static void sss_mc_add_rec_to_chain(struct 
> >>>>>>sss_mc_ctx *mcc,
> >>>>>>     }
> >>>>>>
> >>>>>>     do {
> >>>>>>+        if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
> >>>>>>+            DEBUG(SSSDBG_FATAL_FAILURE,
> >>>>>>+                  ("Corrupted fastcache. Slot number too big.\n"));
> >>>>>>+            sss_mmap_cache_reset(mcc);
> >>>>>>+            return;
> >>>>>>+        }
> >>>>>>+
> >>>>>>         cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
> >>>>>>         if (cur == rec) {
> >>>>>>             /* rec already stored in hash chain */
> >>>>>>@@ -205,16 +212,24 @@ static void sss_mc_add_rec_to_chain(struct 
> >>>>>>sss_mc_ctx *mcc,
> >>>>>>     cur->next = MC_PTR_TO_SLOT(mcc->data_table, rec);
> >>>>>>}
> >>>>>>
> >>>>>>-static inline uint32_t
> >>>>>>+static inline errno_t
> >>>>>>sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
> >>>>>>                                struct sss_mc_rec *start_rec,
> >>>>>>-                               uint32_t hash)
> >>>>>>+                               uint32_t hash,
> >>>>>>+                               uint32_t *_slot)
> >>>>>>{
> >>>>>>     struct sss_mc_rec *rec;
> >>>>>>     uint32_t slot;
> >>>>>>
> >>>>>>     slot = start_rec->next;
> >>>>>>     while (slot != MC_INVALID_VAL) {
> >>>>>>+        if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
> >>>>>>+            DEBUG(SSSDBG_FATAL_FAILURE,
> >>>>>>+                  ("Corrupted fastcache. Slot number too big.\n"));
> >>>>>>+            sss_mmap_cache_reset(mcc);
> >>>>>>+            return EINVAL;
> >>>>>>+        }
> >>>>>>+
> >>>>>>         rec = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
> >>>>>>         if (rec->hash1 == hash || rec->hash2 == hash) {
> >>>>>>             break;
> >>>>>>@@ -223,23 +238,26 @@ sss_mc_get_next_slot_with_hash(struct sss_mc_ctx 
> >>>>>>*mcc,
> >>>>>>         slot = rec->next;
> >>>>>>     }
> >>>>>>
> >>>>>>-    return slot;
> >>>>>>+    *_slot = slot;
> >>>>>>+
> >>>>>>+    return EOK;
> >>>>>>}
> >>>>>>
> >>>>>>-static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
> >>>>>>+static errno_t sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
> >>>>>>                                      struct sss_mc_rec *rec,
> >>>>>>                                      uint32_t hash)
> >>>>>>{
> >>>>>>     struct sss_mc_rec *prev = NULL;
> >>>>>>     struct sss_mc_rec *cur = NULL;
> >>>>>>     uint32_t slot;
> >>>>>>+    uint32_t ret;
> >>>>>>
> >>>>>>     if (hash > MC_HT_ELEMS(mcc->ht_size)) {
> >>>>>>         /* It can happen if rec->hash1 and rec->hash2 was the same.
> >>>>>>          * or it is invalid hash. It is better to return
> >>>>>>          * than trying to access out of bounds memory
> >>>>>>          */
> >>>>>>-        return;
> >>>>>>+        return EOK;
> >>>>>                  ^^^
> >>>>>           EOK should be returned on if hash is equal to MC_INVALID_VAL
> >>>>
> >>>>Ok. I also changed the comments (split the error and normal case).
> >>>>
> >>>>>
> >>>>>>     }
> >>>>>>
> >>>>>>     slot = mcc->hash_table[hash];
> >>>>>>@@ -247,18 +265,37 @@ static void sss_mc_rm_rec_from_chain(struct 
> >>>>>>sss_mc_ctx *mcc,
> >>>>>>         /* record has already been removed. It may happen if 
> >>>>>> rec->hash1 and
> >>>>>>          * rec->has2 are the same. (It is not very likely).
> >>>>>>          */
> >>>>>>-        return;
> >>>>>>+        return EOK;
> >>>>>>+    }
> >>>>>>+
> >>>>>>+    if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
> >>>>>>+        DEBUG(SSSDBG_FATAL_FAILURE,
> >>>>>>+              ("Corrupted fastcache. Slot number too big.\n"));
> >>>>>>+        sss_mmap_cache_reset(mcc);
> >>>>>>+        return EINVAL;
> >>>>>>     }
> >>>>>>+
> >>>>>>     cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
> >>>>>>     if (cur == rec) {
> >>>>>>         /* rec->next can refer to record without matching hashes.
> >>>>>>          * We need to skip this(those) records, because
> >>>>>>          * mcc->hash_table[hash] have to refer to valid start of the 
> >>>>>> chain.
> >>>>>>          */
> >>>>>>-        mcc->hash_table[hash] = sss_mc_get_next_slot_with_hash(mcc, 
> >>>>>>rec, hash);
> >>>>>>+        ret = sss_mc_get_next_slot_with_hash(mcc, rec, hash,
> >>>>>>+                                             &mcc->hash_table[hash]);
> >>>>>>+        if (ret != EOK) {
> >>>>>>+            return ret;
> >>>>>>+        }
> >>>>>This condition does not make sense.
> >>>>>You can use:
> >>>>>       either return ret;
> >>>>>       or return sss_mc_get_next_slot_with_hash
> >>>>
> >>>>Well, I wanted the function to run to the end if nothing went wrong,
> >>>>but yes, in this case we return at several places in the function so
> >>>>it does not make much sense.
> >>>>
> >>>>>
> >>>>>LS
> >>>>
> >>>>Thanks for the review Lukas. New patch is attached.
> >>>>
> >>>>Michal
> >>>
> >>>I tested patch for sssd-1-9 branch on RHEL6
> >>>and it works fine. I think it is good alternative for sssd-1-9 branch.
> >>>
> >>>ACK
> >>>
> >>>LS
> >>
> >>I would personally prefer to have this in master as well, even if the
> >>currently known issues will be fixed with your patch ([PATCH]
> >>mmap_cache: Use two chains for hash collision). I think it hurts
> >>nothing to be more defensive.
> >Your patch is not about defensive coding. The patch is about fixing corrupted
> >cache in back-compatible way. In my opinion, mmap_cache source code looks 
> >much
> >more complicated and readability is decreased.
> >
> >LS
> 
> I do not have strong opinion on this, so I am fine either way.
> 
> Michal

Before pushing the patches, I have two questions:

1) Do these patches cover the same error as the ones Lukas has on the
list now?

2) About this change:
diff --git a/src/sss_client/nss_mc_common.c b/src/sss_client/nss_mc_common.c
index a0a70ab..6764d80 100644
--- a/src/sss_client/nss_mc_common.c
+++ b/src/sss_client/nss_mc_common.c
@@ -190,6 +190,10 @@ errno_t sss_nss_mc_get_record(struct sss_cli_mc_ctx *ctx,
     int count;
     int ret;

+    if (!MC_SLOT_WITHIN_BOUNDS(slot, ctx->dt_size)) {
+        return EINVAL;
+    }
+     
     /* try max 5 times */
     for (count = 5; count > 0; count--) {
         rec = MC_SLOT_TO_PTR(ctx->data_table, slot, struct sss_mc_rec);

All paths that call sss_nss_mc_get_record() also check for
MC_SLOT_WITHIN_BOUNDS. Does it make sense to check both?
_______________________________________________
sssd-devel mailing list
[email protected]
https://lists.fedorahosted.org/mailman/listinfo/sssd-devel

Reply via email to