On Tue, Sep 17, 2013 at 07:27:17PM +0200, Michal Židek wrote:
> On 09/17/2013 07:09 PM, Lukas Slebodnik wrote:
> >On (17/09/13 18:33), Michal Židek wrote:
> >>On 09/17/2013 06:15 PM, Lukas Slebodnik wrote:
> >>>On (17/09/13 17:15), Michal Židek wrote:
> >>>>On 09/17/2013 04:41 PM, Lukas Slebodnik wrote:
> >>>>>On (16/09/13 18:16), Michal Židek wrote:
> >>>>>>On 09/14/2013 12:35 AM, Lukas Slebodnik wrote:
> >>>>>>>On (13/09/13 19:17), Michal Židek wrote:
> >>>>>>>>On 09/13/2013 05:58 PM, Michal Židek wrote:
> >>>>>>>>>Hello,
> >>>>>>>>>
> >>>>>>>>>This patch should add another line of defence against memory cache
> >>>>>>>>>problems caused by accessing slot outside of bounds.
> >>>>>>>>>
> >>>>>>>>>Thanks
> >>>>>>>>>Michal
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>After discussion with Lukas I am attaching alternative version
> >>>>>>>>without the call to save the corrupted cache.
> >>>>>>>>
> >>>>>>>>Thanks
> >>>>>>>>Michal
> >>>>>>>>
> >>>>>>>
> >>>>>>>I tested patch with latest sssd-1-9 on RHEL6.
> >>>>>>>
> >>>>>>>Message was written to the sssd_nss.log:
> >>>>>>>(Sat Sep 14 00:19:24 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>>>
> >>>>>>>But it seems that sssd_nss crashed.
> >>>>>>>
> >>>>>>>Program received signal SIGILL, Illegal instruction.
> >>>>>>>0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
> >>>>>>>#0 0x00007fa63f8b5b03 in mabort () from /lib64/libc.so.6
> >>>>>>>No symbol table info available.
> >>>>>>>#1 0x0000000000000008 in ?? ()
> >>>>>>>No symbol table info available.
> >>>>>>>#2 0x0000000001086cf0 in ?? ()
> >>>>>>>No symbol table info available.
> >>>>>>>#3 0x00000000010842a0 in ?? ()
> >>>>>>>No symbol table info available.
> >>>>>>>#4 0x00000000004288de in sss_mc_find_free_slots (_mcc=0x1086cf0,
> >>>>>>>rec_len=252, key=<value optimized out>, _rec=0x7fff1b6bb008) at
> >>>>>>>src/responder/nss/nsssrv_mmap_cache.c:495
> >>>>>>> tot_slots = <value optimized out>
> >>>>>>> i = <value optimized out>
> >>>>>>> rec = <value optimized out>
> >>>>>>> cur = 0
> >>>>>>> t = <value optimized out>
> >>>>>>> used = true
> >>>>>>>#5 sss_mc_get_record (_mcc=0x1086cf0, rec_len=252, key=<value
> >>>>>>>optimized out>, _rec=0x7fff1b6bb008) at
> >>>>>>>src/responder/nss/nsssrv_mmap_cache.c:637
> >>>>>>> mcc = 0x7fa632dfa038
> >>>>>>> old_rec = <value optimized out>
> >>>>>>> rec = <value optimized out>
> >>>>>>> old_slots = <value optimized out>
> >>>>>>> num_slots = 8
> >>>>>>> base_slot = <value optimized out>
> >>>>>>> ret = 853516376
> >>>>>>> i = <value optimized out>
> >>>>>>> __FUNCTION__ = "sss_mc_get_record"
> >>>>>>>#6 0x0000000000428eee in sss_mmap_cache_pw_store (_mcc=0x10842a0,
> >>>>>>>name=0x7fff1b6bb140, pw=0x7fff1b6bb150, uid=126516, gid=16000,
> >>>>>>>gecos=0x7fff1b6bb180, homedir=0x7fff1b6bb170, shell=0x7fff1b6bb160) at
> >>>>>>>src/responder/nss/nsssrv_mmap_cache.c:731
> >>>>>>> mcc = 0x1086cf0
> >>>>>>> rec = <value optimized out>
> >>>>>>> data = <value optimized out>
> >>>>>>> uidkey = {str = 0x7fff1b6bb010 "126516", len = 7}
> >>>>>>> uidstr = "126516\000\000\021)\213"
> >>>>>>> data_len = 204
> >>>>>>> rec_len = 252
> >>>>>>> pos = <value optimized out>
> >>>>>>> ret = <value optimized out>
> >>>>>>>
> >>>>>>>LS
> >>>>>>>
> >>>>>The crash was caused, because mmap_cache was invalidated
> >>>>>and memset was called for invalidated record.
> >>>>>
> >>>>>memset(rec->data, MC_INVALID_VAL8, <snip> * MC_SIZE_TO_SLOTS(rec->len))
> >>>>> ^^^^^^^^^^
> >>>>> rec->len is
> >>>>> 0xffffffff
> >>>>>>
> >>>>>>Good catch.
> >>>>>>
> >>>>>>In function sss_mc_get_next_slot_with_hash I returned MC_INVALID_VAL
> >>>>>>in case of failure. This function is called in function
> >>>>>>sss_mc_rm_rec_from_chain, which returns no error code. So the upper
> >>>>>>layers worked with the wrong MC_INVALID_VAL, as if it was valid value
> >>>>>>and the MC_PTR_TO_SLOT resulted in some big slot number (which
> >>>>>>probably caused the crash). In this patch I populate the information
> >>>>>>about failure to to sss_mc_invalidate_rec and terminate it if
> >>>>>>necessary.
> >>>>>>
> >>>>>>New patch is attached. Please run the testing script with your
> >>>>>>configuration that failed to see if it is fixed.
> >>>>>>
> >>>>>>Thanks
> >>>>>>Michal
> >>>>>>
> >>>>>I ran a stress test for 20+ hours and sssd_nss didn't crash,
> >>>>>only mmap_cache was reseted or reinvalidated.
> >>>>>
> >>>>>Here is sssd_nss.log:
> >>>>>=============================================================
> >>>>>(Mon Sep 16 18:49:56 2013) [sssd[nss]] [monitor_common_rotate_logs]
> >>>>>(0x0010): Debug level changed to 0x0070
> >>>>>(Mon Sep 16 18:49:57 2013) [sssd[nss]] [monitor_common_rotate_logs]
> >>>>>(0x0010): Debug level changed to 0x0030
> >>>>>(Mon Sep 16 18:53:31 2013) [sssd[nss]] [setpwent_result_timeout]
> >>>>>(0x0020): setpwent result object has expired. Cleaning up.
> >>>>>(Mon Sep 16 19:37:17 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
> >>>>>Corrupted fastcache. name_ptr value is 16.
> >>>>>--- 1st reset.
> >>>>>(Mon Sep 16 21:32:40 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 2nd reset.
> >>>>>(Mon Sep 16 22:04:04 2013) [sssd[nss]] [sss_mc_find_record] (0x0010):
> >>>>>Corrupted fastcache. name_ptr value is 909128497.
> >>>>>--- 3rd reset.
> >>>>>(Mon Sep 16 23:15:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
> >>>>>Corrupted fastcache. Slot number too big.
> >>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
> >>>>>Fatal internal mmap cache error, invalidating cache!
> >>>>>(Mon Sep 16 23:40:19 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
> >>>>>store user testbigldap129485(default) in mmap cache!
> >>>>>--- 1st invalidation.
> >>>>>(Tue Sep 17 06:06:51 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 4th reset.
> >>>>>(Tue Sep 17 06:25:04 2013) [sssd[nss]] [sss_mc_add_rec_to_chain]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 5th reset.
> >>>>>(Tue Sep 17 07:48:23 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 6th reset.
> >>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
> >>>>>Corrupted fastcache. Slot number too big.
> >>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
> >>>>>Fatal internal mmap cache error, invalidating cache!
> >>>>>(Tue Sep 17 08:08:36 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
> >>>>>store user testbigldap119808(default) in mmap cache!
> >>>>>--- 2nd invalidation.
> >>>>>(Tue Sep 17 12:56:02 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 7th reset.
> >>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_is_valid_rec] (0x0010):
> >>>>>Corrupted fastcache. Slot number too big.
> >>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [sss_mc_get_record] (0x0020):
> >>>>>Fatal internal mmap cache error, invalidating cache!
> >>>>>(Tue Sep 17 13:22:27 2013) [sssd[nss]] [fill_pwent] (0x0020): Failed to
> >>>>>store user testbigldap126987(default) in mmap cache!
> >>>>>--- 3rd invalidation.
> >>>>>(Tue Sep 17 14:30:56 2013) [sssd[nss]] [sss_mc_get_next_slot_with_hash]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 8th reset.
> >>>>>(Tue Sep 17 14:54:21 2013) [sssd[nss]] [sss_mc_add_rec_to_chain]
> >>>>>(0x0010): Corrupted fastcache. Slot number too big.
> >>>>>--- 9th reset.
> >>>>>=============================================================
> >>>>>
> >>>>>In older log sile, I also saw reseting caches in another functions.
> >>>>>BTW: I tested my patch with "two next pointers" and it did not crash and
> >>>>>log file is empty :-)
> >>>>>
> >>>>>just few small nitpicks.
> >>>>>
> >>>>>>From ba054788692f713eed0738f5b3f5e6bb48fe4c4b Mon Sep 17 00:00:00 2001
> >>>>>>From: Michal Zidek <[email protected]>
> >>>>>>Date: Fri, 13 Sep 2013 17:41:28 +0200
> >>>>>>Subject: [PATCH] Check slot validity before MC_SLOT_TO_PTR.
> >>>>>>
> >>>>>>resolves:
> >>>>>>https://fedorahosted.org/sssd/ticket/2049
> >>>>>>---
> >>>>>>src/responder/nss/nsssrv_mmap_cache.c | 80
> >>>>>>+++++++++++++++++++++++++++++++----
> >>>>>>src/sss_client/nss_mc_common.c | 4 ++
> >>>>>>2 files changed, 75 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>>diff --git a/src/responder/nss/nsssrv_mmap_cache.c
> >>>>>>b/src/responder/nss/nsssrv_mmap_cache.c
> >>>>>>index 4e45405..09a4c2b 100644
> >>>>>>--- a/src/responder/nss/nsssrv_mmap_cache.c
> >>>>>>+++ b/src/responder/nss/nsssrv_mmap_cache.c
> >>>>>>@@ -191,6 +191,13 @@ static void sss_mc_add_rec_to_chain(struct
> >>>>>>sss_mc_ctx *mcc,
> >>>>>> }
> >>>>>>
> >>>>>> do {
> >>>>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
> >>>>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
> >>>>>>+ ("Corrupted fastcache. Slot number too big.\n"));
> >>>>>>+ sss_mmap_cache_reset(mcc);
> >>>>>>+ return;
> >>>>>>+ }
> >>>>>>+
> >>>>>> cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
> >>>>>> if (cur == rec) {
> >>>>>> /* rec already stored in hash chain */
> >>>>>>@@ -205,16 +212,24 @@ static void sss_mc_add_rec_to_chain(struct
> >>>>>>sss_mc_ctx *mcc,
> >>>>>> cur->next = MC_PTR_TO_SLOT(mcc->data_table, rec);
> >>>>>>}
> >>>>>>
> >>>>>>-static inline uint32_t
> >>>>>>+static inline errno_t
> >>>>>>sss_mc_get_next_slot_with_hash(struct sss_mc_ctx *mcc,
> >>>>>> struct sss_mc_rec *start_rec,
> >>>>>>- uint32_t hash)
> >>>>>>+ uint32_t hash,
> >>>>>>+ uint32_t *_slot)
> >>>>>>{
> >>>>>> struct sss_mc_rec *rec;
> >>>>>> uint32_t slot;
> >>>>>>
> >>>>>> slot = start_rec->next;
> >>>>>> while (slot != MC_INVALID_VAL) {
> >>>>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
> >>>>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
> >>>>>>+ ("Corrupted fastcache. Slot number too big.\n"));
> >>>>>>+ sss_mmap_cache_reset(mcc);
> >>>>>>+ return EINVAL;
> >>>>>>+ }
> >>>>>>+
> >>>>>> rec = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
> >>>>>> if (rec->hash1 == hash || rec->hash2 == hash) {
> >>>>>> break;
> >>>>>>@@ -223,23 +238,26 @@ sss_mc_get_next_slot_with_hash(struct sss_mc_ctx
> >>>>>>*mcc,
> >>>>>> slot = rec->next;
> >>>>>> }
> >>>>>>
> >>>>>>- return slot;
> >>>>>>+ *_slot = slot;
> >>>>>>+
> >>>>>>+ return EOK;
> >>>>>>}
> >>>>>>
> >>>>>>-static void sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
> >>>>>>+static errno_t sss_mc_rm_rec_from_chain(struct sss_mc_ctx *mcc,
> >>>>>> struct sss_mc_rec *rec,
> >>>>>> uint32_t hash)
> >>>>>>{
> >>>>>> struct sss_mc_rec *prev = NULL;
> >>>>>> struct sss_mc_rec *cur = NULL;
> >>>>>> uint32_t slot;
> >>>>>>+ uint32_t ret;
> >>>>>>
> >>>>>> if (hash > MC_HT_ELEMS(mcc->ht_size)) {
> >>>>>> /* It can happen if rec->hash1 and rec->hash2 was the same.
> >>>>>> * or it is invalid hash. It is better to return
> >>>>>> * than trying to access out of bounds memory
> >>>>>> */
> >>>>>>- return;
> >>>>>>+ return EOK;
> >>>>> ^^^
> >>>>> EOK should be returned on if hash is equal to MC_INVALID_VAL
> >>>>
> >>>>Ok. I also changed the comments (split the error and normal case).
> >>>>
> >>>>>
> >>>>>> }
> >>>>>>
> >>>>>> slot = mcc->hash_table[hash];
> >>>>>>@@ -247,18 +265,37 @@ static void sss_mc_rm_rec_from_chain(struct
> >>>>>>sss_mc_ctx *mcc,
> >>>>>> /* record has already been removed. It may happen if
> >>>>>> rec->hash1 and
> >>>>>> * rec->has2 are the same. (It is not very likely).
> >>>>>> */
> >>>>>>- return;
> >>>>>>+ return EOK;
> >>>>>>+ }
> >>>>>>+
> >>>>>>+ if (!MC_SLOT_WITHIN_BOUNDS(slot, mcc->dt_size)) {
> >>>>>>+ DEBUG(SSSDBG_FATAL_FAILURE,
> >>>>>>+ ("Corrupted fastcache. Slot number too big.\n"));
> >>>>>>+ sss_mmap_cache_reset(mcc);
> >>>>>>+ return EINVAL;
> >>>>>> }
> >>>>>>+
> >>>>>> cur = MC_SLOT_TO_PTR(mcc->data_table, slot, struct sss_mc_rec);
> >>>>>> if (cur == rec) {
> >>>>>> /* rec->next can refer to record without matching hashes.
> >>>>>> * We need to skip this(those) records, because
> >>>>>> * mcc->hash_table[hash] have to refer to valid start of the
> >>>>>> chain.
> >>>>>> */
> >>>>>>- mcc->hash_table[hash] = sss_mc_get_next_slot_with_hash(mcc,
> >>>>>>rec, hash);
> >>>>>>+ ret = sss_mc_get_next_slot_with_hash(mcc, rec, hash,
> >>>>>>+ &mcc->hash_table[hash]);
> >>>>>>+ if (ret != EOK) {
> >>>>>>+ return ret;
> >>>>>>+ }
> >>>>>This condition does not make sense.
> >>>>>You can use:
> >>>>> either return ret;
> >>>>> or return sss_mc_get_next_slot_with_hash
> >>>>
> >>>>Well, I wanted the function to run to the end if nothing went wrong,
> >>>>but yes, in this case we return at several places in the function so
> >>>>it does not make much sense.
> >>>>
> >>>>>
> >>>>>LS
> >>>>
> >>>>Thanks for the review Lukas. New patch is attached.
> >>>>
> >>>>Michal
> >>>
> >>>I tested patch for sssd-1-9 branch on RHEL6
> >>>and it works fine. I think it is good alternative for sssd-1-9 branch.
> >>>
> >>>ACK
> >>>
> >>>LS
> >>
> >>I would personally prefer to have this in master as well, even if the
> >>currently known issues will be fixed with your patch ([PATCH]
> >>mmap_cache: Use two chains for hash collision). I think it hurts
> >>nothing to be more defensive.
> >Your patch is not about defensive coding. The patch is about fixing corrupted
> >cache in back-compatible way. In my opinion, mmap_cache source code looks
> >much
> >more complicated and readability is decreased.
> >
> >LS
>
> I do not have strong opinion on this, so I am fine either way.
>
> Michal
Before pushing the patches, I have two questions:
1) Do these patches cover the same error as the ones Lukas has on the
list now?
2) About this change:
diff --git a/src/sss_client/nss_mc_common.c b/src/sss_client/nss_mc_common.c
index a0a70ab..6764d80 100644
--- a/src/sss_client/nss_mc_common.c
+++ b/src/sss_client/nss_mc_common.c
@@ -190,6 +190,10 @@ errno_t sss_nss_mc_get_record(struct sss_cli_mc_ctx *ctx,
int count;
int ret;
+ if (!MC_SLOT_WITHIN_BOUNDS(slot, ctx->dt_size)) {
+ return EINVAL;
+ }
+
/* try max 5 times */
for (count = 5; count > 0; count--) {
rec = MC_SLOT_TO_PTR(ctx->data_table, slot, struct sss_mc_rec);
All paths that call sss_nss_mc_get_record() also check for
MC_SLOT_WITHIN_BOUNDS. Does it make sense to check both?
_______________________________________________
sssd-devel mailing list
[email protected]
https://lists.fedorahosted.org/mailman/listinfo/sssd-devel