Thanks Jakub, I deleted /var/lib/sss/db, and restarted sssd, still could not fix it, id user returned Not found, and userid in nobody. I added debug_level = 9, and found following error in sssd_nss.log:
... (Wed Apr 13 03:09:46 2016) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 1 errno: 11 error message: Fast reply - offline (Wed Apr 13 03:09:46 2016) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040): Unable to get information from Data Provider Error: 1, 11, Fast reply - offline ... What did that mean? I checked with "service sssd status" which was running well, I've just ran the ldapsearch which returned all correct information. Thank you very much . Kind regards, - h On Wed, Apr 13, 2016 at 5:30 PM, Jakub Hrozek <jhro...@redhat.com> wrote: > On Wed, Apr 13, 2016 at 10:52:15AM +1000, jupiter wrote: > > Hi Jakub, > > > > Thanks for your response, please see following embedded comments. > > > > On Tue, Apr 12, 2016 at 6:24 PM, Jakub Hrozek <jhro...@redhat.com> > wrote: > > > > > On Tue, Apr 12, 2016 at 11:03:47AM +1000, jupiter wrote: > > > > Hi, > > > > > > > > We are running sssd version 1.12.4-47 on CentOS 6. It works fine in > > > > general, but from time to time, some nodes listed all user ids with > > > > "nobody", > > > > > > Was this problem happening only on an NFS share..? > > > > > > > I don't think it is an NFS issue, it is an SSS issue. > > > > > > > > > calling id username immediatly returned "No such user", > > > > > > Hmm, I guess not, this sounds like a generic issue, if neither id > > > couldn't find the user. > > > > > > > The user is fine, I can run "id username" in another healthy node without > > any problems. > > > > > > > > > it looks > > > > the id went to cache and did not contact to the LDAP. > > > > > > Please note that if the user was looked up at least once before, then > > > even if SSSD couldn't contact the server for one reason or another, it > > > should have returned entries from the cache. > > > > > > > Once again, the user id is fine, we can verify from other health nodes. > > Beside, when the node is fixed by adding debug_level = 6, everything is > > back to normal. > > > > > > > > > > > > On one occasion, I added debug_level = 6 to the sssd.conf, restarted > > > sssd, > > > > the "nobody" was gone and id username was returned correct LDAP user > id. > > > It > > > > did not make any sense to me how adding a debug_level could fix the > > > > problem. > > > > > > I suspect it was actually the restart, because the restart might cause > > > sssd to reconnect to servers and operate online. > > > > > > > But prior to that change, I restart sssd dozen times, nothing could fix > it > > until I changed debug_level = 6 which fixed the issue, but it did not > make > > any sens to me. > > > > > > > > What you can do, if for some reason running with debugging enabled all > > > the time is not practical, is use the sss_debuglevel tool to bump > > > debugging on the fly. > > > > > > But at any rate, we need to see the sssd logs to proceed. > > > > > > > The error in log file was nss_getpwnam: name 'dhpec' not found in domain > ' > > hpc.org'. It seems to me sssd simply got information from the invalid > > cache, not from the ldap. > > > > > > > > > I could smell the issue from sssd cache, but I have no idea since > > > > the all default cache setting only for some seconds, but when the > node > > > > caught in that problem, it can sit for many days with uids in > nobody, id > > > > returns no such user. > > > > > > > > After searching from Internet, someone suggested to run sss_cache -E > to > > > > invalidate all cached entries would solve the problem, I tried, it > did > > > not > > > > work. > > > > > > Well, if sssd was offline at that time, then invalidating the cache > > > wouldn't help, because sssd wouldn't have a way to fetch the data > from.. > > > > > > I checked sssd process before running sss_cache -E, the sssd was > always on > > line. My question is, how do you verify if the cache has been cleaned? Or > > you simply delete /var/lib/sss/db? > > sss_cache just expires user entries. If you want to really remove the > cache (careful, though..) then yes, at the moment you need to remove the > .ldb files under /var/lib/sss/db. > > The next version of sss_cache should also have an option to really > delete the cache. > _______________________________________________ > sssd-users mailing list > sssd-users@lists.fedorahosted.org > > https://lists.fedorahosted.org/admin/lists/sssd-users@lists.fedorahosted.org >
_______________________________________________ sssd-users mailing list sssd-users@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/sssd-users@lists.fedorahosted.org