Re: [Freeipa-users] Replication woes
Okay, now I'm thinking I need to dump all my replicas and start them fresh. My /var/log/slapd-FOO-COM/errors is filled with messages like this: NSMMReplicationPlugin - changelog program - agmt=cn=meTogood1.foo.com (good1:389): CSN 520a4964001d not found, we aren't as up to date, or we purged agmt=cn=meTogood1.foo.com (good1:389) - Can't locate CSN 520a4964001d in the changelog (DB rc=-30988). The consumer may need to be reinitialized. I assume the consumer is the replica, right? At present, I have two replicas known to my master that are simply gone. Another is there but they can't talk. Three more have good communication but I'm getting errors like these. Is there a good, clean way to just clobber all the replicas and start over without trashing the DNS and other identity data that is inside my master and which *is* working? Deleting them from the master hasn't been working; it tends to hang the master's DNS and other services until I Ctrl-C out and ipactl restart it. I'm afraid to venture out without a net here and make things worse * * *Bret Wortman* http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman bret.wort...@damascusgrp.comwrote: On my master (where this error is occurring), I've got, in /etc/hosts: 127.0.0.1 localhost localhost.localdomain ::1 localhost localhost.localdomain 1.2.3.4ipamaster.foo.net ipamaster So that should be okay, right? # host ipamaster.foo.net ipamaster.foo.net has address 1.2.3.4 # host ipamaster ipamaster.foo.net has address 1.2.3.4 # host localhost localhost has address 127.0.0.1 localhost has IPv6 address ::1 # I checked the other system (the one I can't connect to) to be safe, and its /etc/hosts is similarly configured. It even has the master listed with its correct IP address. * * *Bret Wortman* http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce s...@redhat.com wrote: On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote: So, any idea how to fix the Kerberos problem? If your server is trying to get a tgt for ldap/localhost it probably means your /etc/hosts file is broken and has a line like this: 1.2.3.4 localhost my.real.name When GSSAPI tries to resolve my.realm.name it gets back that 'localhost' is the canonical name so it tries to get a TGT with that name and it fails. If /etc/host sis fine then the DNS server may be returning an IP address that later resolves to localhost again. To unbreak make sure that if you have your fully qualified name in /etc/hosts that it is on its own line pointing at the right IP address and where the FQDN name is the first in line: eg: this is ok: 1.2.3.4 server.full.name server this is not: 1.2.3.4 server server.full.name Simo. Bret Wortman http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman bret.wort...@damascusgrp.com wrote: ...and I got the web UI, authentication and sudo back via: # ipactl stop # ipactl start Not sure why that worked, but it did. I was grasping at straws, honestly. Bret Wortman http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 12:18 PM, Bret Wortman bret.wort...@damascusgrp.com wrote: Digging further, I think this log entry might be the problem between the two servers that aren't talking: slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id[] mech [GSSAPI]: LDAP error -2 (Local error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Server ldap/localh...@spx.net not found in Kerberos database)) errno 2 (No such file or directory) Did I build something incorrectly when that server was set up originally? Bret Wortman http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 12:02 PM, Bret Wortman bret.wort...@damascusgrp.com wrote: I ran it on a good master, against a bad one. As in, I ran this command on my master IPA node: # ipa-replica-manage del --force bad1.foo.net --cleanup Was that wrong? I was trying to delete the bad replica from the master, so I figured the command needed to be run on the master. But again, my master is now in a
Re: [Freeipa-users] Replication woes
If I were going to attempt to restore to an old backup, what directories/files should I make sure to restore? I've got a backup script that tars up: /usr/share/ipa /usr/lib64/ipa /var/lib/pia /var/lib/ipa-client /var/lib/dirsrv /etc Is that enough to roll back to a few days ago before I started down this path? I'm now seeing messages about having the max number of CleanAllRUV tasks (4) and not being able to enqueue any more. So I'm really stuck now and don't know how soon I can get the files requested over to Rich for analysis. * * *Bret Wortman* http://damascusgrp.com/ http://about.me/wortmanbret On Tue, Aug 20, 2013 at 9:46 AM, Rich Megginson rmegg...@redhat.com wrote: On 08/20/2013 05:55 AM, Bret Wortman wrote: Okay, now I'm thinking I need to dump all my replicas and start them fresh. My /var/log/slapd-FOO-COM/errors is filled with messages like this: NSMMReplicationPlugin - changelog program - agmt=cn=meTogood1.foo.com (good1:389): CSN 520a4964001d not found, we aren't as up to date, or we purged agmt=cn=meTogood1.foo.com (good1:389) - Can't locate CSN 520a4964001d in the changelog (DB rc=-30988). The consumer may need to be reinitialized. I assume the consumer is the replica, right? At present, I have two replicas known to my master that are simply gone. Another is there but they can't talk. Three more have good communication but I'm getting errors like these. Is there a good, clean way to just clobber all the replicas and start over without trashing the DNS and other identity data that is inside my master and which *is* working? Deleting them from the master hasn't been working; it tends to hang the master's DNS and other services until I Ctrl-C out and ipactl restart it. I'm afraid to venture out without a net here and make things worse This looks like https://fedorahosted.org/389/ticket/47386 We've never been able to reproduce this in a controlled environment. The original reporter has been able to get this to work in some cases by restarting ipa (ipactl restart). Before you do that, would you be able to provide some information for me? On the supplier and consumer: ldapsearch -xLLL -D cn=directory manager -W -b dc=FOO,dc=COM '((objectclass=nstombstone)(nsuniqueid=---))' ruv.ldif ldapsearch -xLLL -D cn=directory manager -W -b cn=config '(objectclass=nsds5replicationagreement)' agmt.ldif dbscan -f /var/lib/dirsrv/slapd-FOO-COM/cldb/*.db4 | head -200 cldb.txt Be sure to obscure any sensitive data in ruv.ldif, agmt.ldif, and cldb.txt - you can either attach to https://fedorahosted.org/389/ticket/47386 or email to me directly. * * *Bret Wortman* http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman bret.wort...@damascusgrp.com wrote: On my master (where this error is occurring), I've got, in /etc/hosts: 127.0.0.1 localhost localhost.localdomain ::1 localhost localhost.localdomain 1.2.3.4ipamaster.foo.net ipamaster So that should be okay, right? # host ipamaster.foo.net ipamaster.foo.net has address 1.2.3.4 # host ipamaster ipamaster.foo.net has address 1.2.3.4 # host localhost localhost has address 127.0.0.1 localhost has IPv6 address ::1 # I checked the other system (the one I can't connect to) to be safe, and its /etc/hosts is similarly configured. It even has the master listed with its correct IP address. * * *Bret Wortman* http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce s...@redhat.com wrote: On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote: So, any idea how to fix the Kerberos problem? If your server is trying to get a tgt for ldap/localhost it probably means your /etc/hosts file is broken and has a line like this: 1.2.3.4 localhost my.real.name When GSSAPI tries to resolve my.realm.name it gets back that 'localhost' is the canonical name so it tries to get a TGT with that name and it fails. If /etc/host sis fine then the DNS server may be returning an IP address that later resolves to localhost again. To unbreak make sure that if you have your fully qualified name in /etc/hosts that it is on its own line pointing at the right IP address and where the FQDN name is the first in line: eg: this is ok: 1.2.3.4 server.full.name server this is not: 1.2.3.4 server server.full.name Simo. Bret Wortman http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman bret.wort...@damascusgrp.com wrote: ...and I got the web UI, authentication and sudo back via: # ipactl stop # ipactl start Not sure why that worked, but it did. I was grasping at straws, honestly. Bret Wortman http://damascusgrp.com/ http://about.me/wortmanbret
Re: [Freeipa-users] Replication woes
On Aug 20, 2013, at 6:46 AM, Rich Megginson rmegg...@redhat.commailto:rmegg...@redhat.com wrote: On 08/20/2013 05:55 AM, Bret Wortman wrote: Okay, now I'm thinking I need to dump all my replicas and start them fresh. My /var/log/slapd-FOO-COM/errors is filled with messages like this: NSMMReplicationPlugin - changelog program - agmt=cn=meTogood1.foo.comhttp://metogood1.foo.com/ (good1:389): CSN 520a4964001d not found, we aren't as up to date, or we purged agmt=cn=meTogood1.foo.comhttp://metogood1.foo.com/ (good1:389) - Can't locate CSN 520a4964001d in the changelog (DB rc=-30988). The consumer may need to be reinitialized. I assume the consumer is the replica, right? At present, I have two replicas known to my master that are simply gone. Another is there but they can't talk. Three more have good communication but I'm getting errors like these. Is there a good, clean way to just clobber all the replicas and start over without trashing the DNS and other identity data that is inside my master and which is working? Deleting them from the master hasn't been working; it tends to hang the master's DNS and other services until I Ctrl-C out and ipactl restart it. I'm afraid to venture out without a net here and make things worse This looks like https://fedorahosted.org/389/ticket/47386 We've never been able to reproduce this in a controlled environment. The original reporter has been able to get this to work in some cases by restarting ipa (ipactl restart). Before you do that, would you be able to provide some information for me? On the supplier and consumer: ldapsearch -xLLL -D cn=directory manager -W -b dc=FOO,dc=COM '((objectclass=nstombstone)(nsuniqueid=---))' ruv.ldif ldapsearch -xLLL -D cn=directory manager -W -b cn=config '(objectclass=nsds5replicationagreement)' agmt.ldif dbscan -f /var/lib/dirsrv/slapd-FOO-COM/cldb/*.db4 | head -200 cldb.txt Be sure to obscure any sensitive data in ruv.ldif, agmt.ldif, and cldb.txt - you can either attach to https://fedorahosted.org/389/ticket/47386 or email to me directly. Any help you could provide in capturing the fail-state would be hugely appreciated. I've found that if you work through the issue and fix the problem, it doesn't appear to be deliberately reproducible. If you can get the debugging data that Rich needs, I can work on drafting you a basic howto on how to diagnose and fix your replication issue. Bret Wortman [http://damascusgrp.com/item/51f7de33e4b08d2bdb8b4860?format=1500w] http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman bret.wort...@damascusgrp.commailto:bret.wort...@damascusgrp.com wrote: On my master (where this error is occurring), I've got, in /etc/hosts: 127.0.0.1 localhost localhost.localdomain ::1 localhost localhost.localdomain 1.2.3.4ipamaster.foo.nethttp://ipamaster.foo.net/ ipamaster So that should be okay, right? # host ipamaster.foo.nethttp://ipamaster.foo.net/ ipamaster.foo.nethttp://ipamaster.foo.net/ has address 1.2.3.4 # host ipamaster ipamaster.foo.nethttp://ipamaster.foo.net/ has address 1.2.3.4 # host localhost localhost has address 127.0.0.1 localhost has IPv6 address ::1 # I checked the other system (the one I can't connect to) to be safe, and its /etc/hosts is similarly configured. It even has the master listed with its correct IP address. Bret Wortman [http://damascusgrp.com/item/51f7de33e4b08d2bdb8b4860?format=1500w] http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce s...@redhat.commailto:s...@redhat.com wrote: On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote: So, any idea how to fix the Kerberos problem? If your server is trying to get a tgt for ldap/localhost it probably means your /etc/hosts file is broken and has a line like this: 1.2.3.4 localhost my.real.namehttp://my.real.name/ When GSSAPI tries to resolve my.realm.namehttp://my.realm.name/ it gets back that 'localhost' is the canonical name so it tries to get a TGT with that name and it fails. If /etc/host sis fine then the DNS server may be returning an IP address that later resolves to localhost again. To unbreak make sure that if you have your fully qualified name in /etc/hosts that it is on its own line pointing at the right IP address and where the FQDN name is the first in line: eg: this is ok: 1.2.3.4 server.full.namehttp://server.full.name/ server this is not: 1.2.3.4 server server.full.namehttp://server.full.name/ Simo. Bret Wortman http://damascusgrp.com/ http://about.me/wortmanbret On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman bret.wort...@damascusgrp.commailto:bret.wort...@damascusgrp.com wrote: ...and I got the web UI, authentication and sudo back via: # ipactl stop # ipactl start Not sure why that worked, but it did. I was grasping at straws, honestly.