Re: [Freeipa-users] Replication woes

2013-08-20 Thread Bret Wortman
Okay, now I'm thinking I need to dump all my replicas and start them fresh.
My /var/log/slapd-FOO-COM/errors is filled with messages like this:

NSMMReplicationPlugin - changelog program - agmt=cn=meTogood1.foo.com
(good1:389): CSN 520a4964001d not found, we aren't as up to date,
or we purged
agmt=cn=meTogood1.foo.com (good1:389) - Can't locate CSN
520a4964001d in the changelog (DB rc=-30988). The consumer may need
to be reinitialized.

I assume the consumer is the replica, right? At present, I have two
replicas known to my master that are simply gone. Another is there but they
can't talk. Three more have good communication but I'm getting errors like
these. Is there a good, clean way to just clobber all the replicas and
start over without trashing the DNS and other identity data that is inside
my master and which *is* working? Deleting them from the master hasn't been
working; it tends to hang the master's DNS and other services until I
Ctrl-C out and ipactl restart it.

I'm afraid to venture out without a net here and make things worse



*
*
*Bret Wortman*

http://damascusgrp.com/
http://about.me/wortmanbret


On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman
bret.wort...@damascusgrp.comwrote:

 On my master (where this error is occurring), I've got, in /etc/hosts:

 127.0.0.1 localhost localhost.localdomain
 ::1  localhost localhost.localdomain
 1.2.3.4ipamaster.foo.net ipamaster

 So that should be okay, right?

 # host ipamaster.foo.net
 ipamaster.foo.net has address 1.2.3.4
 # host ipamaster
 ipamaster.foo.net has address 1.2.3.4
 # host localhost
 localhost has address 127.0.0.1
 localhost has IPv6 address ::1
  #

 I checked the other system (the one I can't connect to) to be safe, and
 its /etc/hosts is similarly configured. It even has the master listed with
 its correct IP address.



 *
 *
 *Bret Wortman*

 http://damascusgrp.com/
 http://about.me/wortmanbret


 On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce s...@redhat.com wrote:

 On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote:
  So, any idea how to fix the Kerberos problem?
 

 If your server is trying to get a tgt for ldap/localhost it probably
 means your /etc/hosts file is broken and has a line like this:

 1.2.3.4 localhost my.real.name

 When GSSAPI tries to resolve my.realm.name it gets back that 'localhost'
 is the canonical name so it tries to get a TGT with that name and it
 fails.

 If /etc/host sis fine then the DNS server may be returning an IP address
 that later resolves to localhost again.

 To unbreak make sure that if you have your fully qualified name
 in /etc/hosts that it is on its own line pointing at the right IP
 address and where the FQDN name is the first in line:
 eg:

 this is ok:
 1.2.3.4 server.full.name server

 this is not:
 1.2.3.4 server server.full.name

 Simo.
 
  Bret Wortman
 
 
  http://damascusgrp.com/
 
  http://about.me/wortmanbret
 
 
 
  On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman
  bret.wort...@damascusgrp.com wrote:
  ...and I got the web UI, authentication and sudo back via:
 
 
  # ipactl stop
  # ipactl start
 
 
  Not sure why that worked, but it did. I was grasping at
  straws, honestly.
 
 
 
 
 
  Bret Wortman
 
 
  http://damascusgrp.com/
 
  http://about.me/wortmanbret
 
 
 
 
  On Mon, Aug 19, 2013 at 12:18 PM, Bret Wortman
  bret.wort...@damascusgrp.com wrote:
  Digging further, I think this log entry might be the
  problem between the two servers that aren't talking:
 
 
  slapd_ldap_sasl_interactive_bind - Error: could not
  perform interactive bind for id[] mech [GSSAPI]: LDAP
  error -2 (Local error) (SASL(-1): generic failure:
  GSSAPI Error: Unspecified GSS failure. Minor code may
  provide more information (Server
  ldap/localh...@spx.net not found in Kerberos
  database)) errno 2 (No such file or directory)
 
 
  Did I build something incorrectly when that server was
  set up originally?
 
 
 
 
 
 
 
  Bret Wortman
 
 
  http://damascusgrp.com/
 
  http://about.me/wortmanbret
 
 
 
 
  On Mon, Aug 19, 2013 at 12:02 PM, Bret Wortman
  bret.wort...@damascusgrp.com wrote:
  I ran it on a good master, against a bad one.
  As in, I ran this command on my master IPA
  node:
 
 
  # ipa-replica-manage del --force bad1.foo.net
  --cleanup
 
 
  Was that wrong? I was trying to delete the bad
  replica from the master, so I figured the
  command needed to be run on the master. But
  again, my master is now in a 

Re: [Freeipa-users] Replication woes

2013-08-20 Thread Bret Wortman
If I were going to attempt to restore to an old backup, what
directories/files should I make sure to restore? I've got a backup script
that tars up:

/usr/share/ipa
/usr/lib64/ipa
/var/lib/pia
/var/lib/ipa-client
/var/lib/dirsrv
/etc

Is that enough to roll back to a few days ago before I started down this
path? I'm now seeing messages about having the max number of CleanAllRUV
tasks (4) and not being able to enqueue any more. So I'm really stuck now
and don't know how soon I can get the files requested over to Rich for
analysis.


*
*
*Bret Wortman*

http://damascusgrp.com/
http://about.me/wortmanbret


On Tue, Aug 20, 2013 at 9:46 AM, Rich Megginson rmegg...@redhat.com wrote:

  On 08/20/2013 05:55 AM, Bret Wortman wrote:

 Okay, now I'm thinking I need to dump all my replicas and start them
 fresh. My /var/log/slapd-FOO-COM/errors is filled with messages like this:

  NSMMReplicationPlugin - changelog program - agmt=cn=meTogood1.foo.com
 (good1:389): CSN 520a4964001d not found, we aren't as up to date,
 or we purged
 agmt=cn=meTogood1.foo.com (good1:389) - Can't locate CSN
 520a4964001d in the changelog (DB rc=-30988). The consumer may need
 to be reinitialized.

  I assume the consumer is the replica, right? At present, I have two
 replicas known to my master that are simply gone. Another is there but they
 can't talk. Three more have good communication but I'm getting errors like
 these. Is there a good, clean way to just clobber all the replicas and
 start over without trashing the DNS and other identity data that is inside
 my master and which *is* working? Deleting them from the master hasn't
 been working; it tends to hang the master's DNS and other services until I
 Ctrl-C out and ipactl restart it.

  I'm afraid to venture out without a net here and make things worse


 This looks like https://fedorahosted.org/389/ticket/47386

 We've never been able to reproduce this in a controlled environment.

 The original reporter has been able to get this to work in some cases by
 restarting ipa (ipactl restart).

 Before you do that, would you be able to provide some information for me?

 On the supplier and consumer:
 ldapsearch -xLLL -D cn=directory manager -W -b dc=FOO,dc=COM
 '((objectclass=nstombstone)(nsuniqueid=---))'
  ruv.ldif

 ldapsearch -xLLL -D cn=directory manager -W -b cn=config
 '(objectclass=nsds5replicationagreement)'  agmt.ldif

 dbscan -f /var/lib/dirsrv/slapd-FOO-COM/cldb/*.db4 | head -200  cldb.txt

 Be sure to obscure any sensitive data in ruv.ldif, agmt.ldif, and cldb.txt
 - you can either attach to https://fedorahosted.org/389/ticket/47386 or
 email to me directly.





  *
 *
  *Bret Wortman*

  http://damascusgrp.com/
  http://about.me/wortmanbret


 On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman 
 bret.wort...@damascusgrp.com wrote:

On my master (where this error is occurring), I've got, in /etc/hosts:

  127.0.0.1 localhost localhost.localdomain
 ::1  localhost localhost.localdomain
 1.2.3.4ipamaster.foo.net ipamaster

  So that should be okay, right?

  # host ipamaster.foo.net
 ipamaster.foo.net has address 1.2.3.4
 # host ipamaster
 ipamaster.foo.net has address 1.2.3.4
 # host localhost
 localhost has address 127.0.0.1
 localhost has IPv6 address ::1
  #

  I checked the other system (the one I can't connect to) to be safe, and
 its /etc/hosts is similarly configured. It even has the master listed with
 its correct IP address.



  *
 *
 *Bret Wortman*

  http://damascusgrp.com/
  http://about.me/wortmanbret


 On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce s...@redhat.com wrote:

 On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote:
  So, any idea how to fix the Kerberos problem?
 

  If your server is trying to get a tgt for ldap/localhost it probably
 means your /etc/hosts file is broken and has a line like this:

 1.2.3.4 localhost my.real.name

 When GSSAPI tries to resolve my.realm.name it gets back that 'localhost'
 is the canonical name so it tries to get a TGT with that name and it
 fails.

 If /etc/host sis fine then the DNS server may be returning an IP address
 that later resolves to localhost again.

 To unbreak make sure that if you have your fully qualified name
 in /etc/hosts that it is on its own line pointing at the right IP
 address and where the FQDN name is the first in line:
 eg:

 this is ok:
 1.2.3.4 server.full.name server

 this is not:
 1.2.3.4 server server.full.name

 Simo.
 
  Bret Wortman
 
 
  http://damascusgrp.com/
 
  http://about.me/wortmanbret
 
 
 
  On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman
  bret.wort...@damascusgrp.com wrote:
  ...and I got the web UI, authentication and sudo back via:
 
 
  # ipactl stop
  # ipactl start
 
 
  Not sure why that worked, but it did. I was grasping at
  straws, honestly.
 
 
 
 
 
  Bret Wortman
 
 
   http://damascusgrp.com/
 
  http://about.me/wortmanbret
 
 
 
 
 

Re: [Freeipa-users] Replication woes

2013-08-20 Thread JR Aquino
On Aug 20, 2013, at 6:46 AM, Rich Megginson 
rmegg...@redhat.commailto:rmegg...@redhat.com wrote:

On 08/20/2013 05:55 AM, Bret Wortman wrote:
Okay, now I'm thinking I need to dump all my replicas and start them fresh. My 
/var/log/slapd-FOO-COM/errors is filled with messages like this:

NSMMReplicationPlugin - changelog program - 
agmt=cn=meTogood1.foo.comhttp://metogood1.foo.com/ (good1:389): CSN 
520a4964001d not found, we aren't as up to date, or we purged
agmt=cn=meTogood1.foo.comhttp://metogood1.foo.com/ (good1:389) - Can't 
locate CSN 520a4964001d in the changelog (DB rc=-30988). The consumer 
may need to be reinitialized.

I assume the consumer is the replica, right? At present, I have two replicas 
known to my master that are simply gone. Another is there but they can't talk. 
Three more have good communication but I'm getting errors like these. Is there 
a good, clean way to just clobber all the replicas and start over without 
trashing the DNS and other identity data that is inside my master and which is 
working? Deleting them from the master hasn't been working; it tends to hang 
the master's DNS and other services until I Ctrl-C out and ipactl restart it.

I'm afraid to venture out without a net here and make things worse

This looks like https://fedorahosted.org/389/ticket/47386

We've never been able to reproduce this in a controlled environment.

The original reporter has been able to get this to work in some cases by 
restarting ipa (ipactl restart).

Before you do that, would you be able to provide some information for me?

On the supplier and consumer:
ldapsearch -xLLL -D cn=directory manager -W -b dc=FOO,dc=COM 
'((objectclass=nstombstone)(nsuniqueid=---))' 
 ruv.ldif

ldapsearch -xLLL -D cn=directory manager -W -b cn=config 
'(objectclass=nsds5replicationagreement)'  agmt.ldif

dbscan -f /var/lib/dirsrv/slapd-FOO-COM/cldb/*.db4 | head -200  cldb.txt

Be sure to obscure any sensitive data in ruv.ldif, agmt.ldif, and cldb.txt - 
you can either attach to https://fedorahosted.org/389/ticket/47386 or email to 
me directly.


Any help you could provide in capturing the fail-state would be hugely 
appreciated.

I've found that if you work through the issue and fix the problem, it doesn't 
appear to be deliberately reproducible.

If you can get the debugging data that Rich needs, I can work on drafting  you 
a basic howto on how to diagnose and fix your replication issue.


Bret Wortman
[http://damascusgrp.com/item/51f7de33e4b08d2bdb8b4860?format=1500w]
http://damascusgrp.com/
http://about.me/wortmanbret


On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman 
bret.wort...@damascusgrp.commailto:bret.wort...@damascusgrp.com wrote:
On my master (where this error is occurring), I've got, in /etc/hosts:

127.0.0.1 localhost localhost.localdomain
::1  localhost localhost.localdomain
1.2.3.4ipamaster.foo.nethttp://ipamaster.foo.net/ ipamaster

So that should be okay, right?

# host ipamaster.foo.nethttp://ipamaster.foo.net/
ipamaster.foo.nethttp://ipamaster.foo.net/ has address 1.2.3.4
# host ipamaster
ipamaster.foo.nethttp://ipamaster.foo.net/ has address 1.2.3.4
# host localhost
localhost has address 127.0.0.1
localhost has IPv6 address ::1
#

I checked the other system (the one I can't connect to) to be safe, and its 
/etc/hosts is similarly configured. It even has the master listed with its 
correct IP address.




Bret Wortman
[http://damascusgrp.com/item/51f7de33e4b08d2bdb8b4860?format=1500w]
http://damascusgrp.com/
http://about.me/wortmanbret


On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce 
s...@redhat.commailto:s...@redhat.com wrote:
On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote:
 So, any idea how to fix the Kerberos problem?


If your server is trying to get a tgt for ldap/localhost it probably
means your /etc/hosts file is broken and has a line like this:

1.2.3.4 localhost my.real.namehttp://my.real.name/

When GSSAPI tries to resolve my.realm.namehttp://my.realm.name/ it gets back 
that 'localhost'
is the canonical name so it tries to get a TGT with that name and it
fails.

If /etc/host sis fine then the DNS server may be returning an IP address
that later resolves to localhost again.

To unbreak make sure that if you have your fully qualified name
in /etc/hosts that it is on its own line pointing at the right IP
address and where the FQDN name is the first in line:
eg:

this is ok:
1.2.3.4 server.full.namehttp://server.full.name/ server

this is not:
1.2.3.4 server server.full.namehttp://server.full.name/

Simo.

 Bret Wortman


 http://damascusgrp.com/

 http://about.me/wortmanbret



 On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman
 bret.wort...@damascusgrp.commailto:bret.wort...@damascusgrp.com wrote:
 ...and I got the web UI, authentication and sudo back via:


 # ipactl stop
 # ipactl start


 Not sure why that worked, but it did. I was grasping at
 straws, honestly.