Re: Persistent failures of test050

2019-07-01 Thread Ondřej Kuzník
On Mon, Jul 01, 2019 at 03:07:15PM +0200, Ondřej Kuzník wrote:
> On Tue, Jun 25, 2019 at 04:45:30PM -0700, Quanah Gibson-Mount wrote:
> > --On Saturday, June 22, 2019 2:06 PM -0700 Quanah Gibson-Mount
> >  wrote:
> > 
> >> [build@freebsd12 ~/git/openldap-2-4/tests/testrun]$ diff -u server1.out
> >> server3.out
> >> --- server1.out 2019-06-22 18:23:54.93360 +
> >> +++ server3.out 2019-06-22 18:23:55.049209000 +
> >> @@ -1,3 +1,8 @@
> >> +dn: cn=Add-Mod-Del,dc=example,dc=com
> >> +cn: Add-Mod-Del
> >> +objectClass: organizationalRole
> >> +description: guinea pig
> >> +
> > 
> > There appears to be two separate problems happening in test050.
> 
> Just seen another issue when the wait times are further reduced so as to
> have the syncrepl establishment overlap with write traffic.
> 
> 1. Servers start up and traffic starts coming in towards MMR node 1
> 2. syncrepl session from node 2 with node 1 as the producer is
>established
> 3. Add/mod/del cycles run on node 1 and are replicated towards node 2
> 4. Node 1 starts to run a syncrepl session towards node 2 (somehow the
>sid=001 cookie sent is older than the newest modification at the
>time, but that wouldn't really change things)
> 5. That triggers a present phase and the add is propagated - this then
>bypasses the sid source checks at the provider and csn checks on the
>consumer and the entry is actually added

Forgot to mention the consumer part of the above happens here:
https://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=servers/slapd/syncrepl.c;h=28a5e724531ab4e302b4600f0bb8fb883f0de19a;hb=refs/heads/OPENLDAP_REL_ENG_2_4#l3056

> 6. The next add/mod/del cycle starts before the deletion is processed so
>add fails with LDAP_ALREADY_EXISTS and aborts the test.
> 
> It's probably the consumer CSN checks that need to be run again if we
> don't receive the CSN with the PDU (which is what happens in present
> phase), but that might have to be a '>=' on the contextCSN set rather
> than a strict '>'? Something tells me that we need to deal with present
> phase coming in with several entries with the same CSN.

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation   http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP



Re: Persistent failures of test050

2019-07-01 Thread Ondřej Kuzník
On Tue, Jun 25, 2019 at 04:45:30PM -0700, Quanah Gibson-Mount wrote:
> --On Saturday, June 22, 2019 2:06 PM -0700 Quanah Gibson-Mount
>  wrote:
> 
>> [build@freebsd12 ~/git/openldap-2-4/tests/testrun]$ diff -u server1.out
>> server3.out
>> --- server1.out 2019-06-22 18:23:54.93360 +
>> +++ server3.out 2019-06-22 18:23:55.049209000 +
>> @@ -1,3 +1,8 @@
>> +dn: cn=Add-Mod-Del,dc=example,dc=com
>> +cn: Add-Mod-Del
>> +objectClass: organizationalRole
>> +description: guinea pig
>> +
> 
> There appears to be two separate problems happening in test050.

Just seen another issue when the wait times are further reduced so as to
have the syncrepl establishment overlap with write traffic.

1. Servers start up and traffic starts coming in towards MMR node 1
2. syncrepl session from node 2 with node 1 as the producer is
   established
3. Add/mod/del cycles run on node 1 and are replicated towards node 2
4. Node 1 starts to run a syncrepl session towards node 2 (somehow the
   sid=001 cookie sent is older than the newest modification at the
   time, but that wouldn't really change things)
5. That triggers a present phase and the add is propagated - this then
   bypasses the sid source checks at the provider and csn checks on the
   consumer and the entry is actually added
6. The next add/mod/del cycle starts before the deletion is processed so
   add fails with LDAP_ALREADY_EXISTS and aborts the test.

It's probably the consumer CSN checks that need to be run again if we
don't receive the CSN with the PDU (which is what happens in present
phase), but that might have to be a '>=' on the contextCSN set rather
than a strict '>'? Something tells me that we need to deal with present
phase coming in with several entries with the same CSN.

Regards,

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation   http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP



Re: Persistent failures of test050

2019-06-28 Thread Quanah Gibson-Mount
--On Thursday, June 27, 2019 1:09 PM +0200 Ondřej Kuzník 
 wrote:



Not sure the above is the same failure I'm seeing, so will outline mine
(reproduced on master+ITS#9043 logging):


I was just badly summarizing our earlier discussions, it was the same 
thing. ;)


--Quanah


--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:





Re: Persistent failures of test050

2019-06-27 Thread Ondřej Kuzník
On Tue, Jun 25, 2019 at 04:45:30PM -0700, Quanah Gibson-Mount wrote:
> --On Saturday, June 22, 2019 2:06 PM -0700 Quanah Gibson-Mount
>  wrote:
> There appears to be two separate problems happening in test050.
> 
> Problem #1) Null cookie is generated, causing catastrophic database loss
> across the entire MMR cluster (they all lose all their data).  This is new
> with 2.4.48, perhaps related to the revert of part of ITS#8281 when ITS#9015
> was fixed (purely speculation on my part at the moment).  This appears to be
> a major/significant regression.

Not sure the above is the same failure I'm seeing, so will outline mine
(reproduced on master+ITS#9043 logging):
- all servers start with nothing but replicated cn=config
- database is configured on server1 including syncprov and syncrepl, it
  replicates to others
- server2 contacts server1 to start replicating, starts present phase
- server1 contacts server2 to do the same, while server2 is still in
  present phase, somehow server2 has decided to attach its own CSNs to
  entries so it sees a 002 contextcsn and present phase finishes
  prematurely (server2 doesn't have all data yet)
- result is server1 loses a large part of its database while server2 is
  fine, and both think they're in sync

No idea yet why and when server2 generates its own CSN for (some?) of
the entries. Sounds a bit like ITS#8125 to me.

If it thought there was no CSN, things might be ok, might have to reject
new consumers while we know we're in the middle of processing an inbound
refresh (=we have modified the DB but not updated contexCSN). If we
haven't, we could send the entries as we go. That way multiple servers
might reasonably be in present phase from each other at the same time
safely?

I'll see in the meantime why the CSN was generated on server2. Might
take a while to reproduce this again though.

Regards,

-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation   http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP



Re: Persistent failures of test050

2019-06-26 Thread Quanah Gibson-Mount
--On Tuesday, June 25, 2019 5:45 PM -0700 Quanah Gibson-Mount 
 wrote:



Problem #2) If a MMR node is processing a change during which a slapd
shutdown is initiated, it will update the contextCSN of the database but
LOSE the related change (at least with a delete op), resulting in a
database difference.  This is reproducible in 2.4.47 as well (so this is
not a regression).


Ok, this is just a timing issue, the ldapsearch to generate the DB data to 
compare was generated before all the nodes had finished processing the 
changes involved.


It might be better to do something like get the contextCSN from the initial 
master after the change has completed, and ensure that the remaining nodes 
have the same contextCSN value before generating the LDIF for comparison.


--Quanah



--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:





Re: Persistent failures of test050

2019-06-25 Thread Quanah Gibson-Mount
--On Saturday, June 22, 2019 2:06 PM -0700 Quanah Gibson-Mount 
 wrote:



[build@freebsd12 ~/git/openldap-2-4/tests/testrun]$ diff -u server1.out
server3.out
--- server1.out 2019-06-22 18:23:54.93360 +
+++ server3.out 2019-06-22 18:23:55.049209000 +
@@ -1,3 +1,8 @@
+dn: cn=Add-Mod-Del,dc=example,dc=com
+cn: Add-Mod-Del
+objectClass: organizationalRole
+description: guinea pig
+



There appears to be two separate problems happening in test050.

Problem #1) Null cookie is generated, causing catastrophic database loss 
across the entire MMR cluster (they all lose all their data).  This is new 
with 2.4.48, perhaps related to the revert of part of ITS#8281 when 
ITS#9015 was fixed (purely speculation on my part at the moment).  This 
appears to be a major/significant regression.


Problem #2) If a MMR node is processing a change during which a slapd 
shutdown is initiated, it will update the contextCSN of the database but 
LOSE the related change (at least with a delete op), resulting in a 
database difference.  This is reproducible in 2.4.47 as well (so this is 
not a regression).


--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:





Re: Persistent failures of test050

2019-06-25 Thread Quanah Gibson-Mount
--On Sunday, June 23, 2019 12:59 AM +0200 Michael Ströder 
 wrote:



On 6/22/19 10:06 PM, Quanah Gibson-Mount wrote:

I've noticed that when running test050 in a loop, it often fails, even
after increasing the sleep timeout defaults.  Where it fails in the test
is inconsistent and which servers differ is inconsistent as well.


I can confirm that it fails on my system too.


Thanks Michael.  There's definitely a bug here, Ondrej will be filing an 
ITS on it this week or next as he gets more details on what's occurring.


--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:





Re: Persistent failures of test050

2019-06-22 Thread Michael Ströder
On 6/22/19 10:06 PM, Quanah Gibson-Mount wrote:
> I've noticed that when running test050 in a loop, it often fails, even
> after increasing the sleep timeout defaults.  Where it fails in the test
> is inconsistent and which servers differ is inconsistent as well.

I can confirm that it fails on my system too.

Ciao, Michael.



Persistent failures of test050

2019-06-22 Thread Quanah Gibson-Mount
I've noticed that when running test050 in a loop, it often fails, even 
after increasing the sleep timeout defaults.  Where it fails in the test is 
inconsistent and which servers differ is inconsistent as well. I'm 
concerned we may have a regression or perhaps long standing issue here that 
needs addressing.  I'll continue to investigate and see if I can get more 
details on what the issue(s) are.  In my latest run I see:


.
Using ldapmodify to add/modify/delete entries from server 1...
 iteration 1
 iteration 2
 iteration 3
 iteration 4
 iteration 5
 iteration 6
 iteration 7
 iteration 8
 iteration 9
 iteration 10
Waiting 10 seconds for servers to resync...
Using ldapsearch to read all the entries from server 1...
Using ldapsearch to read all the entries from server 2...
Using ldapsearch to read all the entries from server 3...
Using ldapsearch to read all the entries from server 4...
Comparing retrieved entries from server 1 and server 2...
Comparing retrieved entries from server 1 and server 3...
test failed - server 1 and server 3 databases differ
Failed after 3 of 500 iterations


[build@freebsd12 ~/git/openldap-2-4/tests/testrun]$ diff -u server1.out 
server3.out

--- server1.out 2019-06-22 18:23:54.93360 +
+++ server3.out 2019-06-22 18:23:55.049209000 +
@@ -1,3 +1,8 @@
+dn: cn=Add-Mod-Del,dc=example,dc=com
+cn: Add-Mod-Del
+objectClass: organizationalRole
+description: guinea pig
+
dn: cn=All Staff,ou=Groups,dc=example,dc=com
member: cn=Manager,dc=example,dc=com
member: cn=Barbara Jensen,ou=Information Technology 
Division,ou=People,dc=exam



--Quanah


--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP: