Re: prblems rebuilding conversations db

2019-01-24 Thread Michael Menge

Hi

Quoting ellie timoney :
...

On Tue, Jan 22, 2019, at 12:32 AM, Michael Menge wrote:

...

Quoting ellie timoney :

...

> On Wed, Jan 16, 2019, at 10:12 PM, Michael Menge wrote:
>> Hi,
>>
>> because conversations db seems to be required for search indexes, I
>> enabled this option
>> on our production servers today and tried to rebuild the conversations
>> db for all users with
>>
>> ctl_conversationsdb -v -b UID
>>
>> For most users this did take less than a second. But for some users
>> this process would
>> not finish. I did kill one process after about 30 Minutes (most others
>> after 3 Minutes).
>> The UserID.conversations has grown over 2 GB (the mailbox itself has
>> only ~700 MB of mails,
>> and the conversations files from finnished rebuilds are less then 20 MB)

...

I did run ctl_conversationsd -b with strace and discovered that the inbox
(cyrus.index O_RDWR, cyrus.header O_RDONLY, cache  cyrus.O_RDWR )  
of that user

was opened over and over again, no other folder for that user was accessed.
For other users the inbox was only opened once and then the other
folders followed.
So not stopping "ctl_conversationsd -b", it would have run till my
filesystem was
full.



I guess something about that inbox is confusing it, and making it  
redo it over and over?  Very curious



...

> Though, that raises the question: was the user accessing the mailbox
> while the rebuild was in progress?  I wonder if their client is
> doing something funny and tripping things up?

We did this while the server was accessible (didn't want to have
several hour downtime for ~44000 users)


Fair enough!


But as I did my recent testing on a test-server without user access,
only ctl_conversationsdb and cyr_dbtool did accessed the mailbox during
the conversation rebuild mentioned in this mail. So I don't think that
user access is part of the problem.


Thanks for testing that, good to rule it out as a culprit


So for me it looks like we have two problems:

1. multiple entries for the same key in skiplist db files.
As skiplist is a Key/Value store this should not be possible
to have duplicate keys?


I don't know the internals of our database formats well, but I  
_think_ maybe it writes new key/value pairs by appending to the end  
of the database, and then updating the skiplist structure to point  
to the newer version instead of the older version.  If this is  
correct, then if it's rewriting the keys over and over for some  
reason, I would expect the raw file to contain a lot of old,  
unlinked records for the same key, but which aren't visible over the  
API.  I believe there's a process that runs occasionally that  
rewrites these databases with only the "live" keys and gets back the  
wasted space?




I did discover, because of the loop (see below) the point of
database commit was not reached, so I guess that could explain
that there was no duplicated key cleanup done.


Is twoskip a better alternative?


This is before my time, but twoskip was written to solve problems  
FastMail experienced with using skiplist with conversations  
("post-crash recovery was too slow").  So if there's a pathological  
case in the more recent conversations code that trips a bug in  
skiplist, we probably wouldn't see it.  So, I'd recommend using  
twoskip for conversationsdb just based on that :)


Same problem with twoskip, but as I said above, the commit point
was not reached.




2. Why didn't ctl_conversationsdb continue to process
the next mailbox but re-did the INBOX / the last mailbox


It kinda sounds to me like it considered the operation to be  
unsuccessful, and tried again?  Not sure what would cause this.  Is  
there anything interesting in syslog?



I will try to run a debugger on ctl_conversationsdb and
will test with twoskip as conversations db format




The program loops in build_cid_cb (imap/ctl_conversationsdb.c:189)

For the problematic mailbox that I tested, for every message
record->cid was NULLCONVERSATION, so mailbox_cacherecord,
message_update_conversations and mailbox_rewrite_index_record
where called, each returned 0.

After iterating trough all messages in the mailbox count was > 0, and r==0,
so the while condition (!r && count) was true for the next run.
At this point record->cid was still NULLCONVERSATION for every message,
which I guess should not be the case.

Thanks, it'd be very interesting to see if this issue reproduces  
with the twoskip format!




Yes, it did.


I did notice one other problem, recreating the conversations db
on the replica confused the sync protocol as the rebuild did increase
the modseq.


Do you mean to say that this user's conversations db could be  
rebuilt successfully on the replica?  Or was this with one of the  
"good" users?




I did activated the conversation db first on the replica (to watch the  
performance

impact and time needed) and did the rebuild with the -r flag.

I didn't notice an endless loop on the replic, but remember the 

Re: prblems rebuilding conversations db

2019-01-21 Thread ellie timoney
Hi Michael,

On Tue, Jan 22, 2019, at 12:32 AM, Michael Menge wrote:
> Hi Ellie
> 
> Quoting ellie timoney :
> 
> > Hi Michael,
> >
> > On Wed, Jan 16, 2019, at 10:12 PM, Michael Menge wrote:
> >> Hi,
> >>
> >> because conversations db seems to be required for search indexes, I
> >> enabled this option
> >> on our production servers today and tried to rebuild the conversations
> >> db for all users with
> >>
> >> ctl_conversationsdb -v -b UID
> >>
> >> For most users this did take less than a second. But for some users
> >> this process would
> >> not finish. I did kill one process after about 30 Minutes (most others
> >> after 3 Minutes).
> >> The UserID.conversations has grown over 2 GB (the mailbox itself has
> >> only ~700 MB of mails,
> >> and the conversations files from finnished rebuilds are less then 20 MB)
> >
> > So, a 2GB-and-growing conversations db for a 700MB mailbox is a  
> > ratio of ~3:1, which seems obscene...
> >
> 
> thanks for confirming that this is not expected.
> 
> > For comparison, the Conversations.append_reply_200 cassandane test  
> > (two messages with 100 replies each) produces an 87Kb conversations  
> > db for a 1016Kb mailbox (ratio ~1:12, n.b inverted!), which makes  
> > your ~3:1 and growing ratio seem even more obscene...
> >
> > Obviously you can't share the bad conversations db because  
> > conversations db's are full of personally-identifying information.   
> > But I wonder if looking through it with `cyr_dbtool ... show` turns  
> > up any unusual patterns, especially in comparison to one of the good  
> > users?  I wonder if it's somehow repeating itself?
> 
> I was unable to run "cyr_dbtool ... show" or "ctl_conversationsdb -d"
> while ctl_conversationsdb -v -b was still running
> (waiting for a lock on conversations db file)
> 
> After aborting "ctl_conversationsdb -b" "cyr_dbtool show" didn't show any
> duplicates. But the string command showed more occurrences than cyr_dbtool
> did (i used an hash value (d8087b76083ec5e7) I found in the output of  
> cyr_dbtool
> 
> 
> # cyr_dbtool /path-to/userid.conversations skiplist show | grep  
> d8087b76083ec5e7 | wc -l
> 3
> # cyr_dbtool /path-to/userid.conversations skiplist show | grep  
> Bd8087b76083ec5e7 | wc -l
> 1
> # strings /path-to/userid.conversations | grep d8087b76083ec5e7 | wc -l
> 183
> # strings /path-to/userid.conversations | grep Bd8087b76083ec5e7 | wc -l
> 91
> 
> ctl_conversationsdb -d  did a cleanup because I did abort the  
> "ctl_conversationsdb -b" process
> 
> skiplist recovery /path-to/userid.conversations: found partial txn,  
> not replaying
> skiplist: recovered /path-to/userid.conversations (0 records, 144  
> bytes) in 0 seconds
> skiplist: checkpointed /path-to/userid.conversations (0 records, 144  
> bytes) in 0.010 sec
> 
> 
> I did run ctl_conversationsd -b with strace and discovered that the inbox
> (cyrus.index O_RDWR, cyrus.header O_RDONLY, cache  cyrus.O_RDWR ) of that user
> was opened over and over again, no other folder for that user was accessed.
> For other users the inbox was only opened once and then the other  
> folders followed.
> So not stopping "ctl_conversationsd -b", it would have run till my  
> filesystem was
> full.
> 

I guess something about that inbox is confusing it, and making it redo it over 
and over?  Very curious

> > Bron, does this sound like anything you've seen before, that maybe  
> > got fixed on master but not backported to 3.0?
> >
> >> Cyrus Logs show many "ctl_conversationsdb[933]: mailbox: longlock
> >> user.UserID for 1.5 seconds"
> >> every few seconds.
> >
> > This message is logged when the lock is released, if that lock had  
> > been held for >1s.  It's reporting that it held the lock for longer  
> > than it would like to have, but at least you know the lock has been  
> > released.  I think this is telling you that user-visible performance  
> > on the mailbox would have been impacted while this was occurring  
> > (because imapd/lmtpd wouldn't be able to access their mailbox while  
> > ctl_conversationsdb was holding those locks), but also tells you  
> > that ctl_conversationsdb was doing the right thing, and not just  
> > holding a single lock for the entire job.
> >
> 
> I am not sure if any other processes had a chance to acquire a lock.

If they'd tried, they would have succeeded (by blocking until the lock was 
theirs, during the gaps where ctl_conversationsdb dropped and reacquired the 
lock), but maybe nothing else wanted it.

> > Though, that raises the question: was the user accessing the mailbox  
> > while the rebuild was in progress?  I wonder if their client is  
> > doing something funny and tripping things up?
> 
> We did this while the server was accessible (didn't want to have
> several hour downtime for ~44000 users)

Fair enough!

> But as I did my recent testing on a test-server without user access,
> only ctl_conversationsdb and cyr_dbtool did accessed the mailbox during
> the conversation rebuild mentioned 

Re: prblems rebuilding conversations db

2019-01-20 Thread ellie timoney
Hi Michael,

On Wed, Jan 16, 2019, at 10:12 PM, Michael Menge wrote:
> Hi,
> 
> because conversations db seems to be required for search indexes, I  
> enabled this option
> on our production servers today and tried to rebuild the conversations  
> db for all users with
> 
> ctl_conversationsdb -v -b UID
> 
> For most users this did take less than a second. But for some users  
> this process would
> not finish. I did kill one process after about 30 Minutes (most others  
> after 3 Minutes).
> The UserID.conversations has grown over 2 GB (the mailbox itself has  
> only ~700 MB of mails,
> and the conversations files from finnished rebuilds are less then 20 MB)

So, a 2GB-and-growing conversations db for a 700MB mailbox is a ratio of ~3:1, 
which seems obscene...

For comparison, the Conversations.append_reply_200 cassandane test (two 
messages with 100 replies each) produces an 87Kb conversations db for a 1016Kb 
mailbox (ratio ~1:12, n.b inverted!), which makes your ~3:1 and growing ratio 
seem even more obscene...

Obviously you can't share the bad conversations db because conversations db's 
are full of personally-identifying information.  But I wonder if looking 
through it with `cyr_dbtool ... show` turns up any unusual patterns, especially 
in comparison to one of the good users?  I wonder if it's somehow repeating 
itself?

Bron, does this sound like anything you've seen before, that maybe got fixed on 
master but not backported to 3.0?
 
> Cyrus Logs show many "ctl_conversationsdb[933]: mailbox: longlock  
> user.UserID for 1.5 seconds"
> every few seconds.

This message is logged when the lock is released, if that lock had been held 
for >1s.  It's reporting that it held the lock for longer than it would like to 
have, but at least you know the lock has been released.  I think this is 
telling you that user-visible performance on the mailbox would have been 
impacted while this was occurring (because imapd/lmtpd wouldn't be able to 
access their mailbox while ctl_conversationsdb was holding those locks), but 
also tells you that ctl_conversationsdb was doing the right thing, and not just 
holding a single lock for the entire job.

Though, that raises the question: was the user accessing the mailbox while the 
rebuild was in progress?  I wonder if their client is doing something funny and 
tripping things up?

Cheers,

ellie

> I did try reconstruct -r -G user/UserID, followed by   
> ctl_conversationsdb -v -z UID
> and tried to rebuild the db again. Reconstruct reported no errors, and  
> the new rebuild
> hat the same problem.
> 
> Has anybody seen a similar problem?
> 
> Any ideas how to fix this?
> 
> 
> We are running cyrus-imapd 3.0.8
> 
> 
> Kind regards
> 
> Michael Menge
> 
> 
> M.MengeTel.: (49) 7071/29-70316
> Universität Tübingen   Fax.: (49) 7071/29-5912
> Zentrum für Datenverarbeitung  mail:  
> michael.me...@zdv.uni-tuebingen.de
> Wächterstraße 76
> 72074 Tübingen

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus