Re: 2.3.1 Replication is throwing scary errors

2018-06-13 Thread Thore Bödecker
Err, attached the wrong patches.

the correct ones are attached to this mail (0004, 0005, 0006).

On 13.06.18 - 13:29, Thore Bödecker wrote:
> 
> For reference: I'm using the official 2.3.1 tarball together with the
> 3 attached patches, that have been taken from GitHub diffs/commits
> linked to me by Aki in the #dovecot channel.
> 


Cheers,
Thore

-- 
Thore Bödecker

GPG ID: 0xD622431AF8DB80F3
GPG FP: 0F96 559D 3556 24FC 2226  A864 D622 431A F8DB 80F3
From a952e178943a5944255cb7c053d970f8e6d49336 Mon Sep 17 00:00:00 2001
From: Timo Sirainen 
Date: Tue, 5 Jun 2018 20:23:52 +0300
Subject: [PATCH] doveadm-server: Fix hang when sending a lot of output to
 clients

Nowadays ostream adds its io to the stream's specified ioloop, not to
current ioloop.
---
 src/doveadm/client-connection-tcp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/doveadm/client-connection-tcp.c 
b/src/doveadm/client-connection-tcp.c
index a2e1358d7f..672017495d 100644
--- a/src/doveadm/client-connection-tcp.c
+++ b/src/doveadm/client-connection-tcp.c
@@ -336,6 +336,9 @@ static int doveadm_cmd_handle(struct client_connection_tcp 
*conn,
   running one and we can't call the original one recursively, so
   create a new ioloop. */
conn->ioloop = io_loop_create();
+   o_stream_switch_ioloop(conn->output);
+   if (conn->log_out != NULL)
+   o_stream_switch_ioloop(conn->log_out);
 
if (cmd_ver2 != NULL)
doveadm_cmd_server_run_ver2(conn, argc, argv, cctx);
From 59cd19919bf444e5c3fa429314408aacc8dd4eb8 Mon Sep 17 00:00:00 2001
From: Timo Sirainen 
Date: Tue, 24 Apr 2018 18:47:28 +0300
Subject: [PATCH 1/2] lib-storage: Add mail_user_home_mkdir()

---
 src/lib-storage/mail-user.c | 61 +
 src/lib-storage/mail-user.h |  5 
 2 files changed, 66 insertions(+)

diff --git a/src/lib-storage/mail-user.c b/src/lib-storage/mail-user.c
index 947e26cee4..a15ed353ff 100644
--- a/src/lib-storage/mail-user.c
+++ b/src/lib-storage/mail-user.c
@@ -8,6 +8,7 @@
 #include "module-dir.h"
 #include "home-expand.h"
 #include "file-create-locked.h"
+#include "mkdir-parents.h"
 #include "safe-mkstemp.h"
 #include "str.h"
 #include "strescape.h"
@@ -716,6 +717,66 @@ void mail_user_stats_fill(struct mail_user *user, struct 
stats *stats)
user->v.stats_fill(user, stats);
 }
 
+static int
+mail_user_home_mkdir_try_ns(struct mail_namespace *ns, const char *home)
+{
+   const enum mailbox_list_path_type types[] = {
+   MAILBOX_LIST_PATH_TYPE_DIR,
+   MAILBOX_LIST_PATH_TYPE_ALT_DIR,
+   MAILBOX_LIST_PATH_TYPE_CONTROL,
+   MAILBOX_LIST_PATH_TYPE_INDEX,
+   MAILBOX_LIST_PATH_TYPE_INDEX_PRIVATE,
+   MAILBOX_LIST_PATH_TYPE_INDEX_CACHE,
+   MAILBOX_LIST_PATH_TYPE_LIST_INDEX,
+   };
+   size_t home_len = strlen(home);
+   const char *path;
+
+   for (unsigned int i = 0; i < N_ELEMENTS(types); i++) {
+   if (!mailbox_list_get_root_path(ns->list, types[i], ))
+   continue;
+   if (strncmp(path, home, home_len) == 0 &&
+   (path[home_len] == '\0' || path[home_len] == '/')) {
+   return mailbox_list_mkdir_root(ns->list, path,
+  types[i]) < 0 ? -1 : 1;
+   }
+   }
+   return 0;
+}
+
+int mail_user_home_mkdir(struct mail_user *user)
+{
+   struct mail_namespace *ns;
+   const char *home;
+   int ret;
+
+   if (mail_user_get_home(user, ) < 0)
+   return -1;
+
+   /* Try to create the home directory by creating the root directory for
+  a namespace that exists under the home. This way we end up in the
+  special mkdir() code in mailbox_list_try_mkdir_root_parent().
+  Start from INBOX, since that's usually the correct place. */
+   ns = mail_namespace_find_inbox(user->namespaces);
+   if ((ret = mail_user_home_mkdir_try_ns(ns, home)) != 0)
+   return ret < 0 ? -1 : 0;
+   /* try other namespaces */
+   for (ns = user->namespaces; ns != NULL; ns = ns->next) {
+   if ((ns->flags & NAMESPACE_FLAG_INBOX_USER) != 0) {
+   /* already tried the INBOX namespace */
+   continue;
+   }
+   if ((ret = mail_user_home_mkdir_try_ns(ns, home)) != 0)
+   return ret < 0 ? -1 : 0;
+   }
+   /* fallback to a safe mkdir() with 0700 mode */
+   if (mkdir_parents(home, 0700) < 0 && errno != EEXIST) {
+   i_error("mkdir_parents(%s) failed: %m", home);
+   return -1;
+   }
+   return 0;
+}
+
 static const struct var_expand_func_table 
mail_user_var_expand_func_table_arr[] = {
{ "userdb", mail_user_var_expand_func_userdb },
{ NULL, NULL }
diff --git a/src/lib-storage/mail-user.h 

Re: 2.3.1 Replication is throwing scary errors

2018-06-13 Thread Thore Bödecker
Hey all,

almost 48h ago I upgraded both my instances to 2.3.1 again to see if
the new patches would fix the replication issues for me.

So far, the result is: great.

I haven't been able to provoke any kind of I/O stall or persisting
queued/failed resync requests in my replication setup.

Newly added users are replicated instantly upon the first received
mails and the home directory gets created without issues now too.

For reference: I'm using the official 2.3.1 tarball together with the
3 attached patches, that have been taken from GitHub diffs/commits
linked to me by Aki in the #dovecot channel.

I can only encourage everyone to try out 2.3.1 again with these 3
patches to make sure it is rock-solid so that we might get a proper
and stable 2.3.2 release soon-ish :)


PS: For the Arch Linux users among you the dovecot-2.3.1-5 package in
the official repo contains said three patches :)


Cheers,
Thore

-- 
Thore Bödecker

GPG ID: 0xD622431AF8DB80F3
GPG FP: 0F96 559D 3556 24FC 2226  A864 D622 431A F8DB 80F3
commit 890883f12e8d8dd3309743eb95cf0b04f6e39ea0
Author: Aki Tuomi 
Date:   Mon Mar 19 18:39:27 2018 +0200

dsync: Revert to /tmp if home does not exist

Fixes doveadm: Error: Couldn't lock .dovecot-sync.lock: 
safe_mkstemp(.dovecot-sync.lock) failed: No such file or directory

diff --git a/src/doveadm/dsync/dsync-brain.c b/src/doveadm/dsync/dsync-brain.c
index c2b8169..1e84182 100644
--- a/src/doveadm/dsync/dsync-brain.c
+++ b/src/doveadm/dsync/dsync-brain.c
@@ -401,6 +401,7 @@ dsync_brain_lock(struct dsync_brain *brain, const char 
*remote_hostname)
.lock_method = FILE_LOCK_METHOD_FCNTL,
};
const char *home, *error, *local_hostname = my_hostdomain();
+   struct stat st;
bool created;
int ret;
 
@@ -437,8 +438,21 @@ dsync_brain_lock(struct dsync_brain *brain, const char 
*remote_hostname)
 
if (brain->verbose_proctitle)
process_title_set(dsync_brain_get_proctitle_full(brain, 
DSYNC_BRAIN_TITLE_LOCKING));
-   brain->lock_path = p_strconcat(brain->pool, home,
-  "/"DSYNC_LOCK_FILENAME, NULL);
+
+   /* if homedir does not yet exist, create lock under tmpdir */
+   if (stat(home, ) < 0) {
+   if (errno != ENOENT) {
+   i_error("stat(%s) failed: %m", home);
+   return -1;
+   }
+   brain->lock_path = p_strdup_printf(brain->pool, "%s/%s-%s",
+  
brain->user->set->mail_temp_dir,
+  brain->user->username,
+  "/"DSYNC_LOCK_FILENAME);
+   } else {
+   brain->lock_path = p_strconcat(brain->pool, home,
+  "/"DSYNC_LOCK_FILENAME, NULL);
+   }
brain->lock_fd = file_create_locked(brain->lock_path, _set,
>lock, , );
if (brain->lock_fd == -1)
From a952e178943a5944255cb7c053d970f8e6d49336 Mon Sep 17 00:00:00 2001
From: Timo Sirainen 
Date: Tue, 5 Jun 2018 20:23:52 +0300
Subject: [PATCH] doveadm-server: Fix hang when sending a lot of output to
 clients

Nowadays ostream adds its io to the stream's specified ioloop, not to
current ioloop.
---
 src/doveadm/client-connection-tcp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/doveadm/client-connection-tcp.c 
b/src/doveadm/client-connection-tcp.c
index a2e1358d7f..672017495d 100644
--- a/src/doveadm/client-connection-tcp.c
+++ b/src/doveadm/client-connection-tcp.c
@@ -336,6 +336,9 @@ static int doveadm_cmd_handle(struct client_connection_tcp 
*conn,
   running one and we can't call the original one recursively, so
   create a new ioloop. */
conn->ioloop = io_loop_create();
+   o_stream_switch_ioloop(conn->output);
+   if (conn->log_out != NULL)
+   o_stream_switch_ioloop(conn->log_out);
 
if (cmd_ver2 != NULL)
doveadm_cmd_server_run_ver2(conn, argc, argv, cctx);
From 59cd19919bf444e5c3fa429314408aacc8dd4eb8 Mon Sep 17 00:00:00 2001
From: Timo Sirainen 
Date: Tue, 24 Apr 2018 18:47:28 +0300
Subject: [PATCH 1/2] lib-storage: Add mail_user_home_mkdir()

---
 src/lib-storage/mail-user.c | 61 +
 src/lib-storage/mail-user.h |  5 
 2 files changed, 66 insertions(+)

diff --git a/src/lib-storage/mail-user.c b/src/lib-storage/mail-user.c
index 947e26cee4..a15ed353ff 100644
--- a/src/lib-storage/mail-user.c
+++ b/src/lib-storage/mail-user.c
@@ -8,6 +8,7 @@
 #include "module-dir.h"
 #include "home-expand.h"
 #include "file-create-locked.h"
+#include "mkdir-parents.h"
 #include "safe-mkstemp.h"
 #include "str.h"
 #include "strescape.h"
@@ -716,6 +717,66 @@ void mail_user_stats_fill(struct mail_user *user, struct 
stats *stats)
user->v.stats_fill(user, stats);
 }
 
+static int

Re: 2.3.1 Replication is throwing scary errors

2018-06-08 Thread Michael Grimm
Michael Grimm  wrote:

> First of all: Major improvement by this patch applied to 2.3.1, there are no 
> more hanging processes.

From my point of view: the recent commit from Timo did not only fix those 
hanging processes ...

> But: I do find quite a number of error messages like:
> 
>   Jun  7 06:34:20 mail dovecot: doveadm: Error: Failed to lock mailbox 
> NAME for dsyncing: \
>   
> file_create_locked(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock)
>  \
>   failed: 
> fcntl(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock, write-lock, 
> F_SETLKW) \
>   locking failed: Timed out after 30 seconds (WRITE lock held by 
> pid 79452)

… it fixed it finally!

> These messages are only found at that server which is normally receiving 
> synced messages (because almost all mail is received via the other master due 
> to MX priorities).

No wonder why: It was completely my fault. I had had "mail_replica =" pointing 
to itself :-( Copying configs from one server to the other without thinking is 
bad …

Now, after having fixed this stupid configuration mistake, I can report, after 
some hours, that from my point of view replication is back to its 2.2.x 
performance!

I do have to apologise for the noise, sorry.

Regards,
Michael



Re: 2.3.1 Replication is throwing scary errors

2018-06-07 Thread Reuben Farrelly
Regarding my comment below - it looks like a false alarm on my part. 
The commit referenced below hasn't gone into master-2.3 yet which meant 
it wasn't included when I rebuilt earlier today.  That was was an 
incorrect assumption I made.


I have since manually patched it into master-2.3 and it looks to be OK 
so far - touch wood - with 4 hours testing so far.


Reuben


On 7/06/2018 3:21 pm, Reuben Farrelly wrote:

Still not quite right for me.

Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: 
dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds 
(last sent=mail, last recv=mail (EOL))
Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout 
during state=sync_mails (send=mails recv=recv_last_common)


I'm not sure if there is an underlying replication error or if the 
message is just cosmetic, though.


Reuben


On 7/06/2018 4:55 AM, Remko Lodder wrote:

Hi Timo,

Yes this seems to work fine so far. I’ll ask the people to add it to 
the current FreeBSD version..


Cheers
Remko

On 6 Jun 2018, at 19:34, Timo Sirainen > wrote:


Should be fixed by 
https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336 









Re: 2.3.1 Replication is throwing scary errors

2018-06-07 Thread Timo Sirainen
On 7 Jun 2018, at 11.43, Michael Grimm  wrote:
> 
> Timo Sirainen:
> 
>> Should be fixed by
>> https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336
> 
> please ignore my ignorance but shouldn't one add this commit regarding 
> src/doveadm/client-connection-tcp.c ...
> 
> https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc157
> 
> ... to a vanilla 2.3.1 source tree as well?

That's a code simplification / cleanup commit. It doesn't fix anything.



Re: 2.3.1 Replication is throwing scary errors

2018-06-07 Thread Larry Rosenman

On 6/7/18, 3:43 AM, "dovecot on behalf of Michael Grimm" 
 wrote:

Timo Sirainen:

> Should be fixed by
> 
https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336

please ignore my ignorance but shouldn't one add this commit regarding 
src/doveadm/client-connection-tcp.c ...


https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc157

... to a vanilla 2.3.1 source tree as well?

I do have to admit that I am absolutely clueless in understanding 
dovecot's source code, but I came accross this commit because it has 
been committed the very same day as the one solving those hanging 
processes.

I could test it by myself, but I am not sure if that commit would break 
my productive dovecot instances.

Regards,
Michael


I'm happy to add 
https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc157 
to 
The FreeBSD port if folks think it should help.

-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: larry...@gmail.com
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106


Re: 2.3.1 Replication is throwing scary errors

2018-06-07 Thread Michael Grimm

Timo Sirainen:


Should be fixed by
https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336


please ignore my ignorance but shouldn't one add this commit regarding 
src/doveadm/client-connection-tcp.c ...


https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc157

... to a vanilla 2.3.1 source tree as well?

I do have to admit that I am absolutely clueless in understanding 
dovecot's source code, but I came accross this commit because it has 
been committed the very same day as the one solving those hanging 
processes.


I could test it by myself, but I am not sure if that commit would break 
my productive dovecot instances.


Regards,
Michael


Re: 2.3.1 Replication is throwing scary errors

2018-06-07 Thread Michael Grimm

Am 2018-06-07 08:48, schrieb Remko Lodder:

On Thu, Jun 07, 2018 at 08:04:49AM +0200, Michael Grimm wrote:


Conclusion: After 12 hours of running a patched FBSD port I do get 
those
error messages but replictaion seems to work now. But, I still have 
the

feeling that there might something else going wrong.


I agree with that. Are you using the new pkg that ler@ prepared ? That 
includes

the patch and is a 'native' package..


Yes, I am running this new port from ler@.
And: Thanks for his very fast modification!

Regards,
Michael


Re: 2.3.1 Replication is throwing scary errors

2018-06-07 Thread Michael Grimm

Am 2018-06-07 07:34, schrieb Remko Lodder:
On 7 Jun 2018, at 07:21, Reuben Farrelly  
wrote:



Still not quite right for me.

Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: 
dsync(lightning.reub.net): I/O has stalled, no activity for 600 
seconds (last sent=mail, last recv=mail (EOL))
Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout 
during state=sync_mails (send=mails recv=recv_last_common)


I'm not sure if there is an underlying replication error or if the 
message is just cosmetic, though.


Admittedly I have had a few occurences of this behaviour as well last 
night. It happens more sporadic now and seems to be a conflict with my 
user settings. (My users
get added twice to the system, user-domain.tld and u...@domain.tld, 
both are being replicated, the noreplicate flag is not yet honored in 
the version I am using so I cannot

bypass that yet).

I do see messages that came on the other machine on the machine that I 
am using to read these emails. So replication seems to work in that 
regard (where it obviously

did not do that well before).


First of all: Major improvement by this patch applied to 2.3.1, there 
are no more hanging processes.


But: I do find quite a number of error messages like:

	Jun  7 06:34:20 mail dovecot: doveadm: Error: Failed to lock mailbox 
NAME for dsyncing: \
		file_create_locked(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock) 
\
		failed: 
fcntl(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock, 
write-lock, F_SETLKW) \
		locking failed: Timed out after 30 seconds (WRITE lock held by pid 
79452)


These messages are only found at that server which is normally receiving 
synced messages (because almost all mail is received via the other 
master due to MX priorities).


Conclusion: After 12 hours of running a patched FBSD port I do get those 
error messages but replictaion seems to work now. But, I still have the 
feeling that there might something else going wrong.


@Timo: Wouldn't it be worth to add replicator/aggreator error messages 
to head like Aki sent to Remko? That might add some light into 
replication issues today and in the future.


Regards,
Michael


Re: 2.3.1 Replication is throwing scary errors

2018-06-06 Thread Thore Bödecker
And I forgot to CC the list, sorry for that, it's way too early in
the morning :P

On 07.06.18 - 07:39, Thore Bödecker wrote:
> What does the output of these two commands show after that error has
> been logged?
> 
> doveadm replicator status
> 
> doveadm replicator dsync-status
> 
> If there are *waiting failed* requests, that never make it "through"
> (after being temporarily in state *queued failed* and then returning
> to *waiting failed*) this means there is something wrong with the
> replication.
> 
> You can try forcing replication of all known users using
> 
> doveadm replicator replicate '*'
> 
> And see if that resolves the failed requests, but I doubt it.
> 
> Please let us know how your status outputs look like.
> 
> 
> Cheers,
> Thore
> 
> -- 
> Thore Bödecker
> 
> GPG ID: 0xD622431AF8DB80F3
> GPG FP: 0F96 559D 3556 24FC 2226  A864 D622 431A F8DB 80F3
> 
> 
> On 07.06.18 - 15:21, Reuben Farrelly wrote:
> > Still not quite right for me.
> > 
> > Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error:
> > dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds
> > (last sent=mail, last recv=mail (EOL))
> > Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout
> > during state=sync_mails (send=mails recv=recv_last_common)
> > 
> > I'm not sure if there is an underlying replication error or if the message
> > is just cosmetic, though.
> > 
> > Reuben
> > 
> > 
> > On 7/06/2018 4:55 AM, Remko Lodder wrote:
> > > Hi Timo,
> > > 
> > > Yes this seems to work fine so far. I’ll ask the people to add it to the
> > > current FreeBSD version..
> > > 
> > > Cheers
> > > Remko
> > > 
> > > > On 6 Jun 2018, at 19:34, Timo Sirainen  > > > > wrote:
> > > > 
> > > > Should be fixed by 
> > > > https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336
> > > > 
> > > 


signature.asc
Description: PGP signature


Re: 2.3.1 Replication is throwing scary errors

2018-06-06 Thread Remko Lodder


> On 7 Jun 2018, at 07:21, Reuben Farrelly  wrote:
> 
> Still not quite right for me.
> 
> Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: 
> dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds (last 
> sent=mail, last recv=mail (EOL))
> Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout during 
> state=sync_mails (send=mails recv=recv_last_common)
> 
> I'm not sure if there is an underlying replication error or if the message is 
> just cosmetic, though.
> 
> Reuben

Hi,

Admittedly I have had a few occurences of this behaviour as well last night. It 
happens more sporadic now and seems to be a conflict with my user settings. (My 
users
get added twice to the system, user-domain.tld and u...@domain.tld 
, both are being replicated, the noreplicate flag is 
not yet honored in the version I am using so I cannot
bypass that yet).

I do see messages that came on the other machine on the machine that I am using 
to read these emails. So replication seems to work in that regard (where it 
obviously
did not do that well before).

Cheers
Remko

> 
> 
> On 7/06/2018 4:55 AM, Remko Lodder wrote:
>> Hi Timo,
>> Yes this seems to work fine so far. I’ll ask the people to add it to the 
>> current FreeBSD version..
>> Cheers
>> Remko
>>> On 6 Jun 2018, at 19:34, Timo Sirainen mailto:t...@iki.fi>> 
>>> wrote:
>>> 
>>> Should be fixed by 
>>> https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336
>>> 



signature.asc
Description: Message signed with OpenPGP


Re: 2.3.1 Replication is throwing scary errors

2018-06-06 Thread Reuben Farrelly

Still not quite right for me.

Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: 
dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds 
(last sent=mail, last recv=mail (EOL))
Jun  7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout 
during state=sync_mails (send=mails recv=recv_last_common)


I'm not sure if there is an underlying replication error or if the 
message is just cosmetic, though.


Reuben


On 7/06/2018 4:55 AM, Remko Lodder wrote:

Hi Timo,

Yes this seems to work fine so far. I’ll ask the people to add it to the 
current FreeBSD version..


Cheers
Remko

On 6 Jun 2018, at 19:34, Timo Sirainen > wrote:


Should be fixed by 
https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336






Re: 2.3.1 Replication is throwing scary errors

2018-06-06 Thread Remko Lodder
Hi Timo,

Yes this seems to work fine so far. I’ll ask the people to add it to the 
current FreeBSD version..

Cheers
Remko

> On 6 Jun 2018, at 19:34, Timo Sirainen  wrote:
> 
> Should be fixed by 
> https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336
>  
> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: 2.3.1 Replication is throwing scary errors

2018-06-06 Thread Timo Sirainen
Should be fixed by 
https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336 




Re: 2.3.1 Replication is throwing scary errors

2018-06-01 Thread Andy Weal




On 1/06/2018 2:47 AM, Michael Grimm wrote:

On 31. May 2018, at 18:09, Remko Lodder  wrote:

On 31 May 2018, at 17:52, Michael Grimm  wrote:
I would love to get some feedback from the developers regarding:

#) are commercial customers of yours running 2.3 master-master replication 
without those issues reported in this thread?
#) do you get reports about these issues outside this ML as well?
#) and ...


What sort of debugging, short of bisecting 100+ patches between the commits 
above, can we do to progress this?

… what kind of debugging do you suggest?

Aki sent me over some patches recently and I have build a custom package for it 
for FreeBSD. It’s in my pkg repo which I can forward you if you want it.

Great news. I'd love to test it, thus, could you forward it to me? Thanks.
Very good news. I too would love to test and am more than happy to share 
config, system setup  and logs .

Please forward the the pkg repo if you wouldn't mind.



You need to add some lines to the logging thing and then trace those and 
collaborate with the dovecot community/developers.

And, please let me know, which config is needed for those logging lines as well.

As above please



I did not have yet found the time to actively persue this due to other things 
on my head. Sorry for that. I hope to do this “soon” but I dont want to pin 
myself to a commitment that I might not be able to make :)

Well, I will give it a try.

But: more testers might see more in those logging lines ;-)

Regards,
Michael


More than willing to test.

Regards,
Andy



Re: 2.3.1 Replication is throwing scary errors

2018-05-31 Thread Michael Grimm
On 31. May 2018, at 18:09, Remko Lodder  wrote:
>> On 31 May 2018, at 17:52, Michael Grimm  wrote:

>> I would love to get some feedback from the developers regarding:
>> 
>> #) are commercial customers of yours running 2.3 master-master replication 
>> without those issues reported in this thread?
>> #) do you get reports about these issues outside this ML as well?
>> #) and ...
>> 
>>> What sort of debugging, short of bisecting 100+ patches between the commits 
>>> above, can we do to progress this?
>> 
>> … what kind of debugging do you suggest?
> 
> Aki sent me over some patches recently and I have build a custom package for 
> it for FreeBSD. It’s in my pkg repo which I can forward you if you want it.

Great news. I'd love to test it, thus, could you forward it to me? Thanks.

> You need to add some lines to the logging thing and then trace those and 
> collaborate with the dovecot community/developers.

And, please let me know, which config is needed for those logging lines as well.

> I did not have yet found the time to actively persue this due to other things 
> on my head. Sorry for that. I hope to do this “soon” but I dont want to pin 
> myself to a commitment that I might not be able to make :)

Well, I will give it a try. 

But: more testers might see more in those logging lines ;-)

Regards,
Michael



Re: 2.3.1 Replication is throwing scary errors

2018-05-31 Thread Remko Lodder


> On 31 May 2018, at 17:52, Michael Grimm  wrote:
> 
> Reuben Farrelly  wrote:
> 
>> Checking in - this is still an issue with 2.3-master as of today (2.3.devel 
>> (3a6537d59)).
> 
> That doesn't sound good, because I did hope that someone has been working on 
> this issue ...
> 
>> I haven't been able to narrow the problem down to a specific commit. The 
>> best I have been able to get to is that this commit is relatively good (not 
>> perfect but good enough):
>> 
>> d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)
>> 
>> whereas this commit:
>> 
>> 6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)
>> 
>> is pretty bad.  Somewhere in between some commit has caused the problem 
>> (which may have been introduced earlier) to get much worse.
> 
> Thanks for the info.
> 
>> There seem to be a handful of us with broken systems who are prepared to 
>> assist in debugging this and put in our own time to patch, test and get to 
>> the bottom of it, but it is starting to look like we're basically on our own.
> 
> I wonder if there is anyone running a 2.3 master-master replication scheme 
> *without* running into this issue? Please let us know: yes, 2.3 master-master 
> replication does run as rock-stable as in 2.2.
> 
> Anyone?
> 
> I would love to get some feedback from the developers regarding:
> 
> #) are commercial customers of yours running 2.3 master-master replication 
> without those issues reported in this thread?
> #) do you get reports about these issues outside this ML as well?
> #) and ...
> 
>> What sort of debugging, short of bisecting 100+ patches between the commits 
>> above, can we do to progress this?
> 
> … what kind of debugging do you suggest?

Aki sent me over some patches recently and I have build a custom package for it 
for FreeBSD. It’s in my pkg repo which I can forward you if you want it.
You need to add some lines to the logging thing and then trace those and 
collaborate with the dovecot community/developers. I did not have yet found
the time to actively persue this due to other things on my head. Sorry for 
that. I hope to do this “soon” but I dont want to pin myself to a commitment 
that
I might not be able to make :)

Cheers
Remko

> 
> Regards,
> Michael


signature.asc
Description: Message signed with OpenPGP


Re: 2.3.1 Replication is throwing scary errors

2018-05-31 Thread Michael Grimm
Reuben Farrelly  wrote:

> Checking in - this is still an issue with 2.3-master as of today (2.3.devel 
> (3a6537d59)).

That doesn't sound good, because I did hope that someone has been working on 
this issue ...

> I haven't been able to narrow the problem down to a specific commit. The best 
> I have been able to get to is that this commit is relatively good (not 
> perfect but good enough):
> 
> d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)
> 
> whereas this commit:
> 
> 6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)
> 
> is pretty bad.  Somewhere in between some commit has caused the problem 
> (which may have been introduced earlier) to get much worse.

Thanks for the info.

> There seem to be a handful of us with broken systems who are prepared to 
> assist in debugging this and put in our own time to patch, test and get to 
> the bottom of it, but it is starting to look like we're basically on our own.

I wonder if there is anyone running a 2.3 master-master replication scheme 
*without* running into this issue? Please let us know: yes, 2.3 master-master 
replication does run as rock-stable as in 2.2. 

Anyone?

I would love to get some feedback from the developers regarding: 

#) are commercial customers of yours running 2.3 master-master replication 
without those issues reported in this thread?
#) do you get reports about these issues outside this ML as well?
#) and ...

> What sort of debugging, short of bisecting 100+ patches between the commits 
> above, can we do to progress this?

… what kind of debugging do you suggest?

Regards,
Michael



Re: 2.3.1 Replication is throwing scary errors

2018-05-30 Thread Reuben Farrelly

Hi,

Checking in - this is still an issue with 2.3-master as of today 
(2.3.devel (3a6537d59)).


I haven't been able to narrow the problem down to a specific commit. 
The best I have been able to get to is that this commit is relatively 
good (not perfect but good enough):


d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)

whereas this commit:

6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)

is pretty bad.  Somewhere in between some commit has caused the problem 
(which may have been introduced earlier) to get much worse.


There seem to be a handful of us with broken systems who are prepared to 
assist in debugging this and put in our own time to patch, test and get 
to the bottom of it, but it is starting to look like we're basically on 
our own.


What sort of debugging, short of bisecting 100+ patches between the 
commits above, can we do to progress this?


Reuben



On 7/05/2018 5:54 am, Thore Bödecker wrote:

Hey all,

I've been affected by these replication issues too and finally downgraded
back to 2.2.35 since some newly created virtual domains/mailboxes
weren't replicated *at all* due to the bug(s).

My setup is more like a master-slave, where I only have a rather small
virtual machine as the slave host, which is also only MX 20.
The idea was to replicate all mails through dovecot and perform
individual (independent) backups on each host.

The clients use a CNAME with a low TTL of 60s so in case my "master"
(physical dedicated machine) goes down for a longer period I can simply
switch to the slave.

In order for this concept to work, the replication has to work without
any issue. Otherwise clients might notice missing mails or it might
even result in conflicts when the master cames back online if the
slave was out of sync beforehand.


On 06.05.18 - 21:34, Michael Grimm wrote:

And please have a look for processes like:
doveadm-server: [IP4  INBOX import:1/3] (doveadm-server)

These processes will "survive" a dovecot reboot ...


This is indeed the case. Once the replication processes
(doveadm-server) get stuck I had to resort to `kill -9` to get rid of
them. Something is really wrong there.

As stated multiple times in the #dovecot irc channel I'm happy to test
any patches for the 2.3 series in my setup and provide further details
if required.

Thanks to all who are participating in this thread and finally these
issues get some attention :)


Cheers,
Thore





Re: 2.3.1 Replication is throwing scary errors

2018-05-06 Thread Andy Weal
 for 10 days.
   3.  ??? Last night I shutdown mx2 and restarted it a few hours later
   4.  ??? within minutes i was getting the following types of errors on mx2

    May 06 12:56:29 doveadm: Error: Couldn't lock
/var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
pid 1960)

   ? Before i venture down the rabbit hole of fault finding and excess
coffee consumption I was wondering if any of you had any updates on the
problems discussed below.


Cheers for now,
Andy



Hi,

[Formatting is a bit rough, replying from a trimmed digest email]


/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm 

<https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing List 
https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 Replication is 
throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: text/plain; charset=utf-8 
/>//>/Reuben Farrelly wrote: />>/From: Michael Grimm https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot 2.3.1 
at FreeBSD STABLE-11.1 running in two jails at
distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do 
become pounded by
error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 
17:12:18  server1.lan dovecot: doveadm: Error:
dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last 
sent=mail_change, last
recv=mail_change (EOL)) />>>/| Apr 2 17:12:18  server1.lan dovecot: 
doveadm: Error: Timeout
during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 
2 18:59:03  server1.lan dovecot: doveadm: Error:
dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last 
recv=mail (EOL)) />>>/| Apr 2 18:59:03  server1.lan dovecot: doveadm: Error: 
Timeout
during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot 
see in my personal account any missing replications, *but* I
haven't tested this thoroughly enough. I do have customers being
serviced at these productive servers, *thus* I'm back to 2.2.35 until I
do understand or have learned what is going on. />//>/In my reply to this 
statement of mine I mentioned that I have seen those
timeouts quite some times during the past year. Thus, I upgraded to
2.3.1 again, and boom: after some hours I ended up in hanging processes
[1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6  
SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? 
/>//>/-rw--- 1 vmail dovecot uarch 0 Apr 3 16:52
/home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? 
/>//>/Apr 3 17:10:49  server2.lan dovecot: doveadm: Error: Couldn't
lock /home/to/USER1/.dovecot-sync.lock: \ 
/>/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking
failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) 
/>//>/[1] Even stopping dovecot will not end those processes. One has to
manually kill those before restarting dovecot. />//>/After one day of testing 
2.3.1 with a couple of those episodes of
locking/timeout, and now missing mails depending with server your MUA
will connect to, I went back to 2.2.35. After two days at that version I
never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git 
master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is 
written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back 
from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly 
there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated 
to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. 
/>//>/It might be time to describe our setups in order to possibly find co
   mmon
grounds that might trigger this issue you describe and Rimko and myself
ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. 
/>/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />

Re: 2.3.1 Replication is throwing scary errors

2018-05-06 Thread Thore Bödecker
Hey all,

I've been affected by these replication issues too and finally downgraded
back to 2.2.35 since some newly created virtual domains/mailboxes
weren't replicated *at all* due to the bug(s).

My setup is more like a master-slave, where I only have a rather small
virtual machine as the slave host, which is also only MX 20.
The idea was to replicate all mails through dovecot and perform
individual (independent) backups on each host.

The clients use a CNAME with a low TTL of 60s so in case my "master"
(physical dedicated machine) goes down for a longer period I can simply
switch to the slave.

In order for this concept to work, the replication has to work without
any issue. Otherwise clients might notice missing mails or it might
even result in conflicts when the master cames back online if the
slave was out of sync beforehand.


On 06.05.18 - 21:34, Michael Grimm wrote:
> And please have a look for processes like:
>   doveadm-server: [IP4  INBOX import:1/3] (doveadm-server)
> 
> These processes will "survive" a dovecot reboot ...

This is indeed the case. Once the replication processes
(doveadm-server) get stuck I had to resort to `kill -9` to get rid of
them. Something is really wrong there.

As stated multiple times in the #dovecot irc channel I'm happy to test
any patches for the 2.3 series in my setup and provide further details
if required.

Thanks to all who are participating in this thread and finally these
issues get some attention :)


Cheers,
Thore

-- 
Thore Bödecker

GPG ID: 0xD622431AF8DB80F3
GPG FP: 0F96 559D 3556 24FC 2226  A864 D622 431A F8DB 80F3


signature.asc
Description: PGP signature


Re: 2.3.1 Replication is throwing scary errors

2018-05-06 Thread Michael Grimm
;>   ??? ??? Dovecot = 2.3.1
>>>>   ??? ??? File system = ufs
>>>>   ??? MX2 - Backup
>>>>   ??? ??? Freebsd 11.1-Release-p9
>>>>   ???  Running on bare metal - no VM or jails
>>>>   ??? ??? MTA = Postfix 3.4-20180401
>>>>   ??? ??? Dovecot = 2.3.1
>>>>   ??? ??? File system = ufs ( on SSD)
>>>> 
>>>> */
>>>> /*Brief sequence of events
>>>> 
>>>>   1.  ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
>>>>  from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
>>>>  dovecot start)
>>>>   2.  ??? both systems ran ok with no errors for 10 days.
>>>>   3.  ??? Last night I shutdown mx2 and restarted it a few hours later
>>>>   4.  ??? within minutes i was getting the following types of errors on mx2
>>>> 
>>>>    May 06 12:56:29 doveadm: Error: Couldn't lock
>>>> /var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
>>>> fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
>>>> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
>>>> pid 1960)
>>>> 
>>>>   ? Before i venture down the rabbit hole of fault finding and excess
>>>> coffee consumption I was wondering if any of you had any updates on the
>>>> problems discussed below.
>>>> 
>>>> 
>>>> Cheers for now,
>>>> Andy
>>>> 
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> [Formatting is a bit rough, replying from a trimmed digest email]
>>>> 
>>>>> /Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael 
>>>>> Grimm >>> <https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing 
>>>> List >>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 
>>>> Replication is throwing scary errors />/Message-ID: 
>>>> >>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: 
>>>> text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael 
>>>> Grimm >>> <https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot 
>>>> 2.3.1 at FreeBSD STABLE-11.1 running in two jails at
>>>> distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do 
>>>> become pounded by
>>>> error messages at server1 (and vice versa at server2) as follows: />>>/| 
>>>> Apr 2 17:12:18  server1.lan dovecot: doveadm: Error:
>>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds 
>>>> (last sent=mail_change, last
>>>> recv=mail_change (EOL)) />>>/| Apr 2 17:12:18  server1.lan 
>>>> dovecot: doveadm: Error: Timeout
>>>> during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] 
>>>> />>>/| Apr 2 18:59:03  server1.lan dovecot: doveadm: Error:
>>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds 
>>>> (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03  
>>>> server1.lan dovecot: doveadm: Error: Timeout
>>>> during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I 
>>>> cannot see in my personal account any missing replications, *but* I
>>>> haven't tested this thoroughly enough. I do have customers being
>>>> serviced at these productive servers, *thus* I'm back to 2.2.35 until I
>>>> do understand or have learned what is going on. />//>/In my reply to this 
>>>> statement of mine I mentioned that I have seen those
>>>> timeouts quite some times during the past year. Thus, I upgraded to
>>>> 2.3.1 again, and boom: after some hours I ended up in hanging processes
>>>> [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 
>>>>  SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 
>>>> paired with a file like ? />//>/-rw--- 1 vmail dovecot uarch 0 Apr 3 
>>>> 16:52
>>>> /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at 
>>>> server2 are like ? />//>/Apr 3 17:10:49  server2.lan dovecot: 
>>>> doveadm: Error: Couldn't
>>>> lock /home/to/USER1/.dovecot-sync.lock: \ 
>>>&

Re: 2.3.1 Replication is throwing scary errors

2018-05-06 Thread Michael Grimm
Hi Andy

Andy Weal  wrote

> Hi all,
> 
> New to the mailing lists but have joined up because of above 2.3.1 
> Replication is throwing scary errors 
> 
> 
> Brief system configuration
> MX1 - Main 
> Freebsd 11.1-Release-p9
> Hosted on a Vultr VM in Sydney AU 
> MTA = Postfix 3.4-20180401
> Dovecot = 2.3.1
> File system = ufs
> MX2 - Backup 
> Freebsd 11.1-Release-p9
>  Running on bare metal - no VM or jails
> MTA = Postfix 3.4-20180401
> Dovecot = 2.3.1
> File system = ufs ( on SSD)
> 
>
> Brief sequence of events
>   • apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 
> from 2.3.0   (service dovecot stop, portmaster upgrade, service dovecot start)
>   • both systems ran ok with no errors for 10 days.
>   • Last night I shutdown mx2 and restarted it a few hours later
>   • within minutes i was getting the following types of errors on mx2
>  May 06 12:56:29 doveadm: Error: Couldn't lock 
> /var/mail/vhosts/example.net/user1/.dovecot-sync.lock: 
> fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, 
> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 
> 1960)  
> 
>   Before i venture down the rabbit hole of fault finding and excess coffee 
> consumption I was wondering if any of you had any updates on the problems 
> discussed below.

As Reuben already stated: nothing has been "solved" regarding this issue with 
replication and 2.3.1 dovecot, yet. 

There are about 10 reports of this kind, here, and in the German dovecot list, 
I am aware of. All dovecot setups differ in every aspect like OS or virtual 
versus bare metal servers, thus I am convinced that it solely has to do with 
some dovecot code that differs between either 2.2.35 or 2.3.0 and 2.3.1.

Hoping this issue becomes recognised by the developers as a showstopper for 
upgrading from 2.2 to 2.3, soon.

As you are using FreeBSD, you will have a dovecot22 and dovecot-pigeonhole04 
port at hand to omit upgrading to the erroneous 2.3 version for the time being. 
(Thanks to the port maintainer who is following this ML!)

With kind regards,
Michael





Re: 2.3.1 Replication is throwing scary errors

2018-05-06 Thread Reuben Farrelly
Ah yes.  The hash I put below is wrong but the commit message I quoted 
was right. This is the last known working good commit, which was on Mar 6:


https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217ef3b
"doveadm: Unref header search context after use
Fixes memory leak, found by valgrind"

I've subsequently been able to determine that the bad commit was one of 
the dozens which were committed in a big batch on Mar 12, 2018.  I 
definitely see the problem happening when running with commit 
6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch).


In other words it's one of the commits between those two hashes that is 
the problem.   Just need to work out which one...


The problem is also seen with -git as of a day or so ago.  I'm tracking 
master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go 
straight in (including the rc fixes).  So this doesn't appear to be 
fixed yet.


Reuben



On 6/05/2018 11:41 pm, Aki Tuomi wrote:

Seems you got the hash wrong...

You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a 
memory leak, but does not release sync lock.

Interesting if this commit fixes the issue.

Looking at the sync code, where the lock is made / released, there is only one 
commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one 
creates home directory if missing.

Would be interesting if you could test with 2.3.2rc1, which was released few 
days ago, whether this issue goes away?

Aki


On 06 May 2018 at 15:21 Reuben Farrelly <reuben-dove...@reub.net> wrote:


Hi Andy,

Funny you say that - I've been testing this very problem for some time
today to narrow down the commits where I think the problem started
occurring.

So far I have been able to roll forward to commit from git:
74ef8506dcb946acc42722683e49fdcb964eed19

  >> "doveadm: Unref header search context after use" .  So far I have
been running on that code for a few hours and I'm not yet seeing the
problem, which indicates the issue is probably subsequent to this.

I'm of the view that the problem started occurring somewhere between
that commit above on 6 March and the final 2.3.1 release on 27 March.
There are a lot of changes between those two dates, and testing to
narrow down to one specific commit is extremely time consuming.

It would be really helpful if even one of the Dovecot developers could
comment on this and give us some direction (or candidate commits to
investigate) or just let us know if the root cause has been found.  This
bug is a showstopper and has stopped me tracking master-2.3 for over 3
months now, as I can't test later builds or even upgrade to the 2.3.1
release while replication is so broken.

Reuben




Message: 1
Date: Sun, 6 May 2018 13:21:57 +1000
From: Andy Weal <a...@bizemail.com.au>
To: dovecot@dovecot.org
Subject: 2.3.1 Replication is throwing scary errors
Message-ID: <f4274128-350b-9a74-a344-8e6caeffa...@bizemail.com.au>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Hi all,

New to the mailing lists but have joined up because of above */2.3.1
Replication is throwing scary errors


/*Brief system configuration
   ??? MX1 - Main
   ??? ??? Freebsd 11.1-Release-p9
   ??? ??? Hosted on a Vultr VM in Sydney AU
   ??? ??? MTA = Postfix 3.4-20180401
   ??? ??? Dovecot = 2.3.1
   ??? ??? File system = ufs
   ??? MX2 - Backup
   ??? ??? Freebsd 11.1-Release-p9
   ???  Running on bare metal - no VM or jails
   ??? ??? MTA = Postfix 3.4-20180401
   ??? ??? Dovecot = 2.3.1
   ??? ??? File system = ufs ( on SSD)

*/
/*Brief sequence of events

   1.  ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
  from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
  dovecot start)
   2.  ??? both systems ran ok with no errors for 10 days.
   3.  ??? Last night I shutdown mx2 and restarted it a few hours later
   4.  ??? within minutes i was getting the following types of errors on mx2

    May 06 12:56:29 doveadm: Error: Couldn't lock
/var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
pid 1960)

   ? Before i venture down the rabbit hole of fault finding and excess
coffee consumption I was wondering if any of you had any updates on the
problems discussed below.


Cheers for now,
Andy



Hi,

[Formatting is a bit rough, replying from a trimmed digest email]


/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm 

<https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing List 
https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 Replication is 
throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: text/plain; charset=u

Re: 2.3.1 Replication is throwing scary errors

2018-05-06 Thread Aki Tuomi
Seems you got the hash wrong...

You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a 
memory leak, but does not release sync lock.

Interesting if this commit fixes the issue.

Looking at the sync code, where the lock is made / released, there is only one 
commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one 
creates home directory if missing.

Would be interesting if you could test with 2.3.2rc1, which was released few 
days ago, whether this issue goes away?

Aki

> On 06 May 2018 at 15:21 Reuben Farrelly <reuben-dove...@reub.net> wrote:
> 
> 
> Hi Andy,
> 
> Funny you say that - I've been testing this very problem for some time 
> today to narrow down the commits where I think the problem started 
> occurring.
> 
> So far I have been able to roll forward to commit from git: 
> 74ef8506dcb946acc42722683e49fdcb964eed19
> 
>  >> "doveadm: Unref header search context after use" .  So far I have 
> been running on that code for a few hours and I'm not yet seeing the 
> problem, which indicates the issue is probably subsequent to this.
> 
> I'm of the view that the problem started occurring somewhere between 
> that commit above on 6 March and the final 2.3.1 release on 27 March. 
> There are a lot of changes between those two dates, and testing to 
> narrow down to one specific commit is extremely time consuming.
> 
> It would be really helpful if even one of the Dovecot developers could 
> comment on this and give us some direction (or candidate commits to 
> investigate) or just let us know if the root cause has been found.  This 
> bug is a showstopper and has stopped me tracking master-2.3 for over 3 
> months now, as I can't test later builds or even upgrade to the 2.3.1 
> release while replication is so broken.
> 
> Reuben
> 
> 
> 
> > Message: 1
> > Date: Sun, 6 May 2018 13:21:57 +1000
> > From: Andy Weal <a...@bizemail.com.au>
> > To: dovecot@dovecot.org
> > Subject: 2.3.1 Replication is throwing scary errors
> > Message-ID: <f4274128-350b-9a74-a344-8e6caeffa...@bizemail.com.au>
> > Content-Type: text/plain; charset="utf-8"; Format="flowed"
> > 
> > Hi all,
> > 
> > New to the mailing lists but have joined up because of above */2.3.1
> > Replication is throwing scary errors
> > 
> > 
> > /*Brief system configuration
> >   ??? MX1 - Main
> >   ??? ??? Freebsd 11.1-Release-p9
> >   ??? ??? Hosted on a Vultr VM in Sydney AU
> >   ??? ??? MTA = Postfix 3.4-20180401
> >   ??? ??? Dovecot = 2.3.1
> >   ??? ??? File system = ufs
> >   ??? MX2 - Backup
> >   ??? ??? Freebsd 11.1-Release-p9
> >   ???  Running on bare metal - no VM or jails
> >   ??? ??? MTA = Postfix 3.4-20180401
> >   ??? ??? Dovecot = 2.3.1
> >   ??? ??? File system = ufs ( on SSD)
> > 
> > */
> > /*Brief sequence of events
> > 
> >   1.  ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
> >  from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
> >  dovecot start)
> >   2.  ??? both systems ran ok with no errors for 10 days.
> >   3.  ??? Last night I shutdown mx2 and restarted it a few hours later
> >   4.  ??? within minutes i was getting the following types of errors on mx2
> > 
> >    May 06 12:56:29 doveadm: Error: Couldn't lock
> > /var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
> > fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
> > F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
> > pid 1960)
> > 
> >   ? Before i venture down the rabbit hole of fault finding and excess
> > coffee consumption I was wondering if any of you had any updates on the
> > problems discussed below.
> > 
> > 
> > Cheers for now,
> > Andy
> > 
> > 
> > 
> > Hi,
> > 
> > [Formatting is a bit rough, replying from a trimmed digest email]
> > 
> >> /Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm 
> >>  > <https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing List 
> >  > <https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 
> > Replication is throwing scary errors />/Message-ID: 
> >  > <https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: 
> > text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael 
> > Grimm  > <https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot 
> > 

Re: 2.3.1 Replication is throwing scary errors

2018-05-06 Thread Reuben Farrelly

Hi Andy,

Funny you say that - I've been testing this very problem for some time 
today to narrow down the commits where I think the problem started 
occurring.


So far I have been able to roll forward to commit from git: 
74ef8506dcb946acc42722683e49fdcb964eed19


>> "doveadm: Unref header search context after use" .  So far I have 
been running on that code for a few hours and I'm not yet seeing the 
problem, which indicates the issue is probably subsequent to this.


I'm of the view that the problem started occurring somewhere between 
that commit above on 6 March and the final 2.3.1 release on 27 March. 
There are a lot of changes between those two dates, and testing to 
narrow down to one specific commit is extremely time consuming.


It would be really helpful if even one of the Dovecot developers could 
comment on this and give us some direction (or candidate commits to 
investigate) or just let us know if the root cause has been found.  This 
bug is a showstopper and has stopped me tracking master-2.3 for over 3 
months now, as I can't test later builds or even upgrade to the 2.3.1 
release while replication is so broken.


Reuben




Message: 1
Date: Sun, 6 May 2018 13:21:57 +1000
From: Andy Weal <a...@bizemail.com.au>
To: dovecot@dovecot.org
Subject: 2.3.1 Replication is throwing scary errors
Message-ID: <f4274128-350b-9a74-a344-8e6caeffa...@bizemail.com.au>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Hi all,

New to the mailing lists but have joined up because of above */2.3.1
Replication is throwing scary errors


/*Brief system configuration
  ??? MX1 - Main
  ??? ??? Freebsd 11.1-Release-p9
  ??? ??? Hosted on a Vultr VM in Sydney AU
  ??? ??? MTA = Postfix 3.4-20180401
  ??? ??? Dovecot = 2.3.1
  ??? ??? File system = ufs
  ??? MX2 - Backup
  ??? ??? Freebsd 11.1-Release-p9
  ???  Running on bare metal - no VM or jails
  ??? ??? MTA = Postfix 3.4-20180401
  ??? ??? Dovecot = 2.3.1
  ??? ??? File system = ufs ( on SSD)

*/
/*Brief sequence of events

  1.  ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
 from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
 dovecot start)
  2.  ??? both systems ran ok with no errors for 10 days.
  3.  ??? Last night I shutdown mx2 and restarted it a few hours later
  4.  ??? within minutes i was getting the following types of errors on mx2

   May 06 12:56:29 doveadm: Error: Couldn't lock
/var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
pid 1960)

  ? Before i venture down the rabbit hole of fault finding and excess
coffee consumption I was wondering if any of you had any updates on the
problems discussed below.


Cheers for now,
Andy



Hi,

[Formatting is a bit rough, replying from a trimmed digest email]


/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm 

<https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing List 
https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 Replication is 
throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: text/plain; charset=utf-8 
/>//>/Reuben Farrelly wrote: />>/From: Michael Grimm https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot 2.3.1 
at FreeBSD STABLE-11.1 running in two jails at
distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do 
become pounded by
error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 
17:12:18  server1.lan dovecot: doveadm: Error:
dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last 
sent=mail_change, last
recv=mail_change (EOL)) />>>/| Apr 2 17:12:18  server1.lan dovecot: 
doveadm: Error: Timeout
during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 
2 18:59:03  server1.lan dovecot: doveadm: Error:
dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last 
recv=mail (EOL)) />>>/| Apr 2 18:59:03  server1.lan dovecot: doveadm: Error: 
Timeout
during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot 
see in my personal account any missing replications, *but* I
haven't tested this thoroughly enough. I do have customers being
serviced at these productive servers, *thus* I'm back to 2.2.35 until I
do understand or have learned what is going on. />//>/In my reply to this 
statement of mine I mentioned that I have seen those
timeouts quite some times during the past year. Thus, I upgraded to
2.3.1 again, and boom: after some hours I ended up in hanging processes
[1] like (see Remko's mail 

Re: 2.3.1 Replication is throwing scary errors

2018-04-08 Thread Reuben Farrelly

Hi,

[Formatting is a bit rough, replying from a trimmed digest email]


Message: 1
Date: Fri, 6 Apr 2018 15:04:35 +0200
From: Michael Grimm <trash...@ellael.org>
To: Dovecot Mailing List <dovecot@dovecot.org>
Subject: Re: 2.3.1 Replication is throwing scary errors
Message-ID: <e7e7a927-68f8-443f-ba59-e66ced8fe...@ellael.org>
Content-Type: text/plain;   charset=utf-8

Reuben Farrelly wrote:

From: Michael Grimm <trash...@ellael.org>



[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct 
servers.]
I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error 
messages at server1 (and vice versa at server2) as follows:
| Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
dsync(server2.lan): I/O has stalled, \
no activity for 600 seconds (last sent=mail_change, last 
recv=mail_change (EOL))
| Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
Timeout during state=sync_mails \
(send=changes recv=mail_requests)
[?]
| Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
dsync(server2.lan): I/O has stalled, \
no activity for 600 seconds (last sent=mail, last recv=mail 
(EOL))
| Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
Timeout during state=sync_mails \
(send=mails recv=recv_last_common)
I cannot see in my personal account any missing replications, *but* I haven't 
tested this thoroughly enough. I do have customers being serviced at these 
productive servers, *thus* I'm back to 2.2.35 until I do understand or have 
learned what is going on.


In my reply to this statement of mine I mentioned that I have seen those 
timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 
again, and boom: after some hours I ended up in hanging processes [1] like (see 
Remko's mail in addition) ...

doveadm-server: [IP4/6  SOME/MAILBOX import:0/0] (doveadm-server)

? at server2 paired with a file like ?

-rw--- 1 vmail dovecot uarch 0 Apr 3 16:52 
/home/to/USER1/.dovecot-sync.lock

Corresponding logfile entries at server2 are like ?

   Apr  3 17:10:49  server2.lan dovecot: doveadm: Error: Couldn't 
lock /home/to/USER1/.dovecot-sync.lock: \
   fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking 
failed: Timed out after 30 seconds \
   (WRITE lock held by pid 51110)

[1] Even stopping dovecot will not end those processes. One has to manually 
kill those before restarting dovecot.

After one day of testing 2.3.1 with a couple of those episodes of 
locking/timeout, and now missing mails depending with server your MUA will 
connect to, I went back to 2.2.35. After two days at that version I never had 
such an episode again.


It's not just you.  This issue hit me recently, and it was impacting
replication noticeably.  I am following git master-2.3 .

[...]

There is also a second issue of a long standing race with replication
occurring somewhere whereby if a mail comes in, is written to disk, is
replicated and then deleted in short succession, it will reappear
again to the MUA.  I suspect the mail is being replicated back from
the remote.  A few people have reported it over the years but it's not
reliable or consistent, so it has never been fixed.
And lastly there has been an ongoing but seemingly minor issue
relating to locking timing out after 30s particularly on the remote
host that is being replicated to.  I rarely see the problem on my
local disk where almost all of the mail comes in, it's almost always
occurring on the replicate/remote system.


It might be time to describe our setups in order to possibly find common 
grounds that might trigger this issue you describe and Rimko and myself ran 
into as well.

Servers:Cloud Instances (both identical), around 25ms latency apart.
Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class 
CPU)
Both servers are connected via IPsec/racoon tunnels
OS: FreeBSD 11.1-STABLE (both servers)
Filesystem: ZFS
MTA:postfix 3.4-20180401 (postfix delivers via dovecot's LMTP)
IMAP:   dovecot running in FreeBSD jails (issues with 2.3.1, fine with 
2.2.35)
Replication:unsecured tcp / master-master
MUA:mainly iOS or macOS mail.app, rarely roundcube


For me:

Servers:Main: VM on a VMWare ESXi local system (light
load), local SSD disks (no NFS)
Redundant: Linode VM in Singapore, around 250ms away
Also no NFS.  Linode use SSDs for IO.
There is native IPv6 connectivity between both VMs.
As I am using TCPs I don't have a VPN between them -
just raw IPv6 end to end.
OS: Gentoo Linux x86_64 kept well up to date
Filesystem: EXT4 for both
MTA:Postfix 3.4-x (same as you)
IMAP:   Dovecot running natively on the machine (no 

Re: 2.3.1 Replication is throwing scary errors

2018-04-06 Thread Michael Grimm
Reuben Farrelly wrote:
> From: Michael Grimm 

>> [This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at 
>> distinct servers.]
>> I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error 
>> messages at server1 (and vice versa at server2) as follows:
>>  | Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
>> dsync(server2.lan): I/O has stalled, \
>>  no activity for 600 seconds (last sent=mail_change, last 
>> recv=mail_change (EOL))
>>  | Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
>> Timeout during state=sync_mails \
>>  (send=changes recv=mail_requests)
>>  [?]
>>  | Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
>> dsync(server2.lan): I/O has stalled, \
>>  no activity for 600 seconds (last sent=mail, last recv=mail 
>> (EOL))
>>  | Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
>> Timeout during state=sync_mails \
>>  (send=mails recv=recv_last_common)
>> I cannot see in my personal account any missing replications, *but* I 
>> haven't tested this thoroughly enough. I do have customers being serviced at 
>> these productive servers, *thus* I'm back to 2.2.35 until I do understand or 
>> have learned what is going on.

In my reply to this statement of mine I mentioned that I have seen those 
timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 
again, and boom: after some hours I ended up in hanging processes [1] like (see 
Remko's mail in addition) ...

doveadm-server: [IP4/6  SOME/MAILBOX import:0/0] (doveadm-server)

… at server2 paired with a file like …

-rw--- 1 vmail dovecot uarch 0 Apr 3 16:52 
/home/to/USER1/.dovecot-sync.lock 

Corresponding logfile entries at server2 are like …

  Apr  3 17:10:49  server2.lan dovecot: doveadm: Error: Couldn't lock 
/home/to/USER1/.dovecot-sync.lock: \
  fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking 
failed: Timed out after 30 seconds \
  (WRITE lock held by pid 51110)

[1] Even stopping dovecot will not end those processes. One has to manually 
kill those before restarting dovecot.

After one day of testing 2.3.1 with a couple of those episodes of 
locking/timeout, and now missing mails depending with server your MUA will 
connect to, I went back to 2.2.35. After two days at that version I never had 
such an episode again.

> It's not just you.  This issue hit me recently, and it was impacting
> replication noticeably.  I am following git master-2.3 .
[...]
> There is also a second issue of a long standing race with replication
> occurring somewhere whereby if a mail comes in, is written to disk, is
> replicated and then deleted in short succession, it will reappear
> again to the MUA.  I suspect the mail is being replicated back from
> the remote.  A few people have reported it over the years but it's not
> reliable or consistent, so it has never been fixed.
> And lastly there has been an ongoing but seemingly minor issue
> relating to locking timing out after 30s particularly on the remote
> host that is being replicated to.  I rarely see the problem on my
> local disk where almost all of the mail comes in, it's almost always
> occurring on the replicate/remote system.

It might be time to describe our setups in order to possibly find common 
grounds that might trigger this issue you describe and Rimko and myself ran 
into as well.

Servers:Cloud Instances (both identical), around 25ms latency apart.
Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class 
CPU)
Both servers are connected via IPsec/racoon tunnels
OS: FreeBSD 11.1-STABLE (both servers)
Filesystem: ZFS
MTA:postfix 3.4-20180401 (postfix delivers via dovecot's LMTP)
IMAP:   dovecot running in FreeBSD jails (issues with 2.3.1, fine with 
2.2.35)
Replication:unsecured tcp / master-master
MUA:mainly iOS or macOS mail.app, rarely roundcube

I believe it is worthwhile to mention here that I run a poor man's fail-over 
approach (round-robin DNS) as follows:

DNS:mail.server.tld resolves to one IP4 and one IP6 address of each 
server, thus 4 IP addresses in total

According it's MX priority one server (server1) will receive most mail, thus 
dovecot will mostly replicate mail from server1 to server2. Server2 is the one 
showing that deadlocking issues you see in your setup as well.

But connecting MUAs will hop quite frequently between server1 and server2, and 
sometimes will connect to both servers simultaneously, even mixing IP4 and IP6, 
because MUA do not follow MX priorities (IIRC). Normally I would believe that 
this shouldn't be an issue for dovecot, but let me ask dovecot's developers: 

Can those simultaneous connects and modifications of \SEEN et al. be a reason 
for my issues regarding deadlocking?

> For me it seems very unlikely that on an unloaded/idle VPS there 

Re: 2.3.1 Replication is throwing scary errors

2018-04-05 Thread Remko Lodder


> On 4 Apr 2018, at 01:34, Reuben Farrelly  wrote:
> 
> Hi,
> 
>> --
>> Message: 2
>> Date: Mon, 2 Apr 2018 22:06:07 +0200
>> From: Michael Grimm 
>> To: Dovecot Mailing List 
>> Subject: 2.3.1 Replication is throwing scary errors
>> Message-ID: <29998016-d62f-4348-93d1-613b13da9...@ellael.org>
>> Content-Type: text/plain;charset=utf-8
>> Hi
>> [This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at 
>> distinct servers.]
>> I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error 
>> messages at server1 (and vice versa at server2) as follows:
>>  | Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
>> dsync(server2.lan): I/O has stalled, \
>>  no activity for 600 seconds (last sent=mail_change, last 
>> recv=mail_change (EOL))
>>  | Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
>> Timeout during state=sync_mails \
>>  (send=changes recv=mail_requests)
>>  [?]
>>  | Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
>> dsync(server2.lan): I/O has stalled, \
>>  no activity for 600 seconds (last sent=mail, last recv=mail 
>> (EOL))
>>  | Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
>> Timeout during state=sync_mails \
>>  (send=mails recv=recv_last_common)
>> I cannot see in my personal account any missing replications, *but* I 
>> haven't tested this thoroughly enough. I do have customers being serviced at 
>> these productive servers, *thus* I'm back to 2.2.35 until I do understand or 
>> have learned what is going on.
>> Any ideas/feedback?
>> FYI: I haven't seen such errors before. Replication has been working for 
>> years now, without any glitches at all.
>> Regards,
>> Michael
> 
> It's not just you.  This issue hit me recently, and it was impacting 
> replication noticeably.  I am following git master-2.3 .
> 
> 

I am seeing the same as Michael Grimm also on FreeBSD-11.
You’ll also notice in doveadm replicator status ‘*’ that the failed flag is 
raised for those users and that
there are processes just hanging forever when those logs start to appear:

   45949   0.0  0.047888  13276  -  I20:200:00.10 
doveadm-server: [ Verwijderde items send:mail_requests recv:changes] 
(doveadm-server)
 45964   0.0  0.049860  11608  -  I20:200:00.05 
doveadm-server: [IP6  INBOX import:1/3] (doveadm-server)
 45965   0.0  0.158256  19820  -  I20:200:00.11 
doveadm-server: [IP6  INBOX import:16/18] (doveadm-server)
 46480   0.0  0.053536  16288  -  I20:220:00.08 
doveadm-server: [IP6  INBOX import:4/6] (doveadm-server)
 46745   0.0  0.051496  14184  -  I20:220:00.07 
doveadm-server: [IP6  INBOX import:5/6] (doveadm-server)

I also reverted to 2.2.35 because I started to get complaints from my users 
that mail was missing.

Cheers
Remko



signature.asc
Description: Message signed with OpenPGP


Re: 2.3.1 Replication is throwing scary errors

2018-04-04 Thread Gerald Galster
Hi,

> There is also a second issue of a long standing race with replication 
> occurring somewhere whereby if a mail comes in, is written to disk, is 
> replicated and then deleted in short succession, it will reappear again to 
> the MUA.  I suspect the mail is being replicated back from the remote.  A few 
> people have reported it over the years but it's not reliable or consistent, 
> so it has never been fixed.

sounds like my replication issue which is reproducible on 2.2.35 and does not 
occur on 2.2.33.2, so I assume something in the replication code has changed 
between these two versions.

dsync is copying the mail before expunge in this situation (no sieve filters 
involved):

(mail received on mx2a.example.com and delivered via dsync/ssh to 
mx2b.example.com, then expunged via pop3 on mx2b.example.com -> copy/duplicate)
Mar 26 15:35:58 mx2b.example.com dovecot[3825]: pop3(popt...@example.com): 
expunge: box=INBOX, uid=23, 
msgid=, size=1210, 
subject=test 1535
Mar 26 15:35:58 mx2b.example.com dovecot[3825]: doveadm: Error: 
dsync-remote(popt...@example.com): Info: copy from INBOX: box=INBOX, uid=24, 
msgid=, size=1210, 
subject=test 1535
Mar 26 15:35:58 mx2b.example.com dovecot[3825]: doveadm: Error: 
dsync-remote(popt...@example.com): Info: expunge: box=INBOX, uid=23, 
msgid=, size=1210, 
subject=test 1535

For more details see mail from 2018-03-27 / Duplicate mails on pop3 expunge 
with dsync replication on 2.2.35 (2.2.33.2 works)

Best regards,
Gerald

Re: 2.3.1 Replication is throwing scary errors

2018-04-03 Thread Reuben Farrelly

Hi,


--

Message: 2
Date: Mon, 2 Apr 2018 22:06:07 +0200
From: Michael Grimm 
To: Dovecot Mailing List 
Subject: 2.3.1 Replication is throwing scary errors
Message-ID: <29998016-d62f-4348-93d1-613b13da9...@ellael.org>
Content-Type: text/plain;   charset=utf-8

Hi

[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct 
servers.]

I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error 
messages at server1 (and vice versa at server2) as follows:

| Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
dsync(server2.lan): I/O has stalled, \
no activity for 600 seconds (last sent=mail_change, last 
recv=mail_change (EOL))
| Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
Timeout during state=sync_mails \
(send=changes recv=mail_requests)
[?]
| Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
dsync(server2.lan): I/O has stalled, \
no activity for 600 seconds (last sent=mail, last recv=mail 
(EOL))
| Apr  2 18:59:03  server1.lan dovecot: doveadm: Error: 
Timeout during state=sync_mails \
(send=mails recv=recv_last_common)

I cannot see in my personal account any missing replications, *but* I haven't 
tested this thoroughly enough. I do have customers being serviced at these 
productive servers, *thus* I'm back to 2.2.35 until I do understand or have 
learned what is going on.

Any ideas/feedback?

FYI: I haven't seen such errors before. Replication has been working for years 
now, without any glitches at all.

Regards,
Michael


It's not just you.  This issue hit me recently, and it was impacting 
replication noticeably.  I am following git master-2.3 .


Here's a last known reasonably good point in the tree where things 
worked quite well:


EGIT_REPO_URI="https://github.com/dovecot/core.git;
EGIT_BRANCH="master-2.3"
EGIT_COMMIT="d9a1a7cbec19f4c6a47add47688351f8c3a0e372"

So something after that (which could have gone into 2.3.1) has caused this.

There is also a second issue of a long standing race with replication 
occurring somewhere whereby if a mail comes in, is written to disk, is 
replicated and then deleted in short succession, it will reappear again 
to the MUA.  I suspect the mail is being replicated back from the 
remote.  A few people have reported it over the years but it's not 
reliable or consistent, so it has never been fixed.


And lastly there has been an ongoing but seemingly minor issue relating 
to locking timing out after 30s particularly on the remote host that is 
being replicated to.  I rarely see the problem on my local disk where 
almost all of the mail comes in, it's almost always occurring on the 
replicate/remote system.
For me it seems very unlikely that on an unloaded/idle VPS there are 
legitimate problems obtaining a lock in under 30s.  This is with the 
default locking configuration.  This problem started happening a lot 
more after the breakage in (1) above.


These replication issues are similar, and could possibly be related.

My system is Gentoo Linux keeping up with the latest kernels, and on an 
EXT4 FS.  I am using TCPS based replication.  My remote replicate is 
also on Gentoo Linux with EXT4 but on a Linode VPS (around 250ms latency 
away).


I know in a later post you've said that you don't think it has anything 
to do with dovecot-2.3.1, so I'd be interested to know what really is 
the cause in that case.


Reuben


Re: 2.3.1 Replication is throwing scary errors

2018-04-03 Thread Michael Grimm
Michael Grimm  wrote:

> [This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at 
> distinct servers.]
> 
> I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error 
> messages at server1 (and vice versa at server2) as follows:
> 
>   | Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
> dsync(server2.lan): I/O has stalled, \
>   no activity for 600 seconds (last sent=mail_change, last 
> recv=mail_change (EOL))
>   | Apr  2 17:12:18  server1.lan dovecot: doveadm: Error: 
> Timeout during state=sync_mails \
>   (send=changes recv=mail_requests)
[snip]
> FYI: I haven't seen such errors before. Replication has been working for 
> years now, without any glitches at all.

That statement of mine has been incorrect:

#) I did investigate a bit further, and I do see those errors at about 20 days 
spread over the last year. 
#) And what puzzles me even more is the fact that only server2 reports those 
errors, not a single line in server1's log files.
#) All those error messages above are paralleled by messages like:

   Apr  2 17:10:49  server2.lan dovecot: doveadm: Error: Couldn't 
lock /home/to/USER1/.dovecot-sync.lock: \
   fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking 
failed: Timed out after 30 seconds \
   (WRITE lock held by pid 51110)

#) I did upgrade both servers to 2.3.1 a couple of hours ago, and haven't seen 
a single error, yet.

I do have to admit that I do not understand what is going on at server2, and I 
am quite sure it has nothing to do with dovecot.
Sorry for the noise. 
It has nothing to do with dovecot 2.3.1

Regards,
Michael