Re: [qmailadmin] Major bug in qmailadmin when using MySQL replication.
Apologies, I meant to click SAVE instead of send as I left work yesterday. Anyway, appended to this email is The Rest of The Story. On Monday, October 21, 2002, at 06:24 PM, Matt Simerson wrote: OK, this will be fun to explain. I have a large vpopmail install that I'm running this on so the system is quite busy. I recently converted this system from .cdb to MySQL authentication. In so doing, we've happened upon a few anomalies, the strangest one being that qmailadmin functions only worked intermittently. We use MySQL replication and have a MySQL master running on a quad Xeon that is more than adequate hardware wise. We replicate to each mail server via MySQL replication. Vpopmail is then configured to write updates to the master and read from the slave on localhost. This works quite well for vpopmail. So, upon noticing that qmailadmin wasn't working reliably, we started poking around and saw that every time we made a change in qmailadmin, we'd get two updates written to the MySQL master (as observed in binlog). The first would have all the values we wanted, the second would end up resetting those values to exactly what they were before the attempted change was made. So, a little visit to the qmailadmin source code revealed the culprit in user.c. If you read through user.c, you'll find a subroutine modusergo. In there, is the guts of what happens when you click on modify user and then submit. Reading through that, it appears that if we get a new password, we make a call to vpasswd. This function, I would guess is against libvpopmail.a and connects to the MySQL update server and updates the password, in addition to running crypt() against the supplied password. At this point, all is well. However, in the very next function, the problem becomes manifest. We check for the value of gecos and if it exists (which it always does because it's value is in the HTML form) then it first makes a call to vauth_getpw. Testing has revealed that as expected, it does the database read from the MySQL read server on localhost. Unfortunately, it's making the query a tiny fraction of a second after writing the new password into MySQL update server. I'm sure Bill and Ken both know the problem now. If you weren't using replication, this wouldn't be a problem (and isn't on my test servers) but what ends up happening is one SQL query arrives at the update server and sets the values that you changed appropriately. The next query arrives right behind it and resets the values to exactly what they were before the first query updated them. So, the bandaid I currently applied is putting a 5 second sleep in there so that the first change gets a chance to replicate before we run vauth_getpw. However, this is a distinctly BAD way to fix it. There's no way to guarantee that a change made to the MySQL master has propagated to the slave. In 5 seconds it's a 99% probability but I'd argue that isn't good enough. I can only think of one guaranteed way to fix it. That is to make sure that when qmailadmin does it's vauth_getpw call, it reads from the MySQL master. This is the only guaranteed way to assure that a successful database update written a fraction of a second before gets read back. This could be a even bigger problem for people that don't manage their MySQL replication well. Imagine having a farm of replicated servers where one is out of sync and the admins don't notice. A (l)user hits the server that's database isn't replicated and see's their old gecos field that they updated last week. The update it again and it resets their password back to their old one which they had updated the week before. That's not good manners. So, anyone else have suggestions on the best way to fix this? Matt
Re: [qmailadmin] Major bug in qmailadmin when using MySQL replication.
On Tuesday, October 22, 2002, at 07:49 AM, Matt Simerson wrote: Apologies, I meant to click SAVE instead of send as I left work yesterday. Anyway, appended to this email is The Rest of The Story. On Monday, October 21, 2002, at 06:24 PM, Matt Simerson wrote: OK, this will be fun to explain. I have a large vpopmail install that I'm running this on so the system is quite busy. I recently converted this system from .cdb to MySQL authentication. In so doing, we've happened upon a few anomalies, the strangest one being that qmailadmin functions only worked intermittently. We use MySQL replication and have a MySQL master running on a quad Xeon that is more than adequate hardware wise. We replicate to each mail server via MySQL replication. Vpopmail is then configured to write updates to the master and read from the slave on localhost. This works quite well for vpopmail. So, upon noticing that qmailadmin wasn't working reliably, we started poking around and saw that every time we made a change in qmailadmin, we'd get two updates written to the MySQL master (as observed in binlog). The first would have all the values we wanted, the second would end up resetting those values to exactly what they were before the attempted change was made. So, a little visit to the qmailadmin source code revealed the culprit in user.c. If you read through user.c, you'll find a subroutine modusergo. In there, is the guts of what happens when you click on modify user and then submit. Reading through that, it appears that if we get a new password, we make a call to vpasswd. This function, I would guess is against libvpopmail.a and connects to the MySQL update server and updates the password, in addition to running crypt() against the supplied password. At this point, all is well. However, in the very next function, the problem becomes manifest. We check for the value of gecos and if it exists (which it always does because it's value is in the HTML form) then it first makes a call to vauth_getpw. Testing has revealed that as expected, it does the database read from the MySQL read server on localhost. Unfortunately, it's making the query a tiny fraction of a second after writing the new password into MySQL update server. I'm sure Bill and Ken both know the problem now. If you weren't using replication, this wouldn't be a problem (and isn't on my test servers) but what ends up happening is one SQL query arrives at the update server and sets the values that you changed appropriately. The next query arrives right behind it and resets the values to exactly what they were before the first query updated them. So, the bandaid I currently applied is putting a 5 second sleep in there so that the first change gets a chance to replicate before we run vauth_getpw. However, this is a distinctly BAD way to fix it. There's no way to guarantee that a change made to the MySQL master has propagated to the slave. In 5 seconds it's a 99% probability but I'd argue that isn't good enough. I can only think of one guaranteed way to fix it. That is to make sure that when qmailadmin does it's vauth_getpw call, it reads from the MySQL master. This is the only guaranteed way to assure that a successful database update written a fraction of a second before gets read back. This could be a even bigger problem for people that don't manage their MySQL replication well. Imagine having a farm of replicated servers where one is out of sync and the admins don't notice. A (l)user hits the server that's database isn't replicated and see's their old gecos field that they updated last week. The update it again and it resets their password back to their old one which they had updated the week before. That's not good manners. So, anyone else have suggestions on the best way to fix this? After only getting the first half of the story yesterday, I duplicated your process to see if I could figure it out. The strange thing to me was that there are I think 3 select queries and 2 update queries that occur before modusergo is finished. This seems excessive. There really should just be one select, and one update. All the error checking and modifications to the vpw structure should happen in qmailadmin, with a single update at the end. I think that refining this will be the best solution to the problem. However, the following select during moduser (called by the result page after modifying the user) might still get old data. I'm not sure how long replication takes. Did you experiment with sleeps less than 5 seconds? I vaguely recall someone else having to put a sleep in either qmailadmin or vqadmin for this same reason. Maybe another solution would be to refresh back to the main menu, or user list after modifying. That would force a delay before moduser() does the next select for the result page. I'll look at it today and see if I can improve it. Regards, Bill Shupp
Re: [qmailadmin] Major bug in qmailadmin when using MySQL replication.
On Tuesday, October 22, 2002, at 07:49 AM, Matt Simerson wrote: If you read through user.c, you'll find a subroutine modusergo. In there, is the guts of what happens when you click on modify user and then submit. Reading through that, it appears that if we get a new password, we make a call to vpasswd. This function, I would guess is against libvpopmail.a and connects to the MySQL update server and updates the password, in addition to running crypt() against the supplied password. At this point, all is well. However, in the very next function, the problem becomes manifest. We check for the value of gecos and if it exists (which it always does because it's value is in the HTML form) then it first makes a call to vauth_getpw. Testing has revealed that as expected, it does the database read from the MySQL read server on localhost. Unfortunately, it's making the query a tiny fraction of a second after writing the new password into MySQL update server. I'm sure Bill and Ken both know the problem now. What if you just moved the vpasswd call (to change the password) AFTER the other changes you make via vauth_getpw? You still have the problem of a user not seeing their changes propagate immediately (if they go to edit their page again, it might still show the old values). I can only think of one guaranteed way to fix it. That is to make sure that when qmailadmin does it's vauth_getpw call, it reads from the MySQL master. This is the only guaranteed way to assure that a successful database update written a fraction of a second before gets read back. I don't know enough about the vpopmail structures, but you would probably need to add a flag to vauth_getpw indicating whether it was allowed to read from the local (replicated) server, or if it should read from the master. Qmailadmin should make all of its calls from the master, so it's always reporting the actual/current settings. -- Tom Collins [EMAIL PROTECTED]
Re: [qmailadmin] Major bug in qmailadmin when using MySQL replication.
Sorry, didn't mean to take this off-list... On Tuesday, October 22, 2002, at 09:08 AM, Matt Simerson wrote: On Tuesday, October 22, 2002, at 11:27 AM, Bill Shupp wrote: After only getting the first half of the story yesterday, I duplicated your process to see if I could figure it out. The strange thing to me was that there are I think 3 select queries and 2 update queries that occur before modusergo is finished. This seems excessive. There really should just be one select, and one update. I tend to agree. All the error checking and modifications to the vpw structure should happen in qmailadmin, with a single update at the end. The only problem with that is the vpopmail call vauth_setpw function doesn't handle crypting of the password. That needs to be done via a vpopmail call in case the sysadmin has set up crypt to use something other than the systems default crypt(). Of course, that could be as easy as adding a vcrypt routine to vpopmail if it doesn't already exist. Then you'd have the crypted password. I think that refining this will be the best solution to the problem. However, the following select during moduser (called by the result page after modifying the user) might still get old data. I'm not sure how long replication takes. In theory, it's instant but on really busy systems, it can take up to a second or two. If replication breaks for any reason, it could be minutes or hours. You can't count on Did you experiment with sleeps less than 5 seconds? No, specifically because I wrote a script that allows me to watch the replication among all four of my slaves at the same time. It often takes a few seconds for the slave to catch up. I picked 5 because it *should* succeed nearly every time. I vaguely recall someone else having to put a sleep in either qmailadmin or vqadmin for this same reason. But a sleep is a bandaid, it's not a solution. Maybe another solution would be to refresh back to the main menu, or user list after modifying. That would force a delay before moduser() does the next select for the result page. I think the ideal solution would be, as you stated, a one select, and one update function. At the very beginning of the modusergo subroutine, do your vauth_getpw so that you have all the values. Then, if there is a password update, then run the vcrypt function which will return the crypted password. Populate your vpw hash with the crypted password, any other changes as already configured and then at the end of the routine, update the master server. My only beef with that is I that I think it should do the vauth_getpw from the MySQL master. That way you'll never end up overwriting newer data on the master from an out of date slave. That functionality will only affect people using MySQL replication and I think it's a safeguard that should be in place. I think that having qmailadmin always read from the master is the best solution. Unfortunately, it would require updating the vpopmail API to add an argument specifying whether to use a read server or not. Unless someone knows a way around it. Short of that, one solution would be to merge the call to vpasswd into vauth_setpw, but as you pointed out above, the mkpasswd call from vpasswd, along with some other checking from that function, would need to added to modusergo. And to make sure that the next read doesn't get old information, you could go back to the main menu, or show_users instead of moduser. I could easily implement this solution this morning, if that's what people want to do. Thoughts? What do you think Ken? Cheers, Bill Shupp
Re: [qmailadmin] Major bug in qmailadmin when using MySQL replication.
On Tuesday, October 22, 2002, at 09:30 AM, Bill Shupp wrote: I think that having qmailadmin always read from the master is the best solution. Unfortunately, it would require updating the vpopmail API to add an argument specifying whether to use a read server or not. Unless someone knows a way around it. Short of that, one solution would be to merge the call to vpasswd into vauth_setpw, but as you pointed out above, the mkpasswd call from vpasswd, along with some other checking from that function, would need to added to modusergo. And to make sure that the next read doesn't get old information, you could go back to the main menu, or show_users instead of moduser. I could easily implement this solution this morning, if that's what people want to do. Ok, for fun I went ahead and implemented this. The patch below is against vanilla 1.0.6, but it works for me. It consolidates the updates into one call by integrating much of vpasswd right into modusergo. There are still 3 select statements though, 2 before modusergo is called, and one during modusergo. Also, at the end of the function, it calls show_users() instead of moduser(), which should help allow replication to catch up before the next read for that user. Cheers, Bill *** user.c.orig Tue Oct 22 10:01:27 2002 --- user.c Tue Oct 22 10:28:05 2002 *** *** 753,758 --- 753,764 static char NewBuf[156]; int count; FILE *fs; + #ifdef SQWEBMAIL_PASS + uid_t uid; + gid_t gid; + #endif + + vpw = vauth_getpw(ActionUser, Domain); if (!( AdminType==DOMAIN_ADMIN || (AdminType==USER_ADMIN strcmp(ActionUser,Username)==0))){ *** *** 768,788 vclose(); exit(0); } ! ret_code = vpasswd( ActionUser, Domain, Password1, USE_POP); ! if ( ret_code != VA_SUCCESS ) { sprintf(StatusMessage, %s, get_html_text(140)); } else { ! sprintf(StatusMessage,%s %s%s %s, ! get_html_text(139), ActionUser, Domain, verror(ret_code) ); } } GetValue(TmpCGI,Gecos, gecos=, MAX_BUFF); if ( strlen( Gecos ) != 0 ) { - vpw = vauth_getpw(ActionUser, Domain); vpw-pw_gecos = Gecos; - vauth_setpw(vpw, Domain); } /* get the value of the cforward radio button */ GetValue(TmpCGI,box, cforward=, MAX_BUFF); --- 774,813 vclose(); exit(0); } ! if (strlen(Password1) MAX_PW_CLEAR_PASSWD) { ! sprintf(StatusMessage,%s %s%s %s, ! get_html_text(139), ActionUser, Domain, VA_PASSWD_TOO_LONG ); ! moduser(); ! vclose(); ! exit(0); ! } else if (vpw-pw_gid NO_PASSWD_CHNG) { sprintf(StatusMessage, %s, get_html_text(140)); + moduser(); + vclose(); + exit(0); } else { ! sprintf(StatusMessage,%s %s%s, ! get_html_text(139), ActionUser, Domain); } + + mkpasswd3(Password1,Crypted, MAX_BUFF); + vpw-pw_passwd = Crypted; + + #ifdef CLEAR_PASS + vpw-pw_clear_passwd = Password1; + #endif + #ifdef SQWEBMAIL_PASS + vget_assign(Domain, NULL, 0, uid, gid ); + vsqwebmail_pass( vpw-pw_dir, Crypted, uid, gid); + #endif + } GetValue(TmpCGI,Gecos, gecos=, MAX_BUFF); if ( strlen( Gecos ) != 0 ) { vpw-pw_gecos = Gecos; } + vauth_setpw(vpw, Domain); /* get the value of the cforward radio button */ GetValue(TmpCGI,box, cforward=, MAX_BUFF); *** *** 899,903 } call_hooks(HOOK_MODUSER); ! moduser(); } --- 924,928 } call_hooks(HOOK_MODUSER); ! show_users(Username, Domain, Mytime); }