Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-23 Thread Clint Byrum
On Thu, 2010-12-23 at 22:07 +, ingo wrote:
 I took you literally and canged all [!2345], not the others:
 
 The remaining now are:
 
 fgrep stop on runlevel /etc/init/*.conf
 /etc/init/rc.conf:stop on runlevel [!$RUNLEVEL]
 /etc/init/rcS.conf:stop on runlevel [!S]
 /etc/init/rc-sysinit.conf:stop on runlevel
 /etc/init/tty2.conf:stop on runlevel [!23]
 /etc/init/tty3.conf:stop on runlevel [!23]
 /etc/init/tty4.conf:stop on runlevel [!23]
 /etc/init/tty5.conf:stop on runlevel [!23]
 /etc/init/tty6.conf:stop on runlevel [!23]
 /etc/init/ufw.conf:stop on runlevel [!023456]
 
 I still get the orphaned inodes. Shall I also convert the tty's?
 

You can, but I doubt they're the problem.

Can you paste the output of

lsof -n |grep deleted

After the reinstall?

Thanks.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-21 Thread Clint Byrum
On Tue, 2010-12-21 at 12:41 +, Scott James Remnant wrote:
 On 20/12/10 18:22, Clint Byrum wrote:
  In a message to ubuntu-devel I suggested that we have an abstract job,
  'network-services', which most normal (non boot-critical) services
  should follow.
 
  https://lists.ubuntu.com/archives/ubuntu-devel/2010-December/032254.html
 
 General note:  ubuntu-devel is *NOT* the correct list to discuss Upstart 
 changes unless they're unique to Ubuntu.
 

Thanks, Scott

In this case, I don't know if this would be unique to Ubuntu or not. I
am not suggesting a code change in upstart with that message, but rather
a change in the way upstart is used and packaged in Ubuntu. Though, it
would be rather nice if everybody used upstart the same way.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-20 Thread Michael Biebl
2010/12/20 James Hunt 688...@bugs.launchpad.net:

 3) Modify all upstart configs for services which are slow to stop such that 
 they stop on unmount-filesystem,
    rather than stop on runlevel [016].

- What about single user mode? I guess when switching to runlevel 1 we
want to stop services like mysql?
- How do you decide if a service  is 'slow to stop' ? Imho that
highly depends on the given hardware, local configuration and the
amount of data you are dealing with. A general approach would be
preferable.

-- 
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-20 Thread Clint Byrum
On Mon, 2010-12-20 at 12:50 +, James Hunt wrote:
 After discussion with Scott, the best short-term solution would seem to
 be:
 
 1) Modify /etc/init.d/umountfs to call the following in do_stop before
 calling umount/swapoff:
 
  initctl emit unmount-filesystem
 
 2) Modify /etc/init.d/umountroot to call the following in do_stop before
 calling umount:
 
  initctl emit unmount-root-filesystem
 
 
 3) Modify all upstart configs for services which are slow to stop such that 
 they stop on unmount-filesystem,
 rather than stop on runlevel [016].
 
 4) Test!
 
 The overall effect of this being that when /etc/init.d/umountfs emits
 the unmount-filesystem event, it will block until any Upstart jobs which
 stop on those events have completed. Thus, /etc/init.d/umountfs will
 wait for the mysql Upstart job to finish before unmounting its
 filesystems.


Not much happens between rc-sysinit starting and sendsigs/umountfs. Is
slow even 1 second between SIGTERM and exiting? Shouldn't we just make
sure everything that is 'stop on runlevel [!2345]' or 'stop on runlevel
[016]' stops before we umount? bug #672177 may very well be caused
simply by killing the last service that had the deleted libc.so.6 open,
causing the fs to need to finish the deletion right then, which could be
waiting on a sync and many other files being flushed/etc. on a busy
rotational disk. This will cause something very tiny to take a second to
die.

I think we must transition *everything* that stops on runlevel [016] to
'stop on unmounting-filesystems', or get clever and find a way to wait
until upstart is done stopping everything it already wants to stop. I do
think that initctl list is flawed for this task, but it might be the
best chance at catching stragglers that we have.

In a message to ubuntu-devel I suggested that we have an abstract job,
'network-services', which most normal (non boot-critical) services
should follow.

https://lists.ubuntu.com/archives/ubuntu-devel/2010-December/032254.html

By taking this approach, we can at least ammend this fix if it has
unintended consequences.

There's also still the issue (which probably should be its own bug
report) that sendsigs will kill the children of already stopping jobs,
which it shouldn't do, and which it would still do in the suggested fix
since sendsigs runs before umountfs.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-16 Thread Michael Biebl
2010/12/16 Clint Byrum cl...@fewbar.com:

 /etc/init.d/sendsigs has this code:


        # Upstart jobs have their own stop on clauses that sends
        # SIGTERM/SIGKILL just like this, so if they're still running,
        # they're supposed to be
        for pid in $(initctl list | sed -n -e /process [0-9]/s/.*process 
 //p); do
                OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid
        done


 It uses this to determine which pids not to kill because, presumably, upstart 
 should be managing them.

 However, this code is flawed. killall5 will kill the children of all of
 these if they are multi process daemons or scripts running things.

This observation is correct. On the other hand, isn't this exactly
what the sendsigs script is for: clean up any remaining, stray
processes  which have not been stopped by its corresponding sysv init
script or upstart job (or have been e.g. started by the user)?

But I guess you are right, we should first stop all upstart jobs, give
them time to finish stopping, and then let sendsigs clean up anything
remaining afterwards.

 However, this technique can actually be used to determine if there are
 still jobs that are supposed to be stopped, but haven't finished
 stopping yet. Since they should be listed as stop/(pre-stop|post-
 stop|killed), we can determine exactly which pids we expect to go away.
 Since upstart has its own idea of how long to wait before it kills
 these, we should actually wait indefinitely.

 I'm attaching a debdiff that solves the race as far as I can tell,
 though I think it needs a good long look, since it could mean shutdowns
 hang for a long time waiting (I'm especially curious if the pre-stop
 /post-stop's are subject to kill timeout)

This code is still racy, afaics. What about upstart jobs, which are
not stopped by stop on runlevel [016]? They could receive their stop
signal at a point when your loop has already been run.

If you don't want to change existing jobs, we probably have to pick up
Ante's suggestion, and do the following in sendsigs:

1) run a for loop to wait for *all* running upstart jobs to stop.
upstart jobs which need to keep running past sendsigs (e.g. plymouth)
need to signal that using a similar mechanism like the killall5
sendsigs.d omit interface. I'd at least give upstart jobs 60secs time
to stop, so big databases etc have enough time to cleanly shutdown
2.) run a for loop and send SIGTERM all remaining processes, but do
*not* add upstart pids to $OMITPIDS
3.) send a final SIGKILL if any processes are left.


Regarding 1.), it would be nice to have a native C implementation in
upstart, instead of running initctl, grep and sleep manually.


-- 
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-16 Thread Clint Byrum
On Thu, 2010-12-16 at 15:45 +, Michael Biebl wrote:
 2010/12/16 Clint Byrum cl...@fewbar.com:
 
  
  I'm attaching a debdiff that solves the race as far as I can tell,
  though I think it needs a good long look, since it could mean shutdowns
  hang for a long time waiting (I'm especially curious if the pre-stop
  /post-stop's are subject to kill timeout)
 
 This code is still racy, afaics. What about upstart jobs, which are
 not stopped by stop on runlevel [016]? They could receive their stop
 signal at a point when your loop has already been run.
 

Indeed, there is still a race I think now that I dig through upstart's
code a bit. If any of the jobs in the stop/!waiting state have 'stop on
stopped' jobs that will be stopped after they stop, the event isn't
emitted until *after* the transition to stop/waiting.

thread A (upstart job foo):

start/running - stop/pre-stop
sends TERM to owned process
stop/pre-stop - stop/killed
process dies
stop/killed - stop/waiting
emit stopped JOB=foo

thread B (upstart job baz)
start/running - stop/pre-stop
sends kill to owned process
stop/pre-stop - stop/killed
process dies
stop/killed - stop/waiting

thread C (sleep loop)

runs initctl list
greps
sleeps
runs initctl list
greps
sleeps

list is handled by doing a get all jobs command first, and then
individual status commands for each job, so its entirely possible that
we will ask for the status of baz and it will say start/running, and
then foo finishes its transition, then we ask for foo's status and it is
stop/waiting, and we think we're done.

This race would probably be solved by having a list all jobs with
status command, as long as the stopped event is guaranteed to be
consumed before any commands, which, I believe it will.

One delicate issue is that if an upstart managed process dies for any
other reason than being stopped, upstart will try to respawn it, so we
can't just go sending SIGTERM/SIGKILL to all pids, as upstart will fight
us on those. We actually have to stop everything.

 If you don't want to change existing jobs, we probably have to pick up
 Ante's suggestion, and do the following in sendsigs:
 
 1) run a for loop to wait for *all* running upstart jobs to stop.
 upstart jobs which need to keep running past sendsigs (e.g. plymouth)
 need to signal that using a similar mechanism like the killall5
 sendsigs.d omit interface. I'd at least give upstart jobs 60secs time
 to stop, so big databases etc have enough time to cleanly shutdown

IMO, leaving out a valid stop on that gets it stopped at or before
runlevel [016] is the equivilent of the omit interface. You've started
it, saying exactly when upstart should or should not stop it. However,
if you've wandered into the scenario mentioned above with stop on
stopped foo, then we need to handle that.

 2.) run a for loop and send SIGTERM all remaining processes, but do
 *not* add upstart pids to $OMITPIDS

See above, you'd have to send 'stop' commands to upstart for them,
instead of omitting them.

 3.) send a final SIGKILL if any processes are left.
 

I'd say let upstart do that.. but how do we know when we can continue
on to unmounting? I suppose after a lengthy timeout (60s does seem long
enough, though mysql can take longer) this makes sense.

 
 Regarding 1.), it would be nice to have a native C implementation in
 upstart, instead of running initctl, grep and sleep manually.
 

I agree, but I'm having trouble envisioning exactly what one would ask
for. Block until all current goals are reached. Would work maybe.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-13 Thread Michael Biebl
2010/12/14 Clint Byrum cl...@fewbar.com:

 I do think the appropriate fix is to have umountfs emit an 'unmounting-
 filesystems' event and anything that does a 'start on local-filesystems'
 or 'start on filesystem' should also 'stop on unmounting-filesystems',

What do you do about services which have
start on runlevel [2345] and the binary is in /usr?

There are quite a few examples here: acpid, atd, cron, irqbalance, etc
which all have:

start on runlevel [2345]
stop on runlevel [!2345]

Either those jobs are buggy to not specify the start on
(local-)filesystems dependency or your criteria is not sufficient.

Imho the major problem here is, that there is a mixup between
dependencies that need to be satisfied to be able to run a job and
when (in which runlevels) to start a job.

Michael


-- 
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs


Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010-12-10 Thread Michael Biebl
2010/12/10 Ante Karamatić iv...@grad.hr:
 Suggestion: make umountfs wait for all upstart jobs to finish.

Doesn't that conflict though with what is written in
/etc/init.d/sendsigs:

# Upstart jobs have their own stop on clauses that sends
# SIGTERM/SIGKILL just like this, so if they're still running,
# they're supposed to be
for pid in $(initctl list | sed -n -e /process
[0-9]/s/.*process //p); do
OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid
done

or

# did an upstart job start since we last polled initctl? check
# again on each loop and add any new jobs (e.g., plymouth) to
# the list.  If we did miss one starting up, this beats waiting
# 10 seconds before shutting down.
for pid in $(initctl list | sed -n -e /process
[0-9]/s/.*process //p); do
OMITPIDS=${OMITPIDS:+$OMITPIDS }-o $pid
done


-- 
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to mysql-5.1 in ubuntu.
https://bugs.launchpad.net/bugs/688541

Title:
  race condition on shutdown (leads to corrupted fs)

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs