[Bug 1447756] Re: segfault in log.c code causes phone reboot loops

James Hunt Fri, 22 May 2015 06:31:35 -0700

** Description changed:

+ = Summary =
+ 
+ The version of Upstart in vivid is affected by a coule of bugs relating
+ to the flushing data from early-boot jobs to disk which can both result
+ in a crash:
+ 
+ == Problem 1 ==
+ 
+ An internal list is mishandled meaning a crash could occur randomly.
+ 
+ == Problem 2 ==
+ 
+ Jobs which spawn processes in the background then themselves exit can
+ cause a crash due.
+ 
+ = Explanation of how Upstart flushes early job output =
+ 
+ If an Upstart job starts *and ends* early in the boot sequence (before
+ the log partition is mounted and writable) and produces output to its
+ stdout/stderr, Upstart will cache the output for later flushing by
+ adding the 'Log' object associated with the 'Job' to a list.
+ 
+ When the log partition is mounted writable, the
+ /etc/init/flush-early-job-log.conf job is run which calls "initctl
+ notify-disk-writeable". This is a signal to Upstart to flush its cache
+ of early-boot job output which takes the form of iterating the
+ 'log_unflushed_files' list and flushing all the 'Log' entries to disk.
+ 
+ = Code Specifics =
+ 
+ There are 2 issues (note that the numbers used below match those used in
+ the Summary).
+ 
+ == Problem 1 detail ==
+ 
+ Due to a bug in the way the 'log_unflushed_files' list is handled (the
+ 'Log' cannot be added to the list directly, so is added via an
+ intermediary ('NihListElem') node), a crash can result when iterating
+ the list since the 'Log' is freed, but NOT the intermediary node. The
+ implication is that it is possible for the intermediary node to be
+ attempt to dereference already-freed data, resulting in a crash.
+ 
+ == Problem 2 detail ==
+ 
+ If a job spawns a process in the background, then itself exits, that
+ jobs 'Log' entry will be added to the 'log_unflushed_files' list. But,
+ if the background process produces output and then exits before Upstart
+ attempts to flush the original jobs data to disk, the 'NihIo'
+ corresponding to the log will be serviced automatically and the data
+ flushed to disk. The problem comes when Upstart receives the
+ notification to flush the 'log_unflushed_files' list, since that list
+ now contains an entry which has already been freed (since all its data
+ has already been flushed). The result is an assertion failure.
+ 
+ = Fix =
+ 
+ == Problem 1 fix ==
+ 
+ Correct the 'log_unflushed_files' list handling by freeing the
+ 'NihListElem' (which will automatically free the 'Log' object), not by
+ simply freeing the 'Log' object itself.
+ 
+ * Branch: lp:~jamesodhunt/ubuntu/vivid/upstart/bug-1447756/
+ * New Upstart test added to avoid regression?: Yes.
+ 
+ == Problem 2 fix ==
+ 
+ Correct the assumption that the only entries in the
+ 'log_unflushed_files' list will always have data to flush by checking if
+ there is in fact any data to flush; if not, remove the entry from the
+ 'log_unflushed_files' list since it has already been handled
+ automatically by the 'NihIo'.
+ 
+ * Branch: lp:~jamesodhunt/upstart/bug-1447756-the-actual-fix
+ * New Upstart test added to avoid regression?: Yes.
+ 
+ = Workarounds =
+ 
+ If a system is affected by this bug, it will be manifested by a crash
+ early in the boot sequence.
+ 
+ To overcome the issue, either:
+ 
+ a) Boot by adding "--no-log" to the kernel command-line.
+ 
+ b) Disable the flush-early-job-log job (assuming the machine is
+ bootable) by running the following:
+ 
+    $ echo manual | sudo tee -a /etc/init/flush-early-job-log.override
+ 
+ = Impact =
+ 
+ The issue has been present in Upstart since logging was introduced but
+ no known instances of crashes relating to these problems have been
+ reported prior to this bug being reported (which relates the the issue
+ being seen on a very small subset of specific Ubuntu Touch phone
+ hardware where Upstart is used as the system init daemon).
+ 
+ Note that vivid still uses Upstart for managing the graphical session,
+ but now uses systemd by default for the system init daemon. Since the session 
(Upstart) init does not even require 
+ a flush-early-job-log, the exposure to both the bug and the updated fix 
codepath is extremely limited.
+ 
+ = Test Case =
+ 
+ This bug is extremely hard to surface so the approach is simply to check
+ that the internal list can be iterated correctly by:
+ 
+ 1) Booting the system with upstart
+    (select the Upstart option from the grub menu or add "init=/sbin/upstart" 
to the kernel command-line).
+ 
+ 2) Running the following on a system booted with Upstart:
+ 
+    $ for i in $(seq 17); do sudo start flush-early-job-log; done
+ 
+ = Regression Potential =
+ 
+ None expected:
+ 
+ - As noted in Impact, the problems fixed by this version of Upstart have not 
been observed on server/desktop systems before.
+ - The fix is already in wily and no problems have been reported.
+ - See Impact.
+ 
+ = Original Description =
+ 
  We recently started getting reprots from phone users that their devices
  go into a reboot loop after changing the language or getting an OTA
  upgrade (either of both end with a reboot of the phone)
  
  after a bit of research we collected the log at
  http://pastebin.ubuntu.com/10872934/
  
  this shows a segfault of upstarts init binary in the log.c code:
  
  [    6.999083]init: log.c:819: Assertion failed in log_clear_unflushed: 
log->unflushed->len
  [    7.000279]init: Caught abort, core dumped
  [    7.467176]Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x00000600


** Summary changed:

- segfault in log.c code causes phone reboot loops
+ [SRU] segfault in log.c code causes phone reboot loops

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1447756

Title:
  [SRU] segfault in log.c code causes phone reboot loops

To manage notifications about this bug go to:
https://bugs.launchpad.net/canonical-devices-system-image/+bug/1447756/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1447756] Re: segfault in log.c code causes phone reboot loops

Reply via email to