[linuxkernelnewbies] taskset - retrieve or set a pro cess’s CPU affinity
TASKSET(1) Linux User’s Manual TASKSET(1) NAME taskset - retrieve or set a process’s CPU affinity SYNOPSIS taskset [options] mask command [arg]... taskset [options] -p [mask] pid DESCRIPTION taskset is used to set or retrieve the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity. CPU affinity is a scheduler property that "bonds" a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs. Note that the Linux scheduler also supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications. The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and the highest order bit corresponding to the last logical CPU. Not all CPUs may exist on a given system but a mask may specify more CPUs than are present. A retrieved mask will reflect only the bits that correspond to CPUs physically on the system. If an invalid mask is given (i.e., one that corresponds to no valid CPUs on the current system) an error is returned. The masks are typically given in hexadecimal. For example, 0x0001 is processor #0 0x0003 is processors #0 and #1 0x is all processors (#0 through #31) When taskset returns, it is guaranteed that the given program has been scheduled to a legal CPU. OPTIONS -p, --pid operate on an existing PID and not launch a new task -c, --cpu-list specify a numerical list of processors instead of a bitmask. The list may contain multiple items, separated by comma, and ranges. For example, 0,5,7,9-11. -h, --help display usage information and exit -V, --version output version information and exit USAGE The default behavior is to run a new command with a given affinity mask: taskset mask command [arguments] You can also retrieve the CPU affinity of an existing task: taskset -p pid Or set it: taskset -p mask pid PERMISSIONS A user must possess CAP_SYS_NICE to change the CPU affinity of a process. Any user can retrieve the affinity mask. AUTHOR Written by Robert M. Love. === Description of problem: Install Fedora 9 (sorry, I can not find the entry for F9 bug report). Set one task affinity to one CPU core, then set offline the CPU core. After that we can not set online the CPU core again. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. run the test.sh script: test.sh 1 ( 1 is the logical CPU ) test.sh script: #!/bin/bash main(){ typeset -i CPU=$1 ./task.sh > /dev/null& PID=$! if [ `cat /sys/devices/system/cpu/cpu${CPU}/online` = "0" ]; then echo "1" > /sys/devices/system/cpu/cpu${CPU}/online fi MASK=$((1<<${CPU})) `taskset -p ${MASK} ${PID} > /dev/null 2>&1` echo "0" > /sys/devices/system/cpu/cpu${CPU}/online echo "1" > /sys/devices/system/cpu/cpu${CPU}/online disown $PID kill -9 $PID > /dev/null 2>&1 echo "PASS\n" } typeset -i TEST_CPU=$1 main $TEST_CPU 2. task.sh script as following #!/bin/bash while : do NOOP=1 done 3. Actual results: The test.sh will block at set online the CPU ( echo "1" > /sys/devices/system/cpu/cpu${CPU}/online ). Expected results: Additional info: Happened in Intel Bensley platform (2xXeon 2.83G Harpertown C0, chipset Blackford G1, 160 SATA) --- Comment #1 >From Bill Nottingham 2008-02-18 13:07:03 EDT --- Does this happen on the upstream kernel as well? --- Comment #2 >From Song, Youquan 2008-02-21 03:59:48 EDT --- Yes. the kernel 2.6.24 to 2.6.25-rc2 also exit the bug. But the bug is not exit at kernel 2.6.18. --- Comment #3 >From Chuck Ebbert 2008-02-25 17:33:21 EDT --- Does the CPU mask of the running process get changed when the processor is offlined? And can you get a system state (alt-sysrq-t) when the script hangs? --- Comment #4 >From Song, Youquan 2008-02-27 04:15:50 EDT --- Yes, after I set offline the CPU, I use commands "taskset -p $PID and ps -- pid=$PID -o psr" to find that process CPU mask is change and process migrate to other CPU correctly. Attachment is the Screenshot.png --- Comment #5 >From Song, Youquan 2008-02-27 04:18:13 EDT --- Created an attachment (id=296037) [details] CPU can not do hotplu
[linuxkernelnewbies] Direct Memory Access (DMA) and Interrupt Handling
http://www.eventhelix.com/RealtimeMantra/FaultHandling/dma_interrupt_handling.htm DMA and Interrupt Handling In this series on hardware basics, we have already looked at read and write bus cycles. In this article we will cover Direct Memory Access (DMA) and Interrupt Handling. Knowledge of DMA and interrupt handling would be useful in writing code that interfaces directly with IO devices (DMA based serial port design pattern is a good example of such a device). We will discuss the following topics: Direct Memory Access (DMA) A typical DMA operation is described here. Interactions between the main CPU and DMA device are covered. The impact of DMA on processor's internal cache is also covered. Interrupt Handling Processor handling of hardware interrupts is described in this section. Interrupt Acknowledge Cycle Many processors allow the interrupting hardware device to identify itself. This speeds up interrupt handling as the processor can directly invoke the interrupt service routine for the right device. Synchronization Requirements for DMA and Interrupts Software designers need to keep in mind that DMA operations can be triggered at bus cycle boundary while interrupts can only be triggered at instruction boundary. Direct Memory Access (DMA) Device wishing to perform DMA asserts the processors bus request signal. Processor completes the current bus cycle and then asserts the bus grant signal to the device. The device then asserts the bus grant ack signal. The processor senses in the change in the state of bus grant ack signal and starts listening to the data and address bus for DMA activity. The DMA device performs the transfer from the source to destination address. During these transfers, the processor monitors the addresses on the bus and checks if any location modified during DMA operations is cached in the processor. If the processor detects a cached address on the bus, it can take one of the two actions: Processor invalidates the internal cache entry for the address involved in DMA write operation Processor updates the internal cache when a DMA write is detected Once the DMA operations have been completed, the device releases the bus by asserting the bus release signal. Processor acknowledges the bus release and resumes its bus cycles from the point it left off. Interrupt Handling Here we describe interrupt handling in a scenario where the hardware does not support identifying the device that initiated the interrupt. In such cases, the possible interrupting devices need to be polled in software. A device asserts the interrupt signal at a hardwired interrupt level. The processor registers the interrupt and waits to finish the current instruction execution. Once the current instruction execution is completed, the processor initiates the interrupt handling by saving the current register contents on the stack. The processor then switches to supervisor mode and initiates an interrupt acknowledge cycle. No device responds to the interrupt acknowledge cycle, so the processor fetches the vector corresponding to the interrupt level. The address found at the vector is the address of the interrupt service routine (ISR). The ISR polls all the devices to find the device that caused the interrupt. This is accomplished by checking the interrupt status registers on the devices that could have triggered the interrupt. Once the device is located, control is transferred to the handler specific to the interrupting device. After the device specific ISR routine has performed its job, the ISR executes the "return from interrupt" instruction. Execution of the "return from interrupt" instruction results in restoring the processor state. The processor is restored back to user mode. Interrupt Acknowledge Cycle Here we describe interrupt handling in a scenario where the hardware does support identifying the device that initiated the interrupt. In such cases, the exact source of the interrupt can be identified at hardware level. A device asserts the interrupt signal at a hardwired interrupt level. The processor registers the interrupt and waits to finish the current instruction execution. Once the current instruction execution is completed, the processor initiates the interrupt handling by saving the current register contents on the stack. The processor then switches to supervisor mode and initiates an interrupt acknowledge cycle. The interrupting device responds to the interrupt acknowledge cycle with the vector number for the interrupt. Processor uses the vector number obtained above and fetches the vector. The address found at the vector is the address of the interrupt service routine (ISR) for the interrupting device. After the ISR routine has performed its job, the ISR executes the "return from interrupt" instruction. Execution of the "retur
[linuxkernelnewbies] Kernel Log: What's new in 2.6.29 - Part 3: Kernel controlled graphics modes - The H: Security news and Open source developments
http://www.h-online.com/news/Kernel-Log-What-s-new-in-2-6-29-Part-3-Kernel-controlled-graphics-modes--/112431 Kernel Log: What's new in 2.6.29 - Part 3: Kernel controlled graphics modes With the release of 2.6.29-rc1 last weekend, Linus Torvalds concluded the first phase, called the merge window, of the development cycle. This phase allows for incorporating the substantial changes intended for the next kernel version into the source code management system of the Linux kernel. As a result, 2.6.29 is now in the second, stabilising phase, which usually takes eight to ten weeks and gives the kernel developers the opportunity to correct mistakes and make minor changes that are unlikely to cause further flaws. As major changes are only rarely discarded during the stabilising phase, the kernel log can already discuss the most important changes expected for 2.6.29 in the "What's new in 2.6.29" series. Kernel-based mode setting Almost 21 months after its first major announcement, the support for kernel-based mode setting (KMS) for recent Intel graphics hardware has been integrated into the main development branch of Linux (for example 1, 2, 3). This technology gives the kernel noticeably more control over the graphics hardware. When KMS is active, the kernel sets the graphics mode suitable for a monitor as soon as all the required hardware components (ACPI, PCI, graphics hardware etc.) have been initialised. From a user's perspective, this approach is initially no different from framebuffer graphics with suitable drivers. However, in contrast to framebuffer graphics, the kernel also sets the screen resolution during operation, taking over this, and other tasks, from the X server. If the X server and a text console, managed with KMS, use the same screen resolution, the kernel no longer needs to reset the graphics chip and screen resolution when switching between the graphics interface and the console; this was previously required every time the user switched to X and VGA text or framebuffer consoles, because the kernel didn't know the X Server's configuration of the graphics chip. As a result, switching with KMS – for example while booting, when the X server first starts up – is considerably faster and is no longer afflicted by screen flickering or short display disruptions. Because the kernel controls the graphics hardware in KMS, problems that arise when the VGA console and framebuffer driver, the Direct Rendering Manager (DRM) and various userspace programs, including the X server, compete for access to the graphics hardware, can be eliminated. With KMS, when waking up from suspend mode, the kernel also handles the entire graphics hardware re-initialisation, which is designed to solve some of the problems with using the suspend modes. With KMS, X servers will reportedly also operate without root privileges; this and several other improvements associated with KMS are to facilitate the parallel operation of several X servers, allowing users to switch backwards and forwards (fast user switching). KMS will also allow Linux to snatch control from the X server in case of a serious kernel problem (kernel panic) and display troubleshooting instructions similar to those displayed for the dreaded blue screen in Windows – some developers have talked about a "Blue Penguin Of Death", but this isn't possible with the code incorporated in 2.6.29. To avoid hardware access disagreements between the X server and the kernel, the X server and its graphics driver must also support KMS. However, X and kernel hacker Dave Airlie, who is responsible for the kernel's DRM code, explicitly says in his patch integration request that these parts are still being developed and currently are only intended for developers; therefore, KMS should not be enabled during kernel configuration, without the required userspace support. It is likely to be some time until the kernel is ready for KMS with Radeon hardware: Although the KMS code for Radeon GPUs is already available, it is based on the TTM Memory Manager rather than the more recent Graphics Execution Manager (GEM) incorporated with 2.6.28 and so far that is geared to work with Intel hardware. However, according to Dave Airlie the TTM code is not mature enough to be integrated into the official kernel yet. It will probably be even longer until KMS becomes available with a standard kernel and Nvidia hardware, unless the developers of the Nouveau driver, which was created using reverse engineering, can pull some mature KMS code out of their hats, or Nvidia decides to provide KMS support. The latter is particularly likely to improve the reliability of the suspend modes, which often malfunction with the open source drivers for Nvidia hardware. More graphics The Graphics Execution Manager (GEM), which is still set to work with Intel hardware and manages the main memory as well as the access to the GPU's processing units, has been extended to include new features in 2.6.29 (1, 2). Several further
[linuxkernelnewbies] Kernel Log: main development phase for 2.6.29 ends, new X.org drivers - The H: Security news and Open source developments
http://www.h-online.com/news/Kernel-Log-main-development-phase-for-2-6-29-ends-new-X-org-drivers--/112399 Kernel Log: main development phase for 2.6.29 ends, new X.org drivers With the release of 2.6.29-rc1 on Saturday night, Linus Torvalds has closed the 2.6.29 merge window and brought to a close the development phase, during which the major new features for the next version of Linux are adopted. All significant changes in 2.6.29 should now be in the Linux source code management system, including new features previously discussed on heise open such as WiMAX, access point support and the Btrfs and Squashfs file systems. These changes are just some of the more conspicuous changes adopted by the kernel hackers for 2.6.29. Support has been added for kernel-based mode setting on Intel graphics hardware and improvements have been made to the Graphics Execution Manager (GEM), which was integrated with 2.6.28. The SCSI subsystem now supports Fibre Channel over Ethernet (FCoE) and there are fixes to, and new functions in, the eCryptfs, Ext4, OCFS2 and XFS file systems. There are also numerous new and revised drivers, including new or revised audio drivers from the Alsa project and over 600 changes to the V4L/DVB drivers. These are now joined by various, in some cases very large, staging drivers, such as the Comedi framework, or support for Google's Android. heise open's Kernel Log will carry detailed reports on these and other changes over the next few weeks as part of our "What's coming in 2.6.29" series. The realtime defragmenter (online ext4 defragmentation) has not made it into 2.6.29 – Theodore Tso explains why on LKML. Also left out, for the time being, are support for operation as a primary Xen domain (Dom0) and compression of the kernel image with bzip2/lzma. It looks like it could also be a while before support for kernel-based mode setting with AMD hardware meets the kernel development team's quality standards. All about X.org AMD developer Alex Deucher has released version 6.10 of the xf86-video-ati driver package, usually known simply as ati or radeon. It includes support for the RV710 (Radeon HD 4300/HD 4500) and RV730 (Radeon HD 4600) Radeon chips. The new version also reduces tearing during video playback and supports Bicubic Xv scaling on r3xx/r4xx/r5xx/rs690 Radeon chips. The developer discusses further changes on his blog. Matthias Hopf has now released the AtomBIOS disassembler previously used for programming the alternative Radeon graphics driver radeonhd. He describes some of the background to the tool on his blog. The X.org developers have also released version 1.4.0 of the xf86-input-mouse mouse driver. This driver deals with many of the tasks previously dealt with by X server, and the code responsible for this has been removed from X server – with the result that in X server 1.6, currently under development, users will, unless their systems use Evdev, need at least version 1.4.0 of xf86-input-mouse. In Brief: Following LWN.net's occasional publication of analysis of which kernel developers have, for instance, introduced the most or the largest changes into a kernel version (e.g. 1, 2, 3), Wang Chen has been trying his hand at a similar set of online statistics. SELinux hacker James Morris has announced the creation of the Kernel Security Wiki on his blog, where he has also recently summarised all the most significant security-related changes in Linux 2.6.28. As part of the discussion on the adoption of Squashfs, Greg Kroah-Hartman has declared that he will in future accept file systems into the staging directory, as long as they do not require changes in other parts of the kernel. Daniel Phillips is continuing to work on Tux3 and is keeping the developer community updated on new features or internal matters in his "Tux3 Report" – a recent e-mail to LKML, for example, elucidates the current structure of the file system. The kernel development team are planning to hold a "Linux Storage and Filesystem Summit 2009" in San Francisco in early April. A group of developers are working on open source firmware for some of the Broadcom WLAN chips supported under Linux by the b43 driver; this firmware does not, however, appear to work for all testers. Marvell has made WLAN firmware for the GSPI-88W8686 available to download, but has not released the source code. As reported elsewhere, Nvidia has released version 180.22 of its proprietary graphics driver for x86-32 and x86-64 Linux. Further background and information about developments in the Linux kernel and its environment can also be found in previous issues of the kernel log at heise open: Kernel Log: What's new in 2.6.29 - Part 2: WiMax Kernel Log: What's new in 2.6.29 - Part 1: Dodgy Wifi drivers and AP support Kernel Log: 2.6.29 development kicks off, improved 3D support Kernel Log: Higher and Further, The innovations of Linux 2.6.28 Kernel Log: What's coming in 2.6.28 - Part 9: Fastboot and other
[linuxkernelnewbies] Kernel Log: What's new in 2.6.29 - Part 2: WiMAX - The H: Security news and Open source developments
http://www.h-online.com/news/Kernel-Log-What-s-new-in-2-6-29-Part-2-WiMAX--/112393 Kernel Log: What's new in 2.6.29 - Part 2: WiMAX In Part 2 of the Kernel Log's coverage of the major changes happening in the main development branch for the Linux kernel 2.6.29 release, we look at a major new addition to Linux's networking capability, WiMAX support. USB sub-system maintainer Greg Kroah-Hartman has brought the WiMAX stack, developed primarily by Intel developers in the framework of the Linux WiMAX project, into the Linux main development branch. The stack gives Linux 2.6.29 a basic infrastructure for WiMAX wireless broadband networking technology based on the i2400m USB driver, which was also developed by the WiMAX project and concurrently integrated into the kernel. The WiMAX stack communicates with the WiMAX Connection 2400 chip in Intel Wireless WiMAX/WiFi Link 5150 and 5350 (codename: Echo Peak) WLAN/WiMAX modules, found mainly in newer Centrino notebooks. As the change log in the ultimately successful e-mail request for integration shows, Linux WiMAX developers made a number of attempts before the network and USB sub-system administrators were satisfied with the code and gave it the green light for integration into the kernel. Numerous details and background information on the Linux kernel's new WiMAX infrastructure can be found in the e-mail mentioned above, by following the links at the end of this article to commits in the source code administration system, and on the Linux WiMAX website. Also, on the website you can download the i2400m firmware and the corresponding userspace software. However, the Intel WiMAX binary supplicant needed for authentication with the remote host, as well as the Intel WiMAX binary OMADM client are only available online as a pre-compiled archive (license, FAQ). Therefore, distributions based solely on open source software, such as Debian, Fedora and OpenSuse, will not yet include these parts of the userspace stack in their core distributions. However, in the e-mail mentioned above, Intel developers do say "For networks that require authentication (most), the Intel device requires a supplicant in user space – because of a set of issues we are working to resolve, it cannot be made open source yet, but it will". See – Part 1 of Whats new in 2.6.29. The WiMAX Changes in detail i2400m: debugfs controls i2400m: documentation and instructions for usage i2400m: firmware loading and bootrom initialization i2400m: Generic probe/disconnect, reset and message passing i2400m: host/device procotol and core driver definitions i2400m: linkage to the networking stack i2400m: Makefile and Kconfig i2400m: RX and TX data/control paths i2400m/SDIO: firmware upload backend i2400m/SDIO: header for the SDIO subdriver i2400m/SDIO: probe/disconnect, dev init/shutdown and reset backends i2400m/SDIO: TX and RX path backends i2400m/USB: firmware upload backend i2400m/USB: header for the USB bus driver i2400m/USB: probe/disconnect, dev init/shutdown and reset backends i2400m/USB: TX and RX path backends i2400m/usb: wrap USB power saving in #ifdef CONFIG_PM i2400m: various functions for device management wimax: basic API: kernel/user messaging, rfkill and reset wimax: debugfs controls wimax: debug macros and debug settings for the WiMAX stack wimax: documentation for the stack wimax: export linux/wimax.h and linux/wimax/i2400m.h with headers_install wimax: fix kconfig interactions with rfkill and input layers wimax: generic device management (registration, deregistration, lookup) wimax: headers for kernel API and user space interaction wimax/i2400m: add CREDITS and MAINTAINERS entries wimax: internal API for the kernel space WiMAX stack wimax: Makefile, Kconfig and docbook linkage for the stack Further background and information about developments in the Linux kernel and its environment can also be found in previous issues of the kernel log at heise open: Kernel Log: What's new in 2.6.29 - Part 1: Dodgy Wifi drivers and AP support Kernel Log: 2.6.29 development kicks off, improved 3D support Kernel Log: Higher and Further, The innovations of Linux 2.6.28 Kernel Log: What's coming in 2.6.28 - Part 9: Fastboot and other remainders Kernel Log: What's coming in 2.6.28 - Part 7: architecture support, memory subsystem and virtualisation Kernel Log: What's coming in 2.6.28 - Part 6: Changes to the audio drivers Older Kernel logs can be found in the archives or by using the search function at heise open. (thl/c't)
[linuxkernelnewbies] Kernel Log: What's new in 2.6.29 - Part 1: Dodgy Wifi drivers and AP support - The H: Security news and Open source developments
http://www.h-online.com/news/Kernel-Log-What-s-new-in-2-6-29-Part-1-Dodgy-Wifi-drivers-and-AP-support--/112392 Kernel Log: What's new in 2.6.29 - Part 1: Dodgy Wifi drivers and AP support See – Part 2 of Whats new in 2.6.29. Scarcely two weeks after the release of Linux 2.6.28, Linus Torvalds has integrated comprehensive changes for kernel version 2.6.29 into the main development branch. As of Friday morning, he had added a whopping 7550 patches that changed 8388 files and included more than 1,061,513 new, changed, or moved, lines of code. Over the weekend, the merge window closed and the second phase of the development cycle, which usually lasts some eight to ten weeks, has started with the release of 2.6.29-RC1. In the second phase only corrections, or small changes that do not threaten the code, will find their way into the main development branch. A significant part of the changes integrated up to now includes a long list of improvements in the kernel's network support features. Since the network support features are the most significant additions to the kernel, a round-up of these changes kicks off this "What's new in 2.6.29" series. Should any other noteworthy patches for the network subsystem find their way into the main development branch in the coming weeks, we will sum them up in the final instalment of this series – shortly before Torvalds releases 2.6.29. On the occasion of the release, a comprehensive Kernel Log will again sum up the most important changes reported in the course of the "What's new in 2.6.29" series. Dodgy network drivers Following the integration in 2.6.28 of Greg Kroah-Hartman's staging kernel branch into the Linux main development branch, the self-styled "maintainers of crap" have added numerous additional drivers, in the kernel's staging area, that do not meet the kernel developers' quality standards. Among the offenders were the rt2860 and rt2870 Wifi drivers for the Ralink Wifi chips found in some of the new netbooks and low-end notebooks. Other new entries in the staging area are the otus driver, released in October, for the Atheros UB81, UB82 and UB83 WLAN chips, as well as the agnx and rtl8187se drivers for the Airgo AGNX00 WLAN chip and the Realtek RTL8187SE WLAN chip. There was also a comprehensive scrub and restructuring done on the code for the wlan-ng framework included in 2.6.28. Developers swapped the at76_usb staging driver, also included in 2.6.28, for a variant based on the mac80211 Linux WLAN stack – actually, the author of the driver had another solution in mind, so it would be no surprise if additional changes are made to the patch, or if it is withdrawn entirely. Also, the benet network driver for ServerEngines' BladeEngine (EC 3210) 10Gb network adapter is a new addition to the staging area. Whether users of the mainstream distributions and their kernels will see tangible advantages from the inclusion of all these staging drivers depends on which distribution they are using. Administrators of the distributions' kernels activate only some of the staging drivers, or only partially activate them, since they do not meet the normal quality standards of the kernel developers. The drivers' failure to meet these standards is also the reason the kernel is marked with a "TAINT_CRAP" flag when it is loading them. This makes it clear in users' error reports that a "crappy" driver has "besmirched" the kernel and may have been responsible, or partly responsible, for problems. However, in the absence of other drivers, users who simply want to use their hardware may not give a hoot about the driver, as long as it does not cause any serious problems. Network manager/developer Dan Williams made it known in a recent Fedora list post that he does not think much of the staging drivers (1, 2). He said that he would ignore bugs involving staging drivers, "Basically, I'm going to ignore any issues that come in from these drivers because they aren't accepted upstream wireless drivers, despite what gregkh (who's not a wireless developer) tries to make them." More than a thousand other changes Network subsystem administrator David S. Miller did not leave it to Greg Kroah-Hartman alone to submit all of the network updates; he collected more than a thousand network-related patches himself and sent them to Torvalds. (1, 2, 3). New and removed WLAN drivers, AP mode Support for operation as an access point (AP), which has been in the kernel's Wifi stack for some time, albeit deactivated, has now been activated (documentation, support in nl80211). However, the kernel does not handle the actual AP administration functions itself, but rather leaves them to the current versions of hostapd. The WLAN drivers have to support AP mode as well, although this is not the case with the kernel's drivers for the Intel WLAN modules found in Centrino notebooks and others. Developers are expanding the ath5k and p54 WLAN drivers, to support AP mode (1, 2). The kernel hackers have extended th
[linuxkernelnewbies] Kernel Log: Morton questions acceptance of Xen Dom0 code; file systems for SSDs - News - The H Open Source: News and Features
http://www.h-online.com/open/Kernel-Log-Morton-questions-acceptance-of-Xen-Dom0-code-file-systems-for-SSDs--/news/112784 Kernel Log: Morton questions acceptance of Xen Dom0 code; file systems for SSDs In his response to the invitation on the Linux Kernel Mailing list (LKML) for comments on the recently submitted Xen Dom0 patches, Andrew Morton asks whether accepting these kernel extensions into the main Linux development tree to operate as the leading Xen domain (Dom0) still makes sense. He has suggested that Xen may be the "old" way to achieve virtualisation, whereas the world is moving in a "new" direction, towards KVM. He also suggests that Linux developers could regret accepting Xen Dom0 support in three years' time ("I hate to be the one to say it, but we should sit down and work out whether it is justifiable to merge any of this into Linux. I think it's still the case that the Xen technology is the "old" way and that the world is moving off in the "new" direction, KVM? In three years' time, will we regret having merged this? "). This has prompted a debate on the pros and cons, and the relative advantages and drawbacks of Xen and KVM. Jeremy Fitzhardinge, a long-standing Xen developer who sent Xen Dom0 patches developed by him and others to the LKML, campaigned strongly for Xen, but was to some extent rebuffed by other well known Linux developers, including Nick Piggin and Ingo Molnar. As one of the managers of the kernel code for supporting the x86 architecture, Molnar could have an important say in the decision whether to accept Xen support. A decision has probably not been made yet, but in spite of the discussion stimulated by Morton and the objections of other kernel hackers, it's perfectly possible that the next-but-one Linux version (2.6.30) will incorporate Xen Dom0 code, based on these patches. How the situation arose In any case, it's difficult to make any predictions for the development model of the Linux kernel, because many developers, not least Linus Torvalds, can considerably speed up or hold back the acceptance of patches. Originally, Morton prophesied four years ago that acceptance of Xen support into the Linux kernel was imminent. At that time, however, kernel developers were already dissatisfied with some aspects of integrating it into the kernel sources, and asked for changes before it could be accepted. While Xen developers were working on this, other Linux-specific virtualisation solutions appeared, such as KVM (Kernel-based Virtual Machine) and Lguest (originally called Lhype). The kernel developers are consequently pressing for an interface that lets the Linux kernel work as efficiently as possible as a paravirtualised guest under all of these and other virtualisation solutions, without large quantities of special code having to be included in the kernel for each hypervisor. The paravirt_ops abstraction layer then emerged, largely under the leadership of the Lguest developer, and found its way into the main Linux development tree with Linux 2.6.20. That same version also saw the developers accept the KVM virtualisation framework into Linux. Though only a few months old at the time, it fitted into the kernel much better than Xen support and, in the opinion of many kernel hackers, was clearly the technically more elegant solution, since it used the kernel itself as hypervisor and thus had recourse to the infrastructure of the kernel (scheduler, memory management, drivers), while the Xen hypervisor is positioned upstream of the Linux kernel. On the other hand, KVM requires CPUs with virtualisation functions, like the AMD-V and Intel VT. Xen can also use these functions in order to virtualise unmodified guest systems, but if the CPU doesn't provide these functions, an operating system adapted to Xen can alternatively run as a guest under the Xen hypervisor using paravirtualisation. Fitzhardinge is now citing this difference as one of the advantages of Xen, though all recent x86 server processors and many desktop and notebook CPUs now provide virtualisation functions. Second attempt While KVM underwent further constant and rapid speedy development within normal work on the kernel, acquiring functions like migration and PCI device pass-through for guests, Xen developers were slow to move ahead with integrating Xen into the Linux kernel. Instead, they paid a lot of attention to further development of the Xen code, which is also used in commercial Xen products. It sits on top of Linux kernel 2.6.18, and doesn't satisfy the quality requirements of kernel developers. The 2.6.18 kernel however lacks many drivers for more recent PC components, so the distribution developers are porting this Xen code to later kernels. This was and still is extremely laborious and, in practice, the result only works after a fashion. This is probably one of the reasons that motivated Red Hat to buy Qumranet, a company specialising in KVM, and subsequently (according to recently divulged plans
[linuxkernelnewbies] pNFS and the Future of File Systems
http://www.enterprisestorageforum.com/sans/features/article.php/3793301 pNFS and the Future of File Systems December 24, 2008 By Drew Robb High-performance file systems such as Panasas PanFS, Sun QFS, Quantum StorNext, IBM GPFS and HP File Services can add plenty of value to storage implementations (see Choosing the Right High-Performance File System). Take the case of DigitalFilm Tree, a company based in Hollywood that provides post-production and visual effects (VFX) services for the entertainment industry. It recently had to ramp up its operations to deal with VFX for Showtime's "Weeds," CW's "Everybody Hates Chris," NBC's "Scrubs," a new TV pilot episode, and work on the Jet Li movie "The Forbidden Kingdom." The company harnesses a storage environment that includes Apple (NASDAQ: AAPL) Xsan, HP (NYSE: HPQ) StorageWorks arrays, QLogic (NASDAQ: QLGC) switches and gear from several other storage vendors. It is also a mixed OS environment, with the workflow having to deal with users on Macs and PCs. "The velocity of our work on the TV shows demands a non-linear workflow and the management of well over 100 TB of data," said Ramy Katrib, founder and CEO of DigitalFilm Tree. "StorNext enabled us to greatly expand our delivery without having to double our staff." But with the ongoing updates to file system protocols like NFS, including parallel NFS (pNFS), is there a possibility that NFS could eventually supplant the many proprietary file systems out there? Let's first take a look at another couple of high-performance offerings from Sun and NetApp (NASDAQ: NTAP) before taking out our crystal ball to see what the future holds. Sun Lustre Sun Microsystems (NASDAQ: JAVA) characterizes Lustre as "the most scalable parallel file system in the world." In evidence of this, it serves six of the top 10 supercomputers and 40 percent of the top 100. "We have Lustre file systems that scale to petabytes of data in one cohesive name space and deliver in excess of 100 GB/s aggregate performance to 25,000 clients or more," said Peter Bojanic, director of Sun's Lustre Group. "This includes HPC applications at Livermore, Oak Ridge and Sandia National Laboratories, where large-file I/O and sustained high bandwidth are essential." Adoption is also growing in oil and gas, rich media and content distribution networks, which all require mixed workloads with large and small files. One of Lustre's differentiators is that it is available as open source software based on Linux. That's why you find it integrated with storage products from other HPC vendors, including SGI (NASDAQ: SGIC), Dell (NASDAQ: DELL), HP, Cray (NASDAQ: CRAY) and Terascala. Lustre is an object-based cluster file system, but it is not T10 OSD-compliant, and the underlying storage allocation management is block-based. It requires the presence of a Lustre MetaData Server and Lustre Object Storage Servers. File operations bypass the MetaData Server, utilizing parallel data paths to Object Servers in the cluster. Servers are organized in failover pairs. It runs on a variety of networks, including IP and InfiniBand. NetApp WAFL NetApp has a file system called WAFL (Write Anywhere File Layout), which consolidates CIFS, NFS, HTTP, FTP, Fibre Channel and iSCSI and works in conjunction with NetApp's Data ONTAP operating system. WAFL is integrated with RAID-DP, NetApp's high-performance version of RAID-6, so it can survive the loss of one or two disk drives. Non-volatile memory (NVRAM) is added to improve speed by allowing a storage access protocol target to respond to requests to modifications before writing to disks. Through WAFL, requests are logged to NVRAM and file system modifications are saved in volatile memory. After several modifications have accumulated in volatile memory, WAFL gathers the results into what NetApp terms a "consistency point" (basically a snapshot) and writes the consistency point to the RAID group assigned to the file system. "If the consistency point is not written to disk before hardware or software failure, then once Data ONTAP reboots, the contents of the NVRAM log are replayed to the WAFL, and the consistency point is written to disk," said Michael Eisler, senior technical director of NFS at NetApp. "Most of NetApp's competitors have snapshots, but NetApp has used its underlying snapshot technology to build features like file system level mirroring, backup integration, cloning, de-duplication, data retention, striping across network storage devices, and flexible volumes." Flexible volumes (also called FlexVols) are volumes that can share a single pool (or aggregate) of storage with other flexible volumes. These volumes can be grown or contracted as needed — freed up space is returned to the storage pool to be used by other FlexVols. The Future of File Systems Not everyone needs high performance, of course. There are the more common file system protocols such as NFS and CIFS, as well as Sun's open-source ZFS file system that runs on
[linuxkernelnewbies] FAQ - VNUML-WIKI
http://www.dit.upm.es/vnumlwiki/index.php/FAQ VNUML Frequently Asked Questions Authors: David Fernández (david at dit.upm.es) Fermín Galán (galan at dit.upm.es) version 1.7, June 4th, 2004 Contents [hide] 1 Writing the VNUML specification 2 Limitations 3 Linux Kernels for VNUML 4 About root filesystems 5 Starting the simulation (-t option) 6 VNUML over different Linux distributions Writing the VNUML specification How can I check if my VNMUL XML specification is correct? Whenever vnuml tool is executed, the specification is checked and will give you some error messages in case the specification is not correct. Alternatively, you can check your specification using the xmlwf command that comes with expat distribution (needed to run VNUML). The xmllint command (that comes with libxml package) also could be used for the same task. Limitations What is the maximum virtual networks number? There are two hard limits in the number of simultaneous virtual networks (ie, how many can vnumlparser.pl manage). 64 maximum networks, if using host kernel version < 2.6.5 32 maximum networks, if using brige-utils version < 0.9.7 So, if you want to use as many virtual networks as your physical host could cope, use at least bridge-utils 0.9.7 (available just as tarball at http://sourceforge.net/projects/bridge/ at time of this writting) and Linux Kernel 2.6.5. Linux Kernels for VNUML How can I know which kernel options were used when compiling a UML linux kernel? Just execute the kernel with "--showconfig" option. For example, to know if a UML linux kernel has IPv6 support just type: > linux --showconfig | grep IPV6 About root filesystems I have changed the filesystem used by a virtual machine, but when I start the simulation it seems to use the old one. If you are using "COW" filesystems as recommended, you have to delete the old cow file before starting the simulation with the new filesystem. The reason is that cow files save a reference to the root file system they are derived from. To delete a cow file you can use the "purge" option ("-P") of vnumlparser.pl. I am using "root_fs_tutorial" root filesystem and I see that Apache web server (or any other service) is not automatically started when the virtual machine boots, why? How can I make it start from the beginning? Most of the services are not started from the beginning to speed up virtual machines boot up process during the scenario start-up (-t option). It is recommended to start the services you need using "" commands inside your VNUML specification. For example, to start apache2, you can include the following command: /etc/init.d/apache2 start Alternatively, you can use "update-rc.d" command to restore the scripts that start apache2 during boot process. Just start the rootfilesystem in direct mode as described in update rootfilesystem example, login into the virtual machine through the console or using ssh and type the following command: update-rc.d apache2 defaults Starting the simulation (-t option) When I build a scenario, I get the following message when booting each virtual machine: Checking for the skas3 patch in the host: - /proc/mm...not found - PTRACE_FAULTINFO...not found - PTRACE_LDT...not found UML running in SKAS0 mode Then the process stops, apparently hanging, but if I press CTRL+C it continues and, finally, the scenario is set up properly. Can this be avoided? This is a known problem that happens with some guest UML kernels and host kernel combinations. If you are using a modern UML guest kernel (like 2.6.21.5) the problem doesn't use to occur, but otherwise you can test some of the following: This problem use to happen during the VNUMLization, so if you avoid it using -Z switch, it won't happen. However, older VNUML versions does not implement -Z and, anyway, avoiding VNUMLizaton could be problematic if you are not using an official root filesystem. See the the user manual for more information on VNUMLization. This problem seems related with the configuration of the host kernel. In particular, if you are using a host kernel version previous to 2.6.20.2, the problem may happen if CONFIG_COMPAT_VDSO=y. If you are using CONFIG_COMPAT_VDSO=n, the problem won't occur. It seems that when using hostfs kernel 2.6.20.2 or newer, the problem does not happen at all, even in the case you are using CONFIG_COMPAT_VDSO=y (from the 2.6.20.2 changelog: "Fix broken CONFIG_COMPAT_VDSO on i386") The recommended solution is the third one. As a proof, I'm using a 2.6.21 kernel, CONFIG_COMPAT_VDSO=y and the hanging does not occurs. However, further confirmation by other users would be helpful :) I'm trying to build the simple.xml example that comes with the VNUML software, but I'm getting the following error: Checking for the skas3 patch in the host: - /proc/mm...not
[linuxkernelnewbies] The world won’t listen » Us er-mode Linux and skas0
http://blogs.igalia.com/berto/2006/09/13/user-mode-linux-and-skas0/ User-mode Linux and skas0 User-mode Linux (UML) is a port of Linux to its own system call interface. In short, it’s a system that allows to run Linux inside Linux. UML is integrated in the standard Linux tree, so it’s possible to compile an UML kernel from any recent kernel sources (using ‘make ARCH=um‘). Traditionally, UML had a working mode which was both slow and insecure, as each process inside the UML had write access to the kernel data. This mode is known as Tracing Thread (tt mode). A new mode was added in order to solve those issues. It was called skas (for Separate Kernel Address Space). Now the UML kernel was totally inaccessible to UML processes, resulting in a far more secure environment. In skas mode the system ran noticeably faster too. To enable skas mode the host kernel had to be patched. As of September 2006, the latest version of the patch is called skas3. The patch is small but hasn’t been merged in the standard Linux tree. The official UML site has a page about skas mode that explains all these issues more thoroughly. However, by July 2005 a new mode was added to UML in Linux 2.6.13 called skas0 (which, for some reason, isn’t explained in the above page). This new mode is very close to skas3: it provides the same security model and most of its speed gains. The main difference is that you don’t need to patch the host kernel, so you can use a skas-enabled UML in your Linux system without having to mess with the host kernel. The patch is explained in the 2.6.13 changelog or in this article. A skas0-enabled kernel boots like this: Checking that ptrace can change system call numbers...OK Checking syscall emulation patch for ptrace...OK Checking advanced syscall emulation patch for ptrace...OK Checking for tmpfs mount on /dev/shm...OK Checking PROT_EXEC mmap in /dev/shm/...OK Checking for the skas3 patch in the host: - /proc/mm...not found - PTRACE_FAULTINFO...not found - PTRACE_LDT...not found UML running in SKAS0 mode ... Posted: September 13th, 2006 under Planet Igalia, English, Software, Planet GPUL. Comments: 2
[linuxkernelnewbies] C/C++ Thread Safety Annotatio...
http://docs.google.com/Doc?id=ddqtfwhb_0c49t6zgr C/C++ Thread Safety Annotations Le-Chun WuModified:June 9, 2008 Objective This project creates a set of C/C++ program annotations that (1) allow developers to document multi-threaded code so that maintainers can avoid introducing thread safety bugs, and (2) help program analysis tools identify potential thread safety issues. We add a new GCC analysis pass that uses the source annotations to identify thread safety issues and emit compiler warnings. Background Multi-threading is an increasingly important technique to boost performance on multi-core/multiprocessor systems. Unfortunately, multi-threaded programming is hard: timing-dependent bugs, such as data races and deadlocks, are very difficult to expose in testing and hard to reproduce and isolate once discovered. Proper documentation of synchronization policies and thread safety guarantees is probably one of the most useful techniques to manage multi-threaded code and avoid concurrency bugs. In practice, programmers' intended synchronization policies, such as lock acquisition order and lock requirement for shared variables and functions, are often documented in comments. Comments help maintainers avoid introducing errors, but it is hard for program analysis tools to use the information to tell programmers when they have violated their synchronization policies and identify potential thread safety issues. Therefore this project creates program annotations for C/C++ to help developers document locks and how they need to be used to safely read and write shared variables. We design and implement a new GCC pass that uses the annotations to identify and warn about the issues that could potentially result in race conditions and deadlocks. Overview There are many styles of synchronization in multi-threaded programming. The annotations used here only focus on the mutex lock-based synchronization. The annotations are implemented in GCC's "attribute" language extension. The following is a list of C macro definitions using the proposed new attributes. We define these macros here to simplify the examples and discussions in this document. It is also a common practice for people to use the macros instead of the raw GCC attributes for code portability and compatibility. #define GUARDED_BY(x) __attribute__ ((guarded_by(x))) #define GUARDED_VAR__attribute__ ((guarded)) #define PT_GUARDED_BY(x) __attribute__ ((point_to_guarded_by(x))) #define PT_GUARDED_VAR __attribute__ ((point_to_guarded)) #define ACQUIRED_AFTER(...)__attribute__ ((acquired_after(__VA_ARGS__))) #define ACQUIRED_BEFORE(...) __attribute__ ((acquired_before(__VA_ARGS__))) #define LOCKABLE __attribute__ ((lockable)) #define SCOPED_LOCKABLE__attribute__ ((scoped_lockable)) #define EXCLUSIVE_LOCK_FUNCTION(...)__attribute__ ((exclusive_lock(__VA_ARGS__))) #define SHARED_LOCK_FUNCTION(...) __attribute__ ((shared_lock(__VA_ARGS__))) #define EXCLUSIVE_TRYLOCK_FUNCTION(...) __attribute__ ((exclusive_trylock(__VA_ARGS__))) #define SHARED_TRYLOCK_FUNCTION(...)__attribute__ ((shared_trylock(__VA_ARGS__))) #define UNLOCK_FUNCTION(...)__attribute__ ((unlock(__VA_ARGS__))) #define LOCK_RETURNED(x)__attribute__ ((lock_returned(x))) #define LOCKS_EXCLUDED(...) __attribute__ ((locks_excluded(__VA_ARGS__))) #define EXCLUSIVE_LOCKS_REQUIRED(...) __attribute__ ((exclusive_locks_required(__VA_ARGS__))) #define SHARED_LOCKS_REQUIRED(...) __attribute__ ((shared_locks_required(__VA_ARGS__))) #define NO_THREAD_SAFETY_ANALYSIS __attribute__ ((no_thread_safety_analysis)) Note that the annotations proposed here are not expressive enough to handle fine-grained locking relationship between locks and the guarded variables, e.g. when each individual element (or a group of elements) of a linked list/hash table is guarded by a different lock. While most of the proposed annotations are designed for documenting synchronization policies, some are simply created to help program analysis tools. Detailed explanation of the annotations and their usage are discussed in the next section. Detailed Design Variable Annotations The following annotations are used to specify synchronization policies, such as which variables are guarded by which locks and the acquisition order of locks. GUARDED_BY(lock) and GUARDED_VAR These two annotations document a shared variable/field that needs to be protected by a lock. GUARDED_BY specifies a particular lock should be held when accessing the annotated variable. GUARDED_VAR only indicates a shared variable should be guarded (by any lock). GUARDED_VAR is primarily used when the client cannot express the name of the lock. The lock argument in GUARDED_BY (or in any other annotations mentioned below that take lock arguments) can be a variable, a class member, or even an _expression_ specifying an