Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Hi Eric, On Tue, 15 Oct 2013 07:06:25 -0700 Eric Dumazet wrote: > On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: > > On Tue, 15 Oct 2013 15:33:36 +0200 > > Andi Kleen wrote: > > > > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR > > > > hardware > > > > where one cannot benefit from hardware offloads. > > > > > > Is this with sendfile? > > > > Tests were done with iperf at the time without any extra funky options, > > and > > looking at the code it looks like it does plain write() / recv() on the > > socket. > > > > But the csum cost is both for sender and receiver ? No, it was only on the receiver side that I noticed it. > > Please post the following : > > perf record -g "your iperf session" > > perf report | head -n 200 Sorry, but this is 3 years old stuff and I do not have the setup anymore to reproduce. Sébastien. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen wrote: > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware > > where one cannot benefit from hardware offloads. > > Is this with sendfile? Tests were done with iperf at the time without any extra funky options, and looking at the code it looks like it does plain write() / recv() on the socket. Sébastien. > > For normal send() the checksum is done in the user copy and for receiving it > can be also done during the copy in most cases > > -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Hi Neil, Andi, On Mon, 14 Oct 2013 16:25:28 -0400 Neil Horman wrote: > On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: > > Neil Horman writes: > > > > > Sébastien Dugué reported to me that devices implementing ipoib (which > > > don't have > > > checksum offload hardware were spending a significant amount of time > > > computing > > > > Must be an odd workload, most TCP/UDP workloads do copy-checksum > > anyways. I would rather investigate why that doesn't work. > > > FWIW, the reporter was reporting this using an IP over Infiniband network. > Neil indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware where one cannot benefit from hardware offloads. For a bit of background on the issue: It all started nearly 3 years ago when trying to understand why IPoIB BW was so low in our setups and why ksoftirqd used 100% of one CPU. A kernel profile trace showed that the CPU spent most of it's time in checksum computation (from the only old trace I managed to unearth): Function HitTimeAvg ------ schedule 1730629976998 us 364148.5 us csum_partial 1081346520944414 us 1.936 us mwait_idle_with_hints 14519858861 us 6794.529 us get_page_from_freelist101104348120524 us 0.803 us alloc_pages_current 100936755180650 us 0.513 us __phys_addr 355547834471387 us 0.125 us zone_statistics 101104344360871 us 0.431 us ipoib_cm_alloc_rx_skb 6738994343949 us 6.445 us After having recoded the checksum to use 2 ALUs, csum_partial() disappeared from the tracer radar. IPoIB BW got from ~12Gb/s to ~ 20Gb/s and ksoftirqd load dropped down drastically. Sorry, I could not manage to locate my old traces and results, those seem to have been lost in the mist of time. I did some micro benchmark (dirty hack code below) of different solutions. It looks like processing 128-byte blocks in 4 chains allows the best performance, but there are plenty other possibilities. FWIW, this code has been running as is at our customers sites for 3 years now. Sébastien. > > > That said the change looks reasonable, but may not fix the root cause. > > > > -Andi > > > > -- > > a...@linux.intel.com -- Speaking for myself only > > 8<-- /* * gcc -Wall -O3 -o csum_test csum_test.c -lrt */ #include #include #include #include #include #include #define __force #define unlikely(x) (x) typedef uint32_t u32; typedef uint16_t u16; typedef u16 __sum16; typedef u32 __wsum; #define NUM_LOOPS 10 #define BUF_LEN 65536 unsigned char buf[BUF_LEN]; /* * csum_fold - Fold and invert a 32bit checksum. * sum: 32bit unfolded sum * * Fold a 32bit running checksum to 16bit and invert it. This is usually * the last step before putting a checksum into a packet. * Make sure not to mix with 64bit checksums. */ static inline __sum16 csum_fold(__wsum sum) { asm(" addl %1,%0\n" " adcl $0x,%0" : "=r" (sum) : "r" ((__force u32)sum << 16), "0" ((__force u32)sum & 0x)); return (__force __sum16)(~(__force u32)sum >> 16); } static inline unsigned short from32to16(unsigned a) { unsigned short b = a >> 16; asm("addw %w2,%w0\n\t" "adcw $0,%w0\n" : "=r" (b) : "0" (b), "r" (a)); return b; } static inline unsigned add32_with_carry(unsigned a, unsigned b) { asm("addl %2,%0\n\t" "adcl $0,%0" : "=r" (a) : "0" (a), "r" (b)); return a; } /* * Do a 64-bit checksum on an arbitrary memory area. * Returns a 32bit checksum. * * This isn't as time critical as it used to be because many NICs * do hardware checksumming these days. * * Things tried and found to not make it faster: * Manual Prefetching * Unrolling to an 128 bytes inner loop. * Using interleaving with more registers to break the carry chains. */ static unsigned do_csum(const unsigned char *buff, unsigned len) { unsigned odd, count; unsigned long result = 0; if (unlikely(len == 0)) return result; odd = 1 & (unsigned long) buff; if (unlikely(odd)) { result = *buff << 8; len--; buff++; } count = len >> 1;
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Hi Neil, Andi, On Mon, 14 Oct 2013 16:25:28 -0400 Neil Horman nhor...@tuxdriver.com wrote: On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote: Neil Horman nhor...@tuxdriver.com writes: Sébastien Dugué reported to me that devices implementing ipoib (which don't have checksum offload hardware were spending a significant amount of time computing Must be an odd workload, most TCP/UDP workloads do copy-checksum anyways. I would rather investigate why that doesn't work. FWIW, the reporter was reporting this using an IP over Infiniband network. Neil indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware where one cannot benefit from hardware offloads. For a bit of background on the issue: It all started nearly 3 years ago when trying to understand why IPoIB BW was so low in our setups and why ksoftirqd used 100% of one CPU. A kernel profile trace showed that the CPU spent most of it's time in checksum computation (from the only old trace I managed to unearth): Function HitTimeAvg ------ schedule 1730629976998 us 364148.5 us csum_partial 1081346520944414 us 1.936 us mwait_idle_with_hints 14519858861 us 6794.529 us get_page_from_freelist101104348120524 us 0.803 us alloc_pages_current 100936755180650 us 0.513 us __phys_addr 355547834471387 us 0.125 us zone_statistics 101104344360871 us 0.431 us ipoib_cm_alloc_rx_skb 6738994343949 us 6.445 us After having recoded the checksum to use 2 ALUs, csum_partial() disappeared from the tracer radar. IPoIB BW got from ~12Gb/s to ~ 20Gb/s and ksoftirqd load dropped down drastically. Sorry, I could not manage to locate my old traces and results, those seem to have been lost in the mist of time. I did some micro benchmark (dirty hack code below) of different solutions. It looks like processing 128-byte blocks in 4 chains allows the best performance, but there are plenty other possibilities. FWIW, this code has been running as is at our customers sites for 3 years now. Sébastien. That said the change looks reasonable, but may not fix the root cause. -Andi -- a...@linux.intel.com -- Speaking for myself only 8-- /* * gcc -Wall -O3 -o csum_test csum_test.c -lrt */ #include stdio.h #include stdlib.h #include stdint.h #include time.h #include string.h #include errno.h #define __force #define unlikely(x) (x) typedef uint32_t u32; typedef uint16_t u16; typedef u16 __sum16; typedef u32 __wsum; #define NUM_LOOPS 10 #define BUF_LEN 65536 unsigned char buf[BUF_LEN]; /* * csum_fold - Fold and invert a 32bit checksum. * sum: 32bit unfolded sum * * Fold a 32bit running checksum to 16bit and invert it. This is usually * the last step before putting a checksum into a packet. * Make sure not to mix with 64bit checksums. */ static inline __sum16 csum_fold(__wsum sum) { asm( addl %1,%0\n adcl $0x,%0 : =r (sum) : r ((__force u32)sum 16), 0 ((__force u32)sum 0x)); return (__force __sum16)(~(__force u32)sum 16); } static inline unsigned short from32to16(unsigned a) { unsigned short b = a 16; asm(addw %w2,%w0\n\t adcw $0,%w0\n : =r (b) : 0 (b), r (a)); return b; } static inline unsigned add32_with_carry(unsigned a, unsigned b) { asm(addl %2,%0\n\t adcl $0,%0 : =r (a) : 0 (a), r (b)); return a; } /* * Do a 64-bit checksum on an arbitrary memory area. * Returns a 32bit checksum. * * This isn't as time critical as it used to be because many NICs * do hardware checksumming these days. * * Things tried and found to not make it faster: * Manual Prefetching * Unrolling to an 128 bytes inner loop. * Using interleaving with more registers to break the carry chains. */ static unsigned do_csum(const unsigned char *buff, unsigned len) { unsigned odd, count; unsigned long result = 0; if (unlikely(len == 0)) return result; odd = 1 (unsigned long) buff; if (unlikely(odd)) { result = *buff 8; len--; buff++; } count = len 1; /* nr of 16-bit words.. */ if (count) { if (2 (unsigned long) buff) { result += *(unsigned short *)buff; count--; len -= 2; buff += 2; } count = 1;/* nr
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen a...@firstfloor.org wrote: indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware where one cannot benefit from hardware offloads. Is this with sendfile? Tests were done with iperf at the time without any extra funky options, and looking at the code it looks like it does plain write() / recv() on the socket. Sébastien. For normal send() the checksum is done in the user copy and for receiving it can be also done during the copy in most cases -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Hi Eric, On Tue, 15 Oct 2013 07:06:25 -0700 Eric Dumazet eric.duma...@gmail.com wrote: On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote: On Tue, 15 Oct 2013 15:33:36 +0200 Andi Kleen a...@firstfloor.org wrote: indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware where one cannot benefit from hardware offloads. Is this with sendfile? Tests were done with iperf at the time without any extra funky options, and looking at the code it looks like it does plain write() / recv() on the socket. But the csum cost is both for sender and receiver ? No, it was only on the receiver side that I noticed it. Please post the following : perf record -g your iperf session perf report | head -n 200 Sorry, but this is 3 years old stuff and I do not have the setup anymore to reproduce. Sébastien. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is the PCI clock within the spec?
Hi John, On Tue, 04 Dec 2007 11:57:43 +0100 John Sigler <[EMAIL PROTECTED]> wrote: > Hello everyone, > > I have an x86 system, running Linux 2.6.22.1-rt9, in which I plug one > or two PCI I/O boards. I had been experiencing complete system lock-ups > until I sent the system to the board manufacturer, and he fixed the > problem. However, he told me that the PCI clock seemed out of spec, > as far as voltage is concerned. > > (Disclaimer: my knowledge of PCI is 0.) > > The board manufacturer sent me the plot of (what appears to be) voltage > versus time for the PCI clock. > > http://linux.kernel.free.fr/plot1.jpg > > The system manufacturer sent me a similar plot. > > http://linux.kernel.free.fr/plot2.jpg Why did they send you those plots? What was their point? > > As far as my understanding goes, the signal should alternate between > 0 V and 3.3 V (??). Yep, that's the idealized 3.3V signaling case. However, it looks like the signal is overshooting a bit (-0.8V below 0 and +0.8V over 3.3V from looking at the 1st plot) which may be due to incorrect impedance matching on the bus, probes artifacts, ... > In the second plot, it looks like Vmax ~ 4.6V > and Vmin ~ -1.4V (Pk-Pk(C1)=6.08V might mean peak-to-peak voltage?) This one looks a bit high (if they measured the same voltages I wonder where they got their scopes calibrated ;-) ) > > 0) What is this C1 both plots mention? Scope Channel 1 > 1) Am I reading the plot correctly? Yep > 2) Is -1.4V in DC even possible? Why not! > 3) 4.6V is 1.3V above 3.3V and -1.4V is -1.4V below 0. (Assuming I read > the numbers correctly) Are these values within the PCI spec? Or are > these voltages dangerous and / or might cause some problems with some > PCI boards? Well it depends on which of the plot is lying. Looking at the PCI spec (4.2.2.1) the Vih max for a device is Vcc-max+0.5 = 3.6 + 0.5 = 4.1V the Vil min is -0.5V so in this case it looks a bit high. But I would not worry too much, those are only the overshoots, and the circuits have clamping diodes on their inputs. The test waveform voltages for the maximum ratings (4.2.2.3) against which every PCI device should be qualified are higher than what you have here: 7.1V peak-to-peak. Hope this helps. Sebastien. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is the PCI clock within the spec?
Hi John, On Tue, 04 Dec 2007 11:57:43 +0100 John Sigler [EMAIL PROTECTED] wrote: Hello everyone, I have an x86 system, running Linux 2.6.22.1-rt9, in which I plug one or two PCI I/O boards. I had been experiencing complete system lock-ups until I sent the system to the board manufacturer, and he fixed the problem. However, he told me that the PCI clock seemed out of spec, as far as voltage is concerned. (Disclaimer: my knowledge of PCI is 0.) The board manufacturer sent me the plot of (what appears to be) voltage versus time for the PCI clock. http://linux.kernel.free.fr/plot1.jpg The system manufacturer sent me a similar plot. http://linux.kernel.free.fr/plot2.jpg Why did they send you those plots? What was their point? As far as my understanding goes, the signal should alternate between 0 V and 3.3 V (??). Yep, that's the idealized 3.3V signaling case. However, it looks like the signal is overshooting a bit (-0.8V below 0 and +0.8V over 3.3V from looking at the 1st plot) which may be due to incorrect impedance matching on the bus, probes artifacts, ... In the second plot, it looks like Vmax ~ 4.6V and Vmin ~ -1.4V (Pk-Pk(C1)=6.08V might mean peak-to-peak voltage?) This one looks a bit high (if they measured the same voltages I wonder where they got their scopes calibrated ;-) ) 0) What is this C1 both plots mention? Scope Channel 1 1) Am I reading the plot correctly? Yep 2) Is -1.4V in DC even possible? Why not! 3) 4.6V is 1.3V above 3.3V and -1.4V is -1.4V below 0. (Assuming I read the numbers correctly) Are these values within the PCI spec? Or are these voltages dangerous and / or might cause some problems with some PCI boards? Well it depends on which of the plot is lying. Looking at the PCI spec (4.2.2.1) the Vih max for a device is Vcc-max+0.5 = 3.6 + 0.5 = 4.1V the Vil min is -0.5V so in this case it looks a bit high. But I would not worry too much, those are only the overshoots, and the circuits have clamping diodes on their inputs. The test waveform voltages for the maximum ratings (4.2.2.3) against which every PCI device should be qualified are higher than what you have here: 7.1V peak-to-peak. Hope this helps. Sebastien. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] remove HT1000 from the quirk_msi_ht_cap quirk
Andrew, due to mailer problems, I'm not sure you received my previous mail titled: "[PATCH] V2 - quirk: Enable MSI Mapping on HT1000" if you did, then forget it and use this one instead. Or if you already merged V2 then drop this one. Sorry for the inconvenience. Sebastien. From: Sebastien Dugue <[EMAIL PROTECTED]> Now that the HyperTransport MSI mapping capability is enabled for Broadcom's HT1000 bridge, the quirk_msi_ht_cap quirk is no longer necessary. Signed-off-by: Sebastien Dugue <[EMAIL PROTECTED]> Cc: Andrew Morton <[EMAIL PROTECTED]> Cc: "Eric W. Biederman" <[EMAIL PROTECTED]> Cc: Michael Chan <[EMAIL PROTECTED]> --- drivers/pci/quirks.c |4 1 file changed, 4 deletions(-) Index: linux-2.6.24-rc3-mm2/drivers/pci/quirks.c === --- linux-2.6.24-rc3-mm2.orig/drivers/pci/quirks.c +++ linux-2.6.24-rc3-mm2/drivers/pci/quirks.c @@ -1711,10 +1711,6 @@ static void __devinit quirk_msi_ht_cap(s } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SERVERWORKS, PCI_DEVICE_ID_SERVERWORKS_HT2000_PCIE, quirk_msi_ht_cap); -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SERVERWORKS, - PCI_DEVICE_ID_SERVERWORKS_HT1000_PXB, - quirk_msi_ht_cap); - /* * Force enable MSI mapping capability on HT bridges - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] remove HT1000 from the quirk_msi_ht_cap quirk
Andrew, due to mailer problems, I'm not sure you received my previous mail titled: [PATCH] V2 - quirk: Enable MSI Mapping on HT1000 if you did, then forget it and use this one instead. Or if you already merged V2 then drop this one. Sorry for the inconvenience. Sebastien. From: Sebastien Dugue [EMAIL PROTECTED] Now that the HyperTransport MSI mapping capability is enabled for Broadcom's HT1000 bridge, the quirk_msi_ht_cap quirk is no longer necessary. Signed-off-by: Sebastien Dugue [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Cc: Eric W. Biederman [EMAIL PROTECTED] Cc: Michael Chan [EMAIL PROTECTED] --- drivers/pci/quirks.c |4 1 file changed, 4 deletions(-) Index: linux-2.6.24-rc3-mm2/drivers/pci/quirks.c === --- linux-2.6.24-rc3-mm2.orig/drivers/pci/quirks.c +++ linux-2.6.24-rc3-mm2/drivers/pci/quirks.c @@ -1711,10 +1711,6 @@ static void __devinit quirk_msi_ht_cap(s } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SERVERWORKS, PCI_DEVICE_ID_SERVERWORKS_HT2000_PCIE, quirk_msi_ht_cap); -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SERVERWORKS, - PCI_DEVICE_ID_SERVERWORKS_HT1000_PXB, - quirk_msi_ht_cap); - /* * Force enable MSI mapping capability on HT bridges - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] msi: set 'En' bit of MSI Mapping Capability on HT platform
On Sun, 25 Nov 2007 11:21:48 +0800 "peerchen" <[EMAIL PROTECTED]> wrote: > According to the HyperTransport spec, 'En' indicate if the MSI Mapping is > active. So it should be set when enable the MSI. > Cool, I had a patch that added a quirk to enable MSI Mapping on Broadcom's HT1000 so that the builtin BCM5706S Gigabit Ethernet would use MSI. This one looks even cleaner. As a side note, the bnx2 driver tries to use MSI by default (in the kernel I'm using - 2.6.21.5-rt10) and falls back to INTx if no interrupt is received within a reasonable amount of time. If the HT MSI Mapping is not enabled beforehand, the first MSI message (before reverting to INTx) is not trapped by the bridge and lands in memory thus corrupting what is there at that time. In my case it's the mapcount of a struct page in the freelist which is overwritten with the MSI message. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] msi: set 'En' bit of MSI Mapping Capability on HT platform
On Sun, 25 Nov 2007 11:21:48 +0800 peerchen [EMAIL PROTECTED] wrote: According to the HyperTransport spec, 'En' indicate if the MSI Mapping is active. So it should be set when enable the MSI. Cool, I had a patch that added a quirk to enable MSI Mapping on Broadcom's HT1000 so that the builtin BCM5706S Gigabit Ethernet would use MSI. This one looks even cleaner. As a side note, the bnx2 driver tries to use MSI by default (in the kernel I'm using - 2.6.21.5-rt10) and falls back to INTx if no interrupt is received within a reasonable amount of time. If the HT MSI Mapping is not enabled beforehand, the first MSI message (before reverting to INTx) is not trapped by the bridge and lands in memory thus corrupting what is there at that time. In my case it's the mapcount of a struct page in the freelist which is overwritten with the MSI message. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] RT: fix spin_trylock_irq
This patch fixes a bug in spin_trylock_irq() where __spin_trylock_irq() is picked for regular (non-raw) spinlocks instead of _spin_trylock_irq(). This results in systematic boot hangs and may have been going unnoticed for quite some time as it only manifests (aside from a compile warning) when booting with a NUMA config or when using the Chelsio T3 (cxgb3) driver as these seems to be the sole users. Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> --- include/linux/spinlock.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.23-rc9-rt2/include/linux/spinlock.h === --- linux-2.6.23-rc9-rt2.orig/include/linux/spinlock.h +++ linux-2.6.23-rc9-rt2/include/linux/spinlock.h @@ -501,7 +501,7 @@ do { \ #define spin_trylock_irq(lock) \ __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irq, \ - __spin_trylock_irq, lock)) + _spin_trylock_irq, lock)) #define spin_trylock_irqsave(lock, flags) \ __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irqsave, \ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] RT: fix spin_trylock_irq
This patch fixes a bug in spin_trylock_irq() where __spin_trylock_irq() is picked for regular (non-raw) spinlocks instead of _spin_trylock_irq(). This results in systematic boot hangs and may have been going unnoticed for quite some time as it only manifests (aside from a compile warning) when booting with a NUMA config or when using the Chelsio T3 (cxgb3) driver as these seems to be the sole users. Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] --- include/linux/spinlock.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.23-rc9-rt2/include/linux/spinlock.h === --- linux-2.6.23-rc9-rt2.orig/include/linux/spinlock.h +++ linux-2.6.23-rc9-rt2/include/linux/spinlock.h @@ -501,7 +501,7 @@ do { \ #define spin_trylock_irq(lock) \ __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irq, \ - __spin_trylock_irq, lock)) + _spin_trylock_irq, lock)) #define spin_trylock_irqsave(lock, flags) \ __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irqsave, \ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Search for x86_64 documentation.
Hi Francis, On Wed, 1 Aug 2007 17:30:35 +0200 "Francis Moreau" <[EMAIL PROTECTED]> wrote: > Hello Rene, > > On 8/1/07, Rene Herman <[EMAIL PROTECTED]> wrote: > > On 08/01/2007 03:27 PM, Francis Moreau wrote: > > > > > I'm used to hack Linux on a ARM based board and would like to be > > > involved in x86_64 architecture but I don't know where I should > > > start... > > > > > > Could anyone point out some nice documentations/books describing this > > > architecture ? > > > > First and foremost the AMD64 architecture documentation from AMD itself: > > > > http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_739_7044,00.html > > > > Thank you for the pointer but I alread knew about them. > > I was actually more interested in books which are more pleasant to > read than a raw datasheet. Something like "x86_64 arch for newbies" ;) > Then you may have a look at http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html It's a bit more pleasant to read than the AMD or Intel programming manuals but in the long run you will need those along with the accompanying erratas. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: Linux Kernel cfs scheduler and xorg kbd
Hi all, just a datapoint FWIW On Wed, 01 Aug 2007 17:53:32 +0200 "<::.. Teresa_II ..::>" <[EMAIL PROTECTED]> wrote: > У чт, 2007-08-02 у 01:00 +1000, Matthew Hawkins пише: > > > Are you sure its not just a setting in Gnome/KDE for accessibility? > > That tends to do crazy things like making control keys sticky... > > Yes, i use gnome, and all keyboard layouts are set in gnome, also group > switching for layouts. But accessibility is switched completely off. > This reminds me I had something similar happening about a year or so ago running with Debian's unstable xorg and gnome. The keys would stick for a short time (no, no coffee spilled on the keyboard, I can assure you ;-), the most annoying one being the delete key. Nothing deterministic, it could happen anytime (or not at all for several days). As a final resort, I upgraded the xorg and gnome packages - problem solved. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: Linux Kernel cfs scheduler and xorg kbd
Hi all, just a datapoint FWIW On Wed, 01 Aug 2007 17:53:32 +0200 ::.. Teresa_II ..:: [EMAIL PROTECTED] wrote: У чт, 2007-08-02 у 01:00 +1000, Matthew Hawkins пише: Are you sure its not just a setting in Gnome/KDE for accessibility? That tends to do crazy things like making control keys sticky... Yes, i use gnome, and all keyboard layouts are set in gnome, also group switching for layouts. But accessibility is switched completely off. This reminds me I had something similar happening about a year or so ago running with Debian's unstable xorg and gnome. The keys would stick for a short time (no, no coffee spilled on the keyboard, I can assure you ;-), the most annoying one being the delete key. Nothing deterministic, it could happen anytime (or not at all for several days). As a final resort, I upgraded the xorg and gnome packages - problem solved. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Search for x86_64 documentation.
Hi Francis, On Wed, 1 Aug 2007 17:30:35 +0200 Francis Moreau [EMAIL PROTECTED] wrote: Hello Rene, On 8/1/07, Rene Herman [EMAIL PROTECTED] wrote: On 08/01/2007 03:27 PM, Francis Moreau wrote: I'm used to hack Linux on a ARM based board and would like to be involved in x86_64 architecture but I don't know where I should start... Could anyone point out some nice documentations/books describing this architecture ? First and foremost the AMD64 architecture documentation from AMD itself: http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_739_7044,00.html Thank you for the pointer but I alread knew about them. I was actually more interested in books which are more pleasant to read than a raw datasheet. Something like x86_64 arch for newbies ;) Then you may have a look at http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html It's a bit more pleasant to read than the AMD or Intel programming manuals but in the long run you will need those along with the accompanying erratas. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ATA over ethernet swapping and obfuscated code
Hi Pavel, On Tue, 31 Jul 2007 15:58:31 +0200 Pavel Machek <[EMAIL PROTECTED]> wrote: > Hi! > > I wanted to know if it is possible/okay to swap over AOE... > > According to > http://www.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.20 > .. it runs OOM even during normal use, so I guess swapping over it is > no-no? > > Can I build both client and server for these using free software? > > In the process, I looked at the aoe code, and parts of it look like > obfuscated C contest. The use of switch() as an if was particulary > creative; I'm not even sure if I translated it properly... can you > take a look? > > (Patch is > > Signed-off-by: Pavel Machek <[EMAIL PROTECTED]> > > but I did not even compile test it) > > diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c > index 05a9719..38ba35d 100644 > --- a/drivers/block/aoe/aoedev.c > +++ b/drivers/block/aoe/aoedev.c > @@ -64,29 +64,26 @@ aoedev_newdev(ulong nframes) > > d = kzalloc(sizeof *d, GFP_ATOMIC); > f = kcalloc(nframes, sizeof *f, GFP_ATOMIC); > - switch (!d || !f) { > - case 0: > - d->nframes = nframes; > - d->frames = f; > - e = f + nframes; > - for (; f - f->tag = FREETAG; > - f->skb = new_skb(ETH_ZLEN); > - if (!f->skb) > - break; > - } > - if (f == e) > - break; > + if (!d || !f) { > + kfree(f); > + kfree(d); > + return NULL; > + } > + > + d->nframes = nframes; > + d->frames = f; > + e = f + nframes; > + for (; f + f->tag = FREETAG; > + f->skb = new_skb(ETH_ZLEN); > + if (!f->skb) > + break; > + } > + if (f != e) { > while (f > d->frames) { > f--; > dev_kfree_skb(f->skb); > } > - default: > - if (f) > - kfree(f); > - if (d) > - kfree(d); > - return NULL; > } > INIT_WORK(>work, aoecmd_sleepwork); > spin_lock_init(>lock); Creative it is. > > > aoedev_by_sysminor_m() returns with spinlock held in error case; I > guess that's bad. > > struct aoedev * > aoedev_by_sysminor_m(ulong sysminor, ulong bufcnt) > { > struct aoedev *d; > ulong flags; > > spin_lock_irqsave(_lock, flags); > > for (d=devlist; d; d=d->next) > if (d->sysminor == sysminor) > break; > > if (d == NULL) { > d = aoedev_newdev(bufcnt); > if (d == NULL) { > spin_unlock_irqrestore(_lock, flags); ~ what about here > printk(KERN_INFO "aoe: aoedev_newdev > failure.\n"); > return NULL; > ~ here > } > d->sysminor = sysminor; > d->aoemajor = AOEMAJOR(sysminor); > d->aoeminor = AOEMINOR(sysminor); > } > > spin_unlock_irqrestore(_lock, flags); > return d; > } > Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RT] Fix RT balancing tasks pulling
Hi Josh, On Mon, 30 Jul 2007 14:50:34 -0700 Josh Triplett <[EMAIL PROTECTED]> wrote: > On Tue, 2007-07-24 at 15:42 +0200, Sébastien Dugué wrote: > > this one-liner fixes a bug in balance_rt_tasks() which sometimes > > manifests by > > having a lower prio task being scheduled while a higher prio task is sitting > > waiting on another runqueue. > > > > This is pretty hard to reproduce on low cpu count machines, for example, I > > had to have sched_football run in a loop for ~38h before it failed on a > > dual HT > > Xeon box. > > Very impressive find, and this does seem to improve things in some > cases. > > However, with 2.6.22.1-rt8, which includes this patch, I still managed > to get some failures after a few hours on an 8-way (quad dual-core) box. > Preliminary results so far: > 420 Final ball position: 0 > 2 Final ball position: 1 > Darn, I thought it was it :( I had a ~60h run on my box with the very same kernel without a single failure. Do you have any other user RT load running when those failures pop up? Anyway, back to scheduler code dissection now. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RT] Fix RT balancing tasks pulling
Hi Josh, On Mon, 30 Jul 2007 14:50:34 -0700 Josh Triplett [EMAIL PROTECTED] wrote: On Tue, 2007-07-24 at 15:42 +0200, Sébastien Dugué wrote: this one-liner fixes a bug in balance_rt_tasks() which sometimes manifests by having a lower prio task being scheduled while a higher prio task is sitting waiting on another runqueue. This is pretty hard to reproduce on low cpu count machines, for example, I had to have sched_football run in a loop for ~38h before it failed on a dual HT Xeon box. Very impressive find, and this does seem to improve things in some cases. However, with 2.6.22.1-rt8, which includes this patch, I still managed to get some failures after a few hours on an 8-way (quad dual-core) box. Preliminary results so far: 420 Final ball position: 0 2 Final ball position: 1 Darn, I thought it was it :( I had a ~60h run on my box with the very same kernel without a single failure. Do you have any other user RT load running when those failures pop up? Anyway, back to scheduler code dissection now. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ATA over ethernet swapping and obfuscated code
Hi Pavel, On Tue, 31 Jul 2007 15:58:31 +0200 Pavel Machek [EMAIL PROTECTED] wrote: Hi! I wanted to know if it is possible/okay to swap over AOE... According to http://www.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.20 .. it runs OOM even during normal use, so I guess swapping over it is no-no? Can I build both client and server for these using free software? In the process, I looked at the aoe code, and parts of it look like obfuscated C contest. The use of switch() as an if was particulary creative; I'm not even sure if I translated it properly... can you take a look? (Patch is Signed-off-by: Pavel Machek [EMAIL PROTECTED] but I did not even compile test it) diff --git a/drivers/block/aoe/aoedev.c b/drivers/block/aoe/aoedev.c index 05a9719..38ba35d 100644 --- a/drivers/block/aoe/aoedev.c +++ b/drivers/block/aoe/aoedev.c @@ -64,29 +64,26 @@ aoedev_newdev(ulong nframes) d = kzalloc(sizeof *d, GFP_ATOMIC); f = kcalloc(nframes, sizeof *f, GFP_ATOMIC); - switch (!d || !f) { - case 0: - d-nframes = nframes; - d-frames = f; - e = f + nframes; - for (; fe; f++) { - f-tag = FREETAG; - f-skb = new_skb(ETH_ZLEN); - if (!f-skb) - break; - } - if (f == e) - break; + if (!d || !f) { + kfree(f); + kfree(d); + return NULL; + } + + d-nframes = nframes; + d-frames = f; + e = f + nframes; + for (; fe; f++) { + f-tag = FREETAG; + f-skb = new_skb(ETH_ZLEN); + if (!f-skb) + break; + } + if (f != e) { while (f d-frames) { f--; dev_kfree_skb(f-skb); } - default: - if (f) - kfree(f); - if (d) - kfree(d); - return NULL; } INIT_WORK(d-work, aoecmd_sleepwork); spin_lock_init(d-lock); Creative it is. aoedev_by_sysminor_m() returns with spinlock held in error case; I guess that's bad. struct aoedev * aoedev_by_sysminor_m(ulong sysminor, ulong bufcnt) { struct aoedev *d; ulong flags; spin_lock_irqsave(devlist_lock, flags); for (d=devlist; d; d=d-next) if (d-sysminor == sysminor) break; if (d == NULL) { d = aoedev_newdev(bufcnt); if (d == NULL) { spin_unlock_irqrestore(devlist_lock, flags); ~ what about here printk(KERN_INFO aoe: aoedev_newdev failure.\n); return NULL; ~ here } d-sysminor = sysminor; d-aoemajor = AOEMAJOR(sysminor); d-aoeminor = AOEMINOR(sysminor); } spin_unlock_irqrestore(devlist_lock, flags); return d; } Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
On Wed, 25 Jul 2007 16:41:23 -0600 Bjorn Helgaas <[EMAIL PROTECTED]> wrote: > On Wednesday 25 July 2007 07:32:53 am Sébastien Dugué wrote: > > On Wed, 25 Jul 2007 07:16:44 -0600 Bjorn Helgaas <[EMAIL PROTECTED]> wrote: > > > > > The _DDN is a "DOS device name", and the _UID is a "logical device ID > > > that does not change across reboots." Both are optional, and PNPACPI > > > ignores them. But maybe we could change PNPACPI to sort by them if > > > they are present. I'll think about this a bit. > > > > That would be nice, but I wish you good luck with all those > > crappy BIOSes out there. > > Yeah, it's an ugly world we live in. Would you be able to try the > attached patch just for testing? It should sort devices with the > same _HID by their _UID. It doesn't have any effect on my systems, > because my devices are already ordered by _UID by default. But I > think it should switch your COM1/COM2 ports back to the order you > expect. > Works well, thanks Bjorn. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
On Wed, 25 Jul 2007 16:41:23 -0600 Bjorn Helgaas [EMAIL PROTECTED] wrote: On Wednesday 25 July 2007 07:32:53 am Sébastien Dugué wrote: On Wed, 25 Jul 2007 07:16:44 -0600 Bjorn Helgaas [EMAIL PROTECTED] wrote: The _DDN is a DOS device name, and the _UID is a logical device ID that does not change across reboots. Both are optional, and PNPACPI ignores them. But maybe we could change PNPACPI to sort by them if they are present. I'll think about this a bit. That would be nice, but I wish you good luck with all those crappy BIOSes out there. Yeah, it's an ugly world we live in. Would you be able to try the attached patch just for testing? It should sort devices with the same _HID by their _UID. It doesn't have any effect on my systems, because my devices are already ordered by _UID by default. But I think it should switch your COM1/COM2 ports back to the order you expect. Works well, thanks Bjorn. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
On Wed, 25 Jul 2007 15:32:53 +0200 Sébastien Dugué <[EMAIL PROTECTED]> wrote: > > Hi Bjorn, > > On Wed, 25 Jul 2007 07:16:44 -0600 Bjorn Helgaas <[EMAIL PROTECTED]> wrote: > > > On Wednesday 25 July 2007 01:45:54 am Sébastien Dugué wrote: > > > looks like the commit was dropped, nevertheless here is some more info > > > to try and understand what may be going on so it may benefit the > > > posterity ;-) > > > > Thank you very much. > > > > Your machine does indeed describe both UARTs in ACPI, and the PNP probe > > finds them correctly. The only wrinkle is that the PNP probe names the > > ports in the order they appear in the ACPI namespace, and your firmware > > has them "backwards": > > > > Device (COMB) > > { > > Name (_HID, EisaId ("PNP0501")) > > Name (_DDN, "COM2") > > Name (_UID, 0x02) > > > > Device (COMA) > > { > > Name (_HID, EisaId ("PNP0501")) > > Name (_DDN, "COM1") > > Name (_UID, 0x01) > > > Yes, that's what I thought was weird in the DSDT, but thought the PnP layer > would map those devices according to UID or at least DDN. It seems that's > not the case. > > > > > There's nothing illegal about this, but with the current PNPACPI, > > it causes the names to be reversed. The blind probe tries 0x3f8 > > first, and names that ttyS0. But the PNP probe, using the namespace > > order, finds the "COM2" port at 0x2f8 first: > > > > 00:0b: ttyS0 at I/O 0x2f8 (irq = 3) is a 16550A > > 00:0c: ttyS1 at I/O 0x3f8 (irq = 4) is a 16550A > > Oh I see, didn't notice this though. > > > > > So I think your serial console would work without "legacy_serial.force" > > if you used "console=ttyS1" instead of "console=ttyS0". But you > > shouldn't have to do that, of course. > > Will try ttyS1 just to confirm... > > Waiting for reboot... > > OK boots fine with the serial console fully operational again. But I also need to change my inittab. At least it confirms the root cause of the problem. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
Hi Bjorn, On Wed, 25 Jul 2007 07:16:44 -0600 Bjorn Helgaas <[EMAIL PROTECTED]> wrote: > On Wednesday 25 July 2007 01:45:54 am Sébastien Dugué wrote: > > looks like the commit was dropped, nevertheless here is some more info > > to try and understand what may be going on so it may benefit the posterity > > ;-) > > Thank you very much. > > Your machine does indeed describe both UARTs in ACPI, and the PNP probe > finds them correctly. The only wrinkle is that the PNP probe names the > ports in the order they appear in the ACPI namespace, and your firmware > has them "backwards": > > Device (COMB) > { > Name (_HID, EisaId ("PNP0501")) > Name (_DDN, "COM2") > Name (_UID, 0x02) > > Device (COMA) > { > Name (_HID, EisaId ("PNP0501")) > Name (_DDN, "COM1") > Name (_UID, 0x01) Yes, that's what I thought was weird in the DSDT, but thought the PnP layer would map those devices according to UID or at least DDN. It seems that's not the case. > > There's nothing illegal about this, but with the current PNPACPI, > it causes the names to be reversed. The blind probe tries 0x3f8 > first, and names that ttyS0. But the PNP probe, using the namespace > order, finds the "COM2" port at 0x2f8 first: > > 00:0b: ttyS0 at I/O 0x2f8 (irq = 3) is a 16550A > 00:0c: ttyS1 at I/O 0x3f8 (irq = 4) is a 16550A Oh I see, didn't notice this though. > > So I think your serial console would work without "legacy_serial.force" > if you used "console=ttyS1" instead of "console=ttyS0". But you > shouldn't have to do that, of course. Will try ttyS1 just to confirm... Waiting for reboot... OK boots fine with the serial console fully operational again. > > The _DDN is a "DOS device name", and the _UID is a "logical device ID > that does not change across reboots." Both are optional, and PNPACPI > ignores them. But maybe we could change PNPACPI to sort by them if > they are present. I'll think about this a bit. That would be nice, but I wish you good luck with all those crappy BIOSes out there. Thanks, Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
Hi Bjorn, looks like the commit was dropped, nevertheless here is some more info to try and understand what may be going on so it may benefit the posterity ;-) On Tue, 24 Jul 2007 09:48:45 -0600 Bjorn Helgaas <[EMAIL PROTECTED]> wrote: > On Tuesday 24 July 2007 08:28:05 am Sébastien Dugué wrote: > > your commit 7e92b4fc345f5b6f57585fbe5ffdb0f24d7c9b26 broke the serial > > console > > on my box. Adding 'legacy_serial.force=1' to my boot param as a workaround > > solves the issue, but this may be hiding bugs in Linux PnP support or > > in my firmware. > > Thanks for your report. We need to figure out why the 8250_pnp driver > didn't find your serial console device. Can you confirm that you also > have CONFIG_ACPI and CONFIG_PNPACPI in your .config? Yep, both =y CONFIG_ACPI=y CONFIG_PNPACPI=y > > If you have those, and it still doesn't work, can you collect the DSDT > dump, the output of "grep . /sys/bus/pnp/devices/*/*", and the dmesg > from your "legacy_serial.force=1" boot? Please find those attached. I also include the output from dmidecode just in case (I'm not sure it contains anything interesting in this case though). dsdt.dsl -> DSDT dump dmi.txt -> output from dmidecode pnp-info.txt -> output of the grep dmesg-ok.txt -> dmesg with 'legacy_serial.force=1' (CONFIG_PNP_DEBUG=y) dmesg-bad.txt -> dmesg without 'legacy_serial.force=1' (CONFIG_PNP_DEBUG=y) > Then we can tell which port > the blind probe finds and whether it's described somewhere by ACPI. > > Thanks, > Bjorn Thanks, Sébastien. > > > The box is a dual HT Xeon running a vanilla 2.6.22 x86_64 kernel > > > > here is my .config: > > > > CONFIG_PNP=y > > CONFIG_SERIAL_8250=y > > CONFIG_SERIAL_8250_CONSOLE=y > > CONFIG_SERIAL_8250_PCI=y > > CONFIG_SERIAL_8250_PNP=y > > CONFIG_SERIAL_8250_NR_UARTS=4 > > CONFIG_SERIAL_8250_RUNTIME_UARTS=4 > > CONFIG_SERIAL_8250_EXTENDED=y > > CONFIG_SERIAL_8250_SHARE_IRQ=y > > CONFIG_SERIAL_8250_DETECT_IRQ=y > > > > lspci output attached. > > > > Any ideas to help me debug this? > > > > If you need more info (like DSDT dump), just ask. > > > > Thanks, > > > > Sébastien. > > > > > > > > logs.tar.bz2 Description: application/bzip
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
Hi Bjorn, looks like the commit was dropped, nevertheless here is some more info to try and understand what may be going on so it may benefit the posterity ;-) On Tue, 24 Jul 2007 09:48:45 -0600 Bjorn Helgaas [EMAIL PROTECTED] wrote: On Tuesday 24 July 2007 08:28:05 am Sébastien Dugué wrote: your commit 7e92b4fc345f5b6f57585fbe5ffdb0f24d7c9b26 broke the serial console on my box. Adding 'legacy_serial.force=1' to my boot param as a workaround solves the issue, but this may be hiding bugs in Linux PnP support or in my firmware. Thanks for your report. We need to figure out why the 8250_pnp driver didn't find your serial console device. Can you confirm that you also have CONFIG_ACPI and CONFIG_PNPACPI in your .config? Yep, both =y CONFIG_ACPI=y CONFIG_PNPACPI=y If you have those, and it still doesn't work, can you collect the DSDT dump, the output of grep . /sys/bus/pnp/devices/*/*, and the dmesg from your legacy_serial.force=1 boot? Please find those attached. I also include the output from dmidecode just in case (I'm not sure it contains anything interesting in this case though). dsdt.dsl - DSDT dump dmi.txt - output from dmidecode pnp-info.txt - output of the grep dmesg-ok.txt - dmesg with 'legacy_serial.force=1' (CONFIG_PNP_DEBUG=y) dmesg-bad.txt - dmesg without 'legacy_serial.force=1' (CONFIG_PNP_DEBUG=y) Then we can tell which port the blind probe finds and whether it's described somewhere by ACPI. Thanks, Bjorn Thanks, Sébastien. The box is a dual HT Xeon running a vanilla 2.6.22 x86_64 kernel here is my .config: CONFIG_PNP=y CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_PNP=y CONFIG_SERIAL_8250_NR_UARTS=4 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 CONFIG_SERIAL_8250_EXTENDED=y CONFIG_SERIAL_8250_SHARE_IRQ=y CONFIG_SERIAL_8250_DETECT_IRQ=y lspci output attached. Any ideas to help me debug this? If you need more info (like DSDT dump), just ask. Thanks, Sébastien. logs.tar.bz2 Description: application/bzip
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
On Wed, 25 Jul 2007 15:32:53 +0200 Sébastien Dugué [EMAIL PROTECTED] wrote: Hi Bjorn, On Wed, 25 Jul 2007 07:16:44 -0600 Bjorn Helgaas [EMAIL PROTECTED] wrote: On Wednesday 25 July 2007 01:45:54 am Sébastien Dugué wrote: looks like the commit was dropped, nevertheless here is some more info to try and understand what may be going on so it may benefit the posterity ;-) Thank you very much. Your machine does indeed describe both UARTs in ACPI, and the PNP probe finds them correctly. The only wrinkle is that the PNP probe names the ports in the order they appear in the ACPI namespace, and your firmware has them backwards: Device (COMB) { Name (_HID, EisaId (PNP0501)) Name (_DDN, COM2) Name (_UID, 0x02) Device (COMA) { Name (_HID, EisaId (PNP0501)) Name (_DDN, COM1) Name (_UID, 0x01) Yes, that's what I thought was weird in the DSDT, but thought the PnP layer would map those devices according to UID or at least DDN. It seems that's not the case. There's nothing illegal about this, but with the current PNPACPI, it causes the names to be reversed. The blind probe tries 0x3f8 first, and names that ttyS0. But the PNP probe, using the namespace order, finds the COM2 port at 0x2f8 first: 00:0b: ttyS0 at I/O 0x2f8 (irq = 3) is a 16550A 00:0c: ttyS1 at I/O 0x3f8 (irq = 4) is a 16550A Oh I see, didn't notice this though. So I think your serial console would work without legacy_serial.force if you used console=ttyS1 instead of console=ttyS0. But you shouldn't have to do that, of course. Will try ttyS1 just to confirm... Waiting for reboot... OK boots fine with the serial console fully operational again. But I also need to change my inittab. At least it confirms the root cause of the problem. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
Hi Bjorn, On Wed, 25 Jul 2007 07:16:44 -0600 Bjorn Helgaas [EMAIL PROTECTED] wrote: On Wednesday 25 July 2007 01:45:54 am Sébastien Dugué wrote: looks like the commit was dropped, nevertheless here is some more info to try and understand what may be going on so it may benefit the posterity ;-) Thank you very much. Your machine does indeed describe both UARTs in ACPI, and the PNP probe finds them correctly. The only wrinkle is that the PNP probe names the ports in the order they appear in the ACPI namespace, and your firmware has them backwards: Device (COMB) { Name (_HID, EisaId (PNP0501)) Name (_DDN, COM2) Name (_UID, 0x02) Device (COMA) { Name (_HID, EisaId (PNP0501)) Name (_DDN, COM1) Name (_UID, 0x01) Yes, that's what I thought was weird in the DSDT, but thought the PnP layer would map those devices according to UID or at least DDN. It seems that's not the case. There's nothing illegal about this, but with the current PNPACPI, it causes the names to be reversed. The blind probe tries 0x3f8 first, and names that ttyS0. But the PNP probe, using the namespace order, finds the COM2 port at 0x2f8 first: 00:0b: ttyS0 at I/O 0x2f8 (irq = 3) is a 16550A 00:0c: ttyS1 at I/O 0x3f8 (irq = 4) is a 16550A Oh I see, didn't notice this though. So I think your serial console would work without legacy_serial.force if you used console=ttyS1 instead of console=ttyS0. But you shouldn't have to do that, of course. Will try ttyS1 just to confirm... Waiting for reboot... OK boots fine with the serial console fully operational again. The _DDN is a DOS device name, and the _UID is a logical device ID that does not change across reboots. Both are optional, and PNPACPI ignores them. But maybe we could change PNPACPI to sort by them if they are present. I'll think about this a bit. That would be nice, but I wish you good luck with all those crappy BIOSes out there. Thanks, Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
Hi Bjorn, your commit 7e92b4fc345f5b6f57585fbe5ffdb0f24d7c9b26 broke the serial console on my box. Adding 'legacy_serial.force=1' to my boot param as a workaround solves the issue, but this may be hiding bugs in Linux PnP support or in my firmware. The box is a dual HT Xeon running a vanilla 2.6.22 x86_64 kernel here is my .config: CONFIG_PNP=y CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_PNP=y CONFIG_SERIAL_8250_NR_UARTS=4 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 CONFIG_SERIAL_8250_EXTENDED=y CONFIG_SERIAL_8250_SHARE_IRQ=y CONFIG_SERIAL_8250_DETECT_IRQ=y lspci output attached. Any ideas to help me debug this? If you need more info (like DSDT dump), just ask. Thanks, Sébastien. 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) 00:00.1 Class ff00: Intel Corporation E7525/E7520 Error Reporting Registers (rev 0c) 00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c) 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c) 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) 00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c) 00:07.0 PCI bridge: Intel Corporation E7520 PCI Express Port C1 (rev 0c) 00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02) 00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) 00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02) 01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) 02:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02) 02:03.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 02:03.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 03:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02) 08:01.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] 08:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller 08:03.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller
[PATCH RT] Fix RT balancing tasks pulling
Hi Ingo, Thomas, this one-liner fixes a bug in balance_rt_tasks() which sometimes manifests by having a lower prio task being scheduled while a higher prio task is sitting waiting on another runqueue. This is pretty hard to reproduce on low cpu count machines, for example, I had to have sched_football run in a loop for ~38h before it failed on a dual HT Xeon box. Sébastien. --- In the CPU loop in balance_rt_tasks(), the 'next' task pointer is only ever updated if this_rq->lock has been dropped in double_lock_balance(). This sometimes lead to 'this_cpu' pulling tasks that are only garanteed to have a higher priority than the 'next' picked before the loop. Fix this to update 'next' to the last picked RT task. Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> --- kernel/sched.c |8 1 file changed, 8 insertions(+) Index: linux-2.6.22.1-rt6/kernel/sched.c === --- linux-2.6.22.1-rt6.orig/kernel/sched.c 2007-07-24 11:34:18.0 +0200 +++ linux-2.6.22.1-rt6/kernel/sched.c 2007-07-24 11:34:35.0 +0200 @@ -1499,6 +1499,14 @@ static void balance_rt_tasks(struct rq * * in another runqueue. (low likelyhood * but possible) */ + + /* +* Update next so that we won't pick a task +* on another cpu with a priority lower (or equal) +* than the one we just picked. +*/ + next = p; + } spin_unlock(_rq->lock); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RT] Fix RT balancing tasks pulling
Hi Ingo, Thomas, this one-liner fixes a bug in balance_rt_tasks() which sometimes manifests by having a lower prio task being scheduled while a higher prio task is sitting waiting on another runqueue. This is pretty hard to reproduce on low cpu count machines, for example, I had to have sched_football run in a loop for ~38h before it failed on a dual HT Xeon box. Sébastien. --- In the CPU loop in balance_rt_tasks(), the 'next' task pointer is only ever updated if this_rq-lock has been dropped in double_lock_balance(). This sometimes lead to 'this_cpu' pulling tasks that are only garanteed to have a higher priority than the 'next' picked before the loop. Fix this to update 'next' to the last picked RT task. Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] --- kernel/sched.c |8 1 file changed, 8 insertions(+) Index: linux-2.6.22.1-rt6/kernel/sched.c === --- linux-2.6.22.1-rt6.orig/kernel/sched.c 2007-07-24 11:34:18.0 +0200 +++ linux-2.6.22.1-rt6/kernel/sched.c 2007-07-24 11:34:35.0 +0200 @@ -1499,6 +1499,14 @@ static void balance_rt_tasks(struct rq * * in another runqueue. (low likelyhood * but possible) */ + + /* +* Update next so that we won't pick a task +* on another cpu with a priority lower (or equal) +* than the one we just picked. +*/ + next = p; + } spin_unlock(src_rq-lock); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
commit 7e92b4fc34 - x86, serial: convert legacy COM ports to platform devices - broke my serial console
Hi Bjorn, your commit 7e92b4fc345f5b6f57585fbe5ffdb0f24d7c9b26 broke the serial console on my box. Adding 'legacy_serial.force=1' to my boot param as a workaround solves the issue, but this may be hiding bugs in Linux PnP support or in my firmware. The box is a dual HT Xeon running a vanilla 2.6.22 x86_64 kernel here is my .config: CONFIG_PNP=y CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_PNP=y CONFIG_SERIAL_8250_NR_UARTS=4 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 CONFIG_SERIAL_8250_EXTENDED=y CONFIG_SERIAL_8250_SHARE_IRQ=y CONFIG_SERIAL_8250_DETECT_IRQ=y lspci output attached. Any ideas to help me debug this? If you need more info (like DSDT dump), just ask. Thanks, Sébastien. 00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 0c) 00:00.1 Class ff00: Intel Corporation E7525/E7520 Error Reporting Registers (rev 0c) 00:01.0 System peripheral: Intel Corporation E7520 DMA Controller (rev 0c) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0c) 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 0c) 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 0c) 00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 0c) 00:07.0 PCI bridge: Intel Corporation E7520 PCI Express Port C1 (rev 0c) 00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) 00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02) 00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) 00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02) 01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) 02:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02) 02:03.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 02:03.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 03:01.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02) 08:01.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] 08:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller 08:03.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller
Re: [Patch RT] Fix CFS load balancing for RT tasks
Hi Josh, On Thu, 12 Jul 2007 10:13:57 -0700 Josh Triplett <[EMAIL PROTECTED]> wrote: > On Thu, 2007-07-12 at 09:41 -0700, Josh Triplett wrote: > > On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: > > > there seems to be something wrong with the way the CFS balances (or > > > does not > > > balance) RT tasks. This was evidenced using the sched_football testcase > > > available from the RT wiki > > > (http://rt.wiki.kernel.org/index.php/IBM_Test_Cases) > > > which I modified and attached to this mail. > > > > > > The testcase starts a number of threads which fall into 3 categories: > > > > > > 1 referee thread: SCHED_FIFO, RT prio 5 > > > ncpus defensive threads: SCHED_FIFO, RT prio 4 > > > ncpus offensive threads: SCHED_FIFO, RT prio 3 > > > > > > (ncpus being the number of CPUs) > > > > > > To make a long story short, the defensive threads should end up > > > distributed > > > among all CPUs, but that's not the case. For example, on a dual HT Xeon > > > box, > > > after task migration stabilizes we have the following running on the > > > different > > > CPUs: > > > > > > CPU 0: defense2 > > > CPU 1: referee offense2 offense3 offense4 defense3 > > > CPU 2: offense1 > > > CPU 3: defense1 defense4 > > > > > > which clearly show the imbalance between CPU 2 and CPU 3 where offense1 > > > should not be allowed to run while the higher prio defense1 and defense4 > > > are sharing the same CPU. > > > > > > The following patch fixes this by re-enabling the RT overload detection > > > for the CFS. It may not be the right solution, maybe it should be > > > incorporated > > > into the other load balancing mechanisms. I did not digg deep enough yet > > > to make that call ;-) > > > > 2.6.21.5-rt20 plus this patch passed 1000 runs of the standard > > sched_football on an 8 processor (quad dual-core) x86-64 box. Nice > > work. > > Hmmm, seems I spoke a bit too soon; due to a bug in the test log > checker, the test failed but the log checker said PASS. Actual results: > $ grep 'Final ball position' rt-tests.log | sort | uniq -c > 960 Final ball position: 0 > 39 Final ball position: 1 > 1 Final ball position: 2 > > So it failed 4% of the runs. However, it failed much less > spectacularly; rather than overrunning the integer maximum, it only > reached 1-2. Still a huge improvement despite not solving the problem > completely. Yeah, I had a 5000 iterations test run last night which failed with one of the runs having a final ball position of 1. So it's not yet perfect. I think we're hitting the same bug as with the O(1) scheduler in 2.6.21-rt8. It seems that the more CPUs there are, the easier it is to reproduce. Hopefully I should have access to a quad dual core box soon now to verify this. One more thing, the bug seems even harder to reproduce if I have a logdev LD_MARK (direct marker, no kprobe) in context_switch(). Ingo, Thomas, any ideas, opinions? It looks like there might be a race in the RT balancing code that prevents pulling the highest prio task and instead selects a lower one to run. In the mean time I'll continue poring over the 2.6.21-rt8 scheduler to try to find something. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch RT] Fix CFS load balancing for RT tasks
Hi Josh, On Thu, 12 Jul 2007 09:41:55 -0700 Josh Triplett <[EMAIL PROTECTED]> wrote: > On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: > > there seems to be something wrong with the way the CFS balances (or does > > not > > balance) RT tasks. This was evidenced using the sched_football testcase > > available from the RT wiki > > (http://rt.wiki.kernel.org/index.php/IBM_Test_Cases) > > which I modified and attached to this mail. > > > > The testcase starts a number of threads which fall into 3 categories: > > > > 1 referee thread: SCHED_FIFO, RT prio 5 > > ncpus defensive threads: SCHED_FIFO, RT prio 4 > > ncpus offensive threads: SCHED_FIFO, RT prio 3 > > > > (ncpus being the number of CPUs) > > > > To make a long story short, the defensive threads should end up > > distributed > > among all CPUs, but that's not the case. For example, on a dual HT Xeon box, > > after task migration stabilizes we have the following running on the > > different > > CPUs: > > > > CPU 0: defense2 > > CPU 1: referee offense2 offense3 offense4 defense3 > > CPU 2: offense1 > > CPU 3: defense1 defense4 > > > > which clearly show the imbalance between CPU 2 and CPU 3 where offense1 > > should not be allowed to run while the higher prio defense1 and defense4 > > are sharing the same CPU. > > > > The following patch fixes this by re-enabling the RT overload detection > > for the CFS. It may not be the right solution, maybe it should be > > incorporated > > into the other load balancing mechanisms. I did not digg deep enough yet > > to make that call ;-) > > 2.6.21.5-rt20 plus this patch passed 1000 runs of the standard > sched_football on an 8 processor (quad dual-core) x86-64 box. Nice > work. Thanks, but a nightly run revealed that there is still a bug. At least now CFS is on par with O(1). > > > The RT overload mechanism of the O(1) scheduler has not been activated > > in the new CFS. > > > > This patch fixes that by inserting calls to inc_rt_tasks() and > > dec_rt_tasks() > > in enqueue_task_rt() and dequeue_task_rt() respectively, which enables the > > balance_rt_tasks() to be run in the rt_overload case. > > > > > > Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> > > Acked-by: Josh Triplett <[EMAIL PROTECTED]> > > - Josh Triplett Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch RT] Fix CFS load balancing for RT tasks
Hi Josh, On Thu, 12 Jul 2007 09:41:55 -0700 Josh Triplett [EMAIL PROTECTED] wrote: On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: there seems to be something wrong with the way the CFS balances (or does not balance) RT tasks. This was evidenced using the sched_football testcase available from the RT wiki (http://rt.wiki.kernel.org/index.php/IBM_Test_Cases) which I modified and attached to this mail. The testcase starts a number of threads which fall into 3 categories: 1 referee thread: SCHED_FIFO, RT prio 5 ncpus defensive threads: SCHED_FIFO, RT prio 4 ncpus offensive threads: SCHED_FIFO, RT prio 3 (ncpus being the number of CPUs) To make a long story short, the defensive threads should end up distributed among all CPUs, but that's not the case. For example, on a dual HT Xeon box, after task migration stabilizes we have the following running on the different CPUs: CPU 0: defense2 CPU 1: referee offense2 offense3 offense4 defense3 CPU 2: offense1 CPU 3: defense1 defense4 which clearly show the imbalance between CPU 2 and CPU 3 where offense1 should not be allowed to run while the higher prio defense1 and defense4 are sharing the same CPU. The following patch fixes this by re-enabling the RT overload detection for the CFS. It may not be the right solution, maybe it should be incorporated into the other load balancing mechanisms. I did not digg deep enough yet to make that call ;-) 2.6.21.5-rt20 plus this patch passed 1000 runs of the standard sched_football on an 8 processor (quad dual-core) x86-64 box. Nice work. Thanks, but a nightly run revealed that there is still a bug. At least now CFS is on par with O(1). The RT overload mechanism of the O(1) scheduler has not been activated in the new CFS. This patch fixes that by inserting calls to inc_rt_tasks() and dec_rt_tasks() in enqueue_task_rt() and dequeue_task_rt() respectively, which enables the balance_rt_tasks() to be run in the rt_overload case. Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Acked-by: Josh Triplett [EMAIL PROTECTED] - Josh Triplett Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch RT] Fix CFS load balancing for RT tasks
Hi Josh, On Thu, 12 Jul 2007 10:13:57 -0700 Josh Triplett [EMAIL PROTECTED] wrote: On Thu, 2007-07-12 at 09:41 -0700, Josh Triplett wrote: On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: there seems to be something wrong with the way the CFS balances (or does not balance) RT tasks. This was evidenced using the sched_football testcase available from the RT wiki (http://rt.wiki.kernel.org/index.php/IBM_Test_Cases) which I modified and attached to this mail. The testcase starts a number of threads which fall into 3 categories: 1 referee thread: SCHED_FIFO, RT prio 5 ncpus defensive threads: SCHED_FIFO, RT prio 4 ncpus offensive threads: SCHED_FIFO, RT prio 3 (ncpus being the number of CPUs) To make a long story short, the defensive threads should end up distributed among all CPUs, but that's not the case. For example, on a dual HT Xeon box, after task migration stabilizes we have the following running on the different CPUs: CPU 0: defense2 CPU 1: referee offense2 offense3 offense4 defense3 CPU 2: offense1 CPU 3: defense1 defense4 which clearly show the imbalance between CPU 2 and CPU 3 where offense1 should not be allowed to run while the higher prio defense1 and defense4 are sharing the same CPU. The following patch fixes this by re-enabling the RT overload detection for the CFS. It may not be the right solution, maybe it should be incorporated into the other load balancing mechanisms. I did not digg deep enough yet to make that call ;-) 2.6.21.5-rt20 plus this patch passed 1000 runs of the standard sched_football on an 8 processor (quad dual-core) x86-64 box. Nice work. Hmmm, seems I spoke a bit too soon; due to a bug in the test log checker, the test failed but the log checker said PASS. Actual results: $ grep 'Final ball position' rt-tests.log | sort | uniq -c 960 Final ball position: 0 39 Final ball position: 1 1 Final ball position: 2 So it failed 4% of the runs. However, it failed much less spectacularly; rather than overrunning the integer maximum, it only reached 1-2. Still a huge improvement despite not solving the problem completely. Yeah, I had a 5000 iterations test run last night which failed with one of the runs having a final ball position of 1. So it's not yet perfect. I think we're hitting the same bug as with the O(1) scheduler in 2.6.21-rt8. It seems that the more CPUs there are, the easier it is to reproduce. Hopefully I should have access to a quad dual core box soon now to verify this. One more thing, the bug seems even harder to reproduce if I have a logdev LD_MARK (direct marker, no kprobe) in context_switch(). Ingo, Thomas, any ideas, opinions? It looks like there might be a race in the RT balancing code that prevents pulling the highest prio task and instead selects a lower one to run. In the mean time I'll continue poring over the 2.6.21-rt8 scheduler to try to find something. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch RT] Fix CFS load balancing for RT tasks
On Wed, 11 Jul 2007 13:41:55 -0400 (EDT) Steven Rostedt <[EMAIL PROTECTED]> wrote: > > On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: > > > > > > P.S. Thanks to Steven Rostedt for logdev which is proving invaluable in > > >cases like this. > > > > > Great to hear!!! > > I'll be uploading a new version of Logdev in a week or so. > > Thanks for the feedback. > > -- Steve > Great! It's quite a nice and lightweight tool which I think will become one of my top tools for kernel debugging. Just one suggestion, maybe, would be to add to the readme file the order in which the patches should apply along with a short description of what functionality they provide (or maybe document the patches and add a series file). Anyway, thanks for that great tool. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch RT] Fix CFS load balancing for RT tasks
On Wed, 11 Jul 2007 19:06:20 +0200 Thomas Gleixner <[EMAIL PROTECTED]> wrote: > On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: > > The following patch fixes this by re-enabling the RT overload detection > > for the CFS. It may not be the right solution, maybe it should be > > incorporated > > into the other load balancing mechanisms. I did not digg deep enough yet > > to make that call ;-) > > > > P.S. Thanks to Steven Rostedt for logdev which is proving invaluable in > >cases like this. > > > > Sébastien. > > Nice catch. That was dropped during the CFS -> -rt merge. > > tglx Thanks, and sorry, I forgot to CC you :( Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch RT] Fix CFS load balancing for RT tasks
On Wed, 11 Jul 2007 19:06:20 +0200 Thomas Gleixner [EMAIL PROTECTED] wrote: On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: The following patch fixes this by re-enabling the RT overload detection for the CFS. It may not be the right solution, maybe it should be incorporated into the other load balancing mechanisms. I did not digg deep enough yet to make that call ;-) P.S. Thanks to Steven Rostedt for logdev which is proving invaluable in cases like this. Sébastien. Nice catch. That was dropped during the CFS - -rt merge. tglx Thanks, and sorry, I forgot to CC you :( Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch RT] Fix CFS load balancing for RT tasks
On Wed, 11 Jul 2007 13:41:55 -0400 (EDT) Steven Rostedt [EMAIL PROTECTED] wrote: On Wed, 2007-07-11 at 16:47 +0200, Sébastien Dugué wrote: P.S. Thanks to Steven Rostedt for logdev which is proving invaluable in cases like this. Great to hear!!! I'll be uploading a new version of Logdev in a week or so. Thanks for the feedback. -- Steve Great! It's quite a nice and lightweight tool which I think will become one of my top tools for kernel debugging. Just one suggestion, maybe, would be to add to the readme file the order in which the patches should apply along with a short description of what functionality they provide (or maybe document the patches and add a series file). Anyway, thanks for that great tool. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch RT] Fix CFS load balancing for RT tasks
Hi Ingo, all there seems to be something wrong with the way the CFS balances (or does not balance) RT tasks. This was evidenced using the sched_football testcase available from the RT wiki (http://rt.wiki.kernel.org/index.php/IBM_Test_Cases) which I modified and attached to this mail. The testcase starts a number of threads which fall into 3 categories: 1 referee thread: SCHED_FIFO, RT prio 5 ncpus defensive threads: SCHED_FIFO, RT prio 4 ncpus offensive threads: SCHED_FIFO, RT prio 3 (ncpus being the number of CPUs) To make a long story short, the defensive threads should end up distributed among all CPUs, but that's not the case. For example, on a dual HT Xeon box, after task migration stabilizes we have the following running on the different CPUs: CPU 0: defense2 CPU 1: referee offense2 offense3 offense4 defense3 CPU 2: offense1 CPU 3: defense1 defense4 which clearly show the imbalance between CPU 2 and CPU 3 where offense1 should not be allowed to run while the higher prio defense1 and defense4 are sharing the same CPU. The following patch fixes this by re-enabling the RT overload detection for the CFS. It may not be the right solution, maybe it should be incorporated into the other load balancing mechanisms. I did not digg deep enough yet to make that call ;-) P.S. Thanks to Steven Rostedt for logdev which is proving invaluable in cases like this. Sébastien. -- The RT overload mechanism of the O(1) scheduler has not been activated in the new CFS. This patch fixes that by inserting calls to inc_rt_tasks() and dec_rt_tasks() in enqueue_task_rt() and dequeue_task_rt() respectively, which enables the balance_rt_tasks() to be run in the rt_overload case. Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> --- kernel/sched_rt.c |4 1 file changed, 4 insertions(+) Index: linux-2.6.21.5-rt20/kernel/sched_rt.c === --- linux-2.6.21.5-rt20.orig/kernel/sched_rt.c 2007-07-11 10:46:26.0 +0200 +++ linux-2.6.21.5-rt20/kernel/sched_rt.c 2007-07-11 10:46:50.0 +0200 @@ -32,6 +32,8 @@ enqueue_task_rt(struct rq *rq, struct ta list_add_tail(>run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); + + inc_rt_tasks(p, rq); } /* @@ -44,6 +46,8 @@ dequeue_task_rt(struct rq *rq, struct ta update_curr_rt(rq, now); + dec_rt_tasks(p, rq); + list_del(>run_list); if (list_empty(array->queue + p->prio)) __clear_bit(p->prio, array->bitmap); /* Threaded Football - by John Stultz <[EMAIL PROTECTED]> * * This is a scheduler test that uses a football analogy. * The premise is that we want to make sure that lower priority threads * (the offensive team) do not preempt higher priority threads (the * defensive team). The offense is trying to increment the balls position, * while the defense is trying to block that from happening. * And the ref (highest priority thread) will blow the wistle if the * ball moves. Finally, we have crazy fans (higer prority) that try to * distract the defense by occasionally running onto the field. * * Steps: * - Create a fixed number of offense threads (lower priority) * - Create a fixed number of defense threads (higher priority) * - Create a referee thread (highest priority) * - Once everyone is on the field, the offense thread increments the value of * 'the_ball' and yields. The defense thread tries to block the ball by never * letting the offense players get the CPU (it just does a sched_yield) * - The refree threads wakes up regularly to check if the game is over :) * - In the end, if the value of 'the_ball' is >0, the test is considered to * have failed. * * PS: I really don't like football that much, it just seemed to fit. * * 2006-03-16 Reduced verbosity, non binary failure reporting, removal of * crazy_fans thread, added game_length argument by Darren Hart. */ #include #include #include #include #include #include #include #include #include #include #include #define gettid() syscall(__NR_gettid) #define MAXTHREADS 256 #define DEF_GAME_LENGTH 5 /* Here's the position of the ball */ volatile int the_ball; /* Game status */ volatile int game_started; volatile int game_over; /* Keep track of who's on the field */ volatile int offense_count; volatile int defense_count; /* simple mutex for our atomic increments */ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; #define min(x,y) (x < y ? x: y) /* This is the defensive team. They're trying to block the offense */ void *thread_defense(void* arg) { pthread_mutex_lock(); defense_count++; pthread_mutex_unlock(); printf("thread_defense (%ld) starting\n", gettid()); /*keep the ball from being moved */ while (!game_over) { sched_yield(); /* let other defenders
[Patch RT] Fix CFS load balancing for RT tasks
Hi Ingo, all there seems to be something wrong with the way the CFS balances (or does not balance) RT tasks. This was evidenced using the sched_football testcase available from the RT wiki (http://rt.wiki.kernel.org/index.php/IBM_Test_Cases) which I modified and attached to this mail. The testcase starts a number of threads which fall into 3 categories: 1 referee thread: SCHED_FIFO, RT prio 5 ncpus defensive threads: SCHED_FIFO, RT prio 4 ncpus offensive threads: SCHED_FIFO, RT prio 3 (ncpus being the number of CPUs) To make a long story short, the defensive threads should end up distributed among all CPUs, but that's not the case. For example, on a dual HT Xeon box, after task migration stabilizes we have the following running on the different CPUs: CPU 0: defense2 CPU 1: referee offense2 offense3 offense4 defense3 CPU 2: offense1 CPU 3: defense1 defense4 which clearly show the imbalance between CPU 2 and CPU 3 where offense1 should not be allowed to run while the higher prio defense1 and defense4 are sharing the same CPU. The following patch fixes this by re-enabling the RT overload detection for the CFS. It may not be the right solution, maybe it should be incorporated into the other load balancing mechanisms. I did not digg deep enough yet to make that call ;-) P.S. Thanks to Steven Rostedt for logdev which is proving invaluable in cases like this. Sébastien. -- The RT overload mechanism of the O(1) scheduler has not been activated in the new CFS. This patch fixes that by inserting calls to inc_rt_tasks() and dec_rt_tasks() in enqueue_task_rt() and dequeue_task_rt() respectively, which enables the balance_rt_tasks() to be run in the rt_overload case. Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] --- kernel/sched_rt.c |4 1 file changed, 4 insertions(+) Index: linux-2.6.21.5-rt20/kernel/sched_rt.c === --- linux-2.6.21.5-rt20.orig/kernel/sched_rt.c 2007-07-11 10:46:26.0 +0200 +++ linux-2.6.21.5-rt20/kernel/sched_rt.c 2007-07-11 10:46:50.0 +0200 @@ -32,6 +32,8 @@ enqueue_task_rt(struct rq *rq, struct ta list_add_tail(p-run_list, array-queue + p-prio); __set_bit(p-prio, array-bitmap); + + inc_rt_tasks(p, rq); } /* @@ -44,6 +46,8 @@ dequeue_task_rt(struct rq *rq, struct ta update_curr_rt(rq, now); + dec_rt_tasks(p, rq); + list_del(p-run_list); if (list_empty(array-queue + p-prio)) __clear_bit(p-prio, array-bitmap); /* Threaded Football - by John Stultz [EMAIL PROTECTED] * * This is a scheduler test that uses a football analogy. * The premise is that we want to make sure that lower priority threads * (the offensive team) do not preempt higher priority threads (the * defensive team). The offense is trying to increment the balls position, * while the defense is trying to block that from happening. * And the ref (highest priority thread) will blow the wistle if the * ball moves. Finally, we have crazy fans (higer prority) that try to * distract the defense by occasionally running onto the field. * * Steps: * - Create a fixed number of offense threads (lower priority) * - Create a fixed number of defense threads (higher priority) * - Create a referee thread (highest priority) * - Once everyone is on the field, the offense thread increments the value of * 'the_ball' and yields. The defense thread tries to block the ball by never * letting the offense players get the CPU (it just does a sched_yield) * - The refree threads wakes up regularly to check if the game is over :) * - In the end, if the value of 'the_ball' is 0, the test is considered to * have failed. * * PS: I really don't like football that much, it just seemed to fit. * * 2006-03-16 Reduced verbosity, non binary failure reporting, removal of * crazy_fans thread, added game_length argument by Darren Hart. */ #include stdio.h #include stdlib.h #include signal.h #include time.h #include string.h #include pthread.h #include sched.h #include errno.h #include sys/syscall.h #include unistd.h #include sys/time.h #define gettid() syscall(__NR_gettid) #define MAXTHREADS 256 #define DEF_GAME_LENGTH 5 /* Here's the position of the ball */ volatile int the_ball; /* Game status */ volatile int game_started; volatile int game_over; /* Keep track of who's on the field */ volatile int offense_count; volatile int defense_count; /* simple mutex for our atomic increments */ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; #define min(x,y) (x y ? x: y) /* This is the defensive team. They're trying to block the offense */ void *thread_defense(void* arg) { pthread_mutex_lock(mutex); defense_count++; pthread_mutex_unlock(mutex); printf(thread_defense (%ld) starting\n, gettid()); /*keep the ball from being moved */ while (!game_over
Re: [x86 setup 22/33] CPU features verification for the new x86 setup code
Hi Peter, Only two sub-minor nits: On Mon, 9 Jul 2007 19:52:01 -0700 "H. Peter Anvin" <[EMAIL PROTECTED]> wrote: > From: H. Peter Anvin <[EMAIL PROTECTED]> > > Verify that the CPU has enough features to run the kernel. This may > entail enabling features on some CPUs. > > By doing this in the setup code we can be guaranteed to still be able to > write to the console through the BIOS. > > Signed-off-by: H. Peter Anvin <[EMAIL PROTECTED]> > --- > arch/i386/boot/cpu.c | 69 > arch/i386/boot/cpucheck.c | 267 > + > 2 files changed, 336 insertions(+), 0 deletions(-) > create mode 100644 arch/i386/boot/cpu.c > create mode 100644 arch/i386/boot/cpucheck.c > > diff --git a/arch/i386/boot/cpu.c b/arch/i386/boot/cpu.c > new file mode 100644 > index 000..042d894 > --- /dev/null > +++ b/arch/i386/boot/cpu.c > @@ -0,0 +1,69 @@ > +/* -*- linux-c -*- --- * > + * > + * Copyright (C) 1991, 1992 Linus Torvalds > + * Copyright 2007 rPath, Inc. - All Rights Reserved > + * > + * This file is part of the Linux kernel, and is made available under > + * the terms of the GNU General Public License version 2. > + * > + * --- */ > + > +/* > + * arch/i386/boot/cpucheck.c ^ cpu.c ? > + * > + * Check for obligatory CPU features and abort if the features are not > + * present. > + */ > + > +#include "boot.h" > +#include "bitops.h" > +#include > + > +static char *cpu_name(int level) > +{ > + static char buf[6]; > + > + if (level == 64) { > + return "x86-64"; > + } else { > + sprintf(buf, "i%d86", level); > + return buf; > + } > +} > + > +int validate_cpu(void) > +{ > + u32 *err_flags; > + int cpu_level, req_level; > + > + check_cpu(_level, _level, _flags); > + > + if (cpu_level < req_level) { > + printf("This kernel requires an %s CPU, ", > +cpu_name(req_level)); > + printf("but only detected an %s CPU.\n", > +cpu_name(cpu_level)); > + return -1; > + } > + > + if (err_flags) { > + int i, j; > + puts("This kernel requires the following features " > + "not present on the CPU:\n"); > + > + for (i = 0; i < NCAPINTS; i++) { > + u32 e = err_flags[i]; > + > + for (j = 0; j < 32; j++) { > + if (e & 1) > + printf("%d:%d ", i, j); > + > + e >>= 1; > + } > + } > + putchar('\n'); > + return -1; > + } else { > + return 0; > + } > +} > diff --git a/arch/i386/boot/cpucheck.c b/arch/i386/boot/cpucheck.c > new file mode 100644 > index 000..431bef8 > --- /dev/null > +++ b/arch/i386/boot/cpucheck.c > @@ -0,0 +1,267 @@ > +/* -*- linux-c -*- --- * > + * > + * Copyright (C) 1991, 1992 Linus Torvalds > + * Copyright 2007 rPath, Inc. - All Rights Reserved > + * > + * This file is part of the Linux kernel, and is made available under > + * the terms of the GNU General Public License version 2. > + * > + * --- */ > + > +/* > + * arch/i386/boot/cpu.c ^ cpucheck.c ? > + * > + * Check for obligatory CPU features and abort if the features are not > + * present. This code should be compilable as 16-, 32- or 64-bit > + * code, so be very careful with types and inline assembly. > + * > + * This code should not contain any messages; that requires an > + * additional wrapper. > + * Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [x86 setup 22/33] CPU features verification for the new x86 setup code
Hi Peter, Only two sub-minor nits: On Mon, 9 Jul 2007 19:52:01 -0700 H. Peter Anvin [EMAIL PROTECTED] wrote: From: H. Peter Anvin [EMAIL PROTECTED] Verify that the CPU has enough features to run the kernel. This may entail enabling features on some CPUs. By doing this in the setup code we can be guaranteed to still be able to write to the console through the BIOS. Signed-off-by: H. Peter Anvin [EMAIL PROTECTED] --- arch/i386/boot/cpu.c | 69 arch/i386/boot/cpucheck.c | 267 + 2 files changed, 336 insertions(+), 0 deletions(-) create mode 100644 arch/i386/boot/cpu.c create mode 100644 arch/i386/boot/cpucheck.c diff --git a/arch/i386/boot/cpu.c b/arch/i386/boot/cpu.c new file mode 100644 index 000..042d894 --- /dev/null +++ b/arch/i386/boot/cpu.c @@ -0,0 +1,69 @@ +/* -*- linux-c -*- --- * + * + * Copyright (C) 1991, 1992 Linus Torvalds + * Copyright 2007 rPath, Inc. - All Rights Reserved + * + * This file is part of the Linux kernel, and is made available under + * the terms of the GNU General Public License version 2. + * + * --- */ + +/* + * arch/i386/boot/cpucheck.c ^ cpu.c ? + * + * Check for obligatory CPU features and abort if the features are not + * present. + */ + +#include boot.h +#include bitops.h +#include asm/cpufeature.h + +static char *cpu_name(int level) +{ + static char buf[6]; + + if (level == 64) { + return x86-64; + } else { + sprintf(buf, i%d86, level); + return buf; + } +} + +int validate_cpu(void) +{ + u32 *err_flags; + int cpu_level, req_level; + + check_cpu(cpu_level, req_level, err_flags); + + if (cpu_level req_level) { + printf(This kernel requires an %s CPU, , +cpu_name(req_level)); + printf(but only detected an %s CPU.\n, +cpu_name(cpu_level)); + return -1; + } + + if (err_flags) { + int i, j; + puts(This kernel requires the following features + not present on the CPU:\n); + + for (i = 0; i NCAPINTS; i++) { + u32 e = err_flags[i]; + + for (j = 0; j 32; j++) { + if (e 1) + printf(%d:%d , i, j); + + e = 1; + } + } + putchar('\n'); + return -1; + } else { + return 0; + } +} diff --git a/arch/i386/boot/cpucheck.c b/arch/i386/boot/cpucheck.c new file mode 100644 index 000..431bef8 --- /dev/null +++ b/arch/i386/boot/cpucheck.c @@ -0,0 +1,267 @@ +/* -*- linux-c -*- --- * + * + * Copyright (C) 1991, 1992 Linus Torvalds + * Copyright 2007 rPath, Inc. - All Rights Reserved + * + * This file is part of the Linux kernel, and is made available under + * the terms of the GNU General Public License version 2. + * + * --- */ + +/* + * arch/i386/boot/cpu.c ^ cpucheck.c ? + * + * Check for obligatory CPU features and abort if the features are not + * present. This code should be compilable as 16-, 32- or 64-bit + * code, so be very careful with types and inline assembly. + * + * This code should not contain any messages; that requires an + * additional wrapper. + * Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New systems: eu.kernel.org
On Wed, 04 Jul 2007 12:51:16 -0700 "H. Peter Anvin" <[EMAIL PROTECTED]> wrote: > Sébastien Dugué wrote: > > > > One last thing, is there a git server as well for eu? > > > > FWIW, I've set up git.eu.kernel.org; right now it's an alias to > git.kernel.org, but that might change in the future. OK, thanks, will update my pointers. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New systems: eu.kernel.org
On Wed, 04 Jul 2007 12:51:16 -0700 H. Peter Anvin [EMAIL PROTECTED] wrote: Sébastien Dugué wrote: One last thing, is there a git server as well for eu? FWIW, I've set up git.eu.kernel.org; right now it's an alias to git.kernel.org, but that might change in the future. OK, thanks, will update my pointers. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New systems: eu.kernel.org
On Tue, 03 Jul 2007 16:16:01 -0700 "H. Peter Anvin" <[EMAIL PROTECTED]> wrote: > Hi all, > > Thanks to HP, ISC and UmeÃ¥ Universitet, kernel.org now has official > servers in Europe. Specifically, we have www1.eu.kernel.org hosted at > ISC Amsterdam, and www2.eu.kernel.org at UMU (UmeÃ¥, Sweden.) They are > collectively accessible as www.eu.kernel.org. > > This is a full service of kernel.org and not a mirror. In the future, > we hope to add some sort of smart DNS to automatically redirect traffic, > as well as additional services in Europe. > First, thanks a lot to all those involved in this. Just a few things at the top of the main page: - the http, ftp and rsync links still point at kernel.org, as the content is also at eu.kernel.org maybe change these. - idem for the 'prepatch', 'snapshot' and '-mm patch' links - idem for the 'current changesets' 'C' links One last thing, is there a git server as well for eu? Thanks again, Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New systems: eu.kernel.org
On Tue, 03 Jul 2007 16:16:01 -0700 H. Peter Anvin [EMAIL PROTECTED] wrote: Hi all, Thanks to HP, ISC and UmeÃ¥ Universitet, kernel.org now has official servers in Europe. Specifically, we have www1.eu.kernel.org hosted at ISC Amsterdam, and www2.eu.kernel.org at UMU (UmeÃ¥, Sweden.) They are collectively accessible as www.eu.kernel.org. This is a full service of kernel.org and not a mirror. In the future, we hope to add some sort of smart DNS to automatically redirect traffic, as well as additional services in Europe. First, thanks a lot to all those involved in this. Just a few things at the top of the main page: - the http, ftp and rsync links still point at kernel.org, as the content is also at eu.kernel.org maybe change these. - idem for the 'prepatch', 'snapshot' and '-mm patch' links - idem for the 'current changesets' 'C' links One last thing, is there a git server as well for eu? Thanks again, Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch-mm 19/23] x86_64: Convert to cleckevents
Hi Thomas, On Sun, 10 Jun 2007 09:44:18 - Thomas Gleixner <[EMAIL PROTECTED]> wrote: > Convert x86_64 to the clockevents code. Share code with i386 for > hpet and PIT. > > Build and whitespace fixups from: > Venki Pallipadi <[EMAIL PROTECTED]> > and > Chris Wright <[EMAIL PROTECTED]> It seems that the hpet clocksource's ->vread has been lost in the conversion. Is this on purpose? The patch below fixes it for x86_64 on 2.6.21.3-rt9. If you want it rebased on -mm let me know. Thanks, Sébastien. --- It seems the hpet clocksource's vread method was lost in the x86_64 conversion to clockevent. So here it is. Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> arch/i386/kernel/hpet.c | 10 ++ 1 file changed, 10 insertions(+) Index: linux-2.6.21.3-rt9/arch/i386/kernel/hpet.c === --- linux-2.6.21.3-rt9.orig/arch/i386/kernel/hpet.c 2007-06-07 15:09:41.0 +0200 +++ linux-2.6.21.3-rt9/arch/i386/kernel/hpet.c 2007-06-07 15:50:50.0 +0200 @@ -313,6 +313,13 @@ static cycle_t notrace read_hpet(void) return (cycle_t)hpet_readl(HPET_COUNTER); } +#ifdef CONFIG_X86_64 +static notrace cycle_t __vsyscall_fn vread_hpet(void) +{ + return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); +} +#endif + static struct clocksource clocksource_hpet = { .name = "hpet", .rating = 250, @@ -321,6 +328,9 @@ static struct clocksource clocksource_hp .shift = HPET_SHIFT, .flags = CLOCK_SOURCE_IS_CONTINUOUS, .resume = hpet_restart_counter, +#ifdef CONFIG_X86_64 + .vread = vread_hpet, +#endif }; static int hpet_clocksource_register(void) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch-mm 19/23] x86_64: Convert to cleckevents
Hi Thomas, On Sun, 10 Jun 2007 09:44:18 - Thomas Gleixner [EMAIL PROTECTED] wrote: Convert x86_64 to the clockevents code. Share code with i386 for hpet and PIT. Build and whitespace fixups from: Venki Pallipadi [EMAIL PROTECTED] and Chris Wright [EMAIL PROTECTED] It seems that the hpet clocksource's -vread has been lost in the conversion. Is this on purpose? The patch below fixes it for x86_64 on 2.6.21.3-rt9. If you want it rebased on -mm let me know. Thanks, Sébastien. --- It seems the hpet clocksource's vread method was lost in the x86_64 conversion to clockevent. So here it is. Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] arch/i386/kernel/hpet.c | 10 ++ 1 file changed, 10 insertions(+) Index: linux-2.6.21.3-rt9/arch/i386/kernel/hpet.c === --- linux-2.6.21.3-rt9.orig/arch/i386/kernel/hpet.c 2007-06-07 15:09:41.0 +0200 +++ linux-2.6.21.3-rt9/arch/i386/kernel/hpet.c 2007-06-07 15:50:50.0 +0200 @@ -313,6 +313,13 @@ static cycle_t notrace read_hpet(void) return (cycle_t)hpet_readl(HPET_COUNTER); } +#ifdef CONFIG_X86_64 +static notrace cycle_t __vsyscall_fn vread_hpet(void) +{ + return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); +} +#endif + static struct clocksource clocksource_hpet = { .name = hpet, .rating = 250, @@ -321,6 +328,9 @@ static struct clocksource clocksource_hp .shift = HPET_SHIFT, .flags = CLOCK_SOURCE_IS_CONTINUOUS, .resume = hpet_restart_counter, +#ifdef CONFIG_X86_64 + .vread = vread_hpet, +#endif }; static int hpet_clocksource_register(void) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: checkpatch, a patch checking script.
On Wed, 2 May 2007 16:28:27 +0200 (CEST) Geert Uytterhoeven <[EMAIL PROTECTED]> wrote: > + $warnings += search(qr/__FUNCION__/, ^__FUNCTION__ maybe? > + "Should use C99 __func__ instead of GNU > __FUNCTION__\n"); Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: checkpatch, a patch checking script.
On Wed, 2 May 2007 16:28:27 +0200 (CEST) Geert Uytterhoeven [EMAIL PROTECTED] wrote: + $warnings += search(qr/__FUNCION__/, ^__FUNCTION__ maybe? + Should use C99 __func__ instead of GNU __FUNCTION__\n); Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386 GDT cleanups: Rename boot_gdt_table to boot_gdt
Rusty, On Thu, 22 Mar 2007 20:24:10 +1100 Rusty Russell <[EMAIL PROTECTED]> wrote: > On Thu, 2007-03-22 at 09:10 +0100, Sébastien Dugué wrote: > > Why not take on the opportunity to rename boot_gt_table to boot_gtd, > > to avoid the duplicate T(able)? > > That's not a bad idea, but IMHO deserves its own patch. > > I look forward to it! > Rusty. > > Here it is, boot tested on an SMP x86 box. Sébastien. Rename boot_gdt_table to boot_gdt to avoid the duplicate T(able). Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> --- arch/i386/kernel/head.S |9 - arch/i386/kernel/trampoline.S | 12 ++-- 2 files changed, 10 insertions(+), 11 deletions(-) Index: linux-2.6/arch/i386/kernel/head.S === --- linux-2.6.orig/arch/i386/kernel/head.S 2007-03-22 15:09:02.0 +0100 +++ linux-2.6/arch/i386/kernel/head.S 2007-03-22 16:14:38.0 +0100 @@ -147,8 +147,7 @@ page_pde_offset = (__PAGE_OFFSET >> 20); /* * Non-boot CPU entry point; entered from trampoline.S * We can't lgdt here, because lgdt itself uses a data segment, but - * we know the trampoline has already loaded the boot_gdt_table GDT - * for us. + * we know the trampoline has already loaded the boot_gdt for us. * * If cpu hotplug is not supported then this code can go in init section * which will be freed later @@ -588,7 +587,7 @@ fault_msg: .word 0 # 32 bit align gdt_desc.address boot_gdt_descr: .word __BOOT_DS+7 - .long boot_gdt_table - __PAGE_OFFSET + .long boot_gdt - __PAGE_OFFSET .word 0 # 32-bit align idt_desc.address idt_descr: @@ -602,11 +601,11 @@ ENTRY(early_gdt_descr) .long cpu_gdt_table /* - * The boot_gdt_table must mirror the equivalent in setup.S and is + * The boot_gdt must mirror the equivalent in setup.S and is * used only for booting. */ .align L1_CACHE_BYTES -ENTRY(boot_gdt_table) +ENTRY(boot_gdt) .fill GDT_ENTRY_BOOT_CS,8,0 .quad 0x00cf9a00/* kernel 4GB code at 0x */ .quad 0x00cf9200/* kernel 4GB data at 0x */ Index: linux-2.6/arch/i386/kernel/trampoline.S === --- linux-2.6.orig/arch/i386/kernel/trampoline.S2007-03-22 15:09:02.0 +0100 +++ linux-2.6/arch/i386/kernel/trampoline.S 2007-03-22 16:13:23.0 +0100 @@ -29,7 +29,7 @@ * * TYPE VALUE * R_386_32 startup_32_smp - * R_386_32 boot_gdt_table + * R_386_32 boot_gdt */ #include @@ -62,8 +62,8 @@ r_base = . * to 32 bit. */ - lidtl boot_idt - r_base # load idt with 0, 0 - lgdtl boot_gdt - r_base # load gdt with whatever is appropriate + lidtl boot_idt_descr - r_base # load idt with 0, 0 + lgdtl boot_gdt_descr - r_base # load gdt with whatever is appropriate xor %ax, %ax inc %ax # protected mode (PE) bit @@ -73,11 +73,11 @@ r_base = . # These need to be in the same 64K segment as the above; # hence we don't use the boot_gdt_descr defined in head.S -boot_gdt: +boot_gdt_descr: .word __BOOT_DS + 7 # gdt limit - .long boot_gdt_table-__PAGE_OFFSET# gdt base + .long boot_gdt - __PAGE_OFFSET# gdt base -boot_idt: +boot_idt_descr: .word 0 # idt limit = 0 .long 0 # idt base = 0L - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] i386 GDT cleanups: Use per-cpu GDT immediately upon boot
Hi Rusty, On Thu, 22 Mar 2007 10:58:30 +1100 Rusty Russell <[EMAIL PROTECTED]> wrote: > On Wed, 2007-03-21 at 10:51 -0600, Eric W. Biederman wrote: > > Rusty Russell <[EMAIL PROTECTED]> writes: > > > > > On Wed, 2007-03-21 at 03:31 -0600, Eric W. Biederman wrote: > > >> Rusty Russell <[EMAIL PROTECTED]> writes: > > >> > -/* > > >> > - * The boot_gdt_table must mirror the equivalent in setup.S and is > > >> > - * used only for booting. > > >> > - */ > > >> > > >> It looks like you are killing a useful comment here for no good reason. > > > > > > Hi Eric, > > > > > > I think one has to look harder, then. There is no "equivalent in > > > setup.S": there is no setup.S, and it's certainly not clear what GDT > > > this "must mirror": it doesn't mirror any GDT at the moment. > > > > see the gdt in: > > arch/i386/boot/setup.S > > Erk, what a dumb mistake. Apologies for my snarky comment above 8( > > > If anything the comment should read these values are fixed by the boot > > protocol and we can't change them. > > Since lguest doesn't use setup.S, it's outside my experience. I'll just > leave the comment, and try to pretend this never happened 8) > > Thanks muchly, > Rusty. > == > Now we are no longer dynamically allocating the GDT, we don't need the > "cpu_gdt_table" at all: we can switch straight from "boot_gdt_table" > to the per-cpu GDT. This means initializing the cpu_gdt array in C. Why not take on the opportunity to rename boot_gt_table to boot_gtd, to avoid the duplicate T(able)? Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] i386 GDT cleanups: Use per-cpu GDT immediately upon boot
Hi Rusty, On Thu, 22 Mar 2007 10:58:30 +1100 Rusty Russell [EMAIL PROTECTED] wrote: On Wed, 2007-03-21 at 10:51 -0600, Eric W. Biederman wrote: Rusty Russell [EMAIL PROTECTED] writes: On Wed, 2007-03-21 at 03:31 -0600, Eric W. Biederman wrote: Rusty Russell [EMAIL PROTECTED] writes: -/* - * The boot_gdt_table must mirror the equivalent in setup.S and is - * used only for booting. - */ It looks like you are killing a useful comment here for no good reason. Hi Eric, I think one has to look harder, then. There is no equivalent in setup.S: there is no setup.S, and it's certainly not clear what GDT this must mirror: it doesn't mirror any GDT at the moment. see the gdt in: arch/i386/boot/setup.S Erk, what a dumb mistake. Apologies for my snarky comment above 8( If anything the comment should read these values are fixed by the boot protocol and we can't change them. Since lguest doesn't use setup.S, it's outside my experience. I'll just leave the comment, and try to pretend this never happened 8) Thanks muchly, Rusty. == Now we are no longer dynamically allocating the GDT, we don't need the cpu_gdt_table at all: we can switch straight from boot_gdt_table to the per-cpu GDT. This means initializing the cpu_gdt array in C. Why not take on the opportunity to rename boot_gt_table to boot_gtd, to avoid the duplicate T(able)? Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386 GDT cleanups: Rename boot_gdt_table to boot_gdt
Rusty, On Thu, 22 Mar 2007 20:24:10 +1100 Rusty Russell [EMAIL PROTECTED] wrote: On Thu, 2007-03-22 at 09:10 +0100, Sébastien Dugué wrote: Why not take on the opportunity to rename boot_gt_table to boot_gtd, to avoid the duplicate T(able)? That's not a bad idea, but IMHO deserves its own patch. I look forward to it! Rusty. Here it is, boot tested on an SMP x86 box. Sébastien. Rename boot_gdt_table to boot_gdt to avoid the duplicate T(able). Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] --- arch/i386/kernel/head.S |9 - arch/i386/kernel/trampoline.S | 12 ++-- 2 files changed, 10 insertions(+), 11 deletions(-) Index: linux-2.6/arch/i386/kernel/head.S === --- linux-2.6.orig/arch/i386/kernel/head.S 2007-03-22 15:09:02.0 +0100 +++ linux-2.6/arch/i386/kernel/head.S 2007-03-22 16:14:38.0 +0100 @@ -147,8 +147,7 @@ page_pde_offset = (__PAGE_OFFSET 20); /* * Non-boot CPU entry point; entered from trampoline.S * We can't lgdt here, because lgdt itself uses a data segment, but - * we know the trampoline has already loaded the boot_gdt_table GDT - * for us. + * we know the trampoline has already loaded the boot_gdt for us. * * If cpu hotplug is not supported then this code can go in init section * which will be freed later @@ -588,7 +587,7 @@ fault_msg: .word 0 # 32 bit align gdt_desc.address boot_gdt_descr: .word __BOOT_DS+7 - .long boot_gdt_table - __PAGE_OFFSET + .long boot_gdt - __PAGE_OFFSET .word 0 # 32-bit align idt_desc.address idt_descr: @@ -602,11 +601,11 @@ ENTRY(early_gdt_descr) .long cpu_gdt_table /* - * The boot_gdt_table must mirror the equivalent in setup.S and is + * The boot_gdt must mirror the equivalent in setup.S and is * used only for booting. */ .align L1_CACHE_BYTES -ENTRY(boot_gdt_table) +ENTRY(boot_gdt) .fill GDT_ENTRY_BOOT_CS,8,0 .quad 0x00cf9a00/* kernel 4GB code at 0x */ .quad 0x00cf9200/* kernel 4GB data at 0x */ Index: linux-2.6/arch/i386/kernel/trampoline.S === --- linux-2.6.orig/arch/i386/kernel/trampoline.S2007-03-22 15:09:02.0 +0100 +++ linux-2.6/arch/i386/kernel/trampoline.S 2007-03-22 16:13:23.0 +0100 @@ -29,7 +29,7 @@ * * TYPE VALUE * R_386_32 startup_32_smp - * R_386_32 boot_gdt_table + * R_386_32 boot_gdt */ #include linux/linkage.h @@ -62,8 +62,8 @@ r_base = . * to 32 bit. */ - lidtl boot_idt - r_base # load idt with 0, 0 - lgdtl boot_gdt - r_base # load gdt with whatever is appropriate + lidtl boot_idt_descr - r_base # load idt with 0, 0 + lgdtl boot_gdt_descr - r_base # load gdt with whatever is appropriate xor %ax, %ax inc %ax # protected mode (PE) bit @@ -73,11 +73,11 @@ r_base = . # These need to be in the same 64K segment as the above; # hence we don't use the boot_gdt_descr defined in head.S -boot_gdt: +boot_gdt_descr: .word __BOOT_DS + 7 # gdt limit - .long boot_gdt_table-__PAGE_OFFSET# gdt base + .long boot_gdt - __PAGE_OFFSET# gdt base -boot_idt: +boot_idt_descr: .word 0 # idt limit = 0 .long 0 # idt base = 0L - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc2-mm2
On Wed, 7 Mar 2007 00:49:19 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > On Wed, 7 Mar 2007 09:39:48 +0100 Sébastien Dugué <[EMAIL PROTECTED]> wrote: > > > > > Hi Andrew, > > > > On Tue, 6 Mar 2007 00:44:08 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > > > > > Temporarily at > > > > > > http://userweb.kernel.org/~akpm/2.6.21-rc2-mm2/ > > > > > > Will appear later at > > > > > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc2/2.6.20-rc2-mm2/ > > > > > > > > > - git-block.patch is having problems which are getting in the way - it has > > > been dropped. > > > > > > - As a consequence all the AIO patches were dropped > > > > > > > Why? The aio notification and listio patches have nothing to do with > > git-block.patch. > > Yes, I could have retained those with a little bit of jiggling. But... > > > If you think those patches will be rendered obsolete by the new > > syslet/fibril/whatever approach, then fine. > > Until this is sorted out I don't think we can add new core AIO code. > Particularly not new syscalls. Makes sense. > > > Otherwise do you expect me to > > resubmit? > > Is OK for now, I think. Ok, thanks for the good job. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc2-mm2
Hi Andrew, On Tue, 6 Mar 2007 00:44:08 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > > Temporarily at > > http://userweb.kernel.org/~akpm/2.6.21-rc2-mm2/ > > Will appear later at > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc2/2.6.20-rc2-mm2/ > > > - git-block.patch is having problems which are getting in the way - it has > been dropped. > > - As a consequence all the AIO patches were dropped > Why? The aio notification and listio patches have nothing to do with git-block.patch. If you think those patches will be rendered obsolete by the new syslet/fibril/whatever approach, then fine. Otherwise do you expect me to resubmit? Thanks, Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc2-mm2
Hi Andrew, On Tue, 6 Mar 2007 00:44:08 -0800 Andrew Morton [EMAIL PROTECTED] wrote: Temporarily at http://userweb.kernel.org/~akpm/2.6.21-rc2-mm2/ Will appear later at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc2/2.6.20-rc2-mm2/ - git-block.patch is having problems which are getting in the way - it has been dropped. - As a consequence all the AIO patches were dropped Why? The aio notification and listio patches have nothing to do with git-block.patch. If you think those patches will be rendered obsolete by the new syslet/fibril/whatever approach, then fine. Otherwise do you expect me to resubmit? Thanks, Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc2-mm2
On Wed, 7 Mar 2007 00:49:19 -0800 Andrew Morton [EMAIL PROTECTED] wrote: On Wed, 7 Mar 2007 09:39:48 +0100 Sébastien Dugué [EMAIL PROTECTED] wrote: Hi Andrew, On Tue, 6 Mar 2007 00:44:08 -0800 Andrew Morton [EMAIL PROTECTED] wrote: Temporarily at http://userweb.kernel.org/~akpm/2.6.21-rc2-mm2/ Will appear later at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc2/2.6.20-rc2-mm2/ - git-block.patch is having problems which are getting in the way - it has been dropped. - As a consequence all the AIO patches were dropped Why? The aio notification and listio patches have nothing to do with git-block.patch. Yes, I could have retained those with a little bit of jiggling. But... If you think those patches will be rendered obsolete by the new syslet/fibril/whatever approach, then fine. Until this is sorted out I don't think we can add new core AIO code. Particularly not new syscalls. Makes sense. Otherwise do you expect me to resubmit? Is OK for now, I think. Ok, thanks for the good job. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: git issues while fetching from http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/
Hi Laurent, On Tue, 13 Feb 2007 10:29:59 +0100 Laurent Pinchart <[EMAIL PROTECTED]> wrote: > Hi everybody, > > I'm trying to fetch from the Linux kernel repository at > http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/. After > downloading the objects, git failed while trying to fetch tags. Retrying the > fetch operation gives the following output: > > $ git-fetch origin > Fetching refs/heads/master from > http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/ using > http > Auto-following refs/tags/v2.6.20 > Auto-following refs/tags/v2.6.20 > Auto-following refs/tags/v2.6.20-rc1 > Auto-following refs/tags/v2.6.20-rc1 > Auto-following refs/tags/v2.6.20-rc2 > Auto-following refs/tags/v2.6.20-rc2 > Auto-following refs/tags/v2.6.20-rc3 > Auto-following refs/tags/v2.6.20-rc3 > Auto-following refs/tags/v2.6.20-rc4 > Auto-following refs/tags/v2.6.20-rc4 > Auto-following refs/tags/v2.6.20-rc5 > Auto-following refs/tags/v2.6.20-rc5 > Auto-following refs/tags/v2.6.20-rc6 > Auto-following refs/tags/v2.6.20-rc6 > Auto-following refs/tags/v2.6.20-rc7 > Auto-following refs/tags/v2.6.20-rc7 > Failed to fetch refs/tags/v2.6.20 from > http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/ > > While trying to solve the issue, I noticed that the refs/tags directory in > the > remote repository is empty. Is that a normal situation ? Should git be able > to regenerate the tags locally from the objects ? I'm using git 1.4.4.4. I had the same issue. Upgrade git. the refs are now packed in a single file packed-refs. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: git issues while fetching from http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/
Hi Laurent, On Tue, 13 Feb 2007 10:29:59 +0100 Laurent Pinchart [EMAIL PROTECTED] wrote: Hi everybody, I'm trying to fetch from the Linux kernel repository at http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/. After downloading the objects, git failed while trying to fetch tags. Retrying the fetch operation gives the following output: $ git-fetch origin Fetching refs/heads/master from http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/ using http Auto-following refs/tags/v2.6.20 Auto-following refs/tags/v2.6.20 Auto-following refs/tags/v2.6.20-rc1 Auto-following refs/tags/v2.6.20-rc1 Auto-following refs/tags/v2.6.20-rc2 Auto-following refs/tags/v2.6.20-rc2 Auto-following refs/tags/v2.6.20-rc3 Auto-following refs/tags/v2.6.20-rc3 Auto-following refs/tags/v2.6.20-rc4 Auto-following refs/tags/v2.6.20-rc4 Auto-following refs/tags/v2.6.20-rc5 Auto-following refs/tags/v2.6.20-rc5 Auto-following refs/tags/v2.6.20-rc6 Auto-following refs/tags/v2.6.20-rc6 Auto-following refs/tags/v2.6.20-rc7 Auto-following refs/tags/v2.6.20-rc7 Failed to fetch refs/tags/v2.6.20 from http://www.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/ While trying to solve the issue, I noticed that the refs/tags directory in the remote repository is empty. Is that a normal situation ? Should git be able to regenerate the tags locally from the objects ? I'm using git 1.4.4.4. I had the same issue. Upgrade git. the refs are now packed in a single file packed-refs. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: -mm merge plans for 2.6.21
On Fri, 9 Feb 2007 01:05:57 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > On Fri, 9 Feb 2007 09:26:11 +0100 Sébastien Dugué <[EMAIL PROTECTED]> wrote: > > > On Fri, 9 Feb 2007 11:02:28 +0530 "Bharata B Rao" <[EMAIL PROTECTED]> wrote: > > > > > Andrew, > > > > > > On 2/9/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine.patch > > > > fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine-gfs2-fix.patch > > > > fsaio-rename-__lock_page-to-lock_page_blocking.patch > > > > fsaio-interfaces-to-initialize-and-to-test-a-wait-bit-key.patch > > > > fsaio-add-a-default-io-wait-bit-field-in-task-struct.patch > > > > fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio.patch > > > > fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix.patch > > > > fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix-sparse-fix.patch > > > > fsaio-enable-asynchronous-wait-page-and-lock-page.patch > > > > fsaio-filesystem-aio-read.patch > > > > fsaio-aio-o_sync-filesystem-write.patch > > > > > > > > Will wait to see what happens with febrils. > > > > > > > > aio-is-unlikely.patch > > > > > > > > Will merge. > > > > > > > > rework-compat_sys_io_submit.patch > > > > fix-aioh-includes.patch > > > > fix-access_ok-checks.patch > > > > make-good_sigevent-non-static.patch > > > > make-good_sigevent-non-static-fix.patch > > > > make-__sigqueue_free-and.patch > > > > aio-completion-signal-notification.patch > > > > aio-completion-signal-notification-fix.patch > > > > aio-completion-signal-notification-fixes-and-cleanups.patch > > > > aio-completion-signal-notification-small-cleanup.patch > > > > add-listio-syscall-support.patch > > > > > > > > I guess these are dependent upon fsaio. > > > > > > No. AIO completion signal notification patches and listio patch work > > > independently of fsaio patches. Just that for buffered AIO case, the > > > fsaio patches will make the behaviour truly asynchronous. > > > > > > Regards, > > > Bharata. > > > > I concur. > > > > But is the proposed listio implementation still relevant if we go and add > the async syscalls? I think the listio syscall could be emulated in userspace using async syscalls. Even more, the whole POSIX AIO functions could benefit from this, rendering aio.c obsolete. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: -mm merge plans for 2.6.21
On Fri, 9 Feb 2007 11:02:28 +0530 "Bharata B Rao" <[EMAIL PROTECTED]> wrote: > Andrew, > > On 2/9/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine.patch > > fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine-gfs2-fix.patch > > fsaio-rename-__lock_page-to-lock_page_blocking.patch > > fsaio-interfaces-to-initialize-and-to-test-a-wait-bit-key.patch > > fsaio-add-a-default-io-wait-bit-field-in-task-struct.patch > > fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio.patch > > fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix.patch > > fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix-sparse-fix.patch > > fsaio-enable-asynchronous-wait-page-and-lock-page.patch > > fsaio-filesystem-aio-read.patch > > fsaio-aio-o_sync-filesystem-write.patch > > > > Will wait to see what happens with febrils. > > > > aio-is-unlikely.patch > > > > Will merge. > > > > rework-compat_sys_io_submit.patch > > fix-aioh-includes.patch > > fix-access_ok-checks.patch > > make-good_sigevent-non-static.patch > > make-good_sigevent-non-static-fix.patch > > make-__sigqueue_free-and.patch > > aio-completion-signal-notification.patch > > aio-completion-signal-notification-fix.patch > > aio-completion-signal-notification-fixes-and-cleanups.patch > > aio-completion-signal-notification-small-cleanup.patch > > add-listio-syscall-support.patch > > > > I guess these are dependent upon fsaio. > > No. AIO completion signal notification patches and listio patch work > independently of fsaio patches. Just that for buffered AIO case, the > fsaio patches will make the behaviour truly asynchronous. > > Regards, > Bharata. I concur. Thanks, Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: -mm merge plans for 2.6.21
On Fri, 9 Feb 2007 11:02:28 +0530 Bharata B Rao [EMAIL PROTECTED] wrote: Andrew, On 2/9/07, Andrew Morton [EMAIL PROTECTED] wrote: fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine.patch fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine-gfs2-fix.patch fsaio-rename-__lock_page-to-lock_page_blocking.patch fsaio-interfaces-to-initialize-and-to-test-a-wait-bit-key.patch fsaio-add-a-default-io-wait-bit-field-in-task-struct.patch fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio.patch fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix.patch fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix-sparse-fix.patch fsaio-enable-asynchronous-wait-page-and-lock-page.patch fsaio-filesystem-aio-read.patch fsaio-aio-o_sync-filesystem-write.patch Will wait to see what happens with febrils. aio-is-unlikely.patch Will merge. rework-compat_sys_io_submit.patch fix-aioh-includes.patch fix-access_ok-checks.patch make-good_sigevent-non-static.patch make-good_sigevent-non-static-fix.patch make-__sigqueue_free-and.patch aio-completion-signal-notification.patch aio-completion-signal-notification-fix.patch aio-completion-signal-notification-fixes-and-cleanups.patch aio-completion-signal-notification-small-cleanup.patch add-listio-syscall-support.patch I guess these are dependent upon fsaio. No. AIO completion signal notification patches and listio patch work independently of fsaio patches. Just that for buffered AIO case, the fsaio patches will make the behaviour truly asynchronous. Regards, Bharata. I concur. Thanks, Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: -mm merge plans for 2.6.21
On Fri, 9 Feb 2007 01:05:57 -0800 Andrew Morton [EMAIL PROTECTED] wrote: On Fri, 9 Feb 2007 09:26:11 +0100 Sébastien Dugué [EMAIL PROTECTED] wrote: On Fri, 9 Feb 2007 11:02:28 +0530 Bharata B Rao [EMAIL PROTECTED] wrote: Andrew, On 2/9/07, Andrew Morton [EMAIL PROTECTED] wrote: fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine.patch fsaio-add-a-wait-queue-arg-to-the-wait_bit-action-routine-gfs2-fix.patch fsaio-rename-__lock_page-to-lock_page_blocking.patch fsaio-interfaces-to-initialize-and-to-test-a-wait-bit-key.patch fsaio-add-a-default-io-wait-bit-field-in-task-struct.patch fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio.patch fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix.patch fsaio-enable-wait-bit-based-filtered-wakeups-to-work-for-aio-fix-sparse-fix.patch fsaio-enable-asynchronous-wait-page-and-lock-page.patch fsaio-filesystem-aio-read.patch fsaio-aio-o_sync-filesystem-write.patch Will wait to see what happens with febrils. aio-is-unlikely.patch Will merge. rework-compat_sys_io_submit.patch fix-aioh-includes.patch fix-access_ok-checks.patch make-good_sigevent-non-static.patch make-good_sigevent-non-static-fix.patch make-__sigqueue_free-and.patch aio-completion-signal-notification.patch aio-completion-signal-notification-fix.patch aio-completion-signal-notification-fixes-and-cleanups.patch aio-completion-signal-notification-small-cleanup.patch add-listio-syscall-support.patch I guess these are dependent upon fsaio. No. AIO completion signal notification patches and listio patch work independently of fsaio patches. Just that for buffered AIO case, the fsaio patches will make the behaviour truly asynchronous. Regards, Bharata. I concur. But is the proposed listio implementation still relevant if we go and add the async syscalls? I think the listio syscall could be emulated in userspace using async syscalls. Even more, the whole POSIX AIO functions could benefit from this, rendering aio.c obsolete. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm][AIO] AIO completion signal notification small cleanup
Andrew, one more cleanup to aio_setup_sigevent() to make it more readable. Sébastien. From: Sébastien Dugué <[EMAIL PROTECTED]> AIO completion signal notification small cleanup This patch cleans up aio_setup_sigevent() to make it more readable. aio.c | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-02-06 09:33:55.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-02-06 12:43:52.0 +0100 @@ -962,21 +962,11 @@ static long aio_setup_sigevent(struct ai event.sigev_notify != SIGEV_THREAD_ID) return -EINVAL; - notify->notify = event.sigev_notify; - notify->signo = event.sigev_signo; - notify->value = event.sigev_value; - rcu_read_lock(); target = sigevent_find_task(); - if (unlikely(!target)) { - /* -* Revert notify to SIGEV_NONE so that really_put_req() -* knows that no ref has been taken on a task. -*/ - notify->notify = SIGEV_NONE; + if (unlikely(!target)) goto out_unlock; - } /* * At this point, we know that notify is either SIGEV_SIGNAL or @@ -988,6 +978,9 @@ static long aio_setup_sigevent(struct ai notify->target = target; rcu_read_unlock(); + notify->notify = event.sigev_notify; + notify->signo = event.sigev_signo; + notify->value = event.sigev_value; notify->sigq = __sigqueue_alloc(current, GFP_KERNEL, 0); /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm][AIO] AIO completion signal notification fixes and cleanups
On Tue, 6 Feb 2007 14:05:39 +0300 Oleg Nesterov <[EMAIL PROTECTED]> wrote: > On 02/06, S?bastien Dugu? wrote: > > > > @@ -970,8 +969,14 @@ static long aio_setup_sigevent(struct ai > > rcu_read_lock(); > > target = sigevent_find_task(); > > > > - if (unlikely(!target)) > > + if (unlikely(!target)) { > > + /* > > +* Revert notify to SIGEV_NONE so that really_put_req() > > +* knows that no ref has been taken on a task. > > +*/ > > + notify->notify = SIGEV_NONE; > > goto out_unlock; > > + } > > Very minor nit, feel free to ignore. > > Isn't it better to move "notify->* = event.*;" down, when we know that > target != NULL. Imho, a bit easier to follow. This way we don't need to > reset notify->notify = SIGEV_NONE. > > aio_setup_sigevent() relies on the fact that ->notify = SIGEV_NONE on > entry anyway. Yep, right, it will make things cleaner. Thanks, Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm][AIO] AIO completion signal notification fixes and cleanups
Andrew, here is another incremental patch which does a bit of cleanup as well as fixing a possible release on a task ref that was not taken. Thanks, Sébastien. From: Sébastien Dugué <[EMAIL PROTECTED]> AIO completion signal notification misc fixes and cleanups This patches cleans up the notification path and fixes a possible release on a task ref that was not taken in aio_setup_sigevent(). aio.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-02-05 16:53:43.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-02-06 09:33:55.0 +0100 @@ -469,8 +469,7 @@ static inline void really_put_req(struct kfree(req->ki_iovec); /* Release task ref */ - if (req->ki_notify.notify == SIGEV_THREAD_ID || - req->ki_notify.notify == SIGEV_SIGNAL) + if (req->ki_notify.notify != SIGEV_NONE) put_task_struct(req->ki_notify.target); kmem_cache_free(kiocb_cachep, req); @@ -970,8 +969,14 @@ static long aio_setup_sigevent(struct ai rcu_read_lock(); target = sigevent_find_task(); - if (unlikely(!target)) + if (unlikely(!target)) { + /* +* Revert notify to SIGEV_NONE so that really_put_req() +* knows that no ref has been taken on a task. +*/ + notify->notify = SIGEV_NONE; goto out_unlock; + } /* * At this point, we know that notify is either SIGEV_SIGNAL or @@ -996,7 +1001,7 @@ static long aio_setup_sigevent(struct ai return 0; out_unlock: - read_unlock(_lock); + rcu_read_unlock(); return -EINVAL; } @@ -1763,7 +1768,7 @@ int fastcall io_submit_one(struct kioctx (struct sigevent __user *)(unsigned long) iocb->aio_sigeventp); if (ret) - goto out_put_req; + goto out_sigqfree; } /* Attach this iocb to its lio */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm][AIO] Fix AIO completion signal notification possible ref leak
On Mon, 5 Feb 2007 20:13:35 +0300 Oleg Nesterov <[EMAIL PROTECTED]> wrote: > On 02/05, S?bastien Dugu? wrote: > > > > Make sure we only accept valid sigev_notify values in > > aio_setup_sigevent(), > > namely SIGEV_NONE, SIGEV_THREAD_ID or SIGEV_SIGNAL. > > I think this is correct, but I have another concern (most probably I just > confused looking at non-applied patch), could you re-check? > > > @@ -959,6 +959,10 @@ static long aio_setup_sigevent(struct ai > > if (event.sigev_notify == SIGEV_NONE) > > return 0; > > > > + if (event.sigev_notify != SIGEV_SIGNAL && > > + event.sigev_notify != SIGEV_THREAD_ID) > > + return -EINVAL; > > + > > notify->notify = event.sigev_notify; > > notify->signo = event.sigev_signo; > > notify->value = event.sigev_value; > > Ok. But what if sigevent_find_task() fails after that? Doesn't this mean > that really_put_req() will do put_task_struct(NULL) ? > Argh, right, a patch to fix that and a couple of other corner cases to follow soon. Thanks Oleg for looking through this. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm][AIO] Fix AIO completion signal notification possible ref leak
On Mon, 5 Feb 2007 20:13:35 +0300 Oleg Nesterov [EMAIL PROTECTED] wrote: On 02/05, S?bastien Dugu? wrote: Make sure we only accept valid sigev_notify values in aio_setup_sigevent(), namely SIGEV_NONE, SIGEV_THREAD_ID or SIGEV_SIGNAL. I think this is correct, but I have another concern (most probably I just confused looking at non-applied patch), could you re-check? @@ -959,6 +959,10 @@ static long aio_setup_sigevent(struct ai if (event.sigev_notify == SIGEV_NONE) return 0; + if (event.sigev_notify != SIGEV_SIGNAL + event.sigev_notify != SIGEV_THREAD_ID) + return -EINVAL; + notify-notify = event.sigev_notify; notify-signo = event.sigev_signo; notify-value = event.sigev_value; Ok. But what if sigevent_find_task() fails after that? Doesn't this mean that really_put_req() will do put_task_struct(NULL) ? Argh, right, a patch to fix that and a couple of other corner cases to follow soon. Thanks Oleg for looking through this. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm][AIO] AIO completion signal notification fixes and cleanups
Andrew, here is another incremental patch which does a bit of cleanup as well as fixing a possible release on a task ref that was not taken. Thanks, Sébastien. From: Sébastien Dugué [EMAIL PROTECTED] AIO completion signal notification misc fixes and cleanups This patches cleans up the notification path and fixes a possible release on a task ref that was not taken in aio_setup_sigevent(). aio.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-02-05 16:53:43.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-02-06 09:33:55.0 +0100 @@ -469,8 +469,7 @@ static inline void really_put_req(struct kfree(req-ki_iovec); /* Release task ref */ - if (req-ki_notify.notify == SIGEV_THREAD_ID || - req-ki_notify.notify == SIGEV_SIGNAL) + if (req-ki_notify.notify != SIGEV_NONE) put_task_struct(req-ki_notify.target); kmem_cache_free(kiocb_cachep, req); @@ -970,8 +969,14 @@ static long aio_setup_sigevent(struct ai rcu_read_lock(); target = sigevent_find_task(event); - if (unlikely(!target)) + if (unlikely(!target)) { + /* +* Revert notify to SIGEV_NONE so that really_put_req() +* knows that no ref has been taken on a task. +*/ + notify-notify = SIGEV_NONE; goto out_unlock; + } /* * At this point, we know that notify is either SIGEV_SIGNAL or @@ -996,7 +1001,7 @@ static long aio_setup_sigevent(struct ai return 0; out_unlock: - read_unlock(tasklist_lock); + rcu_read_unlock(); return -EINVAL; } @@ -1763,7 +1768,7 @@ int fastcall io_submit_one(struct kioctx (struct sigevent __user *)(unsigned long) iocb-aio_sigeventp); if (ret) - goto out_put_req; + goto out_sigqfree; } /* Attach this iocb to its lio */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm][AIO] AIO completion signal notification fixes and cleanups
On Tue, 6 Feb 2007 14:05:39 +0300 Oleg Nesterov [EMAIL PROTECTED] wrote: On 02/06, S?bastien Dugu? wrote: @@ -970,8 +969,14 @@ static long aio_setup_sigevent(struct ai rcu_read_lock(); target = sigevent_find_task(event); - if (unlikely(!target)) + if (unlikely(!target)) { + /* +* Revert notify to SIGEV_NONE so that really_put_req() +* knows that no ref has been taken on a task. +*/ + notify-notify = SIGEV_NONE; goto out_unlock; + } Very minor nit, feel free to ignore. Isn't it better to move notify-* = event.*; down, when we know that target != NULL. Imho, a bit easier to follow. This way we don't need to reset notify-notify = SIGEV_NONE. aio_setup_sigevent() relies on the fact that -notify = SIGEV_NONE on entry anyway. Yep, right, it will make things cleaner. Thanks, Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm][AIO] AIO completion signal notification small cleanup
Andrew, one more cleanup to aio_setup_sigevent() to make it more readable. Sébastien. From: Sébastien Dugué [EMAIL PROTECTED] AIO completion signal notification small cleanup This patch cleans up aio_setup_sigevent() to make it more readable. aio.c | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-02-06 09:33:55.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-02-06 12:43:52.0 +0100 @@ -962,21 +962,11 @@ static long aio_setup_sigevent(struct ai event.sigev_notify != SIGEV_THREAD_ID) return -EINVAL; - notify-notify = event.sigev_notify; - notify-signo = event.sigev_signo; - notify-value = event.sigev_value; - rcu_read_lock(); target = sigevent_find_task(event); - if (unlikely(!target)) { - /* -* Revert notify to SIGEV_NONE so that really_put_req() -* knows that no ref has been taken on a task. -*/ - notify-notify = SIGEV_NONE; + if (unlikely(!target)) goto out_unlock; - } /* * At this point, we know that notify is either SIGEV_SIGNAL or @@ -988,6 +978,9 @@ static long aio_setup_sigevent(struct ai notify-target = target; rcu_read_unlock(); + notify-notify = event.sigev_notify; + notify-signo = event.sigev_signo; + notify-value = event.sigev_value; notify-sigq = __sigqueue_alloc(current, GFP_KERNEL, 0); /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm][AIO] Fix AIO completion signal notification possible ref leak
Andrew, here is an incremental patch to fix a possible ref leak when users do weird things with the ->sigev_notify flags. Thanks Oleg for spotting this. Sébastien. From: Sébastien Dugué <[EMAIL PROTECTED]> Fix AIO completion signal notification possible ref leak Make sure we only accept valid sigev_notify values in aio_setup_sigevent(), namely SIGEV_NONE, SIGEV_THREAD_ID or SIGEV_SIGNAL. aio.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-02-05 16:50:27.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-02-05 16:53:43.0 +0100 @@ -939,7 +939,7 @@ static int aio_send_signal(struct aio_no info->si_uid = 0; info->si_value = notify->value; - if (notify->notify & SIGEV_THREAD_ID) + if (notify->notify == SIGEV_THREAD_ID) ret = send_sigqueue(notify->signo, sigq, notify->target); else ret = send_group_sigqueue(notify->signo, sigq, notify->target); @@ -959,6 +959,10 @@ static long aio_setup_sigevent(struct ai if (event.sigev_notify == SIGEV_NONE) return 0; + if (event.sigev_notify != SIGEV_SIGNAL && + event.sigev_notify != SIGEV_THREAD_ID) + return -EINVAL; + notify->notify = event.sigev_notify; notify->signo = event.sigev_signo; notify->value = event.sigev_value; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 4/7][AIO] Resend - Make good_sigevent() non-static
Andrew, Could you replace make-good_sigevent-non-static.patch with this patch, as it fixes some problems reported by Oleg Nesterov. Oleg, chime in if you still see something I missed. Hopefully I will have gotten it right this time. Sorry for the hassle. Sébastien. From: Sébastien Dugué <[EMAIL PROTECTED]> Make good_sigevent() non-static and rename it Move good_sigevent() from posix-timers.c to signal.c where it belongs, clean it up, rename it to sigevent_find_task() to better describe what the function does and make it non-static so that it can be used by other subsystems. include/linux/signal.h |1 + kernel/posix-timers.c | 19 +-- kernel/signal.c| 24 3 files changed, 26 insertions(+), 18 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/include/linux/signal.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/signal.h2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/signal.h 2007-02-05 12:25:12.0 +0100 @@ -240,6 +240,7 @@ extern int sigprocmask(int, sigset_t *, struct pt_regs; extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie); +extern struct task_struct * sigevent_find_task(sigevent_t *); extern struct kmem_cache *sighand_cachep; Index: linux-2.6.20-rc6-mm3/kernel/posix-timers.c === --- linux-2.6.20-rc6-mm3.orig/kernel/posix-timers.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/posix-timers.c 2007-01-30 11:41:36.0 +0100 @@ -367,23 +367,6 @@ static enum hrtimer_restart posix_timer_ return ret; } -static struct task_struct * good_sigevent(sigevent_t * event) -{ - struct task_struct *rtn = current->group_leader; - - if ((event->sigev_notify & SIGEV_THREAD_ID ) && - (!(rtn = find_task_by_pid(event->sigev_notify_thread_id)) || -rtn->tgid != current->tgid || -(event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL)) - return NULL; - - if (((event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) && - ((event->sigev_signo <= 0) || (event->sigev_signo > SIGRTMAX))) - return NULL; - - return rtn; -} - void register_posix_clock(const clockid_t clock_id, struct k_clock *new_clock) { if ((unsigned) clock_id >= MAX_CLOCKS) { @@ -496,7 +479,7 @@ sys_timer_create(const clockid_t which_c new_timer->it_sigev_value = event.sigev_value; read_lock(_lock); - if ((process = good_sigevent())) { + if ((process = sigevent_find_task())) { /* * We may be setting up this process for another * thread. It may be exiting. To catch this Index: linux-2.6.20-rc6-mm3/kernel/signal.c === --- linux-2.6.20-rc6-mm3.orig/kernel/signal.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/signal.c2007-02-05 12:27:33.0 +0100 @@ -1213,6 +1213,34 @@ int group_send_sig_info(int sig, struct return ret; } +/*** + * sigevent_find_task - check and get target task from a sigevent. + * @event: the sigevent to be checked + * + * This function must be called with the tasklist_lock held for reading. + */ +struct task_struct * sigevent_find_task(sigevent_t * event) +{ + struct task_struct *task = current->group_leader; + + if ((event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) { + if (event->sigev_signo <= 0 || event->sigev_signo > SIGRTMAX) + return NULL; + } + + if (event->sigev_notify & SIGEV_THREAD_ID) { + if ((event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL) + return NULL; + + task = find_task_by_pid(event->sigev_notify_thread_id); + + if (!task || task->tgid != current->tgid) + task = NULL; + } + + return task; +} + /* * kill_pgrp_info() sends a signal to a process group: this is what the tty * control characters do (^C, ^Z etc) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 4/7][AIO] - Make good_sigevent() non-static
Hi Oleg, thanks for your comments. On Fri, 2 Feb 2007 21:00:39 +0300 Oleg Nesterov <[EMAIL PROTECTED]> wrote: > On 02/01, S?bastien Dugu? wrote: > > > > +struct task_struct * sigevent_find_task(sigevent_t * event) > > +{ > > + struct task_struct *task = NULL; > > + > > + if (event->sigev_signo <= 0 || event->sigev_signo > SIGRTMAX) > > + return NULL; > > + > > + if ((event->sigev_notify & SIGEV_THREAD_ID ) == SIGEV_THREAD_ID) { > > + task = find_task_by_pid(event->sigev_notify_thread_id); > > + > > + if (!task || task->tgid != current->tgid) > > + task = NULL; > > + } else if (event->sigev_notify == SIGEV_SIGNAL) > > + task = current->group_leader; > > + > > + return task; > > +} > > I am afraid this is still not right. Consider > > ->sigev_notify == SIGEV_THREAD_ID | RANDOM_BIT > > Now, the second "if (SIGEV_THREAD_ID)" returns a valid task. However, > > really_put_req: > > if (notify == SIGEV_THREAD_ID || notify == SIGEV_SIGNAL) > put_task_struct(); > > doesn't work, so we have task_struct leak. Right, I'll revert back to the old code with cleanups. > > Worse, this breaks posix-timers. Note that posix-timers allow SIGEV_NONE, > the timer is not queued in that case, we shouldn't do ->sigev_signo check. > This means that aio should check SIGEV_NONE itself. > > Also, it is critical for posix-timers that SIGEV_THREAD_ID doesn't come > with another bit (like in the example below), note the code like > > if (sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) > ... > > IOW: good_sigevent() in its current form is very cryptic, and it _really_ > needs a cleanup, but we should not change its behaviour. Yep. I must admit that I didn't pay enough attention to this 10-liner, but in the end it rightfully backfired on me. > > Apart from this, I don't see other problems in the signal related code in > this series. > > Oleg. > Thanks again for your review. Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 4/7][AIO] - Make good_sigevent() non-static
Hi Oleg, thanks for your comments. On Fri, 2 Feb 2007 21:00:39 +0300 Oleg Nesterov [EMAIL PROTECTED] wrote: On 02/01, S?bastien Dugu? wrote: +struct task_struct * sigevent_find_task(sigevent_t * event) +{ + struct task_struct *task = NULL; + + if (event-sigev_signo = 0 || event-sigev_signo SIGRTMAX) + return NULL; + + if ((event-sigev_notify SIGEV_THREAD_ID ) == SIGEV_THREAD_ID) { + task = find_task_by_pid(event-sigev_notify_thread_id); + + if (!task || task-tgid != current-tgid) + task = NULL; + } else if (event-sigev_notify == SIGEV_SIGNAL) + task = current-group_leader; + + return task; +} I am afraid this is still not right. Consider -sigev_notify == SIGEV_THREAD_ID | RANDOM_BIT Now, the second if (SIGEV_THREAD_ID) returns a valid task. However, really_put_req: if (notify == SIGEV_THREAD_ID || notify == SIGEV_SIGNAL) put_task_struct(); doesn't work, so we have task_struct leak. Right, I'll revert back to the old code with cleanups. Worse, this breaks posix-timers. Note that posix-timers allow SIGEV_NONE, the timer is not queued in that case, we shouldn't do -sigev_signo check. This means that aio should check SIGEV_NONE itself. Also, it is critical for posix-timers that SIGEV_THREAD_ID doesn't come with another bit (like in the example below), note the code like if (sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) ... IOW: good_sigevent() in its current form is very cryptic, and it _really_ needs a cleanup, but we should not change its behaviour. Yep. I must admit that I didn't pay enough attention to this 10-liner, but in the end it rightfully backfired on me. Apart from this, I don't see other problems in the signal related code in this series. Oleg. Thanks again for your review. Sébastien. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 4/7][AIO] Resend - Make good_sigevent() non-static
Andrew, Could you replace make-good_sigevent-non-static.patch with this patch, as it fixes some problems reported by Oleg Nesterov. Oleg, chime in if you still see something I missed. Hopefully I will have gotten it right this time. Sorry for the hassle. Sébastien. From: Sébastien Dugué [EMAIL PROTECTED] Make good_sigevent() non-static and rename it Move good_sigevent() from posix-timers.c to signal.c where it belongs, clean it up, rename it to sigevent_find_task() to better describe what the function does and make it non-static so that it can be used by other subsystems. include/linux/signal.h |1 + kernel/posix-timers.c | 19 +-- kernel/signal.c| 24 3 files changed, 26 insertions(+), 18 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/include/linux/signal.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/signal.h2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/signal.h 2007-02-05 12:25:12.0 +0100 @@ -240,6 +240,7 @@ extern int sigprocmask(int, sigset_t *, struct pt_regs; extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie); +extern struct task_struct * sigevent_find_task(sigevent_t *); extern struct kmem_cache *sighand_cachep; Index: linux-2.6.20-rc6-mm3/kernel/posix-timers.c === --- linux-2.6.20-rc6-mm3.orig/kernel/posix-timers.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/posix-timers.c 2007-01-30 11:41:36.0 +0100 @@ -367,23 +367,6 @@ static enum hrtimer_restart posix_timer_ return ret; } -static struct task_struct * good_sigevent(sigevent_t * event) -{ - struct task_struct *rtn = current-group_leader; - - if ((event-sigev_notify SIGEV_THREAD_ID ) - (!(rtn = find_task_by_pid(event-sigev_notify_thread_id)) || -rtn-tgid != current-tgid || -(event-sigev_notify ~SIGEV_THREAD_ID) != SIGEV_SIGNAL)) - return NULL; - - if (((event-sigev_notify ~SIGEV_THREAD_ID) != SIGEV_NONE) - ((event-sigev_signo = 0) || (event-sigev_signo SIGRTMAX))) - return NULL; - - return rtn; -} - void register_posix_clock(const clockid_t clock_id, struct k_clock *new_clock) { if ((unsigned) clock_id = MAX_CLOCKS) { @@ -496,7 +479,7 @@ sys_timer_create(const clockid_t which_c new_timer-it_sigev_value = event.sigev_value; read_lock(tasklist_lock); - if ((process = good_sigevent(event))) { + if ((process = sigevent_find_task(event))) { /* * We may be setting up this process for another * thread. It may be exiting. To catch this Index: linux-2.6.20-rc6-mm3/kernel/signal.c === --- linux-2.6.20-rc6-mm3.orig/kernel/signal.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/signal.c2007-02-05 12:27:33.0 +0100 @@ -1213,6 +1213,34 @@ int group_send_sig_info(int sig, struct return ret; } +/*** + * sigevent_find_task - check and get target task from a sigevent. + * @event: the sigevent to be checked + * + * This function must be called with the tasklist_lock held for reading. + */ +struct task_struct * sigevent_find_task(sigevent_t * event) +{ + struct task_struct *task = current-group_leader; + + if ((event-sigev_notify ~SIGEV_THREAD_ID) != SIGEV_NONE) { + if (event-sigev_signo = 0 || event-sigev_signo SIGRTMAX) + return NULL; + } + + if (event-sigev_notify SIGEV_THREAD_ID) { + if ((event-sigev_notify ~SIGEV_THREAD_ID) != SIGEV_SIGNAL) + return NULL; + + task = find_task_by_pid(event-sigev_notify_thread_id); + + if (!task || task-tgid != current-tgid) + task = NULL; + } + + return task; +} + /* * kill_pgrp_info() sends a signal to a process group: this is what the tty * control characters do (^C, ^Z etc) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm][AIO] Fix AIO completion signal notification possible ref leak
Andrew, here is an incremental patch to fix a possible ref leak when users do weird things with the -sigev_notify flags. Thanks Oleg for spotting this. Sébastien. From: Sébastien Dugué [EMAIL PROTECTED] Fix AIO completion signal notification possible ref leak Make sure we only accept valid sigev_notify values in aio_setup_sigevent(), namely SIGEV_NONE, SIGEV_THREAD_ID or SIGEV_SIGNAL. aio.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-02-05 16:50:27.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-02-05 16:53:43.0 +0100 @@ -939,7 +939,7 @@ static int aio_send_signal(struct aio_no info-si_uid = 0; info-si_value = notify-value; - if (notify-notify SIGEV_THREAD_ID) + if (notify-notify == SIGEV_THREAD_ID) ret = send_sigqueue(notify-signo, sigq, notify-target); else ret = send_group_sigqueue(notify-signo, sigq, notify-target); @@ -959,6 +959,10 @@ static long aio_setup_sigevent(struct ai if (event.sigev_notify == SIGEV_NONE) return 0; + if (event.sigev_notify != SIGEV_SIGNAL + event.sigev_notify != SIGEV_THREAD_ID) + return -EINVAL; + notify-notify = event.sigev_notify; notify-signo = event.sigev_signo; notify-value = event.sigev_value; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 3/7][AIO] - Fix access_ok() checks
From: Sébastien Dugué <[EMAIL PROTECTED]> Fix access_ok() checks in sys_io_submit() and compat_sys_io_submit() sys_io_submit() and compat_sys_io_submit() check for the number of requests (nr) being positive, but the following access_ok() call uses nr * sizeof(ptr) which can overlow to a negative number. Just make sure that nr * sizeof(ptr) is positive in the first place. aio.c|2 +- compat.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-01-30 11:28:56.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-01-30 11:32:42.0 +0100 @@ -1630,7 +1630,7 @@ asmlinkage long sys_io_submit(aio_contex long ret = 0; int i; - if (unlikely(nr < 0)) + if (unlikely((nr*sizeof(*iocbpp)) < 0)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(*iocbpp) Index: linux-2.6.20-rc6-mm3/fs/compat.c === --- linux-2.6.20-rc6-mm3.orig/fs/compat.c 2007-01-30 11:30:29.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/compat.c2007-01-30 11:31:55.0 +0100 @@ -651,7 +651,7 @@ compat_sys_io_submit(aio_context_t ctx_i long ret = 0; int i; - if (unlikely(nr < 0)) + if (unlikely((nr * sizeof(u32)) < 0)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, iocb, (nr * sizeof(u32) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 5/7][AIO] - Make __sigqueue_free() and __sigqueue_alloc() non static
From: Sébastien Dugué <[EMAIL PROTECTED]> Make __sigqueue_alloc() and __sigqueue_free() non static Allow subsystems to directly call into __sigqueue_alloc() and __sigqueue_free. This is used by the AIO signal notification patch. include/linux/signal.h |3 +++ kernel/signal.c|6 +++--- 2 files changed, 6 insertions(+), 3 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/include/linux/signal.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/signal.h2007-01-30 11:41:36.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/signal.h 2007-01-30 11:41:39.0 +0100 @@ -241,6 +241,9 @@ extern int sigprocmask(int, sigset_t *, struct pt_regs; extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie); extern struct task_struct * sigevent_find_task(sigevent_t *); +extern struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, +int override_rlimit); +extern void __sigqueue_free(struct sigqueue *q); extern struct kmem_cache *sighand_cachep; Index: linux-2.6.20-rc6-mm3/kernel/signal.c === --- linux-2.6.20-rc6-mm3.orig/kernel/signal.c 2007-01-30 11:41:36.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/signal.c2007-01-30 11:41:39.0 +0100 @@ -268,8 +268,8 @@ next_signal(struct sigpending *pending, return sig; } -static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, -int override_rlimit) +struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, + int override_rlimit) { struct sigqueue *q = NULL; struct user_struct *user; @@ -295,7 +295,7 @@ static struct sigqueue *__sigqueue_alloc return(q); } -static void __sigqueue_free(struct sigqueue *q) +void __sigqueue_free(struct sigqueue *q) { if (q->flags & SIGQUEUE_PREALLOC) return; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 2/7][AIO] - fix aio.h includes
From: Sébastien Dugué <[EMAIL PROTECTED]> Fix the double inclusion of linux/uio.h in linux/aio.h aio.h |1 - 1 file changed, 1 deletion(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/include/linux/aio.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/aio.h 2007-01-30 10:05:08.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/aio.h2007-01-30 10:11:22.0 +0100 @@ -7,7 +7,6 @@ #include #include -#include #define AIO_MAXSEGS4 #define AIO_KIOGRP_NR_ATOMIC 8 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 7/7][AIO] - Add listio syscall support
From: Bharata B Rao <[EMAIL PROTECTED]> This patch provides POSIX listio support by means of a new system call. long lio_submit(aio_context_t ctx_id, int mode, long nr, struct iocb __user * __user *iocbpp, struct sigevent __user *event) This system call is similar to the io_submit() system call, but takes two more arguments. 'mode' argument can be LIO_WAIT or LIO_NOWAIT. 'event' argument specifies the signal notification mechanism. This patch is built on support provided by the aio signal notification patch by Sebastien. The following two structures together provide the support for grouping iocbs belonging to a list (lio). struct aio_notify { struct task_struct *target; __u16 signo; __u16 notify; sigval_tvalue; struct sigqueue *sigq; }; struct lio_event { atomic_tlio_users; struct aio_notify lio_notify; }; A single lio_event struct is maintained for the list of iocbs. lio_users holds the number of requests attached to this lio and lio_notify has the necessary information for generating completion notification signal. If the mode is LIO_WAIT, the event argument is ignored and the system call waits until all the requests of the lio are completed. If the mode is LIO_NOWAIT, the system call returns immediately after submitting the io requests and may optionally notify the process on list io completion depending on the event argument. Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]> Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]> --- arch/i386/kernel/syscall_table.S |1 arch/x86_64/ia32/ia32entry.S |5 - fs/aio.c | 193 ++- fs/compat.c | 125 + include/asm-i386/unistd.h|3 include/asm-x86_64/unistd.h |4 include/linux/aio.h | 14 ++ include/linux/aio_abi.h |5 + include/linux/syscalls.h |2 9 files changed, 307 insertions(+), 45 deletions(-) Index: linux-2.6.20-rc6-mm3/arch/i386/kernel/syscall_table.S === --- linux-2.6.20-rc6-mm3.orig/arch/i386/kernel/syscall_table.S 2007-01-31 15:31:59.0 +0100 +++ linux-2.6.20-rc6-mm3/arch/i386/kernel/syscall_table.S 2007-01-31 15:58:55.0 +0100 @@ -320,3 +320,4 @@ ENTRY(sys_call_table) .long sys_getcpu .long sys_epoll_pwait .long sys_lutimesat /* 320 */ + .long sys_lio_submit Index: linux-2.6.20-rc6-mm3/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.20-rc6-mm3.orig/arch/x86_64/ia32/ia32entry.S 2007-01-31 15:31:59.0 +0100 +++ linux-2.6.20-rc6-mm3/arch/x86_64/ia32/ia32entry.S 2007-01-31 15:58:55.0 +0100 @@ -714,8 +714,11 @@ ia32_sys_call_table: .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad quiet_ni_syscall /* sys_epoll_wait */ + .quad quiet_ni_syscall /* 320: sys_lutimesat */ + .quad compat_sys_lio_submit ia32_syscall_end: Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-01-31 15:31:59.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-01-31 16:00:57.0 +0100 @@ -416,6 +416,7 @@ static struct kiocb fastcall *__aio_get_ req->ki_ctx = ctx; req->ki_cancel = NULL; req->ki_retry = NULL; + req->ki_lio = NULL; req->ki_dtor = NULL; req->private = NULL; req->ki_iovec = NULL; @@ -995,6 +996,72 @@ out_unlock: return -EINVAL; } +/* + * lio_notify + * When all iocbs belonging to an lio are complete, this initiates + * the completion notification for the lio. + * + * NOTE: lio is freed here after the last iocb completes, but only + * for LIO_NOWAIT case. In case of LIO_WAIT, the submitting thread + * waits for iocbs to complete and later frees the lio. + */ +void lio_notify(struct lio_event *lio) +{ + int ret; + + ret = atomic_dec_and_test(>lio_users); + + if (unlikely(ret) && lio->lio_notify.notify != SIGEV_NONE) { + /* last one -> notify process */ + if (aio_send_signal(>lio_notify)) + __sigqueue_free(lio->lio_notify.sigq); + kfree(lio); + } +} + +/* + * lio_create + * Allocates a lio_event structure which groups the iocbs belonging + * to the lio and sets up the c
[PATCH -mm 4/7][AIO] - Make good_sigevent() non-static
From: Sébastien Dugué <[EMAIL PROTECTED]> Make good_sigevent() non-static and rename it Move good_sigevent() from posix-timers.c to signal.c where it belongs, clean it up, rename it to sigevent_find_task() to better describe what the function does and make it non-static so that it can be used by other subsystems. include/linux/signal.h |1 + kernel/posix-timers.c | 19 +-- kernel/signal.c| 24 3 files changed, 26 insertions(+), 18 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/include/linux/signal.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/signal.h2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/signal.h 2007-01-30 11:41:36.0 +0100 @@ -240,6 +240,7 @@ extern int sigprocmask(int, sigset_t *, struct pt_regs; extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie); +extern struct task_struct * sigevent_find_task(sigevent_t *); extern struct kmem_cache *sighand_cachep; Index: linux-2.6.20-rc6-mm3/kernel/posix-timers.c === --- linux-2.6.20-rc6-mm3.orig/kernel/posix-timers.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/posix-timers.c 2007-01-30 11:41:36.0 +0100 @@ -367,23 +367,6 @@ static enum hrtimer_restart posix_timer_ return ret; } -static struct task_struct * good_sigevent(sigevent_t * event) -{ - struct task_struct *rtn = current->group_leader; - - if ((event->sigev_notify & SIGEV_THREAD_ID ) && - (!(rtn = find_task_by_pid(event->sigev_notify_thread_id)) || -rtn->tgid != current->tgid || -(event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_SIGNAL)) - return NULL; - - if (((event->sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) && - ((event->sigev_signo <= 0) || (event->sigev_signo > SIGRTMAX))) - return NULL; - - return rtn; -} - void register_posix_clock(const clockid_t clock_id, struct k_clock *new_clock) { if ((unsigned) clock_id >= MAX_CLOCKS) { @@ -496,7 +479,7 @@ sys_timer_create(const clockid_t which_c new_timer->it_sigev_value = event.sigev_value; read_lock(_lock); - if ((process = good_sigevent())) { + if ((process = sigevent_find_task())) { /* * We may be setting up this process for another * thread. It may be exiting. To catch this Index: linux-2.6.20-rc6-mm3/kernel/signal.c === --- linux-2.6.20-rc6-mm3.orig/kernel/signal.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/signal.c2007-01-30 11:41:36.0 +0100 @@ -1213,6 +1213,30 @@ int group_send_sig_info(int sig, struct return ret; } +/*** + * sigevent_find_task - check and get target task from a sigevent. + * @event: the sigevent to be checked + * + * This function must be called with the tasklist_lock held for reading. + */ +struct task_struct * sigevent_find_task(sigevent_t * event) +{ + struct task_struct *task = NULL; + + if (event->sigev_signo <= 0 || event->sigev_signo > SIGRTMAX) + return NULL; + + if ((event->sigev_notify & SIGEV_THREAD_ID ) == SIGEV_THREAD_ID) { + task = find_task_by_pid(event->sigev_notify_thread_id); + + if (!task || task->tgid != current->tgid) + task = NULL; + } else if (event->sigev_notify == SIGEV_SIGNAL) + task = current->group_leader; + + return task; +} + /* * kill_pgrp_info() sends a signal to a process group: this is what the tty * control characters do (^C, ^Z etc) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 1/7][AIO] - Rework compat_sys_io_submit
From: Sébastien Dugué <[EMAIL PROTECTED]> compat_sys_io_submit() cleanup Cleanup compat_sys_io_submit by duplicating some of the native syscall logic in the compat layer and directly calling io_submit_one() instead of fooling the syscall into thinking it is called from a native 64-bit caller. This eliminates: - the overhead of copying the nr iocb pointers on the userspace stack - the PAGE_SIZE/(sizeof(void *)) limit on the number of iocbs that can be submitted. This is also needed for the completion notification patch to avoid having to rewrite each iocb on the caller stack for io_submit_one() to find the sigevents. compat.c | 61 ++--- 1 file changed, 34 insertions(+), 27 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/fs/compat.c === --- linux-2.6.20-rc6-mm3.orig/fs/compat.c 2007-01-30 10:05:08.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/compat.c2007-01-30 11:30:29.0 +0100 @@ -644,40 +644,47 @@ out: return ret; } -static inline long -copy_iocb(long nr, u32 __user *ptr32, struct iocb __user * __user *ptr64) -{ - compat_uptr_t uptr; - int i; - - for (i = 0; i < nr; ++i) { - if (get_user(uptr, ptr32 + i)) - return -EFAULT; - if (put_user(compat_ptr(uptr), ptr64 + i)) - return -EFAULT; - } - return 0; -} - -#define MAX_AIO_SUBMITS(PAGE_SIZE/sizeof(struct iocb *)) - asmlinkage long compat_sys_io_submit(aio_context_t ctx_id, int nr, u32 __user *iocb) { - struct iocb __user * __user *iocb64; - long ret; + struct kioctx *ctx; + long ret = 0; + int i; if (unlikely(nr < 0)) return -EINVAL; - if (nr > MAX_AIO_SUBMITS) - nr = MAX_AIO_SUBMITS; - - iocb64 = compat_alloc_user_space(nr * sizeof(*iocb64)); - ret = copy_iocb(nr, iocb, iocb64); - if (!ret) - ret = sys_io_submit(ctx_id, nr, iocb64); - return ret; + if (unlikely(!access_ok(VERIFY_READ, iocb, (nr * sizeof(u32) + return -EFAULT; + + ctx = lookup_ioctx(ctx_id); + if (unlikely(!ctx)) + return -EINVAL; + + for (i=0; ihttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 0/7][AIO] - AIO completion signal notification v6
Hi Andrew, here is the latest rework of the aio notification and listio support patches with comments from Oleg Nesterov and you addressed. Sébastien. This set now consists in 7 patches: 1. rework-compat-sys-io-submit: cleanup the sys_io_submit() compat layer, making it more efficient and laying out the base for the following patches 2. aio-header-fix-includes: fixes the double inclusion of uio.h in aio.h 3. aio-fix-access_ok-check: fixes the access_ok() checks 4. make-good_sigevent-non-static: move good_sigevent into signal.c, rename it to something more meaningfull and make it non-static 5. make-__sigqueue_alloc-free-non-static: make __sigqueue_alloc() and __sigqueue_free() non static 6. aio-notify-sig: the AIO completion signal notification 7. add-listio-syscall-support: adds listio support via a new syscall Description are in the individual patches. Changes from v5: based on comments from Oleg Nesterov and Andrew Morton - Fix for signo range not checked in the SIGEV_THREAD_ID case in good_sigevent() - renamed good_sigevent() to sigevent_find_task() - Fixed the access_ok checks in sys_io_submit() and compat_sys_io_submit() - Made __sigqueue_alloc() and __sigqueue_free() non static - Reworked the aio code to directly use __sigqueue_alloc() and __sigqueue_free() and avoid messing with SIGQUEUE_PREALLOC - Removed the unneeded PF_EXITING check in aio_setup_sigevent() - Added a comment in aio_setup_sigevent() for why there is no task ref leak - Changed read_lock(_lock) to rcu_read_lock() in aio_setup_sigevent() - Fixed an ioctx reference leak in compat_sys_lio_submit() - Renamed lio_check() as lio_notify() - Added comments to lio_notify() and lio_create() - Replaced sigqueue_free() by __sigqueue_free() in lio_notify() Changes from v4: No comments received for v4, so it must be perfect ;-) replaced the IOCB_CMD_GROUP listio implementation with Bharata's syscall approach which I find much cleaner. Changes from v3: - added justification for the compat_sys_io_submit() cleanup - Zach Brown - more cleanups in compat_sys_io_submit() to put it in line with sys_io_submit() - Zach Brown - Changed "Export good_sigevent()" patch name to "Make good_sigevent() non-static" to better describe what it does. - Christoph Hellwig - Reworked good_sigevent() to make it more readable. - Christoph Hellwig - Simplified the use of the SIGEV_* constants in signal notification - Christoph Hellwig - Take a reference on the target task both for the SIGEV_THREAD_ID and SIGEV_SIGNAL cases. Changes from v2: - rebased to 2.6.19-rc6-mm2 - reworked the sys_io_submit() compat layer as suggested by Zach Brown - fixed saving of a pointer to a task struct in aio-notify-sig as pointed out by Andrew Morton Changes from v1: - cleanups suggested by Christoph Hellwig, Badari Pulavarty and Zach Brown - added lisio patch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 6/7][AIO] - AIO completion signal notification
From: Sébastien Dugué <[EMAIL PROTECTED]> AIO completion signal notification The current 2.6 kernel does not support notification of user space via an RT signal upon an asynchronous IO completion. The POSIX specification states that when an AIO request completes, a signal can be delivered to the application as notification. This patch adds a struct sigevent *aio_sigeventp to the iocb. The relevant fields (pid, signal number and value) are stored in the kiocb for use when the request completes. That sigevent structure is filled by the application as part of the AIO request preparation. Upon request completion, the kernel notifies the application using those sigevent parameters. If SIGEV_NONE has been specified, then the old behaviour is retained and the application must rely on polling the completion queue using io_getevents(). A struct sigevent *aio_sigeventp field is added to struct iocb in include/linux/aio_abi.h A struct aio_notify containing the sigevent parameters is defined in aio.h: struct aio_notify { struct task_struct *target; __u16 signo; __u16 notify; sigval_tvalue; }; A struct aio_notify ki_notify is added to struct kiocb in include/linux/aio.h In io_submit_one(), if the application provided a sigevent then aio_setup_sigevent() is called which does the following: - check access to the user sigevent and make a local copy - if the requested notification is SIGEV_NONE, then nothing to do - fill in the kiocb->ki_notify fields (notify, signo, value) - check sigevent consistency, get the signal target task and save it in kiocb->ki_notify.target - preallocate a sigqueue for this event using __sigqueue_alloc() Upon request completion, in aio_complete(), if notification is needed for this request (iocb->ki_notify.notify != SIGEV_NONE), then aio_send_signal() is called to signal the target task as follows: - fill in the siginfo struct to be sent to the task - if notify is SIGEV_THREAD_ID then send signal to specific task using send_sigqueue() - else send signal to task group using send_5group_sigqueue() Notes concerning sigqueue preallocation: To ensure reliable delivery of completion notification, the sigqueue is preallocated in the submission path so that there is no chance it can fail in the completion path. Unlike the posix-timers case (currently the single other user of sigqueue preallocation), where the sigqueue is allocated for the lifetime of the timer and freed at timer destruction time, the aio case is a bit more tricky due to the async nature of the whole thing. In the aio case, the sigqueue exists for the lifetime of the request, therefore it must be freed only once the signal for the request completion has been delivered. The sigqueue is therefore freed by the signal delivery mechanism when the signal is collected. The aio submission and completion error paths do an explicit __sigqueue_free() to free the sigqueue. fs/aio.c| 114 ++-- fs/compat.c | 18 +++ include/linux/aio.h | 12 + include/linux/aio_abi.h |3 - kernel/signal.c |4 - 5 files changed, 144 insertions(+), 7 deletions(-) Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]> Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]> Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-01-30 11:32:42.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-01-30 11:41:41.0 +0100 @@ -419,6 +419,7 @@ static struct kiocb fastcall *__aio_get_ req->ki_dtor = NULL; req->private = NULL; req->ki_iovec = NULL; + req->ki_notify.sigq = NULL; INIT_LIST_HEAD(>ki_run_list); /* Check if the completion queue has enough free space to @@ -465,6 +466,12 @@ static inline void really_put_req(struct req->ki_dtor(req); if (req->ki_iovec != >ki_inline_vec) kfree(req->ki_iovec); + + /* Release task ref */ + if (req->ki_notify.notify == SIGEV_THREAD_ID || + req->ki_notify.notify == SIGEV_SIGNAL) + put_task_struct(req->ki_notify.target); + kmem_cache_free(kiocb_cachep, req); ctx->reqs_active--; @@ -916,6 +923,78 @@ void fastcall kick_iocb(struct kiocb *io } EXPORT_SYMBOL(kick_iocb); +static int aio_send_signal(struct aio_notify *notify) +{ + struct sigqueue *sigq = notify->sigq; + struct siginfo *info = >info; + int ret; + + memset(info, 0, sizeof(struct siginfo)); + + info->si_signo = notify->signo; + info->si_errno = 0; + info->si_code = SI
[PATCH -mm 0/7][AIO] - AIO completion signal notification v6
Hi Andrew, here is the latest rework of the aio notification and listio support patches with comments from Oleg Nesterov and you addressed. Sébastien. This set now consists in 7 patches: 1. rework-compat-sys-io-submit: cleanup the sys_io_submit() compat layer, making it more efficient and laying out the base for the following patches 2. aio-header-fix-includes: fixes the double inclusion of uio.h in aio.h 3. aio-fix-access_ok-check: fixes the access_ok() checks 4. make-good_sigevent-non-static: move good_sigevent into signal.c, rename it to something more meaningfull and make it non-static 5. make-__sigqueue_alloc-free-non-static: make __sigqueue_alloc() and __sigqueue_free() non static 6. aio-notify-sig: the AIO completion signal notification 7. add-listio-syscall-support: adds listio support via a new syscall Description are in the individual patches. Changes from v5: based on comments from Oleg Nesterov and Andrew Morton - Fix for signo range not checked in the SIGEV_THREAD_ID case in good_sigevent() - renamed good_sigevent() to sigevent_find_task() - Fixed the access_ok checks in sys_io_submit() and compat_sys_io_submit() - Made __sigqueue_alloc() and __sigqueue_free() non static - Reworked the aio code to directly use __sigqueue_alloc() and __sigqueue_free() and avoid messing with SIGQUEUE_PREALLOC - Removed the unneeded PF_EXITING check in aio_setup_sigevent() - Added a comment in aio_setup_sigevent() for why there is no task ref leak - Changed read_lock(tasklist_lock) to rcu_read_lock() in aio_setup_sigevent() - Fixed an ioctx reference leak in compat_sys_lio_submit() - Renamed lio_check() as lio_notify() - Added comments to lio_notify() and lio_create() - Replaced sigqueue_free() by __sigqueue_free() in lio_notify() Changes from v4: No comments received for v4, so it must be perfect ;-) replaced the IOCB_CMD_GROUP listio implementation with Bharata's syscall approach which I find much cleaner. Changes from v3: - added justification for the compat_sys_io_submit() cleanup - Zach Brown - more cleanups in compat_sys_io_submit() to put it in line with sys_io_submit() - Zach Brown - Changed Export good_sigevent() patch name to Make good_sigevent() non-static to better describe what it does. - Christoph Hellwig - Reworked good_sigevent() to make it more readable. - Christoph Hellwig - Simplified the use of the SIGEV_* constants in signal notification - Christoph Hellwig - Take a reference on the target task both for the SIGEV_THREAD_ID and SIGEV_SIGNAL cases. Changes from v2: - rebased to 2.6.19-rc6-mm2 - reworked the sys_io_submit() compat layer as suggested by Zach Brown - fixed saving of a pointer to a task struct in aio-notify-sig as pointed out by Andrew Morton Changes from v1: - cleanups suggested by Christoph Hellwig, Badari Pulavarty and Zach Brown - added lisio patch - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 4/7][AIO] - Make good_sigevent() non-static
From: Sébastien Dugué [EMAIL PROTECTED] Make good_sigevent() non-static and rename it Move good_sigevent() from posix-timers.c to signal.c where it belongs, clean it up, rename it to sigevent_find_task() to better describe what the function does and make it non-static so that it can be used by other subsystems. include/linux/signal.h |1 + kernel/posix-timers.c | 19 +-- kernel/signal.c| 24 3 files changed, 26 insertions(+), 18 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/include/linux/signal.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/signal.h2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/signal.h 2007-01-30 11:41:36.0 +0100 @@ -240,6 +240,7 @@ extern int sigprocmask(int, sigset_t *, struct pt_regs; extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie); +extern struct task_struct * sigevent_find_task(sigevent_t *); extern struct kmem_cache *sighand_cachep; Index: linux-2.6.20-rc6-mm3/kernel/posix-timers.c === --- linux-2.6.20-rc6-mm3.orig/kernel/posix-timers.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/posix-timers.c 2007-01-30 11:41:36.0 +0100 @@ -367,23 +367,6 @@ static enum hrtimer_restart posix_timer_ return ret; } -static struct task_struct * good_sigevent(sigevent_t * event) -{ - struct task_struct *rtn = current-group_leader; - - if ((event-sigev_notify SIGEV_THREAD_ID ) - (!(rtn = find_task_by_pid(event-sigev_notify_thread_id)) || -rtn-tgid != current-tgid || -(event-sigev_notify ~SIGEV_THREAD_ID) != SIGEV_SIGNAL)) - return NULL; - - if (((event-sigev_notify ~SIGEV_THREAD_ID) != SIGEV_NONE) - ((event-sigev_signo = 0) || (event-sigev_signo SIGRTMAX))) - return NULL; - - return rtn; -} - void register_posix_clock(const clockid_t clock_id, struct k_clock *new_clock) { if ((unsigned) clock_id = MAX_CLOCKS) { @@ -496,7 +479,7 @@ sys_timer_create(const clockid_t which_c new_timer-it_sigev_value = event.sigev_value; read_lock(tasklist_lock); - if ((process = good_sigevent(event))) { + if ((process = sigevent_find_task(event))) { /* * We may be setting up this process for another * thread. It may be exiting. To catch this Index: linux-2.6.20-rc6-mm3/kernel/signal.c === --- linux-2.6.20-rc6-mm3.orig/kernel/signal.c 2007-01-30 11:41:27.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/signal.c2007-01-30 11:41:36.0 +0100 @@ -1213,6 +1213,30 @@ int group_send_sig_info(int sig, struct return ret; } +/*** + * sigevent_find_task - check and get target task from a sigevent. + * @event: the sigevent to be checked + * + * This function must be called with the tasklist_lock held for reading. + */ +struct task_struct * sigevent_find_task(sigevent_t * event) +{ + struct task_struct *task = NULL; + + if (event-sigev_signo = 0 || event-sigev_signo SIGRTMAX) + return NULL; + + if ((event-sigev_notify SIGEV_THREAD_ID ) == SIGEV_THREAD_ID) { + task = find_task_by_pid(event-sigev_notify_thread_id); + + if (!task || task-tgid != current-tgid) + task = NULL; + } else if (event-sigev_notify == SIGEV_SIGNAL) + task = current-group_leader; + + return task; +} + /* * kill_pgrp_info() sends a signal to a process group: this is what the tty * control characters do (^C, ^Z etc) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 1/7][AIO] - Rework compat_sys_io_submit
From: Sébastien Dugué [EMAIL PROTECTED] compat_sys_io_submit() cleanup Cleanup compat_sys_io_submit by duplicating some of the native syscall logic in the compat layer and directly calling io_submit_one() instead of fooling the syscall into thinking it is called from a native 64-bit caller. This eliminates: - the overhead of copying the nr iocb pointers on the userspace stack - the PAGE_SIZE/(sizeof(void *)) limit on the number of iocbs that can be submitted. This is also needed for the completion notification patch to avoid having to rewrite each iocb on the caller stack for io_submit_one() to find the sigevents. compat.c | 61 ++--- 1 file changed, 34 insertions(+), 27 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/fs/compat.c === --- linux-2.6.20-rc6-mm3.orig/fs/compat.c 2007-01-30 10:05:08.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/compat.c2007-01-30 11:30:29.0 +0100 @@ -644,40 +644,47 @@ out: return ret; } -static inline long -copy_iocb(long nr, u32 __user *ptr32, struct iocb __user * __user *ptr64) -{ - compat_uptr_t uptr; - int i; - - for (i = 0; i nr; ++i) { - if (get_user(uptr, ptr32 + i)) - return -EFAULT; - if (put_user(compat_ptr(uptr), ptr64 + i)) - return -EFAULT; - } - return 0; -} - -#define MAX_AIO_SUBMITS(PAGE_SIZE/sizeof(struct iocb *)) - asmlinkage long compat_sys_io_submit(aio_context_t ctx_id, int nr, u32 __user *iocb) { - struct iocb __user * __user *iocb64; - long ret; + struct kioctx *ctx; + long ret = 0; + int i; if (unlikely(nr 0)) return -EINVAL; - if (nr MAX_AIO_SUBMITS) - nr = MAX_AIO_SUBMITS; - - iocb64 = compat_alloc_user_space(nr * sizeof(*iocb64)); - ret = copy_iocb(nr, iocb, iocb64); - if (!ret) - ret = sys_io_submit(ctx_id, nr, iocb64); - return ret; + if (unlikely(!access_ok(VERIFY_READ, iocb, (nr * sizeof(u32) + return -EFAULT; + + ctx = lookup_ioctx(ctx_id); + if (unlikely(!ctx)) + return -EINVAL; + + for (i=0; inr; i++) { + compat_uptr_t uptr; + struct iocb __user *user_iocb; + struct iocb tmp; + + if (unlikely(get_user(uptr, iocb + i))) { + ret = -EFAULT; + break; + } + + user_iocb = compat_ptr(uptr); + + if (unlikely(copy_from_user(tmp, user_iocb, sizeof(tmp { + ret = -EFAULT; + break; + } + + ret = io_submit_one(ctx, user_iocb, tmp); + if (ret) + break; + } + + put_ioctx(ctx); + return i ? i: ret; } struct compat_ncp_mount_data { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 7/7][AIO] - Add listio syscall support
From: Bharata B Rao [EMAIL PROTECTED] This patch provides POSIX listio support by means of a new system call. long lio_submit(aio_context_t ctx_id, int mode, long nr, struct iocb __user * __user *iocbpp, struct sigevent __user *event) This system call is similar to the io_submit() system call, but takes two more arguments. 'mode' argument can be LIO_WAIT or LIO_NOWAIT. 'event' argument specifies the signal notification mechanism. This patch is built on support provided by the aio signal notification patch by Sebastien. The following two structures together provide the support for grouping iocbs belonging to a list (lio). struct aio_notify { struct task_struct *target; __u16 signo; __u16 notify; sigval_tvalue; struct sigqueue *sigq; }; struct lio_event { atomic_tlio_users; struct aio_notify lio_notify; }; A single lio_event struct is maintained for the list of iocbs. lio_users holds the number of requests attached to this lio and lio_notify has the necessary information for generating completion notification signal. If the mode is LIO_WAIT, the event argument is ignored and the system call waits until all the requests of the lio are completed. If the mode is LIO_NOWAIT, the system call returns immediately after submitting the io requests and may optionally notify the process on list io completion depending on the event argument. Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Signed-off-by: Laurent Vivier [EMAIL PROTECTED] Signed-off-by: Bharata B Rao [EMAIL PROTECTED] --- arch/i386/kernel/syscall_table.S |1 arch/x86_64/ia32/ia32entry.S |5 - fs/aio.c | 193 ++- fs/compat.c | 125 + include/asm-i386/unistd.h|3 include/asm-x86_64/unistd.h |4 include/linux/aio.h | 14 ++ include/linux/aio_abi.h |5 + include/linux/syscalls.h |2 9 files changed, 307 insertions(+), 45 deletions(-) Index: linux-2.6.20-rc6-mm3/arch/i386/kernel/syscall_table.S === --- linux-2.6.20-rc6-mm3.orig/arch/i386/kernel/syscall_table.S 2007-01-31 15:31:59.0 +0100 +++ linux-2.6.20-rc6-mm3/arch/i386/kernel/syscall_table.S 2007-01-31 15:58:55.0 +0100 @@ -320,3 +320,4 @@ ENTRY(sys_call_table) .long sys_getcpu .long sys_epoll_pwait .long sys_lutimesat /* 320 */ + .long sys_lio_submit Index: linux-2.6.20-rc6-mm3/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.20-rc6-mm3.orig/arch/x86_64/ia32/ia32entry.S 2007-01-31 15:31:59.0 +0100 +++ linux-2.6.20-rc6-mm3/arch/x86_64/ia32/ia32entry.S 2007-01-31 15:58:55.0 +0100 @@ -714,8 +714,11 @@ ia32_sys_call_table: .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad quiet_ni_syscall /* sys_epoll_wait */ + .quad quiet_ni_syscall /* 320: sys_lutimesat */ + .quad compat_sys_lio_submit ia32_syscall_end: Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-01-31 15:31:59.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-01-31 16:00:57.0 +0100 @@ -416,6 +416,7 @@ static struct kiocb fastcall *__aio_get_ req-ki_ctx = ctx; req-ki_cancel = NULL; req-ki_retry = NULL; + req-ki_lio = NULL; req-ki_dtor = NULL; req-private = NULL; req-ki_iovec = NULL; @@ -995,6 +996,72 @@ out_unlock: return -EINVAL; } +/* + * lio_notify + * When all iocbs belonging to an lio are complete, this initiates + * the completion notification for the lio. + * + * NOTE: lio is freed here after the last iocb completes, but only + * for LIO_NOWAIT case. In case of LIO_WAIT, the submitting thread + * waits for iocbs to complete and later frees the lio. + */ +void lio_notify(struct lio_event *lio) +{ + int ret; + + ret = atomic_dec_and_test(lio-lio_users); + + if (unlikely(ret) lio-lio_notify.notify != SIGEV_NONE) { + /* last one - notify process */ + if (aio_send_signal(lio-lio_notify)) + __sigqueue_free(lio-lio_notify.sigq); + kfree(lio); + } +} + +/* + * lio_create + * Allocates a lio_event structure which groups the iocbs belonging + * to the lio and sets up the completion notification. + * + * Case 1: LIO_NOWAIT and NULL sigevent - No need to group
[PATCH -mm 6/7][AIO] - AIO completion signal notification
From: Sébastien Dugué [EMAIL PROTECTED] AIO completion signal notification The current 2.6 kernel does not support notification of user space via an RT signal upon an asynchronous IO completion. The POSIX specification states that when an AIO request completes, a signal can be delivered to the application as notification. This patch adds a struct sigevent *aio_sigeventp to the iocb. The relevant fields (pid, signal number and value) are stored in the kiocb for use when the request completes. That sigevent structure is filled by the application as part of the AIO request preparation. Upon request completion, the kernel notifies the application using those sigevent parameters. If SIGEV_NONE has been specified, then the old behaviour is retained and the application must rely on polling the completion queue using io_getevents(). A struct sigevent *aio_sigeventp field is added to struct iocb in include/linux/aio_abi.h A struct aio_notify containing the sigevent parameters is defined in aio.h: struct aio_notify { struct task_struct *target; __u16 signo; __u16 notify; sigval_tvalue; }; A struct aio_notify ki_notify is added to struct kiocb in include/linux/aio.h In io_submit_one(), if the application provided a sigevent then aio_setup_sigevent() is called which does the following: - check access to the user sigevent and make a local copy - if the requested notification is SIGEV_NONE, then nothing to do - fill in the kiocb-ki_notify fields (notify, signo, value) - check sigevent consistency, get the signal target task and save it in kiocb-ki_notify.target - preallocate a sigqueue for this event using __sigqueue_alloc() Upon request completion, in aio_complete(), if notification is needed for this request (iocb-ki_notify.notify != SIGEV_NONE), then aio_send_signal() is called to signal the target task as follows: - fill in the siginfo struct to be sent to the task - if notify is SIGEV_THREAD_ID then send signal to specific task using send_sigqueue() - else send signal to task group using send_5group_sigqueue() Notes concerning sigqueue preallocation: To ensure reliable delivery of completion notification, the sigqueue is preallocated in the submission path so that there is no chance it can fail in the completion path. Unlike the posix-timers case (currently the single other user of sigqueue preallocation), where the sigqueue is allocated for the lifetime of the timer and freed at timer destruction time, the aio case is a bit more tricky due to the async nature of the whole thing. In the aio case, the sigqueue exists for the lifetime of the request, therefore it must be freed only once the signal for the request completion has been delivered. The sigqueue is therefore freed by the signal delivery mechanism when the signal is collected. The aio submission and completion error paths do an explicit __sigqueue_free() to free the sigqueue. fs/aio.c| 114 ++-- fs/compat.c | 18 +++ include/linux/aio.h | 12 + include/linux/aio_abi.h |3 - kernel/signal.c |4 - 5 files changed, 144 insertions(+), 7 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Signed-off-by: Laurent Vivier [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-01-30 11:32:42.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-01-30 11:41:41.0 +0100 @@ -419,6 +419,7 @@ static struct kiocb fastcall *__aio_get_ req-ki_dtor = NULL; req-private = NULL; req-ki_iovec = NULL; + req-ki_notify.sigq = NULL; INIT_LIST_HEAD(req-ki_run_list); /* Check if the completion queue has enough free space to @@ -465,6 +466,12 @@ static inline void really_put_req(struct req-ki_dtor(req); if (req-ki_iovec != req-ki_inline_vec) kfree(req-ki_iovec); + + /* Release task ref */ + if (req-ki_notify.notify == SIGEV_THREAD_ID || + req-ki_notify.notify == SIGEV_SIGNAL) + put_task_struct(req-ki_notify.target); + kmem_cache_free(kiocb_cachep, req); ctx-reqs_active--; @@ -916,6 +923,78 @@ void fastcall kick_iocb(struct kiocb *io } EXPORT_SYMBOL(kick_iocb); +static int aio_send_signal(struct aio_notify *notify) +{ + struct sigqueue *sigq = notify-sigq; + struct siginfo *info = sigq-info; + int ret; + + memset(info, 0, sizeof(struct siginfo)); + + info-si_signo = notify-signo; + info-si_errno = 0; + info-si_code = SI_ASYNCIO; + info-si_pid = 0; + info-si_uid = 0; + info-si_value = notify-value
[PATCH -mm 2/7][AIO] - fix aio.h includes
From: Sébastien Dugué [EMAIL PROTECTED] Fix the double inclusion of linux/uio.h in linux/aio.h aio.h |1 - 1 file changed, 1 deletion(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/include/linux/aio.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/aio.h 2007-01-30 10:05:08.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/aio.h2007-01-30 10:11:22.0 +0100 @@ -7,7 +7,6 @@ #include linux/uio.h #include asm/atomic.h -#include linux/uio.h #define AIO_MAXSEGS4 #define AIO_KIOGRP_NR_ATOMIC 8 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 5/7][AIO] - Make __sigqueue_free() and __sigqueue_alloc() non static
From: Sébastien Dugué [EMAIL PROTECTED] Make __sigqueue_alloc() and __sigqueue_free() non static Allow subsystems to directly call into __sigqueue_alloc() and __sigqueue_free. This is used by the AIO signal notification patch. include/linux/signal.h |3 +++ kernel/signal.c|6 +++--- 2 files changed, 6 insertions(+), 3 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/include/linux/signal.h === --- linux-2.6.20-rc6-mm3.orig/include/linux/signal.h2007-01-30 11:41:36.0 +0100 +++ linux-2.6.20-rc6-mm3/include/linux/signal.h 2007-01-30 11:41:39.0 +0100 @@ -241,6 +241,9 @@ extern int sigprocmask(int, sigset_t *, struct pt_regs; extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie); extern struct task_struct * sigevent_find_task(sigevent_t *); +extern struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, +int override_rlimit); +extern void __sigqueue_free(struct sigqueue *q); extern struct kmem_cache *sighand_cachep; Index: linux-2.6.20-rc6-mm3/kernel/signal.c === --- linux-2.6.20-rc6-mm3.orig/kernel/signal.c 2007-01-30 11:41:36.0 +0100 +++ linux-2.6.20-rc6-mm3/kernel/signal.c2007-01-30 11:41:39.0 +0100 @@ -268,8 +268,8 @@ next_signal(struct sigpending *pending, return sig; } -static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, -int override_rlimit) +struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, + int override_rlimit) { struct sigqueue *q = NULL; struct user_struct *user; @@ -295,7 +295,7 @@ static struct sigqueue *__sigqueue_alloc return(q); } -static void __sigqueue_free(struct sigqueue *q) +void __sigqueue_free(struct sigqueue *q) { if (q-flags SIGQUEUE_PREALLOC) return; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 3/7][AIO] - Fix access_ok() checks
From: Sébastien Dugué [EMAIL PROTECTED] Fix access_ok() checks in sys_io_submit() and compat_sys_io_submit() sys_io_submit() and compat_sys_io_submit() check for the number of requests (nr) being positive, but the following access_ok() call uses nr * sizeof(ptr) which can overlow to a negative number. Just make sure that nr * sizeof(ptr) is positive in the first place. aio.c|2 +- compat.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Signed-off-by: Sébastien Dugué [EMAIL PROTECTED] Index: linux-2.6.20-rc6-mm3/fs/aio.c === --- linux-2.6.20-rc6-mm3.orig/fs/aio.c 2007-01-30 11:28:56.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/aio.c 2007-01-30 11:32:42.0 +0100 @@ -1630,7 +1630,7 @@ asmlinkage long sys_io_submit(aio_contex long ret = 0; int i; - if (unlikely(nr 0)) + if (unlikely((nr*sizeof(*iocbpp)) 0)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(*iocbpp) Index: linux-2.6.20-rc6-mm3/fs/compat.c === --- linux-2.6.20-rc6-mm3.orig/fs/compat.c 2007-01-30 11:30:29.0 +0100 +++ linux-2.6.20-rc6-mm3/fs/compat.c2007-01-30 11:31:55.0 +0100 @@ -651,7 +651,7 @@ compat_sys_io_submit(aio_context_t ctx_i long ret = 0; int i; - if (unlikely(nr 0)) + if (unlikely((nr * sizeof(u32)) 0)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, iocb, (nr * sizeof(u32) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + aio-completion-signal-notification.patch added to -mm tree
On Fri, 26 Jan 2007 14:52:33 +0300 Oleg Nesterov <[EMAIL PROTECTED]> wrote: > On 01/26, S?bastien Dugu? wrote: > > > > On Thu, 25 Jan 2007 19:21:41 +0300 Oleg Nesterov <[EMAIL PROTECTED]> wrote: > > > > > > + target = good_sigevent(); > > > > + > > > > + if (unlikely(!target || (target->flags & PF_EXITING))) > > > > + goto out_unlock; > > > > > > PF_EXITING check is racy and unneded. In fact, it is wrong. If the main > > > thread is already died, we can only use SIGEV_THREAD_ID signals, because > > > otherwise good_sigevent() returns ->group_leader. > > > > Care to explain here please, I'm not following you. > > My apologies, I was unclear. > > This check is racy, the condition could be changed right after the check. > > It is unneeded, it is ok to do send_sigqueue(tsk) if if that task is already > dead. (we hold the reference to task_struct). > > Now suppose that the main thread (->group_leader) already exited. This is > normal, the thread group is still alive, it should be ok to send a signal to > it via send_group_sigqueue(). But we can't: without SIGEV_THREAD_ID in > ->sigev_notify good_event() returns ->group_leader, and it has PF_EXITING. Thanks, I understand the problem now. I will fix this. > > Yes, kernel/posix-timers.c needs a cleanup too. But please note that it does > this check for another reason (according to the comment). This reason is not > valid now, the callsite for exit_itimers() was moved from __exit_signal() to > do_exit(). > > > > > + if (iocb->ki_notify.notify != SIGEV_NONE) { > > > > + ret = aio_send_signal(>ki_notify); > > > > + > > > > + /* If signal generation failed, release the sigqueue */ > > > > + if (ret) > > > > + sigqueue_free(iocb->ki_notify.sigq); > > > > > > We should not use sigqueue_free() here. It takes current->sighand->siglock > > > to remove sigqueue from "struct sigpending". But current is just a > > > "random" > > > process here. > > > > > > Yes, if I understand this patch correctly, it is not possible that this > > > sigqueue is pending, but still this is bad imho. > > > > Yes, in fact the sigqueue is used for a single signal delivery and then > > free. In fact I could have used directly __sigqueue_free() instead here > > except for the fact that it's private to signal.c and I'm reluctant > > to export it to other subsystems. > > I personally think it is better to export __sigqueue_free() even if > sigqueue_free() > happens to work. It is to fragile imho to reference current->sighand. At least > we need a fat comment. OK. > > > > > static void __sigqueue_free(struct sigqueue *q) > > > > { > > > > - if (q->flags & SIGQUEUE_PREALLOC) > > > > + if (q->flags & SIGQUEUE_PREALLOC && q->info.si_code != > > > > SI_ASYNCIO) > > > > return; > > > > > > Oh, this is not nice. Could we change send_sigqueue/send_group_sigqueue > > > instead ? > > > > Yep, that's the other solution. > > > > > > > > - BUG_ON(!(q->flags & SIGQUEUE_PREALLOC)); > > > + BUG_ON(!(q->flags & SIGQUEUE_PREALLOC) && q->info.si_code != > > > SI_ASYNCIO); > > > > > > This way aio can use __sigqueue_alloc/__sigqueue_free directly and forget > > > about SIGQUEUE_PREALLOC. > > > > Well, I don't think it's cleaner. The aio error path calls sigqueue_free() > > directly whereas in case of success sigqueue_free() is called from the > > signal > > delivery path. > > Hmm... now I don't understand you. Of course, the aio error path should use > __sigqueue_free() if we don't use SIGQUEUE_PREALLOC (and imho we should not). > > And the signal delivery path uses __sigqueue_free() too. > > ? > > > > I'd suggest to not use this interface. Just use group_send_sig_info() or > > > specific_send_sig_info(). Yes, this way we will do GFP_ATOMIC allocation > > > of sigqueue in interrupt context, but is this so bad in this case? > > > > Well, the thihere is that in the past we used group_send_sig_info() > > and specific_send_sig_info() for notification but Zach Brown raised > > the question about reliable signal delivery. IOW an aio submission > > should not succeed if signal delivery is going to fail. Hence the > > use of the preallocated sigqueue. > > Ok, I see, thanks. > > Oleg. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + aio-completion-signal-notification.patch added to -mm tree
On Thu, 25 Jan 2007 19:21:41 +0300 Oleg Nesterov <[EMAIL PROTECTED]> wrote: > Sebastien Dugue wrote: > > > > +static long aio_setup_sigevent(struct aio_notify *notify, > > + struct sigevent __user *user_event) > > +{ > > + sigevent_t event; > > + struct task_struct *target; > > + > > + if (copy_from_user(, user_event, sizeof(event))) > > + return -EFAULT; > > + > > + if (event.sigev_notify == SIGEV_NONE) > > + return 0; > > + > > + notify->notify = event.sigev_notify; > > + notify->signo = event.sigev_signo; > > + notify->value = event.sigev_value; > > + > > + read_lock(_lock); > > We don't need tasklist_lock, we can use rcu_read_lock() instead. Right that will be beneficial, will change. > > > + target = good_sigevent(); > > + > > + if (unlikely(!target || (target->flags & PF_EXITING))) > > + goto out_unlock; > > PF_EXITING check is racy and unneded. In fact, it is wrong. If the main > thread is already died, we can only use SIGEV_THREAD_ID signals, because > otherwise good_sigevent() returns ->group_leader. Care to explain here please, I'm not following you. > > > @@ -994,6 +1077,15 @@ int fastcall aio_complete(struct kiocb * > > kunmap_atomic(ring, KM_IRQ1); > > > > pr_debug("added to ring %p at [%lu]\n", iocb, tail); > > + > > + if (iocb->ki_notify.notify != SIGEV_NONE) { > > + ret = aio_send_signal(>ki_notify); > > + > > + /* If signal generation failed, release the sigqueue */ > > + if (ret) > > + sigqueue_free(iocb->ki_notify.sigq); > > We should not use sigqueue_free() here. It takes current->sighand->siglock > to remove sigqueue from "struct sigpending". But current is just a "random" > process here. > > Yes, if I understand this patch correctly, it is not possible that this > sigqueue is pending, but still this is bad imho. Yes, in fact the sigqueue is used for a single signal delivery and then free. In fact I could have used directly __sigqueue_free() instead here except for the fact that it's private to signal.c and I'm reluctant to export it to other subsystems. > > > static void __sigqueue_free(struct sigqueue *q) > > { > > - if (q->flags & SIGQUEUE_PREALLOC) > > + if (q->flags & SIGQUEUE_PREALLOC && q->info.si_code != SI_ASYNCIO) > > return; > > Oh, this is not nice. Could we change send_sigqueue/send_group_sigqueue > instead ? Yep, that's the other solution. > > - BUG_ON(!(q->flags & SIGQUEUE_PREALLOC)); > + BUG_ON(!(q->flags & SIGQUEUE_PREALLOC) && q->info.si_code != > SI_ASYNCIO); > > This way aio can use __sigqueue_alloc/__sigqueue_free directly and forget > about SIGQUEUE_PREALLOC. Well, I don't think it's cleaner. The aio error path calls sigqueue_free() directly whereas in case of success sigqueue_free() is called from the signal delivery path. > > On the other hand, imho this patch takes a wrong direction. > > The purpose of SIGQUEUE_PREALLOC + send_sigqueue() is to re-use the same > sigqueue while sending a stream of signals. But in aio case we allocate > sigqueue to send only 1 signal, then it freed after the delivery like > the regular sigqueue. So what is the point? > > I'd suggest to not use this interface. Just use group_send_sig_info() or > specific_send_sig_info(). Yes, this way we will do GFP_ATOMIC allocation > of sigqueue in interrupt context, but is this so bad in this case? Well, the thihere is that in the past we used group_send_sig_info() and specific_send_sig_info() for notification but Zach Brown raised the question about reliable signal delivery. IOW an aio submission should not succeed if signal delivery is going to fail. Hence the use of the preallocated sigqueue. Thanks, Sébastien. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/