Re: [3/5] 2.6.21-rc4: known regressions (v2)
Bjorn Helgaas <[EMAIL PROTECTED]> writes: > > The main reason we wait until pci_enable_device() to allocate an > IRQ number is that ia64 currently only has about 180 device vectors, > and there are machines with more PCI slots than that. If we don't reserve irqs that the hardware doesn't support we should be able to simply move the allocation and have about the same cost as we do today. > I also think it's nice that we don't do anything with a device until > we have a driver to claim it. But there certainly have been cases > where delaying IRQ allocation has caused troubles. Agreed. It is the second call to pci_enable_device() by a driver where things really start to unravel in the wait until we need it plan. > I really like the idea of moving to the IRQ == GSI model for ia64. > But of course, we'll have to get rid of the 180-vector limit to > make that work, too. Mostly that is a matter of porting the code from x86_64 where that is already the case. I'm pretty certain I have worked through all of the bit issues, but there might be a few small problems that crop up. If need by I will do the patches as I find time. But if someone else gets there before I do that would be great :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Monday 02 April 2007 09:38, Bjorn Helgaas wrote: > The main reason we wait until pci_enable_device() to allocate an > IRQ number is that ia64 currently only has about 180 device vectors, > and there are machines with more PCI slots than that. Sigh, that didn't make much sense, did it? At the time, ia64 didn't support sharing IRQ vectors, and we preallocated four vectors for every slot, including empty ones. Allocate-on-demand dramatically increased the number of devices we could support because most cards use only one IRQ. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Monday 26 March 2007 21:29, Eric W. Biederman wrote: > "Luck, Tony" <[EMAIL PROTECTED]> writes: > > >> What I'm proposing we do is move the irq allocation code out of > >> pci_enable_device and the irq freeing code out of pci_disable_device > >> in the future. > > > > Sounds rational ... in a world that wasn't dominated by PCI it would > > seem to be the logical approach (since the irq code would have much > > more utility independent of the PCI code). > > Right. We can even do this earlier in the pci code. Just doing this > on demand when the device driver needs it is problematic. As devices > drivers like to keep the requested over a pci_disable_device pci_enable_device > pair. > > The big practical issue is that we will like wind up allocating an irq > number to all usable irqs on ia64. Which means we will like need many > more irq numbers... Although I guess if we keep it at the pci layer > we should be fairly safe. The main reason we wait until pci_enable_device() to allocate an IRQ number is that ia64 currently only has about 180 device vectors, and there are machines with more PCI slots than that. I also think it's nice that we don't do anything with a device until we have a driver to claim it. But there certainly have been cases where delaying IRQ allocation has caused troubles. I really like the idea of moving to the IRQ == GSI model for ia64. But of course, we'll have to get rid of the 180-vector limit to make that work, too. Bjorn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Monday 26 March 2007 21:29, Eric W. Biederman wrote: Luck, Tony [EMAIL PROTECTED] writes: What I'm proposing we do is move the irq allocation code out of pci_enable_device and the irq freeing code out of pci_disable_device in the future. Sounds rational ... in a world that wasn't dominated by PCI it would seem to be the logical approach (since the irq code would have much more utility independent of the PCI code). Right. We can even do this earlier in the pci code. Just doing this on demand when the device driver needs it is problematic. As devices drivers like to keep the requested over a pci_disable_device pci_enable_device pair. The big practical issue is that we will like wind up allocating an irq number to all usable irqs on ia64. Which means we will like need many more irq numbers... Although I guess if we keep it at the pci layer we should be fairly safe. The main reason we wait until pci_enable_device() to allocate an IRQ number is that ia64 currently only has about 180 device vectors, and there are machines with more PCI slots than that. I also think it's nice that we don't do anything with a device until we have a driver to claim it. But there certainly have been cases where delaying IRQ allocation has caused troubles. I really like the idea of moving to the IRQ == GSI model for ia64. But of course, we'll have to get rid of the 180-vector limit to make that work, too. Bjorn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Monday 02 April 2007 09:38, Bjorn Helgaas wrote: The main reason we wait until pci_enable_device() to allocate an IRQ number is that ia64 currently only has about 180 device vectors, and there are machines with more PCI slots than that. Sigh, that didn't make much sense, did it? At the time, ia64 didn't support sharing IRQ vectors, and we preallocated four vectors for every slot, including empty ones. Allocate-on-demand dramatically increased the number of devices we could support because most cards use only one IRQ. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Bjorn Helgaas [EMAIL PROTECTED] writes: The main reason we wait until pci_enable_device() to allocate an IRQ number is that ia64 currently only has about 180 device vectors, and there are machines with more PCI slots than that. If we don't reserve irqs that the hardware doesn't support we should be able to simply move the allocation and have about the same cost as we do today. I also think it's nice that we don't do anything with a device until we have a driver to claim it. But there certainly have been cases where delaying IRQ allocation has caused troubles. Agreed. It is the second call to pci_enable_device() by a driver where things really start to unravel in the wait until we need it plan. I really like the idea of moving to the IRQ == GSI model for ia64. But of course, we'll have to get rid of the 180-vector limit to make that work, too. Mostly that is a matter of porting the code from x86_64 where that is already the case. I'm pretty certain I have worked through all of the bit issues, but there might be a few small problems that crop up. If need by I will do the patches as I find time. But if someone else gets there before I do that would be great :) Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk wrote: > > > Does setting CONFIG_PCI_MSI=n make any difference? > > > > Yes, it does. The hanging resume problem went away. > > Thanks for testing. > > If you enable it again, does the patch from [1] also fix it? Yes, it appears to fix it. Marcus pgpnnyS0pETwQ.pgp Description: PGP signature
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk wrote: Does setting CONFIG_PCI_MSI=n make any difference? Yes, it does. The hanging resume problem went away. Thanks for testing. If you enable it again, does the patch from [1] also fix it? Yes, it appears to fix it. Marcus pgpnnyS0pETwQ.pgp Description: PGP signature
Re: [3/5] 2.6.21-rc4: known regressions (v2)
"Luck, Tony" <[EMAIL PROTECTED]> writes: >> What I'm proposing we do is move the irq allocation code out of >> pci_enable_device and the irq freeing code out of pci_disable_device >> in the future. > > Sounds rational ... in a world that wasn't dominated by PCI it would > seem to be the logical approach (since the irq code would have much > more utility independent of the PCI code). Right. We can even do this earlier in the pci code. Just doing this on demand when the device driver needs it is problematic. As devices drivers like to keep the requested over a pci_disable_device pci_enable_device pair. The big practical issue is that we will like wind up allocating an irq number to all usable irqs on ia64. Which means we will like need many more irq numbers... Although I guess if we keep it at the pci layer we should be fairly safe. I was afraid there was some hotplug reason for waiting until pci_enable_device to allocate the irq numbers. >> Tony, Len before we merge any fixes for 2.6.21-rcX I'd like to at >> least get an ack on the long term direction. > > Long-term-direction-acked-by: Tony Luck <[EMAIL PROTECTED]> Thanks. Then small surgery will happen now, and I will start queuing up the major surgery patches. Although I won't be able to do more than compile test and code review the ia64 changes. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 22:37, Eric W. Biederman wrote: > "Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: > > > On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote: > >> "Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: > >> > >> > Yes, in kernel/power/disk.c:power_down() . > >> > > >> > Please comment out the disable_nonboot_cpus() in there and retest (but > > please > >> > test the latest Linus' tree). > >> > >> > >> > >> Why do we even need a disable_nonboot_cpus in that path? machine_shutdown > >> on i386 and x86_64 should take care of that. Further the code that > >> computes > >> the boot cpu is bogus (not all architectures require cpu == 0 to be > >> the boot cpu), and disabling non boot cpus appears to be a strong > >> x86ism, in the first place. > > > > Yes. > > > >> If the only reason for disable_nonboot_cpus there is to avoid the > >> WARN_ON in init_low_mappings() we should seriously consider killing > >> it. > > > > We have considered it, but no one was sure that it was a good idea. > > The problem with the current init_low_mappings is that it hacks the > current page table. If we can instead use a different page table > the code becomes SMP safe. What exactly is the danger here? > I have extracted the patch that addresses this from the relocatable > patchset and appended it for sparking ideas. It goes a little > farther than we need to solve this issue but the basics are there. > > >> If we can wait for 2.6.22 the relocatable x86_64 patchset that > >> Andi has queued, has changes that kill the init_low_mapping() hack. > > > > I think we should kill the WARN_ON() right now, perhaps replacing it with > > a FIXME comment. > > Reasonable. > > >> I'm not very comfortable with calling cpu_down in a common code path > >> right now either. I'm fairly certain we still don't have that > >> correct. So if we confine the mess that is cpu_down to #if > >> defined(CPU_HOTPLUG) && defined(CONFIG_EXPERIMENTAL) I don't care. > >> If we start using it everywhere I'm very nervous. > >> migration when bringing a cpu down is strongly racy, and I don't think > >> we actually put cpus to sleep properly either. > > > > I'm interested in all of the details, please. I seriously consider dropping > > cpu_up()/cpu_down() from the suspend code paths. > > So I'm not certain if in a multiple cpu context we can avoid all of the > issues with cpu hotplug but there is a reasonable chance so I will > explain as best I can. > > Yanking the appropriate code out of linuxbios the way a processor should stop > itself is to send an INIT IPI to itself. This puts a cpu into an optimized > wait for startup IPI state where it is otherwise disabled. This is the state > any sane BIOS will put the cpus into before control is handed off to the > kernel. > > > static inline void stop_this_cpu(void) > > { > > unsigned apicid; > > apicid = lapicid(); > > > > /* Send an APIC INIT to myself */ > > lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); > > lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_INT_ASSERT | > > LAPIC_DM_INIT); > > /* Wait for the ipi send to finish */ > > lapic_wait_icr_idle(); > > > > /* Deassert the APIC INIT */ > > lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); > > lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_DM_INIT); > > /* Wait for the ipi send to finish */ > > lapic_wait_icr_idle(); > > > > /* If I haven't halted spin forever */ > > for(;;) { > > hlt(); > > } > > } > > I'm not certain what to do with the interrupt races. But I will see > if I can explain what I know. > > > > - Most ioapics are buggy. > - Most ioapics do not follow pci-ordering rules with respect to > interrupt message deliver so ensuring all in-flight irqs have > arrived somewhere is very hard. > - To avoid bugs we always limit ourselves to reprogramming the ioapics > in the interrupt handler, and not considering an interrupt > successfully reprogrammed until we have received an irq in the new > location. > - On x86 we have two basic interrupt handling modes. > o logical addressing with lowest priority delivery. > o physical addressing with delivery to a single cpu. > - With logical addressing as long as the cpu is not available for > having an interrupt delivered to it the interrupt will be > never be delivered to a particular cpu. Ideally we also update > the mask in the ioapic to not target that cpu. > - With physical addressing targeting a single cpu we need to reprogram > the ioapics not to target that specific cpu. This needs to happen > in the interrupt handler and we need to wait for the next interrupt > before we tear down our data structures for handling the interrupt. > > The current cpu hotplug code attempts to reprogram the ioapics from > process context which is just wrong. I wasn't aware of that. > Now as
RE: [3/5] 2.6.21-rc4: known regressions (v2)
> What I'm proposing we do is move the irq allocation code out of > pci_enable_device and the irq freeing code out of pci_disable_device > in the future. Sounds rational ... in a world that wasn't dominated by PCI it would seem to be the logical approach (since the irq code would have much more utility independent of the PCI code). > Tony, Len before we merge any fixes for 2.6.21-rcX I'd like to at > least get an ack on the long term direction. Long-term-direction-acked-by: Tony Luck <[EMAIL PROTECTED]> -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 09:39:29PM +0200, Frederic Riss wrote: > 2007/3/26, Adrian Bunk <[EMAIL PROTECTED]>: > >On Mon, Mar 26, 2007 at 08:53:20PM +0200, Frédéric Riss wrote: > > > >>... (In fact it hangs at the second suspend, but that's another ATA > >> problem that I think has already been reported). > > > >This sounds like the MSI problem. > > > >Do you have CONFIG_PCI_MSI enabled? > >If yes, does disabling it fix it? > > Yes > > >If yes, does CONFIG_PCI_MSI=y with the patch from [1] work? > > Yes ! Thanks for your testing. > Just to be 100% clear, the hang I was seeing at the second suspend is this > one: > > Subject: second suspend to disk in a row results in an oops (libata?) > References : http://lkml.org/lkml/2007/3/17/43 > Submitter : Thomas Meyer <[EMAIL PROTECTED]> > Status : unknown > > I'm not sure it was associated with the MSI issue yet. That's what I was calling "the MSI problem". Since my latest regression list actual cause and patch have been found. > Thanks a lot, > Fred > > >[1] http://lkml.org/lkml/2007/3/24/136 cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
2007/3/26, Adrian Bunk <[EMAIL PROTECTED]>: On Mon, Mar 26, 2007 at 08:53:20PM +0200, Frédéric Riss wrote: >... (In fact it hangs at the second suspend, but that's another ATA > problem that I think has already been reported). This sounds like the MSI problem. Do you have CONFIG_PCI_MSI enabled? If yes, does disabling it fix it? Yes If yes, does CONFIG_PCI_MSI=y with the patch from [1] work? Yes ! Just to be 100% clear, the hang I was seeing at the second suspend is this one: Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer <[EMAIL PROTECTED]> Status : unknown I'm not sure it was associated with the MSI issue yet. Thanks a lot, Fred [1] http://lkml.org/lkml/2007/3/24/136 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 08:53:20PM +0200, Frédéric Riss wrote: >... (In fact it hangs at the second suspend, but that's another ATA > problem that I think has already been reported). This sounds like the MSI problem. Do you have CONFIG_PCI_MSI enabled? If yes, does disabling it fix it? If yes, does CONFIG_PCI_MSI=y with the patch from [1] work? > Fred. cu Adrian [1] http://lkml.org/lkml/2007/3/24/136 -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Le lundi 26 mars 2007 à 11:14 +0200, Thomas Gleixner a écrit : > On Mon, 2007-03-26 at 08:45 +0200, Frédéric RISS wrote: > > Additional data point: I just tried with -rc5 and the issue is still > > present. The config I used for this test defines neither NO_HZ nor > > HIGH_RES_TIMERS. > > Do you have CONFIG_HPET_TIMER enabled and does the box have one ? > If yes, can you please turn it off and retry ? Indeed, turning off CONFIG_HPET_TIMER does fix the coming-out-of-suspend issue. (In fact it hangs at the second suspend, but that's another ATA problem that I think has already been reported). Fred. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 07:42:51PM +0200, Marcus Better wrote: > Adrian Bunk wrote: > > > > Subject: ThinkPad R60: suspend to disk broken > > > Does setting CONFIG_PCI_MSI=n make any difference? > > Yes, it does. The hanging resume problem went away. Thanks for testing. If you enable it again, does the patch from [1] also fix it? > (The display corruption and the instant resume were not affected.) Not a surprise - it's currently quite common that people run into several distinct suspend regressions... > Marcus cu Adrian [1] http://lkml.org/lkml/2007/3/24/136 [2] [2] x86_64 uses arch/i386/pci/common.c -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk wrote: > > > Subject: ThinkPad R60: suspend to disk broken > Does setting CONFIG_PCI_MSI=n make any difference? Yes, it does. The hanging resume problem went away. (The display corruption and the instant resume were not affected.) Marcus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 12:00:22PM +0200, Marcus Better wrote: > Adrian Bunk wrote: > > Subject: ThinkPad R60: suspend to disk broken > > References : http://lkml.org/lkml/2007/3/23/74 > > Submitter : Marcus Better <[EMAIL PROTECTED]> > > Status : submitter tries to bisect > > I just tried -rc5. Now suspend to disk seems to work. I think the XFS > workqueue patch fixed this. > > It can also suspend to RAM, but resume is worse. The first time around it > resumed but corrupted the vesafb console (greenish blinking character cells), > something that used to work before. But the system responded to input, so I > suspended to RAM again. This time the resume failed, it hung after > printing "Linux!" in yellow at the top of the screen. (Seems to be some > artifact, I have seen it before even with working suspend.) >... Does setting CONFIG_PCI_MSI=n make any difference? > Marcus cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Pavel Machek wrote: >> > Subject: ThinkPad R60: suspend to disk broken >> > References : http://lkml.org/lkml/2007/3/23/74 >> input, so I suspended to RAM again. This time the resume failed, it hung >> after printing "Linux!" in yellow at the top of the screen. > Yellow Linux! is my debugging trick. Cute :-) Here is my bisect log so far, with 98 revisions left. Note that all kernels have the XFS workqueue patch applied. ~$ git bisect log git-bisect start # bad: [6fb04ccf5c5e054c4107090bed6e866489f1089f] Linux 2.6.21-rc5 git-bisect bad 6fb04ccf5c5e054c4107090bed6e866489f1089f # good: [c8f71b01a50597e298dc3214a2f2be7b8d31170c] Linux 2.6.21-rc1 git-bisect good c8f71b01a50597e298dc3214a2f2be7b8d31170c # good: [ad5f1196792653dadf09c07a5fa917092b469c1c] ecryptfs: check xattr operation support fix git-bisect good ad5f1196792653dadf09c07a5fa917092b469c1c # good: [271368b69b9e8042063d6c713423e84503bbdaa0] Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6 git-bisect good 271368b69b9e8042063d6c713423e84503bbdaa0 # bad: [f5b42c3324494ea3f9bf795e2a7e4d3cbb06c607] KVM: Fix guest sysenter on vmx git-bisect bad f5b42c3324494ea3f9bf795e2a7e4d3cbb06c607 The bad kernels exhibit the hang on second resume from RAM. The "good" ones all have the artifact with corrupted display. Moreover, they resume immediately from every suspend to RAM _after_ a suspend-to-disk, but not before it. This is when suspending with "echo mem > /sys/power/state". > Try vga=0 ... text console seems to work for you. Ok, will try. Marcus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Hi! > > Subject: ThinkPad R60: suspend to disk broken > > References : http://lkml.org/lkml/2007/3/23/74 > > Submitter : Marcus Better <[EMAIL PROTECTED]> > > Status : submitter tries to bisect > > I just tried -rc5. Now suspend to disk seems to work. I think the XFS > workqueue patch fixed this. > > It can also suspend to RAM, but resume is worse. The first time around it > resumed but corrupted the vesafb console (greenish blinking character cells), > something that used to work before. But the system responded to input, so I > suspended to RAM again. This time the resume failed, it hung after > printing "Linux!" in yellow at the top of the screen. (Seems to be some > artifact, I have seen it before even with working suspend.) Yellow Linux! is my debugging trick. It should be there, but it should also disapear quickly. Try vga=0 ... text console seems to work for you. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
2007/3/26, Thomas Gleixner <[EMAIL PROTECTED]>: On Mon, 2007-03-26 at 08:45 +0200, Frédéric RISS wrote: > Additional data point: I just tried with -rc5 and the issue is still > present. The config I used for this test defines neither NO_HZ nor > HIGH_RES_TIMERS. Do you have CONFIG_HPET_TIMER enabled and does the box have one ? If yes, can you please turn it off and retry ? IIRC the box has a HPET and it gets used. I'll test and confirm when I get home tonight. Thanks, Fred - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk wrote: > Subject: ThinkPad R60: suspend to disk broken > References : http://lkml.org/lkml/2007/3/23/74 > Submitter : Marcus Better <[EMAIL PROTECTED]> > Status : submitter tries to bisect I just tried -rc5. Now suspend to disk seems to work. I think the XFS workqueue patch fixed this. It can also suspend to RAM, but resume is worse. The first time around it resumed but corrupted the vesafb console (greenish blinking character cells), something that used to work before. But the system responded to input, so I suspended to RAM again. This time the resume failed, it hung after printing "Linux!" in yellow at the top of the screen. (Seems to be some artifact, I have seen it before even with working suspend.) I'm attaching my config. Not sure how to bisect this. I guess it would be necessary to keep the XFS workqueue patch throughout, otherwise it is guaranteed to break. Marcus # # Automatically generated make config: don't edit # Linux kernel version: 2.6.21-rc5-melech # Mon Mar 26 10:56:13 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_CPUSETS is not set CONFIG_SYSFS_DEPRECATED=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y # CONFIG_SYSCTL_SYSCALL is not set CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set CONFIG_MCORE2=y # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=m CONFIG_X86_CPUID=m CONFIG_X86_HT=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y # CONFIG_NUMA is not set CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_RESOURCES_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_NR_CPUS=2 CONFIG_HOTPLUG_CPU=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_IOMMU=y # CONFIG_CALGARY_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y # CONFIG_X86_MCE_AMD is not set CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_PHYSICAL_START=0x10 CONFIG_SECCOMP=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_ALL is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_300=y # CONFIG_HZ_1000 is not set CONFIG_HZ=300 CONFIG_REORDER=y CONFIG_K8_NB=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y CONFIG_GENERIC_PENDING_IRQ=y # # Power management options # CONFIG_PM=y #
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, 2007-03-26 at 08:45 +0200, Frédéric RISS wrote: > Additional data point: I just tried with -rc5 and the issue is still > present. The config I used for this test defines neither NO_HZ nor > HIGH_RES_TIMERS. Do you have CONFIG_HPET_TIMER enabled and does the box have one ? If yes, can you please turn it off and retry ? Thanks, tglx - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, 2007-03-26 at 08:45 +0200, Frédéric RISS wrote: Additional data point: I just tried with -rc5 and the issue is still present. The config I used for this test defines neither NO_HZ nor HIGH_RES_TIMERS. Do you have CONFIG_HPET_TIMER enabled and does the box have one ? If yes, can you please turn it off and retry ? Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk wrote: Subject: ThinkPad R60: suspend to disk broken References : http://lkml.org/lkml/2007/3/23/74 Submitter : Marcus Better [EMAIL PROTECTED] Status : submitter tries to bisect I just tried -rc5. Now suspend to disk seems to work. I think the XFS workqueue patch fixed this. It can also suspend to RAM, but resume is worse. The first time around it resumed but corrupted the vesafb console (greenish blinking character cells), something that used to work before. But the system responded to input, so I suspended to RAM again. This time the resume failed, it hung after printing Linux! in yellow at the top of the screen. (Seems to be some artifact, I have seen it before even with working suspend.) I'm attaching my config. Not sure how to bisect this. I guess it would be necessary to keep the XFS workqueue patch throughout, otherwise it is guaranteed to break. Marcus # # Automatically generated make config: don't edit # Linux kernel version: 2.6.21-rc5-melech # Mon Mar 26 10:56:13 2007 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION= # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_CPUSETS is not set CONFIG_SYSFS_DEPRECATED=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y # CONFIG_SYSCTL_SYSCALL is not set CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=cfq # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set CONFIG_MCORE2=y # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=m CONFIG_X86_CPUID=m CONFIG_X86_HT=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y # CONFIG_NUMA is not set CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_RESOURCES_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_NR_CPUS=2 CONFIG_HOTPLUG_CPU=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_IOMMU=y # CONFIG_CALGARY_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y # CONFIG_X86_MCE_AMD is not set CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_PHYSICAL_START=0x10 CONFIG_SECCOMP=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_ALL is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_300=y # CONFIG_HZ_1000 is not set CONFIG_HZ=300 CONFIG_REORDER=y CONFIG_K8_NB=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y CONFIG_GENERIC_PENDING_IRQ=y # # Power management options # CONFIG_PM=y # CONFIG_PM_LEGACY is not
Re: [3/5] 2.6.21-rc4: known regressions (v2)
2007/3/26, Thomas Gleixner [EMAIL PROTECTED]: On Mon, 2007-03-26 at 08:45 +0200, Frédéric RISS wrote: Additional data point: I just tried with -rc5 and the issue is still present. The config I used for this test defines neither NO_HZ nor HIGH_RES_TIMERS. Do you have CONFIG_HPET_TIMER enabled and does the box have one ? If yes, can you please turn it off and retry ? IIRC the box has a HPET and it gets used. I'll test and confirm when I get home tonight. Thanks, Fred - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Hi! Subject: ThinkPad R60: suspend to disk broken References : http://lkml.org/lkml/2007/3/23/74 Submitter : Marcus Better [EMAIL PROTECTED] Status : submitter tries to bisect I just tried -rc5. Now suspend to disk seems to work. I think the XFS workqueue patch fixed this. It can also suspend to RAM, but resume is worse. The first time around it resumed but corrupted the vesafb console (greenish blinking character cells), something that used to work before. But the system responded to input, so I suspended to RAM again. This time the resume failed, it hung after printing Linux! in yellow at the top of the screen. (Seems to be some artifact, I have seen it before even with working suspend.) Yellow Linux! is my debugging trick. It should be there, but it should also disapear quickly. Try vga=0 ... text console seems to work for you. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Pavel Machek wrote: Subject: ThinkPad R60: suspend to disk broken References : http://lkml.org/lkml/2007/3/23/74 input, so I suspended to RAM again. This time the resume failed, it hung after printing Linux! in yellow at the top of the screen. Yellow Linux! is my debugging trick. Cute :-) Here is my bisect log so far, with 98 revisions left. Note that all kernels have the XFS workqueue patch applied. ~$ git bisect log git-bisect start # bad: [6fb04ccf5c5e054c4107090bed6e866489f1089f] Linux 2.6.21-rc5 git-bisect bad 6fb04ccf5c5e054c4107090bed6e866489f1089f # good: [c8f71b01a50597e298dc3214a2f2be7b8d31170c] Linux 2.6.21-rc1 git-bisect good c8f71b01a50597e298dc3214a2f2be7b8d31170c # good: [ad5f1196792653dadf09c07a5fa917092b469c1c] ecryptfs: check xattr operation support fix git-bisect good ad5f1196792653dadf09c07a5fa917092b469c1c # good: [271368b69b9e8042063d6c713423e84503bbdaa0] Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6 git-bisect good 271368b69b9e8042063d6c713423e84503bbdaa0 # bad: [f5b42c3324494ea3f9bf795e2a7e4d3cbb06c607] KVM: Fix guest sysenter on vmx git-bisect bad f5b42c3324494ea3f9bf795e2a7e4d3cbb06c607 The bad kernels exhibit the hang on second resume from RAM. The good ones all have the artifact with corrupted display. Moreover, they resume immediately from every suspend to RAM _after_ a suspend-to-disk, but not before it. This is when suspending with echo mem /sys/power/state. Try vga=0 ... text console seems to work for you. Ok, will try. Marcus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 12:00:22PM +0200, Marcus Better wrote: Adrian Bunk wrote: Subject: ThinkPad R60: suspend to disk broken References : http://lkml.org/lkml/2007/3/23/74 Submitter : Marcus Better [EMAIL PROTECTED] Status : submitter tries to bisect I just tried -rc5. Now suspend to disk seems to work. I think the XFS workqueue patch fixed this. It can also suspend to RAM, but resume is worse. The first time around it resumed but corrupted the vesafb console (greenish blinking character cells), something that used to work before. But the system responded to input, so I suspended to RAM again. This time the resume failed, it hung after printing Linux! in yellow at the top of the screen. (Seems to be some artifact, I have seen it before even with working suspend.) ... Does setting CONFIG_PCI_MSI=n make any difference? Marcus cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk wrote: Subject: ThinkPad R60: suspend to disk broken Does setting CONFIG_PCI_MSI=n make any difference? Yes, it does. The hanging resume problem went away. (The display corruption and the instant resume were not affected.) Marcus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 07:42:51PM +0200, Marcus Better wrote: Adrian Bunk wrote: Subject: ThinkPad R60: suspend to disk broken Does setting CONFIG_PCI_MSI=n make any difference? Yes, it does. The hanging resume problem went away. Thanks for testing. If you enable it again, does the patch from [1] also fix it? (The display corruption and the instant resume were not affected.) Not a surprise - it's currently quite common that people run into several distinct suspend regressions... Marcus cu Adrian [1] http://lkml.org/lkml/2007/3/24/136 [2] [2] x86_64 uses arch/i386/pci/common.c -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Le lundi 26 mars 2007 à 11:14 +0200, Thomas Gleixner a écrit : On Mon, 2007-03-26 at 08:45 +0200, Frédéric RISS wrote: Additional data point: I just tried with -rc5 and the issue is still present. The config I used for this test defines neither NO_HZ nor HIGH_RES_TIMERS. Do you have CONFIG_HPET_TIMER enabled and does the box have one ? If yes, can you please turn it off and retry ? Indeed, turning off CONFIG_HPET_TIMER does fix the coming-out-of-suspend issue. (In fact it hangs at the second suspend, but that's another ATA problem that I think has already been reported). Fred. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 08:53:20PM +0200, Frédéric Riss wrote: ... (In fact it hangs at the second suspend, but that's another ATA problem that I think has already been reported). This sounds like the MSI problem. Do you have CONFIG_PCI_MSI enabled? If yes, does disabling it fix it? If yes, does CONFIG_PCI_MSI=y with the patch from [1] work? Fred. cu Adrian [1] http://lkml.org/lkml/2007/3/24/136 -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
2007/3/26, Adrian Bunk [EMAIL PROTECTED]: On Mon, Mar 26, 2007 at 08:53:20PM +0200, Frédéric Riss wrote: ... (In fact it hangs at the second suspend, but that's another ATA problem that I think has already been reported). This sounds like the MSI problem. Do you have CONFIG_PCI_MSI enabled? If yes, does disabling it fix it? Yes If yes, does CONFIG_PCI_MSI=y with the patch from [1] work? Yes ! Just to be 100% clear, the hang I was seeing at the second suspend is this one: Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer [EMAIL PROTECTED] Status : unknown I'm not sure it was associated with the MSI issue yet. Thanks a lot, Fred [1] http://lkml.org/lkml/2007/3/24/136 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Mon, Mar 26, 2007 at 09:39:29PM +0200, Frederic Riss wrote: 2007/3/26, Adrian Bunk [EMAIL PROTECTED]: On Mon, Mar 26, 2007 at 08:53:20PM +0200, Frédéric Riss wrote: ... (In fact it hangs at the second suspend, but that's another ATA problem that I think has already been reported). This sounds like the MSI problem. Do you have CONFIG_PCI_MSI enabled? If yes, does disabling it fix it? Yes If yes, does CONFIG_PCI_MSI=y with the patch from [1] work? Yes ! Thanks for your testing. Just to be 100% clear, the hang I was seeing at the second suspend is this one: Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer [EMAIL PROTECTED] Status : unknown I'm not sure it was associated with the MSI issue yet. That's what I was calling the MSI problem. Since my latest regression list actual cause and patch have been found. Thanks a lot, Fred [1] http://lkml.org/lkml/2007/3/24/136 cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [3/5] 2.6.21-rc4: known regressions (v2)
What I'm proposing we do is move the irq allocation code out of pci_enable_device and the irq freeing code out of pci_disable_device in the future. Sounds rational ... in a world that wasn't dominated by PCI it would seem to be the logical approach (since the irq code would have much more utility independent of the PCI code). Tony, Len before we merge any fixes for 2.6.21-rcX I'd like to at least get an ack on the long term direction. Long-term-direction-acked-by: Tony Luck [EMAIL PROTECTED] -Tony - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 22:37, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). rant Why do we even need a disable_nonboot_cpus in that path? machine_shutdown on i386 and x86_64 should take care of that. Further the code that computes the boot cpu is bogus (not all architectures require cpu == 0 to be the boot cpu), and disabling non boot cpus appears to be a strong x86ism, in the first place. Yes. If the only reason for disable_nonboot_cpus there is to avoid the WARN_ON in init_low_mappings() we should seriously consider killing it. We have considered it, but no one was sure that it was a good idea. The problem with the current init_low_mappings is that it hacks the current page table. If we can instead use a different page table the code becomes SMP safe. What exactly is the danger here? I have extracted the patch that addresses this from the relocatable patchset and appended it for sparking ideas. It goes a little farther than we need to solve this issue but the basics are there. If we can wait for 2.6.22 the relocatable x86_64 patchset that Andi has queued, has changes that kill the init_low_mapping() hack. I think we should kill the WARN_ON() right now, perhaps replacing it with a FIXME comment. Reasonable. I'm not very comfortable with calling cpu_down in a common code path right now either. I'm fairly certain we still don't have that correct. So if we confine the mess that is cpu_down to #if defined(CPU_HOTPLUG) defined(CONFIG_EXPERIMENTAL) I don't care. If we start using it everywhere I'm very nervous. migration when bringing a cpu down is strongly racy, and I don't think we actually put cpus to sleep properly either. I'm interested in all of the details, please. I seriously consider dropping cpu_up()/cpu_down() from the suspend code paths. So I'm not certain if in a multiple cpu context we can avoid all of the issues with cpu hotplug but there is a reasonable chance so I will explain as best I can. Yanking the appropriate code out of linuxbios the way a processor should stop itself is to send an INIT IPI to itself. This puts a cpu into an optimized wait for startup IPI state where it is otherwise disabled. This is the state any sane BIOS will put the cpus into before control is handed off to the kernel. static inline void stop_this_cpu(void) { unsigned apicid; apicid = lapicid(); /* Send an APIC INIT to myself */ lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_INT_ASSERT | LAPIC_DM_INIT); /* Wait for the ipi send to finish */ lapic_wait_icr_idle(); /* Deassert the APIC INIT */ lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_DM_INIT); /* Wait for the ipi send to finish */ lapic_wait_icr_idle(); /* If I haven't halted spin forever */ for(;;) { hlt(); } } I'm not certain what to do with the interrupt races. But I will see if I can explain what I know. braindump - Most ioapics are buggy. - Most ioapics do not follow pci-ordering rules with respect to interrupt message deliver so ensuring all in-flight irqs have arrived somewhere is very hard. - To avoid bugs we always limit ourselves to reprogramming the ioapics in the interrupt handler, and not considering an interrupt successfully reprogrammed until we have received an irq in the new location. - On x86 we have two basic interrupt handling modes. o logical addressing with lowest priority delivery. o physical addressing with delivery to a single cpu. - With logical addressing as long as the cpu is not available for having an interrupt delivered to it the interrupt will be never be delivered to a particular cpu. Ideally we also update the mask in the ioapic to not target that cpu. - With physical addressing targeting a single cpu we need to reprogram the ioapics not to target that specific cpu. This needs to happen in the interrupt handler and we need to wait for the next interrupt before we tear down our data structures for handling the interrupt. The current cpu hotplug code attempts to reprogram the ioapics from process context which is just wrong. I wasn't aware of that. Now as part of suspend/resume I think we should be programming the hardware not to generate interrupts in the first place at the actual hardware devices so we can likely avoid all of the code that reprograms
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Luck, Tony [EMAIL PROTECTED] writes: What I'm proposing we do is move the irq allocation code out of pci_enable_device and the irq freeing code out of pci_disable_device in the future. Sounds rational ... in a world that wasn't dominated by PCI it would seem to be the logical approach (since the irq code would have much more utility independent of the PCI code). Right. We can even do this earlier in the pci code. Just doing this on demand when the device driver needs it is problematic. As devices drivers like to keep the requested over a pci_disable_device pci_enable_device pair. The big practical issue is that we will like wind up allocating an irq number to all usable irqs on ia64. Which means we will like need many more irq numbers... Although I guess if we keep it at the pci layer we should be fairly safe. I was afraid there was some hotplug reason for waiting until pci_enable_device to allocate the irq numbers. Tony, Len before we merge any fixes for 2.6.21-rcX I'd like to at least get an ack on the long term direction. Long-term-direction-acked-by: Tony Luck [EMAIL PROTECTED] Thanks. Then small surgery will happen now, and I will start queuing up the major surgery patches. Although I won't be able to do more than compile test and code review the ia64 changes. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Le dimanche 25 mars 2007 à 23:34 +0200, Frédéric Riss a écrit : > However, as I pointed out in the initial report, the MacMini doesn't > come out of suspend to ram because a commit in another merged patchset > broke it. I tracked it down to: > > commit e9e2cdb412412326c4827fc78ba27f410d837e6e > parent 79bf2bb335b85db25d27421c798595a2fa2a0e82 > Author: Thomas Gleixner <[EMAIL PROTECTED]> > Date: Fri Feb 16 01:28:04 2007 -0800 > > [PATCH] clockevents: i386 drivers > > This patch has already been mentioned in regression reports, but AFAICS > not related to suspend issues. > > To be totally clear about what works and what doesn't: > > 79bf2bb335b85db25d27421c798595a2fa2a0e82 >+ cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 ==> works > > e9e2cdb412412326c4827fc78ba27f410d837e6e >+ cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 ==> broken > > To try to get more information, I commented the call to > do_suspend_lowlevel in drivers/acpi/sleep/main.c and used > CONFIG_DISABLE_CONSOLE_SUSPEND. Interestingly, the suspend/resume cycle > completes correctly in this mode. Additional data point: I just tried with -rc5 and the issue is still present. The config I used for this test defines neither NO_HZ nor HIGH_RES_TIMERS. Fred. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Le vendredi 23 mars 2007 à 19:50 +0100, Adrian Bunk a écrit : > Subject: MacMini: doesn't come out of suspend to ram > References : http://lkml.org/lkml/2007/3/21/374 > Submitter : Frédéric RISS <[EMAIL PROTECTED]> > Tino Keitel <[EMAIL PROTECTED]> > Caused-By : Bob Moore <[EMAIL PROTECTED]> > commit c5a7156959e89b32260ad6072bbf5077bcdfbeee > Status : unknown I spent some time this weekend investigating this issue more thoroughly. In fact the regression caused by this commit has been corrected by f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 (ACPI: Disable wake GPEs only once.) However, as I pointed out in the initial report, the MacMini doesn't come out of suspend to ram because a commit in another merged patchset broke it. I tracked it down to: commit e9e2cdb412412326c4827fc78ba27f410d837e6e parent 79bf2bb335b85db25d27421c798595a2fa2a0e82 Author: Thomas Gleixner <[EMAIL PROTECTED]> Date: Fri Feb 16 01:28:04 2007 -0800 [PATCH] clockevents: i386 drivers This patch has already been mentioned in regression reports, but AFAICS not related to suspend issues. To be totally clear about what works and what doesn't: 79bf2bb335b85db25d27421c798595a2fa2a0e82 + cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 ==> works e9e2cdb412412326c4827fc78ba27f410d837e6e + cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 ==> broken To try to get more information, I commented the call to do_suspend_lowlevel in drivers/acpi/sleep/main.c and used CONFIG_DISABLE_CONSOLE_SUSPEND. Interestingly, the suspend/resume cycle completes correctly in this mode. Fred. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
"Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: > On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote: >> "Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: >> >> > Yes, in kernel/power/disk.c:power_down() . >> > >> > Please comment out the disable_nonboot_cpus() in there and retest (but > please >> > test the latest Linus' tree). >> >> >> >> Why do we even need a disable_nonboot_cpus in that path? machine_shutdown >> on i386 and x86_64 should take care of that. Further the code that computes >> the boot cpu is bogus (not all architectures require cpu == 0 to be >> the boot cpu), and disabling non boot cpus appears to be a strong >> x86ism, in the first place. > > Yes. > >> If the only reason for disable_nonboot_cpus there is to avoid the >> WARN_ON in init_low_mappings() we should seriously consider killing >> it. > > We have considered it, but no one was sure that it was a good idea. The problem with the current init_low_mappings is that it hacks the current page table. If we can instead use a different page table the code becomes SMP safe. I have extracted the patch that addresses this from the relocatable patchset and appended it for sparking ideas. It goes a little farther than we need to solve this issue but the basics are there. >> If we can wait for 2.6.22 the relocatable x86_64 patchset that >> Andi has queued, has changes that kill the init_low_mapping() hack. > > I think we should kill the WARN_ON() right now, perhaps replacing it with > a FIXME comment. Reasonable. >> I'm not very comfortable with calling cpu_down in a common code path >> right now either. I'm fairly certain we still don't have that >> correct. So if we confine the mess that is cpu_down to #if >> defined(CPU_HOTPLUG) && defined(CONFIG_EXPERIMENTAL) I don't care. >> If we start using it everywhere I'm very nervous. >> migration when bringing a cpu down is strongly racy, and I don't think >> we actually put cpus to sleep properly either. > > I'm interested in all of the details, please. I seriously consider dropping > cpu_up()/cpu_down() from the suspend code paths. So I'm not certain if in a multiple cpu context we can avoid all of the issues with cpu hotplug but there is a reasonable chance so I will explain as best I can. Yanking the appropriate code out of linuxbios the way a processor should stop itself is to send an INIT IPI to itself. This puts a cpu into an optimized wait for startup IPI state where it is otherwise disabled. This is the state any sane BIOS will put the cpus into before control is handed off to the kernel. > static inline void stop_this_cpu(void) > { > unsigned apicid; > apicid = lapicid(); > > /* Send an APIC INIT to myself */ > lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); > lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_INT_ASSERT | > LAPIC_DM_INIT); > /* Wait for the ipi send to finish */ > lapic_wait_icr_idle(); > > /* Deassert the APIC INIT */ > lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); > lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_DM_INIT); > /* Wait for the ipi send to finish */ > lapic_wait_icr_idle(); > > /* If I haven't halted spin forever */ > for(;;) { > hlt(); > } > } I'm not certain what to do with the interrupt races. But I will see if I can explain what I know. - Most ioapics are buggy. - Most ioapics do not follow pci-ordering rules with respect to interrupt message deliver so ensuring all in-flight irqs have arrived somewhere is very hard. - To avoid bugs we always limit ourselves to reprogramming the ioapics in the interrupt handler, and not considering an interrupt successfully reprogrammed until we have received an irq in the new location. - On x86 we have two basic interrupt handling modes. o logical addressing with lowest priority delivery. o physical addressing with delivery to a single cpu. - With logical addressing as long as the cpu is not available for having an interrupt delivered to it the interrupt will be never be delivered to a particular cpu. Ideally we also update the mask in the ioapic to not target that cpu. - With physical addressing targeting a single cpu we need to reprogram the ioapics not to target that specific cpu. This needs to happen in the interrupt handler and we need to wait for the next interrupt before we tear down our data structures for handling the interrupt. The current cpu hotplug code attempts to reprogram the ioapics from process context which is just wrong. Now as part of suspend/resume I think we should be programming the hardware not to generate interrupts in the first place at the actual hardware devices so we can likely avoid all of the code that reprograms interrupts while they are active. If we can use things like pci ordering rules to ensure the device will never fire the interrupt until resumed we
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 21:06, Rafael J. Wysocki wrote: > On Sunday, 25 March 2007 19:25, Thomas Meyer wrote: > > Adrian Bunk schrieb: > > > On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: > > > > > >> ... > > >> The first suspend to disk is ok. The second suspend to disk has a > > >> strange behaviour: > > >> 1.) write pm image > > >> 2.) the system disable the non-boot cpus again (i guess this happens in > > >> power_down()) > > >> 3.) the system doesn't power down. > > >> 4.) pressing any key and the system powers down. > > >> ... > > >> > > > > > > Is this also present with 2.6.20, or is it a regression? > > > > > No, this one is not present in 2.6.20 and this error doesn't (head= > > 317ec6cd00f25d05d153a780bc178c5335f320ee) occur with NO_HZ=n and > > HIGH_RES_TIMERS=n > > > > This error is maybe related with this commit: > > Yes, it is, but I'd rather remove the disable_nonboot_cpus() from > power_down() (as Eric suggested) instead of trying to handle the RCU sync > problem here. > > This has been caused by my commit 94985134b7b46848267ed6b734320db01c974e72 > (swsusp: disable nonboot CPUs before entering platform suspend) that in such a > case should be reverted. s/such a case/the present situation/ Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 19:25, Thomas Meyer wrote: > Adrian Bunk schrieb: > > On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: > > > >> ... > >> The first suspend to disk is ok. The second suspend to disk has a > >> strange behaviour: > >> 1.) write pm image > >> 2.) the system disable the non-boot cpus again (i guess this happens in > >> power_down()) > >> 3.) the system doesn't power down. > >> 4.) pressing any key and the system powers down. > >> ... > >> > > > > Is this also present with 2.6.20, or is it a regression? > > > No, this one is not present in 2.6.20 and this error doesn't (head= > 317ec6cd00f25d05d153a780bc178c5335f320ee) occur with NO_HZ=n and > HIGH_RES_TIMERS=n > > This error is maybe related with this commit: Yes, it is, but I'd rather remove the disable_nonboot_cpus() from power_down() (as Eric suggested) instead of trying to handle the RCU sync problem here. This has been caused by my commit 94985134b7b46848267ed6b734320db01c974e72 (swsusp: disable nonboot CPUs before entering platform suspend) that in such a case should be reverted. Greetings, Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote: > "Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: > > > Yes, in kernel/power/disk.c:power_down() . > > > > Please comment out the disable_nonboot_cpus() in there and retest (but > > please > > test the latest Linus' tree). > > > > Why do we even need a disable_nonboot_cpus in that path? machine_shutdown > on i386 and x86_64 should take care of that. Further the code that computes > the boot cpu is bogus (not all architectures require cpu == 0 to be > the boot cpu), and disabling non boot cpus appears to be a strong > x86ism, in the first place. Yes. > If the only reason for disable_nonboot_cpus there is to avoid the > WARN_ON in init_low_mappings() we should seriously consider killing > it. We have considered it, but no one was sure that it was a good idea. > If we can wait for 2.6.22 the relocatable x86_64 patchset that > Andi has queued, has changes that kill the init_low_mapping() hack. I think we should kill the WARN_ON() right now, perhaps replacing it with a FIXME comment. > I'm not very comfortable with calling cpu_down in a common code path > right now either. I'm fairly certain we still don't have that > correct. So if we confine the mess that is cpu_down to #if > defined(CPU_HOTPLUG) && defined(CONFIG_EXPERIMENTAL) I don't care. > If we start using it everywhere I'm very nervous. > migration when bringing a cpu down is strongly racy, and I don't think > we actually put cpus to sleep properly either. I'm interested in all of the details, please. I seriously consider dropping cpu_up()/cpu_down() from the suspend code paths. Greetings, Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 16:17, Thomas Meyer wrote: > Rafael J. Wysocki schrieb: > > On Sunday, 25 March 2007 14:03, Eric W. Biederman wrote: > > > >> Thomas Meyer <[EMAIL PROTECTED]> writes: > >> > >> > >>> Eric W. Biederman schrieb: > >>> > Thomas could you verify the patch below makes the problem go away > for you. > > > >>> The patch solves the problem. I'm writing this after the third suspend > >>> and resume cycle. > >>> msi irq stays enabled for libata device: > >>> cat /sys/devices/pci\:00/\:00\:1f.2/irq > >>> 218 > >>> > >>> The first suspend to disk is ok. The second suspend to disk has a > >>> strange behaviour: > >>> 1.) write pm image > >>> 2.) the system disable the non-boot cpus again (i guess this happens in > >>> power_down()) > >>> > > > > Yes, in kernel/power/disk.c:power_down() . > > > > Please comment out the disable_nonboot_cpus() in there and retest (but > > please > > test the latest Linus' tree). > > > > > Without disable_nonboot_cpus in power_down the computer powers down > without the mysterious "wait for the next interrupt" hang. Do you have CONFIG_NO_HZ set? Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk schrieb: > On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: > >> ... >> The first suspend to disk is ok. The second suspend to disk has a >> strange behaviour: >> 1.) write pm image >> 2.) the system disable the non-boot cpus again (i guess this happens in >> power_down()) >> 3.) the system doesn't power down. >> 4.) pressing any key and the system powers down. >> ... >> > > Is this also present with 2.6.20, or is it a regression? > No, this one is not present in 2.6.20 and this error doesn't (head= 317ec6cd00f25d05d153a780bc178c5335f320ee) occur with NO_HZ=n and HIGH_RES_TIMERS=n This error is maybe related with this commit: commit cd05a1f818073a623455a58e756c5b419fc98db9 Author: Thomas Gleixner <[EMAIL PROTECTED]> Date: Sat Mar 17 00:25:52 2007 +0100 [PATCH] clockevents: Fix suspend/resume to disk hangs I finally found a dual core box, which survives suspend/resume without crashing in the middle of nowhere. Sigh, I never figured out from the code and the bug reports what's going on. The observed hangs are caused by a stale state transition of the clock event devices, which keeps the RCU synchronization away from completion, when the non boot CPU is brought back up. The suspend/resume in oneshot mode needs the similar care as the periodic mode during suspend to RAM. My assumption that the state transitions during the different shutdown/bringups of s2disk would go through the periodic boot phase and then switch over to highres resp. nohz mode were simply wrong. Add the appropriate suspend / resume handling for the non periodic modes. Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: >... > The first suspend to disk is ok. The second suspend to disk has a > strange behaviour: > 1.) write pm image > 2.) the system disable the non-boot cpus again (i guess this happens in > power_down()) > 3.) the system doesn't power down. > 4.) pressing any key and the system powers down. >... Is this also present with 2.6.20, or is it a regression? cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Rafael J. Wysocki schrieb: > On Sunday, 25 March 2007 14:03, Eric W. Biederman wrote: > >> Thomas Meyer <[EMAIL PROTECTED]> writes: >> >> >>> Eric W. Biederman schrieb: >>> Thomas could you verify the patch below makes the problem go away for you. >>> The patch solves the problem. I'm writing this after the third suspend >>> and resume cycle. >>> msi irq stays enabled for libata device: >>> cat /sys/devices/pci\:00/\:00\:1f.2/irq >>> 218 >>> >>> The first suspend to disk is ok. The second suspend to disk has a >>> strange behaviour: >>> 1.) write pm image >>> 2.) the system disable the non-boot cpus again (i guess this happens in >>> power_down()) >>> > > Yes, in kernel/power/disk.c:power_down() . > > Please comment out the disable_nonboot_cpus() in there and retest (but please > test the latest Linus' tree). > > Without disable_nonboot_cpus in power_down the computer powers down without the mysterious "wait for the next interrupt" hang. >>> 3.) the system doesn't power down. >>> 4.) pressing any key and the system powers down. >>> >>> The same is true for the third suspend cycle. Maybe an acpi problem? >>> >> Sounds possible. You could probably verify it isn't my patch but running >> an unpatched kernel without msi support. As I think the crash you saw should >> only be reproducible when using devices that support msi. >> >> Unless I hear different I'm going to assume that this second case is a >> completely different problem. >> > > I think it is different too. > Yes, it's a different problem - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: > Sounds possible. You could probably verify it isn't my patch but running > an unpatched kernel without msi support. As I think the crash you saw should > only be reproducible when using devices that support msi. > Without your patch and with pci=nomsi option the same error occur. But i think this is not an acpi error, because every interrupt seems to trigger the shutdown, like moving the mouse, or pressing a key. > Unless I hear different I'm going to assume that this second case is a > completely different problem. You might check to see if the acpi > interrupt is stuck after a suspend/resume cycle. > D'accord. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
"Rafael J. Wysocki" <[EMAIL PROTECTED]> writes: > Yes, in kernel/power/disk.c:power_down() . > > Please comment out the disable_nonboot_cpus() in there and retest (but please > test the latest Linus' tree). Why do we even need a disable_nonboot_cpus in that path? machine_shutdown on i386 and x86_64 should take care of that. Further the code that computes the boot cpu is bogus (not all architectures require cpu == 0 to be the boot cpu), and disabling non boot cpus appears to be a strong x86ism, in the first place. If the only reason for disable_nonboot_cpus there is to avoid the WARN_ON in init_low_mappings() we should seriously consider killing it. If we can wait for 2.6.22 the relocatable x86_64 patchset that Andi has queued, has changes that kill the init_low_mapping() hack. I'm not very comfortable with calling cpu_down in a common code path right now either. I'm fairly certain we still don't have that correct. So if we confine the mess that is cpu_down to #if defined(CPU_HOTPLUG) && defined(CONFIG_EXPERIMENTAL) I don't care. If we start using it everywhere I'm very nervous. I know the irq migration when bringing a cpu down is strongly racy, and I don't think we actually put cpus to sleep properly either. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 14:03, Eric W. Biederman wrote: > Thomas Meyer <[EMAIL PROTECTED]> writes: > > > Eric W. Biederman schrieb: > >> > >> Thomas could you verify the patch below makes the problem go away > >> for you. > >> > > > > The patch solves the problem. I'm writing this after the third suspend > > and resume cycle. > > msi irq stays enabled for libata device: > > cat /sys/devices/pci\:00/\:00\:1f.2/irq > > 218 > > > The first suspend to disk is ok. The second suspend to disk has a > > strange behaviour: > > 1.) write pm image > > 2.) the system disable the non-boot cpus again (i guess this happens in > > power_down()) Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). > > 3.) the system doesn't power down. > > 4.) pressing any key and the system powers down. > > > > The same is true for the third suspend cycle. Maybe an acpi problem? > > Sounds possible. You could probably verify it isn't my patch but running > an unpatched kernel without msi support. As I think the crash you saw should > only be reproducible when using devices that support msi. > > Unless I hear different I'm going to assume that this second case is a > completely different problem. I think it is different too. Greetings, Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer <[EMAIL PROTECTED]> writes: > Eric W. Biederman schrieb: >> >> Thomas could you verify the patch below makes the problem go away >> for you. >> > > The patch solves the problem. I'm writing this after the third suspend > and resume cycle. > msi irq stays enabled for libata device: > cat /sys/devices/pci\:00/\:00\:1f.2/irq > 218 > The first suspend to disk is ok. The second suspend to disk has a > strange behaviour: > 1.) write pm image > 2.) the system disable the non-boot cpus again (i guess this happens in > power_down()) > 3.) the system doesn't power down. > 4.) pressing any key and the system powers down. > > The same is true for the third suspend cycle. Maybe an acpi problem? Sounds possible. You could probably verify it isn't my patch but running an unpatched kernel without msi support. As I think the crash you saw should only be reproducible when using devices that support msi. Unless I hear different I'm going to assume that this second case is a completely different problem. You might check to see if the acpi interrupt is stuck after a suspend/resume cycle. At this point I'm going to wait a bit for Tony and Len to have a chance to give their opinion but unless I hear something I'm going to plan on sending the patch out shortly... Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: > > Thomas could you verify the patch below makes the problem go away > for you. > The patch solves the problem. I'm writing this after the third suspend and resume cycle. msi irq stays enabled for libata device: cat /sys/devices/pci\:00/\:00\:1f.2/irq 218 cat /proc/interrupts CPU0 CPU1 0: 274190 0 IO-APIC-edge timer 9: 13417 0 IO-APIC-fasteoi acpi 16:166 0 IO-APIC-fasteoi uhci_hcd:usb4 17: 70908 88643 IO-APIC-fasteoi wifi0 18: 3060 0 IO-APIC-fasteoi libata, uhci_hcd:usb3 19: 8 0 IO-APIC-fasteoi ohci1394, uhci_hcd:usb2 20: 46252 0 IO-APIC-fasteoi HDA Intel 21: 168437 0 IO-APIC-fasteoi uhci_hcd:usb1, ehci_hcd:usb5 218: 15896 0 PCI-MSI-edge libata 219: 1 0 PCI-MSI-edge eth0 NMI: 0 0 LOC: 87574 123338 ERR: 0 MIS: 0 BUT... The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) 3.) the system doesn't power down. 4.) pressing any key and the system powers down. The same is true for the third suspend cycle. Maybe an acpi problem? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: Thomas could you verify the patch below makes the problem go away for you. The patch solves the problem. I'm writing this after the third suspend and resume cycle. msi irq stays enabled for libata device: cat /sys/devices/pci\:00/\:00\:1f.2/irq 218 cat /proc/interrupts CPU0 CPU1 0: 274190 0 IO-APIC-edge timer 9: 13417 0 IO-APIC-fasteoi acpi 16:166 0 IO-APIC-fasteoi uhci_hcd:usb4 17: 70908 88643 IO-APIC-fasteoi wifi0 18: 3060 0 IO-APIC-fasteoi libata, uhci_hcd:usb3 19: 8 0 IO-APIC-fasteoi ohci1394, uhci_hcd:usb2 20: 46252 0 IO-APIC-fasteoi HDA Intel 21: 168437 0 IO-APIC-fasteoi uhci_hcd:usb1, ehci_hcd:usb5 218: 15896 0 PCI-MSI-edge libata 219: 1 0 PCI-MSI-edge eth0 NMI: 0 0 LOC: 87574 123338 ERR: 0 MIS: 0 BUT... The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) 3.) the system doesn't power down. 4.) pressing any key and the system powers down. The same is true for the third suspend cycle. Maybe an acpi problem? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer [EMAIL PROTECTED] writes: Eric W. Biederman schrieb: Thomas could you verify the patch below makes the problem go away for you. The patch solves the problem. I'm writing this after the third suspend and resume cycle. msi irq stays enabled for libata device: cat /sys/devices/pci\:00/\:00\:1f.2/irq 218 The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) 3.) the system doesn't power down. 4.) pressing any key and the system powers down. The same is true for the third suspend cycle. Maybe an acpi problem? Sounds possible. You could probably verify it isn't my patch but running an unpatched kernel without msi support. As I think the crash you saw should only be reproducible when using devices that support msi. Unless I hear different I'm going to assume that this second case is a completely different problem. You might check to see if the acpi interrupt is stuck after a suspend/resume cycle. At this point I'm going to wait a bit for Tony and Len to have a chance to give their opinion but unless I hear something I'm going to plan on sending the patch out shortly... Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 14:03, Eric W. Biederman wrote: Thomas Meyer [EMAIL PROTECTED] writes: Eric W. Biederman schrieb: Thomas could you verify the patch below makes the problem go away for you. The patch solves the problem. I'm writing this after the third suspend and resume cycle. msi irq stays enabled for libata device: cat /sys/devices/pci\:00/\:00\:1f.2/irq 218 The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). 3.) the system doesn't power down. 4.) pressing any key and the system powers down. The same is true for the third suspend cycle. Maybe an acpi problem? Sounds possible. You could probably verify it isn't my patch but running an unpatched kernel without msi support. As I think the crash you saw should only be reproducible when using devices that support msi. Unless I hear different I'm going to assume that this second case is a completely different problem. I think it is different too. Greetings, Rafael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Rafael J. Wysocki [EMAIL PROTECTED] writes: Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). rant Why do we even need a disable_nonboot_cpus in that path? machine_shutdown on i386 and x86_64 should take care of that. Further the code that computes the boot cpu is bogus (not all architectures require cpu == 0 to be the boot cpu), and disabling non boot cpus appears to be a strong x86ism, in the first place. If the only reason for disable_nonboot_cpus there is to avoid the WARN_ON in init_low_mappings() we should seriously consider killing it. If we can wait for 2.6.22 the relocatable x86_64 patchset that Andi has queued, has changes that kill the init_low_mapping() hack. I'm not very comfortable with calling cpu_down in a common code path right now either. I'm fairly certain we still don't have that correct. So if we confine the mess that is cpu_down to #if defined(CPU_HOTPLUG) defined(CONFIG_EXPERIMENTAL) I don't care. If we start using it everywhere I'm very nervous. I know the irq migration when bringing a cpu down is strongly racy, and I don't think we actually put cpus to sleep properly either. /rant Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: Sounds possible. You could probably verify it isn't my patch but running an unpatched kernel without msi support. As I think the crash you saw should only be reproducible when using devices that support msi. Without your patch and with pci=nomsi option the same error occur. But i think this is not an acpi error, because every interrupt seems to trigger the shutdown, like moving the mouse, or pressing a key. Unless I hear different I'm going to assume that this second case is a completely different problem. You might check to see if the acpi interrupt is stuck after a suspend/resume cycle. D'accord. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Rafael J. Wysocki schrieb: On Sunday, 25 March 2007 14:03, Eric W. Biederman wrote: Thomas Meyer [EMAIL PROTECTED] writes: Eric W. Biederman schrieb: Thomas could you verify the patch below makes the problem go away for you. The patch solves the problem. I'm writing this after the third suspend and resume cycle. msi irq stays enabled for libata device: cat /sys/devices/pci\:00/\:00\:1f.2/irq 218 The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). Without disable_nonboot_cpus in power_down the computer powers down without the mysterious wait for the next interrupt hang. 3.) the system doesn't power down. 4.) pressing any key and the system powers down. The same is true for the third suspend cycle. Maybe an acpi problem? Sounds possible. You could probably verify it isn't my patch but running an unpatched kernel without msi support. As I think the crash you saw should only be reproducible when using devices that support msi. Unless I hear different I'm going to assume that this second case is a completely different problem. I think it is different too. Yes, it's a different problem - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: ... The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) 3.) the system doesn't power down. 4.) pressing any key and the system powers down. ... Is this also present with 2.6.20, or is it a regression? cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk schrieb: On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: ... The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) 3.) the system doesn't power down. 4.) pressing any key and the system powers down. ... Is this also present with 2.6.20, or is it a regression? No, this one is not present in 2.6.20 and this error doesn't (head= 317ec6cd00f25d05d153a780bc178c5335f320ee) occur with NO_HZ=n and HIGH_RES_TIMERS=n This error is maybe related with this commit: commit cd05a1f818073a623455a58e756c5b419fc98db9 Author: Thomas Gleixner [EMAIL PROTECTED] Date: Sat Mar 17 00:25:52 2007 +0100 [PATCH] clockevents: Fix suspend/resume to disk hangs I finally found a dual core box, which survives suspend/resume without crashing in the middle of nowhere. Sigh, I never figured out from the code and the bug reports what's going on. The observed hangs are caused by a stale state transition of the clock event devices, which keeps the RCU synchronization away from completion, when the non boot CPU is brought back up. The suspend/resume in oneshot mode needs the similar care as the periodic mode during suspend to RAM. My assumption that the state transitions during the different shutdown/bringups of s2disk would go through the periodic boot phase and then switch over to highres resp. nohz mode were simply wrong. Add the appropriate suspend / resume handling for the non periodic modes. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 16:17, Thomas Meyer wrote: Rafael J. Wysocki schrieb: On Sunday, 25 March 2007 14:03, Eric W. Biederman wrote: Thomas Meyer [EMAIL PROTECTED] writes: Eric W. Biederman schrieb: Thomas could you verify the patch below makes the problem go away for you. The patch solves the problem. I'm writing this after the third suspend and resume cycle. msi irq stays enabled for libata device: cat /sys/devices/pci\:00/\:00\:1f.2/irq 218 The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). Without disable_nonboot_cpus in power_down the computer powers down without the mysterious wait for the next interrupt hang. Do you have CONFIG_NO_HZ set? Rafael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 19:25, Thomas Meyer wrote: Adrian Bunk schrieb: On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: ... The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) 3.) the system doesn't power down. 4.) pressing any key and the system powers down. ... Is this also present with 2.6.20, or is it a regression? No, this one is not present in 2.6.20 and this error doesn't (head= 317ec6cd00f25d05d153a780bc178c5335f320ee) occur with NO_HZ=n and HIGH_RES_TIMERS=n This error is maybe related with this commit: Yes, it is, but I'd rather remove the disable_nonboot_cpus() from power_down() (as Eric suggested) instead of trying to handle the RCU sync problem here. This has been caused by my commit 94985134b7b46848267ed6b734320db01c974e72 (swsusp: disable nonboot CPUs before entering platform suspend) that in such a case should be reverted. Greetings, Rafael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). rant Why do we even need a disable_nonboot_cpus in that path? machine_shutdown on i386 and x86_64 should take care of that. Further the code that computes the boot cpu is bogus (not all architectures require cpu == 0 to be the boot cpu), and disabling non boot cpus appears to be a strong x86ism, in the first place. Yes. If the only reason for disable_nonboot_cpus there is to avoid the WARN_ON in init_low_mappings() we should seriously consider killing it. We have considered it, but no one was sure that it was a good idea. If we can wait for 2.6.22 the relocatable x86_64 patchset that Andi has queued, has changes that kill the init_low_mapping() hack. I think we should kill the WARN_ON() right now, perhaps replacing it with a FIXME comment. I'm not very comfortable with calling cpu_down in a common code path right now either. I'm fairly certain we still don't have that correct. So if we confine the mess that is cpu_down to #if defined(CPU_HOTPLUG) defined(CONFIG_EXPERIMENTAL) I don't care. If we start using it everywhere I'm very nervous. migration when bringing a cpu down is strongly racy, and I don't think we actually put cpus to sleep properly either. I'm interested in all of the details, please. I seriously consider dropping cpu_up()/cpu_down() from the suspend code paths. Greetings, Rafael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Sunday, 25 March 2007 21:06, Rafael J. Wysocki wrote: On Sunday, 25 March 2007 19:25, Thomas Meyer wrote: Adrian Bunk schrieb: On Sun, Mar 25, 2007 at 01:41:33PM +0200, Thomas Meyer wrote: ... The first suspend to disk is ok. The second suspend to disk has a strange behaviour: 1.) write pm image 2.) the system disable the non-boot cpus again (i guess this happens in power_down()) 3.) the system doesn't power down. 4.) pressing any key and the system powers down. ... Is this also present with 2.6.20, or is it a regression? No, this one is not present in 2.6.20 and this error doesn't (head= 317ec6cd00f25d05d153a780bc178c5335f320ee) occur with NO_HZ=n and HIGH_RES_TIMERS=n This error is maybe related with this commit: Yes, it is, but I'd rather remove the disable_nonboot_cpus() from power_down() (as Eric suggested) instead of trying to handle the RCU sync problem here. This has been caused by my commit 94985134b7b46848267ed6b734320db01c974e72 (swsusp: disable nonboot CPUs before entering platform suspend) that in such a case should be reverted. s/such a case/the present situation/ Rafael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Rafael J. Wysocki [EMAIL PROTECTED] writes: On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote: Rafael J. Wysocki [EMAIL PROTECTED] writes: Yes, in kernel/power/disk.c:power_down() . Please comment out the disable_nonboot_cpus() in there and retest (but please test the latest Linus' tree). rant Why do we even need a disable_nonboot_cpus in that path? machine_shutdown on i386 and x86_64 should take care of that. Further the code that computes the boot cpu is bogus (not all architectures require cpu == 0 to be the boot cpu), and disabling non boot cpus appears to be a strong x86ism, in the first place. Yes. If the only reason for disable_nonboot_cpus there is to avoid the WARN_ON in init_low_mappings() we should seriously consider killing it. We have considered it, but no one was sure that it was a good idea. The problem with the current init_low_mappings is that it hacks the current page table. If we can instead use a different page table the code becomes SMP safe. I have extracted the patch that addresses this from the relocatable patchset and appended it for sparking ideas. It goes a little farther than we need to solve this issue but the basics are there. If we can wait for 2.6.22 the relocatable x86_64 patchset that Andi has queued, has changes that kill the init_low_mapping() hack. I think we should kill the WARN_ON() right now, perhaps replacing it with a FIXME comment. Reasonable. I'm not very comfortable with calling cpu_down in a common code path right now either. I'm fairly certain we still don't have that correct. So if we confine the mess that is cpu_down to #if defined(CPU_HOTPLUG) defined(CONFIG_EXPERIMENTAL) I don't care. If we start using it everywhere I'm very nervous. migration when bringing a cpu down is strongly racy, and I don't think we actually put cpus to sleep properly either. I'm interested in all of the details, please. I seriously consider dropping cpu_up()/cpu_down() from the suspend code paths. So I'm not certain if in a multiple cpu context we can avoid all of the issues with cpu hotplug but there is a reasonable chance so I will explain as best I can. Yanking the appropriate code out of linuxbios the way a processor should stop itself is to send an INIT IPI to itself. This puts a cpu into an optimized wait for startup IPI state where it is otherwise disabled. This is the state any sane BIOS will put the cpus into before control is handed off to the kernel. static inline void stop_this_cpu(void) { unsigned apicid; apicid = lapicid(); /* Send an APIC INIT to myself */ lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_INT_ASSERT | LAPIC_DM_INIT); /* Wait for the ipi send to finish */ lapic_wait_icr_idle(); /* Deassert the APIC INIT */ lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid)); lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_DM_INIT); /* Wait for the ipi send to finish */ lapic_wait_icr_idle(); /* If I haven't halted spin forever */ for(;;) { hlt(); } } I'm not certain what to do with the interrupt races. But I will see if I can explain what I know. braindump - Most ioapics are buggy. - Most ioapics do not follow pci-ordering rules with respect to interrupt message deliver so ensuring all in-flight irqs have arrived somewhere is very hard. - To avoid bugs we always limit ourselves to reprogramming the ioapics in the interrupt handler, and not considering an interrupt successfully reprogrammed until we have received an irq in the new location. - On x86 we have two basic interrupt handling modes. o logical addressing with lowest priority delivery. o physical addressing with delivery to a single cpu. - With logical addressing as long as the cpu is not available for having an interrupt delivered to it the interrupt will be never be delivered to a particular cpu. Ideally we also update the mask in the ioapic to not target that cpu. - With physical addressing targeting a single cpu we need to reprogram the ioapics not to target that specific cpu. This needs to happen in the interrupt handler and we need to wait for the next interrupt before we tear down our data structures for handling the interrupt. The current cpu hotplug code attempts to reprogram the ioapics from process context which is just wrong. Now as part of suspend/resume I think we should be programming the hardware not to generate interrupts in the first place at the actual hardware devices so we can likely avoid all of the code that reprograms interrupts while they are active. If we can use things like pci ordering rules to ensure the device will never fire the interrupt until resumed we should be able to disable interrupts synchronously. Something that we can not safely do in
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Le vendredi 23 mars 2007 à 19:50 +0100, Adrian Bunk a écrit : Subject: MacMini: doesn't come out of suspend to ram References : http://lkml.org/lkml/2007/3/21/374 Submitter : Frédéric RISS [EMAIL PROTECTED] Tino Keitel [EMAIL PROTECTED] Caused-By : Bob Moore [EMAIL PROTECTED] commit c5a7156959e89b32260ad6072bbf5077bcdfbeee Status : unknown I spent some time this weekend investigating this issue more thoroughly. In fact the regression caused by this commit has been corrected by f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 (ACPI: Disable wake GPEs only once.) However, as I pointed out in the initial report, the MacMini doesn't come out of suspend to ram because a commit in another merged patchset broke it. I tracked it down to: commit e9e2cdb412412326c4827fc78ba27f410d837e6e parent 79bf2bb335b85db25d27421c798595a2fa2a0e82 Author: Thomas Gleixner [EMAIL PROTECTED] Date: Fri Feb 16 01:28:04 2007 -0800 [PATCH] clockevents: i386 drivers This patch has already been mentioned in regression reports, but AFAICS not related to suspend issues. To be totally clear about what works and what doesn't: 79bf2bb335b85db25d27421c798595a2fa2a0e82 + cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 == works e9e2cdb412412326c4827fc78ba27f410d837e6e + cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 == broken To try to get more information, I commented the call to do_suspend_lowlevel in drivers/acpi/sleep/main.c and used CONFIG_DISABLE_CONSOLE_SUSPEND. Interestingly, the suspend/resume cycle completes correctly in this mode. Fred. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Le dimanche 25 mars 2007 à 23:34 +0200, Frédéric Riss a écrit : However, as I pointed out in the initial report, the MacMini doesn't come out of suspend to ram because a commit in another merged patchset broke it. I tracked it down to: commit e9e2cdb412412326c4827fc78ba27f410d837e6e parent 79bf2bb335b85db25d27421c798595a2fa2a0e82 Author: Thomas Gleixner [EMAIL PROTECTED] Date: Fri Feb 16 01:28:04 2007 -0800 [PATCH] clockevents: i386 drivers This patch has already been mentioned in regression reports, but AFAICS not related to suspend issues. To be totally clear about what works and what doesn't: 79bf2bb335b85db25d27421c798595a2fa2a0e82 + cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 == works e9e2cdb412412326c4827fc78ba27f410d837e6e + cherry-pick f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 == broken To try to get more information, I commented the call to do_suspend_lowlevel in drivers/acpi/sleep/main.c and used CONFIG_DISABLE_CONSOLE_SUSPEND. Interestingly, the suspend/resume cycle completes correctly in this mode. Additional data point: I just tried with -rc5 and the issue is still present. The config I used for this test defines neither NO_HZ nor HIGH_RES_TIMERS. Fred. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer <[EMAIL PROTECTED]> writes: > Eric W. Biederman schrieb: >> >> Odd. I would have thought the oops happened in the first resume, not >> the second. >> >> Hmm. It may have something to do with the ``managed'' driver >> aspect of this as well.. >> > No. I don't think so. The problem is caused by this sequence: (the info > is always before entry of a function and before the exit of a function): Ok. Thanks. It is the ordering of events that keeps it from showing up. The problem happens the first time but only after we have restored msi state so we don't see the ill effects until the second time. Ok staring at the code and thinking about the problem. The only thing that pci_enable_device does (except messing with irqs is flip enable bits). Further pci_enable_device only messes with on 5 architectures. Only ia64 really cares. i386 and x86_64 it is simply delaying work until we need it. frv doesn't really care it just pokes the irq value back into the hardware for some reason. cris just sets a hard coded value. Does cris only have one pci irq? So I think the right solution is to simply make pci_enable_device just flip enable bits and move the rest of the work someplace else. However a thorough cleanup is a little extreme for this point in the release cycle, so I think a quick hack that makes the code not stomp the irq when msi irq's are enabled should be the first fix. Then we can later make the code not change the irqs at all. Thomas could you verify the patch below makes the problem go away for you. Tony, Len the way pci_disable_device is being used in a suspend/resume path by a few drivers is completely incompatible with the way irqs are allocated on ia64. In particular people the following sequence occurs in several drivers. probe: pci_enable_device(pdev); request_irq(pdev->irq); suspend: pci_disable_device(pdev); resume: pci_enable_device(pdev); remove: free_irq(pdev->irq); pci_disable_device(pdev); What I'm proposing we do is move the irq allocation code out of pci_enable_device and the irq freeing code out of pci_disable_device in the future. If we move ia64 to a model where the irq number equal the gsi like we have for x86_64 and are in the middle of for i386 that should be pretty straight forward. It would even be relatively simple to delay vector allocation in that context until request_irq, if we needed the delayed allocation benefit. Do you two have any problems with moving in that direction? If fixing the arch code is unacceptable for some reason I'm not aware of we need to audit the 10-20 drivers that call pci_disable_device in their suspend/resume processing and ensure that they have freed all of the irqs before that point. Given that I have bug reports on the msi path I know that isn't true. Tony, Len before we merge any fixes for 2.6.21-rcX I'd like to at least get an ack on the long term direction. Thanks, Eric diff --git a/arch/cris/arch-v32/drivers/pci/bios.c b/arch/cris/arch-v32/drivers/pci/bios.c index a2b9c60..5b79a7a 100644 --- a/arch/cris/arch-v32/drivers/pci/bios.c +++ b/arch/cris/arch-v32/drivers/pci/bios.c @@ -100,7 +100,9 @@ int pcibios_enable_device(struct pci_dev *dev, int mask) if ((err = pcibios_enable_resources(dev, mask)) < 0) return err; - return pcibios_enable_irq(dev); + if (!dev->msi_enabled) + pcibios_enable_irq(dev); + return 0; } int pcibios_assign_resources(void) diff --git a/arch/frv/mb93090-mb00/pci-vdk.c b/arch/frv/mb93090-mb00/pci-vdk.c index f7279d7..0b581e3 100644 --- a/arch/frv/mb93090-mb00/pci-vdk.c +++ b/arch/frv/mb93090-mb00/pci-vdk.c @@ -466,6 +466,7 @@ int pcibios_enable_device(struct pci_dev *dev, int mask) if ((err = pcibios_enable_resources(dev, mask)) < 0) return err; - pcibios_enable_irq(dev); + if (!dev->msi_enabled) + pcibios_enable_irq(dev); return 0; } diff --git a/arch/i386/pci/common.c b/arch/i386/pci/common.c index 1bb0693..a990a6c 100644 --- a/arch/i386/pci/common.c +++ b/arch/i386/pci/common.c @@ -426,11 +426,13 @@ int pcibios_enable_device(struct pci_dev *dev, int mask) if ((err = pcibios_enable_resources(dev, mask)) < 0) return err; - return pcibios_enable_irq(dev); + if (!dev->msi_enabled) + return pcibios_enable_irq(dev); + return 0; } void pcibios_disable_device (struct pci_dev *dev) { - if (pcibios_disable_irq) + if (!dev->msi_enabled && pcibios_disable_irq) pcibios_disable_irq(dev); } diff --git a/arch/ia64/pci/pci.c b/arch/ia64/pci/pci.c index 474d179..f8bcccd 100644 --- a/arch/ia64/pci/pci.c +++ b/arch/ia64/pci/pci.c @@ -557,14 +557,18 @@ pcibios_enable_device (struct pci_dev *dev, int mask) if (ret < 0) return ret; - return acpi_pci_irq_enable(dev); + if (!dev->msi_enabled) + return acpi_pci_irq_enable(dev);
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: > > Odd. I would have thought the oops happened in the first resume, not > the second. > > Hmm. It may have something to do with the ``managed'' driver > aspect of this as well.. > No. I don't think so. The problem is caused by this sequence: (the info is always before entry of a function and before the exit of a function): 1.) Normal boot [kernel] ahci :00:1f.2: version 2.1 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 0 [kernel] pci_enable_device: msi_enabled= 0 [kernel] PCI: Enabling device :00:1f.2 (0005 -> 0007) [kernel] ACPI: PCI Interrupt :00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 0 2.) msi irq 218 gets assigned 3) First suspend to disk. Consists of 3a) Suspend devices [kernel] ahci :00:1f.2: freeze [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 218 [kernel] pci_disable_device: msi_enabled= 1 [kernel] ACPI: PCI interrupt for device :00:1f.2 disabled [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 218 [kernel] pci_disable_device: msi_enabled= 1 3b) Disable non-boot cpus 3c) Snapshot memory 3d) Enable non-boot cpus 3e) Resume devices (after snapshot!) [kernel] ahci :00:1f.2: resuming [kernel] PM: Writing back config space on device :00:1f.2 at offset 1 (was 2b00403, writing 2b00407) [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 218 [kernel] pci_enable_device: msi_enabled= 1 [kernel] ACPI: PCI Interrupt :00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 1 3f) Write memory image 3g) Power down + reboot 4a) Normal start and restore memory image 4b) Enable non-boot cpus 4c) Resume devices [kernel] ahci :00:1f.2: resuming [kernel] PM: Writing back config space on device :00:1f.2 at offset 1 (was 2b00403, writing 2b00407) [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 218 [kernel] pci_enable_device: msi_enabled= 1 [kernel] ACPI: PCI Interrupt :00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 1 Now the system is running with irq=19 and msi enabled=1. So let's suspend again: 5) Second suspend to disk consists of 5a) Suspend devices [kernel] ahci :00:1f.2: freeze [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 19 [kernel] pci_disable_device: msi_enabled= 1 [kernel] ACPI: PCI interrupt for device :00:1f.2 disabled [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 19 [kernel] pci_disable_device: msi_enabled= 1 5b) Disable non-boot cpus 5c) Snapshot memory 5d) Enable non-boot cpus 5e) Resume devices [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 1 -> OOPS in restore_msi because it tries to access msi structure for irq 19 and not 218. So i guess this has nothing to do with the managed pci functions? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer <[EMAIL PROTECTED]> writes: > Eric W. Biederman schrieb: >> Thomas Meyer <[EMAIL PROTECTED]> writes: >> >> >>> Adrian Bunk schrieb: >>> Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer <[EMAIL PROTECTED]> Status : unknown >>> The problem is identified: http://lkml.org/lkml/2007/3/22/150 >>> >> >> Given the description above I'm a little confused. Doesn't this >> happen every time now? >> > With current git head the oops happens in the second suspend to disk > attempt in a row. Odd. I would have thought the oops happened in the first resume, not the second. Hmm. It may have something to do with the ``managed'' driver aspect of this as well.. >> Or was this happening only the second time before I started my msi >> fixes... >> > So i think, that the current git head already contains your msi fixes. Yes it does. > I don't know if this already happend before your msi changes, but i can > test 2.6.20 if you like to? Sure. A data point if you boot with nomsi or have a kernel compiled without msi support would be interesting as well. As the problem case may not show up without msi support in the picture. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: > Thomas Meyer <[EMAIL PROTECTED]> writes: > > >> Adrian Bunk schrieb: >> >>> Subject: second suspend to disk in a row results in an oops (libata?) >>> References : http://lkml.org/lkml/2007/3/17/43 >>> Submitter : Thomas Meyer <[EMAIL PROTECTED]> >>> Status : unknown >>> >>> >> The problem is identified: http://lkml.org/lkml/2007/3/22/150 >> > > Given the description above I'm a little confused. Doesn't this > happen every time now? > With current git head the oops happens in the second suspend to disk attempt in a row. > Or was this happening only the second time before I started my msi > fixes... > So i think, that the current git head already contains your msi fixes. I don't know if this already happend before your msi changes, but i can test 2.6.20 if you like to? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer <[EMAIL PROTECTED]> writes: > Adrian Bunk schrieb: >> Subject: second suspend to disk in a row results in an oops (libata?) >> References : http://lkml.org/lkml/2007/3/17/43 >> Submitter : Thomas Meyer <[EMAIL PROTECTED]> >> Status : unknown >> > > The problem is identified: http://lkml.org/lkml/2007/3/22/150 Given the description above I'm a little confused. Doesn't this happen every time now? Or was this happening only the second time before I started my msi fixes... Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk schrieb: > Subject: second suspend to disk in a row results in an oops (libata?) > References : http://lkml.org/lkml/2007/3/17/43 > Submitter : Thomas Meyer <[EMAIL PROTECTED]> > Status : unknown > The problem is identified: http://lkml.org/lkml/2007/3/22/150 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: Thomas Meyer [EMAIL PROTECTED] writes: Adrian Bunk schrieb: Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer [EMAIL PROTECTED] Status : unknown The problem is identified: http://lkml.org/lkml/2007/3/22/150 Given the description above I'm a little confused. Doesn't this happen every time now? With current git head the oops happens in the second suspend to disk attempt in a row. Or was this happening only the second time before I started my msi fixes... So i think, that the current git head already contains your msi fixes. I don't know if this already happend before your msi changes, but i can test 2.6.20 if you like to? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer [EMAIL PROTECTED] writes: Eric W. Biederman schrieb: Thomas Meyer [EMAIL PROTECTED] writes: Adrian Bunk schrieb: Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer [EMAIL PROTECTED] Status : unknown The problem is identified: http://lkml.org/lkml/2007/3/22/150 Given the description above I'm a little confused. Doesn't this happen every time now? With current git head the oops happens in the second suspend to disk attempt in a row. Odd. I would have thought the oops happened in the first resume, not the second. Hmm. It may have something to do with the ``managed'' driver aspect of this as well.. Or was this happening only the second time before I started my msi fixes... So i think, that the current git head already contains your msi fixes. Yes it does. I don't know if this already happend before your msi changes, but i can test 2.6.20 if you like to? Sure. A data point if you boot with nomsi or have a kernel compiled without msi support would be interesting as well. As the problem case may not show up without msi support in the picture. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Adrian Bunk schrieb: Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer [EMAIL PROTECTED] Status : unknown The problem is identified: http://lkml.org/lkml/2007/3/22/150 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer [EMAIL PROTECTED] writes: Adrian Bunk schrieb: Subject: second suspend to disk in a row results in an oops (libata?) References : http://lkml.org/lkml/2007/3/17/43 Submitter : Thomas Meyer [EMAIL PROTECTED] Status : unknown The problem is identified: http://lkml.org/lkml/2007/3/22/150 Given the description above I'm a little confused. Doesn't this happen every time now? Or was this happening only the second time before I started my msi fixes... Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Eric W. Biederman schrieb: Odd. I would have thought the oops happened in the first resume, not the second. Hmm. It may have something to do with the ``managed'' driver aspect of this as well.. No. I don't think so. The problem is caused by this sequence: (the info is always before entry of a function and before the exit of a function): 1.) Normal boot [kernel] ahci :00:1f.2: version 2.1 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 0 [kernel] pci_enable_device: msi_enabled= 0 [kernel] PCI: Enabling device :00:1f.2 (0005 - 0007) [kernel] ACPI: PCI Interrupt :00:1f.2[B] - GSI 19 (level, low) - IRQ 19 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 0 2.) msi irq 218 gets assigned 3) First suspend to disk. Consists of 3a) Suspend devices [kernel] ahci :00:1f.2: freeze [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 218 [kernel] pci_disable_device: msi_enabled= 1 [kernel] ACPI: PCI interrupt for device :00:1f.2 disabled [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 218 [kernel] pci_disable_device: msi_enabled= 1 3b) Disable non-boot cpus 3c) Snapshot memory 3d) Enable non-boot cpus 3e) Resume devices (after snapshot!) [kernel] ahci :00:1f.2: resuming [kernel] PM: Writing back config space on device :00:1f.2 at offset 1 (was 2b00403, writing 2b00407) [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 218 [kernel] pci_enable_device: msi_enabled= 1 [kernel] ACPI: PCI Interrupt :00:1f.2[B] - GSI 19 (level, low) - IRQ 19 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 1 3f) Write memory image 3g) Power down + reboot 4a) Normal start and restore memory image 4b) Enable non-boot cpus 4c) Resume devices [kernel] ahci :00:1f.2: resuming [kernel] PM: Writing back config space on device :00:1f.2 at offset 1 (was 2b00403, writing 2b00407) [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 218 [kernel] pci_enable_device: msi_enabled= 1 [kernel] ACPI: PCI Interrupt :00:1f.2[B] - GSI 19 (level, low) - IRQ 19 [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 1 Now the system is running with irq=19 and msi enabled=1. So let's suspend again: 5) Second suspend to disk consists of 5a) Suspend devices [kernel] ahci :00:1f.2: freeze [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 19 [kernel] pci_disable_device: msi_enabled= 1 [kernel] ACPI: PCI interrupt for device :00:1f.2 disabled [kernel] pci_disable_device: dev= c1a59000 [kernel] pci_disable_device: irq= 19 [kernel] pci_disable_device: msi_enabled= 1 5b) Disable non-boot cpus 5c) Snapshot memory 5d) Enable non-boot cpus 5e) Resume devices [kernel] pci_enable_device: dev= c1a59000 [kernel] pci_enable_device: irq= 19 [kernel] pci_enable_device: msi_enabled= 1 - OOPS in restore_msi because it tries to access msi structure for irq 19 and not 218. So i guess this has nothing to do with the managed pci functions? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
Thomas Meyer [EMAIL PROTECTED] writes: Eric W. Biederman schrieb: Odd. I would have thought the oops happened in the first resume, not the second. Hmm. It may have something to do with the ``managed'' driver aspect of this as well.. No. I don't think so. The problem is caused by this sequence: (the info is always before entry of a function and before the exit of a function): Ok. Thanks. It is the ordering of events that keeps it from showing up. The problem happens the first time but only after we have restored msi state so we don't see the ill effects until the second time. Ok staring at the code and thinking about the problem. The only thing that pci_enable_device does (except messing with irqs is flip enable bits). Further pci_enable_device only messes with on 5 architectures. Only ia64 really cares. i386 and x86_64 it is simply delaying work until we need it. frv doesn't really care it just pokes the irq value back into the hardware for some reason. cris just sets a hard coded value. Does cris only have one pci irq? So I think the right solution is to simply make pci_enable_device just flip enable bits and move the rest of the work someplace else. However a thorough cleanup is a little extreme for this point in the release cycle, so I think a quick hack that makes the code not stomp the irq when msi irq's are enabled should be the first fix. Then we can later make the code not change the irqs at all. Thomas could you verify the patch below makes the problem go away for you. Tony, Len the way pci_disable_device is being used in a suspend/resume path by a few drivers is completely incompatible with the way irqs are allocated on ia64. In particular people the following sequence occurs in several drivers. probe: pci_enable_device(pdev); request_irq(pdev-irq); suspend: pci_disable_device(pdev); resume: pci_enable_device(pdev); remove: free_irq(pdev-irq); pci_disable_device(pdev); What I'm proposing we do is move the irq allocation code out of pci_enable_device and the irq freeing code out of pci_disable_device in the future. If we move ia64 to a model where the irq number equal the gsi like we have for x86_64 and are in the middle of for i386 that should be pretty straight forward. It would even be relatively simple to delay vector allocation in that context until request_irq, if we needed the delayed allocation benefit. Do you two have any problems with moving in that direction? If fixing the arch code is unacceptable for some reason I'm not aware of we need to audit the 10-20 drivers that call pci_disable_device in their suspend/resume processing and ensure that they have freed all of the irqs before that point. Given that I have bug reports on the msi path I know that isn't true. Tony, Len before we merge any fixes for 2.6.21-rcX I'd like to at least get an ack on the long term direction. Thanks, Eric diff --git a/arch/cris/arch-v32/drivers/pci/bios.c b/arch/cris/arch-v32/drivers/pci/bios.c index a2b9c60..5b79a7a 100644 --- a/arch/cris/arch-v32/drivers/pci/bios.c +++ b/arch/cris/arch-v32/drivers/pci/bios.c @@ -100,7 +100,9 @@ int pcibios_enable_device(struct pci_dev *dev, int mask) if ((err = pcibios_enable_resources(dev, mask)) 0) return err; - return pcibios_enable_irq(dev); + if (!dev-msi_enabled) + pcibios_enable_irq(dev); + return 0; } int pcibios_assign_resources(void) diff --git a/arch/frv/mb93090-mb00/pci-vdk.c b/arch/frv/mb93090-mb00/pci-vdk.c index f7279d7..0b581e3 100644 --- a/arch/frv/mb93090-mb00/pci-vdk.c +++ b/arch/frv/mb93090-mb00/pci-vdk.c @@ -466,6 +466,7 @@ int pcibios_enable_device(struct pci_dev *dev, int mask) if ((err = pcibios_enable_resources(dev, mask)) 0) return err; - pcibios_enable_irq(dev); + if (!dev-msi_enabled) + pcibios_enable_irq(dev); return 0; } diff --git a/arch/i386/pci/common.c b/arch/i386/pci/common.c index 1bb0693..a990a6c 100644 --- a/arch/i386/pci/common.c +++ b/arch/i386/pci/common.c @@ -426,11 +426,13 @@ int pcibios_enable_device(struct pci_dev *dev, int mask) if ((err = pcibios_enable_resources(dev, mask)) 0) return err; - return pcibios_enable_irq(dev); + if (!dev-msi_enabled) + return pcibios_enable_irq(dev); + return 0; } void pcibios_disable_device (struct pci_dev *dev) { - if (pcibios_disable_irq) + if (!dev-msi_enabled pcibios_disable_irq) pcibios_disable_irq(dev); } diff --git a/arch/ia64/pci/pci.c b/arch/ia64/pci/pci.c index 474d179..f8bcccd 100644 --- a/arch/ia64/pci/pci.c +++ b/arch/ia64/pci/pci.c @@ -557,14 +557,18 @@ pcibios_enable_device (struct pci_dev *dev, int mask) if (ret 0) return ret; - return acpi_pci_irq_enable(dev); + if (!dev-msi_enabled) + return acpi_pci_irq_enable(dev); + return 0; } void
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Friday, 23 March 2007 19:50, Adrian Bunk wrote: > This email lists some known regressions in Linus' tree compared to 2.6.20. > > If you find your name in the Cc header, you are either submitter of one > of the bugs, maintainer of an affectected subsystem or driver, a patch > of you caused a breakage or I'm considering you in any other way > possibly involved with one or more of these issues. > > Due to the huge amount of recipients, please trim the Cc when answering. > [--snip--] > Subject: suspend to disk hangs > References : http://lkml.org/lkml/2007/3/16/126 > Submitter : Maxim Levitsky <[EMAIL PROTECTED]> > Caused-By : Rafael J. Wysocki <[EMAIL PROTECTED]> > commit e3c7db621bed4afb8e231cb005057f2feb5db557 > commit ed746e3b18f4df18afa3763155972c5835f284c5 > commit 259130526c267550bc365d3015917d90667732f1 > Status : unknown The problem has been identified as the known issue related to the XFS freezable workqueues. There is a patch available (http://lkml.org/lkml/2007/3/21/328), that has been merged. Still, there is a problem with the microcode update driver that's being worked on. The reporters of the resume problems who use the microcode driver, please check if the problems go away if you unload the driver before the suspend. Greetings, Rafael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Friday 23 March 2007 20:50:22 Adrian Bunk wrote: > Subject: suspend to disk hangs > References : http://lkml.org/lkml/2007/3/16/126 > Submitter : Maxim Levitsky <[EMAIL PROTECTED]> > Caused-By : Rafael J. Wysocki <[EMAIL PROTECTED]> > commit e3c7db621bed4afb8e231cb005057f2feb5db557 > commit ed746e3b18f4df18afa3763155972c5835f284c5 > commit 259130526c267550bc365d3015917d90667732f1 > Status : unknown > > Hello, It is fixed The problem is that now cpu_up/cpu_down is called with tasks frozen, and this can lead to deadlock if some driver that registered cpu up/down notifier, sleeps, On my system it froze in two places, one in XFS due to freezable workqueues, and in microcode update driver that ask the "frozen" userspace for firmware. Fix for XFS is already in mainline, and Rafael J. Wysocki. already posted a patch that fixes microcode issue, I will test it. But I feel that there are more drivers that can deadlock system in same way, on my system S3/S4 works perfect :-) Even the weird hang i had disappeared. Big thanks to Rafael J. Wysocki. Best regards, Maxim Levitsky - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Friday, 23 March 2007 19:50, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. [--snip--] Subject: suspend to disk hangs References : http://lkml.org/lkml/2007/3/16/126 Submitter : Maxim Levitsky [EMAIL PROTECTED] Caused-By : Rafael J. Wysocki [EMAIL PROTECTED] commit e3c7db621bed4afb8e231cb005057f2feb5db557 commit ed746e3b18f4df18afa3763155972c5835f284c5 commit 259130526c267550bc365d3015917d90667732f1 Status : unknown The problem has been identified as the known issue related to the XFS freezable workqueues. There is a patch available (http://lkml.org/lkml/2007/3/21/328), that has been merged. Still, there is a problem with the microcode update driver that's being worked on. The reporters of the resume problems who use the microcode driver, please check if the problems go away if you unload the driver before the suspend. Greetings, Rafael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3/5] 2.6.21-rc4: known regressions (v2)
On Friday 23 March 2007 20:50:22 Adrian Bunk wrote: Subject: suspend to disk hangs References : http://lkml.org/lkml/2007/3/16/126 Submitter : Maxim Levitsky [EMAIL PROTECTED] Caused-By : Rafael J. Wysocki [EMAIL PROTECTED] commit e3c7db621bed4afb8e231cb005057f2feb5db557 commit ed746e3b18f4df18afa3763155972c5835f284c5 commit 259130526c267550bc365d3015917d90667732f1 Status : unknown Hello, It is fixed The problem is that now cpu_up/cpu_down is called with tasks frozen, and this can lead to deadlock if some driver that registered cpu up/down notifier, sleeps, On my system it froze in two places, one in XFS due to freezable workqueues, and in microcode update driver that ask the frozen userspace for firmware. Fix for XFS is already in mainline, and Rafael J. Wysocki. already posted a patch that fixes microcode issue, I will test it. But I feel that there are more drivers that can deadlock system in same way, on my system S3/S4 works perfect :-) Even the weird hang i had disappeared. Big thanks to Rafael J. Wysocki. Best regards, Maxim Levitsky - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/