Re: 2.4.0-test8-pre1 is quite bad / how about integrating Rik's VMnow?
Hi ! I've read the discussion about the truncate() problem and tried to understand ;) However, there's somethign I don't catch in your code (typo ? bug ? misunderstanding on my side ?) Linus wrote: There's a really simple way to avoid this: compare the thing you're going to zero out against zero before you memset() it to zero. If it was already zero, you just unlock the page and release. Your code does: + kaddr = (char*)kmap(page); + err = 0; + if (!mem_is_zero(kaddr+offset, length)) + goto unmap; + memset(kaddr+offset, 0, length); + flush_dcache_page(page); + __block_commit_write(inode, page, offset, offset+length); +unmap: + kunmap(page); Which seem to be the the opposite of what Linus says: You memset() the page when it's _already_ zero and exit when it's not. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: who maintains linux 4 the powerpc
Can somebody please tell me, who is currently maintaining arch/ppc? The link http://www.ppc.kernel.org/ in the MAINTAINERS file is dead. Cort Dougan ([EMAIL PROTECTED]) and Paul Mackerras ([EMAIL PROTECTED]) There's also a SourceForge site recently created to gather pending patches and bug reports at www.sourceforge.net/projects/ppclinux Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [DOC] Debugging early kernel hangs
Hmm, good idea, but how does this work on, say, non-x86 architectures which don't have a VGA text frame buffer, or whose VGA text frame buffer is not mapped in, or whose VGA text frame buffer is not initialised. You will still end up with those "my kernel hangs during boot" messages. A lot of the problems with debugging early kernel hangs is that you don't have a display set up, or you don't have enough of the memory subsystem initialised (eg, before pci_init) to be able to access devices (eg, before paging_init). [.../...] I've implemented a similar mecanism on the PPC 2.2.x kernel. It's not in the main tree since it requires a couple of lines of change to printk.c in order to handle correctly the removal of the last console. Basically, I setup a "struct console" during very-early boot (almost at the firmware level) that can basically display text on screen (using the firmware pre-inited fb) using a very basic engine, and is setup by default as the printk console. Then, in the main VT code, I unregister this boot console just before registering the VT one. It's a bit hackish and so is not meant to be merged in the main tree, but it's useful when I release test kernels for new Apple hardware, to have printk work from the very beginning of boot. I wanted to clean it up, but I didn't figure out a way to make this work without hacking slighlty printk.c and vt.c (mostly for correctly handling the takeover of the boot console by the VT subsystem). It could have been simpler if I implemented a struct consw instead of the (simpler) struct console, but the resulting code would have been way too bloated (mostly re-implementing an fb-based console ;) Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [DOC] Debugging early kernel hangs
2.4 and 2.2 PPC have progress() for writing progress messages to the screen. They're setup in a per-board very early in the boot so we can see what's going on as soon as the MMU is turned on and lets us get around. Ben, can you just make your changes talk through that? I used to use it with BootX to write out info while setting up the early bootinfo stuff. The progress() stuff works fine on 2.4. (I've not checked with 2.2.x lately). However, there's still a huge gap between the last progress() message and availability of the frame buffer device. The simple console has the advantage of outputing existing printk messages. (basically, it's a console using prom_printf). Well, I beleive I'll just get rid of this debug console to ease merging of my pile of 2.2.x changes. It appeared that I never had a single crash happen during this time, except when working on new HW, but then, I can just add a prom_printf to printk() directly. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
removal of include/linux/openpic.h
Hi ! Is any arch other than PPC using include/linux/openpic.h ? I'm doing some cleanup work on various parts of the PPC arch, and it's now time for the openpic driver to suffer. That file exports to everybody all the functions data structures of the driver, which is wrong with the way the driver is evolving (at least on PPC). However, our driver is in arch/ppc/kernel/open_pic.c. So I'm considering moving the few exported symbols to arch/ppc/kernel/ open_pic.h (or include/asm/open_pic.h, but I don't think it's needed at all there) and kill include/linux/openpic.h completely. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: __bad_udelay in 2.2.18pre15
2.2.18pre15 defines udelay as (in file include/asm-i386/delay.h) : extern void __bad_udelay(void); #define udelay(n) (__builtin_constant_p(n) ? \ ((n) 2 ? __bad_udelay() : __const_udelay((n) * 0x10c6ul)) : \ __udelay(n)) ... It seems __bad_udelay is not defined anywhere in the kernel source. Correct. Its a compile time error trap Well, at first, I wanted to implement it the same way on PPC. However, it dies on all occurences where udelay is called with a non-constant expression. I spotted this case in a few PPC specific stuffs (fixable), but also in the sys_nanosleep code, and in the de4x5 driver. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: __bad_udelay in 2.2.18pre15
Well, at first, I wanted to implement it the same way on PPC. However, it dies on all occurences where udelay is called with a non-constant expression. I spotted this case in a few PPC specific stuffs (fixable), but also in the sys_nanosleep code, and in the de4x5 driver. Hrm... looks like I missed the story about the __builtin_constant_p(). Is this a gcc-specific built-in feature ? Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: __bad_udelay in 2.2.18pre15
Well, at first, I wanted to implement it the same way on PPC. However, it dies on all occurences where udelay is called with a non-constant expression. __builtin_constant_p means non constant expressions will always call udelay I spotted this case in a few PPC specific stuffs (fixable), but also in the sys_nanosleep code, and in the de4x5 driver. I'll check these two Forget about them. It was my non-understanding of __builtin_constant_p() that was causing me the problem. I fixed a few 2 udelay's (replacing them with mdelay) in some PPC specific code. I'll send you some patches later, I have to extract them from my tree. Well... would you accept a huge pile of PPC patches for 2.2.18 in this case, I can send you my current diffs (with a bit of cleanup) ? Those contain almost only pmac-specific stuffs (support for new machines, sleep fixes, and a few more fixes here or there). Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: __bad_udelay in 2.2.18pre15
Yep. This is a huge release patch anyway so resynching the stuff is fine. What I wont take is stuff touching core code I do have a 2 lines patch to the common ide code that fix a problem when revalidating a CD-ROM after sleep, but it was ack'ed by Andre Hedrick. I also have a two-liners to kernel/printk.c to allow takeover of my boot console by the real vt subsystem (I found no way for a struct consw to take over a struct console without this patch). I'll make sure those are separate from the main patch set. I need some time to do some polishing and slicing all those patches so you don't get a single hundreds kb diff, expect something around this week-end. I leave out some fbdev stuffs that may cause problems with other archs. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Problem with include/linux/fs.h vs. glibc
Hi ! Sorry if this have already been the cause of a flamewar on the list, but... I need to compile an app with the 2.4 kernel headers glibc (our stable glibc on PPC is based on 2.1.3). However, the compiler is barking on a change done in 2.4 version of include/linux/fs.h: The 2.2.x version didn't include linux/string.h and all was fine. The 2.4.x version does include linux/string.h and this is causing gcc to bark because of conflicts with glibc headers (glibc seems to #define some of the prototypes defined in linux/string.h, causing various parse errors). So what is the solution ? - removing the #include linux/string.h from linux/fs.h ? - moving it in a #ifdef __KERNEL__ part of the file ? - protect linux/string.h itself with #ifdef __KERNEL__ ? - fix glibc ? (how ? I mean, it's legal to include linux/fs.h from userland, but linux/string.h is obviously not meant to be exported out of the kernel) Regards, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with include/linux/fs.h vs. glibc
- I mean, it's legal to include linux/fs.h from userland, Everybody who thinks so will be severely disappointed. Ok, so if it's not, then I have to fix that app. Thanks. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: dual head r128
Note to linux-kenrel readers: This discussion is the Nth attempt to find a solution to handle both legacy IOs and PCI IOs on machines with several IO busses memory mapped at different locations in the CPU space. No please, is there anybody bloat-conscious on this damned list ? Burying more and more code inside each {in,out}[bwl] is not the solution. Well, that is pretty small overhead, and probably ridiculous compared to the overhead of the IO itself. Most fast devices use MMIO anyway. The problem is that whatever solution someone propose, there is _always_ somebody to reject it. Just define a macro ISA_PORT or something like this and update the kernel to replace all the in/out to fixed ports to do in/out(ISA_PORT(n)). If you don't do it you'll get a nice panic so you'll find all the places quite fast. That basically mean making different macros for ISA in/out and PCI in/ out. I've proposed this several time, but it requires changes to the common code, and all I got when talking about this was flames from x86 people. Drawbacks ? Take the time to make this fit into some x86 people head. Also, I need something that can be ported quickly to 2.4. I'm afraid even if we make everybody agree to it, it will be delayed to 2.5. Linus: Would you accept this change now ? #define ISA_PORT(n) (n) And change to _all_ drivers doing legacy IOs to use that in their in/out macros ? I still prefer making separate macros for legacy IOs (isa_in/isa_out) and for PCI IOs (in/out), or the opposite if you prefer (in/out for isa and pci_in/pci_out for PCI). On x86, they would resolve to the same thing, while on our platforms, they could be handled differently. PCI I/O resources will have to be kernel virtual, physical is impossible with PreP if we want to lift the 2Gbuser space restriction (PreP I/O is from 2 to 3 Gb physical and the first thing to do is to reallocate devices which use it since most firmware use it too liberally, like one device every ... 256Mb). There are other and better ways to increase user available virtual space, however. And anyway I don't want any stinking add in each in/out macro. Well, in 2.4 we can easily reassign PCI IOs if we configure the bridge with proper resources. If all goes well, my new PCI code should handle that fine (should be ready this week-end). Indeed, this is too awkward (is tere no way to redirect only the VGA part of the legacy I/O space ? That's what the PCI-PCI bridges do, but I've not yet used a single machine with AGP so I'm ignorant). No, most bridges used on macs can't do that. In fact, AFAIK, it's not possible to access the ISA memory space neither on those machines (on UniN, I can't generate memory cycles at lower address than 0x8000). My "pet" solution would be to have all legacy drivers request an IO base this way base = isa_get_IO_base(legacy_addr); The isa_get_IO_base function could then be "tweaked" to recognize known legacy addresses and return different bases. (There might still be problems with VGA vs. parallell, I don't know x86 world well enough to be sure). Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
The IO problem, ISA vs. PCI
Note to linux-kenrel readers: This discussion is the Nth attempt to find a solution to handle both legacy IOs and PCI IOs on machines with several IO busses memory mapped at different locations in the CPU space. No please, is there anybody bloat-conscious on this damned list ? Burying more and more code inside each {in,out}[bwl] is not the solution. Well, that is pretty small overhead, and probably ridiculous compared to the overhead of the IO itself. Most fast devices use MMIO anyway. The problem is that whatever solution someone propose, there is _always_ somebody to reject it. Just define a macro ISA_PORT or something like this and update the kernel to replace all the in/out to fixed ports to do in/out(ISA_PORT(n)). If you don't do it you'll get a nice panic so you'll find all the places quite fast. That basically mean making different macros for ISA in/out and PCI in/ out. I've proposed this several time, but it requires changes to the common code, and all I got when talking about this was flames from x86 people. Drawbacks ? Take the time to make this fit into some x86 people head. Also, I need something that can be ported quickly to 2.4. I'm afraid even if we make everybody agree to it, it will be delayed to 2.5. Linus: Would you accept this change now ? #define ISA_PORT(n) (n) And change to _all_ drivers doing legacy IOs to use that in their in/out macros ? I still prefer making separate macros for legacy IOs (isa_in/isa_out) and for PCI IOs (in/out), or the opposite if you prefer (in/out for isa and pci_in/pci_out for PCI). On x86, they would resolve to the same thing, while on our platforms, they could be handled differently. PCI I/O resources will have to be kernel virtual, physical is impossible with PreP if we want to lift the 2Gbuser space restriction (PreP I/O is from 2 to 3 Gb physical and the first thing to do is to reallocate devices which use it since most firmware use it too liberally, like one device every ... 256Mb). There are other and better ways to increase user available virtual space, however. And anyway I don't want any stinking add in each in/out macro. Well, in 2.4 we can easily reassign PCI IOs if we configure the bridge with proper resources. If all goes well, my new PCI code should handle that fine (should be ready this week-end). Indeed, this is too awkward (is tere no way to redirect only the VGA part of the legacy I/O space ? That's what the PCI-PCI bridges do, but I've not yet used a single machine with AGP so I'm ignorant). No, most bridges used on macs can't do that. In fact, AFAIK, it's not possible to access the ISA memory space neither on those machines (on UniN, I can't generate memory cycles at lower address than 0x8000). My "pet" solution would be to have all legacy drivers request an IO base this way base = isa_get_IO_base(legacy_addr); The isa_get_IO_base function could then be "tweaked" to recognize known legacy addresses and return different bases. (There might still be problems with VGA vs. parallell, I don't know x86 world well enough to be sure). Ben. -- RFC822 Header Follows -- From: Benjamin Herrenschmidt [EMAIL PROTECTED] To: Gabriel Paubert [EMAIL PROTECTED], Linux/PowerPC Devel List [EMAIL PROTECTED], Linus Torvalds [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: dual head r128 Date: Thu, 12 Oct 2000 18:58:15 +0200 Message-Id: [EMAIL PROTECTED] In-Reply-To: [EMAIL PROTECTED] References: [EMAIL PROTECTED] X-Mailer: CTM PowerMail 3.0.5 http://www.ctmdev.com MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit --- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Any dual AGP slot motherboards?
Apple sells a computer with dual AGP slots. I just was looking for a intel box like this. Since AGP is a port on the PCI bus it is possible to have more than one AGP port on a/each PCI bus but this requires the PCI chipset to support this. Well, I don't know of such a Mac. To my knowledge, the only Apple box to have an AGP slot are the ones based on the "Core99" chipset, and they provide one AGP slot. You won't be lucky with Apple HW anyway as there are currently issues between the AGP controller and the Linux agpgart driver preventing from using it. Those issues are tricky and I don't think a solution will be available soon. (Apple chipset can make the AGP aperture visible to the CPU AFAIK). Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
The IO problem on multiple PCI busses
Here's the return of an ld problem for which we really need a solution asap since it's now biting us in real life configurations... So the problem happens when you have a machine with more than one PCI host bridge. This is typically the case of all new Apple machines as they have 3 host bridges in one chip (2 of them are relevant: the AGP and the PCI). I don't think the problem exist on x86 machines with real IO cycles, at least in that case, the problem is different. In order to generate IO cycles, the bridge provides us with a region in CPU physical memory space (a 16Mb region in our case) that translates accesses to IO cycles on the PCI bus. Our implementation of inb/outb currently relies on the kernel ioremap'ing one of these regions (the PCI one) and using the ioremap result as a base (offset) inside the inb/outb functions. So that mean that the current design won't allow access to IOs located on any bus but the one we arbitrarily choose (the PCI bus). That's fine in most case, until you decide to put a 3dfx or nvidia card in the AGP slot. Those cards require some IO accesses to be done to the legacy VGA addresses, and of course, our inb/outb functions can't do that. Obviously, we can hack some driver specific thing that would use the arch-specific code to retreive the proper io base address for a given host bridge, but that's a hack. I'm looking for a solution that would cleanly apply to all archs that may potentially face this problem. The problem potentially exist also for any PCI card that has PCI IOs on anything but the main PCI bus. One possibility is to limit our IO space to 64k per bus (to avoid bloating) and then use a hacked ioremap to create a single virtually contiguous kernel region that appends all those IO spaces together. Accessing IOs on bus N would just be the matter of calculating an address of the type 64k*N+offset and doing normal inb/outb on the result. The arch PCI code could then properly fixup PCI IO resources for PCI drivers, and we could add a function of the kind unsigned long pci_bus_io_offset(int busno); that would return the offset to add to inb/outb when accessing IOs on the N'th PCI bus. If we want to go a bit further, and allow ISA drivers that don't have a pci_dev structure to work on legacy devices on any bus, we could provide a set of function of the type int isa_get_bus_count(); unsigned long isa_get_bus_io_offset(int busno); and eventually int isa_bus_to_pci_bus(int isa_busno); int pci_bus_to_isa_bus(int pci_busno); If we want to figure out on which PCI bus a given ISA bus is located if any (-1 beeing no mapping exist). Of course, the same problem exist for ISA memory (used by legacy VGA modes). It's not a problem in real life currently since no powermac can produce PCI cycles in the ISA memory range today, and non-powermac PPC machines currently don't have needs for video cards on anything but the main bus, but the potential issue is there, and the need for a solution may pop up too. I'm, of course open to any comments about this (in fact, I'd really like some feedback). One thing is that we also need to find a way to pass those infos to userland. Currently, we implement an arch-specific syscall that allow to retreive the IO physical base of a given PCI bus. That may be enough, but we may also want something that match more closely what we do in the kernel. Regards, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The IO problem on multiple PCI busses
If we want to go a bit further, and allow ISA drivers that don't have a pci_dev structure to work on legacy devices on any bus, we could provide a set of function of the type int isa_get_bus_count(); unsigned long isa_get_bus_io_offset(int busno); I would add that I'd prefer to keep it separated from the PCI layer in that sense that it can also help handle 16bits ISA-like IO busses on embedded hardware which may (will most of the time) not have anything like a PCI bus. Having the ability to map PCI-ISA bus numbers should be an option. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The IO problem on multiple PCI busses
As a side note, Alpha has a special PCI syscall to get the "PCI controller number" a given PCI device is behind. We could add another ioctl number which does the same thing on /proc/bus/pci/*/* nodes. This way sparc64 and Alpha could have the same user visible API for this as well. And on PPC too since I adapted the pci controller mecanism to it in 2.4. In fact, all that is done by our various syscalls could be done by ioctl's on /proc/bus/pci/*/*. To be generic, the pci controller number should rather be the pci bus number of the host bridge (the top of the PCI tree a given device lives on). The internal controller numbers have no real meaning I think to userland. Also, an ioctl to retreive the iobase would be useful too (in addition to the mmap), especially for getting access to VGA IOs associated with a given PCI card, but also for whatever test tool one would want to write in userland that access legacy IOs on a given PCI bus. Having the mmap is fine, but I like having also the ability to retreive all the informations via an ioctl too. I beleive that if we can agree on the in-kernel format of the PCI controller structure and function to retreive it from a bus number, we can make this generic. For us, the pci controller requires at least an iobase (physical virtual as we always ioremap the IO space during boot) for generating io cycles, the config ops, the mem offset (some platforms don't have a 1:1 mapping of memory cycles vs. CPU bus cycles for PCI memory, for example, on PReP, you write to physical c000 to get a PCI memory write to ). And finally the isa memory base (it may be located differently, some bridge have 1:1 mappings and so allow only high memory addresses to go to the PCI, but do open a "window" at a different physical address to generate ISA memory cycles (low address cycles)). Finally, we have some private datas (pointer to OF node for example), the resource structures (so that we know what a given host bridge can decode and can allocate unallocated PCI resources properly). I'm not familiar with the requirements of other archs however. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The IO problem on multiple PCI busses
I'm, of course open to any comments about this (in fact, I'd really like some feedback). One thing is that we also need to find a way to pass those infos to userland. Currently, we implement an arch-specific syscall that allow to retreive the IO physical base of a given PCI bus. That may be enough, but we may also want something that match more closely what we do in the kernel. Same problem on sparc64. Using a special PCI syscall is fine, _if_ we all end up using the same one. However, I would prefer another mechanism... Right, I remember we discussed this some monthes ago. Currently, we have a syscall that is slightly different from the sparc/alpha ones but very similar. I think a cleaner scheme is to allow mmap() on /proc/bus/pci/${BUS}/${DEVICE} nodes, that is much cleaner and solves transparently any "different word size between userland and kernel" issues (specifically 32-bit userlands executing on 64-bit kernels). I played around with something akin to this, and some of the necessary Xfree86-4.0.x hackery needed, some time ago. But I never finished this. I do agree with you on this. I didn't have time to really work on it so far, I remember you posted a test patch but I was busy at that time with other PCI issues we had with multiple bus systems. Note that this is only the userland side of the story. For now, I'm more concerned about finding a good solution to the kernel side. Also, the problem of finding where the legacy ISA IOs of a given PCI bus are is a bit different that simply mmap'ing a BAR. Some video cards require some access to their VGA IOs without having a BAR covering them, in some case it's necessary to switch the chip from VGA to MMIO mode. I've looked at the parisc code (thanks Alan for pointing that out), and it seem they implement all inb/outb as quite big functions that decypher the address, retreive the bus, and do the proper IO call. Unfortunately, that's a bit bloated, and I don't think I'll ever get other PPC maintainers to agree with such a mecanism (everybody seem to be quite concerned with IO speed, I admit including me). Also, that wouldn't really help the case of legacy drivers or video drivers using legacy addresses for VGA. In all cases, whatever solution we end up having, those will have to be adapted. What I'd like is a smooth path that allow unchanged drivers to still work with the default bus, while adapted driver can be done so with minimum changes (mostly ending up storing an io base and creating a virtual "ISA bus number"). That way, an ISA-like (legacy IO bus) can be mapped to either a PCI bus, or whatever. Maybe "ISA" is not a proper word for it, it could be "basic_io_bus" maybe. Alan also pointed out that there may be similar issues with MMIOs. In fact, as long as we are working with PCI devices, we can easily get things fixed up by munging the resource structures at fixup time. The _is_ however a similar issue with legacy ISA memory, especially since some platform can simply not let you access it. Looking at those in more details (other archs), it appears that the problem happens on most non-x86 archs and is handled differently for each of them, when it's handled at all. So what would be a preferred way ? Create that fake ISA bus number and provide functions for looking them up, getting their IO and mem bases, and eventually mapping PCI busses to ISA busses ? Or does someone have a better idea ? The goal is to try not to change the semantics of inb/outb and friends so that most legacy drivers can still work using the "default" IO bus if they are not upgraded to the new scheme. Thanks for your feedback, Regards, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The IO problem on multiple PCI busses
I do not want an interface where the user still has to do grotty stuff like mmap() on /dev/{mem,kmem}, this was the core of the problem I had with the syscall idea, don't bring it back. Make mmap()'s on a PCI--ISA bridge do something special, for example. The user doesn't need to know anything about physical addressing of the machine, it all can and should be abstracted away. This is why I really detest the XFree86 PCI bus probing layer, it should not need to poke around at so much of the config space information of devices :-( It is the reason why, at least still today in Xfree86 CVS, it simply cannot cope with multiple PCI controllers in a machine because it assumes a flat MEM/IO space. They know about the problem and are working on fixes, but my point is that making this overly knowledgable PCI prober in the first place is what created these problems. Ok, I see your point and I agree. There is still the need, in the ioctl we use the "select" what need to be mapped by the next mmap, to ask for the "legacy IO range of the bus where the card reside" (if it exist of course). That would be the 0-64k (or less, actually a couple of pages would probably be enough) that generates IO cycles in the "low" addresses used for VGA registers on the card. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO issues vs. multiple busses
Here are my comments directly responding to your mail. Hi ! Thanks for taking the time to respond in details. Large systems have problems with I/O port space and legacy devices. There just isn't enough I/O port space to support large configs and ISA aliasing and all the other crud. That's why Intel is (a) ditching all the legacy crap in IA64 and (b) strongly encouraging people to use MMIO space on PCI. Right. We need to decourage use of IOs, definitely ;) I now tend to think that we shouldn't care about making a whole architecture to handle those IO problems, but the simplest possible thing that would fit our needs. (still for in-kernel matters) If you only support one type of bridge, you could avoid the indirect function call (which parisc-linux uses) and encode the access method directly in the inb/outb macros. We do that now, and support IOs on one bridge only. However, some PCI cards still require IO access and we do have several busses, so... The reason why I'm getting this problem on the public place (again ?) is that we are now faced with people who want to put video cards in both AGP PCI busses, those cards requiring accesses to some legacy VGA IOs on each of their busses. Just note that processor speed is so much faster (and getter faster) than the ISA bus (and PCI-1X bus), that CPU overhead is mostly irrelevant to the cost of accessing IO port space. On older x86 boxes it is relevant. Right. That's my opinion too. But it's difficult to make everybody agree on it ;) Even the simple mecanism Paul Mackerras did so that IOs to non-existent devices don't kill the kernel (very small overhead) caused some barking ;) parisc-linux has solved exactly that problem. I have to look in more details. It's my understanding that you use high bits of the IO address to store the HBA number and then use that to call the proper access functions. That would solve the PCI IO problem (PCI cards requiring IOs to BAR-mapped regions), but I don't see how it can fix the problem of a card accessing legacy VGA addresses, except if you hand-fixed the video drivers to fill those high bits before doing IOs. If I understand things correctly, that mean that each card, instead of accessing the legacy VGA port 0xpp, would instead access 0x00bb00pp (or whatever mangling you use to stuff the HBA number). From the driver point of view, it's exactly the same as passing an "offset" that would be added to the legacy address. So both methods (the one I describe that would fit well for us) and yours can end up with the same driver-side API which is to get an "IO base" for the bus a given card reside on. The question is then to decide is all ISA busses are on a matching PCI bus, in which case a simple unsigned pci_get_bus_io_base(int bus_no) -like function would be enough, or if we want a scheme that supports other ISA-like busses ? We could eventually decide to support only PCI, and additionally declare a fake PCI bus for an ISA bus not matched to a PCI bus, whose config ops would return no device in any slot. Do we agree on this ? I don't believe such a solution exists which is "cleaner" than what parisc-linux does and meets the same objectives. Right now, it's important the install be easy in order to make it easy for people to migrate from HPUX to parisc-linux. :^) Well, from the driver point of view, I think it _do_ exist. Basically, the driver will do inb/outb friends. Whatever those function do in reality is arch-dependant. But we agree on the fact that in order for those functions to know on which bus to tap, an additional information must be "cooked" inside the IO address passed to them. That's why I'm proposing this notion of "io base". Additionally, the same problem is true for ISA memory, when it exist obviously. I would indeed like to see the same function for pci_get_legacy_mem_base(int bus_no)-like, that is allowed to return something like for informing the driver that this specific machine won't support ISA memory. With those two simple functions, we could at least - fix the the fbdev's that need access to VGA regions so that they work on multiple bus systems properly - Have vgacon disable itself when there's no ISA memory (that can be handled by reserving the region and thus preventing request_region from working too, well, but that scheme would also simplify the various more/less hacked macros used on all non-x86 archs to access the VGA memory). - Eventually have vgacon work on "any" bus, possibly by providing a kernel option telling it on which bus to look for a legacy VGA device (and defaulting to whatever VGA device the PCI will find first). This way, vgacon would work properly in most cases without arch-specific hacking. Additionally, I beleive it would help making other legacy drivers (if any) work on non-0 busses (I'm thinking about IDE cards using legacy addresses, those do exist), and whatever. The only thing that's annoying me in the fact that we keep tied
Re: The IO problem on multiple PCI busses
No, don't do this, it is evil. Use mappings, specify the device related info somehow when creating the mapping (in the userspace variant you do this by openning a specific device to mmap, in the kernel variant you can encode the bus/dev/etc. info in the device's resource and decode this at ioremap() time, see?). Well, except that drivers doing IOs don't ioremap... Maybe we could define an ioremap-like function for IOs, but the more we discuss this, the more I feel that for in-kernel, a simple function that returns a per-bus io base (and another one for ISA mem) is plenty enough for the few legacy things we have to deal with (mostly VGA). For PCI drivers doing IOs, we just need to have the IO resource structures to be properly fixed up (include the correct iobase already). That iobase can either be a mix of a real io address and a "cooking" in the high bits like parisc, or it can be an address ioremap'd in the correct bus mapping when it's possible, or whatever... Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Question about IRQ_PENDING/IRQ_REPLAY
Hi Linus ! I've some questions regarding the behaviour of arch/i386/kernel/irq.c regarding IRQ_PENDING and IRQ_REPLAY. Especially, my question is about the code in enable_irq() which checks for IRQ_PENDING, and then "replays" the interrupt by asking the APIC to issue it again. I don't have a simple way on PPC to cause the interrupt to happen again, as you can imagine this is rather controller-specific. However, looking at the code closely, I couldn't figure out a case where having IRQ_PENDING in enable_irq() makes sense. How can IRQ_PENDING happen to be set on an IRQ_DISABLED interrupt, and why would that matter (why should we take this interrupt) ? AFAIK, IRQ_PENDING can only be set as a result of a call to do_IRQ(). Since we loop when calling the actual handler, I fail to see how we could "miss" an interrupt. If the interrupt is actually disabled, we should not get it at all, and if we did, I don't see why it would matter to resend it when it gets enabled since disabled interrupts are supposed to be ignored (well, they are by most PICs). Obviously, this matters only for an edge interrupt as level ones will stay asserted until the device is happy. I'd be glad if you could take the time to enlighten me about this as I'm trying to make the PPC code as close as the i386, according to your comment stating that it would be generic in 2.5, and I don't like having code I don't fully understand ;) Regards, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The IO problem on multiple PCI busses
I/O is not supposed to be fast, that's what MMIO is for. :) Just do void outb (u8 val, u16 addr) { void *addr = ioremap (ISA_IO_BASE + addr); if (addr) { writeb (val, addr); iounmap (addr); } } You can map and unmap for each call :) Ugly and slow, but hey, it's I/O... Well, that would really suck ;) And I don't think it would be necessary as we can probably limit each IO bus to 64k without much problem, and have them permanently ioremap'ed. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about IRQ_PENDING/IRQ_REPLAY
In particular, if an edge-triggered interrupt comes in on an x86 IO-APIC while that interrupt is disabled, enabling the interrupt will have caused that irq to get dropped. And if it gets dropped, it will never ever happen again: the interrupt line is now active, and there will never be another edge again. Ok, I see. We have a different issue with the old Apple IRQ controller that can lose interrupts if they are active when re-enabled. We currently rely on a hack to work aroud this that may re-send interrupts, but that involves hacking into __sti() to check for lost interrupts, which is bad. Basically, even a level interrupt, if active while re-enabled, will not be sent by the pic to the CPU, and so further interrupts will be blocked too. We have some code in enable_irq() that can detect this case, but re-triggering the interrupt is not really simple and requires the __sti() hack for now. I beleive we may have a way to re-trigger the interrupt without having to hack __sti() by using a fake timer interrupt. I'll look into this, but in that case, the code can be mostly self-contained in enable_irq, we will probably not need to play with IRQ_PENDING IRQ_REPLAY flag at all. I'd be glad if you could take the time to enlighten me about this as I'm trying to make the PPC code as close as the i386, according to your comment stating that it would be generic in 2.5, and I don't like having code I don't fully understand ;) You likely don't have this problem at all. Most sane interrupt controllers are level-triggered, and won't show the problem. And others (like the i8259) will see a disabled-enabled transition as an edge if the interrupt is active (ie they have the edge-detection logic _after_ the disable logic), and again won't have this problem. Well, Apple now uses OpenPICs, but all slightly older macs had a home-made Apple controller that had the above issue :( In fact, it can happpen with both and and level interrupts for us. Regards, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO issues vs. multiple busses
So once again I vote for the introduction of isa_{request,release}_mem_region(), just like we already have isa_readb() and friends. Well, it's the same problem as the IO, there may be more than one ISA mem region, especially when you put 2 video cards on 2 different PCI hosts (even without a PCI-ISA bridge). In fact, with a PCI-ISA bridge, I can imagine a config where you need 2 ISA IO regions and 2 ISA mem regions on the same PCI bus if that bridge does address translation. My concern for now is mostly to get video cards fixed, I don't care much about legacy ISA hardware as in those case, I guess we can limit ourselves to a single ISA bus and inb/oub beeing happy to cope with it. The problem is that we use the same macros (inb/outb) to access that ISA bus, and to access any PCI IO bus. Well, I would suggest the following: - inb/outb without offset - the ISA bus if any, or the IO space of the first PCI host - inb/outb with offset (or encoded HBA number) - IO space of an other bus - pci_get_bus_io_base() returns the IO offset for accessing the Nth PCI bus IO space so that the fb devs can do VGA IOs on the bus that holds their card. - pci_get_bus_isa_mem_base() returns the base address at which isa mem is available for a given PCI bus (that is the address that generates mem cycles in the range 0-64k). This is a physical address, the driver still have to ioremap it. Some PCI cards can have a BAR mapping the VGA memory elsewhere, drivers for those cards should prefer the BAR mapping of course. All IO ranges can be mapped via kernel VM tricks into a single contiguous space with the offset beeing something like a 64k increment, or we can have the inb/outb do a lookup of the host bus like on parisc. That's an arch implementation detail. Is that ok ? I know it's not perfect, but it would allow to solve the most important problem for now. The PCI cards in need of IOs (like PCI IDE cards) can have their resources fixed up by the arch code in order to tap the correct bus. Only the real legacy ISA drivers will be limited to the fixed (default) ISA bus. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about IRQ_PENDING/IRQ_REPLAY
We do have broken interrupt controllers in this respect. We already have a way of handling it. Ben, take a look at set_lost(). Heh, I know, thanks ;) However, our current scheme implies a hack to __sti() that I'd like to get rid of since it adds an overhead allover the place that could probably be localized if we managed to force an interrupt (using the DEC for example, or using a mac-specific device as this controller only exist on macs anyway). Also, we currently don't use the same mecanism as i386, and since Linus expressed his desire to have irq.c become generic, I'm trying to make sure I fully understand it before merging in PPC the bits that I didn't merge them yet. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about IRQ_PENDING/IRQ_REPLAY
More generic in terms of using irq_desc[] and some similar structures I can see. Making do_IRQ() and enable/disable use the same names and structures as x86 isn't sensible. They're different ports, with different design philosophies. I don't believe that the plan is a common irq.c - lets stay away from that. Why ? Except for a few things like irq probing, our irq.c is already very similar to i386 (well, mostly because I merged most of it some time ago) and I don't see why it would be a bad thing. The design of irq.c makes it perfectly adapted to our needs, there's nothing really x86-specific in it, it handles things we need to be handled correctly, does nothing more than what we need (well it does, but those parts, mostly the irq locking, got already removed), etc... I did that merge in the first place because I wanted the depth support in enable/disable_irq, and more fine-grained spinlocking. I really see nothing wrong in the way irq.c works, I really think that except the small added bit we have in our do_IRQ() to call ppc_md.get_irq(), it's perfectly adapted to our needs. Remember that it allowed to remove the (mostly useless) post_irq() thing we had ? It also allow proper implement of irq distribution even with controllers that could trigger the same IRQ on several CPUs, re-entrancy in the handler if we do early-eoi without masking an edge interrupt is also handled properly, enable/ disable from within the handler too, all sorts of things our previous code didn't do right. The only thing I added to the core irq.c code is that IRQ_PERCPU flag that prevents IRQ_INPROGRESS to be set. It's a bit hackish but allows our IPIs to use a single desc for all CPUs without beeing mutually exclusive. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about IRQ_PENDING/IRQ_REPLAY
handled correctly, does nothing more than what we need (well it does, but those parts, mostly the irq locking, got already removed), etc... Sorry, I meant mostly the irq probing Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about IRQ_PENDING/IRQ_REPLAY
And I seriously doubt that PPC SMP irq handling has gotten _nearly_ the amount of testing and hard work that the x86 counterpart has. Things like support for CPU affinity, per-irq spinlocks, etc etc. Some of those are the reason I moved part of the x86 irq.c code to PPC indeed. Now, I'm not saying that irq.c would necessarily work as-is. It probably doesn't support all the things that other architectures might need (but with three completely different irq controllers on just standard PCs alone, I bet it supports most of it), and I know ia64 wants to extend it to be more spread out over different CPU's, but most of the high-level stuff probably _can_ and should be fairly common. And I think they are. One thing is that if made "common", do_IRQ have to be split into an arch-specific function that retrives the irq_number (and does the ack on some controller), and the actual "dispatch" function that does all the flags game and calls the handler. I've slightly extended it using the IRQ_PERCPU flag to prevent IRQ_INPROGRESS from ever beeing set (a bit hackish but I wanted that for IPIs since they use ordinary irq_desc structures for us in most cases). Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about IRQ_PENDING/IRQ_REPLAY
We have about 12 interrupt controllers we end up using on PPC. I'm suspicious of any effort to base Linux/PPC generic interrupt control code paths on a software architecture that's been tested with 3. More to the point, we get ASIC's that roll in a standard interrupt controller and add some "improvements" at the same time. Well, I personally don't see what would be a problem... Of course, the current i386 irq.c cannot be re-used completel "as is". The bit of code that gets the actual irq number has to be arch specific. But most of the locking issues are completely platform neutral. I personally see that code as a good framework that provides many features that may or may not be neccessary depending on the level of brokenness of a given interrupt controller. As for SMP, I'm sure x86 has seen a lot more testing. I'm not going to sacrifice time-tested stability so we can look just like x86 and get clean SMP locking. We've lost stability already because of some PPC folks' excitement at getting us to behave like x86 in irq.c. We lost stability ? Hrm... If we had ever a problem with SMP, it was in the openpic code, and apparently, due to a HW bug. I don't think the new irq.c code in itself caused us to lose stability. I actually do think it improved the locking, and so, stability. As for a generic irq.c, as a guiding light, I'm all for it. It'll certainly help work with RTLinux. It'll also help new architectures by giving them a snap-together port construction kit. I'm still not going to sacrifice stability in the short-term for this nice feature in the long-run. I'm pretty sure we agree on this. Well, we have been running this new irq.c which I partially based on i386 for some monthes now, and had enough time to iron out most problems. Again, all the stability problems we had so far were related to the openpic implementation, I don't remember seeing one stability problem reported so far that was related to irq.c. And I've been running a couple of dual G4s without much trouble for some time now. We do (did ?) have a problem with irq distribution on SMP with openpic. I'm not sure we yet know exactly why, according to both you and IBM people, we are running over an HW bug of the openpic core. I see nothing in irq.c that can cause this. On the other hand, the new irq.c brings the irq depth handling, the ability to call enable/disable from within the handler (I've been wanting that for some time for the PMU driver), proper spinlock'ing, etc... And last, but not least, consistent semantics of enable/disable irq exposed to drivers (especially things like disable_irq() actually waiting for that irq to be completed on any other CPU). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] fbdev power management
I'm working on improving some aspects of Power Management on the PowerBooks, and among other things, I have a problem with fbdevs. Currently, each fbdev registers a power management callback to sleep/ wakeup the device. We handle HW related things (shutting the backlight off, putting the chip to sleep when possible, backing up the frame buffer content, etc...) from there. We do call the video sleep last during the sleep process, and wake it up first, to avoid any problem if something is beeing printed to the console while the chip is suspended. However, this is not very safe. First, there's the cursor timer, which can screw us up. I have a hack in my tree where the fbdev driver calls a new routine in fbcon.c that stops/starts the cursor timer. But I'm looking toward a more generic solution. By having a way to "suspend" the entire fbcon, maybe we can have all console output blocked buffered until the fbcon is woken up. Also, a question is should we call that fbcon_suspend()/fbcon_resume() (currently only the cursor timer stuff) from the fbdev's or should the fbcon itself register as a power management client, and then call fbdev's suspend/resume routines ? I prefer the second solution as the fbdev's are often PCI devices (and so already have the ability of having PCI suspend/resume hooks). Another solution would be to have all fbdev's have it's own suspend/ resume hook (and maintain a "suspend" state that would tell fbdev to stop calling them or start working on a memory based backup image), and separately, fbdev's own suspend/resume (for the cursor, as it's not head- dependant but rather global to all fbdev's). Any comment ? Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-fbdev-devel] [RFC] fbdev power management
I think registering fbcon as a PM client and doing the above when the fbdev suspend/resume hooks are called should work. A memory backup is worked on until the resume is run and the backup is restored to the display. So the fbdev drivers would register PM with fbcon, not PCI, correct? Either that, or the fbdev would register with PCI (or whatever), _and_ fbcon would too independently. In that scenario, fbcon would only handle things like disabling the cursor timer, while fbdev's would handle HW issues. THe only problem is for fbcon to know that a given fbdev is asleep, this could be an exported per-fbdev flag, an error code, or whatever. In this case, fbcon can either buffer text input, or fallback to the cfb working on the backed up fb image (that last thing can be handled entirely within the fbdev I guess). Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-fbdev-devel] [RFC] fbdev power management
Now for fbcon its simpler. Things get writing to the shadow buffer (vc_screenbuf). When the console gets woken up update_screen is called. While power down the shadow buffer can be written to which is much faster than saving a image of the framebuffer. Of course if you still want to do this such in the case of the X server then copy the image of the framebuffer to regular ram. Then power down /dev/fb using some ioctl calls provide. Ok, I see. Currently, the sleep process is started from an ioctl sent to another driver, which will in turn call various notifier functions to shut down bits of hardware and finally put the machine to sleep. It's not a direct ioctl to the /dev/fb (which may not be opened). One problem I have is that my fbdev sleep routine will restore the mode on wakeup, but that of course doesn't work with X when not using useFBDev as fbdev have no knowledge of the current mode or register settings used by X. I'm wondering if it would be possible to make X think there's a console switch (without actually switching to an active console, as we don't know if we even have one of those available for us), wait for it to reply, and then start the sleep process. One other possibility would be to implement APM-like events, I still have to study those more in details as our sleep process is currently quite different from APM (and definitely not BIOS-based). For now, I have my hooks in fbcon that suspend/restart the cursor timer, that's enough to make sleep stable on 2.4 since we take care of shutting down the display very last (after any other driver) to make sure no printk will end up trying to display something while the chip is powered down. I'll digest your various comments look into all this in more depth with 2.5 console codebase. I beleive some solution must be found for x86 laptops too. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: kernel_thread vs. zombie
Have a look at: http://www.scs.ch/~frey/linux/kernelthreads.html I have an example there that starts and stops kernel threads from init_module and never produced a zombie. I use the same code also to start threads from ioctl and it works for me. I tested it on UP and SMP, Intel and Alpha, 2.2.18 and 2.4.2. Thanks ! Could you explain me a bit why you need the lock_kernel ? My probe thread is already protected by some atomic ops, but I'm considering changing them to semaphores. Is there any need for the bkl to be taken when calling daemonize or is this just for your own syncronisation needs ? I don't think you do more than what I currently do to prevent the zombie (except for the daemonize call, I don't see you changing anything about the parent thread or whatever). At first I though daemonize() would do the trick, but I still see zombies on my tests. I'm running UP now so I don't since my lack of lock_kernel() could explain it. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: kernel_thread vs. zombie
The stuff done in daemonize() and the exit_files could need the kernel lock. At least on some 2.2.x version it does, I did not check whether it is still needed on 2.4. Well, I don't really plan to backport this to 2.2.x. I'll try to see if my problem is related to the lack of kernel lock, or maybe I have just something else wrong. On stop of the thread I need the big kernel lock to make sure the kernel thread exited (everything really done from my up() till the thread is in zombie state) before I unload the module. The comment in the code should explain in. Ok. I don't need that as I'm not in a module, no chances I ever get unloaded. At least not in 2.4. Making ADB and all the controllers and device drivers in modules would be an interesting exercise with module dependencies ;) Note that the threads itself do not run with the kernel lock held. After setting everything up the make an unlock. Ok. Well, I just have an atomic flag testset'ed before starting the bus reset, and released at the end of the thread. No need to make sure the previous one is really dead before starting a new one. I could benefit from semaphores when starting it since if it's already running, I just loop scheduling waiting for the lock bit to be available. But that case will almost never happen in real life. ADB probes are quite rare. Many thanks for your help, I'll see what's wrong in my code ;) ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PCI bridge setup weirdness
No, pci_read_bridge_bases() is obsoleted by new pci setup code. ;-) You have to set up bus resources properly in pcibios_fixup_bus(). For a single root bus configuration, you don't need to do anything with the root bus itself - its resources already point to ioport_resource and iomem_resource, which should be ok. For pci-pci bridges you have to add something like this: The problem I have (and this is why I don't setup host resources properly on multi-host PPCs yet) is that some hosts can have several non-contiguous ranges (especially with memory, IO is usually a single contiguous range). There are simply not enough resource "slots" in the current structures to handle all possibles cases. They basically have a host bridge register in which each low bit enables decoding of a 256Mb region in the range 0xn000 and each high bit enable decoding of a 16Mb region in the range 0xFn00 The typical setup is to have one (or more) 256Mb regions, and one 16Mb region, but that can change from model to model. Ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Fwd: kernel oops with rm in hfs - hit BUG() in line 236 of dcache.h
Begin Forwarded Message Subject: kernel oops with rm in hfs - hit BUG() in line 236 of dcache.h Date Sent: Sunday, December 10, 2000 12:56 AM From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED], [EMAIL PROTECTED] PowerCenter Pro 210mhz 604e, 224MB RAM, Linux 2.4-pre11 (rsync from Paul 12/8) I was removing multiple files from my hfs drive, when I hit the BUG() at line 236 in /usr/src/linux/include/linux/dcache.h: static __inline__ struct dentry * dget(struct dentry *dentry) { if (dentry) { if (!atomic_read(dentry-d_count)) BUG(); atomic_inc(dentry-d_count); } return dentry; } Dec 9 18:09:21 like kernel: kernel BUG at /usr/src/linux/include/linux/ dcache.h:236! Dec 9 18:09:21 like kernel: Oops: Exception in kernel mode, sig: 7 Dec 9 18:09:21 like kernel: NIP: C00712FC XER: LR: C00712FC SP: C1087DB0 REGS: c1087d00 TRAP: 0700 Dec 9 18:09:21 like kernel: MSR: 00089032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 Dec 9 18:09:21 like kernel: TASK = c1086000[12310] 'rm' Last syscall: 10 Dec 9 18:09:21 like kernel: last math c1086000 last altivec Dec 9 18:09:21 like kernel: GPR00: C00712FC C1087DB0 C1086000 0039 1032 0001 C021 Dec 9 18:09:21 like kernel: GPR08: C01B 001F C1087CF0 22822842 1001ECE8 100302E8 1003 Dec 9 18:09:21 like kernel: GPR16: 1 003 1003 1003 1003 C96D7C20 C021 Dec 9 18:09:21 like kernel: GPR24: C291D62C C018 C018 C291D600 C4193C60 C291EE40 C291D628 C9947520 Dec 9 18:09:21 like kernel: Call backtrace: Dec 9 18:09:21 like kernel: C00712FC C0047854 C00479A8 C00048D8 10001D8C 100031D0 10001358 Dec 9 18:09:21 like kernel: 0FF0B734 From the System.map: c007122c T hfs_unlink c00476d8 T vfs_unlink c00478c0 T sys_unlink c00048d8 T ret_from_syscall_1 Thanks, Peter ** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/ - End Forwarded Message - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.X patches for fbcon
--- atyfb.cMon Dec 11 14:28:19 2000 +++ atyfb.c.orig Wed Oct 4 22:22:28 2000 @@ -2796,7 +2796,7 @@ * works on iMacs as well as the G3 powerbooks. - paulus */ if (default_vmode == VMODE_CHOOSE) { - if ((Gx == LG_CHIP_ID)||(Gx == LI_CHIP_ID)||(Gx == LP_CHIP_ID)) + if (Gx == LG_CHIP_ID) /* G3 PowerBook with 1024x768 LCD */ default_vmode = VMODE_1024_768_60; That one is wrong. The machine type must be probed differently. Also, some wallstreet's have a different screen (passive matrix) which is 800x600. I'm trying to find a way to probe for it and will come up with a patch for this In the meantime, passing the vmode is the correct solution. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
aic7xxx.c vs. Adaptec 29160N
I have a 29160N card in a PowerMac G4. It used to work fine with an old UW SCSI disk I had there. Today, I flipped this drive with a real Ultra160 one , and now, the kernel won't boot. It's giving me an endless stream of SCSI reset timeouts on bus 0. Any clue ? I don't really need this disk in Linux (at least not yet), but I don't neither want to plug/unplug the disk each time I boot linux or MacOS... The disk is a Quantum ATLAS_V__9_WLS rev. 0230 Anything I can do to help tracking the problem ? It's difficult to get the actual output of the driver in verbose mode as it is scrolling quite fast and I have nothing like a serial console on this box. The kernel won't boot without noprobe so I can't dump dmesg output. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: aic7xxx.c vs. Adaptec 29160N
Anything I can do to help tracking the problem ? It's difficult to get the actual output of the driver in verbose mode as it is scrolling quite fast and I have nothing like a serial console on this box. The kernel won't boot without noprobe so I can't dump dmesg output. I was wrong, even no_probe won't help, I have to physically disconnect the drive to get the kernel to boot. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: FIXED! Updated 2.4 TODO List -- new addition WAS(test9 PCIresourcecollisions (fwd)
Yes, it will break on any machine with multiple primary PCI busses, because the registers assigning bus number ranges to primary busses are chipset specific. In 2.5, I'd like to rewrite the resource + bus number assignment code to be able to re-layout the busses and resources even on i386 if it detects it's safe to do so (that is there is either only one primary bus or the host bridge is known). This will be needed for proper PCI hotplug and it will also help us to get rid of some more BIOS bugs (especially on some embedded systems). The current 2.4 code with assign_all_busses enabled works nicely on the Apple "UniNorth" 3-host mecanism. There's apparently no register to configure to set the primary bus number (the bridge doesn't care what it's bus number is, it just has different mecanism to issue type 0 and type 1 config cycles). We also have something that is not directly supported in the 2.4 PCI code, but that I implemented via fixups pcibios_update_resource(), which is to have a offset applied to all MMIO resources. Some PPC machines don't have a 1:1 mapping of PCI memory space vs. CPU physical space, and so we must add this offset to all memory resources and substract it ffrom pcibios_update_resource(). I still have not implemented per-host parent resoures (we currently have one parent resource for all 3 hosts), I'm still not completely sure of the best way to implement it (the bus resources management still has a few obscure zones to me). The main problem with PCI we are facing on all our platforms is related to IOs. There are way too may assumtions in the kernel based on the x86 fact that PCI IOs are more or else equivalent to ISA IOs, limited to one 64k space, and so on. We have several cases of machines with several IO busses, each one having it's own IO address space 0-xxx mapped differently (elsewhere) in the CPU physical memory space (those CPUs don't have specific in/out instructions), and which can be more than 64k long. Also, not all host bridge can do remapping of legacy ISA addresses. That leads to several issues with this: If we want to support "normal" PCI IOs (devices that expose registers as IO ranges only, but that also fully support PCI 32 bits IO space) on all of these busses, and still be "compatible" with drivers that do in/out functions on legacy (64k) addresses and expect reaching either an ISA bus or legacy devices on the PCI bus, we need to do all sort of hacking and we have not yet figured out a solution that would make everybody happy. We can put the real physical address used to generate the proper IO cycle in the PCI drive resource structure and have in/out just do the same as readb/writeb. This allows to handle properly PCI IOs on all busses, but breaks legacy crap. We can (and that's what we do today) decide that only one bus support IOs and have a "global" IO_BASE which is added to all in/out accesses, and which is the ioremap'ed IO space of the single bus we decide supports IOs. But that means that we can't access both the VGA registers of a card in the AGP slot and PCI IO space of another card in the main PCI slots (different busses and different IO spaces). We can use MMU tricks to "append" together all IO spaces, one of them beeing considered as the primary and beeing mapped at the bottom of this virtual IO space (for legacy in/out) and all other beeing appended to this one (with proper fixup of PCI resources). This was discussed on linuxppc-dev list a lot, but not implemented yet. My personal point of view would be to either separate completely ISA PCI IO macros, or have a mecanism for all legacy (VGA, ISA, ...) drivers, to ask for a base address from the "legacy" address they intend to use. (get_legacy_base(VGA_LEGACY) for example, would return the IO base to apply to all in/out macros used to access the IO space). That's still not perfect since we can't support two VGA cards on separate busses (which would be theorically possible on a Mac: one in the AGP slot, one in a PCI slot, both having different IO space). So I'm still open to suggestions, but I'd really like to see this problem adressed for _2_5 in a "generic" way. Currently, it's more or less choosing between supporting legacy devices on one bus and no real PCI IOs on any other bus but the first one, or supporting real PCI IOs on all busses, but no legacy IOs. Note that we don't have such a problem for MMIOs fortunately ;) Ben. -- RFC822 Header Follows -- From: Benjamin Herrenschmidt [EMAIL PROTECTED] To: Martin Mares [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: FIXED! Updated 2.4 TODO List -- new addition WAS(test9 PCI resourcecollisions (fwd) Date: Thu, 26 Oct 2000 14:35:42 +0200 Message-Id: [EMAIL PROTECTED] In-Reply-To: [EMAIL PROTECTED] References: [EMAIL PROTECTED] X-Mailer: CTM PowerMail 3.0.5 http://www.ctmdev.com MIME-Version: 1.0 Cont
Re: B/W G3 - big IDE problems with 2.4.0-test10
On Wed, 8 Nov 2000, Andre Hedrick wrote: What is your chipset, CMD646 rev 5 Ultra DMA 33 ??? Yep. I've tried building with the CMD64x driver, and that didn't help matters, if you were wondering. Any thoughts? Did you try the bitkeeper PPC kernel ? (or Paul Mackerras rsync tree ?) Not all PPC patches have been merged in Linus tree yet. There were some resource assignement issues that were fixed only recently. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
About IOs, ISA, PCI, and life (WAS: VGA PCI IO port...)
One way to do this is to treat PCI IO and ISA IO as two separate address spaces. The PCI IO address space is a 14-bit address space (bits 9:8 are always zero) ranging from 0x1000 to 0xFCFF. ISA IO is a 10-bit space (bits 15:10 are available for the card to use) ranging from 0x100 to 0x3FF. VGA cards may be PCI and AGP, but still have allocations in the ISA range. I'd love to see PCI and ISA IOs treated differently too. I'm seeing more and more esoteric PCI setups (especially on some huge PPCs with several host bridges), various different ways to access PCI mem and ISA, various ways to handle the ISA "special case" for memory, etc... When you have to deal with various separate PCI IO spaces, each one having it's own address space, and each one potentially having devices that want to do IOs or "legacy" ISA IOs, then you are screwed. Currently, we don't really support IOs on anything but the "primary" PCI bus (choosen arbitrarily) unless platform-specific driver hacking is done. We could use the MMU mappings to let the kernel think all those IO spaces are actually one big contiguous region, and remap them all together. This way, a simple resource fixup would make PCI drivers using IO resources work at least. But in this case, "ISA" IOs will have to be restricted to one of the IO busses, decided arbitrarily. But what about 2 video cards on the AGP port and one PCI slot of a G4 Mac ? This machine, just an example, have those on different host controllers with separate IO spaces. If those cards need to be driven with VGA accesses (for running a BIOS emulator for example, or just because you have no choice), then you are screwed. All you can do is have one bus support VGA IOs. Another issue is ISA memory space, for the same reason as above (multiple busses), but also because a lot of PCI controller setup can't forward memory cycles below 0x8000 or such arbitrary physical address. Some of them (most but not all) provide a way via a separate physical address to access a 64k "ISA" memory space that generates low-address PCI cycles. So you can have one or more ISA IO busses, and 0 or more ISA memory busses. A solution for that would be to have VGA and other legacy ISA drivers in the kernel change the way they use the IO access macros. One idea I have, would be to either keep a virtual ISA "bus number" along with kernel support functions to count them, get virtual base addresses for IO memory, query about availability of those, etc... Another would be to link that more tighly with PCI by adding generic functions to request the virtual base address of each PCI IO and ISA-memory space. We already have a syscall on some platforms (PPC, Alpha) to request those informations from userland (XFree). I'm not sure about the best way that could fit in the resource architecture yet. I have different problems with PCI resources for now (mostly with host bridges that provide several discontiguous decoding ranges)... Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: PCI power management
Hi ! Glad to see things moving around Power Management ;) This was originally a private reply to Patrick Mochel, but the e-mail kept getting longer and longer :) Note: we have setup a list for PM issues http://lists.sourceforge.net/lists/listinfo/linux-pm-devel Not very much used yet, but I, at least, plan to spam it with all sort of things we need for PowerBook PM... I'm forwarding your message there and I suggest we continue that discussion there as well. The current state of PCI PM is this: pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state. Linus believes the power state transition should occur before (1) and (2), and I agree. pci_set_power_state brings a device to a new D state. If the D state transition is D3-D0, then we (1) save key PCI config registers, (2) go to D0, and (3) restore saved PCI config registers. This originally comes from Donald Becker's acpi_wake function, which is used only for the case of device enabling (where he had no problems), not for the case of returning-from-suspend (where we see problems). I beleive the current scheme is not enough. Here are some of my own thoughts about this: - Some devices won't properly give you their config space when in D3 state. You shouldn't save the configuration when in D3 to restore it after switching to D0, but you must have previously saved it before originally putting the device into D3 state. - There need to be some arch "hooks" in this mecanism. Some machines have the ability (from the arch specific code, by tweaking ASIC bits) to remove clock and/or power from selected devices. That mean power management can be done even with devices not supporting PCI PM provided that the driver can recover them from a PowerOn reset. - Some devices just can't be brought back to life from D3 state without a PCI reset (ATI Rage M3 for example) and that require some arch specific support (when it's possible at all). - The current scheme provide no way for the kernel to "know" if a driver can handle recovering the device from a PowerOn reset. Some drivers can, some can't (the video drivers usually can't as they require the board's PLL to be properly setup by the BIOS). Some advanced PM modes we use on pmacs will cause the motherboard ASIC to turn off power to PCI AGP cards when putting the machine to sleep. We need a way to prevent/allow this "deep sleep" mode depending on what the card supports. - Ordering of power management may matter. On PowerBooks, we run through all notifiers first with a "sleep request" message. None of the drivers will actually put anything to sleep at this point, but they will allocate all the memory the might need for doing so (saving state, saving a framebuffer in some cases, etc...). Once all devices have accepted the request (they can refuse it), I then send a "sleep now" message. This way, I can make sure all memory allocations have been performed and disks properly sync'ed before putting the swap devices to sleep and such things. - On SMP, we need some way to stop other CPUs in the scheduler while running the last round of sleep (putting devices to sleep) at least until all IO layers in Linux can properly handle blocking of IO queues while the device sleeps. - We need a generic (non-x86 APM or ACPI dependant) way of including userland process that request it in the loop. Some userland process that bang hardware directly (X, but not only X) need to be properly suspended (and the kernel has to wait for ack from them before continuing with devices sleep). "apm -s" causes the apm driver to map all suspends to the ACPI D3 state. An apm suspend triggers a pm_send_all call, which in turns triggers pci_pm_suspend. This code [from Linus iirc] walks the root buses, recursively suspending downstream buses and then attached devices. The resume code does the exact opposite. The PCI core suspend/resume code has this comment, and we note the current requirement that -all- drivers should export suspend/resume somehow, in order for a sane PM system to work here. Yup. They should also be able to return an error (fail or just limit to a higher level like D2). They should also be able to tell the kernel if they support recovering from a power down. It is up to the drivers to implement ::suspend() and ::resume(), and few do. The few that do, even fewer work well in practice. I would have preferred that a PM node be created for each PCI node and have the PM nodes organised as a tree structure. That way, arch fixup hooks can re-arrange the tree as the PCI bus-child dependency may not be true. On some portables, some ASICs located on the PCI bus are not dependent on their parent host bridge power plane. That's the current state of things. I do not think the system -- at the PCI core level -- is poorly designed. I think it just takes a lot of grunt work with drivers at this point, plus maybe a few new pci helper functions.
Re: PCI power management
Hi ! Glad to see things moving around Power Management ;) This was originally a private reply to Patrick Mochel, but the e-mail kept getting longer and longer :) Note: we have setup a list for PM issues http://lists.sourceforge.net/lists/listinfo/linux-pm-devel Not very much used yet, but I, at least, plan to spam it with all sort of things we need for PowerBook PM... I'm forwarding your message there and I suggest we continue that discussion there as well. The current state of PCI PM is this: pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state. Linus believes the power state transition should occur before (1) and (2), and I agree. pci_set_power_state brings a device to a new D state. If the D state transition is D3-D0, then we (1) save key PCI config registers, (2) go to D0, and (3) restore saved PCI config registers. This originally comes from Donald Becker's acpi_wake function, which is used only for the case of device enabling (where he had no problems), not for the case of returning-from-suspend (where we see problems). I beleive the current scheme is not enough. Here are some of my own thoughts about this: - Some devices won't properly give you their config space when in D3 state. You shouldn't save the configuration when in D3 to restore it after switching to D0, but you must have previously saved it before originally putting the device into D3 state. - There need to be some arch "hooks" in this mecanism. Some machines have the ability (from the arch specific code, by tweaking ASIC bits) to remove clock and/or power from selected devices. That mean power management can be done even with devices not supporting PCI PM provided that the driver can recover them from a PowerOn reset. - Some devices just can't be brought back to life from D3 state without a PCI reset (ATI Rage M3 for example) and that require some arch specific support (when it's possible at all). - The current scheme provide no way for the kernel to "know" if a driver can handle recovering the device from a PowerOn reset. Some drivers can, some can't (the video drivers usually can't as they require the board's PLL to be properly setup by the BIOS). Some advanced PM modes we use on pmacs will cause the motherboard ASIC to turn off power to PCI AGP cards when putting the machine to sleep. We need a way to prevent/allow this "deep sleep" mode depending on what the card supports. - Ordering of power management may matter. On PowerBooks, we run through all notifiers first with a "sleep request" message. None of the drivers will actually put anything to sleep at this point, but they will allocate all the memory the might need for doing so (saving state, saving a framebuffer in some cases, etc...). Once all devices have accepted the request (they can refuse it), I then send a "sleep now" message. This way, I can make sure all memory allocations have been performed and disks properly sync'ed before putting the swap devices to sleep and such things. - On SMP, we need some way to stop other CPUs in the scheduler while running the last round of sleep (putting devices to sleep) at least until all IO layers in Linux can properly handle blocking of IO queues while the device sleeps. - We need a generic (non-x86 APM or ACPI dependant) way of including userland process that request it in the loop. Some userland process that bang hardware directly (X, but not only X) need to be properly suspended (and the kernel has to wait for ack from them before continuing with devices sleep). "apm -s" causes the apm driver to map all suspends to the ACPI D3 state. An apm suspend triggers a pm_send_all call, which in turns triggers pci_pm_suspend. This code [from Linus iirc] walks the root buses, recursively suspending downstream buses and then attached devices. The resume code does the exact opposite. The PCI core suspend/resume code has this comment, and we note the current requirement that -all- drivers should export suspend/resume somehow, in order for a sane PM system to work here. Yup. They should also be able to return an error (fail or just limit to a higher level like D2). They should also be able to tell the kernel if they support recovering from a power down. It is up to the drivers to implement ::suspend() and ::resume(), and few do. The few that do, even fewer work well in practice. I would have preferred that a PM node be created for each PCI node and have the PM nodes organised as a tree structure. That way, arch fixup hooks can re-arrange the tree as the PCI bus-child dependency may not be true. On some portables, some ASICs located on the PCI bus are not dependent on their parent host bridge power plane. That's the current state of things. I do not think the system -- at the PCI core level -- is poorly designed. I think it just takes a lot of grunt work with drivers at this point, plus maybe a few new pci helper functions.
Re: PCI power management
On Thu, Apr 19, 2001 at 11:19:31AM +0100, Benjamin Herrenschmidt wrote: Hi ! Glad to see things moving around Power Management ;) This was originally a private reply to Patrick Mochel, but the e-mail kept getting longer and longer :) Note: we have setup a list for PM issues http://lists.sourceforge.net/lists/listinfo/linux-pm-devel Oo *tries to subscribe* Doh! The silly thing is trying to use the From_ header on the confirm rather then the From: header and so I can't subscribe. Can this get fixed? Dunno, it's the standard sourceforge/geocrawler list stuff.. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PCI power management
- Some devices just can't be brought back to life from D3 state without a PCI reset (ATI Rage M3 for example) and that require some arch specific support (when it's possible at all). Putting on a driver author hat what I want is pci_power_on_generic pci_power_off_generic pci_power_on_null pci_power_off_null At which point most driver writers are having to do no thinking at all about their device. The PCI layer just requires they pick a function and stick it in the struct pci_device. Could you elaborate about the difference between generic and null functions ? I'm not sure I understand what you mean... Note that in the case of chips like the Rage M3, the driver is the only one to know if it will be able to bring back the card from a power off state or not. It's the only one to know if it can reconfigure the card completely without having a BIOS run before it. I would suggest a call that looks like pci_power_off(uint mask); where mask is PCI_POWER_MASK_D1 = 0x0001 PCI_POWER_MASK_D2 = 0x0002 PCI_POWER_MASK_D3 = 0x0004 PCI_POWER_MASK_NOCLOCK = 0x0008 PCI_POWER_MASK_NOPOWER = 0x0010 The driver sets the mask to whatever state it supports getting the card from. We can #define a PCI_POWER_MASK_STD (that would be a D1+D2+D3) for "generic" drivers that don't really know anything but to follow the HW PCI power management capabilities. This function would be routed to an arch function, that will in turn either call the lower-level PCI code to set D1, D2 or D3 mode (the best supported) or will suspend the card's clock or power if it can and the driver accept it. Typically, on a PowerMac, this function could keep track of which cards are in D2 or D3 mode (or which drivers allowed for clock suspend) and would stop the PCI clock once they all asked for it. This doesnt help you. You need device specific support in each case where bus mastering is occuring and a bus master error could be fatal if missed. For example on i2o I can easily have 4Mbytes of outstanding I/O between the message layer and disk, all of which is bus mastering. Only the driver actually knows when its idle. Right. That's a driver issue. The problem would go away if all drivers properly block their IO queues and wait for all IO to complete when notified of sleep X has hooks for this in XFree 4 The last time I looked at it, those were rather APM-specific. But well, I guess it's easy to update them. What I'm thinknig about is the kernel side, that is a generic, non-APM or non-ACPI specific way of notifying userland process that request for it. Some kind of interface allowing userland to register PM notifiers and have the kernel PM thread be blocked until the userland code "acked" the message. Well, maybe there is already something I missed... Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PCI power management
null = 'do absolutely nothing' generic = 'do D3 as per the specification' The idea being the PM layer would go around calling dev-power_off(dev); as a default notifier for PCI devices. Ok, I see. I didn't understand that the functions you were talking about would be defaults to put directly in the pci_dev structure. And in the case of the cards like that you would need a custom mask. So you'd do pci_set_power_handler(dev, atyfb_power_on, atyfb_power_off) to get a custom function. For most authors however they can call the power handler setup just using prerolled functions that do the right thing and know about any architecture horrors they dont. Right. However, rare are the drivers that don't need at least to know that a power management sequence is going on. All bus mastering drivers, at least, must stop bus mastering (and clearing the bit in the command register is not enough on a bunch of them). Most drivers have to cleanly stop ongoing operations, refuse (or block) requests while the driver is sleeping, etc... and finally configure things back once waking up. I don't see much cases where a simple "default" function would work. My current scheme on powerbook don't do half of that... it still sorta works since I manage to stop all scheduling and shut things down in the proper order, but it's neither a clean nor a safe way to do things. I'd rather pci_dev-powerstate or similar as a set of flags in the device. Ok, agree with that one. I sill consider, however, that the current suspend/resume callbacks in the pci_dev structure are not the best way to do things. I would have really prefered that each pci_dev embed a pm notifier structure. In some cases, we want to pass more than simple suspend/resume messages (suspend request, suspend now, suspend cancel, and resume are the 4 messages I use on powerbooks). Also, this can be generalized to other type of drivers (USB, IEEE1394, ..), eventually passing bus-specific messages Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PCI power management
All devices should handle having power removed from them. And, all of the drivers should as well, since that is the only way we're going to get power management out of legacy devices and other things on the board. This involves saving the current context on suspend, and reinitializing the device, and restoring the context as much as possible when we resume. It should behave almost identically to the boot-time init code. Right. In fact, at the driver level, the power management involve 2 different things: - Handling context save restore of the device state - Blocking of "user" (I mean user of the driver, that can be a kernel servicer) requests properly. In some case, this later thing can be done by returning errors provided that upper level drivers are read to handle them. For example, the IDE layer should probably just block the IO queues while the IDE susbsytem is powered off (not talking about disk sleep, but complete power off of the controller), while an USB host controller should probably return errors to URBs sent by drivers to a sleeping controller since those upper-level drivers should have been put to sleep before the host controller. That part is almost completely overlooked right now. - Some devices just can't be brought back to life from D3 state without a PCI reset (ATI Rage M3 for example) and that require some arch specific support (when it's possible at all). When a device comes out of D3[hot], the equivalent of a soft reset is performed. From D3[cold], PCI RST# is asserted, and the device must be completely reinitialized. Some devices (bad bad HW designers ;) just can't do it themselves. The Rage M3 requires the host to assert PCI RST#, and some motherboards provide no documented facility for that (it might be possible with Apple ASICs for example, it's just not documented). Also, still in the case of the Rage M3, we just can't bring it out of D3 for the same reason we can't bring the r128 in the AGP slot of a Cube Mac out of PowerOff : The complete init sequence of those chips is dependent on the chip revision, requires some informations about undocumented registers that we don't have (at least that's my understanding from talks with ATI) and so can basically only be done by a BIOS (or OpenFirmware driver in my case), and we can't run that on wakeup (OF is dead on macs once the kernel takes over). So we have to limit ourselves to D2 mode on machines that don't remove power from the slots (powerbooks, ibooks imacs) and we can't do deep sleep at all on machines that remove power from the slot (Cube, G4s, ...), at least until we figure out the proper init sequence for those cards. So the point here, as far as the kernel is concerned, is that drivers should have a way to let the kenrel know the min/max power state they support. It's not about what the device supports, it's about what the driver supports. STR and STD imply that all devices will lose power. The drivers are responsible for reinitializing the devices, regardless of what that may involve. Right. I'm typing too fast, but that's what I meant. Hmm. How about doing two walks of the device tree - the first calls a save_state() function for each device, which gives it the opportunity to allocate memory and save appropriate registers, etc. The second actually places the device in a low power state. This could give the kernel the chance to disable swap, or for the action to be cancelled before anything is actually put to sleep. Yup. That's approximately what I do with the PPC-specific "sleep notifiers" we are using. The only difference is that the real save state is done on the "sleep now" (latest) request, not on the "sleep request" (earlier) request. The basic idea here is that the first pass will do all of the memory allocation (or whatever requires all system resources to be available, that can be sending a special power management message to the device, like enabling the remote wakup on USB, etc...). So this first pass requires system services (all other drivers if you prefer, especially the swap device) to be fully alive. The second pass will do the actual IO blocking, state save, and eventually enter device suspend mode for cases where it's controlled by the driver. - On SMP, we need some way to stop other CPUs in the scheduler while running the last round of sleep (putting devices to sleep) at least until all IO layers in Linux can properly handle blocking of IO queues while the device sleeps. Ugh. SMP. Not yet. Well, if all drivers properly handle blocking of IOs, the SMP issue will be easy to handle. Having the other CPUs run is not a problem as long as any IO triggered by processes on theose are properly blocked by sleeping drivers. All is needed is a cross-CPU function call to force the other CPU into an idle loop (or a idle/sleep loop on PPC) on the very last step of entering suspend mode. - We need a generic (non-x86 APM or ACPI dependant) way of
Re: PCI power management
Some devices (bad bad HW designers ;) just can't do it themselves. The Rage M3 requires the host to assert PCI RST#, and some motherboards provide no documented facility for that (it might be possible with Apple ASICs for example, it's just not documented). Why should we support such a non-spec device? Tell ATI to fix their hardware, and tell users (a) not to use the hardware, or (b) use the hardware with the knowledge that you are screwed when it comes to Power Management. Unless there are more cases like this, this should not factor at all into the modifications to the PCI and PM code... Well, I can tell all PowerBook and iBook users to forget about sleep... Also, that would not be the first time we have to deal with poorly documented hardware. I don't think we should refuse to handle any hardware that is out of spec... it would be like saying Linux doesn't support any x86 with a broken BIOS... It's not so complicated to have the minimum flexibility for the driver to tell it's maximum supported power level, and I don't see why it would be a problem to use D2 instead of D3 when we don't support D3 for a given device (either because the HW is broken, undocumented, or because our driver just don't know how to bring back the chip to life). If the motherboard _requires_ it (because it will cut power from the chip), the we can refuse to enter sleep when one driver can't do it (instead of letting the user crash the box badly). In any case, I beleive you are focusing on a point of detail. All I'm asking for (in this specific case) is a simple mask of flags set by the driver to tell what it can handle. It's also useful for devices that don't support PM on machines whose motherboard provide facility to turn OFF power on selected cards. It would allow us to turn off cards for drivers that can handle recovering. Also, I don't think the problem of powering back up the chip and re-initing it from scratch is specific to those ATI chips. Look at XFree, it has to run a BIOS emulator to soft boot video chips. On PCs, I beleive you have the BIOS that re-init them when waking up from an APM or ACPI suspend. On non-PCs when suspend is not handled by the firmware but directly by the kernel, that's not the case. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: isa_read/write not available on ppc - solution suggestions ??
I would suggest the opposite approach instead: make the PPC just support isa_readx/isa_writex instead. We can certainly do that, no problem. BUT that won't get a token ring pcmcia card working in the newer powerbooks, such as the titanium G4 powerbook, because the PCI host bridge doesn't map any cpu addresses to the bottom 16MB of PCI memory space. This is not a problem as far as pcmcia cards are concerned - the pcmcia stuff just picks an appropriate address (typically in the range 0x9000 - 0x9fff) and sets the pcmcia/cardbus bridge to map that to the card. But it means that the physical addresses for the card's memory space will be above the 16MB point, so it is essential to do the ioremap. What about isa_ioremap ? Result from it is a token passed to isa_readx/isa_writex and the arch side can be implemented with a couple of #defines on x86. It's easy to change I beleive, and it paves the way for archs to add a notion of token in the high bits (as we _know_ an ISA address is small). Those token can be used by arch to route to proper PCI bus when several host bridges exist, to route to PCMCIA when the PCMCIA uses it's own ISA memory space like on PPC, etc... Later on, we can see things like ulong pci_get_bus_isa_base(int busno); And the same for PCMCIA whatever 16 bits busses that can exist on embedded hardware. That way, support for multiple busses (either real ISA, embedded custom busses using legacy devices, several PCI hosts with ISA bridges, ...) can be implemented very easily. In most case adjusting the drivers probe code. I'd like to see the same kind of things for IOs in fact but that's another debate ;) Regards, Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pci_disable_device() vs. arch
Hi ! Would it make sense to add a pcibios_disable_device(pci_dev*) called from the end of pci_disable_device() ? I'm adding a call to it to sungem along with other pmac stuffs so that the chip can be properly power down (actually it's not really powered down but unclocked) after module removal. Of course, the arch code must be able to catch it in order to play with the various UniNorth control bits. Note that my current gmac driver does shut the chip down when the interface is down, which makes it a bit more useful for laptops as most users currently compile the driver in the kernel. I have nothing about changing the policy if you prefer so that users will now have to rmmod the driver once done with the interface to save power. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Going beyond 256 PCI buses
It's funny you mention this because I have been working on something similar recently. Basically making xfree86 int10 and VGA poking happy on sparc64. Heh, world is small ;) But this has no real use in the kernel. (actually I take this back, read below) yup, fbcon at least... You have a primary VGA device, that is the one the bios (boot firmware, whatever you want to call it) enables to respond to I/O and MEM accesses, the rest are configured to VGA pallette snoop and that's it. The primary VGA device is the kernel console (unless using some fbcon driver of course), and that's that. Yup, fbcon is what I have in mind here The secondary VGA devices are only interesting to things like the X server, and xfree86 does all the enable/disable/bridge-forward-vga magic when doing multi-head. and multihead fbcon. Perhaps, you might need to program the VGA resources of some device to use it in a fbcon driver (ie. to init it or set screen crt parameters, I believe the tdfx requires the latter which is why I'm having a devil of a time getting it to work on my sparc64 box). This would be a seperate issue, and I would not mind at all seeing an abstraction for this sort of thing, let us call it: struct pci_vga_resource { struct resource io, mem; }; int pci_route_vga(struct pci_dev *pdev, struct pci_vga_resource *res); pci_restore_vga(void); [.../...] Well... that would work for VGA itself (note that this semaphore you are talking about should be shared some way with the /proc interface so XFree can be properly sync'ed as well). But I still think it may be useful to generalize the idea to all kind of legacy IO PIOs. I definitely agree that VGA is a kind of special case, mostly because of the necessary exclusion on the VGA IO response. But what about all those legacy drivers that will issue inx/outx calls without an ioremap ? Should they call ioremap with hard-coded legacy addresses ? There are chipsets containing things like legacy timers, legacy keyboard controllers, etc... and in some (rare I admit) cases, those may be scattered (or multiplied) on various domains. If we decide we don't handle those, then well, I won't argue more (it's mostly an estethic rant on my side ;), but the problem of wether they should call ioremap or not is there, and since the ISA bus can be mapped anywhere in the bus space by the host bridge, there need to be a way to retreive the ISA resources in general for a given domain. That's why I'd suggest something like pci_get_isa_mem(struct resource* isa_mem); pci_get_isa_io(struct resource* isa_io); (I prefer 2 different functions as some platforms like powermac just don't provide the ISA mem space at all, there's no way to generate a memory cycle in the low-address range on the PCI bus of those and they don't have a PCI-ISA bridge), so I like having the ability of one of the functions returning an error and not the other. Also, having the same ioremap() call for both mem IO and PIO means that things like 0xc cannot be interpreted. It's a valid ISA-mem address in the VGA space and a valid PIO address on a PCI bus that supports 64k of PIO space. I beleive it would make things clearer (and probably implementation simpler) to separate ioremap and pioremap. Ben. So you'd go: struct pci_vga_resource vga_res; int err; err = pci_route_vga(tdfx_pdev, vga_res); if (err) barf(); vga_ports = ioremap(vga_res.io.start, vga_res.io.end-vga_res.io.start+1); program_video_crtc_params(vga_ports); iounmap(vga_ports); vga_fb = ioremap(vga_res.mem.start, vga_res.mem.end-vga_res.mem.start+1); clear_vga_fb(vga_fb); iounmap(vga_fb); pci_restore_vga(); pci_route_vga does several things: 1) It saves the current VGA routing information. 2) It configures busses and VGA devices such that PDEV responds to VGA accesses, and other VGA devices just VGA palette snoop. 3) Fills in the pci_vga_resources with io: 0x320--0x340 in domain PDEV lives, vga I/O regs mem: 0xa--0xc in domain PDEV lives, video ram pci_restore_vga, as the name suggests, restores things back to how they were before the pci_route_vga() call. Maybe also some semaphore so only one driver can do this at once and you can't drop the semaphore without calling pci_restore_vga(). VC switching into the X server would need to grab this thing too. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pci_disable_device() vs. arch
Its not clutter -- what you are doing is hiding pieces of the driver from the driver maintainer. pcibios_enable_device should not be cluttered up with such mess, too. Well... pcibios_enable_device() has to at least make sure the device gets powered up as it's powered down after PCI probe. Except if we end up calling pci_set_power_state() to power it up early in the sungem driver. I point out that I recently fixed a bug where Via interrupts were being assigned incorrectly. If I had not done a global grep for Via irq-related code, I would have missed the spot where the PPC code was doing a kludge for one of the four on-board Via devices, hardcoding the USB irq number to 11. Hrm... interrupt routing on some PPC-based motherboard is quite a mess, fortunately that's not the case on pmacs. The IRQ assignement has to be part of the arch AFAIK, only the arch knows on which interrupt line of the controller a given chip is wired and how interrupt controllers are cascaded. Correct. If your driver uses the API correctly, then when/if we want to mess around with hotplug resource assignment, we can un-assign resources as we like. Since there aren't too many users of pci_disable_device so far, I want to make sure early adopters get it right. Well... at least with sungem, there's no such risk as the entire bus (up to the host bridge) where it lives is internal to the UniNorth ASIC. Can you give a -specific- example of arch code that is -not- sungem related, but needs to occur when one powers-down a sungem MAC? If the PM code is related to sungem, it belongs in sungem. So far I don't see a need for arch-specific hooks anywhere... Hrm... let me try again... Powering down individual devices can be controlled by the PCI PM capabilities, or in some cases (at least 2 cases here on UniNorth based pmacs) by other bits in the host bridge. What I suggest if for pci_bus to have an optional set_power_state function that is called when a device on that bus calls pci_set_power_state(). This function would then be able to implement those cases where power control is possible, while not done via PCI PM caps. A pci_bus structure exist for both root busses and busses under PCI-PCI bridges, so effectively, there's a pci_bus structure per bridge (beeing host or PCI-PCI). I beleive it makes sense for the bridge to have a way to handle the child power state. Ben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
VFS locking HFS problems (2.4.6pre6)
I've had a deadlock twice with 2.4.6pre6 today. It's an SMP kernel running on an UP box (a PowerBook Pismo). The deadlock happen in the HFS filesystem in hfs_cat_put(), apparently (quickly looking at addresses) in spin_lock(). I don't have the complete backtrace at hand right now, but it basically went up to kswapd without anything evidently getting that spinlock, I'll try to gather more details. So my question: Is there any document explaining the various locking requirements re-entrency possibilities in a filesystem. What I think might happen after a quick look is that HFS may be causing schedule() to be called while holding the spinlock, and gets then re-entered from another process context. I have to look at it in more detail (is there an HFS maintainer ?) but some background informations on VFS locking reentrancy issues would be helpful. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] I/O Access Abstractions
Last time I checked, ioremap didn't work for inb() and outb(). It should :) it doesnt need to. pci_find_device returns the io address and can return a cookie, ditto isapnp etc Yes, but doing that require 2 annoying things: - Parsing of this cookie on each inx/outx access, which can take a bit of time (typically looking up the host bridge) - On machines with PIO mapped in CPU mem space and several (or large) IO regions, they must all be mapped all the time, which is a waste of kernel virtual space. Why not, at least for 2.5, define a kind of pioremap that would be the equivalent of ioremap for PIO ? In fact, I'd rather have all this abstracted in a ioremap_resource(struct resource *, int flags) iounmap_resource(struct resource *) (flags is just an idea that could be used to pass things like specific caching attributes, or whatever makes sense to a given arch). The distinction between inx/oux readx/writex would still make sense at least for x86. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] I/O Access Abstractions
Last time I checked, ioremap didn't work for inb() and outb(). ioremap itself cannot work for inb/outb as they are different address spaces with potentially overlapping addresses, I don't see how a single function would handle both... except if we pass it a struct resource instead of the address. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Acpi] Re: ACPI fundamental locking problems
Nope. I do not want to maintain two interfaces. If we make user space the way to do these things, then we will do pretty much most of the driver setup etc in user space. We'd have to: we'd enter user space before drivers have had a chance to initialize, exactly because features like these can change the device mappings etc. And I don't want to have two completely different bootup paths. I agree. Also, having this userland step would help for things like booting from an FireWire or USB hard disk. I hacked the SBP2 (FW) driver to be useable as a boot device, but this involved adding an ugly schedule() loop for a couple of seconds before mouting root in order to leave some time for the drive to be probed. Also, on such dynamic busses, you can't really know which device major/minor a given drive will be assigned. Having a userland mecanism here would allow waiting for all devices to be probed, reading of the disk GUID (on fw at least) to figure out where is the real root device, etc... Even displaying a nice UI to let the user pick a root device is none is found, etc... So your idea fixes more than just the ACPI problems ;) Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ide_revalidate_disk() fix
Hi Andre ! Any reason other than usual programmer "too many things to remember" for 2.4 lacking the small ide_revalidate_disk() fix we did recently in 2.2 to keep the blocksize of the device intact ? (Just diff the 2 functions, it's pretty obvious) I'd be glad to send Linus a patch, but I beleive he won't accept an ide.c patch that doesn't originate from you ;) Regards, Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/
Re: PCI GART (?)
I have RTFM but on the matter of enabling DRI for the ATI Mobility video chipset, which on that notebook is a PCI model, there is practically nil information. The DRI website mentions using PCI GART, but there is no option for that in the kernel. How do I enable this? You need to get XFree86 CVS and really the right place to ask is the XFree86 folks. The standard kernel doesnt include pcigart Michel, FYI, PCI GART is a feature of the video chipset, not the host bridge, and so is not directly related to the kernel (there's no generic PCI GART driver like there is an AGP GART driver). AFAIK, the only PCI GART implementation so far is for rage 128 (or derived, like the M3), and is available in the "ati-pcigart-0-0-1-branch" DRI CVS branch. You need to compile the DRM inside this X server version, not the kernel one. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: kernel_thread vs. zombie
daemonize() makes calls that are all protected with the big kernel lock in do_exit(). All usages of daemonize have the big kernel lock held. So I guess it just needs it. Please let me know whether you have success if it makes a difference with having it held. With a bit more experiments, I have this behaviour: (I hold the kerne lock, daemonize(), and release the kernel lock, then do my probe thing which takes a few seconds, and let the thread die by itself) - When started during boot (low PID (9)) It becomes a zombie - When started from a process that quits after sending the ioctl, it is correctly "garbage collected". - When started from a process that stays around, it becomes a zombie too So something is not working, or I'm missing something obvious, or whatever... Any clue ? Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
console.c unblank_screen problem
There is a problem with the power management code for console.c The current code calls do_blank_screen(0); on PM_SUSPEND, and unblank_screen() on PM_RESUME. The problem happens when X is the current display while putting the machine to sleep. The do_blank_screen(0) code will do nothing as the console is not in KD_TEXT mode. However, unblank_screen has no such protection. That means that on wakeup, the cursor timer console blank timers will be re-enabled while X is frontmost, causing the blinking cursor to be displayed on top of X, and other possible issues. I hacked the following pacth to work around this. It appear to work fine, but since the console code is pretty complex, I'm not sure about possible side effects and I'd like some comments before submiting it to Linus: (Don't worry about the {} I added, I just noticed them and will remove them before submitting ;) --- 1.2/drivers/char/console.c Sat Feb 10 18:54:15 2001 +++ edited/drivers/char/console.c Sun Mar 25 17:57:46 2001 @@ -2595,8 +2595,9 @@ int currcons = fg_console; int i; - if (console_blanked) + if (console_blanked) { return; + } /* entering graphics mode? */ if (entering_gfx) { @@ -2660,12 +2661,16 @@ printk("unblank_screen: tty %d not allocated ??\n", fg_console+1); return; } + currcons = fg_console; + if (vcmode != KD_TEXT) { + console_blanked = 0; + return; + } console_timer.function = blank_screen; if (blankinterval) { mod_timer(console_timer, jiffies + blankinterval); } - currcons = fg_console; console_blanked = 0; if (console_blank_hook) console_blank_hook(0); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] binfmt_elf.c fix with PPC update
Hi Linus ! Enclosed is a (not big ;) patch against 2.4.4pre1 that does a few inter-dependant things, one beeing a bug fix for everybody, the other is a mix of bug fix cleanup on PPC: - binfmt_elf.c : fix DLINFO_ITEMS so that final alignement on the stack takes into account the AT_NULL entry (or it won't align). Remove hackish PPC addition (now re-done properly). Add a simple way (via 2 macros) for include/asm-xxx/elf.h to add platform specific entries to it while keeping the alignement right. - Remove shove_aux_table() in arch/ppc/kernel/process.c. That routine used to lookup the aux table on the stack and move it up to align it to a 16 bytes boundary (ABI). Now done via the ARCH_DLINFO in include/asm-ppc/elf.h - Re-implement the alignement mecanism properly, taking into account a pair glibc bugs we had until now (not doing so results in breaking existing userland binaries). - Add 3 new aux table entries for PPC containing some cache line size information. Those are part of our PPC SysV ABI, and were never properly implemented, possibly because of conflict in the AT_ numbers assigned to them. This was now "fixed" and the next glibc release will understand them. Those informations are necessary for glibc to properly handle various brands of PPC CPUs when doing cache invalidates or using cache trick to speed up copy operations. The patch has been tested on PPC, glibc is ready for it, and it's simple enough not to damage other archs. Next are coming the PPC AT_HWCAP infos, still beeing worked on. Feel free to comment, not agree, whatever, I'd be glad however if you could explain me if you don't want to merge that now as our glibc maintainer is waiting for it ;) Regards, Ben. --- linuxppc_2_4_orig/fs/binfmt_elf.c Wed Apr 11 20:18:59 2001 +++ linuxppc_2_4/fs/binfmt_elf.cWed Apr 11 18:38:47 2001 @@ -36,7 +36,7 @@ #include asm/param.h #include asm/pgalloc.h -#define DLINFO_ITEMS 13 +#define DLINFO_ITEMS 14 #include linux/elf.h @@ -135,12 +135,13 @@ /* * Force 16 byte _final_ alignment here for generality. -* Leave an extra 16 bytes free so that on the PowerPC we -* can move the aux table up to start on a 16-byte boundary. */ - sp = (elf_addr_t *)((~15UL (unsigned long)(u_platform)) - 16UL); + sp = (elf_addr_t *)(~15UL (unsigned long)(u_platform)); csp = sp; csp -= DLINFO_ITEMS*2 + (k_platform ? 2 : 0); +#ifdef DLINFO_ARCH_ITEMS + csp -= DLINFO_ARCH_ITEMS*2; +#endif csp -= envc+1; csp -= argc+1; csp -= (!ibcs ? 3 : 1); /* argc itself */ @@ -174,6 +175,13 @@ NEW_AUX_ENT(10, AT_EUID, (elf_addr_t) current-euid); NEW_AUX_ENT(11, AT_GID, (elf_addr_t) current-gid); NEW_AUX_ENT(12, AT_EGID, (elf_addr_t) current-egid); +#ifdef ARCH_DLINFO + /* +* ARCH_DLINFO must come last so platform specific code can enforce +* special alignment requirements on the AUXV if necessary (eg. PPC). +*/ + ARCH_DLINFO; +#endif #undef NEW_AUX_ENT sp -= envc+1; --- linuxppc_2_4_orig/arch/ppc/kernel/process.c Mon Apr 2 19:25:35 2001 +++ linuxppc_2_4/arch/ppc/kernel/process.c Wed Apr 11 18:40:51 2001 @@ -378,45 +378,6 @@ } /* - * XXX ld.so expects the auxiliary table to start on - * a 16-byte boundary, so we have to find it and - * move it up. :-( - */ -static inline void shove_aux_table(unsigned long sp) -{ - int argc; - char *p; - unsigned long e; - unsigned long aux_start, offset; - - if (__get_user(argc, (int *)sp)) - return; - sp += sizeof(int) + (argc + 1) * sizeof(char *); - /* skip over the environment pointers */ - do { - if (__get_user(p, (char **)sp)) - return; - sp += sizeof(char *); - } while (p != NULL); - aux_start = sp; - /* skip to the end of the auxiliary table */ - do { - if (__get_user(e, (unsigned long *)sp)) - return; - sp += 2 * sizeof(unsigned long); - } while (e != AT_NULL); - offset = ((aux_start + 15) ~15) - aux_start; - if (offset != 0) { - do { - sp -= sizeof(unsigned long); - if (__get_user(e, (unsigned long *)sp) - || __put_user(e, (unsigned long *)(sp + offset))) - return; - } while (sp aux_start); - } -} - -/* * Set up a thread for executing a new program */ void start_thread(struct pt_regs *regs, unsigned long nip, unsigned long sp) @@ -425,7 +386,6 @@ regs-nip = nip; regs-gpr[1] = sp; regs-msr = MSR_USER; - shove_aux_table(sp); if (last_task_used_math == current) last_task_used_math = 0; if (last_task_used_altivec == current) ---
[PATCH] [resent] binfmt_elf.c fix with PPC update
Hi Linus ! Enclosed is a (not big ;) patch against 2.4.4pre1 that does a few inter-dependant things, one beeing a bug fix for everybody, the other is a mix of bug fix cleanup on PPC: - binfmt_elf.c : fix DLINFO_ITEMS so that final alignement on the stack takes into account the AT_NULL entry (or it won't align). Remove hackish PPC addition (now re-done properly). Add a simple way (via 2 macros) for include/asm-xxx/elf.h to add platform specific entries to it while keeping the alignement right. - Remove shove_aux_table() in arch/ppc/kernel/process.c. That routine used to lookup the aux table on the stack and move it up to align it to a 16 bytes boundary (ABI). Now done via the ARCH_DLINFO in include/asm-ppc/elf.h - Re-implement the alignement mecanism properly, taking into account a pair glibc bugs we had until now (not doing so results in breaking existing userland binaries). - Add 3 new aux table entries for PPC containing some cache line size information. Those are part of our PPC SysV ABI, and were never properly implemented, possibly because of conflict in the AT_ numbers assigned to them. This was now "fixed" and the next glibc release will understand them. Those informations are necessary for glibc to properly handle various brands of PPC CPUs when doing cache invalidates or using cache trick to speed up copy operations. The patch has been tested on PPC, glibc is ready for it, and it's simple enough not to damage other archs. Next are coming the PPC AT_HWCAP infos, still beeing worked on. Feel free to comment, not agree, whatever, I'd be glad however if you could explain me if you don't want to merge that now as our glibc maintainer is waiting for it ;) Regards, Ben. --- linuxppc_2_4_orig/fs/binfmt_elf.c Wed Apr 11 20:18:59 2001 +++ linuxppc_2_4/fs/binfmt_elf.cWed Apr 11 18:38:47 2001 @@ -36,7 +36,7 @@ #include asm/param.h #include asm/pgalloc.h -#define DLINFO_ITEMS 13 +#define DLINFO_ITEMS 14 #include linux/elf.h @@ -135,12 +135,13 @@ /* * Force 16 byte _final_ alignment here for generality. -* Leave an extra 16 bytes free so that on the PowerPC we -* can move the aux table up to start on a 16-byte boundary. */ - sp = (elf_addr_t *)((~15UL (unsigned long)(u_platform)) - 16UL); + sp = (elf_addr_t *)(~15UL (unsigned long)(u_platform)); csp = sp; csp -= DLINFO_ITEMS*2 + (k_platform ? 2 : 0); +#ifdef DLINFO_ARCH_ITEMS + csp -= DLINFO_ARCH_ITEMS*2; +#endif csp -= envc+1; csp -= argc+1; csp -= (!ibcs ? 3 : 1); /* argc itself */ @@ -174,6 +175,13 @@ NEW_AUX_ENT(10, AT_EUID, (elf_addr_t) current-euid); NEW_AUX_ENT(11, AT_GID, (elf_addr_t) current-gid); NEW_AUX_ENT(12, AT_EGID, (elf_addr_t) current-egid); +#ifdef ARCH_DLINFO + /* +* ARCH_DLINFO must come last so platform specific code can enforce +* special alignment requirements on the AUXV if necessary (eg. PPC). +*/ + ARCH_DLINFO; +#endif #undef NEW_AUX_ENT sp -= envc+1; --- linuxppc_2_4_orig/arch/ppc/kernel/process.c Mon Apr 2 19:25:35 2001 +++ linuxppc_2_4/arch/ppc/kernel/process.c Wed Apr 11 18:40:51 2001 @@ -378,45 +378,6 @@ } /* - * XXX ld.so expects the auxiliary table to start on - * a 16-byte boundary, so we have to find it and - * move it up. :-( - */ -static inline void shove_aux_table(unsigned long sp) -{ - int argc; - char *p; - unsigned long e; - unsigned long aux_start, offset; - - if (__get_user(argc, (int *)sp)) - return; - sp += sizeof(int) + (argc + 1) * sizeof(char *); - /* skip over the environment pointers */ - do { - if (__get_user(p, (char **)sp)) - return; - sp += sizeof(char *); - } while (p != NULL); - aux_start = sp; - /* skip to the end of the auxiliary table */ - do { - if (__get_user(e, (unsigned long *)sp)) - return; - sp += 2 * sizeof(unsigned long); - } while (e != AT_NULL); - offset = ((aux_start + 15) ~15) - aux_start; - if (offset != 0) { - do { - sp -= sizeof(unsigned long); - if (__get_user(e, (unsigned long *)sp) - || __put_user(e, (unsigned long *)(sp + offset))) - return; - } while (sp aux_start); - } -} - -/* * Set up a thread for executing a new program */ void start_thread(struct pt_regs *regs, unsigned long nip, unsigned long sp) @@ -425,7 +386,6 @@ regs-nip = nip; regs-gpr[1] = sp; regs-msr = MSR_USER; - shove_aux_table(sp); if (last_task_used_math == current) last_task_used_math = 0; if (last_task_used_altivec == current) ---
Re: [PATCH] macintosh/mediabay: Convert to kthread API.
Looks OK - there's no way of stopping the kernel thread anyway. It appears that nobody has tried to use this driver at the same time as software-suspend. At least, not successfully. A strategic try_to_freeze() should fix it. This will become (a little) more serious when cpu hotplug is switched to use the process freezer, and perhaps it breaks kprobes already. I'll dig a box with that hardware and do some tests, but it looks nice. Thanks Eric ! There should be no problem with cpu hotplug, the only machines using the media bay driver are old Apple laptops with only one CPU and no HW threads. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] macintosh/therm_windtunnel.c: Convert to kthread API.
On Thu, 2007-04-19 at 16:37 -0700, Andrew Morton wrote: On Thu, 19 Apr 2007 01:58:48 -0600 Eric W. Biederman [EMAIL PROTECTED] wrote: Start the g4fand using kthread_run not a combination of kernel_thread and deamonize. This makes the code a little simpler and more maintainable. I had a bit of trouble reviewing this one because I was laughing so hard at the attempted coding-style in that driver. Oh well. Heh I continue creeping into Christoph's camp - there's quite a bit of open-coded gunk which would go away if we were to teach this driver about kthread_should_stop() and kthread_stop(), and the conversion looks awfully easy to do. It's a shame to stop here. Oh well, I guess at least this is some forward progress. My main problem with touching that driver is that I don't have the hardware to test. I'll try to find a user to play the ginea pig. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
The only reason for using threads here is to get the error recovery out of an interrupt context (where errors may be detected), and then, an hour later, decrement a counter (which is how we limit these to 6 per hour). Thread reaping is trivial, the thread just exits after an hour. In addition, it should be a thread and not done from within keventd because : - It can take a long time (well, relatively but still too long for a work queue) - The driver callbacks might need to use keventd or do flush_workqueue to synchronize with their own workqueues when doing an internal recovery. Since these are events rare, I've no particular concern about performance or resource consumption. The current code seems to work just fine. :-) I think moving to kthread's is cleaner (just a wrapper around kernel threads that simplify dealing with reaping them out mostly) and I agree with Christoph that it would be nice to be able to fire off kthreads from interrupt context.. in many cases, we abuse work queues for things that should really done from kthreads instead (basically anything that takes more than a couple hundred microsecs or so). Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
On Mon, 2007-04-23 at 20:08 -0600, Eric W. Biederman wrote: Benjamin Herrenschmidt [EMAIL PROTECTED] writes: The only reason for using threads here is to get the error recovery out of an interrupt context (where errors may be detected), and then, an hour later, decrement a counter (which is how we limit these to 6 per hour). Thread reaping is trivial, the thread just exits after an hour. In addition, it should be a thread and not done from within keventd because : - It can take a long time (well, relatively but still too long for a work queue) - The driver callbacks might need to use keventd or do flush_workqueue to synchronize with their own workqueues when doing an internal recovery. Since these are events rare, I've no particular concern about performance or resource consumption. The current code seems to work just fine. :-) I think moving to kthread's is cleaner (just a wrapper around kernel threads that simplify dealing with reaping them out mostly) and I agree with Christoph that it would be nice to be able to fire off kthreads from interrupt context.. in many cases, we abuse work queues for things that should really done from kthreads instead (basically anything that takes more than a couple hundred microsecs or so). On that note does anyone have a problem is we manage the irq spawning safe kthreads the same way that we manage the work queue entries. i.e. by a structure allocated by the caller? Not sure... I can see places where I might want to spawn an arbitrary number of these without having to preallocate structures... and if I allocate on the fly, then I need a way to free that structure when the kthread is reaped which I don't think we have currently, do we ? (In fact, I could use that for other things too now that I'm thinking of it ... I might have a go at providing optional kthread destructors). Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] powerpc pseries eeh: Convert to kthread API
Further in general it doesn't make sense to grab a module reference and call that sufficient because we would like to request that the module exits. Which is, btw, I think a total misdesign of our module stuff, but heh, I remember that lead to some flamewars back then... Like anything else, modules should have separated the entrypoints for - Initiating a removal request - Releasing the module The former is use did rmmod, can unregister things from subsystems, etc... (and can file if the driver decides to refuse removal requests when it's busy doing things or whatever policy that module wants to implement). The later is called when all references to the modules have been dropped, it's a bit like the kref release (and could be implemented as one). If we had done that (simple) thing back then, module refcounting would have been much less of a problem... I remember some reasons why that was veto'ed but I didn't and still don't agree. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/12] Pass MAP_FIXED down to get_unmapped_area
This is a first step as there are still cleanups to be done in various areas touched by that code but I think it's probably good to go as is and at least enables me to implement what I need for PowerPC. (Andrew, this is also candidate for 2.6.22 since I haven't had any real objection, mostly suggestion for improving further, which I'll try to do later, and I have further powerpc patches that rely on this). The current get_unmapped_area code calls the f_ops-get_unmapped_area or the arch one (via the mm) only when MAP_FIXED is not passed. That makes it impossible for archs to impose proper constraints on regions of the virtual address space. To work around that, get_unmapped_area() then calls some hugetlbfs specific hacks. This cause several problems, among others: - It makes it impossible for a driver or filesystem to do the same thing that hugetlbfs does (for example, to allow a driver to use larger page sizes to map external hardware) if that requires applying a constraint on the addresses (constraining that mapping in certain regions and other mappings out of those regions). - Some archs like arm, mips, sparc, sparc64, sh and sh64 already want MAP_FIXED to be passed down in order to deal with aliasing issues. The code is there to handle it... but is never called. This serie of patches moves the logic to handle MAP_FIXED down to the various arch/driver get_unmapped_area() implementations, and then changes the generic code to always call them. The hugetlbfs hacks then disappear from the generic code. Since I need to do some special 64K pages mappings for SPEs on cell, I need to work around the first problem at least. I have further patches thus implementing a slices layer that handles multiple page sizes through slices of the address space for use by hugetlbfs, the SPE code, and possibly others, but it requires that serie of patches first/ There is still a potential (but not practical) issue due to the fact that filesystems/drivers implemeting g_u_a will effectively bypass all arch checks. This is not an issue in practice as the only filesystems/drivers using that hook are doing so for arch specific purposes in the first place. There is also a problem with mremap that will completely bypass all arch checks. I'll try to address that separately, I'm not 100% certain yet how, possibly by making it not work when the vma has a file whose f_ops has a get_unmapped_area callback, and by making it use is_hugepage_only_range() before expanding into a new area. Also, I want to turn is_hugepage_only_range() into a more generic is_normal_page_range() as that's really what it will end up meaning when used in stack grow, brk grow and mremap. None of the above issues however are introduced by this patch, they are already there, so I think the patch can go ini for 2.6.22. (Patch is against Linus current git, I'll give a go at -mm asap) Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/12] get_unmapped_area handles MAP_FIXED on powerpc
Handle MAP_FIXED in powerpc's arch_get_unmapped_area() in all 3 implementations of it. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Acked-by: William Irwin [EMAIL PROTECTED] arch/powerpc/mm/hugetlbpage.c | 21 + 1 file changed, 21 insertions(+) Index: linux-cell/arch/powerpc/mm/hugetlbpage.c === --- linux-cell.orig/arch/powerpc/mm/hugetlbpage.c 2007-04-24 15:10:17.0 +1000 +++ linux-cell/arch/powerpc/mm/hugetlbpage.c2007-04-24 15:28:11.0 +1000 @@ -566,6 +566,13 @@ unsigned long arch_get_unmapped_area(str if (len TASK_SIZE) return -ENOMEM; + /* handle fixed mapping: prevent overlap with huge pages */ + if (flags MAP_FIXED) { + if (is_hugepage_only_range(mm, addr, len)) + return -EINVAL; + return addr; + } + if (addr) { addr = PAGE_ALIGN(addr); vma = find_vma(mm, addr); @@ -641,6 +648,13 @@ arch_get_unmapped_area_topdown(struct fi if (len TASK_SIZE) return -ENOMEM; + /* handle fixed mapping: prevent overlap with huge pages */ + if (flags MAP_FIXED) { + if (is_hugepage_only_range(mm, addr, len)) + return -EINVAL; + return addr; + } + /* dont allow allocations above current base */ if (mm-free_area_cache base) mm-free_area_cache = base; @@ -823,6 +837,13 @@ unsigned long hugetlb_get_unmapped_area( /* Paranoia, caller should have dealt with this */ BUG_ON((addr + len) addr); + /* Handle MAP_FIXED */ + if (flags MAP_FIXED) { + if (prepare_hugepage_range(addr, len, pgoff)) + return -EINVAL; + return addr; + } + if (test_thread_flag(TIF_32BIT)) { curareas = current-mm-context.low_htlb_areas; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/12] get_unmapped_area handles MAP_FIXED on alpha
Handle MAP_FIXED in alpha's arch_get_unmapped_area(), simple case, just return the address as passed in Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] arch/alpha/kernel/osf_sys.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-cell/arch/alpha/kernel/osf_sys.c === --- linux-cell.orig/arch/alpha/kernel/osf_sys.c 2007-03-22 14:58:33.0 +1100 +++ linux-cell/arch/alpha/kernel/osf_sys.c 2007-03-22 14:58:44.0 +1100 @@ -1267,6 +1267,9 @@ arch_get_unmapped_area(struct file *filp if (len limit) return -ENOMEM; + if (flags MAP_FIXED) + return addr; + /* First, see if the given suggestion fits. The OSF/1 loader (/sbin/loader) relies on us returning an - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/12] get_unmapped_area handles MAP_FIXED on arm
ARM already had a case for MAP_FIXED in arch_get_unmapped_area() though it was not called before. Fix the comment to reflect that it will now be called. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] arch/arm/mm/mmap.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) Index: linux-cell/arch/arm/mm/mmap.c === --- linux-cell.orig/arch/arm/mm/mmap.c 2007-03-22 14:59:51.0 +1100 +++ linux-cell/arch/arm/mm/mmap.c 2007-03-22 15:00:01.0 +1100 @@ -49,8 +49,7 @@ arch_get_unmapped_area(struct file *filp #endif /* -* We should enforce the MAP_FIXED case. However, currently -* the generic kernel code doesn't allow us to handle this. +* We enforce the MAP_FIXED case. */ if (flags MAP_FIXED) { if (aliasing flags MAP_SHARED addr (SHMLBA - 1)) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/12] get_unmapped_area handles MAP_FIXED in generic code
generic arch_get_unmapped_area() now handles MAP_FIXED. Now that all implementations have been fixed, change the toplevel get_unmapped_area() to call into arch or drivers for the MAP_FIXED case. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] mm/mmap.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) Index: linux-cell/mm/mmap.c === --- linux-cell.orig/mm/mmap.c 2007-03-22 16:29:22.0 +1100 +++ linux-cell/mm/mmap.c2007-03-22 16:30:06.0 +1100 @@ -1199,6 +1199,9 @@ arch_get_unmapped_area(struct file *filp if (len TASK_SIZE) return -ENOMEM; + if (flags MAP_FIXED) + return addr; + if (addr) { addr = PAGE_ALIGN(addr); vma = find_vma(mm, addr); @@ -1272,6 +1275,9 @@ arch_get_unmapped_area_topdown(struct fi if (len TASK_SIZE) return -ENOMEM; + if (flags MAP_FIXED) + return addr; + /* requesting a specific address */ if (addr) { addr = PAGE_ALIGN(addr); @@ -1360,22 +1366,21 @@ get_unmapped_area(struct file *file, uns unsigned long pgoff, unsigned long flags) { unsigned long ret; + unsigned long (*get_area)(struct file *, unsigned long, + unsigned long, unsigned long, unsigned long); - if (!(flags MAP_FIXED)) { - unsigned long (*get_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); - - get_area = current-mm-get_unmapped_area; - if (file file-f_op file-f_op-get_unmapped_area) - get_area = file-f_op-get_unmapped_area; - addr = get_area(file, addr, len, pgoff, flags); - if (IS_ERR_VALUE(addr)) - return addr; - } + get_area = current-mm-get_unmapped_area; + if (file file-f_op file-f_op-get_unmapped_area) + get_area = file-f_op-get_unmapped_area; + addr = get_area(file, addr, len, pgoff, flags); + if (IS_ERR_VALUE(addr)) + return addr; if (addr TASK_SIZE - len) return -ENOMEM; if (addr ~PAGE_MASK) return -EINVAL; + if (file is_file_hugepages(file)) { /* * Check if the given range is hugepage aligned, and - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/12] get_unmapped_area handles MAP_FIXED in hugetlbfs
Generic hugetlb_get_unmapped_area() now handles MAP_FIXED by just calling prepare_hugepage_range() Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Acked-by: William Irwin [EMAIL PROTECTED] fs/hugetlbfs/inode.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-cell/fs/hugetlbfs/inode.c === --- linux-cell.orig/fs/hugetlbfs/inode.c2007-03-22 16:12:56.0 +1100 +++ linux-cell/fs/hugetlbfs/inode.c 2007-03-22 16:16:02.0 +1100 @@ -115,6 +115,12 @@ hugetlb_get_unmapped_area(struct file *f if (len TASK_SIZE) return -ENOMEM; + if (flags MAP_FIXED) { + if (prepare_hugepage_range(addr, len, pgoff)) + return -EINVAL; + return addr; + } + if (addr) { addr = ALIGN(addr, HPAGE_SIZE); vma = find_vma(mm, addr); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/12] get_unmapped_area doesn't need hugetlbfs hacks anymore
Remove the hugetlbfs specific hacks in toplevel get_unmapped_area() now that all archs and hugetlbfs itself do the right thing for both cases. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Acked-by: William Irwin [EMAIL PROTECTED] mm/mmap.c | 16 1 file changed, 16 deletions(-) Index: linux-cell/mm/mmap.c === --- linux-cell.orig/mm/mmap.c 2007-04-12 12:14:46.0 +1000 +++ linux-cell/mm/mmap.c2007-04-12 12:14:47.0 +1000 @@ -1381,22 +1381,6 @@ get_unmapped_area(struct file *file, uns if (addr ~PAGE_MASK) return -EINVAL; - if (file is_file_hugepages(file)) { - /* -* Check if the given range is hugepage aligned, and -* can be made suitable for hugepages. -*/ - ret = prepare_hugepage_range(addr, len, pgoff); - } else { - /* -* Ensure that a normal request is not falling in a -* reserved hugepage range. For some archs like IA-64, -* there is a separate region for hugepages. -*/ - ret = is_hugepage_only_range(current-mm, addr, len); - } - if (ret) - return -EINVAL; return addr; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/12] get_unmapped_area handles MAP_FIXED on x86_64
Handle MAP_FIXED in x86_64 arch_get_unmapped_area(), simple case, just return the address as passed in Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] arch/x86_64/kernel/sys_x86_64.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-cell/arch/x86_64/kernel/sys_x86_64.c === --- linux-cell.orig/arch/x86_64/kernel/sys_x86_64.c 2007-03-22 16:10:10.0 +1100 +++ linux-cell/arch/x86_64/kernel/sys_x86_64.c 2007-03-22 16:11:06.0 +1100 @@ -93,6 +93,9 @@ arch_get_unmapped_area(struct file *filp unsigned long start_addr; unsigned long begin, end; + if (flags MAP_FIXED) + return addr; + find_start_end(flags, begin, end); if (len end) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/12] get_unmapped_area handles MAP_FIXED on sparc64
Handle MAP_FIXED in hugetlb_get_unmapped_area on sparc64 by just using prepare_hugepage_range() Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Acked-by: William Irwin [EMAIL PROTECTED] arch/sparc64/mm/hugetlbpage.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-cell/arch/sparc64/mm/hugetlbpage.c === --- linux-cell.orig/arch/sparc64/mm/hugetlbpage.c 2007-03-22 16:12:57.0 +1100 +++ linux-cell/arch/sparc64/mm/hugetlbpage.c2007-03-22 16:15:33.0 +1100 @@ -175,6 +175,12 @@ hugetlb_get_unmapped_area(struct file *f if (len task_size) return -ENOMEM; + if (flags MAP_FIXED) { + if (prepare_hugepage_range(addr, len, pgoff)) + return -EINVAL; + return addr; + } + if (addr) { addr = ALIGN(addr, HPAGE_SIZE); vma = find_vma(mm, addr); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 7/12] get_unmapped_area handles MAP_FIXED on parisc
Handle MAP_FIXED in parisc arch_get_unmapped_area(), just return the address. We might want to also check for possible cache aliasing issues now that we get called in that case (like ARM or MIPS), leave a comment for the maintainers to pick up. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] arch/parisc/kernel/sys_parisc.c |5 + 1 file changed, 5 insertions(+) Index: linux-cell/arch/parisc/kernel/sys_parisc.c === --- linux-cell.orig/arch/parisc/kernel/sys_parisc.c 2007-03-22 15:28:05.0 +1100 +++ linux-cell/arch/parisc/kernel/sys_parisc.c 2007-03-22 15:29:08.0 +1100 @@ -106,6 +106,11 @@ unsigned long arch_get_unmapped_area(str { if (len TASK_SIZE) return -ENOMEM; + /* Might want to check for cache aliasing issues for MAP_FIXED case +* like ARM or MIPS ??? --BenH. +*/ + if (flags MAP_FIXED) + return addr; if (!addr) addr = TASK_UNMAPPED_BASE; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/12] get_unmapped_area handles MAP_FIXED on ia64
Handle MAP_FIXED in ia64 arch_get_unmapped_area and hugetlb_get_unmapped_area(), just call prepare_hugepage_range in the later and is_hugepage_only_range() in the former. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Acked-by: William Irwin [EMAIL PROTECTED] arch/ia64/kernel/sys_ia64.c |7 +++ arch/ia64/mm/hugetlbpage.c |8 2 files changed, 15 insertions(+) Index: linux-cell/arch/ia64/kernel/sys_ia64.c === --- linux-cell.orig/arch/ia64/kernel/sys_ia64.c 2007-03-22 15:10:45.0 +1100 +++ linux-cell/arch/ia64/kernel/sys_ia64.c 2007-03-22 15:10:47.0 +1100 @@ -33,6 +33,13 @@ arch_get_unmapped_area (struct file *fil if (len RGN_MAP_LIMIT) return -ENOMEM; + /* handle fixed mapping: prevent overlap with huge pages */ + if (flags MAP_FIXED) { + if (is_hugepage_only_range(mm, addr, len)) + return -EINVAL; + return addr; + } + #ifdef CONFIG_HUGETLB_PAGE if (REGION_NUMBER(addr) == RGN_HPAGE) addr = 0; Index: linux-cell/arch/ia64/mm/hugetlbpage.c === --- linux-cell.orig/arch/ia64/mm/hugetlbpage.c 2007-03-22 15:12:32.0 +1100 +++ linux-cell/arch/ia64/mm/hugetlbpage.c 2007-03-22 15:12:39.0 +1100 @@ -148,6 +148,14 @@ unsigned long hugetlb_get_unmapped_area( return -ENOMEM; if (len ~HPAGE_MASK) return -EINVAL; + + /* Handle MAP_FIXED */ + if (flags MAP_FIXED) { + if (prepare_hugepage_range(addr, len, pgoff)) + return -EINVAL; + return addr; + } + /* This code assumes that RGN_HPAGE != 0. */ if ((REGION_NUMBER(addr) != RGN_HPAGE) || (addr (HPAGE_SIZE - 1))) addr = HPAGE_REGION_BASE; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/12] get_unmapped_area handles MAP_FIXED on i386
Handle MAP_FIXED in i386 hugetlb_get_unmapped_area(), just call prepare_hugepage_range. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Acked-by: William Irwin [EMAIL PROTECTED] arch/i386/mm/hugetlbpage.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-cell/arch/i386/mm/hugetlbpage.c === --- linux-cell.orig/arch/i386/mm/hugetlbpage.c 2007-03-22 16:08:12.0 +1100 +++ linux-cell/arch/i386/mm/hugetlbpage.c 2007-03-22 16:14:19.0 +1100 @@ -367,6 +367,12 @@ hugetlb_get_unmapped_area(struct file *f if (len TASK_SIZE) return -ENOMEM; + if (flags MAP_FIXED) { + if (prepare_hugepage_range(addr, len, pgoff)) + return -EINVAL; + return addr; + } + if (addr) { addr = ALIGN(addr, HPAGE_SIZE); vma = find_vma(mm, addr); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/12] get_unmapped_area handles MAP_FIXED on frv
Handle MAP_FIXED in arch_get_unmapped_area on frv. Trivial case, just return the address. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] arch/frv/mm/elf-fdpic.c |4 1 file changed, 4 insertions(+) Index: linux-cell/arch/frv/mm/elf-fdpic.c === --- linux-cell.orig/arch/frv/mm/elf-fdpic.c 2007-03-22 15:00:50.0 +1100 +++ linux-cell/arch/frv/mm/elf-fdpic.c 2007-03-22 15:01:06.0 +1100 @@ -64,6 +64,10 @@ unsigned long arch_get_unmapped_area(str if (len TASK_SIZE) return -ENOMEM; + /* handle MAP_FIXED */ + if (flags MAP_FIXED) + return addr; + /* only honour a hint if we're not going to clobber something doing so */ if (addr) { addr = PAGE_ALIGN(addr); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.20-1] radeonfb: Add support for Radeon xpress 200m
- radeonfb_pm_init(rinfo, rinfo-is_mobility ? 1 : -1, ignore_devlist, force_sleep); + radeonfb_pm_init(rinfo, rinfo-is_mobility rinfo-family != CHIP_FAMILY_RS480 ? 1 : -1, ignore_devlist, force_sleep); I'd rather you add a check for RS480 inside radeonfb_pm_* Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix atomicity of TIF update in flush_thread() for powerpc
.../... Signed-off-by: Mathieu Desnoyers [EMAIL PROTECTED] Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Nice catch ! --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -476,8 +476,13 @@ void flush_thread(void) #ifdef CONFIG_PPC64 struct thread_info *t = current_thread_info(); - if (t-flags _TIF_ABI_PENDING) - t-flags ^= (_TIF_ABI_PENDING | _TIF_32BIT); + if (test_tsk_thread_flag(tsk, TIF_ABI_PENDING)) { + clear_tsk_thread_flag(tsk, TIF_ABI_PENDING); + if (test_tsk_thread_flag(tsk, TIF_32BIT)) + clear_tsk_thread_flag(tsk, TIF_32BIT); + else + set_tsk_thread_flag(tsk, TIF_32BIT); + } #endif discard_lazy_cpu_state(); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Complain about missing system calls.
On Fri, 2007-03-09 at 17:11 +0100, Andi Kleen wrote: David Woodhouse [EMAIL PROTECTED] writes: Most system calls seem to get added to i386 first. This patch automatically generates a warning for any new system call which is implemented on i386 but not the architecture currently being compiled. On PowerPC at the moment, for example, it results in these warnings: init/missing_syscalls.h:935:3: warning: #warning syscall sync_file_range not implemented init/missing_syscalls.h:947:3: warning: #warning syscall getcpu not implemented init/missing_syscalls.h:950:3: warning: #warning syscall epoll_pwait not implemented I think a better solution would be to finally switch to auto generated system call tables for newer system calls. The original reason why the architectures have different system call numbers -- compatibility with another native Unix -- is completely obsolete now. This leaves only minor differences of compat stub vs non compat stub and a few architecture specific calls. Of course the existing syscall numbers can't be changed, but for all new calls one could just add automatically for everybody. A global table with two entries (compat and non compat) and a per arch override table should be sufficient. We need additional gunk for syscalls that can be called from SPEs on cell Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make sure we populate the initroot filesystem late enough
Hmm. The crash came back after I booted into Mac OS X and back. It was however a different crash, I believe it was coming from the USB modules (as it would keep going when it happened, and get another crash, which tended to scroll away too fast for me to capture) but I believe it was still getting down into the slab code and actually dying there. Have you tried, instead, to apply 38f3323037de22bb0089d08be27be01196e7148b ? (That is revert 39d61db0edb34d60b83c5e0d62d0e906578cc707). I suspect this is the proper fix... Ben. However, reverting the reversion of 8d610dd52dd1da696e199e4b4545f33a2a5de5c6 and instead applying the following patch: diff -ru linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c linux-source-2.6.20/arch/powerpc/mm/init_32.c --- linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c 2007-02-05 05:44:54.0 +1100 +++ linux-source-2.6.20/arch/powerpc/mm/init_32.c 2007-03-10 11:03:56.0 +1100 @@ -244,7 +244,8 @@ void free_initrd_mem(unsigned long start, unsigned long end) { if (start end) - printk (Freeing initrd memory: %ldk freed\n, (end - start) 10); + printk (NOT Freeing initrd memory: %ldk freed\n, (end - start) 10); + return; for (; start end; start += PAGE_SIZE) { ClearPageReserved(virt_to_page(start)); init_page_count(virt_to_page(start)); which if I recall correctly David Woodhouse posted to this thread, seems to have fixed it. I dunno if it's relevant, but my initrd.img is 13193315 bytes long, (ie 99 bytes over 12884k) and the above logs: NOT Freeing initrd memory: 12888k freed which makes sense... I of course completely failed to think to check this with the crashing kernel, if it seems relevant I can roll back to it and get the numbers. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] BLK_DEV_IDE_CELLEB dependency fix
On Thu, 2007-03-15 at 17:30 +0300, Sergei Shtylyov wrote: Hello. Akira Iguchi wrote: It's bool and it depends on BLK_DEV_IDE = should depend on BLK_DEV_IDE=y Hm, why I'm seeing module_init() in the driver? :-) And move it to if BLK_DEV_IDEDMA_PCI block because it depends on BLK_DEV_IDEDMA_PCI. IMHO, that driver shouldn't be in drivers/ide/ppc/ then... Why it got there (the same question about PowerMac driver)? Not sure... some reorg changes ide-pmac.c into ppc/pmac.c or such, I don't remember who did it tho. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remote debugging via FireWire
On Sat, 2007-02-10 at 20:16 +0100, Stefan Richter wrote: [ohci1394_early] Some remarks to the September 2006 version at http://www.suse.de/~bk/firewire/ : - Seems its .remove won't work properly if more than one OHCI-1394 controller is installed. And it's .probe isn't reentrant, but that might be less of a problem. - Its functionality will be lost if there is a FireWire bus reset, e.g. when something is plugged in or out. To keep physical DMA alive, an interrupt handler had to be installed which writes ~0 to OHCI1394_PhyReqFilter{Hi,Lo}Set. Can interrupt handlers be registered in an early setup stage? - There might be some register accesses in the setup which could be omitted; I'd have to look this up. - Could be optimized to not use ohci1394.h::struct ti_ohci. - PCI_CLASS_FIREWIRE_OHCI can be replaced by include/linux/pci_ids.h::PCI_CLASS_SERIAL_FIREWIRE_OHCI which was newly added in 2.6.20-git#. - I suppose .probe should check for PCI_CLASS_SERIAL_FIREWIRE_OHCI instead of PCI_CLASS_SERIAL_FIREWIRE. - How about dropping support for configuring this as module, to simplify the code? Unless this would interfere with ohci1394; and it probably would if there was an interrupt handler... - depends on X86_64 is missing in Kconfig. - Maybe put it into arch/x86_64/drivers/ instead of drivers/ieee1394? - Plus what I mentioned earlier in the thread. I could send code to address some of this at next weekend or later. I'd like to have that on ppc as well, so I'd rather keep it in drivers/ I agree that it doesn't need to be a module. If you can load modules, then you can load the full ohci driver. Thus, if it's an early thingy initialized by arch, it can export a special takeover hook that the proper ohci module can then call to override it (important if we start having an irq handler). Andi, also, how do you deal with iommu ? Not at all ? :-) Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] powerpc: Fix vDSO page count calculation
The recent vDSO consolidation patches broke powerpc due to a mistake in the definition of MAXPAGES constants. This fixes it by moving to a dynamically allocated array of pages instead as I don't like much hard coded size limits. Also move the vdso initialisation to an initcall since it doesn't really need to be done -that- early. Applogies for not catching the breakage earlier, Roland _did_ CC me on his patches a while ago, I got busy with other things and forgot to test them. Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED] Index: linux-work/arch/powerpc/kernel/vdso.c === --- linux-work.orig/arch/powerpc/kernel/vdso.c 2007-02-12 10:42:46.0 +1100 +++ linux-work/arch/powerpc/kernel/vdso.c 2007-02-12 11:03:54.0 +1100 @@ -49,24 +49,23 @@ /* Max supported size for symbol names */ #define MAX_SYMNAME64 -#define VDSO32_MAXPAGES(((0x3000 + PAGE_MASK) PAGE_SHIFT) + 2) -#define VDSO64_MAXPAGES(((0x3000 + PAGE_MASK) PAGE_SHIFT) + 2) - extern char vdso32_start, vdso32_end; static void *vdso32_kbase = vdso32_start; -unsigned int vdso32_pages; -static struct page *vdso32_pagelist[VDSO32_MAXPAGES]; +static unsigned int vdso32_pages; +static struct page **vdso32_pagelist; unsigned long vdso32_sigtramp; unsigned long vdso32_rt_sigtramp; #ifdef CONFIG_PPC64 extern char vdso64_start, vdso64_end; static void *vdso64_kbase = vdso64_start; -unsigned int vdso64_pages; -static struct page *vdso64_pagelist[VDSO64_MAXPAGES]; +static unsigned int vdso64_pages; +static struct page **vdso64_pagelist; unsigned long vdso64_rt_sigtramp; #endif /* CONFIG_PPC64 */ +static int vdso_ready; + /* * The vdso data page (aka. systemcfg for old ppc64 fans) is here. * Once the early boot kernel code no longer needs to muck around @@ -182,6 +181,9 @@ int arch_setup_additional_pages(struct l unsigned long vdso_base; int rc; + if (!vdso_ready) + return 0; + #ifdef CONFIG_PPC64 if (test_thread_flag(TIF_32BIT)) { vdso_pagelist = vdso32_pagelist; @@ -661,7 +663,7 @@ static void __init vdso_setup_syscall_ma } -void __init vdso_init(void) +static int __init vdso_init(void) { int i; @@ -716,11 +718,13 @@ void __init vdso_init(void) #ifdef CONFIG_PPC64 vdso64_pages = 0; #endif - return; + return 0; } /* Make sure pages are in the correct state */ - BUG_ON(vdso32_pages + 2 VDSO32_MAXPAGES); + vdso32_pagelist = kzalloc(sizeof(struct page *) * (vdso32_pages + 2), + GFP_KERNEL); + BUG_ON(vdso32_pagelist == NULL); for (i = 0; i vdso32_pages; i++) { struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE); ClearPageReserved(pg); @@ -731,7 +735,9 @@ void __init vdso_init(void) vdso32_pagelist[i] = NULL; #ifdef CONFIG_PPC64 - BUG_ON(vdso64_pages + 2 VDSO64_MAXPAGES); + vdso64_pagelist = kzalloc(sizeof(struct page *) * (vdso64_pages + 2), + GFP_KERNEL); + BUG_ON(vdso64_pagelist == NULL); for (i = 0; i vdso64_pages; i++) { struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE); ClearPageReserved(pg); @@ -743,7 +749,13 @@ void __init vdso_init(void) #endif /* CONFIG_PPC64 */ get_page(virt_to_page(vdso_data)); + + smp_wmb(); + vdso_ready = 1; + + return 0; } +arch_initcall(vdso_init); int in_gate_area_no_task(unsigned long addr) { Index: linux-work/arch/powerpc/mm/mem.c === --- linux-work.orig/arch/powerpc/mm/mem.c 2007-02-12 10:53:02.0 +1100 +++ linux-work/arch/powerpc/mm/mem.c2007-02-12 10:53:05.0 +1100 @@ -384,9 +384,6 @@ void __init mem_init(void) initsize 10); mem_init_done = 1; - - /* Initialize the vDSO */ - vdso_init(); } /* Index: linux-work/include/asm-powerpc/vdso.h === --- linux-work.orig/include/asm-powerpc/vdso.h 2007-02-12 11:02:44.0 +1100 +++ linux-work/include/asm-powerpc/vdso.h 2007-02-12 11:03:36.0 +1100 @@ -18,16 +18,11 @@ #ifndef __ASSEMBLY__ -extern unsigned int vdso64_pages; -extern unsigned int vdso32_pages; - /* Offsets relative to thread-vdso_base */ extern unsigned long vdso64_rt_sigtramp; extern unsigned long vdso32_sigtramp; extern unsigned long vdso32_rt_sigtramp; -extern void vdso_init(void); - #else /* __ASSEMBLY__ */ #ifdef __VDSO64__ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remote debugging via FireWire
On Mon, 2007-02-12 at 07:49 +0100, Andi Kleen wrote: On Sunday 11 February 2007 22:35, Benjamin Herrenschmidt wrote: I'd like to have that on ppc as well, so I'd rather keep it in drivers/ This will need some abstraction at least -- there are some early mapping hacks that are x86 specific right now. Either abstraction or ifdef's .. we have ioremap working very early on ppc :-) I agree that it doesn't need to be a module. If you can load modules, then you can load the full ohci driver. Thus, if it's an early thingy initialized by arch, it can export a special takeover hook that the proper ohci module can then call to override it (important if we start having an irq handler). Andi, also, how do you deal with iommu ? Not at all ? :-) Yes -- it's really early debugging hack mostly. It's reasonable to let the iommu be disabled (or later a special bypass can be added for this) Ok. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: undefined symbol 'PS3_PS3AV'
On Wed, 2007-02-14 at 19:17 +0900, Paul Mundt wrote: On Wed, Feb 14, 2007 at 11:02:06AM +0100, Geert Uytterhoeven wrote: On Wed, 14 Feb 2007, Paul Mundt wrote: This would seem like a reasonable candidate for a 'depends on' instead of a select.. That's what we originally had. But for the user it's simler if he can just enable ps3fb and/or ps3snd (sound driver not yet finished), which both select PS3_PS3AV. Why not just have PS3_PS3AV def_bool y if ps3fb || ps3snd? Or if that doesn't work, just place the PS3FB option in arch/powerpc/platforms/ps3/Kconfig. Of course if select obeyed the depends on, this wouldn't be a problem either.. I'd rather fix Kconfig to do the later... Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] killing the NR_IRQS arrays.
On Fri, 2007-02-16 at 05:10 -0700, Eric W. Biederman wrote: Getting the drivers changed actually looks to be pretty straight forward it will just be a very large mechanical change. We change the type where of variables where appropriate and every once in a while introduce an irq_nr(irq) to get the actual irq number for the places that care (ISA or print statements). Dunno about that irq_nr thingy. If we go that way, I'd be tempted to remove the number completely from the public side of irq_desc... or not. On powerpc, we have this remapped thingy because we completely separate the linux virtual interrupt domain from the physical numbering domains of each PIC. Your change would turn the linux virtual domain into pointers, removing the need for an array and associated limitations, which is nice. So to a given irq_desc / irq virtual number today, I match a pair HW number (which is a special typedef which is currently defined as an unsigned long) and a pointer to the irq host (which is the entity that define a HW number domain). That means that you can have multiple hosts and a given HW number can exist multiple times, once per host. Do you think the irq_hwnumber_t thingy I have should then be generalized and put into the irq_desc ? I would need an additional void * pointer to the irq host as well (it's not a 1:1 relationship to an irq chip and need to be accessed by generic code). Having the HW number be clearly specific to a domain controller makes also a lot of sense in the embedded field with lots of cascaded interrupt controllers. It avoids having to play all sorts of tricks to assign ranges of numbers to various controllers in the system. Only the local number on a given controller matters, the rest is dynamically assigned. Another option would be to have the irq_desc be created by the arch and embedded in a larger data structure, in which case the HW number would be part of the private part of that data structure. Though I suppose that could be a problem with ISA... I suspect that for backward compatibility, we will need to keep something (optionally maybe via CONFIG_*) for ISA/legacy interrupts. That is a 16 entries irq_desc* array, so we can go from a legacy IRQ number to an irq_desc on platform that have legacy/ISA crap floating around. On powerpc, what I do is that I always reserve entries 0...15 of my remapping array in such a way that linux virtual irq 0 is always reserved, and 1...15 are only ever assigned to legacy interrupts if they exist in the system, or left unassigned if they don't. I think we can make this change fairly smoothly if before the code is merged into Linus's tree we have a patchset prepared with a all of the core infrastructure changes and a best effort at all of the driver changes. Then early some merge window we merge the patchset, and fixup the drivers that were missed. As long as we do things properly and not with a big DESIGNED FOR x86 hack in the middle that makes it hard for everybody else, I agree. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] killing the NR_IRQS arrays.
On Fri, 2007-02-16 at 13:41 +0100, Ingo Molnar wrote: * Eric W. Biederman [EMAIL PROTECTED] wrote: So I propose we remove all assumptions from the code that we actually have an array of irqs. That will allow for irq_desc to be dynamically allocated instead of statically allocated saving memory and reducing kernel complexity. hm. I'd suggest to do this without changing request_irq() - and then we could avoid the 'massive, every driver affected' change, right? i.e. because we'll (have to) have an nr_to_desc() and desc_to_nr() mapping facility anyway, lets just not change the driver APIs massively. There dont seem to be that many drivers that assume that irq_desc[] is an array - are there? otherwise, in terms of the irqchips infrastructure and the API between genirq and the irqchip arch-level drivers, this change makes quite a bit of sense i think. or am i missing something fundamental? Well, I don't want to see anything like desc_to_nr / nr_to_desc unless the number in question is a virtual number. That is, there is no way we should go that way and keep passing a HW number through request_irq. That would just be a total nightmare for powerpc and sparc at least. What we can do is generalize the powerpc virtual irq scheme though. You can see the implementation in arch/powerpc/kernel/irq.c starting from the definition of irq_alloc_host() though for some stupid reason, I've put all the documentation in include/asm-powerpc/irq.h so you might want to start there. Once the IRQ numbers are virtualized, it becomes easier to slowly migrate things to use irq_desc_t * while still having a virutal number available. Once everything has been migrated, we can then get rid of the virtual numbers completely except maybe for an optional 16 entries array for legacy cruft. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] killing the NR_IRQS arrays.
Rather than having the job of rewriting this code during 2.6, I'd much prefer to get something sorted, even if it is ARM only before 2.6. I believe that there are some common problems with the existing API which have been hinted at over the last few days, such as large NR_IRQS. As such, I think it would be a good idea to try to thrash this issue out and get something which everyone is happy with. Additionally, I've added Alan's reserve then hook idea to the API; I seem to remember there is a case in IDE which needs something like this. You might want to have a look at the powerpc API with it's remaping capabilities. It's very nice for handling multiple domain spaces. It might be of some use for you. I like your proposed API, I think that's where we want to go in the long run. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] killing the NR_IRQS arrays.
On Sat, 2007-02-17 at 02:37 +0100, Arnd Bergmann wrote: On Friday 16 February 2007 23:37, Benjamin Herrenschmidt wrote: You might want to have a look at the powerpc API with it's remaping capabilities. It's very nice for handling multiple domain spaces. It might be of some use for you. I don't consider the powerpc virtual IRQs a solution for the problem. While I believe you did the right thing for powerpc with generalizing this over all its platforms, it really isn't more than a workaround for the problem that we can't deal well with the static irq_desc array. It's not a solution per-se, though it contains elements of solution like the reverse mappin, which I use to map HW numbers to virtual irqs but can trivially adapt to map HW numbers to irq_desc pointers. Among other things, I want to make sure that we don't end up with just putting an irq number in a field of the irq_desc and have half of the drivers peek at it and assume we can convert between irq_desc* and number in arbitrary ways. The HW irq number should be as much opaque as possible from the world outside of the PIC code and/or arch code that assign them. That's an area where the powerpc and/or sparc code might be of use. When that problem is now getting worse on other architectures, we should try to get it right on all of them, rather than spreading the workaround further. Yes, but I'd like aspects of my remapping work to be included in whatever we come up with, which is to have the new irq_desc either hide the underlying HW number, or at least associate it make it very clear that it's an opaque token and not guaranteed to be unique accross multiple PICs in the system. In addition, if we remove the numbers, archs will need basically the exact same services provided by the powerpc irq core for reverse mapping (going from a HW irq number on a given PIC back to an irq_desc *). Either using a linear array for simple PICs or a radix tree for platforms with very big interrupt numbers (BTW. I think we have lockless radix trees nowadays, I can remove the spinlocks to protect it in the powerpc remapper). Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] killing the NR_IRQS arrays.
No. I don't think we should make your irq_hwnumber_t thingy general because it is not general. I don't understand why you need it to be an unsigned long, that still puzzles me. But for the rest it actually appears that ppc has a simpler model to deal with. I think you might have misunderstood becaues I do beleive it's actually very general :-) Let me explain below. I don't think I actually can describe x86 hardware in you hwnumber_t world. Although I can approximate. And I think it fits well... In non-legacy mode at the top of the tree I have a network cooperating irq controllers. For each cpu there is an lapic next to each cpu that catches interrupt packets and below that I have interrupt controllers that throw interrupt packets. In the network of cooperating interrupt controllers a interrupt packet has a destination address that looks like (cpu#, vector#) where cpu# is currently at 8 bits and slowly growing and the vector# is a fixed 8 bits. The interrupt controllers that throw those packets have a fixed number of irq slots usually 24 or so. Each slot (referred to in the code as a pin) can be programmed which (cpu#, vector#) packet it throws when an interrupt occurs. Including an option to vary the cpu# between a set of cpus. So to be frank to handle this model properly I need to deal with this properly I need. #define NR_IRQS (NR_CPUS*256) There is enough flexibility in this model that hardware vectors have not found a need to cascade interrupt controllers. This is roughly similar to the cell toplevel model where interrupt messages encode the source unit/node, target and class. The chip has an interrupt controller (receiver of those messages) for each thread. In the kernel, I use a flat model, that is I create one host for all of them and my hardware numbers are mode of a similar bit encoding of those routing infos. That is, with a remapping model like mine, the x86 non-legacy situation could be easily expressed by having one domain (I call them hosts in the code) covering the whole fabric and the hw number be your (CPU 16) | vector thing. In addition, but you don't need that on x86, cell has an external controller cascaded on one of those interrupt, I use a separate domain for it. The reason my hwnumber thingy is a generic type is that i provide generic functions to create a linux interrupt for a domain/number pair and generic mecanism to do the reverse mapping. That's where I think my code might be of some use as with the numbers going away, pretty everybody will need a wat to reverse map from HW numbers back to irq_desc *. I use an unsigned long because I needed to choose a type that would fit the biggest number potentially used by an interrupt controller, and that can be real big with some hypervisors for which those are tokens which are potentially 64 bits. Ben I have no problem with a number that is specific to an irq controller for dealing with the internal irq controller implementations, heck I think everyone has that to some degree The linux irq number will remain an arbitrary software number for use by the linux system for talking about the source of the interrupt. So you do intend to keep the linux number which is what I call the virtual interrupt number on powerpc... I wouldn't have thought that to be necessary except as a special case of an array of 16 entries for ISA interrupts... Why in a sparse address space you would find it hard to allocate a range of numbers to an irq controller that only has a fixed number of irqs it can deal with is something I don't understand and I think it is does a disservice to your users. But that is all it is a quality of implementation issue. ia64 does the same foolish thing. It would be fairly easy to change my powerpc code to pre-allocate a full range for a given domain/pic when initializing it instead of doing lazy scattered allocation like I do, though it won't bring much I think. It's not possible for all PICs though, for example, the pSeries needs to use the radix tree reverse mapper because of how large HW interrupt numbers can be. I chose not to do it. In the long run, the only remotely meaningful way to expose interrupt to users would be to -add- columns to /proc/interrupts that provide the host and the HW number on that host, though I'm not sure that wouldn't break some userland tools. The only time it really makes sense to me to let the irq number vary arbitrary are when things are truly dynamic, like with MSI, a hypervisor, or hot plug interrupt controllers. I don't understand why you would go to all that lenght to replace irq numbers with irq_desc * and ... keep then numbers :-) But again, as I said, this is in no way a fundamental limitation of the powerpc code. It could be modified easily to allocate the whole range of a given PIC that uses the linear remapping. It makes no sense for PICs that use the radix tree remapping though. Sure, and I have the same issue with
Re: [RFC] killing the NR_IRQS arrays.
On Sat, 2007-02-17 at 02:06 -0700, Eric W. Biederman wrote: Benjamin Herrenschmidt [EMAIL PROTECTED] writes: In addition, if we remove the numbers, archs will need basically the exact same services provided by the powerpc irq core for reverse mapping (going from a HW irq number on a given PIC back to an irq_desc *). Ben you seem to be under misapprehension that except for the case of ISA (0-16) the linux IRQ number is a hardware number. It is an arbitrary software enumeration, and I think it has been that way a very long time. Did you actually mean is not a hardware number ? If not, then I don't understand your sentence... I can only tell you that my impression of this last is that all the world's not a PPC. Yeah and my grandmother is not the pope, thank you. However, PowerPC is a good example because it has such a diversity of very different hardware setups to deal with, ranging from the multiple layers of cascading controllers all over the place, to interrupts packets encoding vector/target etc... a bit like x86 on cell, to hypervisors providing a single giant number space etc etc etc... Thus, it is extremely likely that something that works well for PowerPC (or for ARM for that matter as it's probably as a colorful environment as PowerPC is) will end up being useful for others. I have a version of the x86 code with a partial conversion done and I didn't need a reverse mapping. What you call the hardware interrupt number never happens to be interesting to me after the system is setup. Because you have the ability to tell your PIC to give you your linux interrupt number when actually sending the interrupt to the processor ? You need a way to get to the irq_desc * when getting an IRQ, either you have a way to map HW numbers back to irq_desc * in sofrware, or your HW allows you to do it. I do suspect there may be an interesting chunk of your ppc work that probably makes sense as a library so other arches could use it. Guess what, one of the options of my code is to not instanciate a remapper... for archs where it's not necessary. (We have the case for example of iSeries whose hypervisor can return us the number we want for an arbitrary interrupt). Now, I'm not saying we should take the PowerPC code and say hey' here's the new generic code. I'm saying that if we're going to change the IRQ stuff that deeply, it would be nice if we looked into some of that stuff I've done that I beleive would be of use for other archs (though you seem to imply that it would be of no use on x86, good, still...). I found it overall very useful to have a generic remapping core and have cascaded PIC setups have a numbering domain local to a given PIC (pretty much, a domain != an irq_chip) and I'm convinced it would make life easier for archs with similar setups. The remapping core also shows its usefulness on archs with very big interrupt numbers, like sparc or pSeries ppc, and possibly others. Now, I -do- have a problem with one aspect of your proposed design which is to keep the linux interrupt number in the generic irq_desc, which I think defeats most of the purpose of moving away from those linux irq numbers. If you do so, then I'll have to keep a separate remapping layer and keep a mecanism for virtualizing linux numbers. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] killing the NR_IRQS arrays.
#define NO_IRQ architecture-defined-int-constant When did you need a magic constant NO_IRQ in generic code. One of the reasons I want to convert the drivers is so we can kill the NO_IRQ nonsense. As for struct irq. Instead of struct irq_desc I really don't care, although the C++ camp hasn't not yet weighed in and mentioned how that creates a namespace conflict for them. Yeah, NO_IRQ would be NULL here... What I do on the powerpc code is since IRQ HW numbers are defined locally to a domain/PIC, when creating a new domain, The PIC code passes a value to use as an illegal value in that domain. It's not exposed outside of the core though, it's really only used to initialize the remapping table with something before any interrupt on that PIC has been mapped. We might need this. But I don't think we need reference counting in the traditional sense. For all practical purpose we already have dynamic irq allocation and it hasn't proven necessary. I would prefer to go to lengths to avoid having to expose that kind of an issue to driver code. I think we do need proper refcounting, but I also think that most drivers will not need to see it. For example, a PCI driver will most probably just do something along the lines of the existing request_irq(pdev-irq), the liftime of pdev-irq is managed by the PCI core. Same goes with MSIs imho, the MSI core can manage the lifetime transparently. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21-rc1] powerpc: Make of_device_uevent() compatible with ibmebus
On Sat, 2007-02-17 at 17:28 +0100, Hoang-Nam Nguyen wrote: ibmebus has a fake root device that's not associated with an ofdt node. Filter out any such devices in of_device_uevent(). Doh ! You are creating an of_device with no attached device-node ? That is totally evil ! Why do you need that ? Ben. Signed-off-by: Joachim Fenkes [EMAIL PROTECTED] --- of_device.c |4 1 files changed, 4 insertions(+) diff -urp a/arch/powerpc/kernel/of_device.c b/arch/powerpc/kernel/of_device.c --- a/arch/powerpc/kernel/of_device.c 2007-02-17 16:36:32.116368480 +0100 +++ b/arch/powerpc/kernel/of_device.c 2007-02-17 16:44:01.319366352 +0100 @@ -180,6 +180,10 @@ int of_device_uevent(struct device *dev, ofdev = to_of_device(dev); + /* e.g. ibmebus has a fake root device w/o ofdt node -- filter that */ + if (!ofdev-node) + return -ENODEV; + if (add_uevent_var(envp, num_envp, i, buffer, buffer_size, length, OF_NAME=%s, ofdev-node-name)) ___ Linuxppc-dev mailing list [EMAIL PROTECTED] https://ozlabs.org/mailman/listinfo/linuxppc-dev - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21-rc1] powerpc: Make of_device_uevent() compatible with ibmebus
On Sat, 2007-02-17 at 19:21 -0500, Joachim Fenkes wrote: Benjamin Herrenschmidt [EMAIL PROTECTED] wrote on 17.02.2007 16:56:39: On Sat, 2007-02-17 at 17:28 +0100, Hoang-Nam Nguyen wrote: ibmebus has a fake root device that's not associated with an ofdt node. Filter out any such devices in of_device_uevent(). Doh ! You are creating an of_device with no attached device-node ? That is totally evil ! Why do you need that ? The driver creates a fake ibmebus device so all ibmebus based devices have a common parent device -- the vio bus does the same. What do you think about linking this device to the device tree / node? All ibmebus-based devices are linked to dt nodes residing directly beneath /, so the mapping would fit. No. If you do that, it shouldn't be an of_device based device. If you want then to be below a common parent, then create that parent of a basic struct device type, that sort of thing. You should never instanciate an of_device that has a NULL device node. vio is different since it's not a subclass of of_device though I tend to also disagree with the way it does things. It's a generic problem with sysfs, I agree it somewhat sucks. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/