Re: 2.4.0-test8-pre1 is quite bad / how about integrating Rik's VMnow?

2000-09-03 Thread Benjamin Herrenschmidt

Hi !

I've read the discussion about the truncate() problem and tried to
understand ;) However, there's somethign I don't catch in your code (typo
? bug ? misunderstanding on my side ?)

Linus wrote:

There's a really simple way to avoid this: compare the thing you're going
to zero out against zero before you memset() it to zero. If it was already
zero, you just unlock the page and release.

Your code does:

+  kaddr = (char*)kmap(page);
+  err = 0;
+  if (!mem_is_zero(kaddr+offset, length))
+  goto unmap;
+  memset(kaddr+offset, 0, length);
+  flush_dcache_page(page);
+  __block_commit_write(inode, page, offset, offset+length);
+unmap:
+  kunmap(page);

Which seem to be the the opposite of what Linus says: You memset() the
page when it's _already_ zero and exit when it's not.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: who maintains linux 4 the powerpc

2000-09-08 Thread Benjamin Herrenschmidt

Can somebody please tell me, who is currently maintaining
arch/ppc?

The link

http://www.ppc.kernel.org/

in the MAINTAINERS file is dead.

Cort Dougan ([EMAIL PROTECTED]) and Paul Mackerras ([EMAIL PROTECTED])

There's also a SourceForge site recently created to gather pending
patches and bug reports at www.sourceforge.net/projects/ppclinux

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [DOC] Debugging early kernel hangs

2000-09-22 Thread Benjamin Herrenschmidt

Hmm, good idea, but how does this work on, say, non-x86 architectures
which don't have a VGA text frame buffer, or whose VGA text frame buffer
is not mapped in, or whose VGA text frame buffer is not initialised.

You will still end  up with those "my kernel hangs during boot" messages.

A lot of the problems with debugging early kernel hangs is that you
don't have a display set up, or you don't have enough of the memory
subsystem initialised (eg, before pci_init) to be able to access
devices (eg, before paging_init).

 [.../...]

I've implemented a similar mecanism on the PPC 2.2.x kernel. It's not in
the main tree since it requires a couple of lines of change to printk.c
in order to handle correctly the removal of the last console.

Basically, I setup a "struct console" during very-early boot (almost at
the firmware level) that can basically display text on screen (using the
firmware pre-inited fb) using a very basic engine, and is setup by
default as the printk console. 

Then, in the main VT code, I unregister this boot console just before
registering the VT one.

It's a bit hackish and so is not meant to be merged in the main tree, but
it's useful when I release test kernels for new Apple hardware, to have
printk work from the very beginning of boot.

I wanted to clean it up, but I didn't figure out a way to make this work
without hacking slighlty printk.c and vt.c (mostly for correctly handling
the takeover of the boot console by the VT subsystem). It could have been
simpler if I implemented a struct consw instead of the (simpler) struct
console, but the resulting code would have been way too bloated (mostly
re-implementing an fb-based console ;)

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [DOC] Debugging early kernel hangs

2000-09-23 Thread Benjamin Herrenschmidt

2.4 and 2.2 PPC have progress() for writing progress messages to the
screen.  They're setup in a per-board very early in the boot so we can see
what's going on as soon as the MMU is turned on and lets us get around.

Ben, can you just make your changes talk through that?  I used to use it
with BootX to write out info while setting up the early bootinfo stuff.

The progress() stuff works fine on 2.4. (I've not checked with 2.2.x lately). 

However, there's still a huge gap between the last progress() message and
availability of the frame buffer device. The simple console has the
advantage of outputing existing printk messages. (basically, it's a
console using prom_printf).

Well, I beleive I'll just get rid of this debug console to ease merging
of my pile of 2.2.x changes. It appeared that I never had a single crash
happen during this time, except when working on new HW, but then, I can
just add a prom_printf to printk() directly.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



removal of include/linux/openpic.h

2000-09-26 Thread Benjamin Herrenschmidt

Hi !

Is any arch other than PPC using include/linux/openpic.h ?

I'm doing some cleanup work on various parts of the PPC arch, and it's
now time for the openpic driver to suffer. That file exports to everybody
all the functions  data structures of the driver, which is wrong with
the way the driver is evolving (at least on PPC). However, our driver is
in arch/ppc/kernel/open_pic.c.
So I'm considering moving the few exported symbols to arch/ppc/kernel/
open_pic.h (or include/asm/open_pic.h, but I don't think it's needed at
all there) and kill include/linux/openpic.h completely.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: __bad_udelay in 2.2.18pre15

2000-10-11 Thread Benjamin Herrenschmidt

 2.2.18pre15 defines udelay as (in file include/asm-i386/delay.h) : 
 extern void __bad_udelay(void);
 
 #define udelay(n) (__builtin_constant_p(n) ? \
 ((n)  2 ? __bad_udelay() : __const_udelay((n) *
 0x10c6ul)) : \
 __udelay(n))
 
 ... 
 It seems __bad_udelay is not defined anywhere in the kernel source. 

Correct. Its a compile time error trap

Well, at first, I wanted to implement it the same way on PPC. However, it
dies on all occurences where udelay is called with a non-constant expression.

I spotted this case in a few PPC specific stuffs (fixable), but also in
the sys_nanosleep code, and in the de4x5 driver.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: __bad_udelay in 2.2.18pre15

2000-10-11 Thread Benjamin Herrenschmidt

Well, at first, I wanted to implement it the same way on PPC. However, it
dies on all occurences where udelay is called with a non-constant expression.

I spotted this case in a few PPC specific stuffs (fixable), but also in
the sys_nanosleep code, and in the de4x5 driver.

Hrm... looks like I missed the story about the __builtin_constant_p(). Is
this a gcc-specific built-in feature ?

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: __bad_udelay in 2.2.18pre15

2000-10-11 Thread Benjamin Herrenschmidt

 
 Well, at first, I wanted to implement it the same way on PPC. However, it
 dies on all occurences where udelay is called with a non-constant
expression.

__builtin_constant_p means non constant expressions will always call udelay

 I spotted this case in a few PPC specific stuffs (fixable), but also in
 the sys_nanosleep code, and in the de4x5 driver.

I'll check these two

Forget about them. It was my non-understanding of __builtin_constant_p()
that was causing me the problem. I fixed a few 2 udelay's (replacing
them with mdelay) in some PPC specific code. I'll send you some patches
later, I have to extract them from my tree. Well... would you accept a
huge pile of PPC patches for 2.2.18 in this case, I can send you my
current diffs (with a bit of cleanup) ?
Those contain almost only pmac-specific stuffs (support for new machines,
sleep fixes, and a few more fixes here or there).

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: __bad_udelay in 2.2.18pre15

2000-10-11 Thread Benjamin Herrenschmidt


Yep. This is a huge release patch anyway so resynching the stuff is fine. 
What I wont take is stuff touching core code

I do have a 2 lines patch to the common ide code that fix a problem when
revalidating a CD-ROM after sleep, but it was ack'ed by Andre Hedrick. I
also have a two-liners to kernel/printk.c to allow takeover of my boot
console by the real vt subsystem (I found no way for a struct consw to
take over a struct console without this patch). I'll make sure those are
separate from the main patch set.

I need some time to do some polishing and slicing all those patches so
you don't get a single hundreds kb diff, expect something around this
week-end. I leave out some fbdev stuffs that may cause problems with
other archs.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Problem with include/linux/fs.h vs. glibc

2000-10-12 Thread Benjamin Herrenschmidt

Hi !

Sorry if this have already been the cause of a flamewar on the list, but...

I need to compile an app with the 2.4 kernel headers  glibc (our stable
glibc on PPC is based on 2.1.3). However, the compiler is barking on a
change done in 2.4 version of include/linux/fs.h:

The 2.2.x version didn't include linux/string.h and all was fine.

The 2.4.x version does include linux/string.h and this is causing gcc
to bark because of conflicts with glibc headers (glibc seems to #define
some of the prototypes defined in linux/string.h, causing various parse
errors).

So what is the solution ?

 - removing the #include linux/string.h from linux/fs.h ?
 - moving it in a #ifdef __KERNEL__ part of the file ?
 - protect linux/string.h itself with #ifdef __KERNEL__ ?
 - fix glibc ? (how ? I mean, it's legal to include linux/fs.h from userland,
   but linux/string.h is obviously not meant to be exported out of the kernel)

Regards,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Problem with include/linux/fs.h vs. glibc

2000-10-12 Thread Benjamin Herrenschmidt

  - I mean, it's legal to include linux/fs.h from userland,

Everybody who thinks so will be severely disappointed.

Ok, so if it's not, then I have to fix that app. Thanks.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: dual head r128

2000-10-12 Thread Benjamin Herrenschmidt

Note to linux-kenrel readers: This discussion is the Nth attempt to find
a solution to handle both legacy IOs and PCI IOs on machines with several
IO busses memory mapped at different locations in the CPU space.

No please, is there anybody bloat-conscious on this damned list ? Burying
more and more code inside each {in,out}[bwl] is not the solution.

Well, that is pretty small overhead, and probably ridiculous compared to
the overhead of the IO itself. Most fast devices use MMIO anyway.

The problem is that whatever solution someone propose, there is _always_
somebody to reject it.

Just define a macro ISA_PORT or something like this and update the kernel
to replace all the in/out to fixed ports to do in/out(ISA_PORT(n)). If you
don't do it you'll get a nice panic so you'll find all the places quite
fast.

That basically mean making different macros for ISA in/out and PCI in/
out. I've proposed this several time, but it requires changes to the
common code, and all I got when talking about this was flames from x86 people.

Drawbacks ?

Take the time to make this fit into some x86 people head. Also, I need
something that can be ported quickly to 2.4. I'm afraid even if we make
everybody agree to it, it will be delayed to 2.5.

Linus: Would you accept this change now ?

#define ISA_PORT(n) (n)

And change to _all_ drivers doing legacy IOs to use that in their in/out
macros ?

I still prefer making separate macros for legacy IOs (isa_in/isa_out) and
for PCI IOs (in/out), or the opposite if you prefer (in/out for isa and
pci_in/pci_out for PCI).
On x86, they would resolve to the same thing, while on our platforms,
they could be handled differently.

PCI I/O resources will have to be kernel virtual, physical is impossible
with PreP if we want to lift the 2Gbuser space restriction (PreP I/O is
from 2 to 3 Gb physical and the first thing to do is to reallocate devices
which use it since most firmware use it too liberally, like one device
every ... 256Mb). There are other and better ways to increase user
available virtual space, however. And anyway I don't want any stinking add
in each in/out macro.

Well, in 2.4 we can easily reassign PCI IOs if we configure the bridge
with proper resources. If all goes well, my new PCI code should handle
that fine (should be ready this week-end).

Indeed, this is too awkward (is tere no way to redirect only the VGA
part of the legacy I/O space ? That's what the PCI-PCI bridges do, but 
I've not yet used a single machine with AGP so I'm ignorant).

No, most bridges used on macs can't do that. In fact, AFAIK, it's not
possible to access the ISA memory space neither on those machines (on
UniN, I can't generate memory cycles at lower address than 0x8000).

My "pet" solution would be to have all legacy drivers request an IO base
this way

  base = isa_get_IO_base(legacy_addr);

The isa_get_IO_base function could then be "tweaked" to recognize known
legacy addresses and return different bases. (There might still be
problems with VGA vs. parallell, I don't know x86 world well enough to be
sure).

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



The IO problem, ISA vs. PCI

2000-10-12 Thread Benjamin Herrenschmidt

Note to linux-kenrel readers: This discussion is the Nth attempt to find
a solution to handle both legacy IOs and PCI IOs on machines with several
IO busses memory mapped at different locations in the CPU space.

No please, is there anybody bloat-conscious on this damned list ? Burying
more and more code inside each {in,out}[bwl] is not the solution.

Well, that is pretty small overhead, and probably ridiculous compared to
the overhead of the IO itself. Most fast devices use MMIO anyway.

The problem is that whatever solution someone propose, there is _always_
somebody to reject it.

Just define a macro ISA_PORT or something like this and update the kernel
to replace all the in/out to fixed ports to do in/out(ISA_PORT(n)). If you
don't do it you'll get a nice panic so you'll find all the places quite
fast.

That basically mean making different macros for ISA in/out and PCI in/
out. I've proposed this several time, but it requires changes to the
common code, and all I got when talking about this was flames from x86 people.

Drawbacks ?

Take the time to make this fit into some x86 people head. Also, I need
something that can be ported quickly to 2.4. I'm afraid even if we make
everybody agree to it, it will be delayed to 2.5.

Linus: Would you accept this change now ?

#define ISA_PORT(n) (n)

And change to _all_ drivers doing legacy IOs to use that in their in/out
macros ?

I still prefer making separate macros for legacy IOs (isa_in/isa_out) and
for PCI IOs (in/out), or the opposite if you prefer (in/out for isa and
pci_in/pci_out for PCI).
On x86, they would resolve to the same thing, while on our platforms,
they could be handled differently.

PCI I/O resources will have to be kernel virtual, physical is impossible
with PreP if we want to lift the 2Gbuser space restriction (PreP I/O is
from 2 to 3 Gb physical and the first thing to do is to reallocate devices
which use it since most firmware use it too liberally, like one device
every ... 256Mb). There are other and better ways to increase user
available virtual space, however. And anyway I don't want any stinking add
in each in/out macro.

Well, in 2.4 we can easily reassign PCI IOs if we configure the bridge
with proper resources. If all goes well, my new PCI code should handle
that fine (should be ready this week-end).

Indeed, this is too awkward (is tere no way to redirect only the VGA
part of the legacy I/O space ? That's what the PCI-PCI bridges do, but 
I've not yet used a single machine with AGP so I'm ignorant).

No, most bridges used on macs can't do that. In fact, AFAIK, it's not
possible to access the ISA memory space neither on those machines (on
UniN, I can't generate memory cycles at lower address than 0x8000).

My "pet" solution would be to have all legacy drivers request an IO base
this way

  base = isa_get_IO_base(legacy_addr);

The isa_get_IO_base function could then be "tweaked" to recognize known
legacy addresses and return different bases. (There might still be
problems with VGA vs. parallell, I don't know x86 world well enough to be
sure).

Ben.

-- RFC822 Header Follows --
From: Benjamin Herrenschmidt [EMAIL PROTECTED]
To: Gabriel Paubert [EMAIL PROTECTED], Linux/PowerPC Devel List
 [EMAIL PROTECTED], Linus Torvalds [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: dual head r128
Date: Thu, 12 Oct 2000 18:58:15 +0200
Message-Id: [EMAIL PROTECTED]
In-Reply-To: [EMAIL PROTECTED]
References: [EMAIL PROTECTED]
X-Mailer: CTM PowerMail 3.0.5 http://www.ctmdev.com
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
---


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Any dual AGP slot motherboards?

2000-10-19 Thread Benjamin Herrenschmidt


Apple sells a computer with dual AGP slots. I just was looking for a intel
box like this. Since AGP is a port on the PCI bus it is possible to have
more than one AGP port on a/each PCI bus but this requires the PCI chipset
to support this. 

Well, I don't know of such a Mac. To my knowledge, the only Apple box to
have an AGP slot are the ones based on the "Core99" chipset, and they
provide one AGP slot. You won't be lucky with Apple HW anyway as there
are currently issues between the AGP controller and the Linux agpgart
driver preventing from using it. Those issues are tricky and I don't
think a solution will be available soon. (Apple chipset can make the AGP
aperture visible to the CPU AFAIK).

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



The IO problem on multiple PCI busses

2001-03-01 Thread Benjamin Herrenschmidt

Here's the return of an ld problem for which we really need a
solution asap since it's now biting us in real life configurations...

So the problem happens when you have a machine with more than one PCI
host bridge. This is typically the case of all new Apple machines as they
have 3 host bridges in one chip (2 of them are relevant: the AGP and the
PCI). I don't think the problem exist on x86 machines with real IO
cycles, at least in that case, the problem is different.

In order to generate IO cycles, the bridge provides us with a region in
CPU physical memory space (a 16Mb region in our case) that translates
accesses to IO cycles on the PCI bus. Our implementation of inb/outb
currently relies on the kernel ioremap'ing one of these regions (the PCI
one) and using the ioremap result as a base (offset) inside the inb/outb
functions.

So that mean that the current design won't allow access to IOs located on
any bus but the one we arbitrarily choose (the PCI bus). That's fine in
most case, until you decide to put a 3dfx or nvidia card in the AGP slot.
Those cards require some IO accesses to be done to the legacy VGA
addresses, and of course, our inb/outb functions can't do that.

Obviously, we can hack some driver specific thing that would use the
arch-specific code to retreive the proper io base address for a given
host bridge, but that's a hack. I'm looking for a solution that would
cleanly apply to all archs that may potentially face this problem.

The problem potentially exist also for any PCI card that has PCI IOs on
anything but the main PCI bus. 

One possibility is to limit our IO space to 64k per bus (to avoid
bloating) and then use a hacked ioremap to create a single virtually
contiguous kernel region that appends all those IO spaces together.
Accessing IOs on bus N would just be the matter of calculating an address
of the type 64k*N+offset and doing normal inb/outb on the result. The
arch PCI code could then properly fixup PCI IO resources for PCI drivers,
and we could add a function of the kind

 unsigned long pci_bus_io_offset(int busno);

that would return the offset to add to inb/outb when accessing IOs on the
N'th PCI bus.

If we want to go a bit further, and allow ISA drivers that don't have a
pci_dev structure to work on legacy devices on any bus, we could provide
a set of function of the type

 int isa_get_bus_count();
 unsigned long isa_get_bus_io_offset(int busno);

and eventually

 int isa_bus_to_pci_bus(int isa_busno);
 int pci_bus_to_isa_bus(int pci_busno);

If we want to figure out on which PCI bus a given ISA bus is located if
any (-1 beeing no mapping 
exist).

Of course, the same problem exist for ISA memory (used by legacy VGA
modes). It's not a problem in real life currently since no powermac can
produce PCI cycles in the ISA memory range today, and non-powermac PPC
machines currently don't have needs for video cards on anything but the
main bus, but the potential issue is there, and the need for a solution
may pop up too.

I'm, of course open to any comments about this (in fact, I'd really like
some feedback). One thing is that we also need to find a way to pass
those infos to userland. Currently, we implement an arch-specific syscall
that allow to retreive the IO physical base of a given PCI bus. That may
be enough, but we may also want something that match more closely what we
do in the kernel.

Regards,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: The IO problem on multiple PCI busses

2001-03-01 Thread Benjamin Herrenschmidt


If we want to go a bit further, and allow ISA drivers that don't have a
pci_dev structure to work on legacy devices on any bus, we could provide
a set of function of the type

 int isa_get_bus_count();
 unsigned long isa_get_bus_io_offset(int busno);

I would add that I'd prefer to keep it separated from the PCI layer in
that sense that it can also help handle 16bits ISA-like IO busses on
embedded hardware which may (will most of the time) not have anything
like a PCI bus. Having the ability to map PCI-ISA bus numbers should be
an option.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: The IO problem on multiple PCI busses

2001-03-01 Thread Benjamin Herrenschmidt

As a side note, Alpha has a special PCI syscall to get the "PCI
controller number" a given PCI device is behind.  We could add
another ioctl number which does the same thing on /proc/bus/pci/*/*
nodes.  This way sparc64 and Alpha could have the same user visible
API for this as well.

And on PPC too since I adapted the pci controller mecanism to
it in 2.4.

In fact, all that is done by our various syscalls could be done by
ioctl's on /proc/bus/pci/*/*.

To be generic, the pci controller number should rather be the pci bus
number of the host bridge (the top of the PCI tree a given device lives
on). The internal controller numbers have no real meaning I think to
userland.

Also, an ioctl to retreive the iobase would be useful too (in addition
to the mmap), especially for getting access to VGA IOs associated with a
given PCI card, but also for whatever test tool one would want to write
in userland that access legacy IOs on a given PCI bus.
Having the mmap is fine, but I like having also the ability to retreive
all the informations via an ioctl too. 

I beleive that if we can agree on the in-kernel format of the PCI
controller structure and function to retreive it from a bus number, we
can make this generic.

For us, the pci controller requires at least an iobase (physical 
virtual as we always ioremap the IO space during boot) for generating
io cycles, the config ops, the mem offset (some platforms don't have
a 1:1 mapping of memory cycles vs. CPU bus cycles for PCI memory, for
example, on PReP, you write to physical c000 to get a PCI memory
write to ). And finally the isa memory base (it may be located
differently, some bridge have 1:1 mappings and so allow only high
memory addresses to go to the PCI, but do open a "window" at a different
physical address to generate ISA memory cycles (low address cycles)).

Finally, we have some private datas (pointer to OF node for example),
the resource structures (so that we know what a given host bridge can
decode and can allocate unallocated PCI resources properly).

I'm not familiar with the requirements of other archs however.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: The IO problem on multiple PCI busses

2001-03-01 Thread Benjamin Herrenschmidt

  I'm, of course open to any comments about this (in fact, I'd really like
  some feedback). One thing is that we also need to find a way to pass
  those infos to userland. Currently, we implement an arch-specific syscall
  that allow to retreive the IO physical base of a given PCI bus. That may
  be enough, but we may also want something that match more closely what we
  do in the kernel.

Same problem on sparc64.  Using a special PCI syscall is fine, _if_ we
all end up using the same one.  However, I would prefer another
mechanism...

Right, I remember we discussed this some monthes ago. Currently, we have
a syscall that is slightly different from the sparc/alpha ones but very
similar.

I think a cleaner scheme is to allow mmap() on
/proc/bus/pci/${BUS}/${DEVICE} nodes, that is much cleaner and solves
transparently any "different word size between userland and kernel"
issues (specifically 32-bit userlands executing on 64-bit kernels).

I played around with something akin to this, and some of the necessary
Xfree86-4.0.x hackery needed, some time ago.  But I never finished
this.

I do agree with you on this. I didn't have time to really work on it so
far, I remember you posted a test patch but I was busy at that time with
other PCI issues we had with multiple bus systems.

Note that this is only the userland side of the story. For now, I'm more
concerned about finding a good solution to the kernel side.

Also, the problem of finding where the legacy ISA IOs of a given PCI bus
are is a bit different that simply mmap'ing a BAR. Some video cards
require some access to their VGA IOs without having a BAR covering them,
in some case it's necessary to switch the chip from VGA to MMIO mode.

I've looked at the parisc code (thanks Alan for pointing that out), and
it seem they implement all inb/outb as quite big functions that decypher
the address, retreive the bus, and do the proper IO call. Unfortunately,
that's a bit bloated, and I don't think I'll ever get other PPC
maintainers to agree with such a mecanism (everybody seem to be quite
concerned with IO speed, I admit including me).

Also, that wouldn't really help the case of legacy drivers or video
drivers using legacy addresses for VGA. In all cases, whatever solution
we end up having, those will have to be adapted. What I'd like is a
smooth path that allow unchanged drivers to still work with the default
bus, while adapted driver can be done so with minimum changes (mostly
ending up storing an io base and creating a virtual "ISA bus number"). 

That way, an ISA-like (legacy IO bus) can be mapped to either a PCI bus,
or whatever. Maybe "ISA" is not a proper word for it, it could be
"basic_io_bus" maybe.

Alan also pointed out that there may be similar issues with MMIOs. In
fact, as long as we are working with PCI devices, we can easily get
things fixed up by munging the resource structures at fixup time. The
_is_ however a similar issue with legacy ISA memory, especially since
some platform can simply not let you access it.

Looking at those in more details (other archs), it appears that the
problem happens on most non-x86 archs and is handled differently for each
of them, when it's handled at all.

So what would be a preferred way ? Create that fake ISA bus number and
provide functions for looking them up, getting their IO and mem bases,
and eventually mapping PCI busses to ISA busses ? Or does someone have a
better idea ? The goal is to try not to change the semantics of inb/outb
and friends so that most legacy drivers can still work using the
"default" IO bus if they are not upgraded to the new scheme.

Thanks for your feedback,

Regards, 
Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: The IO problem on multiple PCI busses

2001-03-02 Thread Benjamin Herrenschmidt

I do not want an interface where the user still has to do
grotty stuff like mmap() on /dev/{mem,kmem}, this was the
core of the problem I had with the syscall idea, don't bring
it back.

Make mmap()'s on a PCI--ISA bridge do something special, for
example.

The user doesn't need to know anything about physical addressing of
the machine, it all can and should be abstracted away.  This is why I
really detest the XFree86 PCI bus probing layer, it should not need to
poke around at so much of the config space information of devices :-(

It is the reason why, at least still today in Xfree86 CVS, it simply
cannot cope with multiple PCI controllers in a machine because it
assumes a flat MEM/IO space.  They know about the problem and are
working on fixes, but my point is that making this overly knowledgable
PCI prober in the first place is what created these problems.

Ok, I see your point and I agree.

There is still the need, in the ioctl we use the "select" what need to be
mapped by the next mmap, to ask for the "legacy IO range of the bus where
the card reside" (if it exist of course). That would be the 0-64k (or less,
actually a couple of pages would probably be enough) that generates IO cycles
in the "low" addresses used for VGA registers on the card.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: IO issues vs. multiple busses

2001-03-03 Thread Benjamin Herrenschmidt

Here are my comments directly responding to your mail.

Hi ! Thanks for taking the time to respond in details.

Large systems have problems with I/O port space and legacy devices.
There just isn't enough I/O port space to support large configs
and ISA aliasing and all the other crud. That's why Intel is (a)
ditching all the legacy crap in IA64 and (b) strongly encouraging
people to use MMIO space on PCI.

Right. We need to decourage use of IOs, definitely ;)
I now tend to think that we shouldn't care about making a whole
architecture to handle those IO problems, but the simplest possible thing
that would fit our needs. (still for in-kernel matters)

If you only support one type of bridge, you could avoid the indirect
function call (which parisc-linux uses) and encode the access method
directly in the inb/outb macros.

We do that now, and support IOs on one bridge only. However, some PCI
cards still require IO access and we do have several busses, so...
The reason why I'm getting this problem on the public place (again ?)
is that we are now faced with people who want to put video cards in both
AGP  PCI busses, those cards requiring accesses to some legacy VGA IOs
on each of their busses.

Just note that processor speed is so much faster (and getter faster)
than the ISA bus (and PCI-1X bus), that CPU overhead is mostly
irrelevant to the cost of accessing IO port space. On older x86
boxes it is relevant.

Right. That's my opinion too. But it's difficult to make everybody agree
on it ;) Even the simple mecanism Paul Mackerras did so that IOs to
non-existent devices don't kill the kernel (very small overhead) caused
some barking ;)

parisc-linux has solved exactly that problem.

I have to look in more details. It's my understanding that you use high
bits of the IO address to store the HBA number and then use that to call
the proper access functions. That would solve the PCI IO problem (PCI cards
requiring IOs to BAR-mapped regions), but I don't see how it can fix the
problem of a card accessing legacy VGA addresses, except if you hand-fixed
the video drivers to fill those high bits before doing IOs.

If I understand things correctly, that mean that each card, instead of
accessing the legacy VGA port 0xpp, would instead access 0x00bb00pp
(or whatever mangling you use to stuff the HBA number).

From the driver point of view, it's exactly the same as passing an "offset"
that would be added to the legacy address. So both methods (the one I describe
that would fit well for us) and yours can end up with the same driver-side API
which is to get an "IO base" for the bus a given card reside on.

The question is then to decide is all ISA busses are on a matching PCI bus,
in which case a simple unsigned pci_get_bus_io_base(int bus_no) -like function
would be enough, or if we want a scheme that supports other ISA-like busses ?

We could eventually decide to support only PCI, and additionally declare a
fake PCI bus for an ISA bus not matched to a PCI bus, whose config ops would
return no device in any slot.

Do we agree on this ? 

I don't believe such a solution exists which is "cleaner" than
what parisc-linux does and meets the same objectives. Right now,
it's important the install be easy in order to make it easy for
people to migrate from HPUX to parisc-linux. :^)

Well, from the driver point of view, I think it _do_ exist. Basically, the
driver will do inb/outb  friends. Whatever those function do in reality is
arch-dependant.
But we agree on the fact that in order for those functions to know on which
bus to tap, an additional information must be "cooked" inside the IO address
passed to them. That's why I'm proposing this notion of "io base".

Additionally, the same problem is true for ISA memory, when it exist
obviously.
I would indeed like to see the same function for
pci_get_legacy_mem_base(int bus_no)-like, that is allowed to return something
like  for informing the driver that this specific machine won't
support
ISA memory.

With those two simple functions, we could at least

 - fix the the fbdev's that need access to VGA regions so that they work on
   multiple bus systems properly
 - Have vgacon disable itself when there's no ISA memory (that can be
handled by
   reserving the region and thus preventing request_region from working
too, well,
   but that scheme would also simplify the various more/less hacked
macros used
   on all non-x86 archs to access the VGA memory).
 - Eventually have vgacon work on "any" bus, possibly by providing a kernel
   option telling it on which bus to look for a legacy VGA device (and
defaulting
   to whatever VGA device the PCI will find first). This way, vgacon
would work
   properly in most cases without arch-specific hacking.

Additionally, I beleive it would help making other legacy drivers (if
any) work on
non-0 busses (I'm thinking about IDE cards using legacy addresses, those
do exist),
and whatever.

The only thing that's annoying me in the fact that we keep tied 

Re: The IO problem on multiple PCI busses

2001-03-03 Thread Benjamin Herrenschmidt

No, don't do this, it is evil.  Use mappings, specify the device
related info somehow when creating the mapping (in the userspace
variant you do this by openning a specific device to mmap, in the
kernel variant you can encode the bus/dev/etc. info in the device's
resource and decode this at ioremap() time, see?).

Well, except that drivers doing IOs don't ioremap...

Maybe we could define an ioremap-like function for IOs, but the more
we discuss this, the more I feel that for in-kernel, a simple function
that returns a per-bus io base (and another one for ISA mem) is plenty
enough for the few legacy things we have to deal with (mostly VGA).

For PCI drivers doing IOs, we just need to have the IO resource
structures to be properly fixed up (include the correct iobase already).

That iobase can either be a mix of a real io address and a "cooking" in
the high bits like parisc, or it can be an address ioremap'd in the
correct bus mapping when it's possible, or whatever...

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Question about IRQ_PENDING/IRQ_REPLAY

2001-03-03 Thread Benjamin Herrenschmidt

Hi Linus !

I've some questions regarding the behaviour of arch/i386/kernel/irq.c
regarding IRQ_PENDING and IRQ_REPLAY.

Especially, my question is about the code in enable_irq() which checks
for IRQ_PENDING, and then 
"replays" the interrupt by asking the APIC to issue it again.

I don't have a simple way on PPC to cause the interrupt to happen again,
as you can imagine this is rather controller-specific. However, looking
at the code closely, I couldn't figure out a case where having
IRQ_PENDING in enable_irq() makes sense.

How can IRQ_PENDING happen to be set on an IRQ_DISABLED interrupt, and
why would that matter (why should we take this interrupt) ?

AFAIK, IRQ_PENDING can only be set as a result of a call to do_IRQ().
Since we loop when calling the actual handler, I fail to see how we could
"miss" an interrupt. If the interrupt is actually disabled, we should not
get it at all, and if we did, I don't see why it would matter to resend
it when it gets enabled since disabled interrupts are supposed to be
ignored (well, they are by most PICs). Obviously, this matters only for
an edge interrupt as level ones will stay asserted until the device is happy.

I'd be glad if you could take the time to enlighten me about this as I'm
trying to make the PPC code as close as the i386, according to your
comment stating that it would be generic in 2.5, and I don't like having
code I don't fully understand ;)

Regards,
Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: The IO problem on multiple PCI busses

2001-03-03 Thread Benjamin Herrenschmidt

I/O is not supposed to be fast, that's what MMIO is for. :)  Just do

void outb (u8 val, u16 addr)
{
   void *addr = ioremap (ISA_IO_BASE + addr);
   if (addr) {
   writeb (val, addr);
   iounmap (addr);
   }
}

You can map and unmap for each call :)  Ugly and slow, but hey, it's
I/O...

Well, that would really suck ;) And I don't think it would be necessary
as we can probably limit each IO bus to 64k without much problem, and
have them permanently ioremap'ed.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Question about IRQ_PENDING/IRQ_REPLAY

2001-03-04 Thread Benjamin Herrenschmidt

In particular, if an edge-triggered interrupt comes in on an x86 IO-APIC
while that interrupt is disabled, enabling the interrupt will have caused
that irq to get dropped. And if it gets dropped, it will never ever happen
again: the interrupt line is now active, and there will never be another
edge again.

Ok, I see. We have a different issue with the old Apple IRQ controller that
can lose interrupts if they are active when re-enabled. We currently rely
on a hack to work aroud this that may re-send interrupts, but that involves
hacking into __sti() to check for lost interrupts, which is bad.

Basically, even a level interrupt, if active while re-enabled, will not be
sent by the pic to the CPU, and so further interrupts will be blocked too.
We have some code in enable_irq() that can detect this case, but re-triggering
the interrupt is not really simple and requires the __sti() hack for now. 

I beleive we may have a way to re-trigger the interrupt without having to
hack __sti() by using a fake timer interrupt. I'll look into this, but in
that case, the code can be mostly self-contained in enable_irq, we will
probably not need to play with IRQ_PENDING  IRQ_REPLAY flag at all.

 I'd be glad if you could take the time to enlighten me about this as I'm
 trying to make the PPC code as close as the i386, according to your
 comment stating that it would be generic in 2.5, and I don't like having
 code I don't fully understand ;)

You likely don't have this problem at all. Most sane interrupt controllers
are level-triggered, and won't show the problem. And others (like the
i8259) will see a disabled-enabled transition as an edge if the interrupt
is active (ie they have the edge-detection logic _after_ the disable
logic), and again won't have this problem.

Well, Apple now uses OpenPICs, but all slightly older macs had a home-made
Apple controller that had the above issue :( In fact, it can happpen with
both and and level interrupts for us.

Regards,
Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: IO issues vs. multiple busses

2001-03-04 Thread Benjamin Herrenschmidt

So once again I vote for the introduction of
isa_{request,release}_mem_region(), just like we already have isa_readb() and
friends.

Well, it's the same problem as the IO, there may be more than one ISA mem
region,
especially when you put 2 video cards on 2 different PCI hosts (even without a
PCI-ISA bridge).

In fact, with a PCI-ISA bridge, I can imagine a config where you need 2 ISA IO
regions and 2 ISA mem regions on the same PCI bus if that bridge does address 
translation.

My concern for now is mostly to get video cards fixed, I don't care much about
legacy ISA hardware as in those case, I guess we can limit ourselves to a
single
ISA bus and inb/oub beeing happy to cope with it.

The problem is that we use the same macros (inb/outb) to access that ISA bus,
and to access any PCI IO bus. Well, I would suggest the following:

 - inb/outb without offset - the ISA bus if any, or the IO space of the
   first PCI host
 - inb/outb with offset (or encoded HBA number) - IO space of an other bus
 - pci_get_bus_io_base() returns the IO offset for accessing the Nth PCI
   bus IO space so that the fb devs can do VGA IOs on the bus that holds
   their card.
 - pci_get_bus_isa_mem_base() returns the base address at which isa mem
   is available for a given PCI bus (that is the address that generates
   mem cycles in the range 0-64k). This is a physical address, the driver
   still have to ioremap it. Some PCI cards can have a BAR mapping the
   VGA memory elsewhere, drivers for those cards should prefer the BAR
   mapping of course.

All IO ranges can be mapped via kernel VM tricks into a single contiguous
space
with the offset beeing something like a 64k increment, or we can have the
inb/outb
do a lookup of the host bus like on parisc. That's an arch implementation
detail.

Is that ok ? I know it's not perfect, but it would allow to solve the most
important problem for now. The PCI cards in need of IOs (like PCI IDE cards)
can have their resources fixed up by the arch code in order to tap the correct
bus. Only the real legacy ISA drivers will be limited to the fixed (default)
ISA bus.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Question about IRQ_PENDING/IRQ_REPLAY

2001-03-04 Thread Benjamin Herrenschmidt

We do have broken interrupt controllers in this respect.  We already have a
way of handling it.  Ben, take a look at set_lost().

Heh, I know, thanks ;)

However, our current scheme implies a hack to __sti() that I'd like to get
rid of since it adds an overhead allover the place that could probably be
localized if we managed to force an interrupt (using the DEC for example,
or using a mac-specific device as this controller only exist on macs anyway).

Also, we currently don't use the same mecanism as i386, and since Linus
expressed his desire to have irq.c become generic, I'm trying to make sure
I fully understand it before merging in PPC the bits that I didn't merge
them yet.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Question about IRQ_PENDING/IRQ_REPLAY

2001-03-05 Thread Benjamin Herrenschmidt

More generic in terms of using irq_desc[] and some similar structures I can
see.  Making do_IRQ() and enable/disable use the same names and structures
as x86 isn't sensible.  They're different ports, with different design
philosophies.

I don't believe that the plan is a common irq.c - lets stay away from that.

Why ? Except for a few things like irq probing, our irq.c is already very
similar
to i386 (well, mostly because I merged most of it some time ago) and I
don't see
why it would be a bad thing. The design of irq.c makes it perfectly
adapted to our
needs, there's nothing really x86-specific in it, it handles things we
need to be
handled correctly, does nothing more than what we need (well it does, but
those parts,
mostly the irq locking, got already removed), etc...

I did that merge in the first place because I wanted the depth support in
enable/disable_irq, and more fine-grained spinlocking. I really see
nothing wrong
in the way irq.c works, I really think that except the small added bit we have
in our do_IRQ() to call ppc_md.get_irq(), it's perfectly adapted to our needs.

Remember that it allowed to remove the (mostly useless) post_irq() thing
we had ?
It also allow proper implement of irq distribution even with controllers
that could
trigger the same IRQ on several CPUs, re-entrancy in the handler if we do
early-eoi
without masking an edge interrupt is also handled properly, enable/
disable from
within the handler too, all sorts of things our previous code didn't do right.

The only thing I added to the core irq.c code is that IRQ_PERCPU flag
that prevents
IRQ_INPROGRESS to be set. It's a bit hackish but allows our IPIs to use a
single
desc for all CPUs without beeing mutually exclusive. 

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Question about IRQ_PENDING/IRQ_REPLAY

2001-03-05 Thread Benjamin Herrenschmidt

handled correctly, does nothing more than what we need (well it does, but
those parts,
mostly the irq locking, got already removed), etc...

Sorry, I meant mostly the irq probing

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Question about IRQ_PENDING/IRQ_REPLAY

2001-03-05 Thread Benjamin Herrenschmidt

And I seriously doubt that PPC SMP irq handling has gotten _nearly_ the
amount of testing and hard work that the x86 counterpart has. Things
like support for CPU affinity, per-irq spinlocks, etc etc.

Some of those are the reason I moved part of the x86 irq.c code to PPC
indeed.

Now, I'm not saying that irq.c would necessarily work as-is. It probably
doesn't support all the things that other architectures might need (but
with three completely different irq controllers on just standard PCs
alone, I bet it supports most of it), and I know ia64 wants to extend it
to be more spread out over different CPU's, but most of the high-level
stuff probably _can_ and should be fairly common.

And I think they are. One thing is that if made "common", do_IRQ have to
be split into an arch-specific function that retrives the irq_number (and
does the ack on some controller), and the actual "dispatch" function that
does all the flags game and calls the handler.

I've slightly extended it using the IRQ_PERCPU flag to prevent IRQ_INPROGRESS
from ever beeing set (a bit hackish but I wanted that for IPIs since they
use ordinary irq_desc structures for us in most cases).

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Question about IRQ_PENDING/IRQ_REPLAY

2001-03-05 Thread Benjamin Herrenschmidt

We have about 12 interrupt controllers we end up using on PPC.  I'm
suspicious of any effort to base Linux/PPC generic interrupt control code
paths on a software architecture that's been tested with 3.  More to the
point, we get ASIC's that roll in a standard interrupt controller and add
some "improvements" at the same time.

Well, I personally don't see what would be a problem... Of course, the 
current i386 irq.c cannot be re-used completel "as is". The bit of code
that gets the actual irq number has to be arch specific. But most of the
locking issues are completely platform neutral.

I personally see that code as a good framework that provides many features
that may or may not be neccessary depending on the level of brokenness of
a given interrupt controller.

As for SMP, I'm sure x86 has seen a lot more testing.  I'm not going to
sacrifice time-tested stability so we can look just like x86 and get clean
SMP locking.  We've lost stability already because of some PPC folks'
excitement at getting us to behave like x86 in irq.c.

We lost stability ? Hrm... If we had ever a problem with SMP, it was in the
openpic code, and apparently, due to a HW bug. I don't think the new irq.c
code in itself caused us to lose stability. I actually do think it improved
the locking, and so, stability.

As for a generic irq.c, as a guiding light, I'm all for it.  It'll
certainly help work with RTLinux.  It'll also help new architectures by
giving them a snap-together port construction kit.  I'm still not going to
sacrifice stability in the short-term for this nice feature in the
long-run.  I'm pretty sure we agree on this.

Well, we have been running this new irq.c which I partially based on
i386 for some monthes now, and had enough time to iron out most problems.
Again, all the stability problems we had so far were related to the openpic
implementation, I don't remember seeing one stability problem reported so
far that was related to irq.c. And I've been running a couple of dual
G4s without much trouble for some time now. 
We do (did ?) have a problem with irq distribution on SMP with openpic. I'm
not sure we yet know exactly why, according to both you and IBM people, we
are running over an HW bug of the openpic core. I see nothing in irq.c
that can cause this.

On the other hand, the new irq.c brings the irq depth handling, the ability
to call enable/disable from within the handler (I've been wanting that for
some time for the PMU driver), proper spinlock'ing, etc...
And last, but not least, consistent semantics of enable/disable irq
exposed to drivers (especially things like disable_irq() actually waiting
for that irq to be completed on any other CPU).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[RFC] fbdev power management

2001-03-13 Thread Benjamin Herrenschmidt

I'm working on improving some aspects of Power Management on the
PowerBooks, and among other things, I have a problem with fbdevs.

Currently, each fbdev registers a power management callback to sleep/
wakeup the device. We handle HW related things (shutting the backlight
off, putting the chip to sleep when possible, backing up the frame buffer
content, etc...) from there.

We do call the video sleep last during the sleep process, and wake it up
first, to avoid any problem if something is beeing printed to the console
while the chip is suspended.

However, this is not very safe. First, there's the cursor timer, which
can screw us up. I have a hack in my tree where the fbdev driver calls a
new routine in fbcon.c that stops/starts the cursor timer.

But I'm looking toward a more generic solution. By having a way to
"suspend" the entire fbcon, maybe we can have all console output blocked
 buffered until the fbcon is woken up. Also, a question is should we
call that fbcon_suspend()/fbcon_resume() (currently only the cursor timer
stuff) from the fbdev's or should the fbcon itself register as a power
management client, and then call fbdev's suspend/resume routines ? I
prefer the second solution as the fbdev's are often PCI devices (and so
already have the ability of having PCI suspend/resume hooks).

Another solution would be to have all fbdev's have it's own suspend/
resume hook (and maintain a "suspend" state that would tell fbdev to stop
calling them or start working on a memory based backup image), and
separately, fbdev's own suspend/resume (for the cursor, as it's not head-
dependant but rather global to all fbdev's).

Any comment ?

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Linux-fbdev-devel] [RFC] fbdev power management

2001-03-14 Thread Benjamin Herrenschmidt

I think registering fbcon as a PM client and doing the above when the
fbdev suspend/resume hooks are called should work.  A memory backup is
worked on until the resume is run and the backup is restored to the
display.

So the fbdev drivers would register PM with fbcon, not PCI, correct?

Either that, or the fbdev would register with PCI (or whatever), _and_
fbcon would too independently. In that scenario, fbcon would only handle
things like disabling the cursor timer, while fbdev's would handle HW
issues. THe only problem is for fbcon to know that a given fbdev is
asleep, this could be an exported per-fbdev flag, an error code, or
whatever. In this case, fbcon can either buffer text input, or fallback
to the cfb working on the backed up fb image (that last thing can be
handled entirely within the fbdev I guess).

Ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Linux-fbdev-devel] [RFC] fbdev power management

2001-03-15 Thread Benjamin Herrenschmidt

  Now for fbcon its simpler. Things get writing to the shadow buffer
(vc_screenbuf). When the console gets woken up update_screen is called.
While power down the shadow buffer can be written to which is much faster
than saving a image of the framebuffer. Of course if you still want to do
this such in the case of the X server then copy the image of the
framebuffer to regular ram. Then power down /dev/fb using some ioctl calls
provide.

Ok, I see. Currently, the sleep process is started from an ioctl sent to
another
driver, which will in turn call various notifier functions to shut down
bits of
hardware and finally put the machine to sleep. It's not a direct ioctl to
the /dev/fb (which may not be opened). 

One problem I have is that my fbdev sleep routine will restore the mode
on wakeup,
but that of course doesn't work with X when not using useFBDev as fbdev
have no
knowledge of the current mode or register settings used by X.

I'm wondering if it would be possible to make X think there's a console switch
(without actually switching to an active console, as we don't know if we
even have
one of those available for us), wait for it to reply, and then start the sleep
process.

One other possibility would be to implement APM-like events, I still have
to study
those more in details as our sleep process is currently quite different
from APM
(and definitely not BIOS-based).

For now, I have my hooks in fbcon that suspend/restart the cursor timer,
that's
enough to make sleep stable on 2.4 since we take care of shutting down
the display
very last (after any other driver) to make sure no printk will end up
trying to
display something while the chip is powered down.
I'll digest your various comments look into all this in more depth with
2.5 console
codebase. I beleive some solution must be found for x86 laptops too.

Ben.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: kernel_thread vs. zombie

2001-03-22 Thread Benjamin Herrenschmidt

Have a look at:
http://www.scs.ch/~frey/linux/kernelthreads.html
I have an example there that starts and stops kernel threads
from init_module and never produced a zombie.
I use the same code also to start threads from ioctl and it
works for me. I tested it on UP and SMP, Intel and Alpha,
2.2.18 and 2.4.2.

Thanks !

Could you explain me a bit why you need the lock_kernel ? My probe
thread is already protected by some atomic ops, but I'm considering
changing them to semaphores. Is there any need for the bkl to be taken
when calling daemonize or is this just for your own syncronisation needs ?

I don't think you do more than what I currently do to prevent the
zombie (except for the daemonize call, I don't see you changing anything
about the parent thread or whatever). 

At first I though daemonize() would do the trick, but I still see
zombies on my tests. I'm running UP now so I don't since my lack
of lock_kernel() could explain it.

Ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: kernel_thread vs. zombie

2001-03-22 Thread Benjamin Herrenschmidt

The stuff done in daemonize() and the exit_files could need
the kernel lock. At least on some 2.2.x version it does,
I did not check whether it is still needed on 2.4.

Well, I don't really plan to backport this to 2.2.x. I'll
try to see if my problem is related to the lack of kernel
lock, or maybe I have just something else wrong.

On stop of the thread I need the big kernel lock to make
sure the kernel thread exited (everything really done
from my up() till the thread is in zombie state) before
I unload the module. The comment in the code should explain 
in.

Ok. I don't need that as I'm not in a module, no chances I ever
get unloaded. At least not in 2.4. Making ADB and all the controllers
and device drivers in modules would  be an interesting exercise with
module dependencies ;)

Note that the threads itself do not run with the kernel lock
held. After setting everything up the make an unlock.

Ok. Well, I just have an atomic flag testset'ed before starting the
bus reset, and released at the end of the thread. No need to make sure
the previous one is really dead before starting a new one. I could
benefit from semaphores when starting it since if it's already running,
I just loop scheduling waiting for the lock bit to be available. But
that case will almost never happen in real life. ADB probes are quite
rare.

Many thanks for your help,

I'll see what's wrong in my code ;)

ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PCI bridge setup weirdness

2000-12-08 Thread Benjamin Herrenschmidt


No, pci_read_bridge_bases() is obsoleted by new pci setup code. ;-)
You have to set up bus resources properly in pcibios_fixup_bus().
For a single root bus configuration, you don't need to do anything
with the root bus itself - its resources already point to ioport_resource
and iomem_resource, which should be ok. For pci-pci bridges you have
to add something like this:

The problem I have (and this is why I don't setup host resources
properly on multi-host PPCs yet) is that some hosts can have several
non-contiguous ranges (especially with memory, IO is usually a single
contiguous range).

There are simply not enough resource "slots" in the current structures
to handle all possibles cases.

They basically have a host bridge register in which each low bit enables
decoding of a 256Mb region in the range 0xn000 and each high bit
enable decoding of a 16Mb region in the range 0xFn00

The typical setup is to have one (or more) 256Mb regions, and one 16Mb
region, but that can change from model to model.

Ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Fwd: kernel oops with rm in hfs - hit BUG() in line 236 of dcache.h

2000-12-10 Thread Benjamin Herrenschmidt

 Begin Forwarded Message 
Subject: kernel oops with rm in hfs - hit BUG() in line 236 of dcache.h
Date Sent: Sunday, December 10, 2000 12:56 AM
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED], [EMAIL PROTECTED]


PowerCenter Pro 210mhz 604e, 224MB RAM, Linux 2.4-pre11 (rsync from Paul 12/8)

I was removing multiple files from my hfs drive, when I hit the BUG() at
line 236 in /usr/src/linux/include/linux/dcache.h:

static __inline__ struct dentry * dget(struct dentry *dentry)
{
if (dentry) {
if (!atomic_read(dentry-d_count))
BUG();
atomic_inc(dentry-d_count);
}
return dentry;
}


Dec  9 18:09:21 like kernel: kernel BUG at /usr/src/linux/include/linux/
dcache.h:236!
Dec  9 18:09:21 like kernel: Oops: Exception in kernel mode, sig: 7
Dec  9 18:09:21 like kernel: NIP: C00712FC XER:  LR: C00712FC SP:
C1087DB0 REGS: c1087d00 TRAP: 0700
Dec  9 18:09:21 like kernel: MSR: 00089032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
Dec  9 18:09:21 like kernel: TASK = c1086000[12310] 'rm' Last syscall: 10
Dec  9 18:09:21 like kernel: last math c1086000 last altivec 
Dec  9 18:09:21 like kernel: GPR00: C00712FC C1087DB0
C1086000 0039 1032 0001 C021 
Dec  9 18:09:21 like kernel: GPR08:  C01B 001F
C1087CF0 22822842 1001ECE8 100302E8 1003
Dec  9 18:09:21 like kernel: GPR16: 1
003 1003 1003 1003  C96D7C20  C021
Dec  9 18:09:21 like kernel: GPR24: C291D62C C018 C018 C291D600
C4193C60 C291EE40 C291D628 C9947520
Dec  9 18:09:21 like kernel: Call backtrace:
Dec  9 18:09:21 like kernel: C00712FC C0047854 C00479A8 C00048D8 10001D8C
100031D0 10001358
Dec  9 18:09:21 like kernel: 0FF0B734 

From the System.map:
c007122c T hfs_unlink
c00476d8 T vfs_unlink
c00478c0 T sys_unlink
c00048d8 T ret_from_syscall_1


Thanks,
Peter


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/


- End Forwarded Message -
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.2.X patches for fbcon

2000-12-11 Thread Benjamin Herrenschmidt

--- atyfb.cMon Dec 11 14:28:19 2000
+++ atyfb.c.orig   Wed Oct  4 22:22:28 2000
@@ -2796,7 +2796,7 @@
  * works on iMacs as well as the G3 powerbooks. - paulus
  */
 if (default_vmode == VMODE_CHOOSE) {
-  if ((Gx == LG_CHIP_ID)||(Gx == LI_CHIP_ID)||(Gx == LP_CHIP_ID))
+  if (Gx == LG_CHIP_ID)
   /* G3 PowerBook with 1024x768 LCD */
   default_vmode = VMODE_1024_768_60;

That one is wrong. The machine type must be probed differently. Also, some
wallstreet's have a different screen (passive matrix) which is 800x600. I'm
trying to find a way to probe for it and will come up with a patch for this
In the meantime, passing the vmode is the correct solution.

Ben.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



aic7xxx.c vs. Adaptec 29160N

2001-01-04 Thread Benjamin Herrenschmidt

I have a 29160N card in a PowerMac G4. It used to work fine with an old
UW SCSI disk I had there. Today, I flipped this drive with a real
Ultra160 one , and now, the kernel won't boot. It's giving me an endless
stream of SCSI reset timeouts on bus 0.

Any clue ? I don't really need this disk in Linux (at least not yet), but
I don't neither want to plug/unplug the disk each time I boot linux or
MacOS...

The disk is a Quantum ATLAS_V__9_WLS rev. 0230

Anything I can do to help tracking the problem ? It's difficult to get
the actual output of the driver in verbose mode as it is scrolling quite
fast and I have nothing like a serial console on this box. The kernel
won't boot without noprobe so I can't dump dmesg output.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: aic7xxx.c vs. Adaptec 29160N

2001-01-04 Thread Benjamin Herrenschmidt

Anything I can do to help tracking the problem ? It's difficult to get
the actual output of the driver in verbose mode as it is scrolling quite
fast and I have nothing like a serial console on this box. The kernel
won't boot without noprobe so I can't dump dmesg output.

I was wrong, even no_probe won't help, I have to physically disconnect
the drive to get the kernel to boot.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: FIXED! Updated 2.4 TODO List -- new addition WAS(test9 PCIresourcecollisions (fwd)

2000-10-26 Thread Benjamin Herrenschmidt


Yes, it will break on any machine with multiple primary PCI busses, because
the registers assigning bus number ranges to primary busses are chipset
specific.

In 2.5, I'd like to rewrite the resource + bus number assignment code to be
able to re-layout the busses and resources even on i386 if it detects it's
safe to do so (that is there is either only one primary bus or the host
bridge
is known). This will be needed for proper PCI hotplug and it will also help
us to get rid of some more BIOS bugs (especially on some embedded systems).

The current 2.4 code with assign_all_busses enabled works nicely on the
Apple "UniNorth" 3-host mecanism. There's apparently no register to
configure to set the primary bus number (the bridge doesn't care what
it's bus number is, it just has different mecanism to issue type 0 and
type 1 config cycles).

We also have something that is not directly supported in the 2.4 PCI
code, but that I implemented via fixups  pcibios_update_resource(),
which is to have a offset applied to all MMIO resources. Some PPC
machines don't have a 1:1 mapping of PCI memory space vs. CPU physical
space, and so we must add this offset to all memory resources and
substract it ffrom pcibios_update_resource().

I still have not implemented per-host parent resoures (we currently have
one parent resource for all 3 hosts), I'm still not completely sure of
the best way to implement it (the bus resources management still has a
few obscure zones to me).

The main problem with PCI we are facing on all our platforms is related
to IOs. There are way too may assumtions in the kernel based on the x86
fact that PCI IOs are more or else equivalent to ISA IOs, limited to one
64k space, and so on.

We have several cases of machines with several IO busses, each one having
it's own IO address space 0-xxx mapped differently (elsewhere) in
the CPU physical memory space (those CPUs don't have specific in/out
instructions), and which can be more than 64k long. Also, not all host
bridge can do remapping of legacy ISA addresses.

That leads to several issues with this: If we want to support "normal"
PCI IOs (devices that expose registers as IO ranges only, but that also
fully support PCI 32 bits IO space) on all of these busses, and still be
"compatible" with drivers that do in/out functions on legacy (64k)
addresses and expect reaching either an ISA bus or legacy devices on the
PCI bus, we need to do all sort of hacking and we have not yet figured
out a solution that would make everybody happy.
We can put the real physical address used to generate the proper IO cycle
in the PCI drive resource structure and have in/out just do the same as
readb/writeb. This allows to handle properly PCI IOs on all busses, but
breaks legacy crap.
We can (and that's what we do today) decide that only one bus support IOs
and have a "global" IO_BASE which is added to all in/out accesses, and
which is the ioremap'ed IO space of the single bus we decide supports
IOs. But that means that we can't access both the VGA registers of a card
in the AGP slot and PCI IO space of another card in the main PCI slots
(different busses and different IO spaces).
We can use MMU tricks to "append" together all IO spaces, one of them
beeing considered as the primary and beeing mapped at the bottom of this
virtual IO space (for legacy in/out) and all other beeing appended to
this one (with proper fixup of PCI resources). This was discussed on
linuxppc-dev list a lot, but not implemented yet.

My personal point of view would be to either separate completely ISA 
PCI IO macros, or have a mecanism for all legacy (VGA, ISA, ...) drivers,
to ask for a base address from the "legacy" address they intend to use.
(get_legacy_base(VGA_LEGACY) for example, would return the IO base to
apply to all in/out macros used to access the IO space).
That's still not perfect since we can't support two VGA cards on separate
busses (which would be theorically possible on a Mac: one in the AGP
slot, one in a PCI slot, both having different IO space).

So I'm still open to suggestions, but I'd really like to see this problem
adressed for _2_5 in a "generic" way. Currently, it's more or less
choosing between supporting legacy devices on one bus and no real PCI IOs
on any other bus but the first one, or 
supporting real PCI IOs on all busses, but no legacy IOs. Note that we
don't have such a problem for MMIOs fortunately ;)

Ben.

-- RFC822 Header Follows --
From: Benjamin Herrenschmidt [EMAIL PROTECTED]
To: Martin Mares [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: FIXED! Updated 2.4 TODO List -- new addition  WAS(test9 PCI
 resourcecollisions (fwd)
Date: Thu, 26 Oct 2000 14:35:42 +0200
Message-Id: [EMAIL PROTECTED]
In-Reply-To: [EMAIL PROTECTED]
References: [EMAIL PROTECTED]
X-Mailer: CTM PowerMail 3.0.5 http://www.ctmdev.com
MIME-Version: 1.0
Cont

Re: B/W G3 - big IDE problems with 2.4.0-test10

2000-11-09 Thread Benjamin Herrenschmidt

On Wed, 8 Nov 2000, Andre Hedrick wrote:

 What is your chipset, CMD646 rev 5 Ultra DMA 33 ???

Yep. I've tried building with the CMD64x driver, and that didn't help
matters, if you were wondering. Any thoughts?

Did you try the bitkeeper PPC kernel ? (or Paul Mackerras rsync tree ?)

Not all PPC patches have been merged in Linus tree yet. There were some
resource assignement issues that were fixed only recently.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



About IOs, ISA, PCI, and life (WAS: VGA PCI IO port...)

2000-11-18 Thread Benjamin Herrenschmidt


One way to do this is to treat PCI IO and ISA IO as two separate
address spaces.  The PCI IO address space is a 14-bit address space
(bits 9:8 are always zero) ranging from 0x1000 to 0xFCFF.  ISA IO is a
10-bit space (bits 15:10 are available for the card to use) ranging
from 0x100 to 0x3FF.

VGA cards may be PCI and AGP, but still have allocations in the ISA
range.

I'd love to see PCI and ISA IOs treated differently too.

I'm seeing more and more esoteric PCI setups (especially on some huge PPCs
with several host bridges), various different ways to access PCI mem and
ISA, various ways to handle the ISA "special case" for memory, etc...

When you have to deal with various separate PCI IO spaces, each one having
it's own address space, and each one potentially having devices that want
to do IOs or "legacy" ISA IOs, then you are screwed.

Currently, we don't really support IOs on anything but the "primary" PCI bus
(choosen arbitrarily) unless platform-specific driver hacking is done.

We could use the MMU mappings to let the kernel think all those IO spaces
are actually one big contiguous region, and remap them all together. This way,
a simple resource fixup would make PCI drivers using IO resources work at
least.
But in this case, "ISA" IOs will have to be restricted to one of the IO
busses,
decided arbitrarily.

But what about 2 video cards on the AGP port and one PCI slot of a G4 Mac ?

This machine, just an example, have those on different host controllers with
separate IO spaces. If those cards need to be driven with VGA accesses (for
running a BIOS emulator for example, or just because you have no choice),
then you are screwed. All you can do is have one bus support VGA IOs.

Another issue is ISA memory space, for the same reason as above (multiple
busses), but also because a lot of PCI controller setup can't forward
memory cycles below 0x8000 or such arbitrary physical address. Some
of them (most but not all) provide a way via a separate physical address
to access a 64k "ISA" memory space that generates low-address PCI cycles.

So you can have one or more ISA IO busses, and 0 or more ISA memory busses.

A solution for that would be to have VGA and other legacy ISA drivers in
the kernel change the way they use the IO access macros.

One idea I have, would be to either keep a virtual ISA "bus number" along
with kernel support functions to count them, get virtual base addresses for
IO  memory, query about availability of those, etc...

Another would be to link that more tighly with PCI by adding generic functions
to request the virtual base address of each PCI IO and ISA-memory space.
We already have a syscall on some platforms (PPC, Alpha) to request those
informations from userland (XFree).

I'm not sure about the best way that could fit in the resource
architecture yet. I have different problems with PCI resources for now
(mostly with host bridges that provide several discontiguous decoding
ranges)...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PCI power management

2001-04-19 Thread Benjamin Herrenschmidt

Hi ! Glad to see things moving around Power Management ;)

This was originally a private reply to Patrick Mochel, but the e-mail
kept getting longer and longer :)

Note: we have setup a list for PM issues

http://lists.sourceforge.net/lists/listinfo/linux-pm-devel

Not very much used yet, but I, at least, plan to spam it with all
sort of things we need for PowerBook PM... I'm forwarding your
message there and I suggest we continue that discussion there as well.

The current state of PCI PM is this:

pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes
the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state. 
Linus believes the power state transition should occur before (1) and
(2), and I agree.

pci_set_power_state brings a device to a new D state.  If the D state
transition is D3-D0, then we (1) save key PCI config registers, (2) go
to D0, and (3) restore saved PCI config registers.  This originally
comes from Donald Becker's acpi_wake function, which is used only for
the case of device enabling (where he had no problems), not for the case
of returning-from-suspend (where we see problems).

I beleive the current scheme is not enough. Here are some of my own
thoughts about this:

 - Some devices won't properly give you their config space when
in D3 state. You shouldn't save the configuration when in D3 to restore
it after switching to D0, but you must have previously saved it before
originally putting the device into D3 state.

 - There need to be some arch "hooks" in this mecanism. Some machines
have the ability (from the arch specific code, by tweaking ASIC bits)
to remove clock and/or power from selected devices. That mean power
management can be done even with devices not supporting PCI PM provided
that the driver can recover them from a PowerOn reset.

 - Some devices just can't be brought back to life from D3 state without
a PCI reset (ATI Rage M3 for example) and that require some arch specific
support (when it's possible at all).

 - The current scheme provide no way for the kernel to "know" if a
driver can handle recovering the device from a PowerOn reset. Some
drivers can, some can't (the video drivers usually can't as they
require the board's PLL to be properly setup by the BIOS). Some
advanced PM modes we use on pmacs will cause the motherboard ASIC to
turn off power to PCI  AGP cards when putting the machine to sleep.
We need a way to prevent/allow this "deep sleep" mode depending
on what the card supports.

 - Ordering of power management may matter. On PowerBooks, we run
through all notifiers first with a "sleep request" message. None of
the drivers will actually put anything to sleep at this point, but
they will allocate all the memory the might need for doing so (saving
state, saving a framebuffer in some cases, etc...). Once all devices
have accepted the request (they can refuse it), I then send a 
"sleep now" message. This way, I can make sure all memory allocations
have been performed and disks properly sync'ed before putting the swap
devices to sleep and such things. 

 - On SMP, we need some way to stop other CPUs in the scheduler
while running the last round of sleep (putting devices to sleep) at least
until all IO layers in Linux can properly handle blocking of IO queues
while the device sleeps.

 - We need a generic (non-x86 APM or ACPI dependant) way of including
userland process that request it in the loop. Some userland process
that bang hardware directly (X, but not only X) need to be properly
suspended (and the kernel has to wait for ack from them before continuing
with devices sleep).

"apm -s" causes the apm driver to map all suspends to the ACPI D3
state.  An apm suspend triggers a pm_send_all call, which in turns
triggers pci_pm_suspend.  This code [from Linus iirc] walks the root
buses, recursively suspending downstream buses and then attached
devices.  The resume code does the exact opposite.  The PCI core
suspend/resume code has this comment, and we note the current
requirement that -all- drivers should export suspend/resume somehow, in
order for a sane PM system to work here.

Yup. They should also be able to return an error (fail or just limit
to a higher level like D2). They should also be able to tell the kernel
if they support recovering from a power down.

It is up to the drivers to implement ::suspend() and ::resume(), and few
do.  The few that do, even fewer work well in practice.

I would have preferred that a PM node be created for each PCI node and
have the PM nodes organised as a tree structure. That way, arch fixup
hooks can re-arrange the tree as the PCI bus-child dependency may not
be true. On some portables, some ASICs located on the PCI bus are not
dependent on their parent host bridge power plane.

That's the current state of things.  I do not think the system -- at the
PCI core level -- is poorly designed.  I think it just takes a lot of
grunt work with drivers at this point, plus maybe a few new pci helper
functions.

Re: PCI power management

2001-04-19 Thread Benjamin Herrenschmidt

Hi ! Glad to see things moving around Power Management ;)

This was originally a private reply to Patrick Mochel, but the e-mail
kept getting longer and longer :)

Note: we have setup a list for PM issues

http://lists.sourceforge.net/lists/listinfo/linux-pm-devel

Not very much used yet, but I, at least, plan to spam it with all
sort of things we need for PowerBook PM... I'm forwarding your
message there and I suggest we continue that discussion there as well.

The current state of PCI PM is this:

pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes
the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state. 
Linus believes the power state transition should occur before (1) and
(2), and I agree.

pci_set_power_state brings a device to a new D state.  If the D state
transition is D3-D0, then we (1) save key PCI config registers, (2) go
to D0, and (3) restore saved PCI config registers.  This originally
comes from Donald Becker's acpi_wake function, which is used only for
the case of device enabling (where he had no problems), not for the case
of returning-from-suspend (where we see problems).

I beleive the current scheme is not enough. Here are some of my own
thoughts about this:

 - Some devices won't properly give you their config space when
in D3 state. You shouldn't save the configuration when in D3 to restore
it after switching to D0, but you must have previously saved it before
originally putting the device into D3 state.

 - There need to be some arch "hooks" in this mecanism. Some machines
have the ability (from the arch specific code, by tweaking ASIC bits)
to remove clock and/or power from selected devices. That mean power
management can be done even with devices not supporting PCI PM provided
that the driver can recover them from a PowerOn reset.

 - Some devices just can't be brought back to life from D3 state without
a PCI reset (ATI Rage M3 for example) and that require some arch specific
support (when it's possible at all).

 - The current scheme provide no way for the kernel to "know" if a
driver can handle recovering the device from a PowerOn reset. Some
drivers can, some can't (the video drivers usually can't as they
require the board's PLL to be properly setup by the BIOS). Some
advanced PM modes we use on pmacs will cause the motherboard ASIC to
turn off power to PCI  AGP cards when putting the machine to sleep.
We need a way to prevent/allow this "deep sleep" mode depending
on what the card supports.

 - Ordering of power management may matter. On PowerBooks, we run
through all notifiers first with a "sleep request" message. None of
the drivers will actually put anything to sleep at this point, but
they will allocate all the memory the might need for doing so (saving
state, saving a framebuffer in some cases, etc...). Once all devices
have accepted the request (they can refuse it), I then send a 
"sleep now" message. This way, I can make sure all memory allocations
have been performed and disks properly sync'ed before putting the swap
devices to sleep and such things. 

 - On SMP, we need some way to stop other CPUs in the scheduler
while running the last round of sleep (putting devices to sleep) at least
until all IO layers in Linux can properly handle blocking of IO queues
while the device sleeps.

 - We need a generic (non-x86 APM or ACPI dependant) way of including
userland process that request it in the loop. Some userland process
that bang hardware directly (X, but not only X) need to be properly
suspended (and the kernel has to wait for ack from them before continuing
with devices sleep).

"apm -s" causes the apm driver to map all suspends to the ACPI D3
state.  An apm suspend triggers a pm_send_all call, which in turns
triggers pci_pm_suspend.  This code [from Linus iirc] walks the root
buses, recursively suspending downstream buses and then attached
devices.  The resume code does the exact opposite.  The PCI core
suspend/resume code has this comment, and we note the current
requirement that -all- drivers should export suspend/resume somehow, in
order for a sane PM system to work here.

Yup. They should also be able to return an error (fail or just limit
to a higher level like D2). They should also be able to tell the kernel
if they support recovering from a power down.

It is up to the drivers to implement ::suspend() and ::resume(), and few
do.  The few that do, even fewer work well in practice.

I would have preferred that a PM node be created for each PCI node and
have the PM nodes organised as a tree structure. That way, arch fixup
hooks can re-arrange the tree as the PCI bus-child dependency may not
be true. On some portables, some ASICs located on the PCI bus are not
dependent on their parent host bridge power plane.

That's the current state of things.  I do not think the system -- at the
PCI core level -- is poorly designed.  I think it just takes a lot of
grunt work with drivers at this point, plus maybe a few new pci helper
functions.

Re: PCI power management

2001-04-19 Thread Benjamin Herrenschmidt

On Thu, Apr 19, 2001 at 11:19:31AM +0100, Benjamin Herrenschmidt wrote:
 Hi ! Glad to see things moving around Power Management ;)
 
 This was originally a private reply to Patrick Mochel, but the e-mail
 kept getting longer and longer :)
 
 Note: we have setup a list for PM issues
 
 http://lists.sourceforge.net/lists/listinfo/linux-pm-devel

Oo

*tries to subscribe*

Doh! The silly thing is trying to use the From_ header on the confirm
rather then the From: header and so I can't subscribe. Can this get fixed?

Dunno, it's the standard sourceforge/geocrawler list stuff..

Ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PCI power management

2001-04-19 Thread Benjamin Herrenschmidt

  - Some devices just can't be brought back to life from D3 state without
 a PCI reset (ATI Rage M3 for example) and that require some arch specific
 support (when it's possible at all).

Putting on a driver author hat what I want is

   pci_power_on_generic
   pci_power_off_generic
   pci_power_on_null
   pci_power_off_null

At which point most driver writers are having to do no thinking at all about
their device. The PCI layer just requires they pick a function and stick it
in the struct pci_device. 

Could you elaborate about the difference between generic and null
functions ? I'm not sure I understand what you mean...

Note that in the case of chips like the Rage M3, the driver is the only
one to know if it will be able to bring back the card from a power off
state or not. It's the only one to know if it can reconfigure the card
completely without having a BIOS run before it.

I would suggest a call that looks like

pci_power_off(uint mask);

where mask is

  PCI_POWER_MASK_D1 = 0x0001
  PCI_POWER_MASK_D2 = 0x0002
  PCI_POWER_MASK_D3 = 0x0004
  PCI_POWER_MASK_NOCLOCK = 0x0008
  PCI_POWER_MASK_NOPOWER = 0x0010

The driver sets the mask to whatever state it supports getting the card
from. We can #define a PCI_POWER_MASK_STD (that would be a D1+D2+D3) for
"generic" drivers that don't really know anything but to follow the HW
PCI power management capabilities.

This function would be routed to an arch function, that will in turn
either call the lower-level PCI code to set D1, D2 or D3 mode (the best
supported) or will suspend the card's clock or power if it can and the
driver accept it.

Typically, on a PowerMac, this function could keep track of which cards
are in D2 or D3 mode (or which drivers allowed for clock suspend) and
would stop the PCI clock once they all asked for it. 

This doesnt help you. You need device specific support in each case where
bus mastering is occuring and a bus master error could be fatal if missed.
For example on i2o I can easily have 4Mbytes of outstanding I/O between the
message layer and disk, all of which is bus mastering. Only the driver
actually
knows when its idle.

Right. That's a driver issue. The problem would go away if all drivers
properly block their IO queues and wait for all IO to complete when
notified of sleep

X has hooks for this in XFree 4

The last time I looked at it, those were rather APM-specific. But well, I
guess it's easy to update them. What I'm thinknig about is the kernel
side, that is a generic, non-APM or non-ACPI specific way of notifying
userland process that request for it. Some kind of interface allowing
userland to register PM notifiers and have the kernel PM thread be
blocked until the userland code "acked" the message.

Well, maybe there is already something I missed...

Ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PCI power management

2001-04-19 Thread Benjamin Herrenschmidt

null = 'do absolutely nothing'
generic = 'do D3 as per the specification'

The idea being the PM layer would go around calling

   dev-power_off(dev);

as a default notifier for PCI devices.

Ok, I see. I didn't understand that the functions you were talking about
would be defaults to put directly in the pci_dev structure.

And in the case of the cards like that you would need a custom mask. So you'd
do
   pci_set_power_handler(dev, atyfb_power_on, atyfb_power_off)

to get a custom function. For most authors however they can call the power
handler setup just using prerolled functions that do the right thing and know
about any architecture horrors they dont.

Right. However, rare are the drivers that don't need at least to know
that a power management sequence is going on. All bus mastering drivers,
at least, must stop bus mastering (and clearing the bit in the command
register is not enough on a bunch of them). Most drivers have to cleanly
stop ongoing operations, refuse (or block) requests while the driver is
sleeping, etc... and finally configure things back once waking up. I
don't see much cases where a simple "default" function would work. 

My current scheme on powerbook don't do half of that... it still sorta
works since I manage to stop all scheduling and shut things down in the
proper order, but it's neither a clean nor a safe way to do things.

I'd rather

   pci_dev-powerstate

or similar as a set of flags in the device.

Ok, agree with that one.

I sill consider, however, that the current suspend/resume callbacks in
the pci_dev structure are not the best way to do things. I would have
really prefered that each pci_dev embed a pm notifier structure. In some
cases, we want to pass more than simple suspend/resume messages (suspend
request, suspend now, suspend cancel, and resume are the 4 messages I use
on powerbooks). 

Also, this can be generalized to other type of drivers (USB, IEEE1394,
..), eventually passing bus-specific messages

Ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PCI power management

2001-04-20 Thread Benjamin Herrenschmidt

All devices should handle having power removed from them. And, all of the
drivers should as well, since that is the only way we're going to get
power management out of legacy devices and other things on the board. This
involves saving the current context on suspend, and reinitializing the
device, and restoring the context as much as possible when we resume. It
should behave almost identically to the boot-time init code.

Right. In fact, at the driver level, the power management involve
2 different things:

 - Handling context save  restore of the device state

 - Blocking of "user" (I mean user of the driver, that can be
   a kernel servicer) requests properly. In some case, this later
   thing can be done by returning errors provided that upper level
   drivers are read to handle them. For example, the IDE layer should
   probably just block the IO queues while the IDE susbsytem is powered
   off (not talking about disk sleep, but complete power off of the
   controller), while an USB host controller should probably return
   errors to URBs sent by drivers to a sleeping controller since those
   upper-level drivers should have been put to sleep before the host
   controller.
   That part is almost completely overlooked right now.

  - Some devices just can't be brought back to life from D3 state without
 a PCI reset (ATI Rage M3 for example) and that require some arch specific
 support (when it's possible at all).

When a device comes out of D3[hot], the equivalent of a soft reset is
performed. From D3[cold], PCI RST# is asserted, and the device must be
completely reinitialized.

Some devices (bad bad HW designers ;) just can't do it themselves. The
Rage M3 requires the host to assert PCI RST#, and some motherboards
provide no documented facility for that (it might be possible with Apple
ASICs for example, it's just not documented).

Also, still in the case of the Rage M3, we just can't bring it out of
D3 for the same reason we can't bring the r128 in the AGP slot of a
Cube Mac out of PowerOff : The complete init sequence of those chips
is dependent on the chip revision, requires some informations about
undocumented registers that we don't have (at least that's my understanding
from talks with ATI) and so can basically only be done by a BIOS (or
OpenFirmware driver in my case), and we can't run that on wakeup (OF
is dead on macs once the kernel takes over). So we have to limit
ourselves to D2 mode on machines that don't remove power from the
slots (powerbooks, ibooks  imacs) and we can't do deep sleep at all
on machines that remove power from the slot (Cube, G4s, ...), at least
until we figure out the proper init sequence for those cards.

So the point here, as far as the kernel is concerned, is that drivers
should have a way to let the kenrel know the min/max power state they
support.

It's not about what the device supports, it's about what the driver
supports. STR and STD imply that all devices will lose power. The drivers
are responsible for reinitializing the devices, regardless of what that
may involve. 

Right. I'm typing too fast, but that's what I meant.

Hmm. How about doing two walks of the device tree - the first calls a
save_state() function for each device, which gives it the opportunity to
allocate memory and save appropriate registers, etc. The second actually
places the device in a low power state.  

This could give the kernel the chance to disable swap, or for the action 
to be cancelled before anything is actually put to sleep.

Yup. That's approximately what I do with the PPC-specific
"sleep notifiers" we are using. The only difference is that the real
save state is done on the "sleep now" (latest) request, not on the
"sleep request" (earlier) request. 

The basic idea here is that the first pass will do all of the memory
allocation (or whatever requires all system resources to be available,
that can be sending a special power management message to the device,
like enabling the remote wakup on USB, etc...). So this first pass
requires system services (all other drivers if you prefer, especially
the swap device) to be fully alive.

The second pass will do the actual IO blocking, state save, and eventually
enter device suspend mode for cases where it's controlled by the driver.

  - On SMP, we need some way to stop other CPUs in the scheduler
 while running the last round of sleep (putting devices to sleep) at least
 until all IO layers in Linux can properly handle blocking of IO queues
 while the device sleeps.

Ugh. SMP. Not yet.

Well, if all drivers properly handle blocking of IOs, the SMP issue will
be easy to handle. Having the other CPUs run is not a problem as long as
any IO triggered by processes on theose are properly blocked by sleeping
drivers. All is needed is a cross-CPU function call to force the other
CPU into an idle loop (or a idle/sleep loop on PPC) on the very last
step of entering suspend mode.

  - We need a generic (non-x86 APM or ACPI dependant) way of 

Re: PCI power management

2001-04-20 Thread Benjamin Herrenschmidt

 Some devices (bad bad HW designers ;) just can't do it themselves. The
 Rage M3 requires the host to assert PCI RST#, and some motherboards
 provide no documented facility for that (it might be possible with Apple
 ASICs for example, it's just not documented).

Why should we support such a non-spec device?  Tell ATI to fix their
hardware, and tell users (a) not to use the hardware, or (b) use the
hardware with the knowledge that you are screwed when it comes to Power
Management.

Unless there are more cases like this, this should not factor at all
into the modifications to the PCI and PM code...

Well, I can tell all PowerBook and iBook users to forget about sleep...

Also, that would not be the first time we have to deal with poorly
documented hardware. I don't think we should refuse to handle any
hardware that is out of spec... it would be like saying Linux doesn't
support any x86 with a broken BIOS...

It's not so complicated to have the minimum flexibility for the driver
to tell it's maximum supported power level, and I don't see why it would
be a problem to use D2 instead of D3 when we don't support D3 for a given
device (either because the HW is broken, undocumented, or because our
driver just don't know how to bring back the chip to life).

If the motherboard _requires_ it (because it will cut power from the chip),
the we can refuse to enter sleep when one driver can't do it (instead of
letting the user crash the box badly).

In any case, I beleive you are focusing on a point of detail. All
I'm asking for (in this specific case) is a simple mask of flags set
by the driver to tell what it can handle. It's also useful for
devices that don't support PM on machines whose motherboard provide
facility to turn OFF power on selected cards. It would allow us to
turn off cards for drivers that can handle recovering. 

Also, I don't think the problem of powering back up the chip and
re-initing it from scratch is specific to those ATI chips. Look at
XFree, it has to run a BIOS emulator to soft boot video chips. On
PCs, I beleive you have the BIOS that re-init them when waking up
from an APM or ACPI suspend. On non-PCs when suspend is not handled
by the firmware but directly by the kernel, that's not the case.

Ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: isa_read/write not available on ppc - solution suggestions ??

2001-05-02 Thread Benjamin Herrenschmidt

 I would suggest the opposite approach instead: make the PPC just support
 isa_readx/isa_writex instead.

We can certainly do that, no problem.

BUT that won't get a token ring pcmcia card working in the newer
powerbooks, such as the titanium G4 powerbook, because the PCI host
bridge doesn't map any cpu addresses to the bottom 16MB of PCI memory
space.  This is not a problem as far as pcmcia cards are concerned -
the pcmcia stuff just picks an appropriate address (typically in the
range 0x9000 - 0x9fff) and sets the pcmcia/cardbus bridge to
map that to the card.  But it means that the physical addresses for
the card's memory space will be above the 16MB point, so it is
essential to do the ioremap.

What about isa_ioremap ? Result from it is a token passed to
isa_readx/isa_writex and the arch side can be implemented with a
couple of #defines on x86. 

It's easy to change I beleive, and it paves the way for archs to
add a notion of token in the high bits (as we _know_ an ISA address
is small). Those token can be used by arch to route to proper PCI
bus when several host bridges exist, to route to PCMCIA when the
PCMCIA uses it's own ISA memory space like on PPC, etc...

Later on, we can see things like

ulong pci_get_bus_isa_base(int busno);

And the same for PCMCIA  whatever 16 bits busses that can exist on
embedded hardware.

That way, support for multiple busses (either real ISA, embedded custom
busses using legacy devices, several PCI hosts with ISA bridges, ...)
can be implemented very easily. In most case adjusting the drivers
probe code.

I'd like to see the same kind of things for IOs in fact but that's
another debate ;)

Regards,
Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



pci_disable_device() vs. arch

2001-06-16 Thread Benjamin Herrenschmidt

Hi !

Would it make sense to add a 

pcibios_disable_device(pci_dev*) called from the end of 
pci_disable_device() ?

I'm adding a call to it to sungem along with other pmac stuffs
so that the chip can be properly power down (actually it's not
really powered down but unclocked) after module removal.
Of course, the arch code must be able to catch it in order to
play with the various UniNorth control bits.

Note that my current gmac driver does shut the chip down when
the interface is down, which makes it a bit more useful for
laptops as most users currently compile the driver in the kernel.

I have nothing about changing the policy if you prefer so that
users will now have to rmmod the driver once done with the
interface to save power.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Going beyond 256 PCI buses

2001-06-14 Thread Benjamin Herrenschmidt

It's funny you mention this because I have been working on something
similar recently.  Basically making xfree86 int10 and VGA poking happy
on sparc64.

Heh, world is small ;)

But this has no real use in the kernel.  (actually I take this back,
read below)

yup, fbcon at least... 

You have a primary VGA device, that is the one the bios (boot
firmware, whatever you want to call it) enables to respond to I/O and
MEM accesses, the rest are configured to VGA pallette snoop and that's
it.  The primary VGA device is the kernel console (unless using some
fbcon driver of course), and that's that.

Yup, fbcon is what I have in mind here

The secondary VGA devices are only interesting to things like the X
server, and xfree86 does all the enable/disable/bridge-forward-vga
magic when doing multi-head.

and multihead fbcon. 

Perhaps, you might need to program the VGA resources of some device to
use it in a fbcon driver (ie. to init it or set screen crt parameters,
I believe the tdfx requires the latter which is why I'm having a devil
of a time getting it to work on my sparc64 box).  This would be a
seperate issue, and I would not mind at all seeing an abstraction for
this sort of thing, let us call it:

   struct pci_vga_resource {
   struct resource io, mem;
   };

   int pci_route_vga(struct pci_dev *pdev, struct pci_vga_resource *res);
   pci_restore_vga(void);

 [.../...]

Well... that would work for VGA itself (note that this semaphore
you are talking about should be shared some way with the /proc
interface so XFree can be properly sync'ed as well).

But I still think it may be useful to generalize the idea to 
all kind of legacy IO  PIOs. I definitely agree that VGA is a kind
of special case, mostly because of the necessary exclusion on
the VGA IO response.

But what about all those legacy drivers that will issue inx/outx
calls without an ioremap ? Should they call ioremap with hard-coded
legacy addresses ? There are chipsets containing things like legacy
timers, legacy keyboard controllers, etc... and in some (rare I admit)
cases, those may be scattered (or multiplied) on various domains. 
If we decide we don't handle those, then well, I won't argue more
(it's mostly an estethic rant on my side ;), but the problem of
wether they should call ioremap or not is there, and since the
ISA bus can be mapped anywhere in the bus space by the host bridge,
there need to be a way to retreive the ISA resources in general for
a given domain.

That's why I'd suggest something like 

pci_get_isa_mem(struct resource* isa_mem);
pci_get_isa_io(struct resource* isa_io);

(I prefer 2 different functions as some platforms like powermac just
don't provide the ISA mem space at all, there's no way to generate
a memory cycle in the low-address range on the PCI bus of those and
they don't have a PCI-ISA bridge), so I like having the ability of
one of the functions returning an error and not the other.

Also, having the same ioremap() call for both mem IO and PIO means
that things like 0xc cannot be interpreted. It's a valid ISA-mem
address in the VGA space and a valid PIO address on a PCI bus that
supports 64k of PIO space.

I beleive it would make things clearer (and probably implementation
simpler) to separate ioremap and pioremap.

Ben.

So you'd go:

   struct pci_vga_resource vga_res;
   int err;

   err = pci_route_vga(tdfx_pdev, vga_res);

   if (err)
   barf();
   vga_ports = ioremap(vga_res.io.start, vga_res.io.end-vga_res.io.start+1);
   program_video_crtc_params(vga_ports);
   iounmap(vga_ports);
   vga_fb = ioremap(vga_res.mem.start, vga_res.mem.end-vga_res.mem.start+1);
   clear_vga_fb(vga_fb);
   iounmap(vga_fb);

   pci_restore_vga();
   
pci_route_vga does several things:

1) It saves the current VGA routing information.
2) It configures busses and VGA devices such that PDEV responds to
   VGA accesses, and other VGA devices just VGA palette snoop.
3) Fills in the pci_vga_resources with
   io: 0x320--0x340 in domain PDEV lives, vga I/O regs
   mem: 0xa--0xc in domain PDEV lives, video ram

pci_restore_vga, as the name suggests, restores things back to how
they were before the pci_route_vga() call.  Maybe also some semaphore
so only one driver can do this at once and you can't drop the
semaphore without calling pci_restore_vga().  VC switching into the X
server would need to grab this thing too.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: pci_disable_device() vs. arch

2001-06-16 Thread Benjamin Herrenschmidt


Its not clutter -- what you are doing is hiding pieces of the driver
from the driver maintainer.  pcibios_enable_device should not be
cluttered up with such mess, too.

Well... pcibios_enable_device() has to at least make sure the device
gets powered up as it's powered down after PCI probe. Except if we
end up calling pci_set_power_state() to power it up early in the
sungem driver.

I point out that I recently fixed a bug where Via interrupts were being
assigned incorrectly.  If I had not done a global grep for Via
irq-related code, I would have missed the spot where the PPC code was
doing a kludge for one of the four on-board Via devices, hardcoding the
USB irq number to 11.

Hrm... interrupt routing on some PPC-based motherboard is quite a
mess, fortunately that's not the case on pmacs. The IRQ assignement
has to be part of the arch AFAIK, only the arch knows on which
interrupt line of the controller a given chip is wired and how
interrupt controllers are cascaded.

Correct.  If your driver uses the API correctly, then when/if we want to
mess around with hotplug resource assignment, we can un-assign resources
as we like.  Since there aren't too many users of pci_disable_device so
far, I want to make sure early adopters get it right.

Well... at least with sungem, there's no such risk as the entire bus
(up to the host bridge) where it lives is internal to the UniNorth
ASIC.

Can you give a -specific- example of arch code that is -not- sungem
related, but needs to occur when one powers-down a sungem MAC?

If the PM code is related to sungem, it belongs in sungem.
So far I don't see a need for arch-specific hooks anywhere...

Hrm... let me try again...

Powering down individual devices can be controlled by the PCI PM
capabilities, or in some cases (at least 2 cases here on UniNorth
based pmacs) by other bits in the host bridge.

What I suggest if for pci_bus to have an optional set_power_state
function that is called when a device on that bus calls
pci_set_power_state(). This function would then be able to implement
those cases where power control is possible, while not done
via PCI PM caps.

A pci_bus structure exist for both root busses and busses under
PCI-PCI bridges, so effectively, there's a pci_bus structure per
bridge (beeing host or PCI-PCI). I beleive it makes sense for
the bridge to have a way to handle the child power state. 

Ben


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



VFS locking HFS problems (2.4.6pre6)

2001-06-29 Thread Benjamin Herrenschmidt

I've had a deadlock twice with 2.4.6pre6 today. It's an SMP kernel
running on an UP box (a PowerBook Pismo).

The deadlock happen in the HFS filesystem in hfs_cat_put(), apparently
(quickly looking at addresses) in spin_lock().

I don't have the complete backtrace at hand right now, but it basically
went up to kswapd without anything evidently getting that spinlock,
I'll try to gather more details.

So my question: Is there any document explaining the various locking
requirements  re-entrency possibilities in a filesystem.

What I think might happen after a quick look is that HFS may be causing
schedule() to be called while holding the spinlock, and gets then
re-entered from another process context. I have to look at it in more
detail (is there an HFS maintainer ?) but some background informations
on VFS locking  reentrancy issues would be helpful.

Ben.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] I/O Access Abstractions

2001-07-02 Thread Benjamin Herrenschmidt

  Last time I checked, ioremap didn't work for inb() and outb().
 
 It should :)

it doesnt need to.

pci_find_device returns the io address and can return a cookie, ditto 
isapnp etc

Yes, but doing that require 2 annoying things:

 - Parsing of this cookie on each inx/outx access, which can
take a bit of time (typically looking up the host bridge)

 - On machines with PIO mapped in CPU mem space and several
(or large) IO regions, they must all be mapped all the time,
which is a waste of kernel virtual space.

Why not, at least for 2.5, define a kind of pioremap that
would be the equivalent of ioremap for PIO ?

In fact, I'd rather have all this abstracted in a

ioremap_resource(struct resource *, int flags)
iounmap_resource(struct resource *)

(flags is just an idea that could be used to pass things
like specific caching attributes, or whatever makes sense to
a given arch).

The distinction between inx/oux  readx/writex would still
make sense at least for x86.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] I/O Access Abstractions

2001-07-02 Thread Benjamin Herrenschmidt

Last time I checked, ioremap didn't work for inb() and outb().

ioremap itself cannot work for inb/outb as they are different
address spaces with potentially overlapping addresses, I don't
see how a single function would handle both... except if we
pass it a struct resource instead of the address.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Acpi] Re: ACPI fundamental locking problems

2001-07-06 Thread Benjamin Herrenschmidt


Nope.

I do not want to maintain two interfaces. If we make user space the way to
do these things, then we will do pretty much most of the driver setup etc
in user space. We'd have to: we'd enter user space before drivers have had
a chance to initialize, exactly because features like these can change
the device mappings etc.

And I don't want to have two completely different bootup paths.

I agree. Also, having this userland step would help for things like
booting from an FireWire or USB hard disk. I hacked the SBP2 (FW)
driver to be useable as a boot device, but this involved adding an
ugly schedule() loop for a couple of seconds before mouting root
in order to leave some time for the drive to be probed. Also, on
such dynamic busses, you can't really know which device major/minor
a given drive will be assigned.

Having a userland mecanism here would allow waiting for all devices
to be probed, reading of the disk GUID (on fw at least) to figure
out where is the real root device, etc... Even displaying a nice
UI to let the user pick a root device is none is found, etc...

So your idea fixes more than just the ACPI problems ;)

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ide_revalidate_disk() fix

2001-02-12 Thread Benjamin Herrenschmidt

Hi Andre !

Any reason other than usual programmer "too many things to remember" for
2.4 lacking the small ide_revalidate_disk() fix we did recently in 2.2 to
keep the blocksize of the device intact ? (Just diff the 2 functions,
it's pretty obvious)

I'd be glad to send Linus a patch, but I beleive he won't accept an ide.c
patch that doesn't originate from you ;)

Regards,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/



Re: PCI GART (?)

2001-02-13 Thread Benjamin Herrenschmidt

 I have RTFM but on the matter of enabling DRI for the
 ATI Mobility video chipset, which on that notebook is
 a PCI model, there is practically nil information. The
 DRI website mentions using PCI GART, but there is no
 option for that in the kernel. How do I enable this?

You need to get XFree86 CVS and really the right place to ask
is the XFree86 folks. The standard kernel doesnt include pcigart

Michel, FYI, PCI GART is a feature of the video chipset, not the host
bridge, and so is not directly related to the kernel (there's no generic
PCI GART driver like there is an AGP GART driver). AFAIK, the only PCI
GART implementation so far is for rage 128 (or derived, like the M3), and
is available in the "ati-pcigart-0-0-1-branch" DRI CVS branch. You need
to compile the DRM inside this X server version, not the kernel one.

Ben.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: kernel_thread vs. zombie

2001-03-22 Thread Benjamin Herrenschmidt

daemonize() makes calls that are all protected with the
big kernel lock in do_exit(). All usages of daemonize have
the big kernel lock held. So I guess it just needs it.

Please let me know whether you have success if it makes
a difference with having it held.

With a bit more experiments, I have this behaviour:

(I hold the kerne lock, daemonize(), and release the kernel lock, then do
my probe thing which takes a few seconds, and let the thread die by itself)

 - When started during boot (low PID (9)) It becomes a zombie
 - When started from a process that quits after sending the ioctl,
   it is correctly "garbage collected".
 - When started from a process that stays around, it becomes a zombie too

So something is not working, or I'm missing something obvious, or whatever...

Any clue ?

Ben.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



console.c unblank_screen problem

2001-03-25 Thread Benjamin Herrenschmidt

There is a problem with the power management code for console.c

The current code calls do_blank_screen(0); on PM_SUSPEND, and
unblank_screen() on PM_RESUME.

The problem happens when X is the current display while putting the
machine to sleep. The do_blank_screen(0) code will do nothing as
the console is not in KD_TEXT mode.
However, unblank_screen has no such protection. That means that
on wakeup, the cursor timer  console blank timers will be re-enabled
while X is frontmost, causing the blinking cursor to be displayed on
top of X, and other possible issues.

I hacked the following pacth to work around this. It appear to work
fine, but since the console code is pretty complex, I'm not sure about
possible side effects and I'd like some comments before submiting it
to Linus:

(Don't worry about the {} I added, I just noticed them and will remove
them before submitting ;)

--- 1.2/drivers/char/console.c  Sat Feb 10 18:54:15 2001
+++ edited/drivers/char/console.c   Sun Mar 25 17:57:46 2001
@@ -2595,8 +2595,9 @@
int currcons = fg_console;
int i;
 
-   if (console_blanked)
+   if (console_blanked) {
return;
+   }
 
/* entering graphics mode? */
if (entering_gfx) {
@@ -2660,12 +2661,16 @@
printk("unblank_screen: tty %d not allocated ??\n", fg_console+1);
return;
}
+   currcons = fg_console;
+   if (vcmode != KD_TEXT) {
+   console_blanked = 0;
+   return;
+   }
console_timer.function = blank_screen;
if (blankinterval) {
mod_timer(console_timer, jiffies + blankinterval);
}
 
-   currcons = fg_console;
console_blanked = 0;
if (console_blank_hook)
console_blank_hook(0);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] binfmt_elf.c fix with PPC update

2001-04-11 Thread Benjamin Herrenschmidt

Hi Linus !

Enclosed is a (not big ;) patch against 2.4.4pre1 that does a few
inter-dependant things, one beeing a bug fix for everybody, the other
is a mix of bug fix  cleanup on PPC:

 - binfmt_elf.c : fix DLINFO_ITEMS so that final alignement on the
   stack takes into account the AT_NULL entry (or it won't align).

   Remove hackish PPC addition (now re-done properly). Add a simple
   way (via 2 macros) for include/asm-xxx/elf.h to add platform
   specific entries to it while keeping the alignement right.

 - Remove shove_aux_table() in arch/ppc/kernel/process.c. That routine
   used to lookup the aux table on the stack and move it up to align
   it to a 16 bytes boundary (ABI). Now done via the ARCH_DLINFO in
   include/asm-ppc/elf.h

 - Re-implement the alignement mecanism properly, taking into account
   a pair glibc bugs we had until now (not doing so results in breaking
   existing userland binaries).

 - Add 3 new aux table entries for PPC containing some cache line size
   information. Those are part of our PPC SysV ABI, and were never
   properly implemented, possibly because of conflict in the AT_
   numbers assigned to them. This was now "fixed" and the next glibc
   release will understand them. Those informations are necessary for
   glibc to properly handle various brands of PPC CPUs when doing
   cache invalidates or using cache trick to speed up copy operations.

The patch has been tested on PPC, glibc is ready for it, and it's
simple enough not to damage other archs. Next are coming the PPC
AT_HWCAP infos, still beeing worked on.

Feel free to comment, not agree, whatever, I'd be glad however if you
could explain me if you don't want to merge that now as our glibc
maintainer is waiting for it ;)

Regards,
Ben.

--- linuxppc_2_4_orig/fs/binfmt_elf.c   Wed Apr 11 20:18:59 2001
+++ linuxppc_2_4/fs/binfmt_elf.cWed Apr 11 18:38:47 2001
@@ -36,7 +36,7 @@
 #include asm/param.h
 #include asm/pgalloc.h
 
-#define DLINFO_ITEMS 13
+#define DLINFO_ITEMS 14
 
 #include linux/elf.h
 
@@ -135,12 +135,13 @@
 
/*
 * Force 16 byte _final_ alignment here for generality.
-* Leave an extra 16 bytes free so that on the PowerPC we
-* can move the aux table up to start on a 16-byte boundary.
 */
-   sp = (elf_addr_t *)((~15UL  (unsigned long)(u_platform)) - 16UL);
+   sp = (elf_addr_t *)(~15UL  (unsigned long)(u_platform));
csp = sp;
csp -= DLINFO_ITEMS*2 + (k_platform ? 2 : 0);
+#ifdef DLINFO_ARCH_ITEMS
+   csp -= DLINFO_ARCH_ITEMS*2;
+#endif
csp -= envc+1;
csp -= argc+1;
csp -= (!ibcs ? 3 : 1); /* argc itself */
@@ -174,6 +175,13 @@
NEW_AUX_ENT(10, AT_EUID, (elf_addr_t) current-euid);
NEW_AUX_ENT(11, AT_GID, (elf_addr_t) current-gid);
NEW_AUX_ENT(12, AT_EGID, (elf_addr_t) current-egid);
+#ifdef ARCH_DLINFO
+   /* 
+* ARCH_DLINFO must come last so platform specific code can enforce
+* special alignment requirements on the AUXV if necessary (eg. PPC).
+*/
+   ARCH_DLINFO;
+#endif
 #undef NEW_AUX_ENT
 
sp -= envc+1;
--- linuxppc_2_4_orig/arch/ppc/kernel/process.c Mon Apr  2 19:25:35 2001
+++ linuxppc_2_4/arch/ppc/kernel/process.c  Wed Apr 11 18:40:51 2001
@@ -378,45 +378,6 @@
 }
 
 /*
- * XXX ld.so expects the auxiliary table to start on
- * a 16-byte boundary, so we have to find it and
- * move it up. :-(
- */
-static inline void shove_aux_table(unsigned long sp)
-{
-   int argc;
-   char *p;
-   unsigned long e;
-   unsigned long aux_start, offset;
-
-   if (__get_user(argc, (int *)sp))
-   return;
-   sp += sizeof(int) + (argc + 1) * sizeof(char *);
-   /* skip over the environment pointers */
-   do {
-   if (__get_user(p, (char **)sp))
-   return;
-   sp += sizeof(char *);
-   } while (p != NULL);
-   aux_start = sp;
-   /* skip to the end of the auxiliary table */
-   do {
-   if (__get_user(e, (unsigned long *)sp))
-   return;
-   sp += 2 * sizeof(unsigned long);
-   } while (e != AT_NULL);
-   offset = ((aux_start + 15)  ~15) - aux_start;
-   if (offset != 0) {
-   do {
-   sp -= sizeof(unsigned long);
-   if (__get_user(e, (unsigned long *)sp)
-   || __put_user(e, (unsigned long *)(sp + offset)))
-   return;
-   } while (sp  aux_start);
-   }
-}
-
-/*
  * Set up a thread for executing a new program
  */
 void start_thread(struct pt_regs *regs, unsigned long nip, unsigned long sp)
@@ -425,7 +386,6 @@
regs-nip = nip;
regs-gpr[1] = sp;
regs-msr = MSR_USER;
-   shove_aux_table(sp);
if (last_task_used_math == current)
last_task_used_math = 0;
if (last_task_used_altivec == current)
--- 

[PATCH] [resent] binfmt_elf.c fix with PPC update

2001-04-16 Thread Benjamin Herrenschmidt

Hi Linus !

Enclosed is a (not big ;) patch against 2.4.4pre1 that does a few
inter-dependant things, one beeing a bug fix for everybody, the other
is a mix of bug fix  cleanup on PPC:

 - binfmt_elf.c : fix DLINFO_ITEMS so that final alignement on the
   stack takes into account the AT_NULL entry (or it won't align).

   Remove hackish PPC addition (now re-done properly). Add a simple
   way (via 2 macros) for include/asm-xxx/elf.h to add platform
   specific entries to it while keeping the alignement right.

 - Remove shove_aux_table() in arch/ppc/kernel/process.c. That routine
   used to lookup the aux table on the stack and move it up to align
   it to a 16 bytes boundary (ABI). Now done via the ARCH_DLINFO in
   include/asm-ppc/elf.h

 - Re-implement the alignement mecanism properly, taking into account
   a pair glibc bugs we had until now (not doing so results in breaking
   existing userland binaries).

 - Add 3 new aux table entries for PPC containing some cache line size
   information. Those are part of our PPC SysV ABI, and were never
   properly implemented, possibly because of conflict in the AT_
   numbers assigned to them. This was now "fixed" and the next glibc
   release will understand them. Those informations are necessary for
   glibc to properly handle various brands of PPC CPUs when doing
   cache invalidates or using cache trick to speed up copy operations.

The patch has been tested on PPC, glibc is ready for it, and it's
simple enough not to damage other archs. Next are coming the PPC
AT_HWCAP infos, still beeing worked on.

Feel free to comment, not agree, whatever, I'd be glad however if you
could explain me if you don't want to merge that now as our glibc
maintainer is waiting for it ;)

Regards,
Ben.

--- linuxppc_2_4_orig/fs/binfmt_elf.c   Wed Apr 11 20:18:59 2001
+++ linuxppc_2_4/fs/binfmt_elf.cWed Apr 11 18:38:47 2001
@@ -36,7 +36,7 @@
 #include asm/param.h
 #include asm/pgalloc.h
 
-#define DLINFO_ITEMS 13
+#define DLINFO_ITEMS 14
 
 #include linux/elf.h
 
@@ -135,12 +135,13 @@
 
/*
 * Force 16 byte _final_ alignment here for generality.
-* Leave an extra 16 bytes free so that on the PowerPC we
-* can move the aux table up to start on a 16-byte boundary.
 */
-   sp = (elf_addr_t *)((~15UL  (unsigned long)(u_platform)) - 16UL);
+   sp = (elf_addr_t *)(~15UL  (unsigned long)(u_platform));
csp = sp;
csp -= DLINFO_ITEMS*2 + (k_platform ? 2 : 0);
+#ifdef DLINFO_ARCH_ITEMS
+   csp -= DLINFO_ARCH_ITEMS*2;
+#endif
csp -= envc+1;
csp -= argc+1;
csp -= (!ibcs ? 3 : 1); /* argc itself */
@@ -174,6 +175,13 @@
NEW_AUX_ENT(10, AT_EUID, (elf_addr_t) current-euid);
NEW_AUX_ENT(11, AT_GID, (elf_addr_t) current-gid);
NEW_AUX_ENT(12, AT_EGID, (elf_addr_t) current-egid);
+#ifdef ARCH_DLINFO
+   /* 
+* ARCH_DLINFO must come last so platform specific code can enforce
+* special alignment requirements on the AUXV if necessary (eg. PPC).
+*/
+   ARCH_DLINFO;
+#endif
 #undef NEW_AUX_ENT
 
sp -= envc+1;
--- linuxppc_2_4_orig/arch/ppc/kernel/process.c Mon Apr  2 19:25:35 2001
+++ linuxppc_2_4/arch/ppc/kernel/process.c  Wed Apr 11 18:40:51 2001
@@ -378,45 +378,6 @@
 }
 
 /*
- * XXX ld.so expects the auxiliary table to start on
- * a 16-byte boundary, so we have to find it and
- * move it up. :-(
- */
-static inline void shove_aux_table(unsigned long sp)
-{
-   int argc;
-   char *p;
-   unsigned long e;
-   unsigned long aux_start, offset;
-
-   if (__get_user(argc, (int *)sp))
-   return;
-   sp += sizeof(int) + (argc + 1) * sizeof(char *);
-   /* skip over the environment pointers */
-   do {
-   if (__get_user(p, (char **)sp))
-   return;
-   sp += sizeof(char *);
-   } while (p != NULL);
-   aux_start = sp;
-   /* skip to the end of the auxiliary table */
-   do {
-   if (__get_user(e, (unsigned long *)sp))
-   return;
-   sp += 2 * sizeof(unsigned long);
-   } while (e != AT_NULL);
-   offset = ((aux_start + 15)  ~15) - aux_start;
-   if (offset != 0) {
-   do {
-   sp -= sizeof(unsigned long);
-   if (__get_user(e, (unsigned long *)sp)
-   || __put_user(e, (unsigned long *)(sp + offset)))
-   return;
-   } while (sp  aux_start);
-   }
-}
-
-/*
  * Set up a thread for executing a new program
  */
 void start_thread(struct pt_regs *regs, unsigned long nip, unsigned long sp)
@@ -425,7 +386,6 @@
regs-nip = nip;
regs-gpr[1] = sp;
regs-msr = MSR_USER;
-   shove_aux_table(sp);
if (last_task_used_math == current)
last_task_used_math = 0;
if (last_task_used_altivec == current)
--- 

Re: [PATCH] macintosh/mediabay: Convert to kthread API.

2007-04-20 Thread Benjamin Herrenschmidt

 Looks OK - there's no way of stopping the kernel thread anyway.
 
 It appears that nobody has tried to use this driver at the same time as
 software-suspend.  At least, not successfully.  A strategic try_to_freeze()
 should fix it.
 
 This will become (a little) more serious when cpu hotplug is switched to
 use the process freezer, and perhaps it breaks kprobes already.

I'll dig a box with that hardware and do some tests, but it looks nice.

Thanks Eric !

There should be no problem with cpu hotplug, the only machines using the
media bay driver are old Apple laptops with only one CPU and no HW
threads.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] macintosh/therm_windtunnel.c: Convert to kthread API.

2007-04-20 Thread Benjamin Herrenschmidt
On Thu, 2007-04-19 at 16:37 -0700, Andrew Morton wrote:
 On Thu, 19 Apr 2007 01:58:48 -0600
 Eric W. Biederman [EMAIL PROTECTED] wrote:
 
  Start the g4fand using kthread_run not a combination
  of kernel_thread and deamonize.  This makes the code
  a little simpler and more maintainable.
 
 I had a bit of trouble reviewing this one because I was laughing so hard at
 the attempted coding-style in that driver.  Oh well.

Heh

 I continue creeping into Christoph's camp - there's quite a bit of
 open-coded gunk which would go away if we were to teach this driver about
 kthread_should_stop() and kthread_stop(), and the conversion looks awfully
 easy to do.  It's a shame to stop here.
 
 Oh well, I guess at least this is some forward progress.

My main problem with touching that driver is that I don't have the
hardware to test. I'll try to find a user to play the ginea pig.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-23 Thread Benjamin Herrenschmidt

 The only reason for using threads here is to get the error recovery
 out of an interrupt context (where errors may be detected), and then,
 an hour later, decrement a counter (which is how we limit these to 
 6 per hour). Thread reaping is trivial, the thread just exits
 after an hour.

In addition, it should be a thread and not done from within keventd
because :

 - It can take a long time (well, relatively but still too long for a
work queue)

 - The driver callbacks might need to use keventd or do flush_workqueue
to synchronize with their own workqueues when doing an internal
recovery.

 Since these are events rare, I've no particular concern about
 performance or resource consumption. The current code seems 
 to work just fine. :-)

I think moving to kthread's is cleaner (just a wrapper around kernel
threads that simplify dealing with reaping them out mostly) and I agree
with Christoph that it would be nice to be able to fire off kthreads
from interrupt context.. in many cases, we abuse work queues for things
that should really done from kthreads instead (basically anything that
takes more than a couple hundred microsecs or so).

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-23 Thread Benjamin Herrenschmidt
On Mon, 2007-04-23 at 20:08 -0600, Eric W. Biederman wrote:
 Benjamin Herrenschmidt [EMAIL PROTECTED] writes:
 
  The only reason for using threads here is to get the error recovery
  out of an interrupt context (where errors may be detected), and then,
  an hour later, decrement a counter (which is how we limit these to 
  6 per hour). Thread reaping is trivial, the thread just exits
  after an hour.
 
  In addition, it should be a thread and not done from within keventd
  because :
 
   - It can take a long time (well, relatively but still too long for a
  work queue)
 
   - The driver callbacks might need to use keventd or do flush_workqueue
  to synchronize with their own workqueues when doing an internal
  recovery.
 
  Since these are events rare, I've no particular concern about
  performance or resource consumption. The current code seems 
  to work just fine. :-)
 
  I think moving to kthread's is cleaner (just a wrapper around kernel
  threads that simplify dealing with reaping them out mostly) and I agree
  with Christoph that it would be nice to be able to fire off kthreads
  from interrupt context.. in many cases, we abuse work queues for things
  that should really done from kthreads instead (basically anything that
  takes more than a couple hundred microsecs or so).
 
 On that note does anyone have a problem is we manage the irq spawning
 safe kthreads the same way that we manage the work queue entries.
 
 i.e. by a structure allocated by the caller?

Not sure... I can see places where I might want to spawn an arbitrary
number of these without having to preallocate structures... and if I
allocate on the fly, then I need a way to free that structure when the
kthread is reaped which I don't think we have currently, do we ? (In
fact, I could use that for other things too now that I'm thinking of
it ... I might have a go at providing optional kthread destructors).

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc pseries eeh: Convert to kthread API

2007-04-23 Thread Benjamin Herrenschmidt

 Further in general it doesn't make sense to grab a module reference
 and call that sufficient because we would like to request that the
 module exits.

Which is, btw, I think a total misdesign of our module stuff, but heh, I
remember that lead to some flamewars back then...

Like anything else, modules should have separated the entrypoints for

 - Initiating a removal request
 - Releasing the module

The former is use did rmmod, can unregister things from subsystems,
etc... (and can file if the driver decides to refuse removal requests
when it's busy doing things or whatever policy that module wants to
implement).

The later is called when all references to the modules have been
dropped, it's a bit like the kref release (and could be implemented as
one).

If we had done that (simple) thing back then, module refcounting would
have been much less of a problem... I remember some reasons why that was
veto'ed but I didn't and still don't agree.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/12] Pass MAP_FIXED down to get_unmapped_area

2007-04-23 Thread Benjamin Herrenschmidt
This is a first step as there are still cleanups to be done in various
areas touched by that code but I think it's probably good to go as is and
at least enables me to implement what I need for PowerPC.

(Andrew, this is also candidate for 2.6.22 since I haven't had any real
objection, mostly suggestion for improving further, which I'll try to
do later, and I have further powerpc patches that rely on this).

The current get_unmapped_area code calls the f_ops-get_unmapped_area or
the arch one (via the mm) only when MAP_FIXED is not passed. That makes
it impossible for archs to impose proper constraints on regions of the
virtual address space. To work around that, get_unmapped_area() then
calls some hugetlbfs specific hacks.

This cause several problems, among others:

 - It makes it impossible for a driver or filesystem to do the same thing
that hugetlbfs does (for example, to allow a driver to use larger page
sizes to map external hardware) if that requires applying a constraint
on the addresses (constraining that mapping in certain regions and other
mappings out of those regions).

 - Some archs like arm, mips, sparc, sparc64, sh and sh64 already want
MAP_FIXED to be passed down in order to deal with aliasing issues.
The code is there to handle it... but is never called.

This serie of patches moves the logic to handle MAP_FIXED down to the
various arch/driver get_unmapped_area() implementations, and then changes
the generic code to always call them. The hugetlbfs hacks then disappear
from the generic code.

Since I need to do some special 64K pages mappings for SPEs on cell, I need
to work around the first problem at least. I have further patches thus
implementing a slices layer that handles multiple page sizes through
slices of the address space for use by hugetlbfs, the SPE code, and possibly
others, but it requires that serie of patches first/

There is still a potential (but not practical) issue due to the fact that
filesystems/drivers implemeting g_u_a will effectively bypass all arch
checks. This is not an issue in practice as the only filesystems/drivers
using that hook are doing so for arch specific purposes in the first place.

There is also a problem with mremap that will completely bypass all arch
checks. I'll try to address that separately, I'm not 100% certain yet how,
possibly by making it not work when the vma has a file whose f_ops has a
get_unmapped_area callback, and by making it use is_hugepage_only_range()
before expanding into a new area.

Also, I want to turn is_hugepage_only_range() into a more generic
is_normal_page_range() as that's really what it will end up meaning
when used in stack grow, brk grow and mremap.

None of the above issues however are introduced by this patch, they are
already there, so I think the patch can go ini for 2.6.22.

(Patch is against Linus current git, I'll give a go at -mm asap)

Cheers,
Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/12] get_unmapped_area handles MAP_FIXED on powerpc

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in powerpc's arch_get_unmapped_area() in all 3
implementations of it.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
Acked-by: William Irwin [EMAIL PROTECTED]

 arch/powerpc/mm/hugetlbpage.c |   21 +
 1 file changed, 21 insertions(+)

Index: linux-cell/arch/powerpc/mm/hugetlbpage.c
===
--- linux-cell.orig/arch/powerpc/mm/hugetlbpage.c   2007-04-24 
15:10:17.0 +1000
+++ linux-cell/arch/powerpc/mm/hugetlbpage.c2007-04-24 15:28:11.0 
+1000
@@ -566,6 +566,13 @@ unsigned long arch_get_unmapped_area(str
if (len  TASK_SIZE)
return -ENOMEM;
 
+   /* handle fixed mapping: prevent overlap with huge pages */
+   if (flags  MAP_FIXED) {
+   if (is_hugepage_only_range(mm, addr, len))
+   return -EINVAL;
+   return addr;
+   }
+
if (addr) {
addr = PAGE_ALIGN(addr);
vma = find_vma(mm, addr);
@@ -641,6 +648,13 @@ arch_get_unmapped_area_topdown(struct fi
if (len  TASK_SIZE)
return -ENOMEM;
 
+   /* handle fixed mapping: prevent overlap with huge pages */
+   if (flags  MAP_FIXED) {
+   if (is_hugepage_only_range(mm, addr, len))
+   return -EINVAL;
+   return addr;
+   }
+
/* dont allow allocations above current base */
if (mm-free_area_cache  base)
mm-free_area_cache = base;
@@ -823,6 +837,13 @@ unsigned long hugetlb_get_unmapped_area(
/* Paranoia, caller should have dealt with this */
BUG_ON((addr + len)   addr);
 
+   /* Handle MAP_FIXED */
+   if (flags  MAP_FIXED) {
+   if (prepare_hugepage_range(addr, len, pgoff))
+   return -EINVAL;
+   return addr;
+   }
+
if (test_thread_flag(TIF_32BIT)) {
curareas = current-mm-context.low_htlb_areas;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/12] get_unmapped_area handles MAP_FIXED on alpha

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in alpha's arch_get_unmapped_area(), simple case, just
return the address as passed in

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 arch/alpha/kernel/osf_sys.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-cell/arch/alpha/kernel/osf_sys.c
===
--- linux-cell.orig/arch/alpha/kernel/osf_sys.c 2007-03-22 14:58:33.0 
+1100
+++ linux-cell/arch/alpha/kernel/osf_sys.c  2007-03-22 14:58:44.0 
+1100
@@ -1267,6 +1267,9 @@ arch_get_unmapped_area(struct file *filp
if (len  limit)
return -ENOMEM;
 
+   if (flags  MAP_FIXED)
+   return addr;
+
/* First, see if the given suggestion fits.
 
   The OSF/1 loader (/sbin/loader) relies on us returning an
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/12] get_unmapped_area handles MAP_FIXED on arm

2007-04-23 Thread Benjamin Herrenschmidt
ARM already had a case for MAP_FIXED in arch_get_unmapped_area() though
it was not called before. Fix the comment to reflect that it will now
be called.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 arch/arm/mm/mmap.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-cell/arch/arm/mm/mmap.c
===
--- linux-cell.orig/arch/arm/mm/mmap.c  2007-03-22 14:59:51.0 +1100
+++ linux-cell/arch/arm/mm/mmap.c   2007-03-22 15:00:01.0 +1100
@@ -49,8 +49,7 @@ arch_get_unmapped_area(struct file *filp
 #endif
 
/*
-* We should enforce the MAP_FIXED case.  However, currently
-* the generic kernel code doesn't allow us to handle this.
+* We enforce the MAP_FIXED case.
 */
if (flags  MAP_FIXED) {
if (aliasing  flags  MAP_SHARED  addr  (SHMLBA - 1))
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/12] get_unmapped_area handles MAP_FIXED in generic code

2007-04-23 Thread Benjamin Herrenschmidt
generic arch_get_unmapped_area() now handles MAP_FIXED. Now that
all implementations have been fixed, change the toplevel
get_unmapped_area() to call into arch or drivers for the MAP_FIXED
case.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 mm/mmap.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

Index: linux-cell/mm/mmap.c
===
--- linux-cell.orig/mm/mmap.c   2007-03-22 16:29:22.0 +1100
+++ linux-cell/mm/mmap.c2007-03-22 16:30:06.0 +1100
@@ -1199,6 +1199,9 @@ arch_get_unmapped_area(struct file *filp
if (len  TASK_SIZE)
return -ENOMEM;
 
+   if (flags  MAP_FIXED)
+   return addr;
+
if (addr) {
addr = PAGE_ALIGN(addr);
vma = find_vma(mm, addr);
@@ -1272,6 +1275,9 @@ arch_get_unmapped_area_topdown(struct fi
if (len  TASK_SIZE)
return -ENOMEM;
 
+   if (flags  MAP_FIXED)
+   return addr;
+
/* requesting a specific address */
if (addr) {
addr = PAGE_ALIGN(addr);
@@ -1360,22 +1366,21 @@ get_unmapped_area(struct file *file, uns
unsigned long pgoff, unsigned long flags)
 {
unsigned long ret;
+   unsigned long (*get_area)(struct file *, unsigned long,
+ unsigned long, unsigned long, unsigned long);
 
-   if (!(flags  MAP_FIXED)) {
-   unsigned long (*get_area)(struct file *, unsigned long, 
unsigned long, unsigned long, unsigned long);
-
-   get_area = current-mm-get_unmapped_area;
-   if (file  file-f_op  file-f_op-get_unmapped_area)
-   get_area = file-f_op-get_unmapped_area;
-   addr = get_area(file, addr, len, pgoff, flags);
-   if (IS_ERR_VALUE(addr))
-   return addr;
-   }
+   get_area = current-mm-get_unmapped_area;
+   if (file  file-f_op  file-f_op-get_unmapped_area)
+   get_area = file-f_op-get_unmapped_area;
+   addr = get_area(file, addr, len, pgoff, flags);
+   if (IS_ERR_VALUE(addr))
+   return addr;
 
if (addr  TASK_SIZE - len)
return -ENOMEM;
if (addr  ~PAGE_MASK)
return -EINVAL;
+
if (file  is_file_hugepages(file))  {
/*
 * Check if the given range is hugepage aligned, and
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/12] get_unmapped_area handles MAP_FIXED in hugetlbfs

2007-04-23 Thread Benjamin Herrenschmidt
Generic hugetlb_get_unmapped_area() now handles MAP_FIXED by just
calling prepare_hugepage_range()

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
Acked-by: William Irwin [EMAIL PROTECTED]

 fs/hugetlbfs/inode.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-cell/fs/hugetlbfs/inode.c
===
--- linux-cell.orig/fs/hugetlbfs/inode.c2007-03-22 16:12:56.0 
+1100
+++ linux-cell/fs/hugetlbfs/inode.c 2007-03-22 16:16:02.0 +1100
@@ -115,6 +115,12 @@ hugetlb_get_unmapped_area(struct file *f
if (len  TASK_SIZE)
return -ENOMEM;
 
+   if (flags  MAP_FIXED) {
+   if (prepare_hugepage_range(addr, len, pgoff))
+   return -EINVAL;
+   return addr;
+   }
+
if (addr) {
addr = ALIGN(addr, HPAGE_SIZE);
vma = find_vma(mm, addr);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/12] get_unmapped_area doesn't need hugetlbfs hacks anymore

2007-04-23 Thread Benjamin Herrenschmidt
Remove the hugetlbfs specific hacks in toplevel get_unmapped_area() now
that all archs and hugetlbfs itself do the right thing for both cases.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
Acked-by: William Irwin [EMAIL PROTECTED]

 mm/mmap.c |   16 
 1 file changed, 16 deletions(-)

Index: linux-cell/mm/mmap.c
===
--- linux-cell.orig/mm/mmap.c   2007-04-12 12:14:46.0 +1000
+++ linux-cell/mm/mmap.c2007-04-12 12:14:47.0 +1000
@@ -1381,22 +1381,6 @@ get_unmapped_area(struct file *file, uns
if (addr  ~PAGE_MASK)
return -EINVAL;
 
-   if (file  is_file_hugepages(file))  {
-   /*
-* Check if the given range is hugepage aligned, and
-* can be made suitable for hugepages.
-*/
-   ret = prepare_hugepage_range(addr, len, pgoff);
-   } else {
-   /*
-* Ensure that a normal request is not falling in a
-* reserved hugepage range.  For some archs like IA-64,
-* there is a separate region for hugepages.
-*/
-   ret = is_hugepage_only_range(current-mm, addr, len);
-   }
-   if (ret)
-   return -EINVAL;
return addr;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 9/12] get_unmapped_area handles MAP_FIXED on x86_64

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in x86_64 arch_get_unmapped_area(), simple case, just
return the address as passed in

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 arch/x86_64/kernel/sys_x86_64.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-cell/arch/x86_64/kernel/sys_x86_64.c
===
--- linux-cell.orig/arch/x86_64/kernel/sys_x86_64.c 2007-03-22 
16:10:10.0 +1100
+++ linux-cell/arch/x86_64/kernel/sys_x86_64.c  2007-03-22 16:11:06.0 
+1100
@@ -93,6 +93,9 @@ arch_get_unmapped_area(struct file *filp
unsigned long start_addr;
unsigned long begin, end;

+   if (flags  MAP_FIXED)
+   return addr;
+
find_start_end(flags, begin, end); 
 
if (len  end)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/12] get_unmapped_area handles MAP_FIXED on sparc64

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in hugetlb_get_unmapped_area on sparc64
by just using prepare_hugepage_range()

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
Acked-by: William Irwin [EMAIL PROTECTED]

 arch/sparc64/mm/hugetlbpage.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-cell/arch/sparc64/mm/hugetlbpage.c
===
--- linux-cell.orig/arch/sparc64/mm/hugetlbpage.c   2007-03-22 
16:12:57.0 +1100
+++ linux-cell/arch/sparc64/mm/hugetlbpage.c2007-03-22 16:15:33.0 
+1100
@@ -175,6 +175,12 @@ hugetlb_get_unmapped_area(struct file *f
if (len  task_size)
return -ENOMEM;
 
+   if (flags  MAP_FIXED) {
+   if (prepare_hugepage_range(addr, len, pgoff))
+   return -EINVAL;
+   return addr;
+   }
+
if (addr) {
addr = ALIGN(addr, HPAGE_SIZE);
vma = find_vma(mm, addr);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/12] get_unmapped_area handles MAP_FIXED on parisc

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in parisc arch_get_unmapped_area(), just return the
address. We might want to also check for possible cache aliasing
issues now that we get called in that case (like ARM or MIPS),
leave a comment for the maintainers to pick up.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 arch/parisc/kernel/sys_parisc.c |5 +
 1 file changed, 5 insertions(+)

Index: linux-cell/arch/parisc/kernel/sys_parisc.c
===
--- linux-cell.orig/arch/parisc/kernel/sys_parisc.c 2007-03-22 
15:28:05.0 +1100
+++ linux-cell/arch/parisc/kernel/sys_parisc.c  2007-03-22 15:29:08.0 
+1100
@@ -106,6 +106,11 @@ unsigned long arch_get_unmapped_area(str
 {
if (len  TASK_SIZE)
return -ENOMEM;
+   /* Might want to check for cache aliasing issues for MAP_FIXED case
+* like ARM or MIPS ??? --BenH.
+*/
+   if (flags  MAP_FIXED)
+   return addr;
if (!addr)
addr = TASK_UNMAPPED_BASE;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/12] get_unmapped_area handles MAP_FIXED on ia64

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in ia64 arch_get_unmapped_area and
hugetlb_get_unmapped_area(), just call prepare_hugepage_range
in the later and is_hugepage_only_range() in the former.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
Acked-by: William Irwin [EMAIL PROTECTED]

 arch/ia64/kernel/sys_ia64.c |7 +++
 arch/ia64/mm/hugetlbpage.c  |8 
 2 files changed, 15 insertions(+)

Index: linux-cell/arch/ia64/kernel/sys_ia64.c
===
--- linux-cell.orig/arch/ia64/kernel/sys_ia64.c 2007-03-22 15:10:45.0 
+1100
+++ linux-cell/arch/ia64/kernel/sys_ia64.c  2007-03-22 15:10:47.0 
+1100
@@ -33,6 +33,13 @@ arch_get_unmapped_area (struct file *fil
if (len  RGN_MAP_LIMIT)
return -ENOMEM;
 
+   /* handle fixed mapping: prevent overlap with huge pages */
+   if (flags  MAP_FIXED) {
+   if (is_hugepage_only_range(mm, addr, len))
+   return -EINVAL;
+   return addr;
+   }
+
 #ifdef CONFIG_HUGETLB_PAGE
if (REGION_NUMBER(addr) == RGN_HPAGE)
addr = 0;
Index: linux-cell/arch/ia64/mm/hugetlbpage.c
===
--- linux-cell.orig/arch/ia64/mm/hugetlbpage.c  2007-03-22 15:12:32.0 
+1100
+++ linux-cell/arch/ia64/mm/hugetlbpage.c   2007-03-22 15:12:39.0 
+1100
@@ -148,6 +148,14 @@ unsigned long hugetlb_get_unmapped_area(
return -ENOMEM;
if (len  ~HPAGE_MASK)
return -EINVAL;
+
+   /* Handle MAP_FIXED */
+   if (flags  MAP_FIXED) {
+   if (prepare_hugepage_range(addr, len, pgoff))
+   return -EINVAL;
+   return addr;
+   }
+
/* This code assumes that RGN_HPAGE != 0. */
if ((REGION_NUMBER(addr) != RGN_HPAGE) || (addr  (HPAGE_SIZE - 1)))
addr = HPAGE_REGION_BASE;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/12] get_unmapped_area handles MAP_FIXED on i386

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in i386 hugetlb_get_unmapped_area(), just call
prepare_hugepage_range.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]
Acked-by: William Irwin [EMAIL PROTECTED]

 arch/i386/mm/hugetlbpage.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-cell/arch/i386/mm/hugetlbpage.c
===
--- linux-cell.orig/arch/i386/mm/hugetlbpage.c  2007-03-22 16:08:12.0 
+1100
+++ linux-cell/arch/i386/mm/hugetlbpage.c   2007-03-22 16:14:19.0 
+1100
@@ -367,6 +367,12 @@ hugetlb_get_unmapped_area(struct file *f
if (len  TASK_SIZE)
return -ENOMEM;
 
+   if (flags  MAP_FIXED) {
+   if (prepare_hugepage_range(addr, len, pgoff))
+   return -EINVAL;
+   return addr;
+   }
+
if (addr) {
addr = ALIGN(addr, HPAGE_SIZE);
vma = find_vma(mm, addr);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/12] get_unmapped_area handles MAP_FIXED on frv

2007-04-23 Thread Benjamin Herrenschmidt
Handle MAP_FIXED in arch_get_unmapped_area on frv. Trivial case, just
return the address.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

 arch/frv/mm/elf-fdpic.c |4 
 1 file changed, 4 insertions(+)

Index: linux-cell/arch/frv/mm/elf-fdpic.c
===
--- linux-cell.orig/arch/frv/mm/elf-fdpic.c 2007-03-22 15:00:50.0 
+1100
+++ linux-cell/arch/frv/mm/elf-fdpic.c  2007-03-22 15:01:06.0 +1100
@@ -64,6 +64,10 @@ unsigned long arch_get_unmapped_area(str
if (len  TASK_SIZE)
return -ENOMEM;
 
+   /* handle MAP_FIXED */
+   if (flags  MAP_FIXED)
+   return addr;
+
/* only honour a hint if we're not going to clobber something doing so 
*/
if (addr) {
addr = PAGE_ALIGN(addr);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2.6.20-1] radeonfb: Add support for Radeon xpress 200m

2007-03-09 Thread Benjamin Herrenschmidt

 - radeonfb_pm_init(rinfo, rinfo-is_mobility ? 1 : -1, 
 ignore_devlist, force_sleep);
 + radeonfb_pm_init(rinfo, rinfo-is_mobility  rinfo-family != 
 CHIP_FAMILY_RS480 ? 1 : -1, ignore_devlist, force_sleep);

I'd rather you add a check for RS480 inside radeonfb_pm_*

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix atomicity of TIF update in flush_thread() for powerpc

2007-03-09 Thread Benjamin Herrenschmidt
 .../...

 Signed-off-by: Mathieu Desnoyers [EMAIL PROTECTED]

Acked-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

Nice catch !

 --- a/arch/powerpc/kernel/process.c
 +++ b/arch/powerpc/kernel/process.c
 @@ -476,8 +476,13 @@ void flush_thread(void)
  #ifdef CONFIG_PPC64
   struct thread_info *t = current_thread_info();
  
 - if (t-flags  _TIF_ABI_PENDING)
 - t-flags ^= (_TIF_ABI_PENDING | _TIF_32BIT);
 + if (test_tsk_thread_flag(tsk, TIF_ABI_PENDING)) {
 + clear_tsk_thread_flag(tsk, TIF_ABI_PENDING);
 + if (test_tsk_thread_flag(tsk, TIF_32BIT))
 + clear_tsk_thread_flag(tsk, TIF_32BIT);
 + else
 + set_tsk_thread_flag(tsk, TIF_32BIT);
 + }
  #endif
  
   discard_lazy_cpu_state();

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Complain about missing system calls.

2007-03-09 Thread Benjamin Herrenschmidt
On Fri, 2007-03-09 at 17:11 +0100, Andi Kleen wrote:
 David Woodhouse [EMAIL PROTECTED] writes:
 
  Most system calls seem to get added to i386 first. This patch
  automatically generates a warning for any new system call which is
  implemented on i386 but not the architecture currently being compiled.
  On PowerPC at the moment, for example, it results in these warnings:
  init/missing_syscalls.h:935:3: warning: #warning syscall sync_file_range 
  not implemented
  init/missing_syscalls.h:947:3: warning: #warning syscall getcpu not 
  implemented
  init/missing_syscalls.h:950:3: warning: #warning syscall epoll_pwait not 
  implemented
 
 I think a better solution would be to finally switch to auto generated
 system call tables for newer system calls. The original reason why the
 architectures have different system call numbers -- compatibility with
 another native Unix -- is completely obsolete now. This leaves only
 minor differences of compat stub vs non compat stub and a few
 architecture specific calls.
 
 Of course the existing syscall numbers can't be changed, but for all new 
 calls one could just add automatically for everybody.
 
 A global table with two entries (compat and non compat) and a per arch 
 override table should be sufficient.

We need additional gunk for syscalls that can be called from SPEs on
cell

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make sure we populate the initroot filesystem late enough

2007-03-13 Thread Benjamin Herrenschmidt

 Hmm. The crash came back after I booted into Mac OS X and back. It was however
 a different crash, I believe it was coming from the USB modules (as it would
 keep going when it happened, and get another crash, which tended to scroll 
 away
 too fast for me to capture) but I believe it was still getting down into the
 slab code and actually dying there.

Have you tried, instead, to apply
38f3323037de22bb0089d08be27be01196e7148b ? (That is revert
39d61db0edb34d60b83c5e0d62d0e906578cc707).

I suspect this is the proper fix...

Ben.

 However, reverting the reversion of
 8d610dd52dd1da696e199e4b4545f33a2a5de5c6 and instead applying
 the following patch:
 
 diff -ru linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c 
 linux-source-2.6.20/arch/powerpc/mm/init_32.c
 --- linux-source-2.6.20.orig/arch/powerpc/mm/init_32.c  2007-02-05 
 05:44:54.0 +1100
 +++ linux-source-2.6.20/arch/powerpc/mm/init_32.c   2007-03-10 
 11:03:56.0 +1100
 @@ -244,7 +244,8 @@
  void free_initrd_mem(unsigned long start, unsigned long end)
  {
 if (start  end)
 -   printk (Freeing initrd memory: %ldk freed\n, (end - start) 
  10);
 +   printk (NOT Freeing initrd memory: %ldk freed\n, (end - 
 start)  10);
 +   return;
 for (; start  end; start += PAGE_SIZE) {
 ClearPageReserved(virt_to_page(start));
 init_page_count(virt_to_page(start));
 
 which if I recall correctly David Woodhouse posted to this thread,
 seems to have fixed it.
 
 I dunno if it's relevant, but my initrd.img is 13193315 bytes long,
 (ie 99 bytes over 12884k) and the above logs:
 NOT Freeing initrd memory: 12888k freed
 which makes sense...
 
 I of course completely failed to think to check this with the crashing
 kernel, if it seems relevant I can roll back to it and get the numbers.
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] BLK_DEV_IDE_CELLEB dependency fix

2007-03-15 Thread Benjamin Herrenschmidt
On Thu, 2007-03-15 at 17:30 +0300, Sergei Shtylyov wrote:
 Hello.
 
 Akira Iguchi wrote:
 
  It's bool and it depends on BLK_DEV_IDE
   = should depend on BLK_DEV_IDE=y
 
 Hm, why I'm seeing module_init() in the driver? :-)
 
  And move it to if BLK_DEV_IDEDMA_PCI block because it depends on 
  BLK_DEV_IDEDMA_PCI.
 
 IMHO, that driver shouldn't be in drivers/ide/ppc/ then...  Why it got 
 there (the same question about PowerMac driver)?

Not sure... some reorg changes ide-pmac.c into ppc/pmac.c or such, I
don't remember who did it tho.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remote debugging via FireWire

2007-02-11 Thread Benjamin Herrenschmidt
On Sat, 2007-02-10 at 20:16 +0100, Stefan Richter wrote:
 [ohci1394_early]
 
 Some remarks to the September 2006 version at
 http://www.suse.de/~bk/firewire/ :
 
   - Seems its .remove won't work properly if more than one OHCI-1394
 controller is installed.  And it's .probe isn't reentrant, but that
 might be less of a problem.
   - Its functionality will be lost if there is a FireWire bus reset,
 e.g. when something is plugged in or out.  To keep physical DMA
 alive, an interrupt handler had to be installed which writes ~0
 to OHCI1394_PhyReqFilter{Hi,Lo}Set.  Can interrupt handlers be
 registered in an early setup stage?
   - There might be some register accesses in the setup which could be
 omitted; I'd have to look this up.
   - Could be optimized to not use ohci1394.h::struct ti_ohci.
   - PCI_CLASS_FIREWIRE_OHCI can be replaced by
 include/linux/pci_ids.h::PCI_CLASS_SERIAL_FIREWIRE_OHCI which
 was newly added in 2.6.20-git#.
   - I suppose .probe should check for PCI_CLASS_SERIAL_FIREWIRE_OHCI
 instead of PCI_CLASS_SERIAL_FIREWIRE.
   - How about dropping support for configuring this as module, to
 simplify the code?  Unless this would interfere with ohci1394; and
 it probably would if there was an interrupt handler...
   - depends on X86_64 is missing in Kconfig.
   - Maybe put it into arch/x86_64/drivers/ instead of drivers/ieee1394?
   - Plus what I mentioned earlier in the thread.
 
 I could send code to address some of this at next weekend or later.

I'd like to have that on ppc as well, so I'd rather keep it in drivers/

I agree that it doesn't need to be a module. If you can load modules,
then you can load the full ohci driver. Thus, if it's an early thingy
initialized by arch, it can export a special takeover hook that the
proper ohci module can then call to override it (important if we start
having an irq handler).

Andi, also, how do you deal with iommu ? Not at all ? :-)

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] powerpc: Fix vDSO page count calculation

2007-02-11 Thread Benjamin Herrenschmidt
The recent vDSO consolidation patches broke powerpc due to a mistake
in the definition of MAXPAGES constants. This fixes it by moving to
a dynamically allocated array of pages instead as I don't like much
hard coded size limits. Also move the vdso initialisation to an initcall
since it doesn't really need to be done -that- early.

Applogies for not catching the breakage earlier, Roland _did_ CC me on
his patches a while ago, I got busy with other things and forgot to test
them.

Signed-off-by: Benjamin Herrenschmidt [EMAIL PROTECTED]

Index: linux-work/arch/powerpc/kernel/vdso.c
===
--- linux-work.orig/arch/powerpc/kernel/vdso.c  2007-02-12 10:42:46.0 
+1100
+++ linux-work/arch/powerpc/kernel/vdso.c   2007-02-12 11:03:54.0 
+1100
@@ -49,24 +49,23 @@
 /* Max supported size for symbol names */
 #define MAX_SYMNAME64
 
-#define VDSO32_MAXPAGES(((0x3000 + PAGE_MASK)  PAGE_SHIFT) + 2)
-#define VDSO64_MAXPAGES(((0x3000 + PAGE_MASK)  PAGE_SHIFT) + 2)
-
 extern char vdso32_start, vdso32_end;
 static void *vdso32_kbase = vdso32_start;
-unsigned int vdso32_pages;
-static struct page *vdso32_pagelist[VDSO32_MAXPAGES];
+static unsigned int vdso32_pages;
+static struct page **vdso32_pagelist;
 unsigned long vdso32_sigtramp;
 unsigned long vdso32_rt_sigtramp;
 
 #ifdef CONFIG_PPC64
 extern char vdso64_start, vdso64_end;
 static void *vdso64_kbase = vdso64_start;
-unsigned int vdso64_pages;
-static struct page *vdso64_pagelist[VDSO64_MAXPAGES];
+static unsigned int vdso64_pages;
+static struct page **vdso64_pagelist;
 unsigned long vdso64_rt_sigtramp;
 #endif /* CONFIG_PPC64 */
 
+static int vdso_ready;
+
 /*
  * The vdso data page (aka. systemcfg for old ppc64 fans) is here.
  * Once the early boot kernel code no longer needs to muck around
@@ -182,6 +181,9 @@ int arch_setup_additional_pages(struct l
unsigned long vdso_base;
int rc;
 
+   if (!vdso_ready)
+   return 0;
+
 #ifdef CONFIG_PPC64
if (test_thread_flag(TIF_32BIT)) {
vdso_pagelist = vdso32_pagelist;
@@ -661,7 +663,7 @@ static void __init vdso_setup_syscall_ma
 }
 
 
-void __init vdso_init(void)
+static int __init vdso_init(void)
 {
int i;
 
@@ -716,11 +718,13 @@ void __init vdso_init(void)
 #ifdef CONFIG_PPC64
vdso64_pages = 0;
 #endif
-   return;
+   return 0;
}
 
/* Make sure pages are in the correct state */
-   BUG_ON(vdso32_pages + 2  VDSO32_MAXPAGES);
+   vdso32_pagelist = kzalloc(sizeof(struct page *) * (vdso32_pages + 2),
+ GFP_KERNEL);
+   BUG_ON(vdso32_pagelist == NULL);
for (i = 0; i  vdso32_pages; i++) {
struct page *pg = virt_to_page(vdso32_kbase + i*PAGE_SIZE);
ClearPageReserved(pg);
@@ -731,7 +735,9 @@ void __init vdso_init(void)
vdso32_pagelist[i] = NULL;
 
 #ifdef CONFIG_PPC64
-   BUG_ON(vdso64_pages + 2  VDSO64_MAXPAGES);
+   vdso64_pagelist = kzalloc(sizeof(struct page *) * (vdso64_pages + 2),
+ GFP_KERNEL);
+   BUG_ON(vdso64_pagelist == NULL);
for (i = 0; i  vdso64_pages; i++) {
struct page *pg = virt_to_page(vdso64_kbase + i*PAGE_SIZE);
ClearPageReserved(pg);
@@ -743,7 +749,13 @@ void __init vdso_init(void)
 #endif /* CONFIG_PPC64 */
 
get_page(virt_to_page(vdso_data));
+
+   smp_wmb();
+   vdso_ready = 1;
+
+   return 0;
 }
+arch_initcall(vdso_init);
 
 int in_gate_area_no_task(unsigned long addr)
 {
Index: linux-work/arch/powerpc/mm/mem.c
===
--- linux-work.orig/arch/powerpc/mm/mem.c   2007-02-12 10:53:02.0 
+1100
+++ linux-work/arch/powerpc/mm/mem.c2007-02-12 10:53:05.0 +1100
@@ -384,9 +384,6 @@ void __init mem_init(void)
initsize  10);
 
mem_init_done = 1;
-
-   /* Initialize the vDSO */
-   vdso_init();
 }
 
 /*
Index: linux-work/include/asm-powerpc/vdso.h
===
--- linux-work.orig/include/asm-powerpc/vdso.h  2007-02-12 11:02:44.0 
+1100
+++ linux-work/include/asm-powerpc/vdso.h   2007-02-12 11:03:36.0 
+1100
@@ -18,16 +18,11 @@
 
 #ifndef __ASSEMBLY__
 
-extern unsigned int vdso64_pages;
-extern unsigned int vdso32_pages;
-
 /* Offsets relative to thread-vdso_base */
 extern unsigned long vdso64_rt_sigtramp;
 extern unsigned long vdso32_sigtramp;
 extern unsigned long vdso32_rt_sigtramp;
 
-extern void vdso_init(void);
-
 #else /* __ASSEMBLY__ */
 
 #ifdef __VDSO64__


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remote debugging via FireWire

2007-02-11 Thread Benjamin Herrenschmidt
On Mon, 2007-02-12 at 07:49 +0100, Andi Kleen wrote:
 On Sunday 11 February 2007 22:35, Benjamin Herrenschmidt wrote:
 
  I'd like to have that on ppc as well, so I'd rather keep it in drivers/
 
 This will need some abstraction at least -- there are some early mapping hacks
 that are x86 specific right now.

Either abstraction or ifdef's .. we have ioremap working very early on
ppc :-)

  I agree that it doesn't need to be a module. If you can load modules,
  then you can load the full ohci driver. Thus, if it's an early thingy
  initialized by arch, it can export a special takeover hook that the
  proper ohci module can then call to override it (important if we start
  having an irq handler).
  
  Andi, also, how do you deal with iommu ? Not at all ? :-)
 
 Yes -- it's really early debugging hack mostly. It's reasonable to 
 let the iommu be disabled (or later a special bypass can be added for this) 

Ok.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: undefined symbol 'PS3_PS3AV'

2007-02-14 Thread Benjamin Herrenschmidt
On Wed, 2007-02-14 at 19:17 +0900, Paul Mundt wrote:
 On Wed, Feb 14, 2007 at 11:02:06AM +0100, Geert Uytterhoeven wrote:
  On Wed, 14 Feb 2007, Paul Mundt wrote:
   This would seem like a reasonable candidate for a 'depends on' instead of
   a select..
  
  That's what we originally had. But for the user it's simler if he can just
  enable ps3fb and/or ps3snd (sound driver not yet finished), which both 
  select
  PS3_PS3AV.
 
 Why not just have PS3_PS3AV def_bool y if ps3fb || ps3snd? Or if that
 doesn't work, just place the PS3FB option in 
 arch/powerpc/platforms/ps3/Kconfig.
 
 Of course if select obeyed the depends on, this wouldn't be a problem
 either..

I'd rather fix Kconfig to do the later...

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] killing the NR_IRQS arrays.

2007-02-16 Thread Benjamin Herrenschmidt
On Fri, 2007-02-16 at 05:10 -0700, Eric W. Biederman wrote:

 Getting the drivers changed actually looks to be pretty straight
 forward it will just be a very large mechanical change.  We change the
 type where of variables where appropriate and every once in a while
 introduce an irq_nr(irq) to get the actual irq number for the places
 that care (ISA or print statements).

Dunno about that irq_nr thingy. If we go that way, I'd be tempted to
remove the number completely from the public side of irq_desc... or
not.

On powerpc, we have this remapped thingy because we completely separate
the linux virtual interrupt domain from the physical numbering domains
of each PIC. Your change would turn the linux virtual domain into
pointers, removing the need for an array and associated limitations,
which is nice.

So to a given irq_desc / irq virtual number today, I match a pair HW
number (which is a special typedef which is currently defined as an
unsigned long) and a pointer to the irq host (which is the entity that
define a HW number domain).

That means that you can have multiple hosts and a given HW number can
exist multiple times, once per host.

Do you think the irq_hwnumber_t thingy I have should then be generalized
and put into the irq_desc ? I would need an additional void * pointer to
the irq host as well (it's not a 1:1 relationship to an irq chip and
need to be accessed by generic code).

Having the HW number be clearly specific to a domain controller makes
also a lot of sense in the embedded field with lots of cascaded
interrupt controllers. It avoids having to play all sorts of tricks to
assign ranges of numbers to various controllers in the system. Only the
local number on a given controller matters, the rest is dynamically
assigned.
 
Another option would be to have the irq_desc be created by the arch and
embedded in a larger data structure, in which case the HW number would
be part of the private part of that data structure. Though I suppose
that could be a problem with ISA...

I suspect that for backward compatibility, we will need to keep
something (optionally maybe via CONFIG_*) for ISA/legacy interrupts.
That is a 16 entries irq_desc* array, so we can go from a legacy IRQ
number to an irq_desc on platform that have legacy/ISA crap floating
around.

On powerpc, what I do is that I always reserve entries 0...15 of my
remapping array in such a way that linux virtual irq 0 is always
reserved, and 1...15 are only ever assigned to legacy interrupts if they
exist in the system, or left unassigned if they don't.

 I think we can make this change fairly smoothly if before the code is
 merged into Linus's tree we have a patchset prepared with a all of the
 core infrastructure changes and a best effort at all of the driver
 changes.  Then early some merge window we merge the patchset, and
 fixup the drivers that were missed.

As long as we do things properly and not with a big DESIGNED FOR x86
hack in the middle that makes it hard for everybody else, I agree.

Cheers,
Ben.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] killing the NR_IRQS arrays.

2007-02-16 Thread Benjamin Herrenschmidt
On Fri, 2007-02-16 at 13:41 +0100, Ingo Molnar wrote:
 * Eric W. Biederman [EMAIL PROTECTED] wrote:
 
  So I propose we remove all assumptions from the code that we actually 
  have an array of irqs.  That will allow for irq_desc to be dynamically 
  allocated instead of statically allocated saving memory and reducing 
  kernel complexity.
 
 hm. I'd suggest to do this without changing request_irq() - and then we 
 could avoid the 'massive, every driver affected' change, right?
 
 i.e. because we'll (have to) have an nr_to_desc() and desc_to_nr() 
 mapping facility anyway, lets just not change the driver APIs massively. 
 There dont seem to be that many drivers that assume that irq_desc[] is 
 an array - are there?
 
 otherwise, in terms of the irqchips infrastructure and the API between 
 genirq and the irqchip arch-level drivers, this change makes quite a bit 
 of sense i think.
 
 or am i missing something fundamental?

Well, I don't want to see anything like desc_to_nr / nr_to_desc unless
the number in question is a virtual number. That is, there is no way we
should go that way and keep passing a HW number through request_irq.
That would just be a total nightmare for powerpc and sparc at least.

What we can do is generalize the powerpc virtual irq scheme though. You
can see the implementation in arch/powerpc/kernel/irq.c starting from
the definition of irq_alloc_host() though for some stupid reason, I've
put all the documentation in include/asm-powerpc/irq.h so you might want
to start there.

Once the IRQ numbers are virtualized, it becomes easier to slowly
migrate things to use irq_desc_t * while still having a virutal number
available.

Once everything has been migrated, we can then get rid of the virtual
numbers completely except maybe for an optional 16 entries array for
legacy cruft.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] killing the NR_IRQS arrays.

2007-02-16 Thread Benjamin Herrenschmidt

  Rather than having the job of rewriting this code during 2.6, I'd much
  prefer to get something sorted, even if it is ARM only before 2.6.
  
  I believe that there are some common problems with the existing API
  which have been hinted at over the last few days, such as large
  NR_IRQS.  As such, I think it would be a good idea to try to thrash
  this issue out and get something which everyone is happy with.
  
  Additionally, I've added Alan's reserve then hook idea to the API;
  I seem to remember there is a case in IDE which needs something like
  this.

You might want to have a look at the powerpc API with it's remaping
capabilities. It's very nice for handling multiple domain spaces. It
might be of some use for you.

I like your proposed API, I think that's where we want to go in the long
run.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] killing the NR_IRQS arrays.

2007-02-16 Thread Benjamin Herrenschmidt
On Sat, 2007-02-17 at 02:37 +0100, Arnd Bergmann wrote:
 On Friday 16 February 2007 23:37, Benjamin Herrenschmidt wrote:
  You might want to have a look at the powerpc API with it's remaping
  capabilities. It's very nice for handling multiple domain spaces. It
  might be of some use for you.
 
 I don't consider the powerpc virtual IRQs a solution for the problem.
 While I believe you did the right thing for powerpc with generalizing
 this over all its platforms, it really isn't more than a workaround
 for the problem that we can't deal well with the static irq_desc
 array.

It's not a solution per-se, though it contains elements of solution like
the reverse mappin, which I use to map HW numbers to virtual irqs but
can trivially adapt to map HW numbers to irq_desc pointers.

Among other things, I want to make sure that we don't end up with just
putting an irq number in a field of the irq_desc and have half of the
drivers peek at it and assume we can convert between irq_desc* and
number in arbitrary ways.

The HW irq number should be as much opaque as possible from the world
outside of the PIC code and/or arch code that assign them. That's an
area where the powerpc and/or sparc code might be of use.


 When that problem is now getting worse on other architectures, we
 should try to get it right on all of them, rather than spreading
 the workaround further.

Yes, but I'd like aspects of my remapping work to be included in
whatever we come up with, which is to have the new irq_desc either hide
the underlying HW number, or at least associate it make it very clear
that it's an opaque token and not guaranteed to be unique accross
multiple PICs in the system.

In addition, if we remove the numbers, archs will need basically the
exact same services provided by the powerpc irq core for reverse mapping
(going from a HW irq number on a given PIC back to an irq_desc *).

Either using a linear array for simple PICs or a radix tree for
platforms with very big interrupt numbers (BTW. I think we have lockless
radix trees nowadays, I can remove the spinlocks to protect it in the
powerpc remapper).

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] killing the NR_IRQS arrays.

2007-02-17 Thread Benjamin Herrenschmidt

 No.  I don't think we should make your irq_hwnumber_t thingy general
 because it is not general.  I don't understand why you need it to be
 an unsigned long, that still puzzles me.  But for the rest it actually
 appears that ppc has a simpler model to deal with.

I think you might have misunderstood becaues I do beleive it's actually
very general :-)

Let me explain below.

 I don't think I actually can describe x86 hardware in you hwnumber_t
 world.  Although I can approximate.

And I think it fits well...

 In non-legacy mode at the top of the tree I have a network cooperating
 irq controllers.  For each cpu there is an lapic next to each cpu that
 catches interrupt packets and below that I have interrupt controllers
 that throw interrupt packets.  In the network of cooperating interrupt
 controllers a interrupt packet has a destination address that looks
 like (cpu#, vector#) where cpu# is currently at 8 bits and slowly
 growing and the vector# is a fixed 8 bits.
 
 The interrupt controllers that throw those packets have a fixed
 number of irq slots usually 24 or so.  Each slot (referred to in the
 code as a pin) can be programmed which (cpu#, vector#) packet it
 throws when an interrupt occurs. Including an option to vary the cpu#
 between a set of cpus.
 
 So to be frank to handle this model properly I need to deal with
 this properly I need.
 #define NR_IRQS (NR_CPUS*256)
 
 There is enough flexibility in this model that hardware vectors
 have not found a need to cascade interrupt controllers.

This is roughly similar to the cell toplevel model where interrupt
messages encode the source unit/node, target and class. The chip has an
interrupt controller (receiver of those messages) for each thread. In
the kernel, I use a flat model, that is I create one host for all of
them and my hardware numbers are mode of a similar bit encoding of those
routing infos.

That is, with a remapping model like mine, the x86 non-legacy situation
could be easily expressed by having one domain (I call them hosts in the
code) covering the whole fabric and the hw number be your (CPU  16) |
vector thing.

In addition, but you don't need that on x86, cell has an external
controller cascaded on one of those interrupt, I use a separate domain
for it.

The reason my hwnumber thingy is a generic type is that i provide
generic functions to create a linux interrupt for a domain/number pair
and generic mecanism to do the reverse mapping. That's where I think my
code might be of some use as with the numbers going away, pretty
everybody will need a wat to reverse map from HW numbers back to
irq_desc *.

I use an unsigned long because I needed to choose a type that would fit
the biggest number potentially used by an interrupt controller, and that
can be real big with some hypervisors for which those are tokens which
are potentially 64 bits.

 Ben I have no problem with a number that is specific to an irq
 controller for dealing with the internal irq controller
 implementations, heck I think everyone has that to some degree
 
 The linux irq number will remain an arbitrary software number for
 use by the linux system for talking about the source of the
 interrupt.

So you do intend to keep the linux number which is what I call the
virtual interrupt number on powerpc... I wouldn't have thought that to
be necessary except as a special case of an array of 16 entries for ISA
interrupts...

 Why in a sparse address space you would find it hard to allocate a
 range of numbers to an irq controller that only has a fixed number of
 irqs it can deal with is something I don't understand and I think
 it is does a disservice to your users.  But that is all it is
 a quality of implementation issue.  ia64 does the same foolish
 thing.

It would be fairly easy to change my powerpc code to pre-allocate a full
range for a given domain/pic when initializing it instead of doing
lazy scattered allocation like I do, though it won't bring much I
think. It's not possible for all PICs though, for example, the pSeries
needs to use the radix tree reverse mapper because of how large HW
interrupt numbers can be.

I chose not to do it. In the long run, the only remotely meaningful way
to expose interrupt to users would be to -add- columns
to /proc/interrupts that provide the host and the HW number on that
host, though I'm not sure that wouldn't break some userland tools.

 The only time it really makes sense to me to let the irq number vary
 arbitrary are when things are truly dynamic, like with MSI, a
 hypervisor, or hot plug interrupt controllers.

I don't understand why you would go to all that lenght to replace irq
numbers with irq_desc * and ... keep then numbers :-)

But again, as I said, this is in no way a fundamental limitation of the
powerpc code. It could be modified easily to allocate the whole range of
a given PIC that uses the linear remapping. It makes no sense for PICs
that use the radix tree remapping though.

 Sure, and I have the same issue with 

Re: [RFC] killing the NR_IRQS arrays.

2007-02-17 Thread Benjamin Herrenschmidt
On Sat, 2007-02-17 at 02:06 -0700, Eric W. Biederman wrote:
 Benjamin Herrenschmidt [EMAIL PROTECTED] writes:
 
  In addition, if we remove the numbers, archs will need basically the
  exact same services provided by the powerpc irq core for reverse mapping
  (going from a HW irq number on a given PIC back to an irq_desc *).
 
 Ben you seem to be under misapprehension that except for the case of
 ISA (0-16) the linux IRQ number is a hardware number.  It is an arbitrary
 software enumeration, and I think it has been that way a very long time.

Did you actually mean is not a hardware number ? If not, then I don't
understand your sentence...

 I can only tell you that my impression of this last is that all the
 world's not a PPC.

Yeah and my grandmother is not the pope, thank you.

However, PowerPC is a good example because it has such a diversity of
very different hardware setups to deal with, ranging from the multiple
layers of cascading controllers all over the place, to interrupts
packets encoding vector/target etc... a bit like x86 on cell, to
hypervisors providing a single giant number space etc etc etc...

Thus, it is extremely likely that something that works well for PowerPC
(or for ARM for that matter as it's probably as a colorful environment
as PowerPC is) will end up being useful for others.

 I have a version of the x86 code with a partial conversion done and
 I didn't need a reverse mapping.  What you call the hardware interrupt
 number never happens to be interesting to me after the system is setup.

Because you have the ability to tell your PIC to give you your linux
interrupt number when actually sending the interrupt to the processor ?
You need a way to get to the irq_desc * when getting an IRQ, either you
have a way to map HW numbers back to irq_desc * in sofrware, or your HW
allows you to do it.

 I do suspect there may be an interesting chunk of your ppc work that
 probably makes sense as a library so other arches could use it.

Guess what, one of the options of my code is to not instanciate a
remapper... for archs where it's not necessary. (We have the case for
example of iSeries whose hypervisor can return us the number we want for
an arbitrary interrupt).

Now, I'm not saying we should take the PowerPC code and say hey' here's
the new generic code.

I'm saying that if we're going to change the IRQ stuff that deeply, it
would be nice if we looked into some of that stuff I've done that I
beleive would be of use for other archs (though you seem to imply that
it would be of no use on x86, good, still...).

I found it overall very useful to have a generic remapping core and have
cascaded PIC setups have a numbering domain local to a given PIC (pretty
much, a domain != an irq_chip) and I'm convinced it would make life
easier for archs with similar setups. The remapping core also shows its
usefulness on archs with very big interrupt numbers, like sparc or
pSeries ppc, and possibly others.

Now, I -do- have a problem with one aspect of your proposed design which
is to keep the linux interrupt number in the generic irq_desc, which I
think defeats most of the purpose of moving away from those linux irq
numbers. If you do so, then I'll have to keep a separate remapping layer
and keep a mecanism for virtualizing linux numbers.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] killing the NR_IRQS arrays.

2007-02-17 Thread Benjamin Herrenschmidt

  #define NO_IRQ  architecture-defined-int-constant
 
 When did you need a magic constant NO_IRQ in generic code.
 One of the reasons I want to convert the drivers is so we can
 kill the NO_IRQ nonsense.
 
 As for struct irq.  Instead of struct irq_desc I really don't
 care, although the C++ camp hasn't not yet weighed in and mentioned
 how that creates a namespace conflict for them. 

Yeah, NO_IRQ would be NULL here...

What I do on the powerpc code is since IRQ HW numbers are defined
locally to a domain/PIC, when creating a new domain, The PIC code passes
a value to use as an illegal value in that domain. It's not exposed
outside of the core though, it's really only used to initialize the
remapping table with something before any interrupt on that PIC has been
mapped. 

 We might need this.  But I don't think we need reference counting in
 the traditional sense.  For all practical purpose we already have
 dynamic irq allocation and it hasn't proven necessary.  I would
 prefer to go to lengths to avoid having to expose that kind of
 an issue to driver code.

I think we do need proper refcounting, but I also think that most
drivers will not need to see it.

For example, a PCI driver will most probably just do something along the
lines of the existing request_irq(pdev-irq), the liftime of pdev-irq
is managed by the PCI core.

Same goes with MSIs imho, the MSI core can manage the lifetime
transparently.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rc1] powerpc: Make of_device_uevent() compatible with ibmebus

2007-02-17 Thread Benjamin Herrenschmidt
On Sat, 2007-02-17 at 17:28 +0100, Hoang-Nam Nguyen wrote:
 ibmebus has a fake root device that's not associated with an ofdt node.
 Filter out any such devices in of_device_uevent().

Doh ! You are creating an of_device with no attached device-node ? That
is totally evil ! Why do you need that ?

Ben.

 
 Signed-off-by: Joachim Fenkes [EMAIL PROTECTED]
 ---
 
 
  of_device.c |4 
  1 files changed, 4 insertions(+)
 
 
 diff -urp a/arch/powerpc/kernel/of_device.c b/arch/powerpc/kernel/of_device.c
 --- a/arch/powerpc/kernel/of_device.c 2007-02-17 16:36:32.116368480 +0100
 +++ b/arch/powerpc/kernel/of_device.c 2007-02-17 16:44:01.319366352 +0100
 @@ -180,6 +180,10 @@ int of_device_uevent(struct device *dev,
  
   ofdev = to_of_device(dev);
  
 + /* e.g. ibmebus has a fake root device w/o ofdt node -- filter that */
 + if (!ofdev-node)
 + return -ENODEV;
 +
   if (add_uevent_var(envp, num_envp, i,
  buffer, buffer_size, length,
  OF_NAME=%s, ofdev-node-name))
 
 ___
 Linuxppc-dev mailing list
 [EMAIL PROTECTED]
 https://ozlabs.org/mailman/listinfo/linuxppc-dev

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rc1] powerpc: Make of_device_uevent() compatible with ibmebus

2007-02-17 Thread Benjamin Herrenschmidt
On Sat, 2007-02-17 at 19:21 -0500, Joachim Fenkes wrote:
 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote on 17.02.2007 
 16:56:39:
 
  On Sat, 2007-02-17 at 17:28 +0100, Hoang-Nam Nguyen wrote:
   ibmebus has a fake root device that's not associated with an ofdt 
 node.
   Filter out any such devices in of_device_uevent().
  
  Doh ! You are creating an of_device with no attached device-node ? That
  is totally evil ! Why do you need that ?
 
 The driver creates a fake ibmebus device so all ibmebus based devices 
 have
 a common parent device -- the vio bus does the same.
 
 What do you think about linking this device to the device tree / node? 
 All
 ibmebus-based devices are linked to dt nodes residing directly beneath 
 /,
 so the mapping would fit.

No. If you do that, it shouldn't be an of_device based device.

If you want then to be below a common parent, then create that parent of
a basic struct device type, that sort of thing. You should never
instanciate an of_device that has a NULL device node.

vio is different since it's not a subclass of of_device though I tend
to also disagree with the way it does things.

It's a generic problem with sysfs, I agree it somewhat sucks.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >