Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB
On Fri, Sep 24, 2010 at 01:01:36PM -0500, Dave Kleikamp wrote: When the DSTI (Disable Shadow TLB Invalidate) bit is set in the CCR2 register, the isync command does not flush the shadow TLB (iTLB dTLB). However, since the shadow TLB does not contain context information, we want the shadow TLB flushed in situations where we are switching context. In those situations, we explicitly clear the DSTI bit before performing isync, and set it again afterward. We also need to do the same when we perform isync after explicitly flushing the TLB. Th setting of the DSTI bit is dependent on CONFIG_PPC_47x_DISABLE_SHADOW_TLB_INVALIDATE. When we are confident that the feature works as expected, the option can probably be removed. You're defaulting it to 'y' in the Kconfig. Technically someone could turn it off I guess, but practice mostly shows that nobody mucks with the defaults. Do you want it to default 'n' for now if you aren't confident in it just quite yet? (Linus also has some kind of gripe with new options being default 'y', but I don't recall all the details and I doubt he'd care about something in low-level PPC code.) josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB
On Mon, 2010-09-27 at 11:04 -0400, Josh Boyer wrote: On Fri, Sep 24, 2010 at 01:01:36PM -0500, Dave Kleikamp wrote: When the DSTI (Disable Shadow TLB Invalidate) bit is set in the CCR2 register, the isync command does not flush the shadow TLB (iTLB dTLB). However, since the shadow TLB does not contain context information, we want the shadow TLB flushed in situations where we are switching context. In those situations, we explicitly clear the DSTI bit before performing isync, and set it again afterward. We also need to do the same when we perform isync after explicitly flushing the TLB. Th setting of the DSTI bit is dependent on CONFIG_PPC_47x_DISABLE_SHADOW_TLB_INVALIDATE. When we are confident that the feature works as expected, the option can probably be removed. You're defaulting it to 'y' in the Kconfig. Technically someone could turn it off I guess, but practice mostly shows that nobody mucks with the defaults. Do you want it to default 'n' for now if you aren't confident in it just quite yet? I think I made it a config option at Ben's request when I first started this work last year, before being sidetracked by other priorities. I could either remove the option, or default it to 'n'. It might be best to just hard-code the behavior to make sure it's exercised, since there's no 47x hardware in production yet, but we can give Ben a chance to weigh in with his opinion. (Linus also has some kind of gripe with new options being default 'y', but I don't recall all the details and I doubt he'd care about something in low-level PPC code.) josh -- Dave Kleikamp IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Oops in trace_hardirqs_on (powerpc)
Hello Steven, Steven Rostedt hat am Wed 22. Sep, 15:44 (-0400) geschrieben: Sorry for the late reply, but I was on vacation when you sent this, and I missed it while going through email. Do you still have this issue? No. I've rebuild my kernel without TRACE_IRQFLAGS and the problem vanished, as expected. The problem is, that in some cases the stack is only two frames deep, which causes the macro CALLER_ADDR1 makes an invalid access. Someone told me, there a workaround for the problem on i386, too. % sed -n 2p arch/x86/lib/thunk_32.S * Trampoline to trace irqs off. (otherwise CALLER_ADDR1 might crash) Bye, Jörg. -- Angenehme Worte sind nie wahr, wahre Worte sind nie angenehm. signature.asc Description: Digital signature http://en.wikipedia.org/wiki/OpenPGP ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFCv2 1/2] dmaengine: add support for scatterlist to scatterlist transfers
2010/9/25 Ira W. Snyder i...@ovro.caltech.edu: This adds support for scatterlist to scatterlist DMA transfers. This is a good idea, we have a local function to do this in DMA40 already, stedma40_memcpy_sg(). This is currently hidden behind a configuration option, which will allow drivers which need this functionality to select it individually. Why? Isn't it better to add this as a new capability flag if you don't want to announce it? Or is the intent to save memory footprint? Yours, Linus Walleij ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support
On Thu, 23 Sep 2010, Christian Riesch wrote: It implies clock tuning in userspace for a potential sub microsecond accurate clock. The clock accuracy will be limited by user space latencies and noise. You wont be able to discipline the system clock accurately. Noise matters, latency doesn't. Well put! That's why we need hardware support for PTP timestamping to reduce the noise, but get along well with the clock servo that is steering the PHC in user space. Even if I buy into the catch phrase above: User space is subject to noise that the in kernel code is not. If you do the tuning over long intervals then it hopefully averages out but it still causes jitter effects that affects the degree of accuracy (or sync) that you can reach. And the noise varies with the load on the system. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support
On Thu, 23 Sep 2010, john stultz wrote: 3) Further, the PTP hardware counter can be simply set to a new offset to put it in line with the network time. This could cause trouble with timekeeping much like unsynced TSCs do. You can do the same for system time. Settimeofday does allow CLOCK_REALTIME to jump, but the CLOCK_MONOTONIC time cannot jump around. Having a clocksource that is non-monotonic would break this. Currently time runs at the same speed. CLOCK_MONOTONIC runs at a offset to CLOCK_REALTIME. We are creating APIs here that allow time to run at different speeds. The design actually avoids most userland induced latency. 1) On the PTP hardware syncing point, the reference packet gets timestamped with the PTP hardware time on arrival. This allows the offset calculation to be done in userland without introducing latency. The timestamps allows the calculation of the network transmission time I guess and therefore its more accurate to calculate that effect out. Ok but then the overhead of getting to code in user space (that does the proper clock adjustments) is resulting in the addition of a relatively long time that is subject to OS scheduling latencies and noises. 2) On the system syncing side, the proposal for the PPS interrupt allows the PTP hardware to trigger an interrupt on the second boundary that would take a timestamp of the system time. Then the pps interface allows for the timestamp to be read from userland allowing the offset to be calculated without introducing additional latency. Sorry dont really get the whole picture here it seems. Sounds like one is going through additional unnecessary layers. Why would the PTP hardware triggger an interrupt? I thought the PTP messages came in via timestamping and are then processed by software. Then the software is issuing a hardware interrupt that then triggers the PPS subsystem. And that is supposed to be better than directly interfacing with the PTP? Additionally, even just in userland, it would be easy to bracket two reads of the system time around one read of the PTP clock to bound any userland latency fairly well. It may not be as good as the PPS interface (although that depends on the interrupt latency), but if the accesses are all local, it probably could get fairly close. That sounds hacky. Ok maybe we need some sort of control interface to manage the clock like the others have. That's what the clock_adjtime call provides. Ummm... You are managing a hardware device with hardware (driver) specific settings. That is currently being done via ioctls. Why generalize it? The posix clocks today assumes one notion of real time in the kernel. All clocks increase in lockstep (aside from offset updates). Not true. The cputime clockids do not increment at the same rate (as the apps don't always run). Further CLOCK_MONOTONIC_RAW provides a non-freq corrected view of CLOCK_MONOTONIC, so it increments at a slightly different rate. cputime clockids are not tracking time but cpu resource use. Re-using the fairly nice (Alan of course disagrees :) posix interface seems at least a little better for application developers who actually have to use the hardware. Well it may also be confusing for others. The application developers also will have a hard time using a generic clock interface to control PTP device specific things like frequencies, rates etc etc. So you always need to ioctl/device specific control interface regardless. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support
On Fri, 24 Sep 2010, Alan Cox wrote: Whether you add new syscalls or do the fd passing using flags and hide the ugly bits in glibc is another question. Use device specific ioctls instead of syscalls? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support
In message: alpine.deb.2.00.1009271038150.9...@router.home Christoph Lameter c...@linux.com writes: : On Thu, 23 Sep 2010, john stultz wrote: : The design actually avoids most userland induced latency. : : 1) On the PTP hardware syncing point, the reference packet gets : timestamped with the PTP hardware time on arrival. This allows the : offset calculation to be done in userland without introducing latency. : : The timestamps allows the calculation of the network transmission time I : guess and therefore its more accurate to calculate that effect out. Ok but : then the overhead of getting to code in user space (that does the proper : clock adjustments) is resulting in the addition of a relatively long time : that is subject to OS scheduling latencies and noises. The timestamps at the hardware level allow you to factor out variation caused by OS Scheduling, OS network stack delay and internal buffering on the NIC. Variation in measurements is what kills accuracy. When steering a clock by making an error measurement of the phase and frequency of it, the latency induced by OS scheduling tends to be unimportant. It is far more important to know when you steered the clock (called adjtime or friends) than to steer it at any fixed latency to when the data for the measurements was made. Measuring the time of steer can tolerate errors in the range of OS scheduling latencies easily, since that tends to produce a very small effect. It introduces an error in your expected phase for the next measurement on the order of the product of the time of steer error times the change in fractional frequency (abs( 1 - (nu_new / nu_old))). Even if the estimate is really bad at 100ms, most steers are on the order about one part per million. This leads to a sub-nanosecond phase error estimate in the next measurement cycle (a non-accumulating error). A 1ms error leads to maybe tens of picoseconds of estimate error. This is a common error that I've seen repeated in this thread. The only reason that it has historically been important is because when you are doing timestamping in software based on an interrupt, that stuff does matter. Warner ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support
On Mon, 27 Sep 2010 10:56:09 -0500 (CDT) Christoph Lameter c...@linux.com wrote: On Fri, 24 Sep 2010, Alan Cox wrote: Whether you add new syscalls or do the fd passing using flags and hide the ugly bits in glibc is another question. Use device specific ioctls instead of syscalls? Some of the ioctls are probably not device specific, the job of the OS in part is to present a unified interface. We already have a mess of HPET and RTC driver ioctls. Some of it undoubtedly is device specific. Alan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFCv2 1/2] dmaengine: add support for scatterlist to scatterlist transfers
On Mon, Sep 27, 2010 at 05:23:34PM +0200, Linus Walleij wrote: 2010/9/25 Ira W. Snyder i...@ovro.caltech.edu: This adds support for scatterlist to scatterlist DMA transfers. This is a good idea, we have a local function to do this in DMA40 already, stedma40_memcpy_sg(). I think that having two devices that want to implement this functionality as part of the DMAEngine API is a good argument for making it available as part of the core API. I think it would be good to add this to struct dma_device, and add a capability (DMA_SG?) for it as well. I have looked at the stedma40_memcpy_sg() function, and I think we would want to extend it slightly for the generic API. Is there any good reason to prohibit scatterlists with different numbers of elements? For example: src scatterlist: 10 elements, each with 4K length (40K total) dst scatterlist: 40 elements, each with 1K length (40K total) The total length of both scatterlists is equal, but the number of scatterlist entries is different. The freescale DMA controller can handle this just fine. I'm proposing this function signature: struct dma_async_tx_descriptor * dma_memcpy_sg(struct dma_chan *chan, struct scatterlist *dst_sg, unsigned int dst_nents, struct scatterlist *src_sg, unsigned int src_nents, unsigned long flags); This is currently hidden behind a configuration option, which will allow drivers which need this functionality to select it individually. Why? Isn't it better to add this as a new capability flag if you don't want to announce it? Or is the intent to save memory footprint? Dan wanted this, probably for memory footprint. If 1 driver is using it, I would rather have it as part of struct dma_device along with a capability. Thanks for the feedback, Ira ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFCv2 1/2] dmaengine: add support for scatterlist to scatterlist transfers
On Mon, Sep 27, 2010 at 10:23 AM, Ira W. Snyder i...@ovro.caltech.edu wrote: On Mon, Sep 27, 2010 at 05:23:34PM +0200, Linus Walleij wrote: 2010/9/25 Ira W. Snyder i...@ovro.caltech.edu: This adds support for scatterlist to scatterlist DMA transfers. This is a good idea, we have a local function to do this in DMA40 already, stedma40_memcpy_sg(). I think that having two devices that want to implement this functionality as part of the DMAEngine API is a good argument for making it available as part of the core API. I think it would be good to add this to struct dma_device, and add a capability (DMA_SG?) for it as well. I have looked at the stedma40_memcpy_sg() function, and I think we would want to extend it slightly for the generic API. Is there any good reason to prohibit scatterlists with different numbers of elements? For example: src scatterlist: 10 elements, each with 4K length (40K total) dst scatterlist: 40 elements, each with 1K length (40K total) The total length of both scatterlists is equal, but the number of scatterlist entries is different. The freescale DMA controller can handle this just fine. I'm proposing this function signature: struct dma_async_tx_descriptor * dma_memcpy_sg(struct dma_chan *chan, struct scatterlist *dst_sg, unsigned int dst_nents, struct scatterlist *src_sg, unsigned int src_nents, unsigned long flags); This is currently hidden behind a configuration option, which will allow drivers which need this functionality to select it individually. Why? Isn't it better to add this as a new capability flag if you don't want to announce it? Or is the intent to save memory footprint? Dan wanted this, probably for memory footprint. If 1 driver is using it, Yes, I did not see a reason to increment the size of dmaengine.o for everyone if only one out-of-tree user of the function existed. I would rather have it as part of struct dma_device along with a capability. I think having this as a dma_device method makes sense now that more than one driver would implement it, and let's drivers see the entirety of the transaction in one call. -- Dan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections
This set of patches decouples the concept that a single memory section corresponds to a single directory in /sys/devices/system/memory/. On systems with large amounts of memory (1+ TB) there are perfomance issues related to creating the large number of sysfs directories. For a powerpc machine with 1 TB of memory we are creating 63,000+ directories. This is resulting in boot times of around 45-50 minutes for systems with 1 TB of memory and 8 hours for systems with 2 TB of memory. With this patch set applied I am now seeing boot times of 5 minutes or less. The root of this issue is in sysfs directory creation. Every time a directory is created a string compare is done against all sibling directories to ensure we do not create duplicates. The list of directory nodes in sysfs is kept as an unsorted list which results in this being an exponentially longer operation as the number of directories are created. The solution solved by this patch set is to allow a single directory in sysfs to span multiple memory sections. This is controlled by an optional architecturally defined function memory_block_size_bytes(). The default definition of this routine returns a memory block size equal to the memory section size. This maintains the current layout of sysfs memory directories as it appears to userspace to remain the same as it is today. For architectures that define their own version of this routine, as is done for powerpc in this patchset, the view in userspace would change such that each memoryXXX directory would span multiple memory sections. The number of sections spanned would depend on the value reported by memory_block_size_bytes. In both cases a new file 'end_phys_index' is created in each memoryXXX directory. This file will contain the physical id of the last memory section covered by the sysfs directory. For the default case, the value in 'end_phys_index' will be the same as in the existing 'phys_index' file. This version of the patch set includes an update to to properly report block_size_bytes, phys_index, and end_phys_index. Additionally, the patch that adds the end_phys_index sysfs file is now patch 5/8 instead of being patch 2/8 as in the previous version of the patches. -Nathan Fontenot ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/8] v2 Move find_memory_block() routine
Move the find_memory_block() routine up to avoid needing a forward declaration in subsequent patches. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 62 +- 1 file changed, 31 insertions(+), 31 deletions(-) Index: linux-next/drivers/base/memory.c === --- linux-next.orig/drivers/base/memory.c 2010-09-21 11:59:24.0 -0500 +++ linux-next/drivers/base/memory.c2010-09-21 12:32:45.0 -0500 @@ -435,6 +435,37 @@ int __weak arch_get_memory_phys_device(u return 0; } +/* + * For now, we have a linear search to go find the appropriate + * memory_block corresponding to a particular phys_index. If + * this gets to be a real problem, we can always use a radix + * tree or something here. + * + * This could be made generic for all sysdev classes. + */ +struct memory_block *find_memory_block(struct mem_section *section) +{ + struct kobject *kobj; + struct sys_device *sysdev; + struct memory_block *mem; + char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1]; + + /* +* This only works because we know that section == sysdev-id +* slightly redundant with sysdev_register() +*/ + sprintf(name[0], %s%d, MEMORY_CLASS_NAME, __section_nr(section)); + + kobj = kset_find_obj(memory_sysdev_class.kset, name); + if (!kobj) + return NULL; + + sysdev = container_of(kobj, struct sys_device, kobj); + mem = container_of(sysdev, struct memory_block, sysdev); + + return mem; +} + static int add_memory_block(int nid, struct mem_section *section, unsigned long state, enum mem_add_context context) { @@ -468,37 +499,6 @@ static int add_memory_block(int nid, str return ret; } -/* - * For now, we have a linear search to go find the appropriate - * memory_block corresponding to a particular phys_index. If - * this gets to be a real problem, we can always use a radix - * tree or something here. - * - * This could be made generic for all sysdev classes. - */ -struct memory_block *find_memory_block(struct mem_section *section) -{ - struct kobject *kobj; - struct sys_device *sysdev; - struct memory_block *mem; - char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1]; - - /* -* This only works because we know that section == sysdev-id -* slightly redundant with sysdev_register() -*/ - sprintf(name[0], %s%d, MEMORY_CLASS_NAME, __section_nr(section)); - - kobj = kset_find_obj(memory_sysdev_class.kset, name); - if (!kobj) - return NULL; - - sysdev = container_of(kobj, struct sys_device, kobj); - mem = container_of(sysdev, struct memory_block, sysdev); - - return mem; -} - int remove_memory_block(unsigned long node_id, struct mem_section *section, int phys_device) { ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/8] v2 Add section count to memory_block struct
Add a section count property to the memory_block struct to track the number of memory sections that have been added/removed from a memory block. This allows us to know when the last memory section of a memory block has been removed so we can remove the memory block. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 16 ++-- include/linux/memory.h |3 +++ 2 files changed, 13 insertions(+), 6 deletions(-) Index: linux-next/drivers/base/memory.c === --- linux-next.orig/drivers/base/memory.c 2010-09-27 09:17:20.0 -0500 +++ linux-next/drivers/base/memory.c2010-09-27 09:31:35.0 -0500 @@ -478,6 +478,7 @@ mem-phys_index = __section_nr(section); mem-state = state; + atomic_inc(mem-section_count); mutex_init(mem-state_mutex); start_pfn = section_nr_to_pfn(mem-phys_index); mem-phys_device = arch_get_memory_phys_device(start_pfn); @@ -505,12 +506,15 @@ struct memory_block *mem; mem = find_memory_block(section); - unregister_mem_sect_under_nodes(mem); - mem_remove_simple_file(mem, phys_index); - mem_remove_simple_file(mem, state); - mem_remove_simple_file(mem, phys_device); - mem_remove_simple_file(mem, removable); - unregister_memory(mem, section); + + if (atomic_dec_and_test(mem-section_count)) { + unregister_mem_sect_under_nodes(mem); + mem_remove_simple_file(mem, phys_index); + mem_remove_simple_file(mem, state); + mem_remove_simple_file(mem, phys_device); + mem_remove_simple_file(mem, removable); + unregister_memory(mem, section); + } return 0; } Index: linux-next/include/linux/memory.h === --- linux-next.orig/include/linux/memory.h 2010-09-27 09:17:20.0 -0500 +++ linux-next/include/linux/memory.h 2010-09-27 09:22:56.0 -0500 @@ -19,10 +19,13 @@ #include linux/node.h #include linux/compiler.h #include linux/mutex.h +#include asm/atomic.h struct memory_block { unsigned long phys_index; unsigned long state; + atomic_t section_count; + /* * This serializes all state change requests. It isn't * held during creation because the control files are ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/8] v2 Add mutex for adding/removing memory blocks
Add a new mutex for use in adding and removing of memory blocks. This is needed to avoid any race conditions in which the same memory block could be added and removed at the same time. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c |7 +++ 1 file changed, 7 insertions(+) Index: linux-next/drivers/base/memory.c === --- linux-next.orig/drivers/base/memory.c 2010-09-27 09:31:35.0 -0500 +++ linux-next/drivers/base/memory.c2010-09-27 09:31:57.0 -0500 @@ -27,6 +27,8 @@ #include asm/atomic.h #include asm/uaccess.h +static DEFINE_MUTEX(mem_sysfs_mutex); + #define MEMORY_CLASS_NAME memory static struct sysdev_class memory_sysdev_class = { @@ -476,6 +478,8 @@ if (!mem) return -ENOMEM; + mutex_lock(mem_sysfs_mutex); + mem-phys_index = __section_nr(section); mem-state = state; atomic_inc(mem-section_count); @@ -497,6 +501,7 @@ ret = register_mem_sect_under_node(mem, nid); } + mutex_unlock(mem_sysfs_mutex); return ret; } @@ -505,6 +510,7 @@ { struct memory_block *mem; + mutex_lock(mem_sysfs_mutex); mem = find_memory_block(section); if (atomic_dec_and_test(mem-section_count)) { @@ -516,6 +522,7 @@ unregister_memory(mem, section); } + mutex_unlock(mem_sysfs_mutex); return 0; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/8] v2 Allow memory block to span multiple memory sections
Update the memory sysfs code such that each sysfs memory directory is now considered a memory block that can span multiple memory sections per memory block. The default size of each memory block is SECTION_SIZE_BITS to maintain the current behavior of having a single memory section per memory block (i.e. one sysfs directory per memory section). For architectures that want to have memory blocks span multiple memory sections they need only define their own memory_block_size_bytes() routine. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 155 ++ 1 file changed, 108 insertions(+), 47 deletions(-) Index: linux-next/drivers/base/memory.c === --- linux-next.orig/drivers/base/memory.c 2010-09-27 09:31:57.0 -0500 +++ linux-next/drivers/base/memory.c2010-09-27 13:50:18.0 -0500 @@ -30,6 +30,14 @@ static DEFINE_MUTEX(mem_sysfs_mutex); #define MEMORY_CLASS_NAME memory +#define MIN_MEMORY_BLOCK_SIZE (1 SECTION_SIZE_BITS) + +static int sections_per_block; + +static inline int base_memory_block_id(int section_nr) +{ + return section_nr / sections_per_block; +} static struct sysdev_class memory_sysdev_class = { .name = MEMORY_CLASS_NAME, @@ -84,28 +92,47 @@ * register_memory - Setup a sysfs device for a memory block */ static -int register_memory(struct memory_block *memory, struct mem_section *section) +int register_memory(struct memory_block *memory) { int error; memory-sysdev.cls = memory_sysdev_class; - memory-sysdev.id = __section_nr(section); + memory-sysdev.id = memory-phys_index / sections_per_block; error = sysdev_register(memory-sysdev); return error; } static void -unregister_memory(struct memory_block *memory, struct mem_section *section) +unregister_memory(struct memory_block *memory) { BUG_ON(memory-sysdev.cls != memory_sysdev_class); - BUG_ON(memory-sysdev.id != __section_nr(section)); /* drop the ref. we got in remove_memory_block() */ kobject_put(memory-sysdev.kobj); sysdev_unregister(memory-sysdev); } +u32 __weak memory_block_size_bytes(void) +{ + return MIN_MEMORY_BLOCK_SIZE; +} + +static u32 get_memory_block_size(void) +{ + u32 block_sz; + + block_sz = memory_block_size_bytes(); + + /* Validate blk_sz is a power of 2 and not less than section size */ + if ((block_sz (block_sz - 1)) || (block_sz MIN_MEMORY_BLOCK_SIZE)) { + WARN_ON(1); + block_sz = MIN_MEMORY_BLOCK_SIZE; + } + + return block_sz; +} + /* * use this as the physical section index that this memsection * uses. @@ -116,7 +143,7 @@ { struct memory_block *mem = container_of(dev, struct memory_block, sysdev); - return sprintf(buf, %08lx\n, mem-phys_index); + return sprintf(buf, %08lx\n, mem-phys_index / sections_per_block); } /* @@ -125,13 +152,16 @@ static ssize_t show_mem_removable(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) { - unsigned long start_pfn; - int ret; + unsigned long i, pfn; + int ret = 1; struct memory_block *mem = container_of(dev, struct memory_block, sysdev); - start_pfn = section_nr_to_pfn(mem-phys_index); - ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION); + for (i = 0; i sections_per_block; i++) { + pfn = section_nr_to_pfn(mem-phys_index + i); + ret = is_mem_section_removable(pfn, PAGES_PER_SECTION); + } + return sprintf(buf, %d\n, ret); } @@ -184,17 +214,14 @@ * OK to have direct references to sparsemem variables in here. */ static int -memory_block_action(struct memory_block *mem, unsigned long action) +memory_section_action(unsigned long phys_index, unsigned long action) { int i; - unsigned long psection; unsigned long start_pfn, start_paddr; struct page *first_page; int ret; - int old_state = mem-state; - psection = mem-phys_index; - first_page = pfn_to_page(psection PFN_SECTION_SHIFT); + first_page = pfn_to_page(phys_index PFN_SECTION_SHIFT); /* * The probe routines leave the pages reserved, just @@ -207,8 +234,8 @@ continue; printk(KERN_WARNING section number %ld page number %d - not reserved, was it already online? \n, - psection, i); + not reserved, was it already online?\n, + phys_index, i); return -EBUSY; } } @@ -219,18 +246,13 @@ ret = online_pages(start_pfn, PAGES_PER_SECTION);
[PATCH 5/8] v2 Add end_phys_index file
Update the 'phys_index' properties of a memory block to include a 'start_phys_index' which is the same as the current 'phys_index' property. The property still appears as 'phys_index' in sysfs but the memory_block struct name is updated to indicate the start and end values. This also adds an 'end_phys_index' property to indicate the id of the last section in th memory block. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c | 39 ++- include/linux/memory.h |3 ++- 2 files changed, 32 insertions(+), 10 deletions(-) Index: linux-next/drivers/base/memory.c === --- linux-next.orig/drivers/base/memory.c 2010-09-27 13:50:18.0 -0500 +++ linux-next/drivers/base/memory.c2010-09-27 13:50:38.0 -0500 @@ -97,7 +97,7 @@ int error; memory-sysdev.cls = memory_sysdev_class; - memory-sysdev.id = memory-phys_index / sections_per_block; + memory-sysdev.id = memory-start_phys_index / sections_per_block; error = sysdev_register(memory-sysdev); return error; @@ -138,12 +138,26 @@ * uses. */ -static ssize_t show_mem_phys_index(struct sys_device *dev, +static ssize_t show_mem_start_phys_index(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) { struct memory_block *mem = container_of(dev, struct memory_block, sysdev); - return sprintf(buf, %08lx\n, mem-phys_index / sections_per_block); + unsigned long phys_index; + + phys_index = mem-start_phys_index / sections_per_block; + return sprintf(buf, %08lx\n, phys_index); +} + +static ssize_t show_mem_end_phys_index(struct sys_device *dev, + struct sysdev_attribute *attr, char *buf) +{ + struct memory_block *mem = + container_of(dev, struct memory_block, sysdev); + unsigned long phys_index; + + phys_index = mem-end_phys_index / sections_per_block; + return sprintf(buf, %08lx\n, phys_index); } /* @@ -158,7 +172,7 @@ container_of(dev, struct memory_block, sysdev); for (i = 0; i sections_per_block; i++) { - pfn = section_nr_to_pfn(mem-phys_index + i); + pfn = section_nr_to_pfn(mem-start_phys_index + i); ret = is_mem_section_removable(pfn, PAGES_PER_SECTION); } @@ -275,14 +289,15 @@ mem-state = MEM_GOING_OFFLINE; for (i = 0; i sections_per_block; i++) { - ret = memory_section_action(mem-phys_index + i, to_state); + ret = memory_section_action(mem-start_phys_index + i, + to_state); if (ret) break; } if (ret) { for (i = 0; i sections_per_block; i++) - memory_section_action(mem-phys_index + i, + memory_section_action(mem-start_phys_index + i, from_state_req); mem-state = from_state_req; @@ -330,7 +345,8 @@ return sprintf(buf, %d\n, mem-phys_device); } -static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL); +static SYSDEV_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL); +static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL); static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state); static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL); static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL); @@ -514,17 +530,21 @@ return -ENOMEM; scn_nr = __section_nr(section); - mem-phys_index = base_memory_block_id(scn_nr) * sections_per_block; + mem-start_phys_index = + base_memory_block_id(scn_nr) * sections_per_block; + mem-end_phys_index = mem-start_phys_index + sections_per_block - 1; mem-state = state; atomic_inc(mem-section_count); mutex_init(mem-state_mutex); - start_pfn = section_nr_to_pfn(mem-phys_index); + start_pfn = section_nr_to_pfn(mem-start_phys_index); mem-phys_device = arch_get_memory_phys_device(start_pfn); ret = register_memory(mem); if (!ret) ret = mem_create_simple_file(mem, phys_index); if (!ret) + ret = mem_create_simple_file(mem, end_phys_index); + if (!ret) ret = mem_create_simple_file(mem, state); if (!ret) ret = mem_create_simple_file(mem, phys_device); @@ -571,6 +591,7 @@ if (atomic_dec_and_test(mem-section_count)) { unregister_mem_sect_under_nodes(mem); mem_remove_simple_file(mem, phys_index); + mem_remove_simple_file(mem, end_phys_index); mem_remove_simple_file(mem, state);
[PATCH 6/8] v2 Update node sysfs code
Update the node sysfs code to be aware of the new capability for a memory block to contain multiple memory sections. This requires an additional parameter to unregister_mem_sect_under_nodes so that we know which memory section of the memory block to unregister. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- drivers/base/memory.c |2 +- drivers/base/node.c | 12 include/linux/node.h |6 -- 3 files changed, 13 insertions(+), 7 deletions(-) Index: linux-next/drivers/base/node.c === --- linux-next.orig/drivers/base/node.c 2010-09-27 13:49:36.0 -0500 +++ linux-next/drivers/base/node.c 2010-09-27 13:50:43.0 -0500 @@ -346,8 +346,10 @@ return -EFAULT; if (!node_online(nid)) return 0; - sect_start_pfn = section_nr_to_pfn(mem_blk-phys_index); - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + + sect_start_pfn = section_nr_to_pfn(mem_blk-start_phys_index); + sect_end_pfn = section_nr_to_pfn(mem_blk-end_phys_index); + sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { int page_nid; @@ -371,7 +373,8 @@ } /* unregister memory section under all nodes that it spans */ -int unregister_mem_sect_under_nodes(struct memory_block *mem_blk) +int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, + unsigned long phys_index) { NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL); unsigned long pfn, sect_start_pfn, sect_end_pfn; @@ -383,7 +386,8 @@ if (!unlinked_nodes) return -ENOMEM; nodes_clear(*unlinked_nodes); - sect_start_pfn = section_nr_to_pfn(mem_blk-phys_index); + + sect_start_pfn = section_nr_to_pfn(phys_index); sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { int nid; Index: linux-next/drivers/base/memory.c === --- linux-next.orig/drivers/base/memory.c 2010-09-27 13:50:38.0 -0500 +++ linux-next/drivers/base/memory.c2010-09-27 13:50:43.0 -0500 @@ -587,9 +587,9 @@ mutex_lock(mem_sysfs_mutex); mem = find_memory_block(section); + unregister_mem_sect_under_nodes(mem, __section_nr(section)); if (atomic_dec_and_test(mem-section_count)) { - unregister_mem_sect_under_nodes(mem); mem_remove_simple_file(mem, phys_index); mem_remove_simple_file(mem, end_phys_index); mem_remove_simple_file(mem, state); Index: linux-next/include/linux/node.h === --- linux-next.orig/include/linux/node.h2010-09-27 13:49:36.0 -0500 +++ linux-next/include/linux/node.h 2010-09-27 13:50:43.0 -0500 @@ -44,7 +44,8 @@ extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); extern int register_mem_sect_under_node(struct memory_block *mem_blk, int nid); -extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk); +extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, + unsigned long phys_index); #ifdef CONFIG_HUGETLBFS extern void register_hugetlbfs_with_node(node_registration_func_t doregister, @@ -72,7 +73,8 @@ { return 0; } -static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk) +static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, + unsigned long phys_index) { return 0; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 7/8] v2 Define memory_block_size_bytes() for powerpc/pseries
Define a version of memory_block_size_bytes() for powerpc/pseries such that a memory block spans an entire lmb. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- arch/powerpc/platforms/pseries/hotplug-memory.c | 66 +++- 1 file changed, 53 insertions(+), 13 deletions(-) Index: linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c === --- linux-next.orig/arch/powerpc/platforms/pseries/hotplug-memory.c 2010-09-27 13:49:34.0 -0500 +++ linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c 2010-09-27 13:50:45.0 -0500 @@ -17,6 +17,54 @@ #include asm/pSeries_reconfig.h #include asm/sparsemem.h +static u32 get_memblock_size(void) +{ + struct device_node *np; + unsigned int memblock_size = 0; + + np = of_find_node_by_path(/ibm,dynamic-reconfiguration-memory); + if (np) { + const unsigned long *size; + + size = of_get_property(np, ibm,lmb-size, NULL); + memblock_size = size ? *size : 0; + + of_node_put(np); + } else { + unsigned int memzero_size = 0; + const unsigned int *regs; + + np = of_find_node_by_path(/mem...@0); + if (np) { + regs = of_get_property(np, reg, NULL); + memzero_size = regs ? regs[3] : 0; + of_node_put(np); + } + + if (memzero_size) { + /* We now know the size of mem...@0, use this to find +* the first memoryblock and get its size. +*/ + char buf[64]; + + sprintf(buf, /mem...@%x, memzero_size); + np = of_find_node_by_path(buf); + if (np) { + regs = of_get_property(np, reg, NULL); + memblock_size = regs ? regs[3] : 0; + of_node_put(np); + } + } + } + + return memblock_size; +} + +u32 memory_block_size_bytes(void) +{ + return get_memblock_size(); +} + static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size) { unsigned long start, start_pfn; @@ -127,30 +175,22 @@ static int pseries_drconf_memory(unsigned long *base, unsigned int action) { - struct device_node *np; - const unsigned long *lmb_size; + unsigned long memblock_size; int rc; - np = of_find_node_by_path(/ibm,dynamic-reconfiguration-memory); - if (!np) + memblock_size = get_memblock_size(); + if (!memblock_size) return -EINVAL; - lmb_size = of_get_property(np, ibm,lmb-size, NULL); - if (!lmb_size) { - of_node_put(np); - return -EINVAL; - } - if (action == PSERIES_DRCONF_MEM_ADD) { - rc = memblock_add(*base, *lmb_size); + rc = memblock_add(*base, memblock_size); rc = (rc 0) ? -EINVAL : 0; } else if (action == PSERIES_DRCONF_MEM_REMOVE) { - rc = pseries_remove_memblock(*base, *lmb_size); + rc = pseries_remove_memblock(*base, memblock_size); } else { rc = -EINVAL; } - of_node_put(np); return rc; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 8/8] v2 Update memory hotplug documentation
Update the memory hotplug documentation to reflect the new behaviors of memory blocks reflected in sysfs. Signed-off-by: Nathan Fontenot nf...@austin.ibm.com --- Documentation/memory-hotplug.txt | 46 +-- 1 file changed, 30 insertions(+), 16 deletions(-) Index: linux-next/Documentation/memory-hotplug.txt === --- linux-next.orig/Documentation/memory-hotplug.txt2010-09-27 13:49:33.0 -0500 +++ linux-next/Documentation/memory-hotplug.txt 2010-09-27 13:50:48.0 -0500 @@ -126,36 +126,50 @@ 4 sysfs files for memory hotplug -All sections have their device information under /sys/devices/system/memory as +All sections have their device information in sysfs. Each section is part of +a memory block under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX -(XXX is section id.) +(XXX is the section id.) -Now, XXX is defined as start_address_of_section / section_size. +Now, XXX is defined as (start_address_of_section / section_size) of the first +section contained in the memory block. The files 'phys_index' and +'end_phys_index' under each directory report the beginning and end section id's +for the memory block covered by the sysfs directory. It is expected that all +memory sections in this range are present and no memory holes exist in the +range. Currently there is no way to determine if there is a memory hole, but +the existence of one should not affect the hotplug capabilities of the memory +block. For example, assume 1GiB section size. A device for a memory starting at 0x1 is /sys/device/system/memory/memory4 (0x1 / 1Gib = 4) This device covers address range [0x1 ... 0x14000) -Under each section, you can see 4 files. +Under each section, you can see 5 files. -/sys/devices/system/memory/memoryXXX/phys_index +/sys/devices/system/memory/memoryXXX/start_phys_index +/sys/devices/system/memory/memoryXXX/end_phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -'phys_index' : read-only and contains section id, same as XXX. -'state' : read-write - at read: contains online/offline state of memory. - at write: user can specify online, offline command -'phys_device': read-only: designed to show the name of physical memory device. - This is not well implemented now. -'removable' : read-only: contains an integer value indicating - whether the memory section is removable or not - removable. A value of 1 indicates that the memory - section is removable and a value of 0 indicates that - it is not removable. +'phys_index' : read-only and contains section id of the first section + in the memory block, same as XXX. +'end_phys_index' : read-only and contains section id of the last section + in the memory block. +'state' : read-write +at read: contains online/offline state of memory. +at write: user can specify online, offline command +which will be performed on al sections in the block. +'phys_device' : read-only: designed to show the name of physical memory +device. This is not well implemented now. +'removable' : read-only: contains an integer value indicating +whether the memory block is removable or not +removable. A value of 1 indicates that the memory +block is removable and a value of 0 indicates that +it is not removable. A memory block is removable only if +every section in the block is removable. NOTE: These directories/files appear after physical memory hotplug phase. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Add config option for batched hcalls
On Sat, 2010-09-25 at 22:49 -0500, Olof Johansson wrote: On Fri, Sep 24, 2010 at 04:44:15PM -0500, Will Schmidt wrote: Add a config option for the (batched) MULTITCE and BULK_REMOVE h-calls. By default, these options are on and are beneficial for performance and throughput reasons. If disabled, the code will fall back to using less optimal TCE and REMOVE hcalls. The ability to easily disable these options is useful for some of the PREEMPT_RT related investigation and work occurring on Power. Hi, I can see why it's useful to enable and disable, but these are all runtime-checked, wouldn't it be more useful to add a bootarg to handle it instead of adding some new config options that pretty much everyone will always go with the defaults on? The bits are set early, but from looking at where they're used, there doesn't seem to be any harm in disabling them later on when a bootarg is convenient to parse and deal with? It has the benefit of easier on/off testing, if that has any value for production debug down the road. Hi Olof, Thats a good idea, let me poke at this a bit more, see if I can get bootargs for this. Thanks, -Will -Olof ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB
On Mon, 2010-09-27 at 10:26 -0500, Dave Kleikamp wrote: I think I made it a config option at Ben's request when I first started this work last year, before being sidetracked by other priorities. I could either remove the option, or default it to 'n'. It might be best to just hard-code the behavior to make sure it's exercised, since there's no 47x hardware in production yet, but we can give Ben a chance to weigh in with his opinion. You can remove the option I suppose. It was useful to have it during early bringup but probably not anymore. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB
On Tue, 2010-09-28 at 07:10 +1000, Benjamin Herrenschmidt wrote: On Mon, 2010-09-27 at 10:26 -0500, Dave Kleikamp wrote: I think I made it a config option at Ben's request when I first started this work last year, before being sidetracked by other priorities. I could either remove the option, or default it to 'n'. It might be best to just hard-code the behavior to make sure it's exercised, since there's no 47x hardware in production yet, but we can give Ben a chance to weigh in with his opinion. You can remove the option I suppose. It was useful to have it during early bringup but probably not anymore. Thanks, Ben. I'll resend it without the config option. Shaggy -- Dave Kleikamp IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] v2 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB
When the DSTI (Disable Shadow TLB Invalidate) bit is set in the CCR2 register, the isync command does not flush the shadow TLB (iTLB dTLB). However, since the shadow TLB does not contain context information, we want the shadow TLB flushed in situations where we are switching context. In those situations, we explicitly clear the DSTI bit before performing isync, and set it again afterward. We also need to do the same when we perform isync after explicitly flushing the TLB. Signed-off-by: Dave Kleikamp sha...@linux.vnet.ibm.com --- arch/powerpc/include/asm/reg_booke.h |4 arch/powerpc/kernel/head_44x.S| 25 + arch/powerpc/mm/tlb_nohash_low.S | 14 +- arch/powerpc/platforms/44x/misc_44x.S | 26 ++ 4 files changed, 68 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/reg_booke.h b/arch/powerpc/include/asm/reg_booke.h index 667a498..a7ecbfe 100644 --- a/arch/powerpc/include/asm/reg_booke.h +++ b/arch/powerpc/include/asm/reg_booke.h @@ -120,6 +120,7 @@ #define SPRN_TLB3CFG 0x2B3 /* TLB 3 Config Register */ #define SPRN_EPR 0x2BE /* External Proxy Register */ #define SPRN_CCR1 0x378 /* Core Configuration Register 1 */ +#define SPRN_CCR2_476 0x379 /* Core Configuration Register 2 (476)*/ #define SPRN_ZPR 0x3B0 /* Zone Protection Register (40x) */ #define SPRN_MAS7 0x3B0 /* MMU Assist Register 7 */ #define SPRN_MMUCR 0x3B2 /* MMU Control Register */ @@ -188,6 +189,9 @@ #defineCCR1_DPC0x0100 /* Disable L1 I-Cache/D-Cache parity checking */ #defineCCR1_TCS0x0080 /* Timer Clock Select */ +/* Bit definitions for CCR2. */ +#define CCR2_476_DSTI 0x0800 /* Disable Shadow TLB Invalidate */ + /* Bit definitions for the MCSR. */ #define MCSR_MCS 0x8000 /* Machine Check Summary */ #define MCSR_IB0x4000 /* Instruction PLB Error */ diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S index 562305b..cd34afb 100644 --- a/arch/powerpc/kernel/head_44x.S +++ b/arch/powerpc/kernel/head_44x.S @@ -38,6 +38,7 @@ #include asm/ppc_asm.h #include asm/asm-offsets.h #include asm/synch.h +#include asm/bug.h #include head_booke.h @@ -703,8 +704,23 @@ _GLOBAL(set_context) stw r4, 0x4(r5) #endif mtspr SPRN_PID,r3 +BEGIN_MMU_FTR_SECTION + b 1f +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x) isync /* Force context change */ blr +1: +#ifdef CONFIG_PPC_47x + mfspr r10,SPRN_CCR2_476 + rlwinm r11,r10,0,~CCR2_476_DSTI + mtspr SPRN_CCR2_476,r11 + isync /* Force context change */ + mtspr SPRN_CCR2_476,r10 +#else /* CONFIG_PPC_47x */ +2: trap + EMIT_BUG_ENTRY 2b,__FILE__,__LINE__,0; +#endif /* CONFIG_PPC_47x */ + blr /* * Init CPU state. This is called at boot time or for secondary CPUs @@ -861,6 +877,15 @@ skpinv:addir4,r4,1 /* Increment */ isync #endif /* CONFIG_PPC_EARLY_DEBUG_44x */ +BEGIN_MMU_FTR_SECTION + mfspr r3,SPRN_CCR2_476 + /* With CCR2(DSTI) set, isync does not invalidate the shadow TLB */ + orisr3,r3,ccr2_476_d...@h + rlwinm r3,r3,0,~CCR2_476_DSTI + mtspr SPRN_CCR2_476,r3 + isync +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x) + /* Establish the interrupt vector offsets */ SET_IVOR(0, CriticalInput); SET_IVOR(1, MachineCheck); diff --git a/arch/powerpc/mm/tlb_nohash_low.S b/arch/powerpc/mm/tlb_nohash_low.S index b9d9fed..f28fb52 100644 --- a/arch/powerpc/mm/tlb_nohash_low.S +++ b/arch/powerpc/mm/tlb_nohash_low.S @@ -112,7 +112,11 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x) clrrwi r4,r3,12/* get an EPN for the hashing with V = 0 */ ori r4,r4,PPC47x_TLBE_SIZE tlbwe r4,r7,0 /* write it */ + mfspr r8,SPRN_CCR2_476 + rlwinm r9,r8,0,~CCR2_476_DSTI + mtspr SPRN_CCR2_476,r9 isync + mtspr SPRN_CCR2_476,r8 wrtee r10 blr #else /* CONFIG_PPC_47x */ @@ -180,7 +184,11 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x) lwz r8,0(r10) /* Load boltmap entry */ addir10,r10,4 /* Next word */ b 1b /* Then loop */ -1: isync /* Sync shadows */ +1: mfspr r9,SPRN_CCR2_476 + rlwinm r10,r9,0,~CCR2_476_DSTI + mtspr SPRN_CCR2_476,r10 + isync /* Sync shadows */ + mtspr SPRN_CCR2_476,r9 wrtee r11 #else /* CONFIG_PPC_47x */ 1: trap @@ -203,7 +211,11 @@ _GLOBAL(_tlbivax_bcast) isync /* tlbivax 0,r3 - use .long to avoid binutils deps */ .long 0x7c000624 | (r3 11) + mfspr r8,SPRN_CCR2_476 + rlwinm r9,r8,0,~CCR2_476_DSTI + mtspr SPRN_CCR2_476,r9
[PATCH RFCv3 0/4] dma: add support for scatterlist to scatterlist copy
This series adds support for scatterlist to scatterlist copies to the generic DMAEngine API. Both the fsldma and ste_dma40 drivers currently implement a similar API using different, non-generic methods. This series converts both of them to the new, standardized API. By doing this as part of the core DMAEngine API, the individual drivers have control over how to chain their descriptors together. This is different to the previous implementation, which called device_prep_dma_memcpy() multiple times. Neither implementation has been tested on real hardware. I attempted a conversion of the ste_dma40 driver which should do the right thing, but the authors should check and make sure. Ira W. Snyder (4): dma: add support for scatterlist to scatterlist copy fsldma: implement support for scatterlist to scatterlist copy fsldma: remove DMA_SLAVE support ste_dma40: implement support for scatterlist to scatterlist copy arch/powerpc/include/asm/fsldma.h | 115 ++ drivers/dma/dmaengine.c |2 + drivers/dma/fsldma.c | 321 + drivers/dma/ste_dma40.c | 17 ++ include/linux/dmaengine.h |6 + 5 files changed, 185 insertions(+), 276 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/4] dma: add support for scatterlist to scatterlist copy
This adds support for scatterlist to scatterlist DMA transfers. A similar interface is exposed by the fsldma driver (through the DMA_SLAVE API) and by the ste_dma40 driver (through an exported function). This patch paves the way for making this type of copy operation a part of the generic DMAEngine API. Futher patches will add support in individual drivers. Signed-off-by: Ira W. Snyder i...@ovro.caltech.edu --- drivers/dma/dmaengine.c |2 ++ include/linux/dmaengine.h |6 ++ 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c index 9d31d5e..db403b8 100644 --- a/drivers/dma/dmaengine.c +++ b/drivers/dma/dmaengine.c @@ -690,6 +690,8 @@ int dma_async_device_register(struct dma_device *device) !device-device_prep_dma_memset); BUG_ON(dma_has_cap(DMA_INTERRUPT, device-cap_mask) !device-device_prep_dma_interrupt); + BUG_ON(dma_has_cap(DMA_SG, device-cap_mask) + !device-device_prep_dma_sg); BUG_ON(dma_has_cap(DMA_SLAVE, device-cap_mask) !device-device_prep_slave_sg); BUG_ON(dma_has_cap(DMA_SLAVE, device-cap_mask) diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h index c61d4ca..7c44620 100644 --- a/include/linux/dmaengine.h +++ b/include/linux/dmaengine.h @@ -64,6 +64,7 @@ enum dma_transaction_type { DMA_PQ_VAL, DMA_MEMSET, DMA_INTERRUPT, + DMA_SG, DMA_PRIVATE, DMA_ASYNC_TX, DMA_SLAVE, @@ -473,6 +474,11 @@ struct dma_device { unsigned long flags); struct dma_async_tx_descriptor *(*device_prep_dma_interrupt)( struct dma_chan *chan, unsigned long flags); + struct dma_async_tx_descriptor *(*device_prep_dma_sg)( + struct dma_chan *chan, + struct scatterlist *dst_sg, unsigned int dst_nents, + struct scatterlist *src_sg, unsigned int src_nents, + unsigned long flags); struct dma_async_tx_descriptor *(*device_prep_slave_sg)( struct dma_chan *chan, struct scatterlist *sgl, -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/4] fsldma: remove DMA_SLAVE support
Now that the generic DMAEngine API has support for scatterlist to scatterlist copying, this implementation of the DMA_SLAVE API is no longer necessary. In order to let device_control() continue to function, a stub device_prep_slave_sg() function is provided. This allows custom device configuration, such as enabling external control. Signed-off-by: Ira W. Snyder i...@ovro.caltech.edu --- arch/powerpc/include/asm/fsldma.h | 115 ++-- drivers/dma/fsldma.c | 219 +++-- 2 files changed, 48 insertions(+), 286 deletions(-) diff --git a/arch/powerpc/include/asm/fsldma.h b/arch/powerpc/include/asm/fsldma.h index debc5ed..dc0bd27 100644 --- a/arch/powerpc/include/asm/fsldma.h +++ b/arch/powerpc/include/asm/fsldma.h @@ -1,7 +1,7 @@ /* * Freescale MPC83XX / MPC85XX DMA Controller * - * Copyright (c) 2009 Ira W. Snyder i...@ovro.caltech.edu + * Copyright (c) 2009-2010 Ira W. Snyder i...@ovro.caltech.edu * * This file is licensed under the terms of the GNU General Public License * version 2. This program is licensed as is without any warranty of any @@ -11,127 +11,32 @@ #ifndef __ARCH_POWERPC_ASM_FSLDMA_H__ #define __ARCH_POWERPC_ASM_FSLDMA_H__ -#include linux/slab.h #include linux/dmaengine.h /* - * Definitions for the Freescale DMA controller's DMA_SLAVE implemention + * The Freescale DMA controller has several features that are not accomodated + * in the Linux DMAEngine API. Therefore, the generic structure is expanded + * to allow drivers to use these features. * - * The Freescale DMA_SLAVE implementation was designed to handle many-to-many - * transfers. An example usage would be an accelerated copy between two - * scatterlists. Another example use would be an accelerated copy from - * multiple non-contiguous device buffers into a single scatterlist. + * This structure should be passed into the DMAEngine routine device_control() + * as in this example: * - * A DMA_SLAVE transaction is defined by a struct fsl_dma_slave. This - * structure contains a list of hardware addresses that should be copied - * to/from the scatterlist passed into device_prep_slave_sg(). The structure - * also has some fields to enable hardware-specific features. + * chan-device-device_control(chan, DMA_SLAVE_CONFIG, (unsigned long)cfg); */ /** - * struct fsl_dma_hw_addr - * @entry: linked list entry - * @address: the hardware address - * @length: length to transfer - * - * Holds a single physical hardware address / length pair for use - * with the DMAEngine DMA_SLAVE API. - */ -struct fsl_dma_hw_addr { - struct list_head entry; - - dma_addr_t address; - size_t length; -}; - -/** * struct fsl_dma_slave - * @addresses: a linked list of struct fsl_dma_hw_addr structures + * @config: the standard Linux DMAEngine API DMA_SLAVE configuration * @request_count: value for DMA request count - * @src_loop_size: setup and enable constant source-address DMA transfers - * @dst_loop_size: setup and enable constant destination address DMA transfers * @external_start: enable externally started DMA transfers * @external_pause: enable externally paused DMA transfers - * - * Holds a list of address / length pairs for use with the DMAEngine - * DMA_SLAVE API implementation for the Freescale DMA controller. */ -struct fsl_dma_slave { +struct fsldma_slave_config { + struct dma_slave_config config; - /* List of hardware address/length pairs */ - struct list_head addresses; - - /* Support for extra controller features */ unsigned int request_count; - unsigned int src_loop_size; - unsigned int dst_loop_size; bool external_start; bool external_pause; }; -/** - * fsl_dma_slave_append - add an address/length pair to a struct fsl_dma_slave - * @slave: the struct fsl_dma_slave to add to - * @address: the hardware address to add - * @length: the length of bytes to transfer from @address - * - * Add a hardware address/length pair to a struct fsl_dma_slave. Returns 0 on - * success, -ERRNO otherwise. - */ -static inline int fsl_dma_slave_append(struct fsl_dma_slave *slave, - dma_addr_t address, size_t length) -{ - struct fsl_dma_hw_addr *addr; - - addr = kzalloc(sizeof(*addr), GFP_ATOMIC); - if (!addr) - return -ENOMEM; - - INIT_LIST_HEAD(addr-entry); - addr-address = address; - addr-length = length; - - list_add_tail(addr-entry, slave-addresses); - return 0; -} - -/** - * fsl_dma_slave_free - free a struct fsl_dma_slave - * @slave: the struct fsl_dma_slave to free - * - * Free a struct fsl_dma_slave and all associated address/length pairs - */ -static inline void fsl_dma_slave_free(struct fsl_dma_slave *slave) -{ - struct fsl_dma_hw_addr *addr, *tmp; - - if (slave) { - list_for_each_entry_safe(addr, tmp, slave-addresses, entry) { -
[RFC PATCH 2/2] pseries/xics: use cpu_possible_mask rather than cpu_all_mask
Current firmware only allows us to send IRQs to the first processor or all processors. We currently check to see if the passed in mask is equal to the all_mask, but the firmware is only considering whether the request is for the equivalent of the possible_mask. Thus, we think the request is for some subset of CPUs and only assign IRQs to the first CPU (on systems without irqbalance running) as evidenced by /proc/interrupts. By using possible_mask instead, we account for this and proper interleaving of interrupts occurs. Without this change and pseries/xics: use cpu_possible_mask rather than cpu_all_mask, IRQs are all routed to CPU0 on power machines not running irqbalance. Signed-off-by: Nishanth Aravamudan n...@us.ibm.com --- arch/powerpc/platforms/pseries/xics.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/platforms/pseries/xics.c b/arch/powerpc/platforms/pseries/xics.c index 93834b0..7c1e342 100644 --- a/arch/powerpc/platforms/pseries/xics.c +++ b/arch/powerpc/platforms/pseries/xics.c @@ -178,7 +178,7 @@ static int get_irq_server(unsigned int virq, const struct cpumask *cpumask, if (!distribute_irqs) return default_server; - if (!cpumask_equal(cpumask, cpu_all_mask)) { + if (!cpumask_subset(cpu_possible_mask, cpumask)) { int server = cpumask_first_and(cpu_online_mask, cpumask); if (server nr_cpu_ids) -- 1.7.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 0/2] Fix IRQ round-robing w/o irqbalance on pseries
We have received reports on power systems not running irqbalance where all interrupts are being routed to CPU0 rather than being interleaved by default across the system. Current firmware only allows either sending interrupts to all CPUs or sending them to one CPU. The following two patches address this issue by fixing the mask used in generic code and by fixing the check for the all setting in the pseries code. Nishanth Aravamudan (2): IRQ: use cpu_possible_mask rather than online_mask in setup_affinity pseries/xics: use cpu_possible_mask rather than cpu_all_mask arch/powerpc/platforms/pseries/xics.c |2 +- kernel/irq/manage.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ker...@vger.kernel.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/8] v2 Allow memory block to span multiple memory sections
On Mon, 2010-09-27 at 14:25 -0500, Nathan Fontenot wrote: +static inline int base_memory_block_id(int section_nr) +{ + return section_nr / sections_per_block; +} ... - mutex_lock(mem_sysfs_mutex); - - mem-phys_index = __section_nr(section); + scn_nr = __section_nr(section); + mem-phys_index = base_memory_block_id(scn_nr) * sections_per_block; I'm really regretting giving this variable such a horrid name. I suck. I think this is correct now: mem-phys_index = base_memory_block_id(scn_nr) * sections_per_block; mem-phys_index = section_nr / sections_per_block * sections_per_block; mem-phys_index = section_nr Since it gets exported to userspace this way: +static ssize_t show_mem_start_phys_index(struct sys_device *dev, struct sysdev_attribute *attr, char *buf) { struct memory_block *mem = container_of(dev, struct memory_block, sysdev); - return sprintf(buf, %08lx\n, mem-phys_index / sections_per_block); + unsigned long phys_index; + + phys_index = mem-start_phys_index / sections_per_block; + return sprintf(buf, %08lx\n, phys_index); +} The only other thing I'd say is that we need to put phys_index out of its misery and call it what it is now: a section number. I think it's OK to call them start/end_section_nr, at least inside the kernel. I intentionally used phys_index terminology in sysfs so that we _could_ eventually do this stuff and break the relationship between sections and the sysfs dirs, but I think keeping the terminology around inside the kernel is confusing now. -- Dave ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Oops in trace_hardirqs_on (powerpc)
On Mon, 2010-09-27 at 14:50 +0200, Jörg Sommer wrote: Hello Steven, Steven Rostedt hat am Wed 22. Sep, 15:44 (-0400) geschrieben: Sorry for the late reply, but I was on vacation when you sent this, and I missed it while going through email. Do you still have this issue? No. I've rebuild my kernel without TRACE_IRQFLAGS and the problem vanished, as expected. The problem is, that in some cases the stack is only two frames deep, which causes the macro CALLER_ADDR1 makes an invalid access. Someone told me, there a workaround for the problem on i386, too. % sed -n 2p arch/x86/lib/thunk_32.S * Trampoline to trace irqs off. (otherwise CALLER_ADDR1 might crash) Yes, I remember that problem. When I get back from Tokyo, I'll tried to remember to fix it. Thanks! -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3 RESEND] powerpc: remove cast from void*
Unnecessary cast from void* in assignment. Signed-off-by: matt mooney m...@muteddisk.com --- arch/powerpc/platforms/pseries/hvCall_inst.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hvCall_inst.c b/arch/powerpc/platforms/pseries/hvCall_inst.c index e19ff02..f106662 100644 --- a/arch/powerpc/platforms/pseries/hvCall_inst.c +++ b/arch/powerpc/platforms/pseries/hvCall_inst.c @@ -55,7 +55,7 @@ static void hc_stop(struct seq_file *m, void *p) static int hc_show(struct seq_file *m, void *p) { unsigned long h_num = (unsigned long)p; - struct hcall_stats *hs = (struct hcall_stats *)m-private; + struct hcall_stats *hs = m-private; if (hs[h_num].num_calls) { if (cpu_has_feature(CPU_FTR_PURR)) -- 1.7.2.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev