Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB

2010-09-27 Thread Josh Boyer
On Fri, Sep 24, 2010 at 01:01:36PM -0500, Dave Kleikamp wrote:
When the DSTI (Disable Shadow TLB Invalidate) bit is set in the CCR2
register, the isync command does not flush the shadow TLB (iTLB  dTLB).

However, since the shadow TLB does not contain context information, we
want the shadow TLB flushed in situations where we are switching context.
In those situations, we explicitly clear the DSTI bit before performing
isync, and set it again afterward.  We also need to do the same when we
perform isync after explicitly flushing the TLB.

Th setting of the DSTI bit is dependent on
CONFIG_PPC_47x_DISABLE_SHADOW_TLB_INVALIDATE.  When we are confident that
the feature works as expected, the option can probably be removed.

You're defaulting it to 'y' in the Kconfig.  Technically someone could
turn it off I guess, but practice mostly shows that nobody mucks with
the defaults.  Do you want it to default 'n' for now if you aren't
confident in it just quite yet?

(Linus also has some kind of gripe with new options being default 'y',
but I don't recall all the details and I doubt he'd care about something
in low-level PPC code.)

josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB

2010-09-27 Thread Dave Kleikamp
On Mon, 2010-09-27 at 11:04 -0400, Josh Boyer wrote:
 On Fri, Sep 24, 2010 at 01:01:36PM -0500, Dave Kleikamp wrote:
 When the DSTI (Disable Shadow TLB Invalidate) bit is set in the CCR2
 register, the isync command does not flush the shadow TLB (iTLB  dTLB).
 
 However, since the shadow TLB does not contain context information, we
 want the shadow TLB flushed in situations where we are switching context.
 In those situations, we explicitly clear the DSTI bit before performing
 isync, and set it again afterward.  We also need to do the same when we
 perform isync after explicitly flushing the TLB.
 
 Th setting of the DSTI bit is dependent on
 CONFIG_PPC_47x_DISABLE_SHADOW_TLB_INVALIDATE.  When we are confident that
 the feature works as expected, the option can probably be removed.
 
 You're defaulting it to 'y' in the Kconfig.  Technically someone could
 turn it off I guess, but practice mostly shows that nobody mucks with
 the defaults.  Do you want it to default 'n' for now if you aren't
 confident in it just quite yet?

I think I made it a config option at Ben's request when I first started
this work last year, before being sidetracked by other priorities.  I
could either remove the option, or default it to 'n'.  It might be best
to just hard-code the behavior to make sure it's exercised, since
there's no 47x hardware in production yet, but we can give Ben a chance
to weigh in with his opinion.

 (Linus also has some kind of gripe with new options being default 'y',
 but I don't recall all the details and I doubt he'd care about something
 in low-level PPC code.)
 
 josh

-- 
Dave Kleikamp
IBM Linux Technology Center

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Oops in trace_hardirqs_on (powerpc)

2010-09-27 Thread Jörg Sommer
Hello Steven,

Steven Rostedt hat am Wed 22. Sep, 15:44 (-0400) geschrieben:
 Sorry for the late reply, but I was on vacation when you sent this, and
 I missed it while going through email.
 
 Do you still have this issue?

No. I've rebuild my kernel without TRACE_IRQFLAGS and the problem
vanished, as expected. The problem is, that in some cases the stack is
only two frames deep, which causes the macro CALLER_ADDR1 makes an
invalid access. Someone told me, there a workaround for the problem on
i386, too.

% sed -n 2p arch/x86/lib/thunk_32.S
 * Trampoline to trace irqs off. (otherwise CALLER_ADDR1 might crash)

Bye, Jörg.
-- 
Angenehme Worte sind nie wahr,
wahre Worte sind nie angenehm.


signature.asc
Description: Digital signature http://en.wikipedia.org/wiki/OpenPGP
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFCv2 1/2] dmaengine: add support for scatterlist to scatterlist transfers

2010-09-27 Thread Linus Walleij
2010/9/25 Ira W. Snyder i...@ovro.caltech.edu:

 This adds support for scatterlist to scatterlist DMA transfers.

This is a good idea, we have a local function to do this in DMA40 already,
stedma40_memcpy_sg().

 This is
 currently hidden behind a configuration option, which will allow drivers
 which need this functionality to select it individually.

Why? Isn't it better to add this as a new capability flag
if you don't want to announce it? Or is the intent to save
memory footprint?

Yours,
Linus Walleij
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter
On Thu, 23 Sep 2010, Christian Riesch wrote:

   It implies clock tuning in userspace for a potential sub microsecond
   accurate clock. The clock accuracy will be limited by user space
   latencies and noise. You wont be able to discipline the system clock
   accurately.
 
  Noise matters, latency doesn't.

 Well put! That's why we need hardware support for PTP timestamping to reduce
 the noise, but get along well with the clock servo that is steering the PHC in
 user space.

Even if I buy into the catch phrase above: User space is subject to noise
that the in kernel code is not. If you do the tuning over long intervals
then it hopefully averages out but it still causes jitter effects that
affects the degree of accuracy (or sync) that you can reach. And the noise
varies with the load on the system.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter
On Thu, 23 Sep 2010, john stultz wrote:

   3) Further, the PTP hardware counter can be simply set to a new offset
   to put it in line with the network time. This could cause trouble with
   timekeeping much like unsynced TSCs do.
 
  You can do the same for system time.

 Settimeofday does allow CLOCK_REALTIME to jump, but the CLOCK_MONOTONIC
 time cannot jump around. Having a clocksource that is non-monotonic
 would break this.

Currently time runs at the same speed. CLOCK_MONOTONIC runs at a offset
to CLOCK_REALTIME. We are creating APIs here that allow time to run at
different speeds.

 The design actually avoids most userland induced latency.

 1) On the PTP hardware syncing point, the reference packet gets
 timestamped with the PTP hardware time on arrival. This allows the
 offset calculation to be done in userland without introducing latency.

The timestamps allows the calculation of the network transmission time I
guess and therefore its more accurate to calculate that effect out. Ok but
then the overhead of getting to code in user space (that does the proper
clock adjustments) is resulting in the addition of a relatively long time
that is subject to OS scheduling latencies and noises.

 2) On the system syncing side, the proposal for the PPS interrupt allows
 the PTP hardware to trigger an interrupt on the second boundary that
 would take a timestamp of the system time. Then the pps interface allows
 for the timestamp to be read from userland allowing the offset to be
 calculated without introducing additional latency.

Sorry dont really get the whole picture here it seems. Sounds like one is
going through additional unnecessary layers. Why would the PTP hardware
triggger an interrupt? I thought the PTP messages came in via
timestamping and are then processed by software. Then the software is
issuing a hardware interrupt that then triggers the PPS subsystem. And
that is supposed to be better than directly interfacing with the PTP?


 Additionally, even just in userland, it would be easy to bracket two
 reads of the system time around one read of the PTP clock to bound any
 userland latency fairly well. It may not be as good as the PPS interface
 (although that depends on the interrupt latency), but if the accesses
 are all local, it probably could get fairly close.

That sounds hacky.

  Ok maybe we need some sort of control interface to manage the clock like
  the others have.

 That's what the clock_adjtime call provides.

Ummm... You are managing a hardware device with hardware (driver) specific
settings. That is currently being done via ioctls. Why generalize it?

  The posix clocks today assumes one notion of real time in the kernel.
  All clocks increase in lockstep (aside from offset updates).

 Not true. The cputime clockids do not increment at the same rate (as the
 apps don't always run). Further CLOCK_MONOTONIC_RAW provides a non-freq
 corrected view of CLOCK_MONOTONIC, so it increments at a slightly
 different rate.

cputime clockids are not tracking time but cpu resource use.

 Re-using the fairly nice (Alan of course disagrees :) posix interface
 seems at least a little better for application developers who actually
 have to use the hardware.

Well it may also be confusing for others. The application developers also
will have a hard time using a generic clock interface to control PTP
device specific things like frequencies, rates etc etc. So you always need
to ioctl/device specific control interface regardless.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter

On Fri, 24 Sep 2010, Alan Cox wrote:

 Whether you add new syscalls or do the fd passing using flags and hide
 the ugly bits in glibc is another question.

Use device specific ioctls instead of syscalls?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread M. Warner Losh
In message: alpine.deb.2.00.1009271038150.9...@router.home
Christoph Lameter c...@linux.com writes:
: On Thu, 23 Sep 2010, john stultz wrote:
:  The design actually avoids most userland induced latency.
: 
:  1) On the PTP hardware syncing point, the reference packet gets
:  timestamped with the PTP hardware time on arrival. This allows the
:  offset calculation to be done in userland without introducing latency.
: 
: The timestamps allows the calculation of the network transmission time I
: guess and therefore its more accurate to calculate that effect out. Ok but
: then the overhead of getting to code in user space (that does the proper
: clock adjustments) is resulting in the addition of a relatively long time
: that is subject to OS scheduling latencies and noises.

The timestamps at the hardware level allow you to factor out variation
caused by OS Scheduling, OS network stack delay and internal buffering
on the NIC.  Variation in measurements is what kills accuracy.

When steering a clock by making an error measurement of the phase and
frequency of it, the latency induced by OS scheduling tends to be
unimportant.  It is far more important to know when you steered the
clock (called adjtime or friends) than to steer it at any fixed
latency to when the data for the measurements was made.  Measuring the
time of steer can tolerate errors in the range of OS scheduling
latencies easily, since that tends to produce a very small effect.  It
introduces an error in your expected phase for the next measurement on
the order of the product of the time of steer error times the change
in fractional frequency (abs( 1 - (nu_new / nu_old))).  Even if the
estimate is really bad at 100ms, most steers are on the order about
one part per million.  This leads to a sub-nanosecond phase error
estimate in the next measurement cycle (a non-accumulating error).  A
1ms error leads to maybe tens of picoseconds of estimate error.

This is a common error that I've seen repeated in this thread.  The
only reason that it has historically been important is because when
you are doing timestamping in software based on an interrupt, that
stuff does matter.

Warner
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Alan Cox
On Mon, 27 Sep 2010 10:56:09 -0500 (CDT)
Christoph Lameter c...@linux.com wrote:

 
 On Fri, 24 Sep 2010, Alan Cox wrote:
 
  Whether you add new syscalls or do the fd passing using flags and hide
  the ugly bits in glibc is another question.
 
 Use device specific ioctls instead of syscalls?

Some of the ioctls are probably not device specific, the job of the OS in
part is to present a unified interface. We already have a mess of HPET
and RTC driver ioctls.

Some of it undoubtedly is device specific.

Alan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFCv2 1/2] dmaengine: add support for scatterlist to scatterlist transfers

2010-09-27 Thread Ira W. Snyder
On Mon, Sep 27, 2010 at 05:23:34PM +0200, Linus Walleij wrote:
 2010/9/25 Ira W. Snyder i...@ovro.caltech.edu:
 
  This adds support for scatterlist to scatterlist DMA transfers.
 
 This is a good idea, we have a local function to do this in DMA40 already,
 stedma40_memcpy_sg().
 

I think that having two devices that want to implement this
functionality as part of the DMAEngine API is a good argument for making
it available as part of the core API. I think it would be good to add
this to struct dma_device, and add a capability (DMA_SG?) for it as
well.

I have looked at the stedma40_memcpy_sg() function, and I think we would
want to extend it slightly for the generic API. Is there any good reason
to prohibit scatterlists with different numbers of elements?

For example:
src scatterlist: 10 elements, each with 4K length (40K total)
dst scatterlist: 40 elements, each with 1K length (40K total)

The total length of both scatterlists is equal, but the number of
scatterlist entries is different. The freescale DMA controller can
handle this just fine.

I'm proposing this function signature:
struct dma_async_tx_descriptor *
dma_memcpy_sg(struct dma_chan *chan,
  struct scatterlist *dst_sg, unsigned int dst_nents,
  struct scatterlist *src_sg, unsigned int src_nents,
  unsigned long flags);

  This is
  currently hidden behind a configuration option, which will allow drivers
  which need this functionality to select it individually.
 
 Why? Isn't it better to add this as a new capability flag
 if you don't want to announce it? Or is the intent to save
 memory footprint?
 

Dan wanted this, probably for memory footprint. If 1 driver is using
it, I would rather have it as part of struct dma_device along with a
capability.

Thanks for the feedback,
Ira
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RFCv2 1/2] dmaengine: add support for scatterlist to scatterlist transfers

2010-09-27 Thread Dan Williams
On Mon, Sep 27, 2010 at 10:23 AM, Ira W. Snyder i...@ovro.caltech.edu wrote:
 On Mon, Sep 27, 2010 at 05:23:34PM +0200, Linus Walleij wrote:
 2010/9/25 Ira W. Snyder i...@ovro.caltech.edu:

  This adds support for scatterlist to scatterlist DMA transfers.

 This is a good idea, we have a local function to do this in DMA40 already,
 stedma40_memcpy_sg().


 I think that having two devices that want to implement this
 functionality as part of the DMAEngine API is a good argument for making
 it available as part of the core API. I think it would be good to add
 this to struct dma_device, and add a capability (DMA_SG?) for it as
 well.

 I have looked at the stedma40_memcpy_sg() function, and I think we would
 want to extend it slightly for the generic API. Is there any good reason
 to prohibit scatterlists with different numbers of elements?

 For example:
 src scatterlist: 10 elements, each with 4K length (40K total)
 dst scatterlist: 40 elements, each with 1K length (40K total)

 The total length of both scatterlists is equal, but the number of
 scatterlist entries is different. The freescale DMA controller can
 handle this just fine.

 I'm proposing this function signature:
 struct dma_async_tx_descriptor *
 dma_memcpy_sg(struct dma_chan *chan,
              struct scatterlist *dst_sg, unsigned int dst_nents,
              struct scatterlist *src_sg, unsigned int src_nents,
              unsigned long flags);

  This is
  currently hidden behind a configuration option, which will allow drivers
  which need this functionality to select it individually.

 Why? Isn't it better to add this as a new capability flag
 if you don't want to announce it? Or is the intent to save
 memory footprint?


 Dan wanted this, probably for memory footprint. If 1 driver is using
 it,

Yes, I did not see a reason to increment the size of dmaengine.o for
everyone if only one out-of-tree user of the function existed.

 I would rather have it as part of struct dma_device along with a
 capability.

I think having this as a dma_device method makes sense now that more
than one driver would implement it, and let's drivers see the entirety
of the transaction in one call.

--
Dan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/8] v2 De-Couple sysfs memory directories from memory sections

2010-09-27 Thread Nathan Fontenot
This set of patches decouples the concept that a single memory
section corresponds to a single directory in 
/sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are perfomance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates.  The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.

For architectures that define their own version of this routine,
as is done for powerpc in this patchset, the view in userspace
would change such that each memoryXXX directory would span
multiple memory sections.  The number of sections spanned would
depend on the value reported by memory_block_size_bytes.

In both cases a new file 'end_phys_index' is created in each
memoryXXX directory.  This file will contain the physical id
of the last memory section covered by the sysfs directory.  For
the default case, the value in 'end_phys_index' will be the same
as in the existing 'phys_index' file.

This version of the patch set includes an update to to properly
report block_size_bytes, phys_index, and end_phys_index.  Additionally,
the patch that adds the end_phys_index sysfs file is now patch 5/8
instead of being patch 2/8 as in the previous version of the patches.

-Nathan Fontenot
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/8] v2 Move find_memory_block() routine

2010-09-27 Thread Nathan Fontenot
Move the find_memory_block() routine up to avoid needing a forward
declaration in subsequent patches.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 drivers/base/memory.c |   62 +-
 1 file changed, 31 insertions(+), 31 deletions(-)

Index: linux-next/drivers/base/memory.c
===
--- linux-next.orig/drivers/base/memory.c   2010-09-21 11:59:24.0 
-0500
+++ linux-next/drivers/base/memory.c2010-09-21 12:32:45.0 -0500
@@ -435,6 +435,37 @@ int __weak arch_get_memory_phys_device(u
return 0;
 }
 
+/*
+ * For now, we have a linear search to go find the appropriate
+ * memory_block corresponding to a particular phys_index. If
+ * this gets to be a real problem, we can always use a radix
+ * tree or something here.
+ *
+ * This could be made generic for all sysdev classes.
+ */
+struct memory_block *find_memory_block(struct mem_section *section)
+{
+   struct kobject *kobj;
+   struct sys_device *sysdev;
+   struct memory_block *mem;
+   char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+
+   /*
+* This only works because we know that section == sysdev-id
+* slightly redundant with sysdev_register()
+*/
+   sprintf(name[0], %s%d, MEMORY_CLASS_NAME, __section_nr(section));
+
+   kobj = kset_find_obj(memory_sysdev_class.kset, name);
+   if (!kobj)
+   return NULL;
+
+   sysdev = container_of(kobj, struct sys_device, kobj);
+   mem = container_of(sysdev, struct memory_block, sysdev);
+
+   return mem;
+}
+
 static int add_memory_block(int nid, struct mem_section *section,
unsigned long state, enum mem_add_context context)
 {
@@ -468,37 +499,6 @@ static int add_memory_block(int nid, str
return ret;
 }
 
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all sysdev classes.
- */
-struct memory_block *find_memory_block(struct mem_section *section)
-{
-   struct kobject *kobj;
-   struct sys_device *sysdev;
-   struct memory_block *mem;
-   char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
-
-   /*
-* This only works because we know that section == sysdev-id
-* slightly redundant with sysdev_register()
-*/
-   sprintf(name[0], %s%d, MEMORY_CLASS_NAME, __section_nr(section));
-
-   kobj = kset_find_obj(memory_sysdev_class.kset, name);
-   if (!kobj)
-   return NULL;
-
-   sysdev = container_of(kobj, struct sys_device, kobj);
-   mem = container_of(sysdev, struct memory_block, sysdev);
-
-   return mem;
-}
-
 int remove_memory_block(unsigned long node_id, struct mem_section *section,
int phys_device)
 {

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/8] v2 Add section count to memory_block struct

2010-09-27 Thread Nathan Fontenot
Add a section count property to the memory_block struct to track the number
of memory sections that have been added/removed from a memory block. This
allows us to know when the last memory section of a memory block has been
removed so we can remove the memory block.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 drivers/base/memory.c  |   16 ++--
 include/linux/memory.h |3 +++
 2 files changed, 13 insertions(+), 6 deletions(-)

Index: linux-next/drivers/base/memory.c
===
--- linux-next.orig/drivers/base/memory.c   2010-09-27 09:17:20.0 
-0500
+++ linux-next/drivers/base/memory.c2010-09-27 09:31:35.0 -0500
@@ -478,6 +478,7 @@
 
mem-phys_index = __section_nr(section);
mem-state = state;
+   atomic_inc(mem-section_count);
mutex_init(mem-state_mutex);
start_pfn = section_nr_to_pfn(mem-phys_index);
mem-phys_device = arch_get_memory_phys_device(start_pfn);
@@ -505,12 +506,15 @@
struct memory_block *mem;
 
mem = find_memory_block(section);
-   unregister_mem_sect_under_nodes(mem);
-   mem_remove_simple_file(mem, phys_index);
-   mem_remove_simple_file(mem, state);
-   mem_remove_simple_file(mem, phys_device);
-   mem_remove_simple_file(mem, removable);
-   unregister_memory(mem, section);
+
+   if (atomic_dec_and_test(mem-section_count)) {
+   unregister_mem_sect_under_nodes(mem);
+   mem_remove_simple_file(mem, phys_index);
+   mem_remove_simple_file(mem, state);
+   mem_remove_simple_file(mem, phys_device);
+   mem_remove_simple_file(mem, removable);
+   unregister_memory(mem, section);
+   }
 
return 0;
 }
Index: linux-next/include/linux/memory.h
===
--- linux-next.orig/include/linux/memory.h  2010-09-27 09:17:20.0 
-0500
+++ linux-next/include/linux/memory.h   2010-09-27 09:22:56.0 -0500
@@ -19,10 +19,13 @@
 #include linux/node.h
 #include linux/compiler.h
 #include linux/mutex.h
+#include asm/atomic.h
 
 struct memory_block {
unsigned long phys_index;
unsigned long state;
+   atomic_t section_count;
+
/*
 * This serializes all state change requests.  It isn't
 * held during creation because the control files are


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/8] v2 Add mutex for adding/removing memory blocks

2010-09-27 Thread Nathan Fontenot
Add a new mutex for use in adding and removing of memory blocks.  This
is needed to avoid any race conditions in which the same memory block could
be added and removed at the same time.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 drivers/base/memory.c |7 +++
 1 file changed, 7 insertions(+)

Index: linux-next/drivers/base/memory.c
===
--- linux-next.orig/drivers/base/memory.c   2010-09-27 09:31:35.0 
-0500
+++ linux-next/drivers/base/memory.c2010-09-27 09:31:57.0 -0500
@@ -27,6 +27,8 @@
 #include asm/atomic.h
 #include asm/uaccess.h
 
+static DEFINE_MUTEX(mem_sysfs_mutex);
+
 #define MEMORY_CLASS_NAME  memory
 
 static struct sysdev_class memory_sysdev_class = {
@@ -476,6 +478,8 @@
if (!mem)
return -ENOMEM;
 
+   mutex_lock(mem_sysfs_mutex);
+
mem-phys_index = __section_nr(section);
mem-state = state;
atomic_inc(mem-section_count);
@@ -497,6 +501,7 @@
ret = register_mem_sect_under_node(mem, nid);
}
 
+   mutex_unlock(mem_sysfs_mutex);
return ret;
 }
 
@@ -505,6 +510,7 @@
 {
struct memory_block *mem;
 
+   mutex_lock(mem_sysfs_mutex);
mem = find_memory_block(section);
 
if (atomic_dec_and_test(mem-section_count)) {
@@ -516,6 +522,7 @@
unregister_memory(mem, section);
}
 
+   mutex_unlock(mem_sysfs_mutex);
return 0;
 }


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 4/8] v2 Allow memory block to span multiple memory sections

2010-09-27 Thread Nathan Fontenot
Update the memory sysfs code such that each sysfs memory directory is now
considered a memory block that can span multiple memory sections per
memory block.  The default size of each memory block is SECTION_SIZE_BITS
to maintain the current behavior of having a single memory section per
memory block (i.e. one sysfs directory per memory section).

For architectures that want to have memory blocks span multiple
memory sections they need only define their own memory_block_size_bytes()
routine.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 drivers/base/memory.c |  155 ++
 1 file changed, 108 insertions(+), 47 deletions(-)

Index: linux-next/drivers/base/memory.c
===
--- linux-next.orig/drivers/base/memory.c   2010-09-27 09:31:57.0 
-0500
+++ linux-next/drivers/base/memory.c2010-09-27 13:50:18.0 -0500
@@ -30,6 +30,14 @@
 static DEFINE_MUTEX(mem_sysfs_mutex);
 
 #define MEMORY_CLASS_NAME  memory
+#define MIN_MEMORY_BLOCK_SIZE  (1  SECTION_SIZE_BITS)
+
+static int sections_per_block;
+
+static inline int base_memory_block_id(int section_nr)
+{
+   return section_nr / sections_per_block;
+}
 
 static struct sysdev_class memory_sysdev_class = {
.name = MEMORY_CLASS_NAME,
@@ -84,28 +92,47 @@
  * register_memory - Setup a sysfs device for a memory block
  */
 static
-int register_memory(struct memory_block *memory, struct mem_section *section)
+int register_memory(struct memory_block *memory)
 {
int error;
 
memory-sysdev.cls = memory_sysdev_class;
-   memory-sysdev.id = __section_nr(section);
+   memory-sysdev.id = memory-phys_index / sections_per_block;
 
error = sysdev_register(memory-sysdev);
return error;
 }
 
 static void
-unregister_memory(struct memory_block *memory, struct mem_section *section)
+unregister_memory(struct memory_block *memory)
 {
BUG_ON(memory-sysdev.cls != memory_sysdev_class);
-   BUG_ON(memory-sysdev.id != __section_nr(section));
 
/* drop the ref. we got in remove_memory_block() */
kobject_put(memory-sysdev.kobj);
sysdev_unregister(memory-sysdev);
 }
 
+u32 __weak memory_block_size_bytes(void)
+{
+   return MIN_MEMORY_BLOCK_SIZE;
+}
+
+static u32 get_memory_block_size(void)
+{
+   u32 block_sz;
+
+   block_sz = memory_block_size_bytes();
+
+   /* Validate blk_sz is a power of 2 and not less than section size */
+   if ((block_sz  (block_sz - 1)) || (block_sz  MIN_MEMORY_BLOCK_SIZE)) {
+   WARN_ON(1);
+   block_sz = MIN_MEMORY_BLOCK_SIZE;
+   }
+
+   return block_sz;
+}
+
 /*
  * use this as the physical section index that this memsection
  * uses.
@@ -116,7 +143,7 @@
 {
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
-   return sprintf(buf, %08lx\n, mem-phys_index);
+   return sprintf(buf, %08lx\n, mem-phys_index / sections_per_block);
 }
 
 /*
@@ -125,13 +152,16 @@
 static ssize_t show_mem_removable(struct sys_device *dev,
struct sysdev_attribute *attr, char *buf)
 {
-   unsigned long start_pfn;
-   int ret;
+   unsigned long i, pfn;
+   int ret = 1;
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
 
-   start_pfn = section_nr_to_pfn(mem-phys_index);
-   ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
+   for (i = 0; i  sections_per_block; i++) {
+   pfn = section_nr_to_pfn(mem-phys_index + i);
+   ret = is_mem_section_removable(pfn, PAGES_PER_SECTION);
+   }
+
return sprintf(buf, %d\n, ret);
 }
 
@@ -184,17 +214,14 @@
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(struct memory_block *mem, unsigned long action)
+memory_section_action(unsigned long phys_index, unsigned long action)
 {
int i;
-   unsigned long psection;
unsigned long start_pfn, start_paddr;
struct page *first_page;
int ret;
-   int old_state = mem-state;
 
-   psection = mem-phys_index;
-   first_page = pfn_to_page(psection  PFN_SECTION_SHIFT);
+   first_page = pfn_to_page(phys_index  PFN_SECTION_SHIFT);
 
/*
 * The probe routines leave the pages reserved, just
@@ -207,8 +234,8 @@
continue;
 
printk(KERN_WARNING section number %ld page number %d 
-   not reserved, was it already online? \n,
-   psection, i);
+   not reserved, was it already online?\n,
+   phys_index, i);
return -EBUSY;
}
}
@@ -219,18 +246,13 @@
ret = online_pages(start_pfn, PAGES_PER_SECTION);
   

[PATCH 5/8] v2 Add end_phys_index file

2010-09-27 Thread Nathan Fontenot
Update the 'phys_index' properties of a memory block to include a
'start_phys_index' which is the same as the current 'phys_index' property.
The property still appears as 'phys_index' in sysfs but the memory_block
struct name is updated to indicate the start and end values.
This also adds an 'end_phys_index' property to indicate the id of the
last section in th memory block.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 drivers/base/memory.c  |   39 ++-
 include/linux/memory.h |3 ++-
 2 files changed, 32 insertions(+), 10 deletions(-)

Index: linux-next/drivers/base/memory.c
===
--- linux-next.orig/drivers/base/memory.c   2010-09-27 13:50:18.0 
-0500
+++ linux-next/drivers/base/memory.c2010-09-27 13:50:38.0 -0500
@@ -97,7 +97,7 @@
int error;
 
memory-sysdev.cls = memory_sysdev_class;
-   memory-sysdev.id = memory-phys_index / sections_per_block;
+   memory-sysdev.id = memory-start_phys_index / sections_per_block;
 
error = sysdev_register(memory-sysdev);
return error;
@@ -138,12 +138,26 @@
  * uses.
  */
 
-static ssize_t show_mem_phys_index(struct sys_device *dev,
+static ssize_t show_mem_start_phys_index(struct sys_device *dev,
struct sysdev_attribute *attr, char *buf)
 {
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
-   return sprintf(buf, %08lx\n, mem-phys_index / sections_per_block);
+   unsigned long phys_index;
+
+   phys_index = mem-start_phys_index / sections_per_block;
+   return sprintf(buf, %08lx\n, phys_index);
+}
+
+static ssize_t show_mem_end_phys_index(struct sys_device *dev,
+   struct sysdev_attribute *attr, char *buf)
+{
+   struct memory_block *mem =
+   container_of(dev, struct memory_block, sysdev);
+   unsigned long phys_index;
+
+   phys_index = mem-end_phys_index / sections_per_block;
+   return sprintf(buf, %08lx\n, phys_index);
 }
 
 /*
@@ -158,7 +172,7 @@
container_of(dev, struct memory_block, sysdev);
 
for (i = 0; i  sections_per_block; i++) {
-   pfn = section_nr_to_pfn(mem-phys_index + i);
+   pfn = section_nr_to_pfn(mem-start_phys_index + i);
ret = is_mem_section_removable(pfn, PAGES_PER_SECTION);
}
 
@@ -275,14 +289,15 @@
mem-state = MEM_GOING_OFFLINE;
 
for (i = 0; i  sections_per_block; i++) {
-   ret = memory_section_action(mem-phys_index + i, to_state);
+   ret = memory_section_action(mem-start_phys_index + i,
+   to_state);
if (ret)
break;
}
 
if (ret) {
for (i = 0; i  sections_per_block; i++)
-   memory_section_action(mem-phys_index + i,
+   memory_section_action(mem-start_phys_index + i,
  from_state_req);
 
mem-state = from_state_req;
@@ -330,7 +345,8 @@
return sprintf(buf, %d\n, mem-phys_device);
 }
 
-static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL);
+static SYSDEV_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
+static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL);
 static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state);
 static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL);
 static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL);
@@ -514,17 +530,21 @@
return -ENOMEM;
 
scn_nr = __section_nr(section);
-   mem-phys_index = base_memory_block_id(scn_nr) * sections_per_block;
+   mem-start_phys_index =
+   base_memory_block_id(scn_nr) * sections_per_block;
+   mem-end_phys_index = mem-start_phys_index + sections_per_block - 1;
mem-state = state;
atomic_inc(mem-section_count);
mutex_init(mem-state_mutex);
-   start_pfn = section_nr_to_pfn(mem-phys_index);
+   start_pfn = section_nr_to_pfn(mem-start_phys_index);
mem-phys_device = arch_get_memory_phys_device(start_pfn);
 
ret = register_memory(mem);
if (!ret)
ret = mem_create_simple_file(mem, phys_index);
if (!ret)
+   ret = mem_create_simple_file(mem, end_phys_index);
+   if (!ret)
ret = mem_create_simple_file(mem, state);
if (!ret)
ret = mem_create_simple_file(mem, phys_device);
@@ -571,6 +591,7 @@
if (atomic_dec_and_test(mem-section_count)) {
unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
+   mem_remove_simple_file(mem, end_phys_index);
mem_remove_simple_file(mem, state);

[PATCH 6/8] v2 Update node sysfs code

2010-09-27 Thread Nathan Fontenot
Update the node sysfs code to be aware of the new capability for a memory
block to contain multiple memory sections.  This requires an additional
parameter to unregister_mem_sect_under_nodes so that we know which memory
section of the memory block to unregister.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 drivers/base/memory.c |2 +-
 drivers/base/node.c   |   12 
 include/linux/node.h  |6 --
 3 files changed, 13 insertions(+), 7 deletions(-)

Index: linux-next/drivers/base/node.c
===
--- linux-next.orig/drivers/base/node.c 2010-09-27 13:49:36.0 -0500
+++ linux-next/drivers/base/node.c  2010-09-27 13:50:43.0 -0500
@@ -346,8 +346,10 @@
return -EFAULT;
if (!node_online(nid))
return 0;
-   sect_start_pfn = section_nr_to_pfn(mem_blk-phys_index);
-   sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+
+   sect_start_pfn = section_nr_to_pfn(mem_blk-start_phys_index);
+   sect_end_pfn = section_nr_to_pfn(mem_blk-end_phys_index);
+   sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
int page_nid;
 
@@ -371,7 +373,8 @@
 }
 
 /* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+   unsigned long phys_index)
 {
NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -383,7 +386,8 @@
if (!unlinked_nodes)
return -ENOMEM;
nodes_clear(*unlinked_nodes);
-   sect_start_pfn = section_nr_to_pfn(mem_blk-phys_index);
+
+   sect_start_pfn = section_nr_to_pfn(phys_index);
sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
int nid;
Index: linux-next/drivers/base/memory.c
===
--- linux-next.orig/drivers/base/memory.c   2010-09-27 13:50:38.0 
-0500
+++ linux-next/drivers/base/memory.c2010-09-27 13:50:43.0 -0500
@@ -587,9 +587,9 @@
 
mutex_lock(mem_sysfs_mutex);
mem = find_memory_block(section);
+   unregister_mem_sect_under_nodes(mem, __section_nr(section));
 
if (atomic_dec_and_test(mem-section_count)) {
-   unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
mem_remove_simple_file(mem, end_phys_index);
mem_remove_simple_file(mem, state);
Index: linux-next/include/linux/node.h
===
--- linux-next.orig/include/linux/node.h2010-09-27 13:49:36.0 
-0500
+++ linux-next/include/linux/node.h 2010-09-27 13:50:43.0 -0500
@@ -44,7 +44,8 @@
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
int nid);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+  unsigned long phys_index);
 
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -72,7 +73,8 @@
 {
return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+ unsigned long phys_index)
 {
return 0;
 }


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 7/8] v2 Define memory_block_size_bytes() for powerpc/pseries

2010-09-27 Thread Nathan Fontenot
Define a version of memory_block_size_bytes() for powerpc/pseries such that
a memory block spans an entire lmb.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 arch/powerpc/platforms/pseries/hotplug-memory.c |   66 +++-
 1 file changed, 53 insertions(+), 13 deletions(-)

Index: linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c
===
--- linux-next.orig/arch/powerpc/platforms/pseries/hotplug-memory.c 
2010-09-27 13:49:34.0 -0500
+++ linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c  2010-09-27 
13:50:45.0 -0500
@@ -17,6 +17,54 @@
 #include asm/pSeries_reconfig.h
 #include asm/sparsemem.h
 
+static u32 get_memblock_size(void)
+{
+   struct device_node *np;
+   unsigned int memblock_size = 0;
+
+   np = of_find_node_by_path(/ibm,dynamic-reconfiguration-memory);
+   if (np) {
+   const unsigned long *size;
+
+   size = of_get_property(np, ibm,lmb-size, NULL);
+   memblock_size = size ? *size : 0;
+
+   of_node_put(np);
+   } else {
+   unsigned int memzero_size = 0;
+   const unsigned int *regs;
+
+   np = of_find_node_by_path(/mem...@0);
+   if (np) {
+   regs = of_get_property(np, reg, NULL);
+   memzero_size = regs ? regs[3] : 0;
+   of_node_put(np);
+   }
+
+   if (memzero_size) {
+   /* We now know the size of mem...@0, use this to find
+* the first memoryblock and get its size.
+*/
+   char buf[64];
+
+   sprintf(buf, /mem...@%x, memzero_size);
+   np = of_find_node_by_path(buf);
+   if (np) {
+   regs = of_get_property(np, reg, NULL);
+   memblock_size = regs ? regs[3] : 0;
+   of_node_put(np);
+   }
+   }
+   }
+
+   return memblock_size;
+}
+
+u32 memory_block_size_bytes(void)
+{
+   return get_memblock_size();
+}
+
 static int pseries_remove_memblock(unsigned long base, unsigned int 
memblock_size)
 {
unsigned long start, start_pfn;
@@ -127,30 +175,22 @@
 
 static int pseries_drconf_memory(unsigned long *base, unsigned int action)
 {
-   struct device_node *np;
-   const unsigned long *lmb_size;
+   unsigned long memblock_size;
int rc;
 
-   np = of_find_node_by_path(/ibm,dynamic-reconfiguration-memory);
-   if (!np)
+   memblock_size = get_memblock_size();
+   if (!memblock_size)
return -EINVAL;
 
-   lmb_size = of_get_property(np, ibm,lmb-size, NULL);
-   if (!lmb_size) {
-   of_node_put(np);
-   return -EINVAL;
-   }
-
if (action == PSERIES_DRCONF_MEM_ADD) {
-   rc = memblock_add(*base, *lmb_size);
+   rc = memblock_add(*base, memblock_size);
rc = (rc  0) ? -EINVAL : 0;
} else if (action == PSERIES_DRCONF_MEM_REMOVE) {
-   rc = pseries_remove_memblock(*base, *lmb_size);
+   rc = pseries_remove_memblock(*base, memblock_size);
} else {
rc = -EINVAL;
}
 
-   of_node_put(np);
return rc;
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 8/8] v2 Update memory hotplug documentation

2010-09-27 Thread Nathan Fontenot
Update the memory hotplug documentation to reflect the new behaviors of
memory blocks reflected in sysfs.

Signed-off-by: Nathan Fontenot nf...@austin.ibm.com

---
 Documentation/memory-hotplug.txt |   46 +--
 1 file changed, 30 insertions(+), 16 deletions(-)

Index: linux-next/Documentation/memory-hotplug.txt
===
--- linux-next.orig/Documentation/memory-hotplug.txt2010-09-27 
13:49:33.0 -0500
+++ linux-next/Documentation/memory-hotplug.txt 2010-09-27 13:50:48.0 
-0500
@@ -126,36 +126,50 @@
 
 4 sysfs files for memory hotplug
 
-All sections have their device information under /sys/devices/system/memory as
+All sections have their device information in sysfs.  Each section is part of
+a memory block under /sys/devices/system/memory as
 
 /sys/devices/system/memory/memoryXXX
-(XXX is section id.)
+(XXX is the section id.)
 
-Now, XXX is defined as start_address_of_section / section_size.
+Now, XXX is defined as (start_address_of_section / section_size) of the first
+section contained in the memory block.  The files 'phys_index' and
+'end_phys_index' under each directory report the beginning and end section id's
+for the memory block covered by the sysfs directory.  It is expected that all
+memory sections in this range are present and no memory holes exist in the
+range. Currently there is no way to determine if there is a memory hole, but
+the existence of one should not affect the hotplug capabilities of the memory
+block.
 
 For example, assume 1GiB section size. A device for a memory starting at
 0x1 is /sys/device/system/memory/memory4
 (0x1 / 1Gib = 4)
 This device covers address range [0x1 ... 0x14000)
 
-Under each section, you can see 4 files.
+Under each section, you can see 5 files.
 
-/sys/devices/system/memory/memoryXXX/phys_index
+/sys/devices/system/memory/memoryXXX/start_phys_index
+/sys/devices/system/memory/memoryXXX/end_phys_index
 /sys/devices/system/memory/memoryXXX/phys_device
 /sys/devices/system/memory/memoryXXX/state
 /sys/devices/system/memory/memoryXXX/removable
 
-'phys_index' : read-only and contains section id, same as XXX.
-'state'  : read-write
-   at read:  contains online/offline state of memory.
-   at write: user can specify online, offline command
-'phys_device': read-only: designed to show the name of physical memory device.
-   This is not well implemented now.
-'removable'  : read-only: contains an integer value indicating
-   whether the memory section is removable or not
-   removable.  A value of 1 indicates that the memory
-   section is removable and a value of 0 indicates that
-   it is not removable.
+'phys_index'  : read-only and contains section id of the first section
+   in the memory block, same as XXX.
+'end_phys_index'  : read-only and contains section id of the last section
+   in the memory block.
+'state'   : read-write
+at read:  contains online/offline state of memory.
+at write: user can specify online, offline command
+which will be performed on al sections in the block.
+'phys_device' : read-only: designed to show the name of physical memory
+device.  This is not well implemented now.
+'removable'   : read-only: contains an integer value indicating
+whether the memory block is removable or not
+removable.  A value of 1 indicates that the memory
+block is removable and a value of 0 indicates that
+it is not removable. A memory block is removable only if
+every section in the block is removable.
 
 NOTE:
   These directories/files appear after physical memory hotplug phase.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] Add config option for batched hcalls

2010-09-27 Thread Will Schmidt
On Sat, 2010-09-25 at 22:49 -0500, Olof Johansson wrote:
 On Fri, Sep 24, 2010 at 04:44:15PM -0500, Will Schmidt wrote:
  
  Add a config option for the (batched) MULTITCE and BULK_REMOVE h-calls.
  
  By default, these options are on and are beneficial for performance and
  throughput reasons.   If disabled, the code will fall back to using less
  optimal TCE and REMOVE hcalls.   The ability to easily disable these
  options is useful for some of the PREEMPT_RT related investigation and
  work occurring on Power.
 
 Hi,
 
 I can see why it's useful to enable and disable, but these are all
 runtime-checked, wouldn't it be more useful to add a bootarg to handle
 it instead of adding some new config options that pretty much everyone
 will always go with the defaults on?
 
 The bits are set early, but from looking at where they're used, there
 doesn't seem to be any harm in disabling them later on when a bootarg
 is convenient to parse and deal with?
 
 It has the benefit of easier on/off testing, if that has any value for
 production debug down the road.

Hi Olof, 
  Thats a good idea, let me poke at this a bit more, see if I can get
bootargs for this.  

Thanks, 
-Will

 
 
 -Olof
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB

2010-09-27 Thread Benjamin Herrenschmidt
On Mon, 2010-09-27 at 10:26 -0500, Dave Kleikamp wrote:
 I think I made it a config option at Ben's request when I first started
 this work last year, before being sidetracked by other priorities.  I
 could either remove the option, or default it to 'n'.  It might be best
 to just hard-code the behavior to make sure it's exercised, since
 there's no 47x hardware in production yet, but we can give Ben a chance
 to weigh in with his opinion.

You can remove the option I suppose. It was useful to have it during
early bringup but probably not anymore.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/2] 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB

2010-09-27 Thread Dave Kleikamp
On Tue, 2010-09-28 at 07:10 +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2010-09-27 at 10:26 -0500, Dave Kleikamp wrote:
  I think I made it a config option at Ben's request when I first started
  this work last year, before being sidetracked by other priorities.  I
  could either remove the option, or default it to 'n'.  It might be best
  to just hard-code the behavior to make sure it's exercised, since
  there's no 47x hardware in production yet, but we can give Ben a chance
  to weigh in with his opinion.
 
 You can remove the option I suppose. It was useful to have it during
 early bringup but probably not anymore.

Thanks, Ben.  I'll resend it without the config option.

Shaggy
-- 
Dave Kleikamp
IBM Linux Technology Center

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/2] v2 476: Set CCR2[DSTI] to prevent isync from flushing shadow TLB

2010-09-27 Thread Dave Kleikamp
When the DSTI (Disable Shadow TLB Invalidate) bit is set in the CCR2
register, the isync command does not flush the shadow TLB (iTLB  dTLB).

However, since the shadow TLB does not contain context information, we
want the shadow TLB flushed in situations where we are switching context.
In those situations, we explicitly clear the DSTI bit before performing
isync, and set it again afterward.  We also need to do the same when we
perform isync after explicitly flushing the TLB.

Signed-off-by: Dave Kleikamp sha...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/reg_booke.h  |4 
 arch/powerpc/kernel/head_44x.S|   25 +
 arch/powerpc/mm/tlb_nohash_low.S  |   14 +-
 arch/powerpc/platforms/44x/misc_44x.S |   26 ++
 4 files changed, 68 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/reg_booke.h 
b/arch/powerpc/include/asm/reg_booke.h
index 667a498..a7ecbfe 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -120,6 +120,7 @@
 #define SPRN_TLB3CFG   0x2B3   /* TLB 3 Config Register */
 #define SPRN_EPR   0x2BE   /* External Proxy Register */
 #define SPRN_CCR1  0x378   /* Core Configuration Register 1 */
+#define SPRN_CCR2_476  0x379   /* Core Configuration Register 2 (476)*/
 #define SPRN_ZPR   0x3B0   /* Zone Protection Register (40x) */
 #define SPRN_MAS7  0x3B0   /* MMU Assist Register 7 */
 #define SPRN_MMUCR 0x3B2   /* MMU Control Register */
@@ -188,6 +189,9 @@
 #defineCCR1_DPC0x0100 /* Disable L1 I-Cache/D-Cache parity 
checking */
 #defineCCR1_TCS0x0080 /* Timer Clock Select */
 
+/* Bit definitions for CCR2. */
+#define CCR2_476_DSTI  0x0800 /* Disable Shadow TLB Invalidate */
+
 /* Bit definitions for the MCSR. */
 #define MCSR_MCS   0x8000 /* Machine Check Summary */
 #define MCSR_IB0x4000 /* Instruction PLB Error */
diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
index 562305b..cd34afb 100644
--- a/arch/powerpc/kernel/head_44x.S
+++ b/arch/powerpc/kernel/head_44x.S
@@ -38,6 +38,7 @@
 #include asm/ppc_asm.h
 #include asm/asm-offsets.h
 #include asm/synch.h
+#include asm/bug.h
 #include head_booke.h
 
 
@@ -703,8 +704,23 @@ _GLOBAL(set_context)
stw r4, 0x4(r5)
 #endif
mtspr   SPRN_PID,r3
+BEGIN_MMU_FTR_SECTION
+   b   1f
+END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x)
isync   /* Force context change */
blr
+1:
+#ifdef CONFIG_PPC_47x
+   mfspr   r10,SPRN_CCR2_476
+   rlwinm  r11,r10,0,~CCR2_476_DSTI
+   mtspr   SPRN_CCR2_476,r11
+   isync   /* Force context change */
+   mtspr   SPRN_CCR2_476,r10
+#else /* CONFIG_PPC_47x */
+2: trap
+   EMIT_BUG_ENTRY 2b,__FILE__,__LINE__,0;
+#endif /* CONFIG_PPC_47x */
+   blr
 
 /*
  * Init CPU state. This is called at boot time or for secondary CPUs
@@ -861,6 +877,15 @@ skpinv:addir4,r4,1 /* 
Increment */
isync
 #endif /* CONFIG_PPC_EARLY_DEBUG_44x */
 
+BEGIN_MMU_FTR_SECTION
+   mfspr   r3,SPRN_CCR2_476
+   /* With CCR2(DSTI) set, isync does not invalidate the shadow TLB */
+   orisr3,r3,ccr2_476_d...@h
+   rlwinm  r3,r3,0,~CCR2_476_DSTI
+   mtspr   SPRN_CCR2_476,r3
+   isync
+END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x)
+
/* Establish the interrupt vector offsets */
SET_IVOR(0,  CriticalInput);
SET_IVOR(1,  MachineCheck);
diff --git a/arch/powerpc/mm/tlb_nohash_low.S b/arch/powerpc/mm/tlb_nohash_low.S
index b9d9fed..f28fb52 100644
--- a/arch/powerpc/mm/tlb_nohash_low.S
+++ b/arch/powerpc/mm/tlb_nohash_low.S
@@ -112,7 +112,11 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x)
clrrwi  r4,r3,12/* get an EPN for the hashing with V = 0 */
ori r4,r4,PPC47x_TLBE_SIZE
tlbwe   r4,r7,0 /* write it */
+   mfspr   r8,SPRN_CCR2_476
+   rlwinm  r9,r8,0,~CCR2_476_DSTI
+   mtspr   SPRN_CCR2_476,r9
isync
+   mtspr   SPRN_CCR2_476,r8
wrtee   r10
blr
 #else /* CONFIG_PPC_47x */
@@ -180,7 +184,11 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_47x)
lwz r8,0(r10)   /* Load boltmap entry */
addir10,r10,4   /* Next word */
b   1b  /* Then loop */
-1: isync   /* Sync shadows */
+1: mfspr   r9,SPRN_CCR2_476
+   rlwinm  r10,r9,0,~CCR2_476_DSTI
+   mtspr   SPRN_CCR2_476,r10
+   isync   /* Sync shadows */
+   mtspr   SPRN_CCR2_476,r9
wrtee   r11
 #else /* CONFIG_PPC_47x */
 1: trap
@@ -203,7 +211,11 @@ _GLOBAL(_tlbivax_bcast)
isync
 /* tlbivax 0,r3 - use .long to avoid binutils deps */
.long 0x7c000624 | (r3  11)
+   mfspr   r8,SPRN_CCR2_476
+   rlwinm  r9,r8,0,~CCR2_476_DSTI
+   mtspr   SPRN_CCR2_476,r9

[PATCH RFCv3 0/4] dma: add support for scatterlist to scatterlist copy

2010-09-27 Thread Ira W. Snyder
This series adds support for scatterlist to scatterlist copies to the
generic DMAEngine API. Both the fsldma and ste_dma40 drivers currently
implement a similar API using different, non-generic methods. This series
converts both of them to the new, standardized API.

By doing this as part of the core DMAEngine API, the individual drivers
have control over how to chain their descriptors together. This is
different to the previous implementation, which called
device_prep_dma_memcpy() multiple times.

Neither implementation has been tested on real hardware. I attempted a
conversion of the ste_dma40 driver which should do the right thing, but the
authors should check and make sure.

Ira W. Snyder (4):
  dma: add support for scatterlist to scatterlist copy
  fsldma: implement support for scatterlist to scatterlist copy
  fsldma: remove DMA_SLAVE support
  ste_dma40: implement support for scatterlist to scatterlist copy

 arch/powerpc/include/asm/fsldma.h |  115 ++
 drivers/dma/dmaengine.c   |2 +
 drivers/dma/fsldma.c  |  321 +
 drivers/dma/ste_dma40.c   |   17 ++
 include/linux/dmaengine.h |6 +
 5 files changed, 185 insertions(+), 276 deletions(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/4] dma: add support for scatterlist to scatterlist copy

2010-09-27 Thread Ira W. Snyder
This adds support for scatterlist to scatterlist DMA transfers. A
similar interface is exposed by the fsldma driver (through the DMA_SLAVE
API) and by the ste_dma40 driver (through an exported function).

This patch paves the way for making this type of copy operation a part
of the generic DMAEngine API. Futher patches will add support in
individual drivers.

Signed-off-by: Ira W. Snyder i...@ovro.caltech.edu
---
 drivers/dma/dmaengine.c   |2 ++
 include/linux/dmaengine.h |6 ++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 9d31d5e..db403b8 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -690,6 +690,8 @@ int dma_async_device_register(struct dma_device *device)
!device-device_prep_dma_memset);
BUG_ON(dma_has_cap(DMA_INTERRUPT, device-cap_mask) 
!device-device_prep_dma_interrupt);
+   BUG_ON(dma_has_cap(DMA_SG, device-cap_mask) 
+   !device-device_prep_dma_sg);
BUG_ON(dma_has_cap(DMA_SLAVE, device-cap_mask) 
!device-device_prep_slave_sg);
BUG_ON(dma_has_cap(DMA_SLAVE, device-cap_mask) 
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index c61d4ca..7c44620 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -64,6 +64,7 @@ enum dma_transaction_type {
DMA_PQ_VAL,
DMA_MEMSET,
DMA_INTERRUPT,
+   DMA_SG,
DMA_PRIVATE,
DMA_ASYNC_TX,
DMA_SLAVE,
@@ -473,6 +474,11 @@ struct dma_device {
unsigned long flags);
struct dma_async_tx_descriptor *(*device_prep_dma_interrupt)(
struct dma_chan *chan, unsigned long flags);
+   struct dma_async_tx_descriptor *(*device_prep_dma_sg)(
+   struct dma_chan *chan,
+   struct scatterlist *dst_sg, unsigned int dst_nents,
+   struct scatterlist *src_sg, unsigned int src_nents,
+   unsigned long flags);
 
struct dma_async_tx_descriptor *(*device_prep_slave_sg)(
struct dma_chan *chan, struct scatterlist *sgl,
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/4] fsldma: remove DMA_SLAVE support

2010-09-27 Thread Ira W. Snyder
Now that the generic DMAEngine API has support for scatterlist to
scatterlist copying, this implementation of the DMA_SLAVE API is no
longer necessary.

In order to let device_control() continue to function, a stub
device_prep_slave_sg() function is provided. This allows custom device
configuration, such as enabling external control.

Signed-off-by: Ira W. Snyder i...@ovro.caltech.edu
---
 arch/powerpc/include/asm/fsldma.h |  115 ++--
 drivers/dma/fsldma.c  |  219 +++--
 2 files changed, 48 insertions(+), 286 deletions(-)

diff --git a/arch/powerpc/include/asm/fsldma.h 
b/arch/powerpc/include/asm/fsldma.h
index debc5ed..dc0bd27 100644
--- a/arch/powerpc/include/asm/fsldma.h
+++ b/arch/powerpc/include/asm/fsldma.h
@@ -1,7 +1,7 @@
 /*
  * Freescale MPC83XX / MPC85XX DMA Controller
  *
- * Copyright (c) 2009 Ira W. Snyder i...@ovro.caltech.edu
+ * Copyright (c) 2009-2010 Ira W. Snyder i...@ovro.caltech.edu
  *
  * This file is licensed under the terms of the GNU General Public License
  * version 2. This program is licensed as is without any warranty of any
@@ -11,127 +11,32 @@
 #ifndef __ARCH_POWERPC_ASM_FSLDMA_H__
 #define __ARCH_POWERPC_ASM_FSLDMA_H__
 
-#include linux/slab.h
 #include linux/dmaengine.h
 
 /*
- * Definitions for the Freescale DMA controller's DMA_SLAVE implemention
+ * The Freescale DMA controller has several features that are not accomodated
+ * in the Linux DMAEngine API. Therefore, the generic structure is expanded
+ * to allow drivers to use these features.
  *
- * The Freescale DMA_SLAVE implementation was designed to handle many-to-many
- * transfers. An example usage would be an accelerated copy between two
- * scatterlists. Another example use would be an accelerated copy from
- * multiple non-contiguous device buffers into a single scatterlist.
+ * This structure should be passed into the DMAEngine routine device_control()
+ * as in this example:
  *
- * A DMA_SLAVE transaction is defined by a struct fsl_dma_slave. This
- * structure contains a list of hardware addresses that should be copied
- * to/from the scatterlist passed into device_prep_slave_sg(). The structure
- * also has some fields to enable hardware-specific features.
+ * chan-device-device_control(chan, DMA_SLAVE_CONFIG, (unsigned long)cfg);
  */
 
 /**
- * struct fsl_dma_hw_addr
- * @entry: linked list entry
- * @address: the hardware address
- * @length: length to transfer
- *
- * Holds a single physical hardware address / length pair for use
- * with the DMAEngine DMA_SLAVE API.
- */
-struct fsl_dma_hw_addr {
-   struct list_head entry;
-
-   dma_addr_t address;
-   size_t length;
-};
-
-/**
  * struct fsl_dma_slave
- * @addresses: a linked list of struct fsl_dma_hw_addr structures
+ * @config: the standard Linux DMAEngine API DMA_SLAVE configuration
  * @request_count: value for DMA request count
- * @src_loop_size: setup and enable constant source-address DMA transfers
- * @dst_loop_size: setup and enable constant destination address DMA transfers
  * @external_start: enable externally started DMA transfers
  * @external_pause: enable externally paused DMA transfers
- *
- * Holds a list of address / length pairs for use with the DMAEngine
- * DMA_SLAVE API implementation for the Freescale DMA controller.
  */
-struct fsl_dma_slave {
+struct fsldma_slave_config {
+   struct dma_slave_config config;
 
-   /* List of hardware address/length pairs */
-   struct list_head addresses;
-
-   /* Support for extra controller features */
unsigned int request_count;
-   unsigned int src_loop_size;
-   unsigned int dst_loop_size;
bool external_start;
bool external_pause;
 };
 
-/**
- * fsl_dma_slave_append - add an address/length pair to a struct fsl_dma_slave
- * @slave: the struct fsl_dma_slave to add to
- * @address: the hardware address to add
- * @length: the length of bytes to transfer from @address
- *
- * Add a hardware address/length pair to a struct fsl_dma_slave. Returns 0 on
- * success, -ERRNO otherwise.
- */
-static inline int fsl_dma_slave_append(struct fsl_dma_slave *slave,
-  dma_addr_t address, size_t length)
-{
-   struct fsl_dma_hw_addr *addr;
-
-   addr = kzalloc(sizeof(*addr), GFP_ATOMIC);
-   if (!addr)
-   return -ENOMEM;
-
-   INIT_LIST_HEAD(addr-entry);
-   addr-address = address;
-   addr-length = length;
-
-   list_add_tail(addr-entry, slave-addresses);
-   return 0;
-}
-
-/**
- * fsl_dma_slave_free - free a struct fsl_dma_slave
- * @slave: the struct fsl_dma_slave to free
- *
- * Free a struct fsl_dma_slave and all associated address/length pairs
- */
-static inline void fsl_dma_slave_free(struct fsl_dma_slave *slave)
-{
-   struct fsl_dma_hw_addr *addr, *tmp;
-
-   if (slave) {
-   list_for_each_entry_safe(addr, tmp, slave-addresses, entry) {
-   

[RFC PATCH 2/2] pseries/xics: use cpu_possible_mask rather than cpu_all_mask

2010-09-27 Thread Nishanth Aravamudan
Current firmware only allows us to send IRQs to the first processor or
all processors. We currently check to see if the passed in mask is equal
to the all_mask, but the firmware is only considering whether the
request is for the equivalent of the possible_mask. Thus, we think the
request is for some subset of CPUs and only assign IRQs to the first CPU
(on systems without irqbalance running) as evidenced by
/proc/interrupts. By using possible_mask instead, we account for this
and proper interleaving of interrupts occurs. Without this change and
pseries/xics: use cpu_possible_mask rather than cpu_all_mask, IRQs are
all routed to CPU0 on power machines not running irqbalance.

Signed-off-by: Nishanth Aravamudan n...@us.ibm.com
---
 arch/powerpc/platforms/pseries/xics.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/xics.c 
b/arch/powerpc/platforms/pseries/xics.c
index 93834b0..7c1e342 100644
--- a/arch/powerpc/platforms/pseries/xics.c
+++ b/arch/powerpc/platforms/pseries/xics.c
@@ -178,7 +178,7 @@ static int get_irq_server(unsigned int virq, const struct 
cpumask *cpumask,
if (!distribute_irqs)
return default_server;
 
-   if (!cpumask_equal(cpumask, cpu_all_mask)) {
+   if (!cpumask_subset(cpu_possible_mask, cpumask)) {
int server = cpumask_first_and(cpu_online_mask, cpumask);
 
if (server  nr_cpu_ids)
-- 
1.7.0.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 0/2] Fix IRQ round-robing w/o irqbalance on pseries

2010-09-27 Thread Nishanth Aravamudan
We have received reports on power systems not running irqbalance where
all interrupts are being routed to CPU0 rather than being interleaved by
default across the system. Current firmware only allows either sending
interrupts to all CPUs or sending them to one CPU. The following two
patches address this issue by fixing the mask used in generic code and
by fixing the check for the all setting in the pseries code.

Nishanth Aravamudan (2):
  IRQ: use cpu_possible_mask rather than online_mask in setup_affinity
  pseries/xics: use cpu_possible_mask rather than cpu_all_mask

 arch/powerpc/platforms/pseries/xics.c |2 +-
 kernel/irq/manage.c   |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 4/8] v2 Allow memory block to span multiple memory sections

2010-09-27 Thread Dave Hansen
On Mon, 2010-09-27 at 14:25 -0500, Nathan Fontenot wrote:
 +static inline int base_memory_block_id(int section_nr)
 +{
 +   return section_nr / sections_per_block;
 +}
...
 -   mutex_lock(mem_sysfs_mutex);
 -
 -   mem-phys_index = __section_nr(section);
 +   scn_nr = __section_nr(section);
 +   mem-phys_index = base_memory_block_id(scn_nr) * sections_per_block; 

I'm really regretting giving this variable such a horrid name.  I suck.

I think this is correct now:

mem-phys_index = base_memory_block_id(scn_nr) * sections_per_block;
mem-phys_index = section_nr / sections_per_block * sections_per_block;
mem-phys_index = section_nr

Since it gets exported to userspace this way:

 +static ssize_t show_mem_start_phys_index(struct sys_device *dev,
 struct sysdev_attribute *attr, char *buf)
  {
 struct memory_block *mem =
 container_of(dev, struct memory_block, sysdev);
 -   return sprintf(buf, %08lx\n, mem-phys_index / sections_per_block);
 +   unsigned long phys_index;
 +
 +   phys_index = mem-start_phys_index / sections_per_block;
 +   return sprintf(buf, %08lx\n, phys_index);
 +}

The only other thing I'd say is that we need to put phys_index out of
its misery and call it what it is now: a section number.  I think it's
OK to call them start/end_section_nr, at least inside the kernel.  I
intentionally used phys_index terminology in sysfs so that we _could_
eventually do this stuff and break the relationship between sections and
the sysfs dirs, but I think keeping the terminology around inside the
kernel is confusing now.

-- Dave

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Oops in trace_hardirqs_on (powerpc)

2010-09-27 Thread Steven Rostedt
On Mon, 2010-09-27 at 14:50 +0200, Jörg Sommer wrote:
 Hello Steven,
 
 Steven Rostedt hat am Wed 22. Sep, 15:44 (-0400) geschrieben:
  Sorry for the late reply, but I was on vacation when you sent this, and
  I missed it while going through email.
  
  Do you still have this issue?
 
 No. I've rebuild my kernel without TRACE_IRQFLAGS and the problem
 vanished, as expected. The problem is, that in some cases the stack is
 only two frames deep, which causes the macro CALLER_ADDR1 makes an
 invalid access. Someone told me, there a workaround for the problem on
 i386, too.
 
 % sed -n 2p arch/x86/lib/thunk_32.S
  * Trampoline to trace irqs off. (otherwise CALLER_ADDR1 might crash)

Yes, I remember that problem. When I get back from Tokyo, I'll tried to
remember to fix it.

Thanks!

-- Steve


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/3 RESEND] powerpc: remove cast from void*

2010-09-27 Thread matt mooney
Unnecessary cast from void* in assignment.

Signed-off-by: matt mooney m...@muteddisk.com
---
 arch/powerpc/platforms/pseries/hvCall_inst.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hvCall_inst.c 
b/arch/powerpc/platforms/pseries/hvCall_inst.c
index e19ff02..f106662 100644
--- a/arch/powerpc/platforms/pseries/hvCall_inst.c
+++ b/arch/powerpc/platforms/pseries/hvCall_inst.c
@@ -55,7 +55,7 @@ static void hc_stop(struct seq_file *m, void *p)
 static int hc_show(struct seq_file *m, void *p)
 {
unsigned long h_num = (unsigned long)p;
-   struct hcall_stats *hs = (struct hcall_stats *)m-private;
+   struct hcall_stats *hs = m-private;
 
if (hs[h_num].num_calls) {
if (cpu_has_feature(CPU_FTR_PURR))
-- 
1.7.2.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev