Re[4]: [PATCH][v4] powerpc 44x: support for 256KB PAGE_SIZE

2009-01-29 Thread Yuri Tikhonov

On Sunday, January 18, 2009 you wrote:
 Ok I tried this out in menuconfig.  You are right that the depends on 
 makes sense as it removes the option from the config file as not 
 relevant.  But right now to enable 256K pages one has to go to platform
 setup to find this dependency, then has to go to general setup to find
 the shmem option at the bottom of the list in the embedded/expert 
 section, then finally go to the kernel options menu to finally choose 
 the page size.

 Moving this question just before the page size choice removes one of 
 those hidden menu, so I suggest that it be moved to just before the 
 option that it allow be selected.

 Right, it'll me more convenient. My [v5] patch addresses this.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[4]: [PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication

2009-01-17 Thread Yuri Tikhonov
Hello Dan,

On Friday, January 16, 2009 you wrote:

 On Fri, Jan 16, 2009 at 4:41 AM, Yuri Tikhonov y...@emcraft.com wrote:
 I don't think this will work as we will be mixing Q into the new P and
 P into the new Q.  In order to support (src_cnt  device-max_pq) we
 need to explicitly tell the driver that the operation is being
 continued (DMA_PREP_CONTINUE) and to apply different coeffeicients to
 P and Q to cancel the effect of including them as sources.

  With DMA_PREP_ZERO_P/Q approach, the Q isn't mixed into new P, and P
 isn't mixed into new Q. For your example of max_pq=4:

  p, q = PQ(src0, src1, src2, src3, src4, COEF({01}, {02}, {04}, {08}, {10}))

  with the current implementation will be split into:

  p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08})
  p`,q` = PQ(src4, COEF({10}))

  which will result to the following:

  p = ((dma_flags  DMA_PREP_ZERO_P) ? 0 : old_p) + src0 + src1 + src2 + src3
  q = ((dma_flags  DMA_PREP_ZERO_Q) ? 0 : old_q) + {01}*src0 + {02}*src1 + 
 {04}*src2 + {08}*src3

  p` = p + src4
  q` = q + {10}*src4


 Huh?  Does the ppc440spe engine have some notion of flagging a source
 as old_p/old_q?  Otherwise I do not see how the engine will not turn
 this into:

 p` = p + src4 + q
 q` = q + {10}*src4 + {x}*p

 I think you missed the fact that we have passed p and q back in as
 sources.  Unless we have multiple p destinations and multiple q
 destinations, or hardware support for continuations I do not see how
 you can guarantee this split.

 I guess, I've got your point. You are missing the fact that 
destinations for 'p' and 'q' are passed in device_prep_dma_pq() method 
separately from sources. Speaking your words: we do not have multiple 
destinations through the while() cycles, the destinations are the same 
in each pass.

 Please look at do_async_pq() implementation more carefully: 'blocks' 
is a pointer to 'src_cnt' sources _plus_ two destination pages (as 
it's stated in async_pq() description). Before coming into the while() 
cycle we save destinations in the dma_dest[] array, and then pass this 
to device_prep_dma_pq() in each (src_cnt/max_pq) cycle. That is, we do 
not passes destinations as the sources explicitly: we just clear 
DMA_PREP_ZERO_P/Q flags to notify ADMA level that this have to XOR the 
current content of destination(s) with the result of new operation.

  I'm afraid that the difference (13/4, 125/32) is very significant, so
 getting rid of DMA_PREP_ZERO_P/Q will eat most of the improvement
 which could be achieved with the current approach.

 Data corruption is a slightly higher cost :-).


  but at this point I do not see a cleaner alternatve for engines like 
 iop13xx.

  I can't find any description of iop13xx processors at Intel's
 web-site, only 3xx:

 http://www.intel.com/design/iio/index.htm?iid=ipp_embed+embed_io

  So, it's hard for me to do any suggestions. I just wonder - doesn't
 iop13xx allow users to program destination addresses into the sources
 fields of descriptors?

 Yes it does, but the engine does not know it is a destination.

 Take a look at page 496 of the following and tell me if you come to a
 different conclusion.
 http://download.intel.com/design/iio/docs/31503602.pdf

 I see. The major difference in the implementation of support for P+Q 
in ppc440spe DMA engines is that ppc440spe allows to include (xor) the 
previous content of P_Result and/or Q_Result just by setting a 
corresponding indication in the destination (P_Result and/or Q_Result) 
address(es) 

 The 5.7.5 P+Q Update Operation case won't help here, since, if 
I understand it right, it doesn't allow to set up different 
multipliers for Old and New Data.

 So, it looks like your approach:

p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

 is the only possible way of including the previous P/Q content into 
the calculation.

 But I still think, that this p'/q' hack should have a place on the 
ADMA level, not ASYNC_TX. It looks more generic if ASYNC_TX will 
assume that ADMA is capable of p'=p+src / q'=q+{}*src. Otherwise, 
we'll have an overhead for the DMAs which could work without this 
overhead.

 In your case, the IOP ADMA driver should handle the situation when it 
receives 4 sources to be P+Qed with the previous contents of 
destinations, for example, by generating the sequence of 4 descriptors 
to process such a request.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[4]: [PATCH 03/11][v3] async_tx: add support for asynchronous RAID6 recovery operations

2009-01-17 Thread Yuri Tikhonov
On Friday, January 16, 2009 you wrote:

 On Fri, Jan 16, 2009 at 4:51 AM, Yuri Tikhonov y...@emcraft.com wrote:
  The reason why I preferred to use async_pq() instead of async_xor()
 here is to maximize the chance that the whole D+D recovery operation
 will be handled in one ADMA device, i.e. without channels switch and
 the latency introduced because of that.


 This should be a function of the async_tx_find_channel implementation.
  The default version tries to keep a chain of operations on one
 channel.

 struct dma_chan *
 __async_tx_find_channel(struct dma_async_tx_descriptor *depend_tx,
 enum dma_transaction_type tx_type)
 {
 /* see if we can keep the chain on one channel */
 if (depend_tx 
 dma_has_cap(tx_type, depend_tx-chan-device-cap_mask))
 return depend_tx-chan;
 return dma_find_channel(tx_type);
 }

 Right. Then I need to update my ADMA driver, and add support for 
explicit DMA_XOR capability on channels which can process DMA_PQ.
Thanks.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 11/11][v2] ppc440spe-adma: ADMA driver for PPC440SP(e) systems

2009-01-16 Thread Yuri Tikhonov

 Hello David,

 Thanks a lot for review.

 The general note to be made here is that the changes to the DTS file 
made by this patch are necessary for a ppc440spe ADMA driver, which is 
a not-completed arch/powerpc port from the arch/ppc branch, and which 
uses DT (well, incorrectly) just to get interrupts. Otherwise, it's 
just a platform device driver.

 We provided this ADMA driver just as the reference of driver, which 
implements the RAID-6 related low-level stuff. ppc440spe ADMA in its 
current state is far from ready for merging. We'll elaborate on its 
cleaning up then (surely, taking into account all the comments made 
from community). But, even now, the driver works, so we publish this 
so interested people could use and test it.

 Some comments mixed in below.

On Tuesday, January 13, 2009 you wrote:

 On Tue, Jan 13, 2009 at 03:43:55AM +0300, Yuri Tikhonov wrote:
 Adds the platform device definitions and the architecture specific support
 routines for the ppc440spe adma driver.
 
 Any board equipped with PPC440SP(e) controller may utilize this driver.
 
 diff --git a/arch/powerpc/boot/dts/katmai.dts 
 b/arch/powerpc/boot/dts/katmai.dts
 index 077819b..f2f77c8 100644
 --- a/arch/powerpc/boot/dts/katmai.dts
 +++ b/arch/powerpc/boot/dts/katmai.dts
 @@ -16,7 +16,7 @@
  
  / {
   #address-cells = 2;
 - #size-cells = 1;
 + #size-cells = 2;

 You've changed the root level size-cells, but haven't updated the
 sub-nodes (such as /memory) accordingly.

 Thanks, we'll fix this in the next version of this patch.

   model = amcc,katmai;
   compatible = amcc,katmai;
   dcr-parent = {/cpus/c...@0};
 @@ -392,6 +392,30 @@
   0x0 0x0 0x0 0x3 UIC3 0xa 0x4 /* swizzled int 
 C */
   0x0 0x0 0x0 0x4 UIC3 0xb 0x4 /* swizzled int 
 D */;
   };
 + DMA0: dma0 {

 No 'compatible' property, which seems dubious.

 OK, we'll fix.

 + interrupt-parent = DMA0;
 + interrupts = 0 1;
 + #interrupt-cells = 1;
 + #address-cells = 0;
 + #size-cells = 0;
 + interrupt-map = 
 + 0 UIC0 0x14 4
 + 1 UIC1 0x16 4;
 + };
 + DMA1: dma1 {
 + interrupt-parent = DMA1;
 + interrupts = 0 1;
 + #interrupt-cells = 1;
 + #address-cells = 0;
 + #size-cells = 0;
 + interrupt-map = 
 + 0 UIC0 0x16 4
 + 1 UIC1 0x16 4;

 Are these interrupt-maps correct?  The second interrupt from both dma
 controllers is routed to the same line on UIC1?

 The map is correct:

- first interrupts are 'DMAx Command Status FIFO Needs Service';
- second interrupt is 'DMA Error', both DMA engines share common error IRQ.


 + };
 + xor {
 + interrupt-parent = UIC1;
 + interrupts = 0x1f 4;

 What the hell is this thing?  No compatible property, nor even a
 meaningful name.

 This is the XOR accelerator, the dedicated DMA engine of ppc440spe 
equipped with the ability to do XOR operations in h/w. I guess, it 
could be named like DMA2.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [RFC PATCH 00/11][v3] md: support for asynchronous execution of RAID6 operations

2009-01-16 Thread Yuri Tikhonov

 Hello Dan,

On Wednesday, January 14, 2009 you wrote:

[..]

 Do you have a git tree where you can post this series?  That would
 make it easier for me to track/review.

 Yes.

 Please see the raidstuff branch in the linux-2.6-denx repository:

http://git.denx.de/?p=linux-2.6-denx.git;a=shortlog;h=refs/heads/raidstuff

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication

2009-01-16 Thread Yuri Tikhonov
)
 +#define DMA_QCHECK_FAILED  (1  1)

 Perhaps turn these into an enum such that we can pass around a enum
 pq_check_flags pointer rather than a non-descript u32 *.

 Agree.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 03/11][v3] async_tx: add support for asynchronous RAID6 recovery operations

2009-01-16 Thread Yuri Tikhonov

On Thursday, January 15, 2009 Dan Williams wrote:

 On Mon, Jan 12, 2009 at 5:43 PM, Yuri Tikhonov y...@emcraft.com wrote:
 +   /* (2) Calculate Q+Qxy */
 +   lptrs[0] = ptrs[failb];
 +   lptrs[1] = ptrs[disks-1];
 +   lptrs[2] = NULL;
 +   tx = async_pq(lptrs, NULL, 0, 1, bytes, ASYNC_TX_DEP_ACK,
 + tx, NULL, NULL);
 +
 +   /* (3) Calculate P+Pxy */
 +   lptrs[0] = ptrs[faila];
 +   lptrs[1] = ptrs[disks-2];
 +   lptrs[2] = NULL;
 +   tx = async_pq(lptrs, NULL, 0, 1, bytes, ASYNC_TX_DEP_ACK,
 + tx, NULL, NULL);
 +

 These two calls convinced me that ASYNC_TX_PQ_ZERO_{P,Q} need to go.
 A 1-source async_pq operation does not make sense.

 Another source is hidden under not-set ASYNC_TX_PQ_ZERO_{P,Q} :) 
Though, I agree, this looks rather misleading.

   These should be:

/* (2) Calculate Q+Qxy */
lptrs[0] = ptrs[disks-1];
lptrs[1] = ptrs[failb];
tx = async_xor(lptrs[0], lptrs, 0, 2, bytes,
   ASYNC_TX_XOR_DROP_DST|ASYNC_TX_DEP_ACK, tx, NULL, NULL);

 /* (3) Calculate P+Pxy */
lptrs[0] = ptrs[disks-2];
lptrs[1] = ptrs[faila];
tx = async_xor(lptrs[0], lptrs, 0, 2, bytes,
   ASYNC_TX_XOR_DROP_DST|ASYNC_TX_DEP_ACK, tx, NULL, NULL);


 The reason why I preferred to use async_pq() instead of async_xor() 
here is to maximize the chance that the whole D+D recovery operation 
will be handled in one ADMA device, i.e. without channels switch and 
the latency introduced because of that.

 So, if we'll decide to stay with ASYNC_TX_PQ_ZERO_{P,Q}, then this 
should be probably kept unchanged, but if we'll get rid of 
ASYNC_TX_PQ_ZERO_{P,Q}, then, obviously, we'll have to use 
async_xor()s here as you suggest.


 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 11/11][v2] ppc440spe-adma: ADMA driver for PPC440SP(e) systems

2009-01-16 Thread Yuri Tikhonov

 Hello Anton,

 Thanks for review. Please note the general note I made in Re[2]: 
[PATCH 11/11][v2] ppc440spe-adma: ADMA driver for PPC440SP(e) 
systems.

 All your comments make sense, so we'll try to address these in the 
next version of the driver. Some comments below.

On Thursday, January 15, 2009 you wrote:

 Hello Yuri,

 On Tue, Jan 13, 2009 at 03:43:55AM +0300, Yuri Tikhonov wrote:
 Adds the platform device definitions and the architecture specific support
 routines for the ppc440spe adma driver.
 
 Any board equipped with PPC440SP(e) controller may utilize this driver.
 
 Signed-off-by: Yuri Tikhonov y...@emcraft.com
 Signed-off-by: Ilya Yanok ya...@emcraft.com
 ---

 Quite complex and interesting driver, I must say.
 Have you thought about splitting ppc440spe-adma.c into multiple
 files, btw?

 Admittedly, no. But I guess this makes sense. The driver supports two 
different types of DMA devices of ppc440spe: DMA0,1 and DMA2[XOR 
engine]. So, we could split the driver at least in two, which would 
definitely simplified the code. 

 A few comments down below...

 [...]
 +typedef struct ppc440spe_adma_device {

 Please avoid typedefs.

 OK.

 [...]
 +/*
 + * Descriptor of allocated CDB
 + */
 +typedef struct {
 + dma_cdb_t   *vaddr; /* virtual address of CDB */
 + dma_addr_t  paddr;  /* physical address of CDB */
 + /*
 +  * Additional fields
 +  */
 + struct list_headlink;   /* link in processing list */
 + u32 status; /* status of the CDB */
 + /* status bits:  */
 + #define DMA_CDB_DONE(10)  /* CDB processing competed */
 + #define DMA_CDB_CANCEL  (11)  /* waiting thread was interrupted */
 +} dma_cdbd_t;

 It seems there are no users of this struct.

 Indeed. This is an useless inheritance of some old version of the 
driver. Will remove this in the next patch.

[..]

 +/**
 + * ppc440spe_desc_init_dma01pq - initialize the descriptors for PQ operation
 + * qith DMA0/1
 + */
 +static inline void ppc440spe_desc_init_dma01pq(ppc440spe_desc_t *desc,
 + int dst_cnt, int src_cnt, unsigned long flags,
 + unsigned long op)
 +{

 Way to big for inline. The same for all the inlines.

 Btw, ppc_async_tx_find_best_channel() looks too big for inline
 and also too big to be in a .h file.

 OK, will be moved to the appropriate .c.

[..]

 [...]
 +static int ppc440spe_test_raid6 (ppc440spe_ch_t *chan)
 +{
 + ppc440spe_desc_t *sw_desc, *iter;
 + struct page *pg;
 + char *a;
 + dma_addr_t dma_addr, addrs[2];
 + unsigned long op = 0;
 + int rval = 0;
 +
 + /*FIXME*/

 ?

 +
 + set_bit(PPC440SPE_DESC_WXOR, op);
 +
 + pg = alloc_page(GFP_KERNEL);
 + if (!pg)
 + return -ENOMEM;
 +

 +
 +/**
 + * ppc440spe_adma_probe - probe the asynch device
 + */
 +static int __devinit ppc440spe_adma_probe(struct platform_device *pdev)
 +{
 + struct resource *res;

 Why is this a platform driver? What's the point of describing
 DMA nodes in the device tree w/o actually using them (don't count
 interrupts)? There are a lot of hard-coded addresses in the code...
 :-/

 And arch/powerpc/platforms/44x/ppc440spe_dma_engines.c file
 reminds me arch/ppc-style bindings. ;-)

 Right. This driver is a not-completed port from the arch/ppc branch.

 + int ret=0, irq1, irq2, initcode = PPC_ADMA_INIT_OK;
 + void *regs;
 + ppc440spe_dev_t *adev;
 + ppc440spe_ch_t *chan;
 + ppc440spe_aplat_t *plat_data;
 + struct ppc_dma_chan_ref *ref;
 + struct device_node *dp;
 + char s[10];
 +

 [...]
 +static int __init ppc440spe_adma_init (void)
 +{
 + int rval, i;
 + struct proc_dir_entry *p;
 +
 + for (i = 0; i  PPC440SPE_ADMA_ENGINES_NUM; i++)
 + ppc_adma_devices[i] = -1;
 +
 + rval = platform_driver_register(ppc440spe_adma_driver);
 +
 + if (rval == 0) {
 + /* Create /proc entries */
 + ppc440spe_proot = proc_mkdir(PPC440SPE_R6_PROC_ROOT, NULL);
 + if (!ppc440spe_proot) {
 + printk(KERN_ERR %s: failed to create %s proc 
 + directory\n,__func__,PPC440SPE_R6_PROC_ROOT);
 + /* User will not be able to enable h/w RAID-6 */
 + return rval;
 + }

 /proc? Why /proc? The driver has nothing to do with Linux VM subsystem
 or processes. I think /sys/ interface would suit better for this, no?
 Either way, userspace interfaces should be documented somehow
 (probably Documentation/ABI/, or at least some comments in the
 code).

 Agree, we'll fix this.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 07/11] md: rewrite handle_stripe_dirtying6 in asynchronous way

2009-01-16 Thread Yuri Tikhonov
On Friday, January 16, 2009 you wrote:

 On Thu, Jan 15, 2009 at 2:51 PM, Dan Williams dan.j.willi...@intel.com 
 wrote:
 On Mon, Dec 8, 2008 at 2:57 PM, Yuri Tikhonov y...@emcraft.com wrote:
 What's the reasoning behind changing the logic here, i.e. removing
 must_compute and such?  I'd feel more comfortable seeing copy and
 paste where possible with cleanups separated out into their own patch.


 Ok, I now see why this change was made.  Please make this changelog
 more descriptive than Rewrite handle_stripe_dirtying6 function to
 work asynchronously.

 Sure, how about the following:



 md: rewrite handle_stripe_dirtying6 in asynchronous way

 Processing stripe dirtying in asynchronous way requires some changes 
to the handle_stripe_dirtying6() algorithm.

 In the synchronous implementation of the stripe dirtying we processed 
dirtying of a degraded stripe (with partially changed strip(s) located 
on the failed drive(s)) inside one handle_stripe_dirtying6() call:
- we computed the missed strips from the old parities, and thus got 
the fully up-to-date stripe, then
- we did reconstruction using the new data to write.

 In the asynchronous case of handle_stripe_dirtying6() we don't 
process anything right inside this function (since we under the lock), 
but only schedule the necessary operations with flags. Thus, if 
handle_stripe_dirtying6() is performed on the top of a degraded array 
we should schedule the reconstruction operation when the failed strips 
are marked (by previously called fetch_block6()) as to be computed 
(with the R5_Wantcompute flag), and all the other strips of the stripe 
are UPTODATE. The schedule_reconstruction() function will set the 
STRIPE_OP_POSTXOR flag [for new parity calculation], which is then 
handled in raid_run_ops() after the STRIPE_OP_COMPUTE_BLK one [which 
causes computing of the data missed].



 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 07/11] md: rewrite handle_stripe_dirtying6 in asynchronous way

2009-01-16 Thread Yuri Tikhonov

Hello Cheng,

On Friday, January 16, 2009 you wrote:

 Ack, could you please make the changelog more descriptive?
 and or add some of your benchmark results?

 Of course. We did benchmarking using the Xdd tool like follows:

# xdd -op write -kbytes $kbytes -reqsize $reqsize -dio-passes 2 –verbose 
-target $target_device

 where

$kbytes = data disks * size of disk
$reqsize= data disks * chunk size
$target_device = /dev/md0

 This way we did write of full array size, and thus achieved the 
maximum performance.

 The test cases were RAID-6 built on the top of 14 S-ATA drives 
connected to 2 LSI cards (7+7) inserted into the 800 MHz Katmai board 
(based on ppc440spe) equipped with 4GB of 800 MHz DRAM .

 Here are the results (Psw - write throughput with s/w RAID-6; Phw - 
write throughput with the h/w accelerated RAID-6):

 PAGE_SIZE=4KB, chunk=64/128/256 KB
Psw = 71/72/74 MBps
Phw = 128/136/139 MBps

 PAGE_SIZE=16KB, chunk=256/512/1024 KB
Psw = 81/81/82 MBps
Phw = 205/244/239 MBps

 PAGE_SIZE=64KB, chunk=1024/2048/4096 KB
Psw = 84/84/85 MBps
Phw = 258/253/258 MBps

 PAGE_SIZE=256KB, chunk=4096/8192/16384 KB
Psw = 81/83/83 MBps
Phw = 288/275/274 MBps

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re[2]: [PATCH][v4] powerpc 44x: support for 256KB PAGE_SIZE

2009-01-16 Thread Yuri Tikhonov
Hello Milton,

On Friday, January 16, 2009 you wrote:

 On Jan 12, 2009, at 4:49 PM, Yuri Tikhonov wrote:

 This patch adds support for 256KB pages on ppc44x-based boards.

 Another day, another comment.  The motivation for reply was the second
 comment below.


 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index 84b8613..18f33ef 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -443,6 +443,19 @@ config PPC_64K_PAGES
   bool 64k page size if 44x || PPC_STD_MMU_64
   select PPC_HAS_HASH_64K if PPC_STD_MMU_64

 +config PPC_256K_PAGES
 + bool 256k page size if 44x
 + depends on !STDBINUTILS  !SHMEM

 depends on !STDBINUTILS  (!SHMEM || BROKEN)

 to identify that it is not fundamentally incompatable just not a chance
 of working without other changes.

 This makes sense.

[..]

 +config STDBINUTILS
 + bool Using standard binutils settings
 + depends on 44x
 + default y


 I think this should be

 config STDBINUTILS
 bool Using standard binutils settings if 44x
 default y

 that way we imply that all powerpc users are using the standard 
 binutils instead of only those using a 44x platform.  We still get the
 intended effect of asking the user only on 44x.

 I haven't looked at the resulting question or config order to see if it
 makes sense to leave it here or put it closer to the page size.

 I'm not sure about this. For 44x platforms - the STDBINUTILS option 
is reasonable, because it's used in the PAGE_SIZE selection process. 
But as regarding the other powerpcs the STDBINUTILS option will do 
nothing, but taking a superfluous string in configs. Are you sure this 
will be better ?

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH][v3] powerpc 44x: support for 256KB PAGE_SIZE

2009-01-12 Thread Yuri Tikhonov

Hello Prodyut,

On Monday, January 12, 2009 you wrote:

 On Sun, Jan 11, 2009 at 10:42 AM, Yuri Tikhonov y...@emcraft.com wrote:

 This patch adds support for 256KB pages on ppc44x-based boards.


 Hi Yuri,
 Do you still need the mm/shmem.c patch to avoid division by zero?

 Yes.

 I looked at the mm/shmem.c latest git code, and I see that it doesn't
 have the needed patch for 256KB page.

 Right. We proposed the work-around for this (which just simply 
increased the sizes of variables which hold the overflowed values) to 
LKML here:

http://lkml.org/lkml/2008/12/19/20

 If I understand Hugh right, then such a fix is acceptable, but much 
far from the best, so Hugh is about to implement the correct fix for 
the problem as soon as he'll find some time (big thanks to him for 
this).

 I think another option would be to make 256KB compile only if CONFIG_SHMEM=n

 Agree. For the current situation it seems the better solution. I'll 
update, and re-post the patch shortly.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH][v4] powerpc 44x: support for 256KB PAGE_SIZE

2009-01-12 Thread Yuri Tikhonov
This patch adds support for 256KB pages on ppc44x-based boards.

For simplification of implementation with 256KB pages we still assume
2-level paging. As a side effect this leads to wasting extra memory space
reserved for PTE tables: only 1/4 of pages allocated for PTEs are
actually used. But this may be an acceptable trade-off to achieve the
high performance we have with big PAGE_SIZEs in some applications (e.g.
RAID).

Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize
the risk of stack overflows in the cases of on-stack arrays, which size
depends on the page size (e.g. multipage BIOs, NTFS, etc.).

With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down
to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be
occupied by PKMAP addresses leaving no place for vmalloc. We do not
separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that
value of 10 in support for 16K/64K had been selected rather intuitively.
Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB,
one) we have 512 pages for PKMAP.

Because ELF standard supports only page sizes up to 64K, then you should
use binutils later than 2.17.50.0.3 with '-zmax-page-size' set to 256K
for building applications, which are to be run with the 256KB-page sized
kernel. If using the older binutils, then you should patch them like follows:

--- binutils/bfd/elf32-ppc.c.orig
+++ binutils/bfd/elf32-ppc.c

-#define ELF_MAXPAGESIZE0x1
+#define ELF_MAXPAGESIZE0x4

One more restriction we currently have with 256KB page sizes is inability
to use shmem safely, so, for now, the 256KB is available only if you turn
the CONFIG_SHMEM option off.
Though, if you need shmem with 256KB pages, you can always remove the !SHMEM
dependency in 'config PPC_256K_PAGES', and use the workaround available here:
 http://lkml.org/lkml/2008/12/19/20

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 arch/powerpc/Kconfig   |   15 +++
 arch/powerpc/include/asm/highmem.h |   10 +-
 arch/powerpc/include/asm/mmu-44x.h |2 ++
 arch/powerpc/include/asm/page.h|6 --
 arch/powerpc/include/asm/page_32.h |4 
 arch/powerpc/include/asm/thread_info.h |4 +++-
 arch/powerpc/kernel/head_booke.h   |   11 ++-
 arch/powerpc/platforms/44x/Kconfig |   12 
 8 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 84b8613..18f33ef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -443,6 +443,19 @@ config PPC_64K_PAGES
bool 64k page size if 44x || PPC_STD_MMU_64
select PPC_HAS_HASH_64K if PPC_STD_MMU_64
 
+config PPC_256K_PAGES
+   bool 256k page size if 44x
+   depends on !STDBINUTILS  !SHMEM
+   help
+ Make the page size 256k.
+
+ As the ELF standard only requires alignment to support page
+ sizes up to 64k, you will need to compile all of your user
+ space applications with a non-standard binutils settings
+ (see the STDBINUTILS description for details).
+
+ Say N unless you know what you are doing.
+
 endchoice
 
 config FORCE_MAX_ZONEORDER
@@ -455,6 +468,8 @@ config FORCE_MAX_ZONEORDER
default 9 if PPC_STD_MMU_32  PPC_16K_PAGES
range 7 64 if PPC_STD_MMU_32  PPC_64K_PAGES
default 7 if PPC_STD_MMU_32  PPC_64K_PAGES
+   range 5 64 if PPC_STD_MMU_32  PPC_256K_PAGES
+   default 5 if PPC_STD_MMU_32  PPC_256K_PAGES
range 11 64
default 11
help
diff --git a/arch/powerpc/include/asm/highmem.h 
b/arch/powerpc/include/asm/highmem.h
index 04e4a62..a290759 100644
--- a/arch/powerpc/include/asm/highmem.h
+++ b/arch/powerpc/include/asm/highmem.h
@@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table;
  * chunk of RAM.
  */
 /*
- * We use one full pte table with 4K pages. And with 16K/64K pages pte
- * table covers enough memory (32MB and 512MB resp.) that both FIXMAP
- * and PKMAP can be placed in single pte table. We use 1024 pages for
- * PKMAP in case of 16K/64K pages.
+ * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte
+ * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP
+ * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP
+ * in case of 16K/64K/256K page sizes.
  */
 #ifdef CONFIG_PPC_4K_PAGES
 #define PKMAP_ORDERPTE_SHIFT
 #else
-#define PKMAP_ORDER10
+#define PKMAP_ORDER9
 #endif
 #define LAST_PKMAP (1  PKMAP_ORDER)
 #ifndef CONFIG_PPC_4K_PAGES
diff --git a/arch/powerpc/include/asm/mmu-44x.h 
b/arch/powerpc/include/asm/mmu-44x.h
index 27cc6fd..3c86576 100644
--- a/arch/powerpc/include/asm/mmu-44x.h
+++ b/arch/powerpc/include/asm/mmu-44x.h
@@ -83,6 +83,8 @@ typedef struct {
 #define PPC44x_TLBE_SIZE   PPC44x_TLB_16K
 #elif (PAGE_SHIFT == 16)
 #define

[RFC PATCH 00/11][v3] md: support for asynchronous execution of RAID6 operations

2009-01-12 Thread Yuri Tikhonov
 Hello,

 This is the next attempt on asynchronous RAID-6 support. This patch-set
has the Dan Williams' comments (Dec, 17) addressed with the following
exception:

- I still think that using 'enum dma_ctrl_flags' for PQ-specific 
operations is better than introducing another group of flags and 
enhance the device_prep_dma_pq()/device_prep_dma_pqzero_sum() with one 
more parameter. If unchanged 'enum dma_ctrl_flags' will be a criteria 
of acceptance, please let me know - I'll re-implement this exactly as 
Dan suggested.


 Fearing to look like a spammer, I post only those patches, which had 
been affected by the changes intended to address Dan's comments. These 
are the following five:

0002-async_tx-add-support-for-asynchronous-GF-multiplica.patch
0003-async_tx-add-support-for-asynchronous-RAID6-recover.patch
0004-md-run-RAID-6-stripe-operations-outside-the-lock.patch
0008-md-asynchronous-handle_parity_check6.patch
0011-ppc440spe-adma-ADMA-driver-for-the-PPC440SP-e-syst.patch

 As regarding the other six patches of asynchronous RAID-6 support 
patchset:

0001-async_tx-don-t-use-src_list-argument-of-async_xor.patch
0005-md-common-schedule_reconstruction-for-raid5-6.patch
0006-md-change-handle_stripe_fill6-to-work-in-asynchrono.patch
0007-md-rewrite-handle_stripe_dirtying6-in-asynchronous.patch
0009-md-change-handle_stripe6-to-work-asynchronously.patch
0010-md-remove-unused-functions.patch

 they are the same as in my 09 Dec 2008 post:

 https://kerneltrap.org/mailarchive/linux-raid/2008/12/8/4367574

 and are waiting for your comments.

 Regards, Yuri
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication

2009-01-12 Thread Yuri Tikhonov
This adds support for doing asynchronous GF multiplication by adding
four additional functions to async_tx API:

 async_pq() does simultaneous XOR of sources and XOR of sources
  GF-multiplied by given coefficients.

 async_pq_zero_sum() checks if results of calculations match given
  ones.

 async_gen_syndrome() does sumultaneous XOR and R/S syndrome of sources.

 async_syndrome_zerosum() checks if results of XOR/syndrome calculation
  matches given ones.

Latter two functions just use async_pq() with the approprite coefficients
in asynchronous case but have significant optimizations if synchronous
case.

To support this API dmaengine driver should set DMA_PQ and
DMA_PQ_ZERO_SUM capabilities and provide device_prep_dma_pq and
device_prep_dma_pqzero_sum methods in dma_device structure.

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 crypto/async_tx/Kconfig |4 +
 crypto/async_tx/Makefile|1 +
 crypto/async_tx/async_pq.c  |  615 +++
 crypto/async_tx/async_xor.c |2 +-
 include/linux/async_tx.h|   46 +++-
 include/linux/dmaengine.h   |   30 ++-
 6 files changed, 693 insertions(+), 5 deletions(-)
 create mode 100644 crypto/async_tx/async_pq.c

diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig
index d8fb391..cb6d731 100644
--- a/crypto/async_tx/Kconfig
+++ b/crypto/async_tx/Kconfig
@@ -14,3 +14,7 @@ config ASYNC_MEMSET
tristate
select ASYNC_CORE
 
+config ASYNC_PQ
+   tristate
+   select ASYNC_CORE
+
diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile
index 27baa7d..1b99265 100644
--- a/crypto/async_tx/Makefile
+++ b/crypto/async_tx/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_ASYNC_CORE) += async_tx.o
 obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o
 obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o
 obj-$(CONFIG_ASYNC_XOR) += async_xor.o
+obj-$(CONFIG_ASYNC_PQ) += async_pq.o
diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
new file mode 100644
index 000..5871651
--- /dev/null
+++ b/crypto/async_tx/async_pq.c
@@ -0,0 +1,615 @@
+/*
+ * Copyright(c) 2007 Yuri Tikhonov y...@emcraft.com
+ *
+ * Developed for DENX Software Engineering GmbH
+ *
+ * Asynchronous GF-XOR calculations ASYNC_TX API.
+ *
+ * based on async_xor.c code written by:
+ * Dan Williams dan.j.willi...@intel.com
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#include linux/kernel.h
+#include linux/interrupt.h
+#include linux/dma-mapping.h
+#include linux/raid/xor.h
+#include linux/async_tx.h
+
+#include ../drivers/md/raid6.h
+
+/**
+ *  The following static variables are used in cases of synchronous
+ * zero sum to save the values to check. Two pages used for zero sum and
+ * the third one is for dumb P destination when calling gen_syndrome()
+ */
+static spinlock_t spare_lock;
+static struct page *spare_pages[3];
+
+/**
+ * do_async_pq - asynchronously calculate P and/or Q
+ */
+static struct dma_async_tx_descriptor *
+do_async_pq(struct dma_chan *chan, struct page **blocks, unsigned char *scfs,
+   unsigned int offset, int src_cnt, size_t len, enum async_tx_flags flags,
+   struct dma_async_tx_descriptor *depend_tx,
+   dma_async_tx_callback cb_fn, void *cb_param)
+{
+   struct dma_device *dma = chan-device;
+   dma_addr_t dma_dest[2], dma_src[src_cnt];
+   struct dma_async_tx_descriptor *tx = NULL;
+   dma_async_tx_callback _cb_fn;
+   void *_cb_param;
+   unsigned char *scf = NULL;
+   int i, src_off = 0;
+   unsigned short pq_src_cnt;
+   enum async_tx_flags async_flags;
+   enum dma_ctrl_flags dma_flags = 0;
+
+   /*  If we won't handle src_cnt in one shot, then the following
+* flag(s) will be set only on the first pass of prep_dma
+*/
+   if (flags  ASYNC_TX_PQ_ZERO_P)
+   dma_flags |= DMA_PREP_ZERO_P;
+   if (flags  ASYNC_TX_PQ_ZERO_Q)
+   dma_flags |= DMA_PREP_ZERO_Q;
+
+   /* DMAs use destinations as sources, so use BIDIRECTIONAL mapping */
+   if (blocks[src_cnt]) {
+   dma_dest[0] = dma_map_page(dma-dev, blocks[src_cnt

[PATCH 03/11][v3] async_tx: add support for asynchronous RAID6 recovery operations

2009-01-12 Thread Yuri Tikhonov
This patch extends async_tx API with two operations for recovery
operations on RAID6 array with two failed disks using new async_pq()
operation. Patch introduces the following functions:

 async_r6_dd_recov() recovers after double data disk failure

 async_r6_dp_recov() recovers after D+P failure

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 crypto/async_tx/Kconfig |5 +
 crypto/async_tx/Makefile|1 +
 crypto/async_tx/async_r6recov.c |  286 +++
 include/linux/async_tx.h|   11 ++
 4 files changed, 303 insertions(+), 0 deletions(-)
 create mode 100644 crypto/async_tx/async_r6recov.c

diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig
index cb6d731..0b56224 100644
--- a/crypto/async_tx/Kconfig
+++ b/crypto/async_tx/Kconfig
@@ -18,3 +18,8 @@ config ASYNC_PQ
tristate
select ASYNC_CORE
 
+config ASYNC_R6RECOV
+   tristate
+   select ASYNC_CORE
+   select ASYNC_PQ
+
diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile
index 1b99265..0ed8f13 100644
--- a/crypto/async_tx/Makefile
+++ b/crypto/async_tx/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o
 obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o
 obj-$(CONFIG_ASYNC_XOR) += async_xor.o
 obj-$(CONFIG_ASYNC_PQ) += async_pq.o
+obj-$(CONFIG_ASYNC_R6RECOV) += async_r6recov.o
diff --git a/crypto/async_tx/async_r6recov.c b/crypto/async_tx/async_r6recov.c
new file mode 100644
index 000..8642c14
--- /dev/null
+++ b/crypto/async_tx/async_r6recov.c
@@ -0,0 +1,286 @@
+/*
+ * Copyright(c) 2007 Yuri Tikhonov y...@emcraft.com
+ *
+ * Developed for DENX Software Engineering GmbH
+ *
+ * Asynchronous RAID-6 recovery calculations ASYNC_TX API.
+ *
+ * based on async_xor.c code written by:
+ * Dan Williams dan.j.willi...@intel.com
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 51
+ * Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+#include linux/kernel.h
+#include linux/interrupt.h
+#include linux/dma-mapping.h
+#include linux/raid/xor.h
+#include linux/async_tx.h
+
+#include ../drivers/md/raid6.h
+
+/**
+ * async_r6_dd_recov - attempt to calculate two data misses using dma engines.
+ * @disks: number of disks in the RAID-6 array
+ * @bytes: size of strip
+ * @faila: first failed drive index
+ * @failb: second failed drive index
+ * @ptrs: array of pointers to strips (last two must be p and q, respectively)
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
+ * @depend_tx: depends on the result of this transaction.
+ * @cb: function to call when the operation completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_r6_dd_recov(int disks, size_t bytes, int faila, int failb,
+   struct page **ptrs, enum async_tx_flags flags,
+   struct dma_async_tx_descriptor *depend_tx,
+   dma_async_tx_callback cb, void *cb_param)
+{
+   struct dma_async_tx_descriptor *tx = NULL;
+   struct page *lptrs[disks];
+   unsigned char lcoef[disks-4];
+   int i = 0, k = 0, fc = -1;
+   uint8_t bc[2];
+   dma_async_tx_callback lcb = NULL;
+   void *lcb_param = NULL;
+
+   /* Assume that failb  faila */
+   if (faila  failb) {
+   fc = faila;
+   faila = failb;
+   failb = fc;
+   }
+
+   /* Try to compute missed data asynchronously. */
+   if (disks == 4) {
+   /*
+* Pxy and Qxy are zero in this case so we already have
+* P+Pxy and Q+Qxy in P and Q strips respectively.
+*/
+   tx = depend_tx;
+   lcb = cb;
+   lcb_param = cb_param;
+   goto do_mult;
+   }
+
+   /*
+* (1) Calculate Qxy and Pxy:
+* Qxy = A(0)*D(0) + ... + A(n-1)*D(n-1) + A(n+1)*D(n+1) + ... +
+*   A(m-1)*D(m-1) + A(m+1)*D(m+1) + ... + A(disks-1)*D(disks-1),
+* where n = faila, m = failb.
+*/
+   for (i = 0, k = 0; i  disks - 2; i++) {
+   if (i != faila  i != failb) {
+   lptrs[k] = ptrs[i];
+   lcoef[k] = raid6_gfexp[i];
+   k++;
+   }
+   }
+
+   lptrs[k] = ptrs[faila];
+   lptrs[k+1] = ptrs[failb

[PATCH 04/11][v3] md: run RAID-6 stripe operations outside the lock

2009-01-12 Thread Yuri Tikhonov
 The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pqxor+copy
operations asynchronously, outside the lock.

 The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct

 The flow is the same as in the RAID-5 case.

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 drivers/md/Kconfig |2 +
 drivers/md/raid5.c |  291 +++
 include/linux/raid/raid5.h |4 +-
 3 files changed, 269 insertions(+), 28 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2281b50..6c9964f 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -123,6 +123,8 @@ config MD_RAID456
depends on BLK_DEV_MD
select ASYNC_MEMCPY
select ASYNC_XOR
+   select ASYNC_PQ
+   select ASYNC_R6RECOV
---help---
  A RAID-5 set of N drives with a capacity of C MB per drive provides
  the capacity of C * (N - 1) MB, and protects against a failure
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a5ba080..8110f31 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -584,18 +584,26 @@ static void ops_run_biofill(struct stripe_head *sh)
ops_complete_biofill, sh);
 }
 
-static void ops_complete_compute5(void *stripe_head_ref)
+static void ops_complete_compute(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
-   int target = sh-ops.target;
-   struct r5dev *tgt = sh-dev[target];
+   int target, i;
+   struct r5dev *tgt;
 
pr_debug(%s: stripe %llu\n, __func__,
(unsigned long long)sh-sector);
 
-   set_bit(R5_UPTODATE, tgt-flags);
-   BUG_ON(!test_bit(R5_Wantcompute, tgt-flags));
-   clear_bit(R5_Wantcompute, tgt-flags);
+   /* mark the computed target(s) as uptodate */
+   for (i = 0; i  2; i++) {
+   target = (!i) ? sh-ops.target : sh-ops.target2;
+   if (target  0)
+   continue;
+   tgt = sh-dev[target];
+   set_bit(R5_UPTODATE, tgt-flags);
+   BUG_ON(!test_bit(R5_Wantcompute, tgt-flags));
+   clear_bit(R5_Wantcompute, tgt-flags);
+   }
+
clear_bit(STRIPE_COMPUTE_RUN, sh-state);
if (sh-check_state == check_state_compute_run)
sh-check_state = check_state_compute_result;
@@ -627,15 +635,155 @@ static struct dma_async_tx_descriptor 
*ops_run_compute5(struct stripe_head *sh)
 
if (unlikely(count == 1))
tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE,
-   0, NULL, ops_complete_compute5, sh);
+   0, NULL, ops_complete_compute, sh);
else
tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE,
ASYNC_TX_XOR_ZERO_DST, NULL,
-   ops_complete_compute5, sh);
+   ops_complete_compute, sh);
+
+   return tx;
+}
+
+static struct dma_async_tx_descriptor *
+ops_run_compute6_1(struct stripe_head *sh)
+{
+   /* kernel stack size limits the total number of disks */
+   int disks = sh-disks;
+   struct page *srcs[disks];
+   int target = sh-ops.target  0 ? sh-ops.target2 : sh-ops.target;
+   struct r5dev *tgt = sh-dev[target];
+   struct page *dest = sh-dev[target].page;
+   int count = 0;
+   int pd_idx = sh-pd_idx, qd_idx = raid6_next_disk(pd_idx, disks);
+   int d0_idx = raid6_next_disk(qd_idx, disks);
+   struct dma_async_tx_descriptor *tx;
+   int i;
+
+   pr_debug(%s: stripe %llu block: %d\n,
+   __func__, (unsigned long long)sh-sector, target);
+   BUG_ON(!test_bit(R5_Wantcompute, tgt-flags));
+
+   atomic_inc(sh-count);
+
+   if (target == qd_idx) {
+   /* We are actually computing the Q drive*/
+   i = d0_idx;
+   do {
+   srcs[count++] = sh-dev[i].page;
+   i = raid6_next_disk(i, disks);
+   } while (i != pd_idx);
+   srcs[count] = NULL;
+   srcs[count+1] = dest;
+   tx = async_gen_syndrome(srcs, 0, count, STRIPE_SIZE,
+   0, NULL, ops_complete_compute, sh);
+   } else {
+   /* Compute any data- or p-drive using XOR */
+   for (i = disks; i-- ; ) {
+   if (i != target  i

[PATCH 08/11][v2] md: asynchronous handle_parity_check6

2009-01-12 Thread Yuri Tikhonov
This patch introduces the state machine for handling the RAID-6 parities
check and repair functionality.

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 drivers/md/raid5.c |  164 +++-
 1 files changed, 111 insertions(+), 53 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2e26e84..b8e37c8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2620,91 +2620,149 @@ static void handle_parity_checks6(raid5_conf_t *conf, 
struct stripe_head *sh,
struct r6_state *r6s, struct page *tmp_page,
int disks)
 {
-   int update_p = 0, update_q = 0;
-   struct r5dev *dev;
+   int i;
+   struct r5dev *devs[2] = {NULL, NULL};
int pd_idx = sh-pd_idx;
int qd_idx = r6s-qd_idx;
 
set_bit(STRIPE_HANDLE, sh-state);
 
BUG_ON(s-failed  2);
-   BUG_ON(s-uptodate  disks);
+
/* Want to check and possibly repair P and Q.
 * However there could be one 'failed' device, in which
 * case we can only check one of them, possibly using the
 * other to generate missing data
 */
 
-   /* If !tmp_page, we cannot do the calculations,
-* but as we have set STRIPE_HANDLE, we will soon be called
-* by stripe_handle with a tmp_page - just wait until then.
-*/
-   if (tmp_page) {
+   switch (sh-check_state) {
+   case check_state_idle:
+   /* start a new check operation if there are  2 failures */
if (s-failed == r6s-q_failed) {
/* The only possible failed device holds 'Q', so it
 * makes sense to check P (If anything else were failed,
 * we would have used P to recreate it).
 */
-   compute_block_1(sh, pd_idx, 1);
-   if (!page_is_zero(sh-dev[pd_idx].page)) {
-   compute_block_1(sh, pd_idx, 0);
-   update_p = 1;
-   }
+   sh-check_state = check_state_run;
+   set_bit(STRIPE_OP_CHECK_PP, s-ops_request);
+   clear_bit(R5_UPTODATE, sh-dev[pd_idx].flags);
+   s-uptodate--;
}
if (!r6s-q_failed  s-failed  2) {
/* q is not failed, and we didn't use it to generate
 * anything, so it makes sense to check it
 */
-   memcpy(page_address(tmp_page),
-  page_address(sh-dev[qd_idx].page),
-  STRIPE_SIZE);
-   compute_parity6(sh, UPDATE_PARITY);
-   if (memcmp(page_address(tmp_page),
-  page_address(sh-dev[qd_idx].page),
-  STRIPE_SIZE) != 0) {
-   clear_bit(STRIPE_INSYNC, sh-state);
-   update_q = 1;
-   }
+   sh-check_state = check_state_run;
+   set_bit(STRIPE_OP_CHECK_QP, s-ops_request);
+   clear_bit(R5_UPTODATE, sh-dev[qd_idx].flags);
+   s-uptodate--;
}
-   if (update_p || update_q) {
-   conf-mddev-resync_mismatches += STRIPE_SECTORS;
-   if (test_bit(MD_RECOVERY_CHECK, conf-mddev-recovery))
-   /* don't try to repair!! */
-   update_p = update_q = 0;
+   if (sh-check_state == check_state_run) {
+   break;
}
 
-   /* now write out any block on a failed drive,
-* or P or Q if they need it
-*/
+   /* we have 2-disk failure */
+   BUG_ON(s-failed != 2);
+   devs[0] = sh-dev[r6s-failed_num[0]];
+   devs[1] = sh-dev[r6s-failed_num[1]];
+   /* fall through */
+   case check_state_compute_result:
+   sh-check_state = check_state_idle;
 
-   if (s-failed == 2) {
-   dev = sh-dev[r6s-failed_num[1]];
-   s-locked++;
-   set_bit(R5_LOCKED, dev-flags);
-   set_bit(R5_Wantwrite, dev-flags);
+   BUG_ON((devs[0]  !devs[1]) ||
+  (!devs[0]  devs[1]));
+
+   BUG_ON(s-uptodate  (disks - 1));
+
+   if (!devs[0]) {
+   if (s-failed = 1)
+   devs[0] = sh-dev[r6s-failed_num[0]];
+   else
+   devs[0] = sh-dev[pd_idx];
}
-   if (s-failed = 1

[PATCH][v3] powerpc 44x: support for 256KB PAGE_SIZE

2009-01-11 Thread Yuri Tikhonov

This patch adds support for 256KB pages on ppc44x-based boards.

For simplification of implementation with 256KB pages we still assume
2-level paging. As a side effect this leads to wasting extra memory space
reserved for PTE tables: only 1/4 of pages allocated for PTEs are
actually used. But this may be an acceptable trade-off to achieve the
high performance we have with big PAGE_SIZEs in some applications (e.g.
RAID).

Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize
the risk of stack overflows in the cases of on-stack arrays, which size
depends on the page size (e.g. multipage BIOs, NTFS, etc.).

With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down
to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be
occupied by PKMAP addresses leaving no place for vmalloc. We do not
separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that
value of 10 in support for 16K/64K had been selected rather intuitively.
Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB,
one) we have 512 pages for PKMAP.

Because ELF standard supports only page sizes up to 64K, then you should
use binutils later than 2.17.50.0.3 with '-zmax-page-size' set to 256K
for building applications, which are to be run with the 256KB-page sized
kernel. If using the older binutils, then you should patch them like follows:

--- binutils/bfd/elf32-ppc.c.orig
+++ binutils/bfd/elf32-ppc.c

-#define ELF_MAXPAGESIZE0x1
+#define ELF_MAXPAGESIZE0x4

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 arch/powerpc/Kconfig   |   15 +++
 arch/powerpc/include/asm/highmem.h |   10 +-
 arch/powerpc/include/asm/mmu-44x.h |2 ++
 arch/powerpc/include/asm/page.h|6 --
 arch/powerpc/include/asm/page_32.h |4 
 arch/powerpc/include/asm/thread_info.h |4 +++-
 arch/powerpc/kernel/head_booke.h   |   11 ++-
 arch/powerpc/platforms/44x/Kconfig |   12 
 8 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 84b8613..ceb402c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -443,6 +443,19 @@ config PPC_64K_PAGES
bool 64k page size if 44x || PPC_STD_MMU_64
select PPC_HAS_HASH_64K if PPC_STD_MMU_64
 
+config PPC_256K_PAGES
+   bool 256k page size if 44x
+   depends on !STDBINUTILS
+   help
+ Make the page size 256k.
+
+ As the ELF standard only requires alignment to support page
+ sizes up to 64k, you will need to compile all of your user
+ space applications with a non-standard binutils settings
+ (see the STDBINUTILS description for details).
+
+ Say N unless you know what you are doing.
+
 endchoice
 
 config FORCE_MAX_ZONEORDER
@@ -455,6 +468,8 @@ config FORCE_MAX_ZONEORDER
default 9 if PPC_STD_MMU_32  PPC_16K_PAGES
range 7 64 if PPC_STD_MMU_32  PPC_64K_PAGES
default 7 if PPC_STD_MMU_32  PPC_64K_PAGES
+   range 5 64 if PPC_STD_MMU_32  PPC_256K_PAGES
+   default 5 if PPC_STD_MMU_32  PPC_256K_PAGES
range 11 64
default 11
help
diff --git a/arch/powerpc/include/asm/highmem.h 
b/arch/powerpc/include/asm/highmem.h
index 04e4a62..a290759 100644
--- a/arch/powerpc/include/asm/highmem.h
+++ b/arch/powerpc/include/asm/highmem.h
@@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table;
  * chunk of RAM.
  */
 /*
- * We use one full pte table with 4K pages. And with 16K/64K pages pte
- * table covers enough memory (32MB and 512MB resp.) that both FIXMAP
- * and PKMAP can be placed in single pte table. We use 1024 pages for
- * PKMAP in case of 16K/64K pages.
+ * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte
+ * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP
+ * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP
+ * in case of 16K/64K/256K page sizes.
  */
 #ifdef CONFIG_PPC_4K_PAGES
 #define PKMAP_ORDERPTE_SHIFT
 #else
-#define PKMAP_ORDER10
+#define PKMAP_ORDER9
 #endif
 #define LAST_PKMAP (1  PKMAP_ORDER)
 #ifndef CONFIG_PPC_4K_PAGES
diff --git a/arch/powerpc/include/asm/mmu-44x.h 
b/arch/powerpc/include/asm/mmu-44x.h
index 27cc6fd..3c86576 100644
--- a/arch/powerpc/include/asm/mmu-44x.h
+++ b/arch/powerpc/include/asm/mmu-44x.h
@@ -83,6 +83,8 @@ typedef struct {
 #define PPC44x_TLBE_SIZE   PPC44x_TLB_16K
 #elif (PAGE_SHIFT == 16)
 #define PPC44x_TLBE_SIZE   PPC44x_TLB_64K
+#elif (PAGE_SHIFT == 18)
+#define PPC44x_TLBE_SIZE   PPC44x_TLB_256K
 #else
 #error Unsupported PAGE_SIZE
 #endif
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 197d569..32cbf16 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -19,12 +19,14 @@
 #include asm/kdump.h

Re[2]: [PATCH] powerpc 44x: support for 256KB PAGE_SIZE

2008-12-21 Thread Yuri Tikhonov

 Hello Milton,

 Thanks for reviewing. I'll re-post the updated patch shortly.

On Sunday, December 21, 2008 you wrote:


 On Dec 19, 2008, at 12:39 AM, Yuri Tikhonov wrote:

 This patch adds support for 256KB pages on ppc44x-based boards.


 Hi.  A couple of small comments.

 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index cd8ff7c..348702c 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -436,6 +436,14 @@ config PPC_64K_PAGES
   bool 64k page size if 44x || PPC_STD_MMU_64
   select PPC_HAS_HASH_64K if PPC_STD_MMU_64

 +config PPC_256K_PAGES
 + bool 256k page size if 44x
 + depends on !STDBINUTILS
 + help
 +   ELF standard supports only page sizes up to 64K so you need a 
 patched
 +   binutils in order to use 256K pages. Chose it only if you know what
 +   you are doing.
 +

 Make the page size 256k.   As the ELF standard only requires alignment
 to
 support page sizes up to 64k, you will need to compile all of your user
 space applications with a patched binutils.

 Say N unless you know what you are doing.

 (The previous text did not describe what the option actually did, and 
 did not emphasize that all of user space had to be compiled).


 diff --git a/arch/powerpc/include/asm/thread_info.h 
 b/arch/powerpc/include/asm/thread_info.h
 index 9665a26..3c8bbab 100644
 --- a/arch/powerpc/include/asm/thread_info.h
 +++ b/arch/powerpc/include/asm/thread_info.h
 @@ -15,8 +15,12 @@
  #ifdef CONFIG_PPC64
  #define THREAD_SHIFT 14
  #else
 +#ifdef CONFIG_PPC_256K_PAGES
 +#define THREAD_SHIFT 15
 +#else
  #define THREAD_SHIFT 13
  #endif
 +#endif

 Switching to #elif would remove the nested ifdef here.


  #define THREAD_SIZE  (1  THREAD_SHIFT)

 diff --git a/arch/powerpc/kernel/head_booke.h 
 b/arch/powerpc/kernel/head_booke.h
 index fce2df9..bc6a26c 100644
 --- a/arch/powerpc/kernel/head_booke.h
 +++ b/arch/powerpc/kernel/head_booke.h
 @@ -10,6 +10,14 @@
   mtspr   SPRN_IVOR##vector_number,r26;   \
   sync

 +#ifndef CONFIG_PPC_256K_PAGES

 This if should be on THREAD_SIZE or THREAD_SHIFT, not the option that 
 causes it to be large.

 +#define ALLOC_STACK_FRAME(reg, val)  addi reg,reg,val
 +#else
 +#define ALLOC_STACK_FRAME(reg, val)  \
 + addis   reg,reg,v...@ha; \
 + addireg,reg,v...@l
 +#endif
 +
  #define NORMAL_EXCEPTION_PROLOG 
   \
   mtspr   SPRN_SPRG0,r10; /* save two registers to work with */\
   mtspr   SPRN_SPRG1,r11;  \
 @@ -20,7 +28,7 @@
   beq 1f;  \
   mfspr   r1,SPRN_SPRG3;  /* if from user, start at top of   */\
   lwz r1,THREAD_INFO-THREAD(r1); /* this thread's kernel stack   */\
 - addir1,r1,THREAD_SIZE;   \
 + ALLOC_STACK_FRAME(r1, THREAD_SIZE);  \
  1:   subir1,r1,INT_FRAME_SIZE;   /* Allocate an exception frame */\
   mr  r11,r1;  \
   stw r10,_CCR(r11);  /* save various registers  */\
 diff --git a/init/Kconfig b/init/Kconfig
 index f763762..96229b9 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -635,6 +635,16 @@ config ELF_CORE
   help
 Enable support for generating core dumps. Disabling saves about 4k.

 +config STDBINUTILS
 + bool Using standard binutils if EMBEDDED
 + depends on 44x
 + default y
 + help
 +   Turning this option off allows you to select 256KB PAGE_SIZE on 
 44x.
 +   Note, that kernel will be able to run only those applications,
 +   which had been compiled using the patched binutils (ELF standard
 +   supports only page sizes up to 64K).
 +

 The config variable 44x is pretty nondescript for a global config file.
   If you leave it here, I would suggest  making it depend on PPC  44x
 so that other people don't have to wonder what 44x means.  I didn't 
 look if there was a better place to put this option, somewhere 
 architecture specific might be better (eg after selecting the 44x cpu 
 type).

  config PCSPKR_PLATFORM
   bool Enable PC-Speaker support if EMBEDDED
   depends on ALPHA || X86 || MIPS || PPC_PREP || PPC_CHRP || 
 PPC_PSERIES


 thanks,
 milton




 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH][v2] powerpc 44x: support for 256KB PAGE_SIZE

2008-12-21 Thread Yuri Tikhonov

This patch adds support for 256KB pages on ppc44x-based boards.

For simplification of implementation with 256KB pages we still assume
2-level paging. As a side effect this leads to wasting extra memory space
reserved for PTE tables: only 1/4 of pages allocated for PTEs are
actually used. But this may be an acceptable trade-off to achieve the
high performance we have with big PAGE_SIZEs in some applications (e.g.
RAID).

Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize
the risk of stack overflows in the cases of on-stack arrays, which size
depends on the page size (e.g. multipage BIOs, NTFS, etc.).

With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down
to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be
occupied by PKMAP addresses leaving no place for vmalloc. We do not
separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that
value of 10 in support for 16K/64K had been selected rather intuitively.
Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB,
one) we have 512 pages for PKMAP.

Because ELF standard supports only page sizes up to 64K, then you should
use patched binutils for building applications to be run with the 256KB-
page sized kernel. The patch for binutils is rather trivial, and may
look as follows:

--- binutils/bfd/elf32-ppc.c.orig
+++ binutils/bfd/elf32-ppc.c

-#define ELF_MAXPAGESIZE0x1
+#define ELF_MAXPAGESIZE0x4

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 arch/powerpc/Kconfig   |   14 ++
 arch/powerpc/include/asm/highmem.h |   10 +-
 arch/powerpc/include/asm/mmu-44x.h |2 ++
 arch/powerpc/include/asm/page.h|6 --
 arch/powerpc/include/asm/page_32.h |4 
 arch/powerpc/include/asm/thread_info.h |4 +++-
 arch/powerpc/kernel/head_booke.h   |   11 ++-
 arch/powerpc/platforms/44x/Kconfig |   10 ++
 8 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index cd8ff7c..3338e71 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -436,6 +436,18 @@ config PPC_64K_PAGES
bool 64k page size if 44x || PPC_STD_MMU_64
select PPC_HAS_HASH_64K if PPC_STD_MMU_64
 
+config PPC_256K_PAGES
+   bool 256k page size if 44x
+   depends on !STDBINUTILS
+   help
+ Make the page size 256k.
+
+ As the ELF standard only requires alignment to support page
+ sizes up to 64k, you will need to compile all of your user
+ space applications with a patched binutils.
+
+ Say N unless you know what you are doing.
+
 endchoice
 
 config FORCE_MAX_ZONEORDER
@@ -448,6 +460,8 @@ config FORCE_MAX_ZONEORDER
default 9 if PPC_STD_MMU_32  PPC_16K_PAGES
range 7 64 if PPC_STD_MMU_32  PPC_64K_PAGES
default 7 if PPC_STD_MMU_32  PPC_64K_PAGES
+   range 5 64 if PPC_STD_MMU_32  PPC_256K_PAGES
+   default 5 if PPC_STD_MMU_32  PPC_256K_PAGES
range 11 64
default 11
help
diff --git a/arch/powerpc/include/asm/highmem.h 
b/arch/powerpc/include/asm/highmem.h
index 7d6bb37..0d1333f 100644
--- a/arch/powerpc/include/asm/highmem.h
+++ b/arch/powerpc/include/asm/highmem.h
@@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table;
  * chunk of RAM.
  */
 /*
- * We use one full pte table with 4K pages. And with 16K/64K pages pte
- * table covers enough memory (32MB and 512MB resp.) that both FIXMAP
- * and PKMAP can be placed in single pte table. We use 1024 pages for
- * PKMAP in case of 16K/64K pages.
+ * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte
+ * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP
+ * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP
+ * in case of 16K/64K/256K page sizes.
  */
 #ifdef CONFIG_PPC_4K_PAGES
 #define PKMAP_ORDERPTE_SHIFT
 #else
-#define PKMAP_ORDER10
+#define PKMAP_ORDER9
 #endif
 #define LAST_PKMAP (1  PKMAP_ORDER)
 #ifndef CONFIG_PPC_4K_PAGES
diff --git a/arch/powerpc/include/asm/mmu-44x.h 
b/arch/powerpc/include/asm/mmu-44x.h
index 73e1909..52a2339 100644
--- a/arch/powerpc/include/asm/mmu-44x.h
+++ b/arch/powerpc/include/asm/mmu-44x.h
@@ -81,6 +81,8 @@ typedef struct {
 #define PPC44x_TLBE_SIZE   PPC44x_TLB_16K
 #elif (PAGE_SHIFT == 16)
 #define PPC44x_TLBE_SIZE   PPC44x_TLB_64K
+#elif (PAGE_SHIFT == 18)
+#define PPC44x_TLBE_SIZE   PPC44x_TLB_256K
 #else
 #error Unsupported PAGE_SIZE
 #endif
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 197d569..32cbf16 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -19,12 +19,14 @@
 #include asm/kdump.h
 
 /*
- * On regular PPC32 page size is 4K (but we support 4K/16K/64K pages
+ * On regular PPC32 page size is 4K (but we support 4K/16K/64K

Re[2]: [PATCH][v2] powerpc 44x: support for 256KB PAGE_SIZE

2008-12-21 Thread Yuri Tikhonov

Hello Andreas,

On Sunday, December 21, 2008 you wrote:

 Yuri Tikhonov y...@emcraft.com writes:

 Because ELF standard supports only page sizes up to 64K, then you should
 use patched binutils for building applications to be run with the 256KB-
 page sized kernel. The patch for binutils is rather trivial, and may
 look as follows:

 --- binutils/bfd/elf32-ppc.c.orig
 +++ binutils/bfd/elf32-ppc.c

 -#define ELF_MAXPAGESIZE0x1
 +#define ELF_MAXPAGESIZE0x4

 You don't have to patch it, it's enough to pass -zmax-page-size=0x4
 to the linker.

 Thanks for pointing this. I guess, the -zmax-page-size option is new 
to binutils 2.17.50.0.10. Right?

 I'll remove the STDBINUTILS config option from this patch then, and 
correspondingly update the description.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH][v2] powerpc 44x: support for 256KB PAGE_SIZE

2008-12-21 Thread Yuri Tikhonov

On Sunday, December 21, 2008 you wrote:

 Yuri Tikhonov y...@emcraft.com writes:

  Thanks for pointing this. I guess, the -zmax-page-size option is new 
 to binutils 2.17.50.0.10. Right?

 It was added 2½ years ago.

 Yes, approximately: http://gcc.gnu.org/ml/gcc/2006-07/msg00361.html


Date: Sat, 15 Jul 2006 15:19:05 -0700
Subject: The Linux binutils 2.17.50.0.3 is released

Changes from binutils 2.17.50.0.2:
...
16. Add -z max-page-size= and -z common-page-size= to ELF linker.


  I'll remove the STDBINUTILS config option from this patch then, and 
 correspondingly update the description.

 I thought about this more, and I guess that we should keep the 
dependency on the newly introduced by this patch STDBINUTILS option. 
Perhaps, change it name to, say, STD_ELF_LINKING.

 Because, in any case, either should we patch binutils, or should we 
pass some non-default value to the linker: the resulted application 
doesn't match the ELF standard, and the applications which match the 
standard simply won't work with the 256K paged kernel - so, a user has 
to be aware of it before turning the 256K on.

 What do you think about this?

 Regards, Yuri


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH][v2] fork_init: fix division by zero

2008-12-18 Thread Yuri Tikhonov

Hello Andrew,

On Friday, December 19, 2008 you wrote:
[snip]
  There is one more warning from the common code when I use 256KB pages:
 
CC  mm/shmem.o
 mm/shmem.c: In function 'shmem_truncate_range':
 mm/shmem.c:613: warning: division by zero
 mm/shmem.c:619: warning: division by zero
 mm/shmem.c:644: warning: division by zero
 mm/shmem.c: In function 'shmem_unuse_inode':
 mm/shmem.c:873: warning: division by zero
 
  The problem here is that ENTRIES_PER_PAGEPAGE becomes 0x1..
 when PAGE_SIZE is 256K.
 
  How about the following fix ?

[snip]

 Looks sane.

 Thanks for reviewing.

 But to apply this I'd prefer a changelog, a signoff and a grunt from Hugh.

 Sure, I'll post this in the separate thread then; keeping Hugh in CC.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] mm/shmem.c: fix division by zero

2008-12-18 Thread Yuri Tikhonov

The following patch fixes division by zero, which we have in
shmem_truncate_range() and shmem_unuse_inode(), if use big
PAGE_SIZE values (e.g. 256KB on ppc44x).

With 256KB PAGE_SIZE the ENTRIES_PER_PAGEPAGE constant becomes
too large (0x1..), so this patch just changes the types
from 'ulong' to 'ullong' where it's necessary.

Signed-off-by: Yuri Tikhonov y...@emcraft.com
---
 mm/shmem.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0ed0752..99d7c91 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -57,7 +57,7 @@
 #include asm/pgtable.h
 
 #define ENTRIES_PER_PAGE (PAGE_CACHE_SIZE/sizeof(unsigned long))
-#define ENTRIES_PER_PAGEPAGE (ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)
+#define ENTRIES_PER_PAGEPAGE ((unsigned long 
long)ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 
 #define SHMEM_MAX_INDEX  (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * 
(ENTRIES_PER_PAGE+1))
@@ -95,7 +95,7 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
+static int shmem_getpage(struct inode *inode, unsigned long long idx,
 struct page **pagep, enum sgp_type sgp, int *type);
 
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
@@ -533,7 +533,7 @@ static void shmem_truncate_range(struct inode *inode, 
loff_t start, loff_t end)
int punch_hole;
spinlock_t *needs_lock;
spinlock_t *punch_lock;
-   unsigned long upper_limit;
+   unsigned long long upper_limit;
 
inode-i_ctime = inode-i_mtime = CURRENT_TIME;
idx = (start + PAGE_CACHE_SIZE - 1)  PAGE_CACHE_SHIFT;
@@ -1175,7 +1175,7 @@ static inline struct mempolicy *shmem_get_sbmpol(struct 
shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
+static int shmem_getpage(struct inode *inode, unsigned long long idx,
struct page **pagep, enum sgp_type sgp, int *type)
 {
struct address_space *mapping = inode-i_mapping;
-- 
1.6.0.4
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] powerpc 44x: support for 256KB PAGE_SIZE

2008-12-18 Thread Yuri Tikhonov
This patch adds support for 256KB pages on ppc44x-based boards.

For simplification of implementation with 256KB pages we still assume
2-level paging. As a side effect this leads to wasting extra memory space
reserved for PTE tables: only 1/4 of pages allocated for PTEs are
actually used. But this may be an acceptable trade-off to achieve the
high performance we have with big PAGE_SIZEs in some applications (e.g.
RAID).

Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize
the risk of stack overflows in the cases of on-stack arrays, which size
depends on the page size (e.g. multipage BIOs, NTFS, etc.).

With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down
to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be
occupied by PKMAP addresses leaving no place for vmalloc. We do not
separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that
value of 10 in support for 16K/64K had been selected rather intuitively.
Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB,
one) we have 512 pages for PKMAP.

Because ELF standard supports only page sizes up to 64K, then you should
use patched binutils for building applications to be run with the 256KB-
page sized kernel. The patch for binutils is rather trivial, and may
look as follows:

--- binutils/bfd/elf32-ppc.c.orig
+++ binutils/bfd/elf32-ppc.c

-#define ELF_MAXPAGESIZE0x1
+#define ELF_MAXPAGESIZE0x4

Signed-off-by: Yuri Tikhonov y...@emcraft.com
Signed-off-by: Ilya Yanok ya...@emcraft.com
---
 arch/powerpc/Kconfig   |   10 ++
 arch/powerpc/include/asm/highmem.h |   10 +-
 arch/powerpc/include/asm/mmu-44x.h |2 ++
 arch/powerpc/include/asm/page.h|6 --
 arch/powerpc/include/asm/page_32.h |4 
 arch/powerpc/include/asm/thread_info.h |4 
 arch/powerpc/kernel/head_booke.h   |   10 +-
 init/Kconfig   |   10 ++
 8 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index cd8ff7c..348702c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -436,6 +436,14 @@ config PPC_64K_PAGES
bool 64k page size if 44x || PPC_STD_MMU_64
select PPC_HAS_HASH_64K if PPC_STD_MMU_64
 
+config PPC_256K_PAGES
+   bool 256k page size if 44x
+   depends on !STDBINUTILS
+   help
+ ELF standard supports only page sizes up to 64K so you need a patched
+ binutils in order to use 256K pages. Chose it only if you know what
+ you are doing.
+
 endchoice
 
 config FORCE_MAX_ZONEORDER
@@ -448,6 +456,8 @@ config FORCE_MAX_ZONEORDER
default 9 if PPC_STD_MMU_32  PPC_16K_PAGES
range 7 64 if PPC_STD_MMU_32  PPC_64K_PAGES
default 7 if PPC_STD_MMU_32  PPC_64K_PAGES
+   range 5 64 if PPC_STD_MMU_32  PPC_256K_PAGES
+   default 5 if PPC_STD_MMU_32  PPC_256K_PAGES
range 11 64
default 11
help
diff --git a/arch/powerpc/include/asm/highmem.h 
b/arch/powerpc/include/asm/highmem.h
index 7d6bb37..0d1333f 100644
--- a/arch/powerpc/include/asm/highmem.h
+++ b/arch/powerpc/include/asm/highmem.h
@@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table;
  * chunk of RAM.
  */
 /*
- * We use one full pte table with 4K pages. And with 16K/64K pages pte
- * table covers enough memory (32MB and 512MB resp.) that both FIXMAP
- * and PKMAP can be placed in single pte table. We use 1024 pages for
- * PKMAP in case of 16K/64K pages.
+ * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte
+ * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP
+ * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP
+ * in case of 16K/64K/256K page sizes.
  */
 #ifdef CONFIG_PPC_4K_PAGES
 #define PKMAP_ORDERPTE_SHIFT
 #else
-#define PKMAP_ORDER10
+#define PKMAP_ORDER9
 #endif
 #define LAST_PKMAP (1  PKMAP_ORDER)
 #ifndef CONFIG_PPC_4K_PAGES
diff --git a/arch/powerpc/include/asm/mmu-44x.h 
b/arch/powerpc/include/asm/mmu-44x.h
index 73e1909..52a2339 100644
--- a/arch/powerpc/include/asm/mmu-44x.h
+++ b/arch/powerpc/include/asm/mmu-44x.h
@@ -81,6 +81,8 @@ typedef struct {
 #define PPC44x_TLBE_SIZE   PPC44x_TLB_16K
 #elif (PAGE_SHIFT == 16)
 #define PPC44x_TLBE_SIZE   PPC44x_TLB_64K
+#elif (PAGE_SHIFT == 18)
+#define PPC44x_TLBE_SIZE   PPC44x_TLB_256K
 #else
 #error Unsupported PAGE_SIZE
 #endif
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 197d569..32cbf16 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -19,12 +19,14 @@
 #include asm/kdump.h
 
 /*
- * On regular PPC32 page size is 4K (but we support 4K/16K/64K pages
+ * On regular PPC32 page size is 4K (but we support 4K/16K/64K/256K pages
  * on PPC44x). For PPC64 we support either 4K or 64K software
  * page size. When using 64K

Re[2]: [PATCH 02/11][v2] async_tx: add support for asynchronous GF multiplication

2008-12-18 Thread Yuri Tikhonov
, increase the stack usage and, in general, the time of 
functions calls by adding new parameters to ADMA methods ?

   In this case can we set up a dependency chain with
 async_memset()?

 Well, we can. But wouldn't this be an overhead? For example, 
ppc440spe DMA allows to do so-called RXOR which overwrites, and 
doesn't take care of destinations. So, we can do ZERO_DST(s)+PQ in one 
short on one DMA engine. Again, I'm not sure that keeping 
dma_ctrl_flags unchanged is worthy of creating such a dependency; 
it'll obviously lead both to degradation of performance  increasing 
of CPU utilization.


  /**
 @@ -299,6 +301,7 @@ struct dma_async_tx_descriptor {
  * @global_node: list_head for global dma_device_list
  * @cap_mask: one or more dma_capability flags
  * @max_xor: maximum number of xor sources, 0 if no capability
 + * @max_pq: maximum number of PQ sources, 0 if no capability
  * @refcount: reference count
  * @done: IO completion struct
  * @dev_id: unique device ID
 @@ -308,7 +311,9 @@ struct dma_async_tx_descriptor {
  * @device_free_chan_resources: release DMA channel's resources
  * @device_prep_dma_memcpy: prepares a memcpy operation
  * @device_prep_dma_xor: prepares a xor operation
 + * @device_prep_dma_pq: prepares a pq operation
  * @device_prep_dma_zero_sum: prepares a zero_sum operation
 + * @device_prep_dma_pqzero_sum: prepares a pqzero_sum operation
  * @device_prep_dma_memset: prepares a memset operation
  * @device_prep_dma_interrupt: prepares an end of chain interrupt operation
  * @device_prep_slave_sg: prepares a slave dma operation
 @@ -322,6 +327,7 @@ struct dma_device {
struct list_head global_node;
dma_cap_mask_t  cap_mask;
int max_xor;
 +   int max_pq;


 max_xor and max_pq can be changed to unsigned shorts to keep the size
 of the struct the same.

 Right.

struct kref refcount;
struct completion done;
 @@ -339,9 +345,17 @@ struct dma_device {
struct dma_async_tx_descriptor *(*device_prep_dma_xor)(
struct dma_chan *chan, dma_addr_t dest, dma_addr_t *src,
unsigned int src_cnt, size_t len, unsigned long flags);
 +   struct dma_async_tx_descriptor *(*device_prep_dma_pq)(
 +   struct dma_chan *chan, dma_addr_t *dst, dma_addr_t *src,
 +   unsigned int src_cnt, unsigned char *scf,
 +   size_t len, unsigned long flags);
struct dma_async_tx_descriptor *(*device_prep_dma_zero_sum)(
struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt,
size_t len, u32 *result, unsigned long flags);
 +   struct dma_async_tx_descriptor *(*device_prep_dma_pqzero_sum)(
 +   struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt,
 +   unsigned char *scf,
 +   size_t len, u32 *presult, u32 *qresult, unsigned long flags);

 I would rather we turn the 'result' parameter into a pointer to flags
 where bit 0 is the xor/p result and bit1 is the q result.

 Yes, this'll be better.


 Thanks for reviewing. I'll re-generate ASYNC_TX patch (in the parts 
where I absolutely agreed with you), and then re-post. Any comments 
regarding RAID-6 part?

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH][v2] fork_init: fix division by zero

2008-12-17 Thread Yuri Tikhonov
Hello Paul,

On Friday 12 December 2008 03:48, Paul Mackerras wrote:
 Andrew Morton writes:
 
   +#if (8 * THREAD_SIZE)  PAGE_SIZE
 max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE);
   +#else
   + max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE));
   +#endif
  
  The expression you've chosen here can be quite inacccurate, because
  ((PAGE_SIZE / (8 * THREAD_SIZE)) is a small number.  The way to
  preserve accuracy is
 
 The assumption is that THREAD_SIZE is a power of 2, as is PAGE_SIZE.
 
 I think Yuri should be increasing THREAD_SIZE for the larger page
 sizes he's implementing, because we have on-stack arrays whose size
 depends on the page size.  I suspect that having THREAD_SIZE less than
 1/8 of PAGE_SIZE risks stack overflows, and the better fix is for Yuri
 to make sure THREAD_SIZE is at least 1/8 of PAGE_SIZE.  (In fact, more
 may be needed - someone should work out what fraction is actually
 needed.)

  Right, thanks for pointing this. I guess, I was just lucky since didn't run 
into
problems with stack overflows. So, I agree that we should increase the 
THREAD_SIZE in case of 256KB pages up to 1/8 of PAGE_SIZE, that is up 
to 32KB.

 There is one more warning from the common code when I use 256KB pages:

   CC  mm/shmem.o
mm/shmem.c: In function 'shmem_truncate_range':
mm/shmem.c:613: warning: division by zero
mm/shmem.c:619: warning: division by zero
mm/shmem.c:644: warning: division by zero
mm/shmem.c: In function 'shmem_unuse_inode':
mm/shmem.c:873: warning: division by zero

 The problem here is that ENTRIES_PER_PAGEPAGE becomes 0x1..
when PAGE_SIZE is 256K.

 How about the following fix ?

diff --git a/mm/shmem.c b/mm/shmem.c
index 0ed0752..99d7c91 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -57,7 +57,7 @@
 #include asm/pgtable.h
 
 #define ENTRIES_PER_PAGE (PAGE_CACHE_SIZE/sizeof(unsigned long))
-#define ENTRIES_PER_PAGEPAGE (ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)
+#define ENTRIES_PER_PAGEPAGE ((unsigned long 
long)ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 
 #define SHMEM_MAX_INDEX  (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * 
(ENTRIES_PER_PAGE+1))
@@ -95,7 +95,7 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
+static int shmem_getpage(struct inode *inode, unsigned long long idx,
 struct page **pagep, enum sgp_type sgp, int *type);
 
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
@@ -533,7 +533,7 @@ static void shmem_truncate_range(struct inode *inode, 
loff_t start, loff_t end)
int punch_hole;
spinlock_t *needs_lock;
spinlock_t *punch_lock;
-   unsigned long upper_limit;
+   unsigned long long upper_limit;
 
inode-i_ctime = inode-i_mtime = CURRENT_TIME;
idx = (start + PAGE_CACHE_SIZE - 1)  PAGE_CACHE_SHIFT;
@@ -1175,7 +1175,7 @@ static inline struct mempolicy *shmem_get_sbmpol(struct 
shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
+static int shmem_getpage(struct inode *inode, unsigned long long idx,
struct page **pagep, enum sgp_type sgp, int *type)
 {
struct address_space *mapping = inode-i_mapping;

Regards, Yuri
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH][v2] fork_init: fix division by zero

2008-12-11 Thread Yuri Tikhonov

Hello Andrew,

On Thursday, December 11, 2008 you wrote:

[snip]

 The expression you've chosen here can be quite inacccurate, because
 ((PAGE_SIZE / (8 * THREAD_SIZE)) is a small number. 

 But why is it bad? We do multiplication to 'mempages', not division. 
All the numbers in the multiplier are the power of 2, so both 
expressions:

mempages * (PAGE_SIZE / (8 * THREAD_SIZE))

and

max_threads = (mempages * PAGE_SIZE) / (8 * THREAD_SIZE)

are finally equal. 

  The way to preserve accuracy is

 max_threads = (mempages * PAGE_SIZE) / (8 * THREAD_SIZE);

 so how about avoiding the nasty ifdefs and doing

 I'm OK with the approach below, but, leading resulting to the same, 
this involves some overhead to the code where there was no this 
overhead before this patch: e.g. your implementation is finally boils 
down to ~5 times more processor instructions than there were before,
plus operations with stack for the 'm' variable.

 On the other hand, my approach with nasty (I agree) ifdefs doesn't 
lead to overheads to the code which does not need this: i.e. the most 
common situation of small PAGE_SIZEs. Big PAGE_SIZE is the exception, 
so I believe that the more common cases should not suffer because of 
this.

 --- a/kernel/fork.c~fork_init-fix-division-by-zero
 +++ a/kernel/fork.c
 @@ -69,6 +69,7 @@
  #include asm/mmu_context.h
  #include asm/cacheflush.h
  #include asm/tlbflush.h
 +#include asm/div64.h
  
  /*
   * Protected counters by write_lock_irq(tasklist_lock)
 @@ -185,10 +186,15 @@ void __init fork_init(unsigned long memp
  
 /*
  * The default maximum number of threads is set to a safe
 -* value: the thread structures can take up at most half
 -* of memory.
 +* value: the thread structures can take up at most
 +* (1/8) part of memory.
  */
 -   max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE);
 +   {
 +   /* max_threads = (mempages * PAGE_SIZE) / THREAD_SIZE / 8; */
 +   u64 m = mempages * PAGE_SIZE;
 +   do_div(m, THREAD_SIZE * 8);
 +   max_threads = m;
 +   }
  
 /*
  * we need to allow at least 20 threads to boot a system
 _

 ?


 The code is also inaccurate because it assumes that whatever allocator
is used for threads will pack the thread_structs into pages with best
 possible density, which isn't necessarily the case.  Let's not worry
 about that.




 OT:

 max_threads is widly wrong anyway.

 - the caller passes in num_physpages, which includes highmem.  And we
   can't allocate thread structs from highmem.

 - num_physpages includes kernel pages and other stuff which can never
   be allocated via the page allocator.

 A suitable fix would be to switch the caller to the strangely-named
 nr_free_buffer_pages().

 If you grep the tree for `num_physpages', you will find a splendid
 number of similar bugs.  num_physpages should be unexported, burnt,
 deleted, etc.  It's just an invitation to write buggy code.


 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] fork_init: fix division by zero

2008-12-10 Thread Yuri Tikhonov

 Hello Geert,

On Wednesday, December 10, 2008 you wrote:
 On Tue, 9 Dec 2008, Yuri Tikhonov wrote:
 The following patch fixes divide-by-zero error for the
 cases of really big PAGE_SIZEs (e.g. 256KB on ppc44x).
 Support for such big page sizes on 44x is not present in the
 current kernel yet, but coming soon.
 
 Also this patch fixes the comment for the max_threads
 settings, as this didn't match the things actually done
 in the code.
 
 Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
 Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
 ---
  kernel/fork.c |8 ++--
  1 files changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/fork.c b/kernel/fork.c
 index 2a372a0..b0ac2fb 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -181,10 +181,14 @@ void __init fork_init(unsigned long mempages)
  
   /*
* The default maximum number of threads is set to a safe
 -  * value: the thread structures can take up at most half
 -  * of memory.
 +  * value: the thread structures can take up at most
 +  * (1/8) part of memory.
*/
 +#if (8 * THREAD_SIZE)  PAGE_SIZE
   max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE);
 +#else
 + max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE);
   
 +#endif

 Can't this overflow, e.g. on 32-bit machines with HIGHMEM?

 The multiplier here is not PAGE_SIZE, but [PAGE_SIZE / (8 * 
THREAD_SIZE)], and this value is expected to be rather small (2, 4, or 
so).

 Furthermore, due to the #if/#endif construction multiplication is 
used only with rather big PAGE_SIZE values, and the bigger page size 
is then the smaller 'mempages' is.

 So, for example, when running with PAGE_SIZE=256KB, THREAD_SIZE=8KB, 
on 32-bit 440spe-based machine with 4GB RAM installed, here we have:

 max_threads = (4G/256K) * (256K / 8 * 8K) = 16384 * 4 = 65536.

 And the overflow will have a place only in case of very very big 
sizes of RAM: = 256TB:

 max_threads = (256T / 256K) * (256K / 8 * 8K) = 0x4000. * 4.

 But I don't think that with 256TB RAM installed this code will be 
the only place of problems :)

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] powerpc: add 16K/64K pages support for the 44x PPC32 architectures.

2008-12-10 Thread Yuri Tikhonov
/L1_CACHE_BYTES /* Number of lines in a page */

 Same comment.

  pgd_t *pgd_alloc(struct mm_struct *mm)
  {
   pgd_t *ret;
  
 - ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 + ret = (pgd_t *)kzalloc(1  PGDIR_ORDER, GFP_KERNEL);
   return ret;
  }

 We may want to consider using a slab cache. Maybe an area where we want
 to merge 32 and 64 bit code, though it doesn't have to be right now.

 Do we know the impact of using kzalloc instead of gfp for when it's
 really just a single page though ? Does it have overhead or will kzalloc
 just fallback to gfp ? If it has overhead, then we probably want to
 ifdef and keep using gfp for the 1-page case.

 This depends on allocator: SLUB looks calling __get_free_pages() if 
size  PAGE_SIZE [note, not = !], but SLAB doesn't. So, we'll add 
ifdef here.


  __init_refok pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned 
 long address)
 @@ -400,7 +395,7 @@ void kernel_map_pages(struct page *page, int numpages, 
 int enable)
  #endif /* CONFIG_DEBUG_PAGEALLOC */
  
  static int fixmaps;
 -unsigned long FIXADDR_TOP = 0xf000;
 +unsigned long FIXADDR_TOP = (-PAGE_SIZE);
  EXPORT_SYMBOL(FIXADDR_TOP);
  
  void __set_fixmap (enum fixed_addresses idx, phys_addr_t phys, pgprot_t 
 flags)
 diff --git a/arch/powerpc/platforms/Kconfig.cputype 
 b/arch/powerpc/platforms/Kconfig.cputype
 index 548efa5..73a5aa9 100644
 --- a/arch/powerpc/platforms/Kconfig.cputype
 +++ b/arch/powerpc/platforms/Kconfig.cputype
 @@ -204,7 +204,7 @@ config PPC_STD_MMU_32
  
  config PPC_MM_SLICES
   bool
 - default y if HUGETLB_PAGE || PPC_64K_PAGES
 + default y if HUGETLB_PAGE || (PPC64  PPC_64K_PAGES)
   default n

 I would make it PPC_64  (HUGETLB_PAGE || PPC_64K_PAGES) for now,
 I don't think we want to use the existing slice code on anything else.

 Make it even PPC_STD_MMU_64

 Cheers,
 Ben.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[4]: [PATCH] fork_init: fix division by zero

2008-12-10 Thread Yuri Tikhonov

 Hello David,

On Wednesday, December 10, 2008 you wrote:

 Yuri Tikhonov [EMAIL PROTECTED] wrote:

  Here we believe in preprocessor: since all PAGE_SIZE, 8, and 
 THREAD_SIZE are the constants we expect it will calculate this.

 The preprocessor shouldn't be calculating this.  I believe it will _only_
 calculate expressions for #if.  In the situation you're referring to, it
 should perform a substitution and nothing more.  The preprocessor doesn't
 necessarily know how to handle the types involved.

 In any case, there's an easy way to find out: you can ask the compiler to give
 you the result of running the source through the preprocessor only. For
 instance, if you run this:

 #define PAGE_SIZE 4096
 #define THREAD_SIZE 8192
 unsigned long mempages;
 unsigned long jump(void)
 {
 unsigned long max_threads;
 max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE);
 return max_threads;
 }

 through gcc -E, you get:

 # 1 calc.c
 # 1 built-in
 # 1 command line
 # 1 calc.c
 unsigned long mempages;
 unsigned long jump(void)
 {
  unsigned long max_threads;
  max_threads = mempages * 4096 / (8 * 8192);
  return max_threads;
 }


  In any case, adding braces as follows probably would be better:
 
 + max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE));

 I think you mean brackets, not braces '{}'.

 Yes, it was a typo.


  Right ?

 Definitely not.

 I added this function to the above:

 unsigned long alt(void)
 {
 unsigned long max_threads;
 max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE));
 return max_threads;
 }

 and ran it through gcc -S -O2 for x86_64:

 jump:
 movqmempages(%rip), %rax
 salq$12, %rax
 shrq$16, %rax
 ret
 alt:
 xorl%eax, %eax
 ret

 Note the difference?  In jump(), x86_64 first multiplies mempages by 4096, and
 _then_ divides by 8*8192.

 In alt(), it just returns 0 because the compiler realised that you're
 multiplying by 0.

 I think Geert has already commented this: you've compiled your alt() 
functions having 4K PAGE_SIZE and 8K THREAD_SIZE - this case is 
handled by the old code in fork_init.

 If you're going to bracket the expression, it must be:

 max_threads = (mempages * PAGE_SIZE) / (8 * THREAD_SIZE);

 which should be superfluous.

  E.g. here is the result from this line as produced by cross-gcc 
 4.2.2:
 
 lis r9,0
 rlwinm  r29,r29,2,16,29
 stw r29,0(r9)
 
  As you see - only rotate-left, i.e. multiplication to the constant.

 Ummm...  On powerpc, I believe rotate-left would be a division as it does the
 bit-numbering and the bit direction the opposite way to more familiar CPUs
 such as x86.

 On powerpc shifting left is multiplication by 2, as this has the most 
significant bit first.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] fork_init: fix division by zero

2008-12-10 Thread Yuri Tikhonov

 Hello Al,

On Wednesday, December 10, 2008 you wrote:

 On Wed, Dec 10, 2008 at 01:01:13PM +0300, Yuri Tikhonov wrote:

  + max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE);

  +#endif
 
  Can't this overflow, e.g. on 32-bit machines with HIGHMEM?
 
  The multiplier here is not PAGE_SIZE, but [PAGE_SIZE / (8 * 
 THREAD_SIZE)], and this value is expected to be rather small (2, 4, or 
 so).

 x * y / z is parsed as (x * y) / z, not x * (y / z).

 Here we believe in preprocessor: since all PAGE_SIZE, 8, and 
THREAD_SIZE are the constants we expect it will calculate this.

 E.g. here is the result from this line as produced by cross-gcc 
4.2.2:

lis r9,0
rlwinm  r29,r29,2,16,29
stw r29,0(r9)

 As you see - only rotate-left, i.e. multiplication to the constant.

 In any case, adding braces as follows probably would be better:

+ max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE));

 Right ?

 Only assignment operators (and ?:, in a sense that a ? b : c ? d : e is
 parsed as a ? b : (c ? d : e)) are right-to-left.  The rest is left-to-right.



 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] ASYNC_TX: async_xor mapping fix

2008-12-10 Thread Yuri Tikhonov

 Hello Dan,

On Wednesday, December 10, 2008 you wrote:

 On Mon, 2008-12-08 at 14:08 -0700, Dan Williams wrote:
 On Mon, 2008-12-08 at 12:14 -0700, Yuri Tikhonov wrote:
  The destination address may be present in the source list, so we should
  map the addresses from the source list first.
   Otherwise, if page corresponding to destination is not marked as write-
  through (with regards to CPU cache), then mapping it with DMA_FROM_DEVICE
  may lead to data loss, and finally to an incorrect result of calculations.
  
 
 Thanks Yuri.  I think we should avoid mapping the destination twice
 altogether, and for simplicity just always map it bidirectionally.

 Yuri, Saeed, can I get your acked-by's for the following 2.6.28 patch:

 I would do the src_list[i] == dest check with unlikely(), since we 
a-priori aware that only one of all src_cnt addresses is the possible 
destination.

 As for the rest:

Acked-by: Yuri Tikhonov [EMAIL PROTECTED]

 Thanks,
 Dan


snip
 async_xor: dma_map destination DMA_BIDIRECTIONAL

 From: Dan Williams [EMAIL PROTECTED]

 Mapping the destination multiple times is a misuse of the dma-api.
 Since the destination may be reused as a source, ensure that it is only
 mapped once and that it is mapped bidirectionally.  This appears to add
 ugliness on the unmap side in that it always reads back the destination
 address from the descriptor, but gcc can determine that dma_unmap is a
 nop and not emit the code that calculates its arguments.

 Cc: [EMAIL PROTECTED]
 Cc: Saeed Bishara [EMAIL PROTECTED]
 Reported-by: Yuri Tikhonov [EMAIL PROTECTED]
 Signed-off-by: Dan Williams [EMAIL PROTECTED]
 ---
  crypto/async_tx/async_xor.c |   11 +--
  drivers/dma/iop-adma.c  |   16 +---
  drivers/dma/mv_xor.c|   15 ---
  3 files changed, 34 insertions(+), 8 deletions(-)

 diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
 index c029d3e..a6faa90 100644
 --- a/crypto/async_tx/async_xor.c
 +++ b/crypto/async_tx/async_xor.c
 @@ -53,10 +53,17 @@ do_async_xor(struct dma_chan *chan, struct page *dest, 
 struct page **src_list,
 int xor_src_cnt;
 dma_addr_t dma_dest;
  
 -   dma_dest = dma_map_page(dma-dev, dest, offset, len, DMA_FROM_DEVICE);
 -   for (i = 0; i  src_cnt; i++)
 +   /* map the dest bidrectional in case it is re-used as a source */
 +   dma_dest = dma_map_page(dma-dev, dest, offset, len, 
 DMA_BIDIRECTIONAL);
 +   for (i = 0; i  src_cnt; i++) {
 +   /* only map the dest once */
 +   if (src_list[i] == dest) {
 +   dma_src[i] = dma_dest;
 +   continue;
 +   }
 dma_src[i] = dma_map_page(dma-dev, src_list[i], offset,
   len, DMA_TO_DEVICE);
 +   }
  
 while (src_cnt) {
 async_flags = flags;
 diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
 index c7a9306..6be3172 100644
 --- a/drivers/dma/iop-adma.c
 +++ b/drivers/dma/iop-adma.c
 @@ -85,18 +85,28 @@ iop_adma_run_tx_complete_actions(struct 
 iop_adma_desc_slot *desc,
 enum dma_ctrl_flags flags = desc-async_tx.flags;
 u32 src_cnt;
 dma_addr_t addr;
 +   dma_addr_t dest;
  
 +   src_cnt = unmap-unmap_src_cnt;
 +   dest = iop_desc_get_dest_addr(unmap, iop_chan);
 if (!(flags  DMA_COMPL_SKIP_DEST_UNMAP)) {
 -   addr =
 iop_desc_get_dest_addr(unmap, iop_chan);
 -   dma_unmap_page(dev, addr, len, 
 DMA_FROM_DEVICE);
 +   enum dma_data_direction dir;
 +
 +   if (src_cnt  1) /* is xor? */
 +   dir = DMA_BIDIRECTIONAL;
 +   else
 +   dir = DMA_FROM_DEVICE;
 +
 +   dma_unmap_page(dev, dest, len, dir);
 }
  
 if (!(flags  DMA_COMPL_SKIP_SRC_UNMAP)) {
 -   src_cnt = unmap-unmap_src_cnt;
 while (src_cnt--) {
 addr = iop_desc_get_src_addr(unmap,
 iop_chan,
 src_cnt);
 +   if (addr == dest)
 +   continue;
 dma_unmap_page(dev, addr, len,
DMA_TO_DEVICE);
 }
 diff --git a/drivers/dma/mv_xor.c b/drivers/dma/mv_xor.c
 index 0328da0..bcda174 100644
 --- a/drivers/dma/mv_xor.c
 +++ b/drivers/dma/mv_xor.c
 @@ -311,17 +311,26

[PATCH] fork_init: fix division by zero

2008-12-09 Thread Yuri Tikhonov

The following patch fixes divide-by-zero error for the
cases of really big PAGE_SIZEs (e.g. 256KB on ppc44x).
Support for such big page sizes on 44x is not present in the
current kernel yet, but coming soon.

Also this patch fixes the comment for the max_threads
settings, as this didn't match the things actually done
in the code.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 kernel/fork.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..b0ac2fb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -181,10 +181,14 @@ void __init fork_init(unsigned long mempages)
 
/*
 * The default maximum number of threads is set to a safe
-* value: the thread structures can take up at most half
-* of memory.
+* value: the thread structures can take up at most
+* (1/8) part of memory.
 */
+#if (8 * THREAD_SIZE)  PAGE_SIZE
max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE);
+#else
+   max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE);
+#endif
 
/*
 * we need to allow at least 20 threads to boot a system
-- 
1.5.6.1
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[RFC PATCH 00/11][v2] md: support for asynchronous execution of RAID6 operations

2008-12-08 Thread Yuri Tikhonov
 Hello,

 This is the next attempt on asynchronous RAID-6 support. This patch-set
has the Dan Williams' comments (Nov, 15) addressed. These were mainly
about the ASYNC_TX part of the code.

 The following patch-set includes enhancements to the async_tx api and
modifications to md-raid6 to issue memory copies and parity calculations
asynchronously. Thus we may process copy operations and RAID-6 calculations
on the dedicated DMA engines accessible with ASYNC_TX API, and, as a result
off-load CPU, and improve the performance.

 To reduce the code duplication in the raid driver this patch-set modifies
some raid-5 functions to make them possible to use in the raid-6 case.

 The patch-set can be broken down into thee following main categories: 

1) Additions to ASYNC_TX API (patches 1-3; without the patch 1 the ASYNC_TX
can't be compiled for 44x in 2.6.27-rc6 or later);

2) RAID-6 implementation (patches 4-10)

3) ppc440spe ADMA driver (patch 11) (provided as a reference here)

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 01/11] async_tx: don't use src_list argument of async_xor() for dma addresses

2008-12-08 Thread Yuri Tikhonov
Using src_list argument of async_xor() as a storage for dma addresses
implies sizeof(dma_addr_t) = sizeof(struct page *) restriction which is
not always true (e.g. ppc440spe).

Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
---
 crypto/async_tx/async_xor.c |   14 ++
 1 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index c029d3e..00c74c5 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -42,7 +42,7 @@ do_async_xor(struct dma_chan *chan, struct page *dest, struct 
page **src_list,
 dma_async_tx_callback cb_fn, void *cb_param)
 {
struct dma_device *dma = chan-device;
-   dma_addr_t *dma_src = (dma_addr_t *) src_list;
+   dma_addr_t dma_src[src_cnt];
struct dma_async_tx_descriptor *tx = NULL;
int src_off = 0;
int i;
@@ -247,7 +247,7 @@ async_xor_zero_sum(struct page *dest, struct page 
**src_list,
BUG_ON(src_cnt = 1);
 
if (device  src_cnt = device-max_xor) {
-   dma_addr_t *dma_src = (dma_addr_t *) src_list;
+   dma_addr_t dma_src[src_cnt];
unsigned long dma_prep_flags = cb_fn ? DMA_PREP_INTERRUPT : 0;
int i;
 
@@ -296,16 +296,6 @@ EXPORT_SYMBOL_GPL(async_xor_zero_sum);
 
 static int __init async_xor_init(void)
 {
-   #ifdef CONFIG_DMA_ENGINE
-   /* To conserve stack space the input src_list (array of page pointers)
-* is reused to hold the array of dma addresses passed to the driver.
-* This conversion is only possible when dma_addr_t is less than the
-* the size of a pointer.  HIGHMEM64G is known to violate this
-* assumption.
-*/
-   BUILD_BUG_ON(sizeof(dma_addr_t)  sizeof(struct page *));
-   #endif
-
return 0;
 }
 
-- 
1.5.6.1
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 02/11][v2] async_tx: add support for asynchronous GF multiplication

2008-12-08 Thread Yuri Tikhonov
This adds support for doing asynchronous GF multiplication by adding
four additional functions to async_tx API:

 async_pq() does simultaneous XOR of sources and XOR of sources
  GF-multiplied by given coefficients.

 async_pq_zero_sum() checks if results of calculations match given
  ones.

 async_gen_syndrome() does sumultaneous XOR and R/S syndrome of sources.

 async_syndrome_zerosum() checks if results of XOR/syndrome calculation
  matches given ones.

Latter two functions just use async_pq() with the approprite coefficients
in asynchronous case but have significant optimizations if synchronous
case.

To support this API dmaengine driver should set DMA_PQ and
DMA_PQ_ZERO_SUM capabilities and provide device_prep_dma_pq and
device_prep_dma_pqzero_sum methods in dma_device structure.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 crypto/async_tx/Kconfig|4 +
 crypto/async_tx/Makefile   |1 +
 crypto/async_tx/async_pq.c |  586 
 include/linux/async_tx.h   |   45 -
 include/linux/dmaengine.h  |   16 ++-
 5 files changed, 648 insertions(+), 4 deletions(-)
 create mode 100644 crypto/async_tx/async_pq.c

diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig
index d8fb391..cb6d731 100644
--- a/crypto/async_tx/Kconfig
+++ b/crypto/async_tx/Kconfig
@@ -14,3 +14,7 @@ config ASYNC_MEMSET
tristate
select ASYNC_CORE
 
+config ASYNC_PQ
+   tristate
+   select ASYNC_CORE
+
diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile
index 27baa7d..1b99265 100644
--- a/crypto/async_tx/Makefile
+++ b/crypto/async_tx/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_ASYNC_CORE) += async_tx.o
 obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o
 obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o
 obj-$(CONFIG_ASYNC_XOR) += async_xor.o
+obj-$(CONFIG_ASYNC_PQ) += async_pq.o
diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
new file mode 100644
index 000..439338f
--- /dev/null
+++ b/crypto/async_tx/async_pq.c
@@ -0,0 +1,586 @@
+/*
+ * Copyright(c) 2007 Yuri Tikhonov [EMAIL PROTECTED]
+ *
+ * Developed for DENX Software Engineering GmbH
+ *
+ * Asynchronous GF-XOR calculations ASYNC_TX API.
+ *
+ * based on async_xor.c code written by:
+ * Dan Williams [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#include linux/kernel.h
+#include linux/interrupt.h
+#include linux/dma-mapping.h
+#include linux/raid/xor.h
+#include linux/async_tx.h
+
+#include ../drivers/md/raid6.h
+
+/**
+ *  The following static variables are used in cases of synchronous
+ * zero sum to save the values to check. Two pages used for zero sum and
+ * the third one is for dumb P destination when calling gen_syndrome()
+ */
+static spinlock_t spare_lock;
+struct page *spare_pages[3];
+
+/**
+ * do_async_pq - asynchronously calculate P and/or Q
+ */
+static struct dma_async_tx_descriptor *
+do_async_pq(struct dma_chan *chan, struct page **blocks,
+   unsigned char *scf_list, unsigned int offset, int src_cnt, size_t len,
+   enum async_tx_flags flags, struct dma_async_tx_descriptor *depend_tx,
+   dma_async_tx_callback cb_fn, void *cb_param)
+{
+   struct dma_device *dma = chan-device;
+   dma_addr_t dma_dest[2], dma_src[src_cnt];
+   struct dma_async_tx_descriptor *tx = NULL;
+   dma_async_tx_callback _cb_fn;
+   void *_cb_param;
+   int i, pq_src_cnt, src_off = 0;
+   enum async_tx_flags async_flags;
+   enum dma_ctrl_flags dma_flags = 0;
+
+   /*  If we won't handle src_cnt in one shot, then the following
+* flag(s) will be set only on the first pass of prep_dma
+*/
+   if (flags  ASYNC_TX_PQ_ZERO_P)
+   dma_flags |= DMA_PREP_ZERO_P;
+   if (flags  ASYNC_TX_PQ_ZERO_Q)
+   dma_flags |= DMA_PREP_ZERO_Q;
+
+   /* DMAs use destinations as sources, so use BIDIRECTIONAL mapping */
+   dma_dest[0] = !blocks[src_cnt] ? 0 :
+   dma_map_page(dma-dev, blocks[src_cnt],
+offset, len, DMA_BIDIRECTIONAL);
+   dma_dest[1

[PATCH 03/11][v2] async_tx: add support for asynchronous RAID6 recovery operations

2008-12-08 Thread Yuri Tikhonov
This patch extends async_tx API with two operations for recovery
operations on RAID6 array with two failed disks using new async_pq()
operation. Patch introduces the following functions:

 async_r6_dd_recov() recovers after double data disk failure

 async_r6_dp_recov() recovers after D+P failure

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 crypto/async_tx/Kconfig |5 +
 crypto/async_tx/Makefile|1 +
 crypto/async_tx/async_r6recov.c |  282 +++
 include/linux/async_tx.h|   11 ++
 4 files changed, 299 insertions(+), 0 deletions(-)
 create mode 100644 crypto/async_tx/async_r6recov.c

diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig
index cb6d731..0b56224 100644
--- a/crypto/async_tx/Kconfig
+++ b/crypto/async_tx/Kconfig
@@ -18,3 +18,8 @@ config ASYNC_PQ
tristate
select ASYNC_CORE
 
+config ASYNC_R6RECOV
+   tristate
+   select ASYNC_CORE
+   select ASYNC_PQ
+
diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile
index 1b99265..0ed8f13 100644
--- a/crypto/async_tx/Makefile
+++ b/crypto/async_tx/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o
 obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o
 obj-$(CONFIG_ASYNC_XOR) += async_xor.o
 obj-$(CONFIG_ASYNC_PQ) += async_pq.o
+obj-$(CONFIG_ASYNC_R6RECOV) += async_r6recov.o
diff --git a/crypto/async_tx/async_r6recov.c b/crypto/async_tx/async_r6recov.c
new file mode 100644
index 000..403c1aa
--- /dev/null
+++ b/crypto/async_tx/async_r6recov.c
@@ -0,0 +1,282 @@
+/*
+ * Copyright(c) 2007 Yuri Tikhonov [EMAIL PROTECTED]
+ *
+ * Developed for DENX Software Engineering GmbH
+ *
+ * Asynchronous RAID-6 recovery calculations ASYNC_TX API.
+ *
+ * based on async_xor.c code written by:
+ * Dan Williams [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#include linux/kernel.h
+#include linux/interrupt.h
+#include linux/dma-mapping.h
+#include linux/raid/xor.h
+#include linux/async_tx.h
+
+#include ../drivers/md/raid6.h
+
+/**
+ * async_r6_dd_recov - attempt to calculate two data misses using dma engines.
+ * @disks: number of disks in the RAID-6 array
+ * @bytes: size of strip
+ * @faila: first failed drive index
+ * @failb: second failed drive index
+ * @ptrs: array of pointers to strips (last two must be p and q, respectively)
+ * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
+ * @depend_tx: depends on the result of this transaction.
+ * @cb: function to call when the operation completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_r6_dd_recov(int disks, size_t bytes, int faila, int failb,
+   struct page **ptrs, enum async_tx_flags flags,
+   struct dma_async_tx_descriptor *depend_tx,
+   dma_async_tx_callback cb, void *cb_param)
+{
+   struct dma_async_tx_descriptor *tx = NULL;
+   struct page *lptrs[disks];
+   unsigned char lcoef[disks-4];
+   int i = 0, k = 0, fc = -1;
+   uint8_t bc[2];
+   dma_async_tx_callback lcb = NULL;
+   void *lcb_param = NULL;
+
+   /* Assume that failb  faila */
+   if (faila  failb) {
+   fc = faila;
+   faila = failb;
+   failb = fc;
+   }
+
+   /* Try to compute missed data asynchronously. */
+   if (disks == 4) {
+   /* Pxy and Qxy are zero in this case so we already have
+* P+Pxy and Q+Qxy in P and Q strips respectively.
+*/
+   tx = depend_tx;
+   lcb = cb;
+   lcb_param = cb_param;
+   goto do_mult;
+   }
+
+   /* (1) Calculate Qxy and Pxy:
+* Qxy = A(0)*D(0) + ... + A(n-1)*D(n-1) + A(n+1)*D(n+1) + ... +
+*   A(m-1)*D(m-1) + A(m+1)*D(m+1) + ... + A(disks-1)*D(disks-1),
+* where n = faila, m = failb.
+*/
+   for (i = 0, k = 0; i  disks - 2; i++) {
+   if (i != faila  i != failb) {
+   lptrs[k] = ptrs[i];
+   lcoef[k] = raid6_gfexp[i];
+   k

[PATCH 04/11][v2] md: run RAID-6 stripe operations outside the lock

2008-12-08 Thread Yuri Tikhonov
 The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pqxor+copy
operations asynchronously, outside the lock.

 The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct

 The flow is the same as in the RAID-5 case.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/md/Kconfig |2 +
 drivers/md/raid5.c |  292 
 include/linux/raid/raid5.h |6 +-
 3 files changed, 271 insertions(+), 29 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2281b50..6c9964f 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -123,6 +123,8 @@ config MD_RAID456
depends on BLK_DEV_MD
select ASYNC_MEMCPY
select ASYNC_XOR
+   select ASYNC_PQ
+   select ASYNC_R6RECOV
---help---
  A RAID-5 set of N drives with a capacity of C MB per drive provides
  the capacity of C * (N - 1) MB, and protects against a failure
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a36a743..aeec3e5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -584,18 +584,26 @@ static void ops_run_biofill(struct stripe_head *sh)
ops_complete_biofill, sh);
 }
 
-static void ops_complete_compute5(void *stripe_head_ref)
+static void ops_complete_compute(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
-   int target = sh-ops.target;
-   struct r5dev *tgt = sh-dev[target];
+   int target, i;
+   struct r5dev *tgt;
 
pr_debug(%s: stripe %llu\n, __func__,
(unsigned long long)sh-sector);
 
-   set_bit(R5_UPTODATE, tgt-flags);
-   BUG_ON(!test_bit(R5_Wantcompute, tgt-flags));
-   clear_bit(R5_Wantcompute, tgt-flags);
+   /* mark the computed target(s) as uptodate */
+   for (i = 0; i  2; i++) {
+   target = (!i) ? sh-ops.target : sh-ops.target2;
+   if (target  0)
+   continue;
+   tgt = sh-dev[target];
+   set_bit(R5_UPTODATE, tgt-flags);
+   BUG_ON(!test_bit(R5_Wantcompute, tgt-flags));
+   clear_bit(R5_Wantcompute, tgt-flags);
+   }
+
clear_bit(STRIPE_COMPUTE_RUN, sh-state);
if (sh-check_state == check_state_compute_run)
sh-check_state = check_state_compute_result;
@@ -627,15 +635,155 @@ static struct dma_async_tx_descriptor 
*ops_run_compute5(struct stripe_head *sh)
 
if (unlikely(count == 1))
tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE,
-   0, NULL, ops_complete_compute5, sh);
+   0, NULL, ops_complete_compute, sh);
else
tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE,
ASYNC_TX_XOR_ZERO_DST, NULL,
-   ops_complete_compute5, sh);
+   ops_complete_compute, sh);
+
+   return tx;
+}
+
+static struct dma_async_tx_descriptor *
+ops_run_compute6_1(struct stripe_head *sh)
+{
+   /* kernel stack size limits the total number of disks */
+   int disks = sh-disks;
+   struct page *srcs[disks];
+   int target = sh-ops.target  0 ? sh-ops.target2 : sh-ops.target;
+   struct r5dev *tgt = sh-dev[target];
+   struct page *dest = sh-dev[target].page;
+   int count = 0;
+   int pd_idx = sh-pd_idx, qd_idx = raid6_next_disk(pd_idx, disks);
+   int d0_idx = raid6_next_disk(qd_idx, disks);
+   struct dma_async_tx_descriptor *tx;
+   int i;
+
+   pr_debug(%s: stripe %llu block: %d\n,
+   __func__, (unsigned long long)sh-sector, target);
+   BUG_ON(!test_bit(R5_Wantcompute, tgt-flags));
+
+   atomic_inc(sh-count);
+
+   if (target == qd_idx) {
+   /* We are actually computing the Q drive*/
+   i = d0_idx;
+   do {
+   srcs[count++] = sh-dev[i].page;
+   i = raid6_next_disk(i, disks);
+   } while (i != pd_idx);
+   srcs[count] = NULL;
+   srcs[count+1] = dest;
+   tx = async_gen_syndrome(srcs, 0, count, STRIPE_SIZE,
+   0, NULL, ops_complete_compute, sh);
+   } else {
+   /* Compute any data- or p-drive using XOR */
+   for (i = disks; i-- ; ) {
+   if (i != target  i

[PATCH 05/11] md: common schedule_reconstruction for raid5/6

2008-12-08 Thread Yuri Tikhonov
To be able to re-use the schedule_reconstruction5() code in RAID-6
case, this should handle Q-parity strip appropriately. This patch
introduces this.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/md/raid5.c |   18 ++
 1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index aeec3e5..e31f38b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1885,10 +1885,11 @@ static void compute_block_2(struct stripe_head *sh, int 
dd_idx1, int dd_idx2)
 }
 
 static void
-schedule_reconstruction5(struct stripe_head *sh, struct stripe_head_state *s,
+schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 int rcw, int expand)
 {
int i, pd_idx = sh-pd_idx, disks = sh-disks;
+   int level = sh-raid_conf-level;
 
if (rcw) {
/* if we are not expanding this is a proper write request, and
@@ -1914,10 +1915,12 @@ schedule_reconstruction5(struct stripe_head *sh, struct 
stripe_head_state *s,
s-locked++;
}
}
-   if (s-locked + 1 == disks)
+   if ((level == 5  s-locked + 1 == disks) ||
+   (level == 6  s-locked + 2 == disks))
if (!test_and_set_bit(STRIPE_FULL_WRITE, sh-state))
atomic_inc(sh-raid_conf-pending_full_writes);
} else {
+   BUG_ON(level == 6);
BUG_ON(!(test_bit(R5_UPTODATE, sh-dev[pd_idx].flags) ||
test_bit(R5_Wantcompute, sh-dev[pd_idx].flags)));
 
@@ -1949,6 +1952,13 @@ schedule_reconstruction5(struct stripe_head *sh, struct 
stripe_head_state *s,
clear_bit(R5_UPTODATE, sh-dev[pd_idx].flags);
s-locked++;
 
+   if (level == 6) {
+   int qd_idx = raid6_next_disk(pd_idx, disks);
+   set_bit(R5_LOCKED, sh-dev[qd_idx].flags);
+   clear_bit(R5_UPTODATE, sh-dev[qd_idx].flags);
+   s-locked++;
+   }
+
pr_debug(%s: stripe %llu locked: %d ops_request: %lx\n,
__func__, (unsigned long long)sh-sector,
s-locked, s-ops_request);
@@ -2410,7 +2420,7 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf,
if ((s-req_compute || !test_bit(STRIPE_COMPUTE_RUN, sh-state)) 
(s-locked == 0  (rcw == 0 || rmw == 0) 
!test_bit(STRIPE_BIT_DELAY, sh-state)))
-   schedule_reconstruction5(sh, s, rcw == 0, 0);
+   schedule_reconstruction(sh, s, rcw == 0, 0);
 }
 
 static void handle_stripe_dirtying6(raid5_conf_t *conf,
@@ -3003,7 +3013,7 @@ static bool handle_stripe5(struct stripe_head *sh)
sh-disks = conf-raid_disks;
sh-pd_idx = stripe_to_pdidx(sh-sector, conf,
conf-raid_disks);
-   schedule_reconstruction5(sh, s, 1, 1);
+   schedule_reconstruction(sh, s, 1, 1);
} else if (s.expanded  !sh-reconstruct_state  s.locked == 0) {
clear_bit(STRIPE_EXPAND_READY, sh-state);
atomic_dec(conf-reshape_stripes);
-- 
1.5.6.1

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 07/11] md: rewrite handle_stripe_dirtying6 in asynchronous way

2008-12-08 Thread Yuri Tikhonov
Rewrite handle_stripe_dirtying6 function to work asynchronously.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/md/raid5.c |  113 ++--
 1 files changed, 30 insertions(+), 83 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e08ed4f..f0b47bd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2485,99 +2485,46 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf,
struct stripe_head *sh, struct stripe_head_state *s,
struct r6_state *r6s, int disks)
 {
-   int rcw = 0, must_compute = 0, pd_idx = sh-pd_idx, i;
+   int rcw = 0, pd_idx = sh-pd_idx, i;
int qd_idx = r6s-qd_idx;
+
+   set_bit(STRIPE_HANDLE, sh-state);
for (i = disks; i--; ) {
struct r5dev *dev = sh-dev[i];
-   /* Would I have to read this buffer for reconstruct_write */
-   if (!test_bit(R5_OVERWRITE, dev-flags)
-i != pd_idx  i != qd_idx
-(!test_bit(R5_LOCKED, dev-flags)
-   ) 
-   !test_bit(R5_UPTODATE, dev-flags)) {
-   if (test_bit(R5_Insync, dev-flags)) rcw++;
-   else {
-   pr_debug(raid6: must_compute: 
-   disk %d flags=%#lx\n, i, dev-flags);
-   must_compute++;
+   /* check if we haven't enough data */
+   if (!test_bit(R5_OVERWRITE, dev-flags) 
+   i != pd_idx  i != qd_idx 
+   !test_bit(R5_LOCKED, dev-flags) 
+   !(test_bit(R5_UPTODATE, dev-flags) ||
+ test_bit(R5_Wantcompute, dev-flags))) {
+   rcw++;
+   if (!test_bit(R5_Insync, dev-flags))
+   continue; /* it's a failed drive */
+
+   if (
+ test_bit(STRIPE_PREREAD_ACTIVE, sh-state)) {
+   pr_debug(Read_old stripe %llu 
+   block %d for Reconstruct\n,
+(unsigned long long)sh-sector, i);
+   set_bit(R5_LOCKED, dev-flags);
+   set_bit(R5_Wantread, dev-flags);
+   s-locked++;
+   } else {
+   pr_debug(Request delayed stripe %llu 
+   block %d for Reconstruct\n,
+(unsigned long long)sh-sector, i);
+   set_bit(STRIPE_DELAYED, sh-state);
+   set_bit(STRIPE_HANDLE, sh-state);
}
}
}
-   pr_debug(for sector %llu, rcw=%d, must_compute=%d\n,
-  (unsigned long long)sh-sector, rcw, must_compute);
-   set_bit(STRIPE_HANDLE, sh-state);
-
-   if (rcw  0)
-   /* want reconstruct write, but need to get some data */
-   for (i = disks; i--; ) {
-   struct r5dev *dev = sh-dev[i];
-   if (!test_bit(R5_OVERWRITE, dev-flags)
-!(s-failed == 0  (i == pd_idx || i == qd_idx))
-!test_bit(R5_LOCKED, dev-flags) 
-   !test_bit(R5_UPTODATE, dev-flags) 
-   test_bit(R5_Insync, dev-flags)) {
-   if (
- test_bit(STRIPE_PREREAD_ACTIVE, sh-state)) {
-   pr_debug(Read_old stripe %llu 
-   block %d for Reconstruct\n,
-(unsigned long long)sh-sector, i);
-   set_bit(R5_LOCKED, dev-flags);
-   set_bit(R5_Wantread, dev-flags);
-   s-locked++;
-   } else {
-   pr_debug(Request delayed stripe %llu 
-   block %d for Reconstruct\n,
-(unsigned long long)sh-sector, i);
-   set_bit(STRIPE_DELAYED, sh-state);
-   set_bit(STRIPE_HANDLE, sh-state);
-   }
-   }
-   }
/* now if nothing is locked, and if we have enough data, we can start a
 * write request
 */
-   if (s-locked == 0  rcw == 0 
+   if ((s-req_compute || !test_bit(STRIPE_COMPUTE_RUN, sh-state)) 
+   s-locked == 0  rcw == 0 
!test_bit(STRIPE_BIT_DELAY, sh-state)) {
-   if (must_compute  0

[PATCH 06/11] md: change handle_stripe_fill6 to work in asynchronous way

2008-12-08 Thread Yuri Tikhonov
Change handle_stripe_fill6 to work asynchronously and introduce helper
fetch_block6 function for this.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/md/raid5.c |  154 
 1 files changed, 106 insertions(+), 48 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e31f38b..e08ed4f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2224,61 +2224,119 @@ static void handle_stripe_fill5(struct stripe_head *sh,
set_bit(STRIPE_HANDLE, sh-state);
 }
 
-static void handle_stripe_fill6(struct stripe_head *sh,
-   struct stripe_head_state *s, struct r6_state *r6s,
-   int disks)
+/* fetch_block6 - checks the given member device to see if its data needs
+ * to be read or computed to satisfy a request.
+ *
+ * Returns 1 when no more member devices need to be checked, otherwise returns
+ * 0 to tell the loop in handle_stripe_fill6 to continue
+ */
+static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
+struct r6_state *r6s, int disk_idx, int disks)
 {
-   int i;
-   for (i = disks; i--; ) {
-   struct r5dev *dev = sh-dev[i];
-   if (!test_bit(R5_LOCKED, dev-flags) 
-   !test_bit(R5_UPTODATE, dev-flags) 
-   (dev-toread || (dev-towrite 
-!test_bit(R5_OVERWRITE, dev-flags)) ||
-s-syncing || s-expanding ||
-(s-failed = 1 
- (sh-dev[r6s-failed_num[0]].toread ||
-  s-to_write)) ||
-(s-failed = 2 
- (sh-dev[r6s-failed_num[1]].toread ||
-  s-to_write {
-   /* we would like to get this block, possibly
-* by computing it, but we might not be able to
+   struct r5dev *dev = sh-dev[disk_idx];
+   struct r5dev *fdev[2] = { sh-dev[r6s-failed_num[0]],
+ sh-dev[r6s-failed_num[1]] };
+
+   if (!test_bit(R5_LOCKED, dev-flags) 
+   !test_bit(R5_UPTODATE, dev-flags) 
+   (dev-toread ||
+(dev-towrite  !test_bit(R5_OVERWRITE, dev-flags)) ||
+s-syncing || s-expanding ||
+(s-failed = 1 
+ (fdev[0]-toread || s-to_write)) ||
+(s-failed = 2 
+ (fdev[1]-toread || s-to_write {
+   /* we would like to get this block, possibly by computing it,
+* otherwise read it if the backing disk is insync
+*/
+   BUG_ON(test_bit(R5_Wantcompute, dev-flags));
+   BUG_ON(test_bit(R5_Wantread, dev-flags));
+   if ((s-uptodate == disks - 1) 
+   (s-failed  (disk_idx == r6s-failed_num[0] ||
+  disk_idx == r6s-failed_num[1]))) {
+   /* have disk failed, and we're requested to fetch it;
+* do compute it
 */
-   if ((s-uptodate == disks - 1) 
-   (s-failed  (i == r6s-failed_num[0] ||
-  i == r6s-failed_num[1]))) {
-   pr_debug(Computing stripe %llu block %d\n,
-  (unsigned long long)sh-sector, i);
-   compute_block_1(sh, i, 0);
-   s-uptodate++;
-   } else if ( s-uptodate == disks-2  s-failed = 2 ) {
-   /* Computing 2-failure is *very* expensive; only
-* do it if failed = 2
+   pr_debug(Computing stripe %llu block %d\n,
+  (unsigned long long)sh-sector, disk_idx);
+   set_bit(STRIPE_COMPUTE_RUN, sh-state);
+   set_bit(STRIPE_OP_COMPUTE_BLK, s-ops_request);
+   set_bit(R5_Wantcompute, dev-flags);
+   sh-ops.target = disk_idx;
+   sh-ops.target2 = -1; /* no 2nd target */
+   s-req_compute = 1;
+   s-uptodate++;
+   return 1;
+   } else if ( s-uptodate == disks-2  s-failed = 2 ) {
+   /* Computing 2-failure is *very* expensive; only
+* do it if failed = 2
+*/
+   int other;
+   for (other = disks; other--; ) {
+   if (other == disk_idx)
+   continue;
+   if (!test_bit(R5_UPTODATE,
+ sh-dev[other].flags))
+   break;
+   }
+   BUG_ON(other  0

[PATCH 08/11] md: asynchronous handle_parity_check6

2008-12-08 Thread Yuri Tikhonov
This patch introduces the state machine for handling the RAID-6 parities
check and repair functionality.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/md/raid5.c |  163 +++-
 1 files changed, 110 insertions(+), 53 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f0b47bd..91e5438 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2621,91 +2621,148 @@ static void handle_parity_checks6(raid5_conf_t *conf, 
struct stripe_head *sh,
struct r6_state *r6s, struct page *tmp_page,
int disks)
 {
-   int update_p = 0, update_q = 0;
-   struct r5dev *dev;
+   int i;
+   struct r5dev *devs[2] = {NULL, NULL};
int pd_idx = sh-pd_idx;
int qd_idx = r6s-qd_idx;
 
set_bit(STRIPE_HANDLE, sh-state);
 
BUG_ON(s-failed  2);
-   BUG_ON(s-uptodate  disks);
+
/* Want to check and possibly repair P and Q.
 * However there could be one 'failed' device, in which
 * case we can only check one of them, possibly using the
 * other to generate missing data
 */
 
-   /* If !tmp_page, we cannot do the calculations,
-* but as we have set STRIPE_HANDLE, we will soon be called
-* by stripe_handle with a tmp_page - just wait until then.
-*/
-   if (tmp_page) {
+   switch (sh-check_state) {
+   case check_state_idle:
+   /* start a new check operation if there are  2 failures */
if (s-failed == r6s-q_failed) {
/* The only possible failed device holds 'Q', so it
 * makes sense to check P (If anything else were failed,
 * we would have used P to recreate it).
 */
-   compute_block_1(sh, pd_idx, 1);
-   if (!page_is_zero(sh-dev[pd_idx].page)) {
-   compute_block_1(sh, pd_idx, 0);
-   update_p = 1;
-   }
+   sh-check_state = check_state_run;
+   set_bit(STRIPE_OP_CHECK_PP, s-ops_request);
+   clear_bit(R5_UPTODATE, sh-dev[pd_idx].flags);
+   s-uptodate--;
}
if (!r6s-q_failed  s-failed  2) {
/* q is not failed, and we didn't use it to generate
 * anything, so it makes sense to check it
 */
-   memcpy(page_address(tmp_page),
-  page_address(sh-dev[qd_idx].page),
-  STRIPE_SIZE);
-   compute_parity6(sh, UPDATE_PARITY);
-   if (memcmp(page_address(tmp_page),
-  page_address(sh-dev[qd_idx].page),
-  STRIPE_SIZE) != 0) {
-   clear_bit(STRIPE_INSYNC, sh-state);
-   update_q = 1;
-   }
+   sh-check_state = check_state_run;
+   set_bit(STRIPE_OP_CHECK_QP, s-ops_request);
+   clear_bit(R5_UPTODATE, sh-dev[qd_idx].flags);
+   s-uptodate--;
}
-   if (update_p || update_q) {
-   conf-mddev-resync_mismatches += STRIPE_SECTORS;
-   if (test_bit(MD_RECOVERY_CHECK, conf-mddev-recovery))
-   /* don't try to repair!! */
-   update_p = update_q = 0;
+   if (sh-check_state == check_state_run) {
+   break;
}
 
-   /* now write out any block on a failed drive,
-* or P or Q if they need it
-*/
+   /* we have 2-disk failure */
+   BUG_ON(s-failed != 2);
+   devs[0] = sh-dev[r6s-failed_num[0]];
+   devs[1] = sh-dev[r6s-failed_num[1]];
+   /* fall through */
+   case check_state_compute_result:
+   sh-check_state = check_state_idle;
 
-   if (s-failed == 2) {
-   dev = sh-dev[r6s-failed_num[1]];
-   s-locked++;
-   set_bit(R5_LOCKED, dev-flags);
-   set_bit(R5_Wantwrite, dev-flags);
+   BUG_ON((devs[0]  !devs[1]) ||
+  (!devs[0]  devs[1]));
+
+   BUG_ON(s-uptodate  (disks - 1));
+
+   if (!devs[0]) {
+   if (s-failed = 1)
+   devs[0] = sh-dev[r6s-failed_num[0]];
+   else
+   devs[0] = sh-dev[pd_idx];
}
-   if (s-failed = 1

[PATCH 10/11] md: remove unused functions

2008-12-08 Thread Yuri Tikhonov
 Some clean-up of the replaced or already unnecessary functions.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/md/raid5.c |  246 
 1 files changed, 0 insertions(+), 246 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 47b7de3..73307a9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1645,245 +1645,6 @@ static sector_t compute_blocknr(struct stripe_head *sh, 
int i)
 }
 
 
-
-/*
- * Copy data between a page in the stripe cache, and one or more bion
- * The page could align with the middle of the bio, or there could be
- * several bion, each with several bio_vecs, which cover part of the page
- * Multiple bion are linked together on bi_next.  There may be extras
- * at the end of this list.  We ignore them.
- */
-static void copy_data(int frombio, struct bio *bio,
-struct page *page,
-sector_t sector)
-{
-   char *pa = page_address(page);
-   struct bio_vec *bvl;
-   int i;
-   int page_offset;
-
-   if (bio-bi_sector = sector)
-   page_offset = (signed)(bio-bi_sector - sector) * 512;
-   else
-   page_offset = (signed)(sector - bio-bi_sector) * -512;
-   bio_for_each_segment(bvl, bio, i) {
-   int len = bio_iovec_idx(bio,i)-bv_len;
-   int clen;
-   int b_offset = 0;
-
-   if (page_offset  0) {
-   b_offset = -page_offset;
-   page_offset += b_offset;
-   len -= b_offset;
-   }
-
-   if (len  0  page_offset + len  STRIPE_SIZE)
-   clen = STRIPE_SIZE - page_offset;
-   else clen = len;
-
-   if (clen  0) {
-   char *ba = __bio_kmap_atomic(bio, i, KM_USER0);
-   if (frombio)
-   memcpy(pa+page_offset, ba+b_offset, clen);
-   else
-   memcpy(ba+b_offset, pa+page_offset, clen);
-   __bio_kunmap_atomic(ba, KM_USER0);
-   }
-   if (clen  len) /* hit end of page */
-   break;
-   page_offset +=  len;
-   }
-}
-
-#define check_xor()do {  \
-   if (count == MAX_XOR_BLOCKS) {\
-   xor_blocks(count, STRIPE_SIZE, dest, ptr);\
-   count = 0;\
-  }  \
-   } while(0)
-
-static void compute_parity6(struct stripe_head *sh, int method)
-{
-   raid6_conf_t *conf = sh-raid_conf;
-   int i, pd_idx = sh-pd_idx, qd_idx, d0_idx, disks = sh-disks, count;
-   struct bio *chosen;
-   / FIX THIS: This could be very bad if disks is close to 256 /
-   void *ptrs[disks];
-
-   qd_idx = raid6_next_disk(pd_idx, disks);
-   d0_idx = raid6_next_disk(qd_idx, disks);
-
-   pr_debug(compute_parity, stripe %llu, method %d\n,
-   (unsigned long long)sh-sector, method);
-
-   switch(method) {
-   case READ_MODIFY_WRITE:
-   BUG();  /* READ_MODIFY_WRITE N/A for RAID-6 */
-   case RECONSTRUCT_WRITE:
-   for (i= disks; i-- ;)
-   if ( i != pd_idx  i != qd_idx  sh-dev[i].towrite ) 
{
-   chosen = sh-dev[i].towrite;
-   sh-dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
sh-dev[i].flags))
-   wake_up(conf-wait_for_overlap);
-
-   BUG_ON(sh-dev[i].written);
-   sh-dev[i].written = chosen;
-   }
-   break;
-   case CHECK_PARITY:
-   BUG();  /* Not implemented yet */
-   }
-
-   for (i = disks; i--;)
-   if (sh-dev[i].written) {
-   sector_t sector = sh-dev[i].sector;
-   struct bio *wbi = sh-dev[i].written;
-   while (wbi  wbi-bi_sector  sector + STRIPE_SECTORS) 
{
-   copy_data(1, wbi, sh-dev[i].page, sector);
-   wbi = r5_next_bio(wbi, sector);
-   }
-
-   set_bit(R5_LOCKED, sh-dev[i].flags);
-   set_bit(R5_UPTODATE, sh-dev[i].flags);
-   }
-
-// switch(method) {
-// case RECONSTRUCT_WRITE:
-// case CHECK_PARITY:
-// case UPDATE_PARITY:
-   /* Note that unlike RAID-5, the ordering of the disks matters 
greatly. */
-   /* FIX: Is this ordering of drives even remotely optimal

[PATCH 09/11] md: change handle_stripe6 to work asynchronously

2008-12-08 Thread Yuri Tikhonov
handle_stripe6 function is changed to do things asynchronously.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/md/raid5.c |  130 
 1 files changed, 90 insertions(+), 40 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 91e5438..47b7de3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3117,9 +3117,10 @@ static bool handle_stripe6(struct stripe_head *sh, 
struct page *tmp_page)
 
r6s.qd_idx = raid6_next_disk(pd_idx, disks);
pr_debug(handling stripe %llu, state=%#lx cnt=%d, 
-   pd_idx=%d, qd_idx=%d\n,
+   pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n,
   (unsigned long long)sh-sector, sh-state,
-  atomic_read(sh-count), pd_idx, r6s.qd_idx);
+  atomic_read(sh-count), pd_idx, r6s.qd_idx,
+  sh-check_state, sh-reconstruct_state);
memset(s, 0, sizeof(s));
 
spin_lock(sh-lock);
@@ -3139,35 +3140,24 @@ static bool handle_stripe6(struct stripe_head *sh, 
struct page *tmp_page)
 
pr_debug(check %d: state 0x%lx read %p write %p written %p\n,
i, dev-flags, dev-toread, dev-towrite, dev-written);
-   /* maybe we can reply to a read */
-   if (test_bit(R5_UPTODATE, dev-flags)  dev-toread) {
-   struct bio *rbi, *rbi2;
-   pr_debug(Return read for disc %d\n, i);
-   spin_lock_irq(conf-device_lock);
-   rbi = dev-toread;
-   dev-toread = NULL;
-   if (test_and_clear_bit(R5_Overlap, dev-flags))
-   wake_up(conf-wait_for_overlap);
-   spin_unlock_irq(conf-device_lock);
-   while (rbi  rbi-bi_sector  dev-sector + 
STRIPE_SECTORS) {
-   copy_data(0, rbi, dev-page, dev-sector);
-   rbi2 = r5_next_bio(rbi, dev-sector);
-   spin_lock_irq(conf-device_lock);
-   if (!raid5_dec_bi_phys_segments(rbi)) {
-   rbi-bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(conf-device_lock);
-   rbi = rbi2;
-   }
-   }
+   /* maybe we can reply to a read
+*
+* new wantfill requests are only permitted while
+* ops_complete_biofill is guaranteed to be inactive
+*/
+   if (test_bit(R5_UPTODATE, dev-flags)  dev-toread 
+   !test_bit(STRIPE_BIOFILL_RUN, sh-state))
+   set_bit(R5_Wantfill, dev-flags);
 
/* now count some things */
if (test_bit(R5_LOCKED, dev-flags)) s.locked++;
if (test_bit(R5_UPTODATE, dev-flags)) s.uptodate++;
+   if (test_bit(R5_Wantcompute, dev-flags))
+   BUG_ON(++s.compute  2);
 
-
-   if (dev-toread)
+   if (test_bit(R5_Wantfill, dev-flags)) {
+   s.to_fill++;
+   } else if (dev-toread)
s.to_read++;
if (dev-towrite) {
s.to_write++;
@@ -3208,6 +3198,11 @@ static bool handle_stripe6(struct stripe_head *sh, 
struct page *tmp_page)
blocked_rdev = NULL;
}
 
+   if (s.to_fill  !test_bit(STRIPE_BIOFILL_RUN, sh-state)) {
+   set_bit(STRIPE_OP_BIOFILL, s.ops_request);
+   set_bit(STRIPE_BIOFILL_RUN, sh-state);
+   }
+
pr_debug(locked=%d uptodate=%d to_read=%d
to_write=%d failed=%d failed_num=%d,%d\n,
   s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
@@ -3248,18 +3243,62 @@ static bool handle_stripe6(struct stripe_head *sh, 
struct page *tmp_page)
 * or to load a block that is being partially written.
 */
if (s.to_read || s.non_overwrite || (s.to_write  s.failed) ||
-   (s.syncing  (s.uptodate  disks)) || s.expanding)
+   (s.syncing  (s.uptodate + s.compute  disks)) || s.expanding)
handle_stripe_fill6(sh, s, r6s, disks);
 
-   /* now to consider writing and what else, if anything should be read */
-   if (s.to_write)
+   /* Now we check to see if any write operations have recently
+* completed
+*/
+   if (sh-reconstruct_state == reconstruct_state_drain_result) {
+   int qd_idx = raid6_next_disk(sh-pd_idx,
+conf-raid_disks);
+
+   sh-reconstruct_state = reconstruct_state_idle;
+   /* All the 'written' buffers and the parity blocks are ready

Re[2]: [PATCH 01/11] async_tx: don't use src_list argument of async_xor() for dma addresses

2008-12-08 Thread Yuri Tikhonov
On Tuesday, December 9, 2008 you wrote:

 On Mon, Dec 8, 2008 at 2:55 PM, Yuri Tikhonov [EMAIL PROTECTED] wrote:
 Using src_list argument of async_xor() as a storage for dma addresses
 implies sizeof(dma_addr_t) = sizeof(struct page *) restriction which is
 not always true (e.g. ppc440spe).


 ppc440spe runs with CONFIG_PHYS_64BIT?

 Yep. It uses 36-bit addressing, so this CONFIG is turned on.

 If we do this then we need to also change md to limit the number of
 allowed disks based on the kernel stack size.  Because with 256 disks
 a 4K stack can be consumed by one call to async_pq ((256 sources in
 raid5.c + 256 sources async_pq.c) * 8 bytes per source on 64-bit).

 On ppc440spe we have 8KB stack, so the things are not worse than on 
32-bit archs with 4KB stack. Thus, I guess no changes to md are 
required because of this patch. Right?

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 11/11] ppc440spe-adma: ADMA driver for PPC440SP(e) systems

2008-12-08 Thread Yuri Tikhonov

 Hello Josh,

 If you are still intending to review our ppc440spe ADMA driver 
(thanks in advance if so), then please use the driver from my latest 
post as the reference:

 http://ozlabs.org/pipermail/linuxppc-dev/2008-December/065983.html

since this has some updates relating to the November version.

On Thursday, November 13, 2008 you wrote:

 On Thu, 13 Nov 2008 20:50:43 +0300
 Ilya Yanok [EMAIL PROTECTED] wrote:

 Josh Boyer wrote:
  On Thu, Nov 13, 2008 at 06:16:04PM +0300, Ilya Yanok wrote:

  Adds the platform device definitions and the architecture specific support
  routines for the ppc440spe adma driver.
 
  Any board equipped with PPC440SP(e) controller may utilize this driver.
 
  Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
  Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
  
 
  Before I really dig into reviewing this driver, I'm going to ask you as 
  simple
  question.  This looks like a 1/2 completed port of an arch/ppc driver that 
  uses
  the device tree (incorrectly) to get the interrupt resources and that's 
  about it.
  Otherwise, it's just a straight up platform device driver.  Is that 
  correct?

 
 Yep, that's correct.

 OK.

  If that is the case, I think the driver needs more work before it can be 
  merged.
  It should get the DCR and MMIO resources from the device tree as well.  It 
  should
  be binding on compatible properties and not based on device tree paths.  
  And it
  should probably be an of_platform device driver.

 
 Surely, you're right. I agree with you in that this driver isn't ready
 for merging. But it works so we'd like to publish it so interested
 people could use it and test it.

 And that's fine.  I just wanted to see where you were headed with this
 one for now.  I'll try to do a review in the next few days.  Thanks for
 posting.

 josh
 --
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH][v2] xsysace: use resource_size_t instead of unsigned long

2008-11-27 Thread Yuri Tikhonov
Use resource_size_t for physical address of SystemACE
chip. This fixes the driver brokeness for 32 bit systems
with 64 bit resources (e.g. PPC440SPe).

Also this patch adds one more compatible string for more
clean description of the hardware, and fixes a sector_t-
related compilation warning.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/block/xsysace.c |   24 +---
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c
index ecab9e6..9efd3d7 100644
--- a/drivers/block/xsysace.c
+++ b/drivers/block/xsysace.c
@@ -194,7 +194,7 @@ struct ace_device {
int in_irq;
 
/* Details of hardware device */
-   unsigned long physaddr;
+   resource_size_t physaddr;
void __iomem *baseaddr;
int irq;
int bus_width;  /* 0 := 8 bit; 1 := 16 bit */
@@ -628,8 +628,8 @@ static void ace_fsm_dostate(struct ace_device *ace)
 
/* Okay, it's a data request, set it up for transfer */
dev_dbg(ace-dev,
-   request: sec=%lx hcnt=%lx, ccnt=%x, dir=%i\n,
-   req-sector, req-hard_nr_sectors,
+   request: sec=%llx hcnt=%lx, ccnt=%x, dir=%i\n,
+   (unsigned long long)req-sector, req-hard_nr_sectors,
req-current_nr_sectors, rq_data_dir(req));
 
ace-req = req;
@@ -935,7 +935,8 @@ static int __devinit ace_setup(struct ace_device *ace)
int rc;
 
dev_dbg(ace-dev, ace_setup(ace=0x%p)\n, ace);
-   dev_dbg(ace-dev, physaddr=0x%lx irq=%i\n, ace-physaddr, ace-irq);
+   dev_dbg(ace-dev, physaddr=0x%llx irq=%i\n,
+   (unsigned long long)ace-physaddr, ace-irq);
 
spin_lock_init(ace-lock);
init_completion(ace-id_completion);
@@ -1017,8 +1018,8 @@ static int __devinit ace_setup(struct ace_device *ace)
/* Print the identification */
dev_info(ace-dev, Xilinx SystemACE revision %i.%i.%i\n,
 (version  12)  0xf, (version  8)  0x0f, version  0xff);
-   dev_dbg(ace-dev, physaddr 0x%lx, mapped to 0x%p, irq=%i\n,
-   ace-physaddr, ace-baseaddr, ace-irq);
+   dev_dbg(ace-dev, physaddr 0x%llx, mapped to 0x%p, irq=%i\n,
+   (unsigned long long)ace-physaddr, ace-baseaddr, ace-irq);
 
ace-media_change = 1;
ace_revalidate_disk(ace-gd);
@@ -1035,8 +1036,8 @@ err_alloc_disk:
 err_blk_initq:
iounmap(ace-baseaddr);
 err_ioremap:
-   dev_info(ace-dev, xsysace: error initializing device at 0x%lx\n,
-  ace-physaddr);
+   dev_info(ace-dev, xsysace: error initializing device at 0x%llx\n,
+(unsigned long long)ace-physaddr);
return -ENOMEM;
 }
 
@@ -1059,7 +1060,7 @@ static void __devexit ace_teardown(struct ace_device 
*ace)
 }
 
 static int __devinit
-ace_alloc(struct device *dev, int id, unsigned long physaddr,
+ace_alloc(struct device *dev, int id, resource_size_t physaddr,
  int irq, int bus_width)
 {
struct ace_device *ace;
@@ -1119,7 +1120,7 @@ static void __devexit ace_free(struct device *dev)
 
 static int __devinit ace_probe(struct platform_device *dev)
 {
-   unsigned long physaddr = 0;
+   resource_size_t physaddr = 0;
int bus_width = ACE_BUS_WIDTH_16; /* FIXME: should not be hard coded */
int id = dev-id;
int irq = NO_IRQ;
@@ -1165,7 +1166,7 @@ static int __devinit
 ace_of_probe(struct of_device *op, const struct of_device_id *match)
 {
struct resource res;
-   unsigned long physaddr;
+   resource_size_t physaddr;
const u32 *id;
int irq, bus_width, rc;
 
@@ -1205,6 +1206,7 @@ static struct of_device_id ace_of_match[] __devinitdata 
= {
{ .compatible = xlnx,opb-sysace-1.00.b, },
{ .compatible = xlnx,opb-sysace-1.00.c, },
{ .compatible = xlnx,xps-sysace-1.00.a, },
+   { .compatible = xlnx,sysace, },
{},
 };
 MODULE_DEVICE_TABLE(of, ace_of_match);
-- 
1.5.6.1
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH][v2] xsysace: use resource_size_t instead of unsigned long

2008-11-27 Thread Yuri Tikhonov
I'm sorry, but the patch I've just posted turn out to be corrupted. The
correct one is below.

---
Use resource_size_t for physical address of SystemACE
chip. This fixes the driver brokeness for 32 bit systems
with 64 bit resources (e.g. PPC440SPe).

Also this patch adds one more compatible string for more
clean description of the hardware, and fixes a sector_t-
related compilation warning.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
---
 drivers/block/xsysace.c |   24 +---
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c
index ecab9e6..9efd3d7 100644
--- a/drivers/block/xsysace.c
+++ b/drivers/block/xsysace.c
@@ -194,7 +194,7 @@ struct ace_device {
int in_irq;
 
/* Details of hardware device */
-   unsigned long physaddr;
+   resource_size_t physaddr;
void __iomem *baseaddr;
int irq;
int bus_width;  /* 0 := 8 bit; 1 := 16 bit */
@@ -628,8 +628,8 @@ static void ace_fsm_dostate(struct ace_device *ace)
 
/* Okay, it's a data request, set it up for transfer */
dev_dbg(ace-dev,
-   request: sec=%lx hcnt=%lx, ccnt=%x, dir=%i\n,
-   req-sector, req-hard_nr_sectors,
+   request: sec=%llx hcnt=%lx, ccnt=%x, dir=%i\n,
+   (unsigned long long)req-sector, req-hard_nr_sectors,
req-current_nr_sectors, rq_data_dir(req));
 
ace-req = req;
@@ -935,7 +935,8 @@ static int __devinit ace_setup(struct ace_device *ace)
int rc;
 
dev_dbg(ace-dev, ace_setup(ace=0x%p)\n, ace);
-   dev_dbg(ace-dev, physaddr=0x%lx irq=%i\n, ace-physaddr, ace-irq);
+   dev_dbg(ace-dev, physaddr=0x%llx irq=%i\n,
+   (unsigned long long)ace-physaddr, ace-irq);
 
spin_lock_init(ace-lock);
init_completion(ace-id_completion);
@@ -1017,8 +1018,8 @@ static int __devinit ace_setup(struct ace_device *ace)
/* Print the identification */
dev_info(ace-dev, Xilinx SystemACE revision %i.%i.%i\n,
 (version  12)  0xf, (version  8)  0x0f, version  0xff);
-   dev_dbg(ace-dev, physaddr 0x%lx, mapped to 0x%p, irq=%i\n,
-   ace-physaddr, ace-baseaddr, ace-irq);
+   dev_dbg(ace-dev, physaddr 0x%llx, mapped to 0x%p, irq=%i\n,
+   (unsigned long long)ace-physaddr, ace-baseaddr, ace-irq);
 
ace-media_change = 1;
ace_revalidate_disk(ace-gd);
@@ -1035,8 +1036,8 @@ err_alloc_disk:
 err_blk_initq:
iounmap(ace-baseaddr);
 err_ioremap:
-   dev_info(ace-dev, xsysace: error initializing device at 0x%lx\n,
-  ace-physaddr);
+   dev_info(ace-dev, xsysace: error initializing device at 0x%llx\n,
+(unsigned long long)ace-physaddr);
return -ENOMEM;
 }
 
@@ -1059,7 +1060,7 @@ static void __devexit ace_teardown(struct ace_device *ace)
 }
 
 static int __devinit
-ace_alloc(struct device *dev, int id, unsigned long physaddr,
+ace_alloc(struct device *dev, int id, resource_size_t physaddr,
  int irq, int bus_width)
 {
struct ace_device *ace;
@@ -1119,7 +1120,7 @@ static void __devexit ace_free(struct device *dev)
 
 static int __devinit ace_probe(struct platform_device *dev)
 {
-   unsigned long physaddr = 0;
+   resource_size_t physaddr = 0;
int bus_width = ACE_BUS_WIDTH_16; /* FIXME: should not be hard coded */
int id = dev-id;
int irq = NO_IRQ;
@@ -1165,7 +1166,7 @@ static int __devinit
 ace_of_probe(struct of_device *op, const struct of_device_id *match)
 {
struct resource res;
-   unsigned long physaddr;
+   resource_size_t physaddr;
const u32 *id;
int irq, bus_width, rc;
 
@@ -1205,6 +1206,7 @@ static struct of_device_id ace_of_match[] __devinitdata = 
{
{ .compatible = xlnx,opb-sysace-1.00.b, },
{ .compatible = xlnx,opb-sysace-1.00.c, },
{ .compatible = xlnx,xps-sysace-1.00.a, },
+   { .compatible = xlnx,sysace, },
{},
 };
 MODULE_DEVICE_TABLE(of, ace_of_match);
-- 
1.5.6.1
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH][v2] xsysace: use resource_size_t instead of unsigned long

2008-11-27 Thread Yuri Tikhonov

 Hello Grant,

On Thursday 27 November 2008 17:11, Grant Likely wrote:
 On Thu, Nov 27, 2008 at 5:21 AM, Yuri Tikhonov [EMAIL PROTECTED] wrote:
  Use resource_size_t for physical address of SystemACE
  chip. This fixes the driver brokeness for 32 bit systems
  with 64 bit resources (e.g. PPC440SPe).
 
 Hey Yuri,
 
 I actually already picked up the last version of your patch after
 fixing it up myself.  It's currently sitting in Paul's powerpc tree
 and it will be merged into mainline when Linus gets back from
 vacation.

 Oops. Indeed. Thanks.

 
 Can you please spin a new version with just the addition of the
 compatible value and base it on Paul's tree.

 Sure. I've generated the patch against the origin/merge branch of Paul's 
tree, and posting it as separate [PATCH] xsysace: add compatible string
.

 Regards, Yuri
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] xsysace: add compatible string

2008-11-27 Thread Yuri Tikhonov
Add one more compatible string to the table for
of_platform binding, so that the platforms, which
have the SysACE chip on board (e.g. Katmai), could
describe it in their device trees correctly.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
---
 drivers/block/xsysace.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c
index 29e1dfa..381d686 100644
--- a/drivers/block/xsysace.c
+++ b/drivers/block/xsysace.c
@@ -1206,6 +1206,7 @@ static struct of_device_id ace_of_match[] __devinitdata = 
{
{ .compatible = xlnx,opb-sysace-1.00.b, },
{ .compatible = xlnx,opb-sysace-1.00.c, },
{ .compatible = xlnx,xps-sysace-1.00.a, },
+   { .compatible = xlnx,sysace, },
{},
 };
 MODULE_DEVICE_TABLE(of, ace_of_match);
-- 
1.5.6.1
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[4]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes

2008-11-26 Thread Yuri Tikhonov
 
0x;
 
/* This drives busses 10 to 0x1f */
bus-range = 0x30 0x3f;
diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 77c7fa0..adbeb19 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -59,7 +59,7 @@ extern int check_legacy_ioport(unsigned long base_port);
 
 extern unsigned long isa_io_base;
 extern unsigned long pci_io_base;
-extern unsigned long pci_dram_offset;
+extern resource_size_t pci_dram_offset;
 
 extern resource_size_t isa_mem_base;
 
@@ -728,14 +728,14 @@ static inline void * phys_to_virt(unsigned long address)
  */
 #ifdef CONFIG_PPC32
 
-static inline unsigned long virt_to_bus(volatile void * address)
+static inline resource_size_t virt_to_bus(volatile void * address)
 {
 if (address == NULL)
return 0;
 return __pa(address) + PCI_DRAM_OFFSET;
 }
 
-static inline void * bus_to_virt(unsigned long address)
+static inline void * bus_to_virt(resource_size_t address)
 {
 if (address == 0)
return NULL;
diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c
index 88db4ff..5855937 100644
--- a/arch/powerpc/kernel/pci_32.c
+++ b/arch/powerpc/kernel/pci_32.c
@@ -33,7 +33,7 @@
 #endif
 
 unsigned long isa_io_base = 0;
-unsigned long pci_dram_offset = 0;
+resource_size_t pci_dram_offset = 0;
 int pcibios_assign_bus_offset = 1;
 
 void pcibios_make_OF_bus_map(void);
diff --git a/arch/powerpc/sysdev/ppc4xx_pci.c b/arch/powerpc/sysdev/ppc4xx_pci.c
index afbdd48..f748c5b 100644
--- a/arch/powerpc/sysdev/ppc4xx_pci.c
+++ b/arch/powerpc/sysdev/ppc4xx_pci.c
@@ -126,10 +126,8 @@ static int __init ppc4xx_parse_dma_ranges(struct 
pci_controller *hose,
if ((pci_space  0x0300) != 0x0200)
continue;

-   /* We currently only support memory at 0, and pci_addr
-* within 32 bits space
-*/
-   if (cpu_addr != 0 || pci_addr  0x) {
+   /* We currently only support memory at 0 */
+   if (cpu_addr != 0) {
printk(KERN_WARNING %s: Ignored unsupported dma range
0x%016llx...0x%016llx - 0x%016llx\n,
   hose-dn-full_name,
@@ -179,18 +177,12 @@ static int __init ppc4xx_parse_dma_ranges(struct 
pci_controller *hose,
return -ENXIO;
}

-   /* Check that we are fully contained within 32 bits space */
-   if (res-end  0x) {
-   printk(KERN_ERR %s: dma-ranges outside of 32 bits space\n,
-  hose-dn-full_name);
-   return -ENXIO;
-   }
  out:
dma_offset_set = 1;
pci_dram_offset = res-start;

-   printk(KERN_INFO 4xx PCI DMA offset set to 0x%08lx\n,
-  pci_dram_offset);
+   printk(KERN_INFO 4xx PCI DMA offset set to 0x%016llx\n,
+  (unsigned long long)pci_dram_offset);
return 0;
 }


 Any ideas ?

 If you really need 32-bit DMA support, you'll have to wait for swiotlb
 from Becky or work with her in bringing it to powerpc so that we can do
 bounce buffering for those devices.

 Ben.


 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[4]: [2/2] powerpc: support for 256K pages on PPC 44x

2008-11-26 Thread Yuri Tikhonov

 Hello Milton,

On Friday, November 14, 2008 you wrote:

 On Nov 13, 2008, at 10:32 PM, Yuri Tikhonov wrote:
 On Tuesday, November 11, 2008 Milton Miller wrote:
  #ifdef CONFIG_PTE_64BIT
  typedef unsigned long long pte_basic_t;
 +#ifdef CONFIG_PPC_256K_PAGES
 +#define PTE_SHIFT   (PAGE_SHIFT - 7)

 This seems to be missing the comment on how many ptes are actually 
 in
 the page that are in the other if and else cases.

 Ok. I'll fix this. Actually it's another hack: we don't use full page
 for PTE table because we need to reserve something for PGD

 I don't understand we need to reserve something for PGD.   Do you
 mean that you would not require a second page for the PGD because the
 full pagetable could fit in one page?
 ...
 That does imply you want to allocate the pte page from a slab instead
 of pgalloc.  Is that covered?

  Well, in case of 256K PAGE_SIZE we do not need the PGD level indeed
 (18 bits are used for offset, and remaining 14 bits are for PTE index
 inside the PTE table). Even the full 256K PTE page isn't necessary to
 cover the full range: only half of it would be enough (with 14 bits we
 can address only 16K PTEs).

  But the head_44x.S code is essentially based on the assumption of
 2-level page addressing. Also, I may guess that eliminating of the
 PGD level won't be as easy as just a re-implementation of the TLB-miss
 handlers in head_44x.S. So, the current approach for 256K-pages
 support was just a compromise between the required for the project
 functionality, and the effort necessary to achieve it.

 So are you allocating the  PAGE_SIZE levels from slabs (either kmalloc
 or dedicated) instead of allocating pages?   Or are you wasting the 
 extra space?

 Wasting the extra space has a place here.

 At a very minimum you need to comment this in the code.  If I were 
 maintiner I would say not wasting large fractions of pages when the 
 page size is 256k would be my merge requirement.  As I said, I'm fine 
 with keeping the page table two levels, but the tradeoff needs to be 
 documented.

 Agree, we'll document this fact, and re-submit the patch.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[6]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes

2008-11-26 Thread Yuri Tikhonov
On Thursday, November 27, 2008 you wrote:

  I've implemented (2) (the code is below), and it works. But, 
 admittedly, this (working) looks strange to me because of the 
 following:
  To be able to use 64-bit PCI mapping on PPC32 I had to replace the
 'unsigned long' type of pci_dram_offset with 'resource_size_t', which 
 on ppc440spe is 'u64'. So, in dma_alloc_coherent() I put the 64-bit 
 value into the 'dma_addr_t' handle. I use 2.6.27 kernel for testing, 
 which has sizeof(dma_addr_t) == sizeof(u32). Thus, 
 dma_alloc_coherent() cuts the upper 32 bits of PCI address, and returns 
 only low 32-bit part of PCI address to its caller. And, regardless of 
 this fact, the PCI device does operate somehow (this is the PCI-E LSI 
 disk controller served by the drivers/message/fusion/mptbase.c + 
 mptsas.c drivers).
 
  I've verified that ppc440spe PCI-E bridge's BARs (PECFGn_BAR0L,H) are 
 configured with the new, 1TB, address value:

 Strange... when I look at pci4xx_parse_dma_ranges() I see it
 specifically avoiding PCI addresses above 4G ... That needs fixing.

 Right, it avoid. I guess you haven't read my e-mail to its end, 
because my work-around patch, which I referenced there, fixes this :)

diff --git a/arch/powerpc/sysdev/ppc4xx_pci.c b/arch/powerpc/sysdev/ppc4xx_pci.c
index afbdd48..f748c5b 100644
--- a/arch/powerpc/sysdev/ppc4xx_pci.c
+++ b/arch/powerpc/sysdev/ppc4xx_pci.c
@@ -126,10 +126,8 @@ static int __init ppc4xx_parse_dma_ranges(struct 
pci_controller *hose,
if ((pci_space  0x0300) != 0x0200)
continue;

-   /* We currently only support memory at 0, and pci_addr
-* within 32 bits space
-*/
-   if (cpu_addr != 0 || pci_addr  0x) {
+   /* We currently only support memory at 0 */
+   if (cpu_addr != 0) {
printk(KERN_WARNING %s: Ignored unsupported dma range
0x%016llx...0x%016llx - 0x%016llx\n,
   hose-dn-full_name,
@@ -179,18 +177,12 @@ static int __init ppc4xx_parse_dma_ranges(struct 
pci_controller *hose,
return -ENXIO;
}

-   /* Check that we are fully contained within 32 bits space */
-   if (res-end  0x) {
-   printk(KERN_ERR %s: dma-ranges outside of 32 bits space\n,
-  hose-dn-full_name);
-   return -ENXIO;
-   }
  out:
dma_offset_set = 1;
pci_dram_offset = res-start;

-   printk(KERN_INFO 4xx PCI DMA offset set to 0x%08lx\n,
-  pci_dram_offset);
+   printk(KERN_INFO 4xx PCI DMA offset set to 0x%016llx\n,
+  (unsigned long long)pci_dram_offset);
return 0;
 }

 To implement that trick you definitely need to make dma_addr_t 64 bits.

 Sure. The problem here is that the LSI (the PCI device I want to DMA 
to/from 1TB PCI addresses) driver doesn't work with this (i.e. it's 
broken in, e.g., 2.6.28-rc6) on ppc440spe-based platform. It looks 
like there is no support for 32-bit CPUs with 64-bit physical 
addresses in the LSI driver. E.g. the following mix in the 
drivers/message/fusion/mptbase.h code points to the fact that the 
driver supposes 64-bit dma_addr_t on 64-bit CPUs only:

#ifdef CONFIG_64BIT
#define CAST_U32_TO_PTR(x)  ((void *)(u64)x)
#define CAST_PTR_TO_U32(x)  ((u32)(u64)x)
#else
#define CAST_U32_TO_PTR(x)  ((void *)x)
#define CAST_PTR_TO_U32(x)  ((u32)x)
#endif


#define mpt_addr_size() \
((sizeof(dma_addr_t) == sizeof(u64)) ? MPI_SGE_FLAGS_64_BIT_ADDRESSING 
: \
MPI_SGE_FLAGS_32_BIT_ADDRESSING)



 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 02/11] async_tx: add support for asynchronous GF multiplication

2008-11-26 Thread Yuri Tikhonov

 Hello Dan,

On Saturday, November 15, 2008 you wrote:

 A few comments

 Thanks.

 1/ I don't see code for handling cases where the src_cnt exceeds the
 hardware maximum.

 Right, actually the ADMA devices we used (ppc440spe DMA engines) has 
no limitations on the src_cnt (well, actually there is the limit - the 
size of descriptors FIFO, but it's more than the number of drives 
which may be handled with the current RAID-6 driver, i.e.  256), but 
I agree - the ASYNC_TX functions should not assume that any ADMA 
device will have such a feature. So we'll implement this, and then 
re-post the patches.

 2/ dmaengine.h defines DMA_PQ_XOR but these patches should really
 change that to DMA_PQ and do s/pqxor/pq/ across the rest of the code
 base.

 OK.

 3/ In my implementation (unfinished) of async_pq I decided to make the
 prototype:

 May I ask do you have in plans to finish and release your 
implementation?


 +/**
 + * async_pq - attempt to generate p (xor) and q (Reed-Solomon code) with a
 + * dma engine for a given set of blocks.  This routine assumes a field of
 + * GF(2^8) with a primitive polynomial of 0x11d and a generator of {02}.
 + * In the synchronous case the p and q blocks are used as temporary
 + * storage whereas dma engines have their own internal buffers.  The
 + * ASYNC_TX_PQ_ZERO_P and ASYNC_TX_PQ_ZERO_Q flags clear the
 + * destination(s) before they are used.
 + * @blocks: source block array ordered from 0..src_cnt with the p destination
 + * at blocks[src_cnt] and q at blocks[src_cnt + 1]
 + * NOTE: client code must assume the contents of this array are destroyed
 + * @offset: offset in pages to start transaction
 + * @src_cnt: number of source pages: 2  src_cnt = 255
 + * @len: length in bytes
 + * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK
 + * @depend_tx: p+q operation depends on the result of this transaction.
 + * @cb_fn: function to call when p+q generation completes
 + * @cb_param: parameter to pass to the callback routine
 + */
 +struct dma_async_tx_descriptor *
 +async_pq(struct page **blocks, unsigned int offset, int src_cnt, size_t len,
 +enum async_tx_flags flags, struct dma_async_tx_descriptor *depend_tx,
 +dma_async_tx_callback cb_fn, void *cb_param)

 Where p and q are not specified separately.  This matches more closely
 how the current gen_syndrome is specified with the goal of not
 requiring any changes to existing software raid6 interface.
 Thoughts?

 Understood. Our goal was to be more close to the ASYNC_TX interfaces, 
so we specified the destinations separately. Though I'm fine with your 
prototype, since doubling the same address is no good, so, we'll 
change this. 

 Any comments regarding the drivers/md/raid5.c part ?

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes

2008-11-13 Thread Yuri Tikhonov
Hello,

This patch extends DMA ranges for PCI(X) to 4GB, so that it could
work on Katmais with 4GB RAM installed.

Add new nodes for the PPC440SPe DMA, XOR engines to
be used in the PPC440SPe ADMA driver, and the SysACE
controller, which connects Compact Flash to Katmai.

Signed-off-by: Ilya Yanok [EMAIL PROTECTED]
Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
---
 arch/powerpc/boot/dts/katmai.dts |   46 +++--
 1 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/boot/dts/katmai.dts b/arch/powerpc/boot/dts/katmai.dts
index 077819b..7749478 100644
--- a/arch/powerpc/boot/dts/katmai.dts
+++ b/arch/powerpc/boot/dts/katmai.dts
@@ -245,8 +245,8 @@
ranges = 0x0200 0x 0x8000 0x000d 
0x8000 0x 0x8000
  0x0100 0x 0x 0x000c 
0x0800 0x 0x0001;
 
-   /* Inbound 2GB range starting at 0 */
-   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 
0x8000;
+   /* Inbound 4GB range starting at 0 */
+   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 
0x;
 
/* This drives busses 0 to 0xf */
bus-range = 0x0 0xf;
@@ -289,8 +289,8 @@
ranges = 0x0200 0x 0x8000 0x000e 
0x 0x 0x8000
  0x0100 0x 0x 0x000f 
0x8000 0x 0x0001;
 
-   /* Inbound 2GB range starting at 0 */
-   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 
0x8000;
+   /* Inbound 4GB range starting at 0 */
+   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 
0x;
 
/* This drives busses 10 to 0x1f */
bus-range = 0x10 0x1f;
@@ -330,8 +330,8 @@
ranges = 0x0200 0x 0x8000 0x000e 
0x8000 0x 0x8000
  0x0100 0x 0x 0x000f 
0x8001 0x 0x0001;
 
-   /* Inbound 2GB range starting at 0 */
-   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 
0x8000;
+   /* Inbound 4GB range starting at 0 */
+   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 
0x;
 
/* This drives busses 10 to 0x1f */
bus-range = 0x20 0x2f;
@@ -371,8 +371,8 @@
ranges = 0x0200 0x 0x8000 0x000f 
0x 0x 0x8000
  0x0100 0x 0x 0x000f 
0x8002 0x 0x0001;
 
-   /* Inbound 2GB range starting at 0 */
-   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 
0x8000;
+   /* Inbound 4GB range starting at 0 */
+   dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 
0x;
 
/* This drives busses 10 to 0x1f */
bus-range = 0x30 0x3f;
@@ -392,6 +392,36 @@
0x0 0x0 0x0 0x3 UIC3 0xa 0x4 /* swizzled int C 
*/
0x0 0x0 0x0 0x4 UIC3 0xb 0x4 /* swizzled int D 
*/;
};
+   DMA0: dma0 {
+   interrupt-parent = DMA0;
+   interrupts = 0 1;
+   #interrupt-cells = 1;
+   #address-cells = 0;
+   #size-cells = 0;
+   interrupt-map = 
+   0 UIC0 0x14 4
+   1 UIC1 0x16 4;
+   };
+   DMA1: dma1 {
+   interrupt-parent = DMA1;
+   interrupts = 0 1;
+   #interrupt-cells = 1;
+   #address-cells = 0;
+   #size-cells = 0;
+   interrupt-map = 
+   0 UIC0 0x16 4
+   1 UIC1 0x16 4;
+   };
+   xor {
+   interrupt-parent = UIC1;
+   interrupts = 0x1f 4;
+   };
+   [EMAIL PROTECTED] {
+   compatible = xlnx,opb-sysace-1.00.b;
+   interrupt-parent = UIC2;
+   interrupts = 0x19 4;
+   reg = 0x0004 0xfe00 0x100;
+   };
};
 
chosen {
-- 
1.5.6.1


-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [2/2] powerpc: support for 256K pages on PPC 44x

2008-11-13 Thread Yuri Tikhonov
Hello Milton,

On Tuesday, November 11, 2008 Milton Miller wrote:

[snip]


  #ifdef CONFIG_PTE_64BIT
  typedef unsigned long long pte_basic_t;
 +#ifdef CONFIG_PPC_256K_PAGES
 +#define PTE_SHIFT   (PAGE_SHIFT - 7)

 This seems to be missing the comment on how many ptes are actually in
 the page that are in the other if and else cases.

 Ok. I'll fix this. Actually it's another hack: we don't use full page
 for PTE table because we need to reserve something for PGD

 I don't understand we need to reserve something for PGD.   Do you 
 mean that you would not require a second page for the PGD because the 
 full pagetable could fit in one page?   My first reaction was to say 
 then create pgtable-nopgd.h like the other two.  The page walkers 
 support this with the advent of gigantic pages.  Then I realized that 
 might not be optimal:  while the page table might fit in one page, it 
 would mean you always allocate the pte space to cover the full address
 space.   Even if your processes spread out over the 3G of address space
 allocated to them (32 bit kernel), you will allocate space for 4G, 
 wasting 1/4 of the pte space.
 That does imply you want to allocate the pte page from a slab instead 
 of pgalloc.  Is that covered?

 Well, in case of 256K PAGE_SIZE we do not need the PGD level indeed
(18 bits are used for offset, and remaining 14 bits are for PTE index 
inside the PTE table). Even the full 256K PTE page isn't necessary to 
cover the full range: only half of it would be enough (with 14 bits we 
can address only 16K PTEs).

 But the head_44x.S code is essentially based on the assumption of 
2-level page addressing. Also, I may guess that eliminating of the
PGD level won't be as easy as just a re-implementation of the TLB-miss 
handlers in head_44x.S. So, the current approach for 256K-pages 
support was just a compromise between the required for the project 
functionality, and the effort necessary to achieve it.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] xsysace: use resource_size_t instead of unsigned long

2008-11-13 Thread Yuri Tikhonov

Hello Stephen,

On Thursday, November 13, 2008 you wrote:

 Hi Yuri,

 On Thu, 13 Nov 2008 11:43:17 +0300 Yuri Tikhonov [EMAIL PROTECTED] wrote:

 - dev_dbg(ace-dev, physaddr=0x%lx irq=%i\n, ace-physaddr, ace-irq);
 + dev_dbg(ace-dev, physaddr=0x%llx irq=%i\n, (u64)ace-physaddr, 
 ace-irq);

 You should cast the physaddr to unsigned long long as u64 is
 unsigned long on some architectures.  The same is needed in other
 places as well.

 Thanks for your comment. We'll fix this.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes

2008-11-13 Thread Yuri Tikhonov

 Hello Josh,

On Thursday, November 13, 2008 you wrote:

[snip]

 You have no compatible property in these 3 nodes.  How are drivers
 supposed to bind to them?

 You also have no reg or dcr-reg properties.  What exactly are these
 nodes for?

 Probably we (me and Ilya) overdone with posting katmai.dts-related 
changes to ML, and duplicated the same patches in different posts. 
Sorry for the confusion. These nodes are necessary for the ppc440spe 
ADMA driver:

http://www.nabble.com/-PATCH-11-11--ppc440spe-adma:-ADMA-driver-for-PPC440SP(e)-td20488049.html

 + [EMAIL PROTECTED] {
 + compatible = xlnx,opb-sysace-1.00.b;

 Odd.  This isn't a xilinx board by any means.  This should probably
 look something like:

 compatible = amcc,sysace-440spe, xlnx,opb-sysace-1.00.b;

 Though I'm curious about it in general.  The xilinx bindings have the
 versioning numbers on them to match particular bit-streams in the FPGAs
 if I remember correctly.  Does that really apply here?

 No, we just selected the description which looked more appropriate 
for our case: SysAce is connected to the External Bus Controller of 
440SPe, which in turns is attached as a slave to OPB (on-chip 
peripheral).

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes

2008-11-13 Thread Yuri Tikhonov

Hello Grant,

On Friday, November 14, 2008 you wrote:

 On Thu, Nov 13, 2008 at 06:45:33AM -0500, Josh Boyer wrote:
 On Thu, 13 Nov 2008 11:49:14 +0300
 Yuri Tikhonov [EMAIL PROTECTED] wrote:
  +   [EMAIL PROTECTED] {
  +   compatible = xlnx,opb-sysace-1.00.b;
 
 Odd.  This isn't a xilinx board by any means.  This should probably
 look something like:
 
   compatible = amcc,sysace-440spe, xlnx,opb-sysace-1.00.b;

 Actually, if there is a sysace, it is definitely a xilinx part.  It
 won't be on the SoC.  However, compatible = xlnx,opb-sysace-1.00.b
 isn't really accurate.  It should really be compatible = xlnx,sysace
 and the driver modified to accept this string.

 OK, we'll do this, and then re-post together with the cleaned-up 
[PATCH] xsysace: use resource_size_t instead of unsigned long patch.


   xlnx,opb-sysace-1.00.b
 is an FPGA block used to interface to the system ace which is
 definitely not in use here.

 So while it does get things to work, it is not a clean description of
 the hardware.

 g.




 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] powerpc: add support for PAGE_SIZEs greater than 4KB for

2008-09-11 Thread Yuri Tikhonov

Hello,

On Thursday, September 11, 2008 you wrote:

 I was planning to post a similar patch. Good that you already posted
 it :-) I will try to finish off similar patch for 40x processors.


 +choice
 +   prompt Page size
 +   depends on 44x  PPC32
 +   default PPC32_4K_PAGES
 +   help
 + The PAGE_SIZE definition. Increasing the page size may
 + improve the system performance in some dedicated cases.
 + If unsure, set it to 4 KB.
 +
 You should mention an example of dedicated cases (eg. RAID).

ACK.

 I think this help should mention that for page size 256KB, you will
 need to have a special version of binutils, since the ELF standard
 mentions page sizes only upto 64KB.

 Right. We use ELDK-4.2 for compiling applications to be run on 256K
PAGE_SIZE kernel. This toolchain includes necessary changes for
ELF_MAXPAGESIZE in binutils/bfd/elf32-ppc.c.

 -#ifdef CONFIG_PPC_64K_PAGES
 +#if defined(CONFIG_PPC32_256K_PAGES)
 +#define PAGE_SHIFT 18
 +#elif defined(CONFIG_PPC32_64K_PAGES) || defined(CONFIG_PPC_64K_PAGES)
  #define PAGE_SHIFT 16
 +#elif defined(CONFIG_PPC32_16K_PAGES)
 +#define PAGE_SHIFT 14
  #else
  #define PAGE_SHIFT 12
  #endif

 Why should the new defines be inside CONFIG_PPC_64K_PAGES? The
 definition CONFIG_PPC_64K_PAGES is repeated.

 We decided to introduce new CONFIG_PPC32_64K_PAGES option to
distinguish using 64K pages on PPC32 and PPC64, so PAGE_SHIFT will be
defined as 16 when the CONFIG_PPC_64K_PAGES option is set on some PPC64
platform, and as 16 when the CONFIG_PPC32_64K_PAGES option is set on
some ppc44x PPC32 platform.

 Shouldn't these defines be like this:
 #if defined(CONFIG_PPC32_256K_PAGES)
 #define PAGE_SHIFT 18
 #elif defined(CONFIG_PPC32_64K_PAGES) || defined(CONFIG_PPC_64K_PAGES)
 #define PAGE_SHIFT 16
 #elif defined(CONFIG_PPC32_16K_PAGES)
 #define PAGE_SHIFT 14
 #else
 #define PAGE_SHIFT 12
 #endif

 Admittedly, I don't see the difference between your version and
Ilya's one. Am I missing something ?

 +#elif (PAGE_SHIFT == 14)
 +/*
 + * PAGE_SIZE  16K
 + * PAGE_SHIFT 14
 + * PTE_SHIFT  11
 + * PMD_SHIFT  25
 + */
 +#define PPC44x_TLBE_SIZE   PPC44x_TLB_16K
 +#define PPC44x_PGD_OFF_SH  9  /*(32 - PMD_SHIFT + 2)*/
 +#define PPC44x_PGD_OFF_M1  23 /*(PMD_SHIFT - 2)*/
 +#define PPC44x_PTE_ADD_SH  21 /*32 - PMD_SHIFT + PTE_SHIFT + 3*/
 +#define PPC44x_PTE_ADD_M1  18 /*32 - 3 - PTE_SHIFT*/
 +#define PPC44x_RPN_M2  17 /*31 - PAGE_SHIFT*/

 Please change PPC44x_PGD_OFF_SH to PPC44x_PGD_OFF_SHIFT. SH sounds
 very confusing. I don't like the MI and M2 names too. Change
 PPC44x_RPN_M2 to PPC44x_RPN_MASK. Change M1 to MASK in
 PPC44x_PGD_OFF_M1 and PPC44x_PTE_ADD_M1 .
 Is there no way a define like
 #define PPC44x_PGD_OFF_SH  (32 - PMD_SHIFT + 2)
 be used in assembly file. If yes, we can avoid repeating the defines.

 I think these 44x specific defines should go to asm/mmu-44x.h since I
 am planning to post a patch for 40x. For those processors, the defines
 below will changes as:
 #define PPC44x_PTE_ADD_SH  (32 - PMD_SHIFT + PTE_SHIFT + 2)
 #define PPC44x_PTE_ADD_M1  (32 - 2 - PTE_SHIFT)
 Since these defines are not generic, they should be put in the mmu
 specific header file rather than adding a new header file. When 40x
 processors are supported, the corresponding defines can go to
 include/asm/mmu-40x.h

 +#elif (PAGE_SHIFT == 18)
 +/*
 + * PAGE_SIZE  256K
 + * PAGE_SHIFT 18
 + * PTE_SHIFT  11
 + * PMD_SHIFT  29
 + */
 +#define PPC44x_TLBE_SIZE   PPC44x_TLB_256K
 +#define PPC44x_PGD_OFF_SH  5  /*(32 - PMD_SHIFT + 2)*/
 +#define PPC44x_PGD_OFF_M1  27 /*(PMD_SHIFT - 2)*/
 +#define PPC44x_PTE_ADD_SH  17 /*32 - PMD_SHIFT + PTE_SHIFT + 3*/
 +#define PPC44x_PTE_ADD_M1  18 /*32 - 3 - PTE_SHIFT*/
 +#define PPC44x_RPN_M2  13 /*31 - PAGE_SHIFT*/

 For 256KB page size, I cannot understand why PTE_SHIFT is 11. Since
 each PTE entry is 8 byte, PTE_SHIFT should have been 15. But then
 there would be no bits in the Effective address for the 1st level
 PGDIR offset. On what basis PTE_SHIFT of 11 is chosen? This overflow
 problem happens only for 256KB page size.

 We should use smaller PTE area in address to free some bits for PGDIR
part. I guess the only impact this approach has is ineffective usage
of memory pages allocated for PTE tables, since having PTE_SHIFT of 11
we use only 1/16 of pages with PTEs.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] powerpc: add support for PAGE_SIZEs greater than 4KB for

2008-09-11 Thread Yuri Tikhonov

Hello Prodyut,

Thanks for your comments. Some answers below.

On Friday, September 12, 2008 you wrote:

/*
 * Create WS1. This is the faulting address (EPN),
 * page size, and valid flag.
 */
 -   li  r11,PPC44x_TLB_VALID | PPC44x_TLB_4K
 +   li  r11,PPC44x_TLB_VALID | PPC44x_TLBE_SIZE
rlwimi  r10,r11,0,20,31 /* Insert valid and page 
 size*/
tlbwe   r10,r13,PPC44x_TLB_PAGEID   /* Write PAGEID */


 Change
rlwimi  r10,r11,0,20,31 /* Insert valid and page 
 size*/
 to
rlwimi  r10,r11,0,PPC44x_PTE_ADD_M1,31 /* Insert 
 valid and page size*/

 Agree. We'll fix this.

 I guess this works for us, because we used the large EPN mask here
which covered more bits in EPN field of TLB entries, than it was
required for 16/64/256K PAGE_SIZE cases:

TLB Word 0 / bits 0..21:   EPN (Effective Page Number) [from 4 to 22 bits]
TLB Word 0 / bit 22 :  V (Valid bit) [1 bit]
TLB Word 0 / bits 24..27 : SIZE (Page Size) [4 bits]

 Thus, doing 'rlwimi' we masked our V/SIZE bits and cleared EPN for
all 4/16/64/256K PAGE_SIZE cases.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH] powerpc: add support for PAGE_SIZEs greater than 4KB for

2008-09-11 Thread Yuri Tikhonov

Hi Ilya,

On Friday, September 12, 2008 you wrote:

 Hi,

 prodyut hazarika wrote:
 In file arch/powerpc/mm/pgtable_32.c, we have:

 #ifdef CONFIG_PTE_64BIT
 /* 44x uses an 8kB pgdir because it has 8-byte Linux PTEs. */
 #define PGDIR_ORDER 1
 #else
 #define PGDIR_ORDER 0
 #endif
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 pgd_t *ret;

 ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 return ret;
 }

 Thus, we allocate 2 pages for 44x processors for PGD. This is needed
 only for 4K page.
 We are anyway not using the whole 64K or 256K page for the PGD. So
 there is no point to waste an additional 64K or 256KB page
   

 Ok. Not sure I'm right but I think 16K case doesn't need second page 
 too. (PGDIR_SHIFT=25, so sizeof(pgd_t)(32-PGDIR_SHIFT)  16KB)

 ACK, no need need in a second page when working with 16K pages.
Prodyut's approach addresses this too, but ...

 Change this to:
 #ifdef CONFIG_PTE_64BIT
 #if (PAGE_SHIFT == 12)
   

 I think #ifdef CONFIG_PTE_64BIT is a little bit confusing here...  
 Actually PGDIR_ORDER  should be something like max(32 + 2 - PGDIR_SHIFT
 - PAGE_SHIFT, 0)

 /* 44x uses an 8kB pgdir because it has 8-byte Linux PTEs. */
 #define PGDIR_ORDER 1
 #else
 #define PGDIR_ORDER 0
 #endif
 #else
 #define PGDIR_ORDER 0
 #endif
   

 Yuri, any comments?

 ... as for me, I like your approach more.

 Regards, Yuri

 --
 Yuri Tikhonov, Senior Software Engineer
 Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Unaligned LocalPlus Bus access on MPC5200

2008-02-19 Thread Yuri Tikhonov

 Hello,

 I've encountered with the problem of unaligned word access to external 
devices (Flash memory) connected to Local Plus bus of MPC5200 processor. Any 
comments on this would be very appreciated.

 And the essence of the issue is as follows:

- when I try to read a data word from LPB-connected Flash using some even 
address (0xFF00, 0xFF02, 0xFF04, etc), then everything works fine 
(it's not a typo, 2bytes-aligned word accesses pass well too);

- when I try to read a data word from LPB-connected Flash using some odd 
address (0xFF01, 0xFF03, ...) then LP returns only 1 byte of word 
correctly (the 3 bytes remained are filled with zeros).

(a) Here is what I have when I read from LPB using word-aligned accesses with 
MPC5200 rev.A and LPB configured in Non_Multiplexed mode + 8-bit data bus:

lwz from 0xc3082000:0xba476bc7
lwz from 0xc3082004:0xb95d77de
 ...

 Now I try the unaligned reads:
lwz from 0xc3082000:0xba476bc7
lwz from 0xc3082001:0x00b9
lwz from 0xc3082002:0x6bc7b95d
lwz from 0xc3082003:0xc700

(b) With MPC5200 rev.B situation is similar, and different only in fact that 
unaligned read results to 2-bytes reading for 0x1 addresses cases (LPB again 
configured in Non_Multiplexed mode + 8-bit data bus):

lwz from 0xd1082000:0x2f459eaf
lwz from 0xd1082004:0x388ff68d
...

lwz from 0xd1082000:0x2f459eaf
lwz from 0xd1082001:0xaf38
lwz from 0xd1082002:0x9eaf388f
lwz from 0xd1082003:0xaf00

(c) When LPB operates in the Multiplexed mode with 32-bit data bus, the 
erroneous result is observed for 0x3 addresses cases only (MPC5200 has rev.A 
in these tests):
lwz from 0xc3082000:0x7e9043a6
lwz from 0xc3082004:0x7eb143a6
...

lwz from 0xc3082000:0x7e9043a6
lwz from 0xc3082001:0x9043a67e
lwz from 0xc3082002:0x43a67eb1
lwz from 0xc3082003:0xa600

 I used the following platforms for tests:

- TQM5200 board, which is based on MPC5200rev.A CPU, and has AMD Flash 
connected to LPB configured in Multiplexed mode with 32-bit data bus;

- some customed board, which is based on MPC5200rev.A CPU, and has Intel Flash 
connected to LPB configured in Non-Multiplexed mode with 8-bit data bus;

- Lite5200B board, which is based on MPC5200rev.B CPU, and has AMD Flash 
connected to LPB configured in Non-Multiplexed mode with 8-bit data bus.

 The Linux source tree I used is linux-2.6.23.16 (DENX linux-2.6.23-stable 
branch). The toolchain is ELDK-4.2.


 As an example, these LPB-related issue leads to the incorrect operation of 
JFFS2 file-system created on the top of MTD device built on a Flash chip from 
Intel/Sharp (drivers/mtd/chips/cfi_cmdset_0001.c).

 With these Flash chips implementation of point/unpoint API is possible, so 
the cfi_cmdset_0001.c driver exports the corresponding point/unpoint methods 
for the MTD device, and JFFS2 then uses these methods to operate with data 
directly from Flash (without copying them to RAM memory).

 One of these operations is memcpy() in the jffs2_scan_dirent_node() function, 
which copies the file name from some address at Flash (aligned) to, 
unfortunately, unaligned destination in RAM (name field of the 
jffs2_full_dirent structure).

 The implementation of memcpy() in lib_powerpc first does byte-to-byte 
transfers to achieve the aligned destination, and then does word-to-word 
transfers, but by this moment the source is unaligned, so memcpy() does lwz-s 
from unaligned addresses on LPB.

 Just FYI, a simple work-around for the issue with the Intel/Sharp Flash chips 
connected to LPB of MPC5200 is to mark your struct map_info as .phys = 
NO_XIP, and implement read/write/copy_from/copy_to byte-to-byte functions (in 
your drivers/mtd/maps/ board file).

 Regards, Yuri

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/1] [PPC] 8xx swap bug-fix

2008-02-04 Thread Yuri Tikhonov

 Hi Scott,

 You are right. The TLB handlers for 8xx in arch/powerpc branch set the 
PAGE_ACCESSED flag unconditionally too. And the 
include/asm-powerpc/pgtable-ppc32.h file still includes the comment that this 
is the bug. So, probably the corresponding patch for powerpc branch will be 
usefull. Does anybody use swap with some of the 8xx-based boards supported in 
powerpc branch ?

 Regards, Yuri

On Monday 04 February 2008 21:24, Scott Wood wrote:
 On Sat, Feb 02, 2008 at 12:22:17PM +0100, Jochen Friedrich wrote:
  Hi Yuri,
  
Here is the patch which makes Linux-2.6 swap routines operate correctly 
on
   the ppc-8xx-based machines.
  
  is there any 8xx board left which isn't ported to ARCH=powerpc?
 
 More importantly, is this something that is also broken in arch/powerpc?  It
 looks like it has the same code...
 
 -Scott
 

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 1/1] [PPC] 8xx swap bug-fix

2008-02-01 Thread Yuri Tikhonov

 Hello,

 Here is the patch which makes Linux-2.6 swap routines operate correctly on
the ppc-8xx-based machines.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
--
diff --git a/arch/ppc/kernel/head_8xx.S b/arch/ppc/kernel/head_8xx.S
index eb8d26f..321bda2 100644
--- a/arch/ppc/kernel/head_8xx.S
+++ b/arch/ppc/kernel/head_8xx.S
@@ -329,8 +329,18 @@ InstructionTLBMiss:
mfspr   r11, SPRN_MD_TWC/* and get the pte address */
lwz r10, 0(r11) /* Get the pte */
 
+#ifdef CONFIG_SWAP
+   /* do not set the _PAGE_ACCESSED bit of a non-present page */
+   andi.   r11, r10, _PAGE_PRESENT
+   beq 4f
+   ori r10, r10, _PAGE_ACCESSED
+   mfspr   r11, SPRN_MD_TWC/* get the pte address again */
+   stw r10, 0(r11)
+4:
+#else
ori r10, r10, _PAGE_ACCESSED
stw r10, 0(r11)
+#endif
 
/* The Linux PTE won't go exactly into the MMU TLB.
 * Software indicator bits 21, 22 and 28 must be clear.
@@ -395,8 +405,17 @@ DataStoreTLBMiss:
DO_8xx_CPU6(0x3b80, r3)
mtspr   SPRN_MD_TWC, r11
 
-   mfspr   r11, SPRN_MD_TWC/* get the pte address again */
+#ifdef CONFIG_SWAP
+   /* do not set the _PAGE_ACCESSED bit of a non-present page */
+   andi.   r11, r10, _PAGE_PRESENT
+   beq 4f
+   ori r10, r10, _PAGE_ACCESSED
+4:
+   /* and update pte in table */
+#else
ori r10, r10, _PAGE_ACCESSED
+#endif
+   mfspr   r11, SPRN_MD_TWC/* get the pte address again */
stw r10, 0(r11)
 
/* The Linux PTE won't go exactly into the MMU TLB.
@@ -575,7 +594,16 @@ DataTLBError:
 
/* Update 'changed', among others.
*/
+#ifdef CONFIG_SWAP
+   ori r10, r10, _PAGE_DIRTY|_PAGE_HWWRITE
+   /* do not set the _PAGE_ACCESSED bit of a non-present page */
+   andi.   r11, r10, _PAGE_PRESENT
+   beq 4f
+   ori r10, r10, _PAGE_ACCESSED
+4:
+#else
ori r10, r10, _PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_HWWRITE
+#endif
mfspr   r11, SPRN_MD_TWC/* Get pte address again */
stw r10, 0(r11) /* and update pte in table */
 
diff --git a/include/asm-ppc/pgtable.h b/include/asm-ppc/pgtable.h
index c159315..76717ff 100644
--- a/include/asm-ppc/pgtable.h
+++ b/include/asm-ppc/pgtable.h
@@ -341,14 +341,6 @@ extern unsigned long ioremap_bot, ioremap_base;
 #define _PMD_PAGE_MASK 0x000c
 #define _PMD_PAGE_8M   0x000c
 
-/*
- * The 8xx TLB miss handler allegedly sets _PAGE_ACCESSED in the PTE
- * for an address even if _PAGE_PRESENT is not set, as a performance
- * optimization.  This is a bug if you ever want to use swap unless
- * _PAGE_ACCESSED is 2, which it isn't, or unless you have 8xx-specific
- * definitions for __swp_entry etc. below, which would be gross.
- *  -- paulus
- */
 #define _PTE_NONE_MASK _PAGE_ACCESSED
 
 #else /* CONFIG_6xx */

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 0/2] [PPC 4xx] L2-cache synchronization for ppc44x

2008-01-11 Thread Yuri Tikhonov

 Hello, Eugene,

 The h/w snooping mechanism you are talking about is limited to the Low 
Latency (LL) segment of the PLB bus in ppc440sp and ppc440spe chips (see 
section 7.2.7 L2 Cache Coherency of the ppc440spe spec), whereas DMA and 
XOR engines use the High Bandwidth (HB) segment of PLB bus (see 
section 1.1.2 Internal Buses of the ppc440spe spec).

 Thus, the h/w snooping mechanism is not able to trace the results of 
operations performed by DMA and XOR engines and keep L2-cache coherent with 
SDRAM, because the data flow through the HB PLB segment. This leads to, for 
example, incorrect results of RAID-parity calculations if one uses the h/w 
accelerated ppc440spe ADMA driver with L2-cache enabled.

 The s/w synchronization algorithms proposed in my patches has no LL PLB 
limitations as opposed to h/w snooping, but, probably, this is not the best 
way of how it might be implemented. Even though with these patches the h/w 
accelerated RAID starts to operate correctly (with L2-cache enabled) there is 
a performance degradation (induced by loops in the L2-cache synchronization 
routines) observed in the most cases. So, as a result, there is no benefit 
from using L2-cache for these, RAID, cases at all.

 Regards, Yuri

On Wednesday 28 November 2007 22:50, Eugene Surovegin wrote:
 On Wed, Nov 07, 2007 at 01:40:10AM +0300, Yuri Tikhonov wrote:
  
   Hello all,
  
   Here is a patch-set for support L2-cache synchronization routines for
  the ppc44x processors family. I know that the ppc branch is for 
bug-fixing only, thus
  the patch-set is just FYI [though enabled but non-coherent L2-cache may 
appear as a bug for
  someone who uses one of the boards listed below :)].
  
  [PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for 
ppc44x;
  [PATCH 2/2] [PPC 44x] enable L2-cache for the following ppc44x-based 
boards: ALPR,
  Katmai, Ocotea, and Taishan.
 
 Why is this all needed?
 
 IIRC ibm440gx_l2c_enable() configures 64G snoop region for L2C.
 
 Did AMCC made non-only-coherent L2C chips recently?
 
 -- 
 Eugene
 
 

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for ppc44x

2007-11-07 Thread Yuri Tikhonov

 Hi Olof,

 Thanks a lot for the feedbacks. Comments below.

On 07.11.2007, 7:04:28 you wrote:

 Hi,

 Some comments below. In general this patch adds #ifdefs in common code,
 that's normally frowned upon.
 It would maybe be better to add a new call to ppc_machdeps and call it
 if set.

 Agree; this looks better indeed. 

 On Wed, Nov 07, 2007 at 01:40:28AM +0300, Yuri Tikhonov wrote:

...

 +
  /*
   * Write any modified data cache blocks out to memory.
   * Does not invalidate the corresponding cache lines (especially for
 diff --git a/include/asm-powerpc/cache.h b/include/asm-powerpc/cache.h
 index 5350704..8a2f9e6 100644
 --- a/include/asm-powerpc/cache.h
 +++ b/include/asm-powerpc/cache.h
 @@ -10,12 +10,14 @@
  #define MAX_COPY_PREFETCH  1
  #elif defined(CONFIG_PPC32)
  #define L1_CACHE_SHIFT 5
 +#define L2_CACHE_SHIFT 5
  #define MAX_COPY_PREFETCH  4
  #else /* CONFIG_PPC64 */
  #define L1_CACHE_SHIFT 7
  #endif
  
  #defineL1_CACHE_BYTES  (1  L1_CACHE_SHIFT)
 +#defineL2_CACHE_BYTES  (1  L2_CACHE_SHIFT)

 The above looks highly system dependent to me. Should maybe be a part
 of the cache info structures instead, and filled in from the device tree?

 This is the Level-2 cache line parameter. I'll see what can be made here. For 
now I've just renamed these definitions and moved them into the PPC44x-specific 
header.


 Regards,
  Yuri 

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re[2]: [PATCH 2/2] [PPC 44x] enable L2-cache for ALPR, Katmai, Ocotea, and Taishan

2007-11-07 Thread Yuri Tikhonov

 Hi Olof,

On 07.11.2007, 7:06:08 you wrote:
...
 +
 +config L2_CACHE
 +   bool Enable Level-2 Cache
 +   depends on NOT_COHERENT_CACHE  (KATMAI || TAISHAN || OCOTEA || 
 ALPR)
 +   default y
 +   help
 + This option enables L2-cache on ppc44x controllers.
 + If unsure, say Y.

 That's a very generic config name. Maybe something like PPC_4XX_L2_CACHE?

 Having the ppc_machdep for invalidating L2-cache lines we can avoid 
introducing the new configuration options at all. See below.

 Is there ever a case where a user would NOT want l2 cache enabled (and
 disabled permanently enough to rebuild the kernel instead of giving a
 kernel command line option?)

 Theoretically - yes. Internal SRAM of ppc44x may be used for something else 
than L2 cache. 

 Admittedly, the configuration option was necessary for me to enable or disable 
my L2-cache synchronization routine in the generic dma_sync() function. Per 
your suggestion, now, instead of introducing the new kernel option I initialize 
the L2-cache sync ppc_machdep right in the L2-cache enable routine: thus if the 
user will not enable L2-cache (will not want internal SRAM to act as L2-cache 
and will not call the L2-cache enabling routine) then my new ppc_machdep will 
remain set to zero and will not affect on SRAM used for some specific purposes.

...

 @@ -567,7 +569,9 @@ void __init platform_init(unsigned long r3, unsigned 
 long r4,
  #ifdef CONFIG_KGDB
 ppc_md.early_serial_map = alpr_early_serial_map;
  #endif
 +#ifdef CONFIG_L2_CACHE
 ppc_md.init = alpr_init;
 +#endif

 Why do you take out the above calls if the new option is selected? Seems
 odd to remove something that worked(?) before.

 Umm.. Quite the contrary, the option selected made these calls avaiable. 
Though it doesn't matter anymore since there is no CONFIG_L2_CACHE option 
anymore (i.e. all the four boards dealt with in this patch-set now have 
L2-cache enabled regardless of configuration, as it was initially).

 ppc_md.restart = alpr_restart;
  }
  

...

 +#ifdef CONFIG_L2_CACHE
 +static void __init katmai_init(void)
 +{
 +   ibm440gx_l2c_setup(clocks);
 +}
 +#endif
 +
  void __init platform_init(unsigned long r3, unsigned long r4,
   unsigned long r5, unsigned long r6, unsigned long 
 r7)
  {
 @@ -599,4 +607,7 @@ void __init platform_init(unsigned long r3, unsigned 
 long r4,
 ppc_md.early_serial_map = katmai_early_serial_map;
  #endif
 ppc_md.restart = katmai_restart;
 +#ifdef CONFIG_L2_CACHE
 +   ppc_md.init = katmai_init;
 +#endif

 See comment above. Should the above init be called for all configs, not just
 when L2_CACHE is enabled?

 Also, it looks like the init function is the same on every board. It would
 be better to make a common function instead of duplicating it everywhere.

 Agree, but perhaps it's not the case for the ppc branch. Will do this in the 
powerpc branch as soon as support for these boards will be ported there.. by 
someone :)

Regards,
 Yuri 

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] [PPC 44x] L2-cache synchronization for ppc44x

2007-11-07 Thread Yuri Tikhonov

 This is the updated patch for support synchronization of L2-Cache with the 
external memory on the ppc44x-based platforms.

 Differencies against the previous patch-set:
- remove L2_CACHE config option;
- introduce the ppc machdep to invalidate L2 cache lines;
- some code clean-up.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED]

--
diff --git a/arch/powerpc/lib/dma-noncoherent.c 
b/arch/powerpc/lib/dma-noncoherent.c
index 1947380..b06f05c 100644
--- a/arch/powerpc/lib/dma-noncoherent.c
+++ b/arch/powerpc/lib/dma-noncoherent.c
@@ -31,6 +31,7 @@
 #include linux/dma-mapping.h
 
 #include asm/tlbflush.h
+#include asm/machdep.h
 
 /*
  * This address range defaults to a value that is safe for all
@@ -186,6 +187,8 @@ __dma_alloc_coherent(size_t size, dma_addr_t *handle, gfp_t 
gfp)
unsigned long kaddr = (unsigned long)page_address(page);
memset(page_address(page), 0, size);
flush_dcache_range(kaddr, kaddr + size);
+   if (ppc_md.l2cache_inv_range)
+   ppc_md.l2cache_inv_range(__pa(kaddr), __pa(kaddr + 
size));
}
 
/*
@@ -351,12 +354,16 @@ void __dma_sync(void *vaddr, size_t size, int direction)
BUG();
case DMA_FROM_DEVICE:   /* invalidate only */
invalidate_dcache_range(start, end);
+   if (ppc_md.l2cache_inv_range)
+   ppc_md.l2cache_inv_range(__pa(start), __pa(end));
break;
case DMA_TO_DEVICE: /* writeback only */
clean_dcache_range(start, end);
break;
case DMA_BIDIRECTIONAL: /* writeback and invalidate */
flush_dcache_range(start, end);
+   if (ppc_md.l2cache_inv_range)
+   ppc_md.l2cache_inv_range(__pa(start), __pa(end));
break;
}
 }
diff --git a/arch/ppc/kernel/misc.S b/arch/ppc/kernel/misc.S
index 46cf8fa..31c9149 100644
--- a/arch/ppc/kernel/misc.S
+++ b/arch/ppc/kernel/misc.S
@@ -25,6 +25,10 @@
 #include asm/thread_info.h
 #include asm/asm-offsets.h
 
+#ifdef CONFIG_44x
+#include asm/ibm44x.h
+#endif
+
 #ifdef CONFIG_8xx
 #define ISYNC_8xx isync
 #else
@@ -386,6 +390,35 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
sync/* additional sync needed on g4 */
isync
blr
+
+#if defined(CONFIG_44x)
+/*
+ * Invalidate the Level-2 cache lines corresponded to the address
+ * range.
+ *
+ * invalidate_l2cache_range(unsigned long start, unsigned long stop)
+ */
+_GLOBAL(invalidate_l2cache_range)
+   li  r5,PPC44X_L2_CACHE_BYTES-1  /* align on L2-cache line */
+   andcr3,r3,r5
+   subfr4,r3,r4
+   add r4,r4,r5
+   srwi.   r4,r4,PPC44X_L2_CACHE_SHIFT
+   mtctr   r4
+
+   lis r4, L2C_CMD_INV16
+1: mtdcr   DCRN_L2C0_ADDR,r3   /* write address to invalidate */
+   mtdcr   DCRN_L2C0_CMD,r4/* issue the Invalidate cmd */
+
+2: mfdcr   r5,DCRN_L2C0_SR /* wait for complete */
+   andis.  r5,r5,L2C_CMD_CLR16
+   beq 2b
+
+   addir3,r3,PPC44X_L2_CACHE_BYTES /* next address to invalidate */
+   bdnz1b
+   blr
+#endif
+
 /*
  * Write any modified data cache blocks out to memory.
  * Does not invalidate the corresponding cache lines (especially for
diff --git a/arch/ppc/syslib/ibm440gx_common.c 
b/arch/ppc/syslib/ibm440gx_common.c
index 6b1a801..64c663f 100644
--- a/arch/ppc/syslib/ibm440gx_common.c
+++ b/arch/ppc/syslib/ibm440gx_common.c
@@ -12,6 +12,8 @@
  */
 #include linux/kernel.h
 #include linux/interrupt.h
+#include asm/machdep.h
+#include asm/cacheflush.h
 #include asm/ibm44x.h
 #include asm/mmu.h
 #include asm/processor.h
@@ -201,6 +203,7 @@ void __init ibm440gx_l2c_enable(void){
 
asm volatile (sync; isync ::: memory);
local_irq_restore(flags);
+   ppc_md.l2cache_inv_range = invalidate_l2cache_range;
 }
 
 /* Disable L2 cache */
diff --git a/include/asm-powerpc/cacheflush.h b/include/asm-powerpc/cacheflush.h
index ba667a3..bdebfaa 100644
--- a/include/asm-powerpc/cacheflush.h
+++ b/include/asm-powerpc/cacheflush.h
@@ -49,6 +49,7 @@ extern void flush_dcache_range(unsigned long start, unsigned 
long stop);
 #ifdef CONFIG_PPC32
 extern void clean_dcache_range(unsigned long start, unsigned long stop);
 extern void invalidate_dcache_range(unsigned long start, unsigned long stop);
+extern void invalidate_l2cache_range(unsigned long start, unsigned long stop);
 #endif /* CONFIG_PPC32 */
 #ifdef CONFIG_PPC64
 extern void flush_inval_dcache_range(unsigned long start, unsigned long stop);
diff --git a/include/asm-powerpc/machdep.h b/include/asm-powerpc/machdep.h
index 71c6e7e..754f416 100644
--- a/include/asm-powerpc/machdep.h
+++ b/include/asm-powerpc/machdep.h
@@ -201,6 +201,8 @@ struct machdep_calls {
void(*early_serial_map)(void);
void

Re[2]: [PATCH] [PPC 44x] L2-cache synchronization for ppc44x

2007-11-07 Thread Yuri Tikhonov

 Hi Ben,


On 08.11.2007, 2:19:33 you wrote:
 On Thu, 2007-11-08 at 02:12 +0300, Yuri Tikhonov wrote:
 This is the updated patch for support synchronization of L2-Cache with
 the external memory on the ppc44x-based platforms.
 
  Differencies against the previous patch-set:
 - remove L2_CACHE config option;
 - introduce the ppc machdep to invalidate L2 cache lines;
 - some code clean-up.

 Can you tell me more about how this cache operates ? I don't quite
 understand why you would invalidate it on bidirectional DMAs rather than
 flush it to memory (unless you get your terminology wrong) and why you
 wouldn't flush it on transfers to the device.. Unless it is a
 write-through cache ?

 Yes, the ppc44x Level2 cache has the write-through design, so no need to do 
any kind of l2_flush.

 As far as the DMA_BIDIRECTIONAL case is concerned flush_dcache_range() flushes 
the data over the following path: L1-L2-RAM, but invalidates L1 only, and  L2 
remains invalid. Since in the BIDIRECTIONAL case DMA may update the data in RAM 
- we have to invalidate L2-cache manually, so that CPU may read new data 
transmitted by DMA right from RAM rather than old ones stuck in L2 due to 
flush_dcache().


Regards,
 Yuri 

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for ppc44x

2007-11-06 Thread Yuri Tikhonov
 Support for L2-cache coherency synchronization routines in ppc44x
processors.


Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED]

--
diff --git a/arch/powerpc/lib/dma-noncoherent.c 
b/arch/powerpc/lib/dma-noncoherent.c
index 1947380..593a425 100644
--- a/arch/powerpc/lib/dma-noncoherent.c
+++ b/arch/powerpc/lib/dma-noncoherent.c
@@ -351,12 +351,18 @@ void __dma_sync(void *vaddr, size_t size, int direction)
BUG();
case DMA_FROM_DEVICE:   /* invalidate only */
invalidate_dcache_range(start, end);
+#ifdef CONFIG_L2_CACHE
+   invalidate_l2cache_range(__pa(start), __pa(end));
+#endif
break;
case DMA_TO_DEVICE: /* writeback only */
clean_dcache_range(start, end);
break;
case DMA_BIDIRECTIONAL: /* writeback and invalidate */
flush_dcache_range(start, end);
+#ifdef CONFIG_L2_CACHE
+   invalidate_l2cache_range(__pa(start), __pa(end));
+#endif
break;
}
 }
diff --git a/arch/ppc/kernel/misc.S b/arch/ppc/kernel/misc.S
index 46cf8fa..de62f85 100644
--- a/arch/ppc/kernel/misc.S
+++ b/arch/ppc/kernel/misc.S
@@ -386,6 +386,36 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
sync/* additional sync needed on g4 */
isync
blr
+
+#ifdef CONFIG_L2_CACHE
+/*
+ * Invalidate the Level-2 cache lines corresponded to the address
+ * range.
+ *
+ * invalidate_l2cache_range(unsigned long start, unsigned long stop)
+ */
+#include asm/ibm4xx.h
+_GLOBAL(invalidate_l2cache_range)
+   li  r5,L2_CACHE_BYTES-1 /* do l2-cache line alignment */
+   andcr3,r3,r5
+   subfr4,r3,r4
+   add r4,r4,r5
+   srwi.   r4,r4,L2_CACHE_SHIFT
+   mtctr   r4
+
+   lis r4, L2C_CMD_INV16
+1: mtdcr   DCRN_L2C0_ADDR,r3   /* write address to invalidate */
+   mtdcr   DCRN_L2C0_CMD,r4/* issue the Invalidate cmd */
+
+2: mfdcr   r5,DCRN_L2C0_SR /* wait for complete */
+   andis.  r5,r5,L2C_CMD_CLR16
+beq2b
+
+   addir3,r3,L2_CACHE_BYTES/* next address to invalidate */
+   bdnz1b
+   blr
+#endif
+
 /*
  * Write any modified data cache blocks out to memory.
  * Does not invalidate the corresponding cache lines (especially for
diff --git a/include/asm-powerpc/cache.h b/include/asm-powerpc/cache.h
index 5350704..8a2f9e6 100644
--- a/include/asm-powerpc/cache.h
+++ b/include/asm-powerpc/cache.h
@@ -10,12 +10,14 @@
 #define MAX_COPY_PREFETCH  1
 #elif defined(CONFIG_PPC32)
 #define L1_CACHE_SHIFT 5
+#define L2_CACHE_SHIFT 5
 #define MAX_COPY_PREFETCH  4
 #else /* CONFIG_PPC64 */
 #define L1_CACHE_SHIFT 7
 #endif
 
 #defineL1_CACHE_BYTES  (1  L1_CACHE_SHIFT)
+#defineL2_CACHE_BYTES  (1  L2_CACHE_SHIFT)
 
 #defineSMP_CACHE_BYTES L1_CACHE_BYTES
 
diff --git a/include/asm-powerpc/cacheflush.h b/include/asm-powerpc/cacheflush.h
index ba667a3..bdebfaa 100644
--- a/include/asm-powerpc/cacheflush.h
+++ b/include/asm-powerpc/cacheflush.h
@@ -49,6 +49,7 @@ extern void flush_dcache_range(unsigned long start, unsigned 
long stop);
 #ifdef CONFIG_PPC32
 extern void clean_dcache_range(unsigned long start, unsigned long stop);
 extern void invalidate_dcache_range(unsigned long start, unsigned long stop);
+extern void invalidate_l2cache_range(unsigned long start, unsigned long stop);
 #endif /* CONFIG_PPC32 */
 #ifdef CONFIG_PPC64
 extern void flush_inval_dcache_range(unsigned long start, unsigned long stop);
diff --git a/include/asm-ppc/ibm44x.h b/include/asm-ppc/ibm44x.h
index 8078a58..782909a 100644
--- a/include/asm-ppc/ibm44x.h
+++ b/include/asm-ppc/ibm44x.h
@@ -138,7 +138,6 @@
  * The residual board information structure the boot loader passes
  * into the kernel.
  */
-#ifndef __ASSEMBLY__
 
 /*
  * DCRN definitions
@@ -814,6 +813,5 @@
 
 #include asm/ibm4xx.h
 
-#endif /* __ASSEMBLY__ */
 #endif /* __ASM_IBM44x_H__ */
 #endif /* __KERNEL__ */ 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 2/2] [PPC 44x] enable L2-cache for ALPR, Katmai, Ocotea, and Taishan

2007-11-06 Thread Yuri Tikhonov
 This patch introduces the L2_CACHE configuration option available
for the ppc44x-based boards with L2-cache enabled.

Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED]

--
diff --git a/arch/ppc/platforms/4xx/Kconfig b/arch/ppc/platforms/4xx/Kconfig
index 1d2ca42..ad6b581 100644
--- a/arch/ppc/platforms/4xx/Kconfig
+++ b/arch/ppc/platforms/4xx/Kconfig
@@ -396,4 +396,12 @@ config SERIAL_SICC_CONSOLE
bool
depends on SERIAL_SICC  UART0_TTYS1
default y
+
+config L2_CACHE
+   bool Enable Level-2 Cache
+   depends on NOT_COHERENT_CACHE  (KATMAI || TAISHAN || OCOTEA || ALPR)
+   default y
+   help
+ This option enables L2-cache on ppc44x controllers.
+ If unsure, say Y.
 endmenu
diff --git a/arch/ppc/platforms/4xx/alpr.c b/arch/ppc/platforms/4xx/alpr.c
index 3b6519f..0623801 100644
--- a/arch/ppc/platforms/4xx/alpr.c
+++ b/arch/ppc/platforms/4xx/alpr.c
@@ -537,10 +537,12 @@ static void __init alpr_setup_arch(void)
printk(Prodrive ALPR port (DENX Software Engineering [EMAIL 
PROTECTED])\n);
 }
 
+#ifdef CONFIG_L2_CACHE
 static void __init alpr_init(void)
 {
ibm440gx_l2c_setup(clocks);
 }
+#endif
 
 static void alpr_progress(char *buf, unsigned short val)
 {
@@ -567,7 +569,9 @@ void __init platform_init(unsigned long r3, unsigned long 
r4,
 #ifdef CONFIG_KGDB
ppc_md.early_serial_map = alpr_early_serial_map;
 #endif
+#ifdef CONFIG_L2_CACHE
ppc_md.init = alpr_init;
+#endif
ppc_md.restart = alpr_restart;
 }
 
diff --git a/arch/ppc/platforms/4xx/katmai.c b/arch/ppc/platforms/4xx/katmai.c
index d29ebf6..01f1baf 100644
--- a/arch/ppc/platforms/4xx/katmai.c
+++ b/arch/ppc/platforms/4xx/katmai.c
@@ -219,6 +219,7 @@ katmai_show_cpuinfo(struct seq_file *m)
 {
seq_printf(m, vendor\t\t: AMCC\n);
seq_printf(m, machine\t\t: PPC440SPe EVB (Katmai)\n);
+   ibm440gx_show_cpuinfo(m);
 
return 0;
 }
@@ -584,6 +585,13 @@ static void katmai_restart(char *cmd)
mtspr(SPRN_DBCR0, DBCR0_RST_CHIP);
 }
 
+#ifdef CONFIG_L2_CACHE
+static void __init katmai_init(void)
+{
+   ibm440gx_l2c_setup(clocks);
+}
+#endif
+
 void __init platform_init(unsigned long r3, unsigned long r4,
  unsigned long r5, unsigned long r6, unsigned long r7)
 {
@@ -599,4 +607,7 @@ void __init platform_init(unsigned long r3, unsigned long 
r4,
ppc_md.early_serial_map = katmai_early_serial_map;
 #endif
ppc_md.restart = katmai_restart;
+#ifdef CONFIG_L2_CACHE
+   ppc_md.init = katmai_init;
+#endif
 }
diff --git a/arch/ppc/platforms/4xx/ocotea.c b/arch/ppc/platforms/4xx/ocotea.c
index a7435aa..8b13811 100644
--- a/arch/ppc/platforms/4xx/ocotea.c
+++ b/arch/ppc/platforms/4xx/ocotea.c
@@ -321,10 +321,12 @@ ocotea_setup_arch(void)
printk(IBM Ocotea port (MontaVista Software, Inc. [EMAIL 
PROTECTED])\n);
 }
 
+#ifdef CONFIG_L2_CACHE
 static void __init ocotea_init(void)
 {
ibm440gx_l2c_setup(clocks);
 }
+#endif
 
 void __init platform_init(unsigned long r3, unsigned long r4,
unsigned long r5, unsigned long r6, unsigned long r7)
@@ -345,5 +347,7 @@ void __init platform_init(unsigned long r3, unsigned long 
r4,
 #ifdef CONFIG_KGDB
ppc_md.early_serial_map = ocotea_early_serial_map;
 #endif
+#ifdef CONFIG_L2_CACHE
ppc_md.init = ocotea_init;
+#endif
 }
diff --git a/arch/ppc/platforms/4xx/taishan.c b/arch/ppc/platforms/4xx/taishan.c
index f4b9435..8bb6f15 100644
--- a/arch/ppc/platforms/4xx/taishan.c
+++ b/arch/ppc/platforms/4xx/taishan.c
@@ -370,10 +370,12 @@ taishan_setup_arch(void)
printk(AMCC PowerPC 440GX Taishan Platform\n);
 }
 
+#ifdef CONFIG_L2_CACHE
 static void __init taishan_init(void)
 {
ibm440gx_l2c_setup(clocks);
 }
+#endif
 
 void __init platform_init(unsigned long r3, unsigned long r4,
unsigned long r5, unsigned long r6, unsigned long r7)
@@ -389,6 +391,8 @@ void __init platform_init(unsigned long r3, unsigned long 
r4,
 #ifdef CONFIG_KGDB
ppc_md.early_serial_map = taishan_early_serial_map;
 #endif
+#ifdef CONFIG_L2_CACHE
ppc_md.init = taishan_init;
+#endif
 }
   

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 0/2] [PPC 4xx] L2-cache synchronization for ppc44x

2007-11-06 Thread Yuri Tikhonov

 Hello all,

 Here is a patch-set for support L2-cache synchronization routines for
the ppc44x processors family. I know that the ppc branch is for bug-fixing 
only, thus
the patch-set is just FYI [though enabled but non-coherent L2-cache may appear 
as a bug for
someone who uses one of the boards listed below :)].

[PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for ppc44x;
[PATCH 2/2] [PPC 44x] enable L2-cache for the following ppc44x-based boards: 
ALPR,
Katmai, Ocotea, and Taishan.

 Regards, Yuri

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-22 Thread Yuri Tikhonov
On Friday 19 October 2007 17:24, Kumar Gala wrote:
 
 On Oct 18, 2007, at 6:21 PM, Paul Mackerras wrote:
 
  Yuri Tikhonov writes:
 
  The following patch adds support for 256KB PAGE_SIZE on ppc44x- 
  based boards.
  The applications to be run on the kernel with 256KB PAGE_SIZE have  
  to be
  built using the modified version of binutils, where the MAXPAGESIZE
  definition is set to 0x4 (as opposite to standard 0x1).
 
  Have you measured the performance using a 64kB page size?  If so, how
  does it compare with the 256kB page size?
 
 I was wondering about this as well?  Isn't this technically in  
 violation of the ABI?

 
 No it isn't the violation.

 As stated in System V ABI. PowerPC processor supplement
(on which the Linux Standard Base Core Specification for PPC32
is based):  ... Virtual addresses and file offsets for the PowerPC processor 
family 
segments are congruent modulo 64 Kbytes (0x1) or larger powers of 2

 So, 256 Kbytes is just a larger case.

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-19 Thread Yuri Tikhonov
On Friday 19 October 2007 03:21, Paul Mackerras wrote:
 Have you measured the performance using a 64kB page size?  If so, how
 does it compare with the 256kB page size?

 I measured the performance of the sequential full-stripe write operations to
a RAID-5 array (P values below are in MB per second) using the h/w accelerated
RAID-5 driver.

 Here are the comparative results for the different PAGE_SIZE values:

PAGE_SIZE = 4K:
 P = 66 MBps;

PAGE_SIZE = 16K:
 P = 145 MBps;

PAGE_SIZE = 64K:
 P = 196 MBps;

PAGE_SIZE = 256K:
 P = 217 MBps.

 The 64kB page size has the attraction that no binutils changes are
 required.

 That's true, but the additional performance is an attractive thing too.

 
 Paul.
 

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-19 Thread Yuri Tikhonov
On Friday 19 October 2007 19:48, Kumar Gala wrote:
  PAGE_SIZE = 4K:
   P = 66 MBps;
 
  PAGE_SIZE = 16K:
   P = 145 MBps;
 
  PAGE_SIZE = 64K:
   P = 196 MBps;
 
  PAGE_SIZE = 256K:
   P = 217 MBps.
 
 Is this all in kernel space? or is there a user space aspect to the  
 benchmark?

 The situation here is that the Linux RAID driver does a lot of complex things
with the pages (strips of array) using CPU before submitting these pages to h/w.
Here is where the most time is spent. Thus, increasing the PAGE_SIZE value we 
reduce
the number of these complex algorithms calls needed to process the whole test 
(writing
the fixed number of MBytes to RAID array). So, there are no user space aspects.

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-18 Thread Yuri Tikhonov

 It has turned out that my mailer had corrupted my previous message (thanks
Wolfgang Denk for pointing this). So if you'd like to apply the patch without 
the
conflicts please use the version of the patch in this mail.

 The following patch adds support for 256KB PAGE_SIZE on ppc44x-based boards. 
The applications to be run on the kernel with 256KB PAGE_SIZE have to be 
built using the modified version of binutils, where the MAXPAGESIZE 
definition is set to 0x4 (as opposite to standard 0x1).

 Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED]
 Acked-by: Yuri Tikhonov [EMAIL PROTECTED]

--

diff --git a/arch/ppc/Kconfig b/arch/ppc/Kconfig
index c590b18..0ee372d 100644
--- a/arch/ppc/Kconfig
+++ b/arch/ppc/Kconfig
@@ -1223,6 +1223,9 @@ config PPC_PAGE_16K
 
 config PPC_PAGE_64K
bool 64 KB if 44x
+
+config PPC_PAGE_256K
+   bool 256 KB if 44x
 endchoice
 
 endmenu
diff --git a/arch/ppc/kernel/entry.S b/arch/ppc/kernel/entry.S
index fba7ca1..2140341 100644
--- a/arch/ppc/kernel/entry.S
+++ b/arch/ppc/kernel/entry.S
@@ -200,7 +200,7 @@ _GLOBAL(DoSyscall)
 #ifdef SHOW_SYSCALLS
bl  do_show_syscall
 #endif /* SHOW_SYSCALLS */
-   rlwinm  r10,r1,0,0,18   /* current_thread_info() */
+   rlwinm  r10,r1,0,0,(31-THREAD_SHIFT)/* current_thread_info() */
lwz r11,TI_FLAGS(r10)
andi.   r11,r11,_TIF_SYSCALL_T_OR_A
bne-syscall_dotrace
@@ -221,7 +221,7 @@ ret_from_syscall:
bl  do_show_syscall_exit
 #endif
mr  r6,r3
-   rlwinm  r12,r1,0,0,18   /* current_thread_info() */
+   rlwinm  r12,r1,0,0,(31-THREAD_SHIFT)/* current_thread_info() */
/* disable interrupts so current_thread_info()-flags can't change */
LOAD_MSR_KERNEL(r10,MSR_KERNEL) /* doesn't include MSR_EE */
SYNC
@@ -639,7 +639,7 @@ ret_from_except:
 
 user_exc_return:   /* r10 contains MSR_KERNEL here */
/* Check current_thread_info()-flags */
-   rlwinm  r9,r1,0,0,18
+   rlwinm  r9,r1,0,0,(31-THREAD_SHIFT)
lwz r9,TI_FLAGS(r9)
andi.   r0,r9,(_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK|_TIF_NEED_RESCHED)
bne do_work
@@ -659,7 +659,7 @@ restore_user:
 /* N.B. the only way to get here is from the beq following ret_from_except. */
 resume_kernel:
/* check current_thread_info-preempt_count */
-   rlwinm  r9,r1,0,0,18
+   rlwinm  r9,r1,0,0,(31-THREAD_SHIFT)
lwz r0,TI_PREEMPT(r9)
cmpwi   0,r0,0  /* if non-zero, just restore regs and return */
bne restore
@@ -669,7 +669,7 @@ resume_kernel:
andi.   r0,r3,MSR_EE/* interrupts off? */
beq restore /* don't schedule if so */
 1: bl  preempt_schedule_irq
-   rlwinm  r9,r1,0,0,18
+   rlwinm  r9,r1,0,0,(31-THREAD_SHIFT)
lwz r3,TI_FLAGS(r9)
andi.   r0,r3,_TIF_NEED_RESCHED
bne-1b
@@ -875,7 +875,7 @@ recheck:
LOAD_MSR_KERNEL(r10,MSR_KERNEL)
SYNC
MTMSRD(r10) /* disable interrupts */
-   rlwinm  r9,r1,0,0,18
+   rlwinm  r9,r1,0,0,(31-THREAD_SHIFT)
lwz r9,TI_FLAGS(r9)
andi.   r0,r9,_TIF_NEED_RESCHED
bne-do_resched
diff --git a/arch/ppc/kernel/head_booke.h b/arch/ppc/kernel/head_booke.h
index f3d274c..db4 100644
--- a/arch/ppc/kernel/head_booke.h
+++ b/arch/ppc/kernel/head_booke.h
@@ -20,7 +20,9 @@
beq 1f;  \
mfspr   r1,SPRN_SPRG3;  /* if from user, start at top of   */\
lwz r1,THREAD_INFO-THREAD(r1); /* this thread's kernel stack   */\
-   addir1,r1,THREAD_SIZE;   \
+   lis r11,[EMAIL PROTECTED];  
 \
+   ori r11,r11,[EMAIL PROTECTED];  
 \
+   add r1,r1,r11;   \
 1: subir1,r1,INT_FRAME_SIZE;   /* Allocate an exception frame */\
mr  r11,r1;  \
stw r10,_CCR(r11);  /* save various registers  */\
@@ -106,7 +108,9 @@
/* COMING FROM USER MODE */  \
mfspr   r11,SPRN_SPRG3; /* if from user, start at top of   */\
lwz r11,THREAD_INFO-THREAD(r11); /* this thread's kernel stack */\
-   addir11,r11,THREAD_SIZE; \
+   lis r11,[EMAIL PROTECTED];   \
+   ori r11,r11,[EMAIL PROTECTED];  
 \
+   add r1,r1,r11;   \
 1: subir11,r11,INT_FRAME_SIZE; /* Allocate an exception frame */\
stw r10,_CCR(r11);  /* save various registers  */\
stw r12,GPR12(r11

Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-18 Thread Yuri Tikhonov

On Thursday 18 October 2007 14:44, you wrote:
 Sorry, this is against arch/ppc which is bug fix only.  New features
 should be done against arch/powerpc. 

 Understood. The situation here is that the boards, which required these
modifications, have no support in the arch/powerpc branch. So this is 
why we made this in arch/ppc.

 Also, I'd rather see something along the lines of hugetlbfs support instead.

 Here I agree with Benjamin. Furthermore, IIRC the hugetlb file-system is
supported for PPC64 architectures only. Here we have PPC32.

 josh

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-18 Thread Yuri Tikhonov

On Thursday 18 October 2007 15:47, Benjamin Herrenschmidt wrote:
 
   Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED]
   Acked-by: Yuri Tikhonov [EMAIL PROTECTED]
 
 Small nit...
 
 You are posting the patch, thus you should be signing off, not ack'ing.
 
 Ack'ing means you agree with the patch but you aren't in the handling
 chain for it. In this case, it seems like the author is Pavel and you
 are forwarding it, in wich case, you -are- in the handling chain and
 should should sign it off.
 
 Best would be for Pavel (if he is indeed the author) to submit it
 himself though.

  Thanks for the explanations. Will keep this in mind in the future.

 
 Ben.

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-18 Thread Yuri Tikhonov
On Thursday 18 October 2007 16:12, Benjamin Herrenschmidt wrote:

  I always reserve the right to change my mind.  If something makes sense
  and the code is decent enough then it might very well be acceptable.
  Requiring a modified binutils makes me a bit nervous though.
 
  From a kernel point of view, I totally don't care about the modified
 binutils to build userspace as long as it's not required to build the
 kernel and that option is not enabled by default (and explicitely
 documented as having that requirement).
 
 If it is necessary for building the kernel, then I'm a bit cooler about
 the whole thing indeed, the max page size needs to be added at least as
 a command line or linker script param so a different build of binutils
 isn't needed.

 No, 256K-page-sized kernel is being built using the standard binutils. 
Modifications
to them are necessary for user-space applications only. And the libraries as 
well.

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] ppc44x: support for 256K PAGE_SIZE

2007-10-18 Thread Yuri Tikhonov
On Thursday 18 October 2007 17:25, Josh Boyer wrote:
   Understood. The situation here is that the boards, which required these
  modifications, have no support in the arch/powerpc branch. So this is 
  why we made this in arch/ppc.
 
 Bit of a dilemma then.  What board exactly?

 These are the Katmai and Yucca PPC440SPe-based boards (from AMCC).

   Also, I'd rather see something along the lines of hugetlbfs support 
   instead.
  
   Here I agree with Benjamin. Furthermore, IIRC the hugetlb file-system is
  supported for PPC64 architectures only. Here we have PPC32.
 
 Well that needs fixing anyway, but ok.  Also, is the modified binutils
 only required for userspace to take advantage here?  Seems so, but I'd
 just like to be sure.

 You are right, for userspace only. 

 
 josh
 

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev