Re[4]: [PATCH][v4] powerpc 44x: support for 256KB PAGE_SIZE
On Sunday, January 18, 2009 you wrote: Ok I tried this out in menuconfig. You are right that the depends on makes sense as it removes the option from the config file as not relevant. But right now to enable 256K pages one has to go to platform setup to find this dependency, then has to go to general setup to find the shmem option at the bottom of the list in the embedded/expert section, then finally go to the kernel options menu to finally choose the page size. Moving this question just before the page size choice removes one of those hidden menu, so I suggest that it be moved to just before the option that it allow be selected. Right, it'll me more convenient. My [v5] patch addresses this. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[4]: [PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication
Hello Dan, On Friday, January 16, 2009 you wrote: On Fri, Jan 16, 2009 at 4:41 AM, Yuri Tikhonov y...@emcraft.com wrote: I don't think this will work as we will be mixing Q into the new P and P into the new Q. In order to support (src_cnt device-max_pq) we need to explicitly tell the driver that the operation is being continued (DMA_PREP_CONTINUE) and to apply different coeffeicients to P and Q to cancel the effect of including them as sources. With DMA_PREP_ZERO_P/Q approach, the Q isn't mixed into new P, and P isn't mixed into new Q. For your example of max_pq=4: p, q = PQ(src0, src1, src2, src3, src4, COEF({01}, {02}, {04}, {08}, {10})) with the current implementation will be split into: p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}) p`,q` = PQ(src4, COEF({10})) which will result to the following: p = ((dma_flags DMA_PREP_ZERO_P) ? 0 : old_p) + src0 + src1 + src2 + src3 q = ((dma_flags DMA_PREP_ZERO_Q) ? 0 : old_q) + {01}*src0 + {02}*src1 + {04}*src2 + {08}*src3 p` = p + src4 q` = q + {10}*src4 Huh? Does the ppc440spe engine have some notion of flagging a source as old_p/old_q? Otherwise I do not see how the engine will not turn this into: p` = p + src4 + q q` = q + {10}*src4 + {x}*p I think you missed the fact that we have passed p and q back in as sources. Unless we have multiple p destinations and multiple q destinations, or hardware support for continuations I do not see how you can guarantee this split. I guess, I've got your point. You are missing the fact that destinations for 'p' and 'q' are passed in device_prep_dma_pq() method separately from sources. Speaking your words: we do not have multiple destinations through the while() cycles, the destinations are the same in each pass. Please look at do_async_pq() implementation more carefully: 'blocks' is a pointer to 'src_cnt' sources _plus_ two destination pages (as it's stated in async_pq() description). Before coming into the while() cycle we save destinations in the dma_dest[] array, and then pass this to device_prep_dma_pq() in each (src_cnt/max_pq) cycle. That is, we do not passes destinations as the sources explicitly: we just clear DMA_PREP_ZERO_P/Q flags to notify ADMA level that this have to XOR the current content of destination(s) with the result of new operation. I'm afraid that the difference (13/4, 125/32) is very significant, so getting rid of DMA_PREP_ZERO_P/Q will eat most of the improvement which could be achieved with the current approach. Data corruption is a slightly higher cost :-). but at this point I do not see a cleaner alternatve for engines like iop13xx. I can't find any description of iop13xx processors at Intel's web-site, only 3xx: http://www.intel.com/design/iio/index.htm?iid=ipp_embed+embed_io So, it's hard for me to do any suggestions. I just wonder - doesn't iop13xx allow users to program destination addresses into the sources fields of descriptors? Yes it does, but the engine does not know it is a destination. Take a look at page 496 of the following and tell me if you come to a different conclusion. http://download.intel.com/design/iio/docs/31503602.pdf I see. The major difference in the implementation of support for P+Q in ppc440spe DMA engines is that ppc440spe allows to include (xor) the previous content of P_Result and/or Q_Result just by setting a corresponding indication in the destination (P_Result and/or Q_Result) address(es) The 5.7.5 P+Q Update Operation case won't help here, since, if I understand it right, it doesn't allow to set up different multipliers for Old and New Data. So, it looks like your approach: p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10})) is the only possible way of including the previous P/Q content into the calculation. But I still think, that this p'/q' hack should have a place on the ADMA level, not ASYNC_TX. It looks more generic if ASYNC_TX will assume that ADMA is capable of p'=p+src / q'=q+{}*src. Otherwise, we'll have an overhead for the DMAs which could work without this overhead. In your case, the IOP ADMA driver should handle the situation when it receives 4 sources to be P+Qed with the previous contents of destinations, for example, by generating the sequence of 4 descriptors to process such a request. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[4]: [PATCH 03/11][v3] async_tx: add support for asynchronous RAID6 recovery operations
On Friday, January 16, 2009 you wrote: On Fri, Jan 16, 2009 at 4:51 AM, Yuri Tikhonov y...@emcraft.com wrote: The reason why I preferred to use async_pq() instead of async_xor() here is to maximize the chance that the whole D+D recovery operation will be handled in one ADMA device, i.e. without channels switch and the latency introduced because of that. This should be a function of the async_tx_find_channel implementation. The default version tries to keep a chain of operations on one channel. struct dma_chan * __async_tx_find_channel(struct dma_async_tx_descriptor *depend_tx, enum dma_transaction_type tx_type) { /* see if we can keep the chain on one channel */ if (depend_tx dma_has_cap(tx_type, depend_tx-chan-device-cap_mask)) return depend_tx-chan; return dma_find_channel(tx_type); } Right. Then I need to update my ADMA driver, and add support for explicit DMA_XOR capability on channels which can process DMA_PQ. Thanks. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 11/11][v2] ppc440spe-adma: ADMA driver for PPC440SP(e) systems
Hello David, Thanks a lot for review. The general note to be made here is that the changes to the DTS file made by this patch are necessary for a ppc440spe ADMA driver, which is a not-completed arch/powerpc port from the arch/ppc branch, and which uses DT (well, incorrectly) just to get interrupts. Otherwise, it's just a platform device driver. We provided this ADMA driver just as the reference of driver, which implements the RAID-6 related low-level stuff. ppc440spe ADMA in its current state is far from ready for merging. We'll elaborate on its cleaning up then (surely, taking into account all the comments made from community). But, even now, the driver works, so we publish this so interested people could use and test it. Some comments mixed in below. On Tuesday, January 13, 2009 you wrote: On Tue, Jan 13, 2009 at 03:43:55AM +0300, Yuri Tikhonov wrote: Adds the platform device definitions and the architecture specific support routines for the ppc440spe adma driver. Any board equipped with PPC440SP(e) controller may utilize this driver. diff --git a/arch/powerpc/boot/dts/katmai.dts b/arch/powerpc/boot/dts/katmai.dts index 077819b..f2f77c8 100644 --- a/arch/powerpc/boot/dts/katmai.dts +++ b/arch/powerpc/boot/dts/katmai.dts @@ -16,7 +16,7 @@ / { #address-cells = 2; - #size-cells = 1; + #size-cells = 2; You've changed the root level size-cells, but haven't updated the sub-nodes (such as /memory) accordingly. Thanks, we'll fix this in the next version of this patch. model = amcc,katmai; compatible = amcc,katmai; dcr-parent = {/cpus/c...@0}; @@ -392,6 +392,30 @@ 0x0 0x0 0x0 0x3 UIC3 0xa 0x4 /* swizzled int C */ 0x0 0x0 0x0 0x4 UIC3 0xb 0x4 /* swizzled int D */; }; + DMA0: dma0 { No 'compatible' property, which seems dubious. OK, we'll fix. + interrupt-parent = DMA0; + interrupts = 0 1; + #interrupt-cells = 1; + #address-cells = 0; + #size-cells = 0; + interrupt-map = + 0 UIC0 0x14 4 + 1 UIC1 0x16 4; + }; + DMA1: dma1 { + interrupt-parent = DMA1; + interrupts = 0 1; + #interrupt-cells = 1; + #address-cells = 0; + #size-cells = 0; + interrupt-map = + 0 UIC0 0x16 4 + 1 UIC1 0x16 4; Are these interrupt-maps correct? The second interrupt from both dma controllers is routed to the same line on UIC1? The map is correct: - first interrupts are 'DMAx Command Status FIFO Needs Service'; - second interrupt is 'DMA Error', both DMA engines share common error IRQ. + }; + xor { + interrupt-parent = UIC1; + interrupts = 0x1f 4; What the hell is this thing? No compatible property, nor even a meaningful name. This is the XOR accelerator, the dedicated DMA engine of ppc440spe equipped with the ability to do XOR operations in h/w. I guess, it could be named like DMA2. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [RFC PATCH 00/11][v3] md: support for asynchronous execution of RAID6 operations
Hello Dan, On Wednesday, January 14, 2009 you wrote: [..] Do you have a git tree where you can post this series? That would make it easier for me to track/review. Yes. Please see the raidstuff branch in the linux-2.6-denx repository: http://git.denx.de/?p=linux-2.6-denx.git;a=shortlog;h=refs/heads/raidstuff Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication
) +#define DMA_QCHECK_FAILED (1 1) Perhaps turn these into an enum such that we can pass around a enum pq_check_flags pointer rather than a non-descript u32 *. Agree. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 03/11][v3] async_tx: add support for asynchronous RAID6 recovery operations
On Thursday, January 15, 2009 Dan Williams wrote: On Mon, Jan 12, 2009 at 5:43 PM, Yuri Tikhonov y...@emcraft.com wrote: + /* (2) Calculate Q+Qxy */ + lptrs[0] = ptrs[failb]; + lptrs[1] = ptrs[disks-1]; + lptrs[2] = NULL; + tx = async_pq(lptrs, NULL, 0, 1, bytes, ASYNC_TX_DEP_ACK, + tx, NULL, NULL); + + /* (3) Calculate P+Pxy */ + lptrs[0] = ptrs[faila]; + lptrs[1] = ptrs[disks-2]; + lptrs[2] = NULL; + tx = async_pq(lptrs, NULL, 0, 1, bytes, ASYNC_TX_DEP_ACK, + tx, NULL, NULL); + These two calls convinced me that ASYNC_TX_PQ_ZERO_{P,Q} need to go. A 1-source async_pq operation does not make sense. Another source is hidden under not-set ASYNC_TX_PQ_ZERO_{P,Q} :) Though, I agree, this looks rather misleading. These should be: /* (2) Calculate Q+Qxy */ lptrs[0] = ptrs[disks-1]; lptrs[1] = ptrs[failb]; tx = async_xor(lptrs[0], lptrs, 0, 2, bytes, ASYNC_TX_XOR_DROP_DST|ASYNC_TX_DEP_ACK, tx, NULL, NULL); /* (3) Calculate P+Pxy */ lptrs[0] = ptrs[disks-2]; lptrs[1] = ptrs[faila]; tx = async_xor(lptrs[0], lptrs, 0, 2, bytes, ASYNC_TX_XOR_DROP_DST|ASYNC_TX_DEP_ACK, tx, NULL, NULL); The reason why I preferred to use async_pq() instead of async_xor() here is to maximize the chance that the whole D+D recovery operation will be handled in one ADMA device, i.e. without channels switch and the latency introduced because of that. So, if we'll decide to stay with ASYNC_TX_PQ_ZERO_{P,Q}, then this should be probably kept unchanged, but if we'll get rid of ASYNC_TX_PQ_ZERO_{P,Q}, then, obviously, we'll have to use async_xor()s here as you suggest. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 11/11][v2] ppc440spe-adma: ADMA driver for PPC440SP(e) systems
Hello Anton, Thanks for review. Please note the general note I made in Re[2]: [PATCH 11/11][v2] ppc440spe-adma: ADMA driver for PPC440SP(e) systems. All your comments make sense, so we'll try to address these in the next version of the driver. Some comments below. On Thursday, January 15, 2009 you wrote: Hello Yuri, On Tue, Jan 13, 2009 at 03:43:55AM +0300, Yuri Tikhonov wrote: Adds the platform device definitions and the architecture specific support routines for the ppc440spe adma driver. Any board equipped with PPC440SP(e) controller may utilize this driver. Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- Quite complex and interesting driver, I must say. Have you thought about splitting ppc440spe-adma.c into multiple files, btw? Admittedly, no. But I guess this makes sense. The driver supports two different types of DMA devices of ppc440spe: DMA0,1 and DMA2[XOR engine]. So, we could split the driver at least in two, which would definitely simplified the code. A few comments down below... [...] +typedef struct ppc440spe_adma_device { Please avoid typedefs. OK. [...] +/* + * Descriptor of allocated CDB + */ +typedef struct { + dma_cdb_t *vaddr; /* virtual address of CDB */ + dma_addr_t paddr; /* physical address of CDB */ + /* + * Additional fields + */ + struct list_headlink; /* link in processing list */ + u32 status; /* status of the CDB */ + /* status bits: */ + #define DMA_CDB_DONE(10) /* CDB processing competed */ + #define DMA_CDB_CANCEL (11) /* waiting thread was interrupted */ +} dma_cdbd_t; It seems there are no users of this struct. Indeed. This is an useless inheritance of some old version of the driver. Will remove this in the next patch. [..] +/** + * ppc440spe_desc_init_dma01pq - initialize the descriptors for PQ operation + * qith DMA0/1 + */ +static inline void ppc440spe_desc_init_dma01pq(ppc440spe_desc_t *desc, + int dst_cnt, int src_cnt, unsigned long flags, + unsigned long op) +{ Way to big for inline. The same for all the inlines. Btw, ppc_async_tx_find_best_channel() looks too big for inline and also too big to be in a .h file. OK, will be moved to the appropriate .c. [..] [...] +static int ppc440spe_test_raid6 (ppc440spe_ch_t *chan) +{ + ppc440spe_desc_t *sw_desc, *iter; + struct page *pg; + char *a; + dma_addr_t dma_addr, addrs[2]; + unsigned long op = 0; + int rval = 0; + + /*FIXME*/ ? + + set_bit(PPC440SPE_DESC_WXOR, op); + + pg = alloc_page(GFP_KERNEL); + if (!pg) + return -ENOMEM; + + +/** + * ppc440spe_adma_probe - probe the asynch device + */ +static int __devinit ppc440spe_adma_probe(struct platform_device *pdev) +{ + struct resource *res; Why is this a platform driver? What's the point of describing DMA nodes in the device tree w/o actually using them (don't count interrupts)? There are a lot of hard-coded addresses in the code... :-/ And arch/powerpc/platforms/44x/ppc440spe_dma_engines.c file reminds me arch/ppc-style bindings. ;-) Right. This driver is a not-completed port from the arch/ppc branch. + int ret=0, irq1, irq2, initcode = PPC_ADMA_INIT_OK; + void *regs; + ppc440spe_dev_t *adev; + ppc440spe_ch_t *chan; + ppc440spe_aplat_t *plat_data; + struct ppc_dma_chan_ref *ref; + struct device_node *dp; + char s[10]; + [...] +static int __init ppc440spe_adma_init (void) +{ + int rval, i; + struct proc_dir_entry *p; + + for (i = 0; i PPC440SPE_ADMA_ENGINES_NUM; i++) + ppc_adma_devices[i] = -1; + + rval = platform_driver_register(ppc440spe_adma_driver); + + if (rval == 0) { + /* Create /proc entries */ + ppc440spe_proot = proc_mkdir(PPC440SPE_R6_PROC_ROOT, NULL); + if (!ppc440spe_proot) { + printk(KERN_ERR %s: failed to create %s proc + directory\n,__func__,PPC440SPE_R6_PROC_ROOT); + /* User will not be able to enable h/w RAID-6 */ + return rval; + } /proc? Why /proc? The driver has nothing to do with Linux VM subsystem or processes. I think /sys/ interface would suit better for this, no? Either way, userspace interfaces should be documented somehow (probably Documentation/ABI/, or at least some comments in the code). Agree, we'll fix this. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 07/11] md: rewrite handle_stripe_dirtying6 in asynchronous way
On Friday, January 16, 2009 you wrote: On Thu, Jan 15, 2009 at 2:51 PM, Dan Williams dan.j.willi...@intel.com wrote: On Mon, Dec 8, 2008 at 2:57 PM, Yuri Tikhonov y...@emcraft.com wrote: What's the reasoning behind changing the logic here, i.e. removing must_compute and such? I'd feel more comfortable seeing copy and paste where possible with cleanups separated out into their own patch. Ok, I now see why this change was made. Please make this changelog more descriptive than Rewrite handle_stripe_dirtying6 function to work asynchronously. Sure, how about the following: md: rewrite handle_stripe_dirtying6 in asynchronous way Processing stripe dirtying in asynchronous way requires some changes to the handle_stripe_dirtying6() algorithm. In the synchronous implementation of the stripe dirtying we processed dirtying of a degraded stripe (with partially changed strip(s) located on the failed drive(s)) inside one handle_stripe_dirtying6() call: - we computed the missed strips from the old parities, and thus got the fully up-to-date stripe, then - we did reconstruction using the new data to write. In the asynchronous case of handle_stripe_dirtying6() we don't process anything right inside this function (since we under the lock), but only schedule the necessary operations with flags. Thus, if handle_stripe_dirtying6() is performed on the top of a degraded array we should schedule the reconstruction operation when the failed strips are marked (by previously called fetch_block6()) as to be computed (with the R5_Wantcompute flag), and all the other strips of the stripe are UPTODATE. The schedule_reconstruction() function will set the STRIPE_OP_POSTXOR flag [for new parity calculation], which is then handled in raid_run_ops() after the STRIPE_OP_COMPUTE_BLK one [which causes computing of the data missed]. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 07/11] md: rewrite handle_stripe_dirtying6 in asynchronous way
Hello Cheng, On Friday, January 16, 2009 you wrote: Ack, could you please make the changelog more descriptive? and or add some of your benchmark results? Of course. We did benchmarking using the Xdd tool like follows: # xdd -op write -kbytes $kbytes -reqsize $reqsize -dio-passes 2 –verbose -target $target_device where $kbytes = data disks * size of disk $reqsize= data disks * chunk size $target_device = /dev/md0 This way we did write of full array size, and thus achieved the maximum performance. The test cases were RAID-6 built on the top of 14 S-ATA drives connected to 2 LSI cards (7+7) inserted into the 800 MHz Katmai board (based on ppc440spe) equipped with 4GB of 800 MHz DRAM . Here are the results (Psw - write throughput with s/w RAID-6; Phw - write throughput with the h/w accelerated RAID-6): PAGE_SIZE=4KB, chunk=64/128/256 KB Psw = 71/72/74 MBps Phw = 128/136/139 MBps PAGE_SIZE=16KB, chunk=256/512/1024 KB Psw = 81/81/82 MBps Phw = 205/244/239 MBps PAGE_SIZE=64KB, chunk=1024/2048/4096 KB Psw = 84/84/85 MBps Phw = 258/253/258 MBps PAGE_SIZE=256KB, chunk=4096/8192/16384 KB Psw = 81/83/83 MBps Phw = 288/275/274 MBps Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH][v4] powerpc 44x: support for 256KB PAGE_SIZE
Hello Milton, On Friday, January 16, 2009 you wrote: On Jan 12, 2009, at 4:49 PM, Yuri Tikhonov wrote: This patch adds support for 256KB pages on ppc44x-based boards. Another day, another comment. The motivation for reply was the second comment below. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 84b8613..18f33ef 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -443,6 +443,19 @@ config PPC_64K_PAGES bool 64k page size if 44x || PPC_STD_MMU_64 select PPC_HAS_HASH_64K if PPC_STD_MMU_64 +config PPC_256K_PAGES + bool 256k page size if 44x + depends on !STDBINUTILS !SHMEM depends on !STDBINUTILS (!SHMEM || BROKEN) to identify that it is not fundamentally incompatable just not a chance of working without other changes. This makes sense. [..] +config STDBINUTILS + bool Using standard binutils settings + depends on 44x + default y I think this should be config STDBINUTILS bool Using standard binutils settings if 44x default y that way we imply that all powerpc users are using the standard binutils instead of only those using a 44x platform. We still get the intended effect of asking the user only on 44x. I haven't looked at the resulting question or config order to see if it makes sense to leave it here or put it closer to the page size. I'm not sure about this. For 44x platforms - the STDBINUTILS option is reasonable, because it's used in the PAGE_SIZE selection process. But as regarding the other powerpcs the STDBINUTILS option will do nothing, but taking a superfluous string in configs. Are you sure this will be better ? Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH][v3] powerpc 44x: support for 256KB PAGE_SIZE
Hello Prodyut, On Monday, January 12, 2009 you wrote: On Sun, Jan 11, 2009 at 10:42 AM, Yuri Tikhonov y...@emcraft.com wrote: This patch adds support for 256KB pages on ppc44x-based boards. Hi Yuri, Do you still need the mm/shmem.c patch to avoid division by zero? Yes. I looked at the mm/shmem.c latest git code, and I see that it doesn't have the needed patch for 256KB page. Right. We proposed the work-around for this (which just simply increased the sizes of variables which hold the overflowed values) to LKML here: http://lkml.org/lkml/2008/12/19/20 If I understand Hugh right, then such a fix is acceptable, but much far from the best, so Hugh is about to implement the correct fix for the problem as soon as he'll find some time (big thanks to him for this). I think another option would be to make 256KB compile only if CONFIG_SHMEM=n Agree. For the current situation it seems the better solution. I'll update, and re-post the patch shortly. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH][v4] powerpc 44x: support for 256KB PAGE_SIZE
This patch adds support for 256KB pages on ppc44x-based boards. For simplification of implementation with 256KB pages we still assume 2-level paging. As a side effect this leads to wasting extra memory space reserved for PTE tables: only 1/4 of pages allocated for PTEs are actually used. But this may be an acceptable trade-off to achieve the high performance we have with big PAGE_SIZEs in some applications (e.g. RAID). Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize the risk of stack overflows in the cases of on-stack arrays, which size depends on the page size (e.g. multipage BIOs, NTFS, etc.). With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be occupied by PKMAP addresses leaving no place for vmalloc. We do not separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that value of 10 in support for 16K/64K had been selected rather intuitively. Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB, one) we have 512 pages for PKMAP. Because ELF standard supports only page sizes up to 64K, then you should use binutils later than 2.17.50.0.3 with '-zmax-page-size' set to 256K for building applications, which are to be run with the 256KB-page sized kernel. If using the older binutils, then you should patch them like follows: --- binutils/bfd/elf32-ppc.c.orig +++ binutils/bfd/elf32-ppc.c -#define ELF_MAXPAGESIZE0x1 +#define ELF_MAXPAGESIZE0x4 One more restriction we currently have with 256KB page sizes is inability to use shmem safely, so, for now, the 256KB is available only if you turn the CONFIG_SHMEM option off. Though, if you need shmem with 256KB pages, you can always remove the !SHMEM dependency in 'config PPC_256K_PAGES', and use the workaround available here: http://lkml.org/lkml/2008/12/19/20 Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- arch/powerpc/Kconfig | 15 +++ arch/powerpc/include/asm/highmem.h | 10 +- arch/powerpc/include/asm/mmu-44x.h |2 ++ arch/powerpc/include/asm/page.h|6 -- arch/powerpc/include/asm/page_32.h |4 arch/powerpc/include/asm/thread_info.h |4 +++- arch/powerpc/kernel/head_booke.h | 11 ++- arch/powerpc/platforms/44x/Kconfig | 12 8 files changed, 55 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 84b8613..18f33ef 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -443,6 +443,19 @@ config PPC_64K_PAGES bool 64k page size if 44x || PPC_STD_MMU_64 select PPC_HAS_HASH_64K if PPC_STD_MMU_64 +config PPC_256K_PAGES + bool 256k page size if 44x + depends on !STDBINUTILS !SHMEM + help + Make the page size 256k. + + As the ELF standard only requires alignment to support page + sizes up to 64k, you will need to compile all of your user + space applications with a non-standard binutils settings + (see the STDBINUTILS description for details). + + Say N unless you know what you are doing. + endchoice config FORCE_MAX_ZONEORDER @@ -455,6 +468,8 @@ config FORCE_MAX_ZONEORDER default 9 if PPC_STD_MMU_32 PPC_16K_PAGES range 7 64 if PPC_STD_MMU_32 PPC_64K_PAGES default 7 if PPC_STD_MMU_32 PPC_64K_PAGES + range 5 64 if PPC_STD_MMU_32 PPC_256K_PAGES + default 5 if PPC_STD_MMU_32 PPC_256K_PAGES range 11 64 default 11 help diff --git a/arch/powerpc/include/asm/highmem.h b/arch/powerpc/include/asm/highmem.h index 04e4a62..a290759 100644 --- a/arch/powerpc/include/asm/highmem.h +++ b/arch/powerpc/include/asm/highmem.h @@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table; * chunk of RAM. */ /* - * We use one full pte table with 4K pages. And with 16K/64K pages pte - * table covers enough memory (32MB and 512MB resp.) that both FIXMAP - * and PKMAP can be placed in single pte table. We use 1024 pages for - * PKMAP in case of 16K/64K pages. + * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte + * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP + * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP + * in case of 16K/64K/256K page sizes. */ #ifdef CONFIG_PPC_4K_PAGES #define PKMAP_ORDERPTE_SHIFT #else -#define PKMAP_ORDER10 +#define PKMAP_ORDER9 #endif #define LAST_PKMAP (1 PKMAP_ORDER) #ifndef CONFIG_PPC_4K_PAGES diff --git a/arch/powerpc/include/asm/mmu-44x.h b/arch/powerpc/include/asm/mmu-44x.h index 27cc6fd..3c86576 100644 --- a/arch/powerpc/include/asm/mmu-44x.h +++ b/arch/powerpc/include/asm/mmu-44x.h @@ -83,6 +83,8 @@ typedef struct { #define PPC44x_TLBE_SIZE PPC44x_TLB_16K #elif (PAGE_SHIFT == 16) #define
[RFC PATCH 00/11][v3] md: support for asynchronous execution of RAID6 operations
Hello, This is the next attempt on asynchronous RAID-6 support. This patch-set has the Dan Williams' comments (Dec, 17) addressed with the following exception: - I still think that using 'enum dma_ctrl_flags' for PQ-specific operations is better than introducing another group of flags and enhance the device_prep_dma_pq()/device_prep_dma_pqzero_sum() with one more parameter. If unchanged 'enum dma_ctrl_flags' will be a criteria of acceptance, please let me know - I'll re-implement this exactly as Dan suggested. Fearing to look like a spammer, I post only those patches, which had been affected by the changes intended to address Dan's comments. These are the following five: 0002-async_tx-add-support-for-asynchronous-GF-multiplica.patch 0003-async_tx-add-support-for-asynchronous-RAID6-recover.patch 0004-md-run-RAID-6-stripe-operations-outside-the-lock.patch 0008-md-asynchronous-handle_parity_check6.patch 0011-ppc440spe-adma-ADMA-driver-for-the-PPC440SP-e-syst.patch As regarding the other six patches of asynchronous RAID-6 support patchset: 0001-async_tx-don-t-use-src_list-argument-of-async_xor.patch 0005-md-common-schedule_reconstruction-for-raid5-6.patch 0006-md-change-handle_stripe_fill6-to-work-in-asynchrono.patch 0007-md-rewrite-handle_stripe_dirtying6-in-asynchronous.patch 0009-md-change-handle_stripe6-to-work-asynchronously.patch 0010-md-remove-unused-functions.patch they are the same as in my 09 Dec 2008 post: https://kerneltrap.org/mailarchive/linux-raid/2008/12/8/4367574 and are waiting for your comments. Regards, Yuri ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 02/11][v3] async_tx: add support for asynchronous GF multiplication
This adds support for doing asynchronous GF multiplication by adding four additional functions to async_tx API: async_pq() does simultaneous XOR of sources and XOR of sources GF-multiplied by given coefficients. async_pq_zero_sum() checks if results of calculations match given ones. async_gen_syndrome() does sumultaneous XOR and R/S syndrome of sources. async_syndrome_zerosum() checks if results of XOR/syndrome calculation matches given ones. Latter two functions just use async_pq() with the approprite coefficients in asynchronous case but have significant optimizations if synchronous case. To support this API dmaengine driver should set DMA_PQ and DMA_PQ_ZERO_SUM capabilities and provide device_prep_dma_pq and device_prep_dma_pqzero_sum methods in dma_device structure. Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- crypto/async_tx/Kconfig |4 + crypto/async_tx/Makefile|1 + crypto/async_tx/async_pq.c | 615 +++ crypto/async_tx/async_xor.c |2 +- include/linux/async_tx.h| 46 +++- include/linux/dmaengine.h | 30 ++- 6 files changed, 693 insertions(+), 5 deletions(-) create mode 100644 crypto/async_tx/async_pq.c diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig index d8fb391..cb6d731 100644 --- a/crypto/async_tx/Kconfig +++ b/crypto/async_tx/Kconfig @@ -14,3 +14,7 @@ config ASYNC_MEMSET tristate select ASYNC_CORE +config ASYNC_PQ + tristate + select ASYNC_CORE + diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile index 27baa7d..1b99265 100644 --- a/crypto/async_tx/Makefile +++ b/crypto/async_tx/Makefile @@ -2,3 +2,4 @@ obj-$(CONFIG_ASYNC_CORE) += async_tx.o obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o obj-$(CONFIG_ASYNC_XOR) += async_xor.o +obj-$(CONFIG_ASYNC_PQ) += async_pq.o diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c new file mode 100644 index 000..5871651 --- /dev/null +++ b/crypto/async_tx/async_pq.c @@ -0,0 +1,615 @@ +/* + * Copyright(c) 2007 Yuri Tikhonov y...@emcraft.com + * + * Developed for DENX Software Engineering GmbH + * + * Asynchronous GF-XOR calculations ASYNC_TX API. + * + * based on async_xor.c code written by: + * Dan Williams dan.j.willi...@intel.com + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ +#include linux/kernel.h +#include linux/interrupt.h +#include linux/dma-mapping.h +#include linux/raid/xor.h +#include linux/async_tx.h + +#include ../drivers/md/raid6.h + +/** + * The following static variables are used in cases of synchronous + * zero sum to save the values to check. Two pages used for zero sum and + * the third one is for dumb P destination when calling gen_syndrome() + */ +static spinlock_t spare_lock; +static struct page *spare_pages[3]; + +/** + * do_async_pq - asynchronously calculate P and/or Q + */ +static struct dma_async_tx_descriptor * +do_async_pq(struct dma_chan *chan, struct page **blocks, unsigned char *scfs, + unsigned int offset, int src_cnt, size_t len, enum async_tx_flags flags, + struct dma_async_tx_descriptor *depend_tx, + dma_async_tx_callback cb_fn, void *cb_param) +{ + struct dma_device *dma = chan-device; + dma_addr_t dma_dest[2], dma_src[src_cnt]; + struct dma_async_tx_descriptor *tx = NULL; + dma_async_tx_callback _cb_fn; + void *_cb_param; + unsigned char *scf = NULL; + int i, src_off = 0; + unsigned short pq_src_cnt; + enum async_tx_flags async_flags; + enum dma_ctrl_flags dma_flags = 0; + + /* If we won't handle src_cnt in one shot, then the following +* flag(s) will be set only on the first pass of prep_dma +*/ + if (flags ASYNC_TX_PQ_ZERO_P) + dma_flags |= DMA_PREP_ZERO_P; + if (flags ASYNC_TX_PQ_ZERO_Q) + dma_flags |= DMA_PREP_ZERO_Q; + + /* DMAs use destinations as sources, so use BIDIRECTIONAL mapping */ + if (blocks[src_cnt]) { + dma_dest[0] = dma_map_page(dma-dev, blocks[src_cnt
[PATCH 03/11][v3] async_tx: add support for asynchronous RAID6 recovery operations
This patch extends async_tx API with two operations for recovery operations on RAID6 array with two failed disks using new async_pq() operation. Patch introduces the following functions: async_r6_dd_recov() recovers after double data disk failure async_r6_dp_recov() recovers after D+P failure Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- crypto/async_tx/Kconfig |5 + crypto/async_tx/Makefile|1 + crypto/async_tx/async_r6recov.c | 286 +++ include/linux/async_tx.h| 11 ++ 4 files changed, 303 insertions(+), 0 deletions(-) create mode 100644 crypto/async_tx/async_r6recov.c diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig index cb6d731..0b56224 100644 --- a/crypto/async_tx/Kconfig +++ b/crypto/async_tx/Kconfig @@ -18,3 +18,8 @@ config ASYNC_PQ tristate select ASYNC_CORE +config ASYNC_R6RECOV + tristate + select ASYNC_CORE + select ASYNC_PQ + diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile index 1b99265..0ed8f13 100644 --- a/crypto/async_tx/Makefile +++ b/crypto/async_tx/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o obj-$(CONFIG_ASYNC_XOR) += async_xor.o obj-$(CONFIG_ASYNC_PQ) += async_pq.o +obj-$(CONFIG_ASYNC_R6RECOV) += async_r6recov.o diff --git a/crypto/async_tx/async_r6recov.c b/crypto/async_tx/async_r6recov.c new file mode 100644 index 000..8642c14 --- /dev/null +++ b/crypto/async_tx/async_r6recov.c @@ -0,0 +1,286 @@ +/* + * Copyright(c) 2007 Yuri Tikhonov y...@emcraft.com + * + * Developed for DENX Software Engineering GmbH + * + * Asynchronous RAID-6 recovery calculations ASYNC_TX API. + * + * based on async_xor.c code written by: + * Dan Williams dan.j.willi...@intel.com + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 51 + * Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. + * + */ +#include linux/kernel.h +#include linux/interrupt.h +#include linux/dma-mapping.h +#include linux/raid/xor.h +#include linux/async_tx.h + +#include ../drivers/md/raid6.h + +/** + * async_r6_dd_recov - attempt to calculate two data misses using dma engines. + * @disks: number of disks in the RAID-6 array + * @bytes: size of strip + * @faila: first failed drive index + * @failb: second failed drive index + * @ptrs: array of pointers to strips (last two must be p and q, respectively) + * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK + * @depend_tx: depends on the result of this transaction. + * @cb: function to call when the operation completes + * @cb_param: parameter to pass to the callback routine + */ +struct dma_async_tx_descriptor * +async_r6_dd_recov(int disks, size_t bytes, int faila, int failb, + struct page **ptrs, enum async_tx_flags flags, + struct dma_async_tx_descriptor *depend_tx, + dma_async_tx_callback cb, void *cb_param) +{ + struct dma_async_tx_descriptor *tx = NULL; + struct page *lptrs[disks]; + unsigned char lcoef[disks-4]; + int i = 0, k = 0, fc = -1; + uint8_t bc[2]; + dma_async_tx_callback lcb = NULL; + void *lcb_param = NULL; + + /* Assume that failb faila */ + if (faila failb) { + fc = faila; + faila = failb; + failb = fc; + } + + /* Try to compute missed data asynchronously. */ + if (disks == 4) { + /* +* Pxy and Qxy are zero in this case so we already have +* P+Pxy and Q+Qxy in P and Q strips respectively. +*/ + tx = depend_tx; + lcb = cb; + lcb_param = cb_param; + goto do_mult; + } + + /* +* (1) Calculate Qxy and Pxy: +* Qxy = A(0)*D(0) + ... + A(n-1)*D(n-1) + A(n+1)*D(n+1) + ... + +* A(m-1)*D(m-1) + A(m+1)*D(m+1) + ... + A(disks-1)*D(disks-1), +* where n = faila, m = failb. +*/ + for (i = 0, k = 0; i disks - 2; i++) { + if (i != faila i != failb) { + lptrs[k] = ptrs[i]; + lcoef[k] = raid6_gfexp[i]; + k++; + } + } + + lptrs[k] = ptrs[faila]; + lptrs[k+1] = ptrs[failb
[PATCH 04/11][v3] md: run RAID-6 stripe operations outside the lock
The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pqxor+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_POSTXOR - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case. Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- drivers/md/Kconfig |2 + drivers/md/raid5.c | 291 +++ include/linux/raid/raid5.h |4 +- 3 files changed, 269 insertions(+), 28 deletions(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 2281b50..6c9964f 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -123,6 +123,8 @@ config MD_RAID456 depends on BLK_DEV_MD select ASYNC_MEMCPY select ASYNC_XOR + select ASYNC_PQ + select ASYNC_R6RECOV ---help--- A RAID-5 set of N drives with a capacity of C MB per drive provides the capacity of C * (N - 1) MB, and protects against a failure diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index a5ba080..8110f31 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -584,18 +584,26 @@ static void ops_run_biofill(struct stripe_head *sh) ops_complete_biofill, sh); } -static void ops_complete_compute5(void *stripe_head_ref) +static void ops_complete_compute(void *stripe_head_ref) { struct stripe_head *sh = stripe_head_ref; - int target = sh-ops.target; - struct r5dev *tgt = sh-dev[target]; + int target, i; + struct r5dev *tgt; pr_debug(%s: stripe %llu\n, __func__, (unsigned long long)sh-sector); - set_bit(R5_UPTODATE, tgt-flags); - BUG_ON(!test_bit(R5_Wantcompute, tgt-flags)); - clear_bit(R5_Wantcompute, tgt-flags); + /* mark the computed target(s) as uptodate */ + for (i = 0; i 2; i++) { + target = (!i) ? sh-ops.target : sh-ops.target2; + if (target 0) + continue; + tgt = sh-dev[target]; + set_bit(R5_UPTODATE, tgt-flags); + BUG_ON(!test_bit(R5_Wantcompute, tgt-flags)); + clear_bit(R5_Wantcompute, tgt-flags); + } + clear_bit(STRIPE_COMPUTE_RUN, sh-state); if (sh-check_state == check_state_compute_run) sh-check_state = check_state_compute_result; @@ -627,15 +635,155 @@ static struct dma_async_tx_descriptor *ops_run_compute5(struct stripe_head *sh) if (unlikely(count == 1)) tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, - 0, NULL, ops_complete_compute5, sh); + 0, NULL, ops_complete_compute, sh); else tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, ASYNC_TX_XOR_ZERO_DST, NULL, - ops_complete_compute5, sh); + ops_complete_compute, sh); + + return tx; +} + +static struct dma_async_tx_descriptor * +ops_run_compute6_1(struct stripe_head *sh) +{ + /* kernel stack size limits the total number of disks */ + int disks = sh-disks; + struct page *srcs[disks]; + int target = sh-ops.target 0 ? sh-ops.target2 : sh-ops.target; + struct r5dev *tgt = sh-dev[target]; + struct page *dest = sh-dev[target].page; + int count = 0; + int pd_idx = sh-pd_idx, qd_idx = raid6_next_disk(pd_idx, disks); + int d0_idx = raid6_next_disk(qd_idx, disks); + struct dma_async_tx_descriptor *tx; + int i; + + pr_debug(%s: stripe %llu block: %d\n, + __func__, (unsigned long long)sh-sector, target); + BUG_ON(!test_bit(R5_Wantcompute, tgt-flags)); + + atomic_inc(sh-count); + + if (target == qd_idx) { + /* We are actually computing the Q drive*/ + i = d0_idx; + do { + srcs[count++] = sh-dev[i].page; + i = raid6_next_disk(i, disks); + } while (i != pd_idx); + srcs[count] = NULL; + srcs[count+1] = dest; + tx = async_gen_syndrome(srcs, 0, count, STRIPE_SIZE, + 0, NULL, ops_complete_compute, sh); + } else { + /* Compute any data- or p-drive using XOR */ + for (i = disks; i-- ; ) { + if (i != target i
[PATCH 08/11][v2] md: asynchronous handle_parity_check6
This patch introduces the state machine for handling the RAID-6 parities check and repair functionality. Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- drivers/md/raid5.c | 164 +++- 1 files changed, 111 insertions(+), 53 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2e26e84..b8e37c8 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2620,91 +2620,149 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh, struct r6_state *r6s, struct page *tmp_page, int disks) { - int update_p = 0, update_q = 0; - struct r5dev *dev; + int i; + struct r5dev *devs[2] = {NULL, NULL}; int pd_idx = sh-pd_idx; int qd_idx = r6s-qd_idx; set_bit(STRIPE_HANDLE, sh-state); BUG_ON(s-failed 2); - BUG_ON(s-uptodate disks); + /* Want to check and possibly repair P and Q. * However there could be one 'failed' device, in which * case we can only check one of them, possibly using the * other to generate missing data */ - /* If !tmp_page, we cannot do the calculations, -* but as we have set STRIPE_HANDLE, we will soon be called -* by stripe_handle with a tmp_page - just wait until then. -*/ - if (tmp_page) { + switch (sh-check_state) { + case check_state_idle: + /* start a new check operation if there are 2 failures */ if (s-failed == r6s-q_failed) { /* The only possible failed device holds 'Q', so it * makes sense to check P (If anything else were failed, * we would have used P to recreate it). */ - compute_block_1(sh, pd_idx, 1); - if (!page_is_zero(sh-dev[pd_idx].page)) { - compute_block_1(sh, pd_idx, 0); - update_p = 1; - } + sh-check_state = check_state_run; + set_bit(STRIPE_OP_CHECK_PP, s-ops_request); + clear_bit(R5_UPTODATE, sh-dev[pd_idx].flags); + s-uptodate--; } if (!r6s-q_failed s-failed 2) { /* q is not failed, and we didn't use it to generate * anything, so it makes sense to check it */ - memcpy(page_address(tmp_page), - page_address(sh-dev[qd_idx].page), - STRIPE_SIZE); - compute_parity6(sh, UPDATE_PARITY); - if (memcmp(page_address(tmp_page), - page_address(sh-dev[qd_idx].page), - STRIPE_SIZE) != 0) { - clear_bit(STRIPE_INSYNC, sh-state); - update_q = 1; - } + sh-check_state = check_state_run; + set_bit(STRIPE_OP_CHECK_QP, s-ops_request); + clear_bit(R5_UPTODATE, sh-dev[qd_idx].flags); + s-uptodate--; } - if (update_p || update_q) { - conf-mddev-resync_mismatches += STRIPE_SECTORS; - if (test_bit(MD_RECOVERY_CHECK, conf-mddev-recovery)) - /* don't try to repair!! */ - update_p = update_q = 0; + if (sh-check_state == check_state_run) { + break; } - /* now write out any block on a failed drive, -* or P or Q if they need it -*/ + /* we have 2-disk failure */ + BUG_ON(s-failed != 2); + devs[0] = sh-dev[r6s-failed_num[0]]; + devs[1] = sh-dev[r6s-failed_num[1]]; + /* fall through */ + case check_state_compute_result: + sh-check_state = check_state_idle; - if (s-failed == 2) { - dev = sh-dev[r6s-failed_num[1]]; - s-locked++; - set_bit(R5_LOCKED, dev-flags); - set_bit(R5_Wantwrite, dev-flags); + BUG_ON((devs[0] !devs[1]) || + (!devs[0] devs[1])); + + BUG_ON(s-uptodate (disks - 1)); + + if (!devs[0]) { + if (s-failed = 1) + devs[0] = sh-dev[r6s-failed_num[0]]; + else + devs[0] = sh-dev[pd_idx]; } - if (s-failed = 1
[PATCH][v3] powerpc 44x: support for 256KB PAGE_SIZE
This patch adds support for 256KB pages on ppc44x-based boards. For simplification of implementation with 256KB pages we still assume 2-level paging. As a side effect this leads to wasting extra memory space reserved for PTE tables: only 1/4 of pages allocated for PTEs are actually used. But this may be an acceptable trade-off to achieve the high performance we have with big PAGE_SIZEs in some applications (e.g. RAID). Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize the risk of stack overflows in the cases of on-stack arrays, which size depends on the page size (e.g. multipage BIOs, NTFS, etc.). With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be occupied by PKMAP addresses leaving no place for vmalloc. We do not separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that value of 10 in support for 16K/64K had been selected rather intuitively. Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB, one) we have 512 pages for PKMAP. Because ELF standard supports only page sizes up to 64K, then you should use binutils later than 2.17.50.0.3 with '-zmax-page-size' set to 256K for building applications, which are to be run with the 256KB-page sized kernel. If using the older binutils, then you should patch them like follows: --- binutils/bfd/elf32-ppc.c.orig +++ binutils/bfd/elf32-ppc.c -#define ELF_MAXPAGESIZE0x1 +#define ELF_MAXPAGESIZE0x4 Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- arch/powerpc/Kconfig | 15 +++ arch/powerpc/include/asm/highmem.h | 10 +- arch/powerpc/include/asm/mmu-44x.h |2 ++ arch/powerpc/include/asm/page.h|6 -- arch/powerpc/include/asm/page_32.h |4 arch/powerpc/include/asm/thread_info.h |4 +++- arch/powerpc/kernel/head_booke.h | 11 ++- arch/powerpc/platforms/44x/Kconfig | 12 8 files changed, 55 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 84b8613..ceb402c 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -443,6 +443,19 @@ config PPC_64K_PAGES bool 64k page size if 44x || PPC_STD_MMU_64 select PPC_HAS_HASH_64K if PPC_STD_MMU_64 +config PPC_256K_PAGES + bool 256k page size if 44x + depends on !STDBINUTILS + help + Make the page size 256k. + + As the ELF standard only requires alignment to support page + sizes up to 64k, you will need to compile all of your user + space applications with a non-standard binutils settings + (see the STDBINUTILS description for details). + + Say N unless you know what you are doing. + endchoice config FORCE_MAX_ZONEORDER @@ -455,6 +468,8 @@ config FORCE_MAX_ZONEORDER default 9 if PPC_STD_MMU_32 PPC_16K_PAGES range 7 64 if PPC_STD_MMU_32 PPC_64K_PAGES default 7 if PPC_STD_MMU_32 PPC_64K_PAGES + range 5 64 if PPC_STD_MMU_32 PPC_256K_PAGES + default 5 if PPC_STD_MMU_32 PPC_256K_PAGES range 11 64 default 11 help diff --git a/arch/powerpc/include/asm/highmem.h b/arch/powerpc/include/asm/highmem.h index 04e4a62..a290759 100644 --- a/arch/powerpc/include/asm/highmem.h +++ b/arch/powerpc/include/asm/highmem.h @@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table; * chunk of RAM. */ /* - * We use one full pte table with 4K pages. And with 16K/64K pages pte - * table covers enough memory (32MB and 512MB resp.) that both FIXMAP - * and PKMAP can be placed in single pte table. We use 1024 pages for - * PKMAP in case of 16K/64K pages. + * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte + * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP + * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP + * in case of 16K/64K/256K page sizes. */ #ifdef CONFIG_PPC_4K_PAGES #define PKMAP_ORDERPTE_SHIFT #else -#define PKMAP_ORDER10 +#define PKMAP_ORDER9 #endif #define LAST_PKMAP (1 PKMAP_ORDER) #ifndef CONFIG_PPC_4K_PAGES diff --git a/arch/powerpc/include/asm/mmu-44x.h b/arch/powerpc/include/asm/mmu-44x.h index 27cc6fd..3c86576 100644 --- a/arch/powerpc/include/asm/mmu-44x.h +++ b/arch/powerpc/include/asm/mmu-44x.h @@ -83,6 +83,8 @@ typedef struct { #define PPC44x_TLBE_SIZE PPC44x_TLB_16K #elif (PAGE_SHIFT == 16) #define PPC44x_TLBE_SIZE PPC44x_TLB_64K +#elif (PAGE_SHIFT == 18) +#define PPC44x_TLBE_SIZE PPC44x_TLB_256K #else #error Unsupported PAGE_SIZE #endif diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index 197d569..32cbf16 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -19,12 +19,14 @@ #include asm/kdump.h
Re[2]: [PATCH] powerpc 44x: support for 256KB PAGE_SIZE
Hello Milton, Thanks for reviewing. I'll re-post the updated patch shortly. On Sunday, December 21, 2008 you wrote: On Dec 19, 2008, at 12:39 AM, Yuri Tikhonov wrote: This patch adds support for 256KB pages on ppc44x-based boards. Hi. A couple of small comments. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index cd8ff7c..348702c 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -436,6 +436,14 @@ config PPC_64K_PAGES bool 64k page size if 44x || PPC_STD_MMU_64 select PPC_HAS_HASH_64K if PPC_STD_MMU_64 +config PPC_256K_PAGES + bool 256k page size if 44x + depends on !STDBINUTILS + help + ELF standard supports only page sizes up to 64K so you need a patched + binutils in order to use 256K pages. Chose it only if you know what + you are doing. + Make the page size 256k. As the ELF standard only requires alignment to support page sizes up to 64k, you will need to compile all of your user space applications with a patched binutils. Say N unless you know what you are doing. (The previous text did not describe what the option actually did, and did not emphasize that all of user space had to be compiled). diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index 9665a26..3c8bbab 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -15,8 +15,12 @@ #ifdef CONFIG_PPC64 #define THREAD_SHIFT 14 #else +#ifdef CONFIG_PPC_256K_PAGES +#define THREAD_SHIFT 15 +#else #define THREAD_SHIFT 13 #endif +#endif Switching to #elif would remove the nested ifdef here. #define THREAD_SIZE (1 THREAD_SHIFT) diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h index fce2df9..bc6a26c 100644 --- a/arch/powerpc/kernel/head_booke.h +++ b/arch/powerpc/kernel/head_booke.h @@ -10,6 +10,14 @@ mtspr SPRN_IVOR##vector_number,r26; \ sync +#ifndef CONFIG_PPC_256K_PAGES This if should be on THREAD_SIZE or THREAD_SHIFT, not the option that causes it to be large. +#define ALLOC_STACK_FRAME(reg, val) addi reg,reg,val +#else +#define ALLOC_STACK_FRAME(reg, val) \ + addis reg,reg,v...@ha; \ + addireg,reg,v...@l +#endif + #define NORMAL_EXCEPTION_PROLOG \ mtspr SPRN_SPRG0,r10; /* save two registers to work with */\ mtspr SPRN_SPRG1,r11; \ @@ -20,7 +28,7 @@ beq 1f; \ mfspr r1,SPRN_SPRG3; /* if from user, start at top of */\ lwz r1,THREAD_INFO-THREAD(r1); /* this thread's kernel stack */\ - addir1,r1,THREAD_SIZE; \ + ALLOC_STACK_FRAME(r1, THREAD_SIZE); \ 1: subir1,r1,INT_FRAME_SIZE; /* Allocate an exception frame */\ mr r11,r1; \ stw r10,_CCR(r11); /* save various registers */\ diff --git a/init/Kconfig b/init/Kconfig index f763762..96229b9 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -635,6 +635,16 @@ config ELF_CORE help Enable support for generating core dumps. Disabling saves about 4k. +config STDBINUTILS + bool Using standard binutils if EMBEDDED + depends on 44x + default y + help + Turning this option off allows you to select 256KB PAGE_SIZE on 44x. + Note, that kernel will be able to run only those applications, + which had been compiled using the patched binutils (ELF standard + supports only page sizes up to 64K). + The config variable 44x is pretty nondescript for a global config file. If you leave it here, I would suggest making it depend on PPC 44x so that other people don't have to wonder what 44x means. I didn't look if there was a better place to put this option, somewhere architecture specific might be better (eg after selecting the 44x cpu type). config PCSPKR_PLATFORM bool Enable PC-Speaker support if EMBEDDED depends on ALPHA || X86 || MIPS || PPC_PREP || PPC_CHRP || PPC_PSERIES thanks, milton Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH][v2] powerpc 44x: support for 256KB PAGE_SIZE
This patch adds support for 256KB pages on ppc44x-based boards. For simplification of implementation with 256KB pages we still assume 2-level paging. As a side effect this leads to wasting extra memory space reserved for PTE tables: only 1/4 of pages allocated for PTEs are actually used. But this may be an acceptable trade-off to achieve the high performance we have with big PAGE_SIZEs in some applications (e.g. RAID). Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize the risk of stack overflows in the cases of on-stack arrays, which size depends on the page size (e.g. multipage BIOs, NTFS, etc.). With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be occupied by PKMAP addresses leaving no place for vmalloc. We do not separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that value of 10 in support for 16K/64K had been selected rather intuitively. Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB, one) we have 512 pages for PKMAP. Because ELF standard supports only page sizes up to 64K, then you should use patched binutils for building applications to be run with the 256KB- page sized kernel. The patch for binutils is rather trivial, and may look as follows: --- binutils/bfd/elf32-ppc.c.orig +++ binutils/bfd/elf32-ppc.c -#define ELF_MAXPAGESIZE0x1 +#define ELF_MAXPAGESIZE0x4 Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- arch/powerpc/Kconfig | 14 ++ arch/powerpc/include/asm/highmem.h | 10 +- arch/powerpc/include/asm/mmu-44x.h |2 ++ arch/powerpc/include/asm/page.h|6 -- arch/powerpc/include/asm/page_32.h |4 arch/powerpc/include/asm/thread_info.h |4 +++- arch/powerpc/kernel/head_booke.h | 11 ++- arch/powerpc/platforms/44x/Kconfig | 10 ++ 8 files changed, 52 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index cd8ff7c..3338e71 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -436,6 +436,18 @@ config PPC_64K_PAGES bool 64k page size if 44x || PPC_STD_MMU_64 select PPC_HAS_HASH_64K if PPC_STD_MMU_64 +config PPC_256K_PAGES + bool 256k page size if 44x + depends on !STDBINUTILS + help + Make the page size 256k. + + As the ELF standard only requires alignment to support page + sizes up to 64k, you will need to compile all of your user + space applications with a patched binutils. + + Say N unless you know what you are doing. + endchoice config FORCE_MAX_ZONEORDER @@ -448,6 +460,8 @@ config FORCE_MAX_ZONEORDER default 9 if PPC_STD_MMU_32 PPC_16K_PAGES range 7 64 if PPC_STD_MMU_32 PPC_64K_PAGES default 7 if PPC_STD_MMU_32 PPC_64K_PAGES + range 5 64 if PPC_STD_MMU_32 PPC_256K_PAGES + default 5 if PPC_STD_MMU_32 PPC_256K_PAGES range 11 64 default 11 help diff --git a/arch/powerpc/include/asm/highmem.h b/arch/powerpc/include/asm/highmem.h index 7d6bb37..0d1333f 100644 --- a/arch/powerpc/include/asm/highmem.h +++ b/arch/powerpc/include/asm/highmem.h @@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table; * chunk of RAM. */ /* - * We use one full pte table with 4K pages. And with 16K/64K pages pte - * table covers enough memory (32MB and 512MB resp.) that both FIXMAP - * and PKMAP can be placed in single pte table. We use 1024 pages for - * PKMAP in case of 16K/64K pages. + * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte + * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP + * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP + * in case of 16K/64K/256K page sizes. */ #ifdef CONFIG_PPC_4K_PAGES #define PKMAP_ORDERPTE_SHIFT #else -#define PKMAP_ORDER10 +#define PKMAP_ORDER9 #endif #define LAST_PKMAP (1 PKMAP_ORDER) #ifndef CONFIG_PPC_4K_PAGES diff --git a/arch/powerpc/include/asm/mmu-44x.h b/arch/powerpc/include/asm/mmu-44x.h index 73e1909..52a2339 100644 --- a/arch/powerpc/include/asm/mmu-44x.h +++ b/arch/powerpc/include/asm/mmu-44x.h @@ -81,6 +81,8 @@ typedef struct { #define PPC44x_TLBE_SIZE PPC44x_TLB_16K #elif (PAGE_SHIFT == 16) #define PPC44x_TLBE_SIZE PPC44x_TLB_64K +#elif (PAGE_SHIFT == 18) +#define PPC44x_TLBE_SIZE PPC44x_TLB_256K #else #error Unsupported PAGE_SIZE #endif diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index 197d569..32cbf16 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -19,12 +19,14 @@ #include asm/kdump.h /* - * On regular PPC32 page size is 4K (but we support 4K/16K/64K pages + * On regular PPC32 page size is 4K (but we support 4K/16K/64K
Re[2]: [PATCH][v2] powerpc 44x: support for 256KB PAGE_SIZE
Hello Andreas, On Sunday, December 21, 2008 you wrote: Yuri Tikhonov y...@emcraft.com writes: Because ELF standard supports only page sizes up to 64K, then you should use patched binutils for building applications to be run with the 256KB- page sized kernel. The patch for binutils is rather trivial, and may look as follows: --- binutils/bfd/elf32-ppc.c.orig +++ binutils/bfd/elf32-ppc.c -#define ELF_MAXPAGESIZE0x1 +#define ELF_MAXPAGESIZE0x4 You don't have to patch it, it's enough to pass -zmax-page-size=0x4 to the linker. Thanks for pointing this. I guess, the -zmax-page-size option is new to binutils 2.17.50.0.10. Right? I'll remove the STDBINUTILS config option from this patch then, and correspondingly update the description. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH][v2] powerpc 44x: support for 256KB PAGE_SIZE
On Sunday, December 21, 2008 you wrote: Yuri Tikhonov y...@emcraft.com writes: Thanks for pointing this. I guess, the -zmax-page-size option is new to binutils 2.17.50.0.10. Right? It was added 2½ years ago. Yes, approximately: http://gcc.gnu.org/ml/gcc/2006-07/msg00361.html Date: Sat, 15 Jul 2006 15:19:05 -0700 Subject: The Linux binutils 2.17.50.0.3 is released Changes from binutils 2.17.50.0.2: ... 16. Add -z max-page-size= and -z common-page-size= to ELF linker. I'll remove the STDBINUTILS config option from this patch then, and correspondingly update the description. I thought about this more, and I guess that we should keep the dependency on the newly introduced by this patch STDBINUTILS option. Perhaps, change it name to, say, STD_ELF_LINKING. Because, in any case, either should we patch binutils, or should we pass some non-default value to the linker: the resulted application doesn't match the ELF standard, and the applications which match the standard simply won't work with the 256K paged kernel - so, a user has to be aware of it before turning the 256K on. What do you think about this? Regards, Yuri ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH][v2] fork_init: fix division by zero
Hello Andrew, On Friday, December 19, 2008 you wrote: [snip] There is one more warning from the common code when I use 256KB pages: CC mm/shmem.o mm/shmem.c: In function 'shmem_truncate_range': mm/shmem.c:613: warning: division by zero mm/shmem.c:619: warning: division by zero mm/shmem.c:644: warning: division by zero mm/shmem.c: In function 'shmem_unuse_inode': mm/shmem.c:873: warning: division by zero The problem here is that ENTRIES_PER_PAGEPAGE becomes 0x1.. when PAGE_SIZE is 256K. How about the following fix ? [snip] Looks sane. Thanks for reviewing. But to apply this I'd prefer a changelog, a signoff and a grunt from Hugh. Sure, I'll post this in the separate thread then; keeping Hugh in CC. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] mm/shmem.c: fix division by zero
The following patch fixes division by zero, which we have in shmem_truncate_range() and shmem_unuse_inode(), if use big PAGE_SIZE values (e.g. 256KB on ppc44x). With 256KB PAGE_SIZE the ENTRIES_PER_PAGEPAGE constant becomes too large (0x1..), so this patch just changes the types from 'ulong' to 'ullong' where it's necessary. Signed-off-by: Yuri Tikhonov y...@emcraft.com --- mm/shmem.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 0ed0752..99d7c91 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -57,7 +57,7 @@ #include asm/pgtable.h #define ENTRIES_PER_PAGE (PAGE_CACHE_SIZE/sizeof(unsigned long)) -#define ENTRIES_PER_PAGEPAGE (ENTRIES_PER_PAGE*ENTRIES_PER_PAGE) +#define ENTRIES_PER_PAGEPAGE ((unsigned long long)ENTRIES_PER_PAGE*ENTRIES_PER_PAGE) #define BLOCKS_PER_PAGE (PAGE_CACHE_SIZE/512) #define SHMEM_MAX_INDEX (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * (ENTRIES_PER_PAGE+1)) @@ -95,7 +95,7 @@ static unsigned long shmem_default_max_inodes(void) } #endif -static int shmem_getpage(struct inode *inode, unsigned long idx, +static int shmem_getpage(struct inode *inode, unsigned long long idx, struct page **pagep, enum sgp_type sgp, int *type); static inline struct page *shmem_dir_alloc(gfp_t gfp_mask) @@ -533,7 +533,7 @@ static void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end) int punch_hole; spinlock_t *needs_lock; spinlock_t *punch_lock; - unsigned long upper_limit; + unsigned long long upper_limit; inode-i_ctime = inode-i_mtime = CURRENT_TIME; idx = (start + PAGE_CACHE_SIZE - 1) PAGE_CACHE_SHIFT; @@ -1175,7 +1175,7 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache */ -static int shmem_getpage(struct inode *inode, unsigned long idx, +static int shmem_getpage(struct inode *inode, unsigned long long idx, struct page **pagep, enum sgp_type sgp, int *type) { struct address_space *mapping = inode-i_mapping; -- 1.6.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] powerpc 44x: support for 256KB PAGE_SIZE
This patch adds support for 256KB pages on ppc44x-based boards. For simplification of implementation with 256KB pages we still assume 2-level paging. As a side effect this leads to wasting extra memory space reserved for PTE tables: only 1/4 of pages allocated for PTEs are actually used. But this may be an acceptable trade-off to achieve the high performance we have with big PAGE_SIZEs in some applications (e.g. RAID). Also with 256KB PAGE_SIZE we increase THREAD_SIZE up to 32KB to minimize the risk of stack overflows in the cases of on-stack arrays, which size depends on the page size (e.g. multipage BIOs, NTFS, etc.). With 256KB PAGE_SIZE we need to decrease the PKMAP_ORDER at least down to 9, otherwise all high memory (2 ^ 10 * PAGE_SIZE == 256MB) we'll be occupied by PKMAP addresses leaving no place for vmalloc. We do not separate PKMAP_ORDER for 256K from 16K/64K PAGE_SIZE here; actually that value of 10 in support for 16K/64K had been selected rather intuitively. Thus now for all cases of PAGE_SIZE on ppc44x (including the default, 4KB, one) we have 512 pages for PKMAP. Because ELF standard supports only page sizes up to 64K, then you should use patched binutils for building applications to be run with the 256KB- page sized kernel. The patch for binutils is rather trivial, and may look as follows: --- binutils/bfd/elf32-ppc.c.orig +++ binutils/bfd/elf32-ppc.c -#define ELF_MAXPAGESIZE0x1 +#define ELF_MAXPAGESIZE0x4 Signed-off-by: Yuri Tikhonov y...@emcraft.com Signed-off-by: Ilya Yanok ya...@emcraft.com --- arch/powerpc/Kconfig | 10 ++ arch/powerpc/include/asm/highmem.h | 10 +- arch/powerpc/include/asm/mmu-44x.h |2 ++ arch/powerpc/include/asm/page.h|6 -- arch/powerpc/include/asm/page_32.h |4 arch/powerpc/include/asm/thread_info.h |4 arch/powerpc/kernel/head_booke.h | 10 +- init/Kconfig | 10 ++ 8 files changed, 48 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index cd8ff7c..348702c 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -436,6 +436,14 @@ config PPC_64K_PAGES bool 64k page size if 44x || PPC_STD_MMU_64 select PPC_HAS_HASH_64K if PPC_STD_MMU_64 +config PPC_256K_PAGES + bool 256k page size if 44x + depends on !STDBINUTILS + help + ELF standard supports only page sizes up to 64K so you need a patched + binutils in order to use 256K pages. Chose it only if you know what + you are doing. + endchoice config FORCE_MAX_ZONEORDER @@ -448,6 +456,8 @@ config FORCE_MAX_ZONEORDER default 9 if PPC_STD_MMU_32 PPC_16K_PAGES range 7 64 if PPC_STD_MMU_32 PPC_64K_PAGES default 7 if PPC_STD_MMU_32 PPC_64K_PAGES + range 5 64 if PPC_STD_MMU_32 PPC_256K_PAGES + default 5 if PPC_STD_MMU_32 PPC_256K_PAGES range 11 64 default 11 help diff --git a/arch/powerpc/include/asm/highmem.h b/arch/powerpc/include/asm/highmem.h index 7d6bb37..0d1333f 100644 --- a/arch/powerpc/include/asm/highmem.h +++ b/arch/powerpc/include/asm/highmem.h @@ -39,15 +39,15 @@ extern pte_t *pkmap_page_table; * chunk of RAM. */ /* - * We use one full pte table with 4K pages. And with 16K/64K pages pte - * table covers enough memory (32MB and 512MB resp.) that both FIXMAP - * and PKMAP can be placed in single pte table. We use 1024 pages for - * PKMAP in case of 16K/64K pages. + * We use one full pte table with 4K pages. And with 16K/64K/256K pages pte + * table covers enough memory (32MB/512MB/2GB resp.), so that both FIXMAP + * and PKMAP can be placed in a single pte table. We use 512 pages for PKMAP + * in case of 16K/64K/256K page sizes. */ #ifdef CONFIG_PPC_4K_PAGES #define PKMAP_ORDERPTE_SHIFT #else -#define PKMAP_ORDER10 +#define PKMAP_ORDER9 #endif #define LAST_PKMAP (1 PKMAP_ORDER) #ifndef CONFIG_PPC_4K_PAGES diff --git a/arch/powerpc/include/asm/mmu-44x.h b/arch/powerpc/include/asm/mmu-44x.h index 73e1909..52a2339 100644 --- a/arch/powerpc/include/asm/mmu-44x.h +++ b/arch/powerpc/include/asm/mmu-44x.h @@ -81,6 +81,8 @@ typedef struct { #define PPC44x_TLBE_SIZE PPC44x_TLB_16K #elif (PAGE_SHIFT == 16) #define PPC44x_TLBE_SIZE PPC44x_TLB_64K +#elif (PAGE_SHIFT == 18) +#define PPC44x_TLBE_SIZE PPC44x_TLB_256K #else #error Unsupported PAGE_SIZE #endif diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index 197d569..32cbf16 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -19,12 +19,14 @@ #include asm/kdump.h /* - * On regular PPC32 page size is 4K (but we support 4K/16K/64K pages + * On regular PPC32 page size is 4K (but we support 4K/16K/64K/256K pages * on PPC44x). For PPC64 we support either 4K or 64K software * page size. When using 64K
Re[2]: [PATCH 02/11][v2] async_tx: add support for asynchronous GF multiplication
, increase the stack usage and, in general, the time of functions calls by adding new parameters to ADMA methods ? In this case can we set up a dependency chain with async_memset()? Well, we can. But wouldn't this be an overhead? For example, ppc440spe DMA allows to do so-called RXOR which overwrites, and doesn't take care of destinations. So, we can do ZERO_DST(s)+PQ in one short on one DMA engine. Again, I'm not sure that keeping dma_ctrl_flags unchanged is worthy of creating such a dependency; it'll obviously lead both to degradation of performance increasing of CPU utilization. /** @@ -299,6 +301,7 @@ struct dma_async_tx_descriptor { * @global_node: list_head for global dma_device_list * @cap_mask: one or more dma_capability flags * @max_xor: maximum number of xor sources, 0 if no capability + * @max_pq: maximum number of PQ sources, 0 if no capability * @refcount: reference count * @done: IO completion struct * @dev_id: unique device ID @@ -308,7 +311,9 @@ struct dma_async_tx_descriptor { * @device_free_chan_resources: release DMA channel's resources * @device_prep_dma_memcpy: prepares a memcpy operation * @device_prep_dma_xor: prepares a xor operation + * @device_prep_dma_pq: prepares a pq operation * @device_prep_dma_zero_sum: prepares a zero_sum operation + * @device_prep_dma_pqzero_sum: prepares a pqzero_sum operation * @device_prep_dma_memset: prepares a memset operation * @device_prep_dma_interrupt: prepares an end of chain interrupt operation * @device_prep_slave_sg: prepares a slave dma operation @@ -322,6 +327,7 @@ struct dma_device { struct list_head global_node; dma_cap_mask_t cap_mask; int max_xor; + int max_pq; max_xor and max_pq can be changed to unsigned shorts to keep the size of the struct the same. Right. struct kref refcount; struct completion done; @@ -339,9 +345,17 @@ struct dma_device { struct dma_async_tx_descriptor *(*device_prep_dma_xor)( struct dma_chan *chan, dma_addr_t dest, dma_addr_t *src, unsigned int src_cnt, size_t len, unsigned long flags); + struct dma_async_tx_descriptor *(*device_prep_dma_pq)( + struct dma_chan *chan, dma_addr_t *dst, dma_addr_t *src, + unsigned int src_cnt, unsigned char *scf, + size_t len, unsigned long flags); struct dma_async_tx_descriptor *(*device_prep_dma_zero_sum)( struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt, size_t len, u32 *result, unsigned long flags); + struct dma_async_tx_descriptor *(*device_prep_dma_pqzero_sum)( + struct dma_chan *chan, dma_addr_t *src, unsigned int src_cnt, + unsigned char *scf, + size_t len, u32 *presult, u32 *qresult, unsigned long flags); I would rather we turn the 'result' parameter into a pointer to flags where bit 0 is the xor/p result and bit1 is the q result. Yes, this'll be better. Thanks for reviewing. I'll re-generate ASYNC_TX patch (in the parts where I absolutely agreed with you), and then re-post. Any comments regarding RAID-6 part? Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH][v2] fork_init: fix division by zero
Hello Paul, On Friday 12 December 2008 03:48, Paul Mackerras wrote: Andrew Morton writes: +#if (8 * THREAD_SIZE) PAGE_SIZE max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE); +#else + max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE)); +#endif The expression you've chosen here can be quite inacccurate, because ((PAGE_SIZE / (8 * THREAD_SIZE)) is a small number. The way to preserve accuracy is The assumption is that THREAD_SIZE is a power of 2, as is PAGE_SIZE. I think Yuri should be increasing THREAD_SIZE for the larger page sizes he's implementing, because we have on-stack arrays whose size depends on the page size. I suspect that having THREAD_SIZE less than 1/8 of PAGE_SIZE risks stack overflows, and the better fix is for Yuri to make sure THREAD_SIZE is at least 1/8 of PAGE_SIZE. (In fact, more may be needed - someone should work out what fraction is actually needed.) Right, thanks for pointing this. I guess, I was just lucky since didn't run into problems with stack overflows. So, I agree that we should increase the THREAD_SIZE in case of 256KB pages up to 1/8 of PAGE_SIZE, that is up to 32KB. There is one more warning from the common code when I use 256KB pages: CC mm/shmem.o mm/shmem.c: In function 'shmem_truncate_range': mm/shmem.c:613: warning: division by zero mm/shmem.c:619: warning: division by zero mm/shmem.c:644: warning: division by zero mm/shmem.c: In function 'shmem_unuse_inode': mm/shmem.c:873: warning: division by zero The problem here is that ENTRIES_PER_PAGEPAGE becomes 0x1.. when PAGE_SIZE is 256K. How about the following fix ? diff --git a/mm/shmem.c b/mm/shmem.c index 0ed0752..99d7c91 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -57,7 +57,7 @@ #include asm/pgtable.h #define ENTRIES_PER_PAGE (PAGE_CACHE_SIZE/sizeof(unsigned long)) -#define ENTRIES_PER_PAGEPAGE (ENTRIES_PER_PAGE*ENTRIES_PER_PAGE) +#define ENTRIES_PER_PAGEPAGE ((unsigned long long)ENTRIES_PER_PAGE*ENTRIES_PER_PAGE) #define BLOCKS_PER_PAGE (PAGE_CACHE_SIZE/512) #define SHMEM_MAX_INDEX (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * (ENTRIES_PER_PAGE+1)) @@ -95,7 +95,7 @@ static unsigned long shmem_default_max_inodes(void) } #endif -static int shmem_getpage(struct inode *inode, unsigned long idx, +static int shmem_getpage(struct inode *inode, unsigned long long idx, struct page **pagep, enum sgp_type sgp, int *type); static inline struct page *shmem_dir_alloc(gfp_t gfp_mask) @@ -533,7 +533,7 @@ static void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end) int punch_hole; spinlock_t *needs_lock; spinlock_t *punch_lock; - unsigned long upper_limit; + unsigned long long upper_limit; inode-i_ctime = inode-i_mtime = CURRENT_TIME; idx = (start + PAGE_CACHE_SIZE - 1) PAGE_CACHE_SHIFT; @@ -1175,7 +1175,7 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache */ -static int shmem_getpage(struct inode *inode, unsigned long idx, +static int shmem_getpage(struct inode *inode, unsigned long long idx, struct page **pagep, enum sgp_type sgp, int *type) { struct address_space *mapping = inode-i_mapping; Regards, Yuri ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH][v2] fork_init: fix division by zero
Hello Andrew, On Thursday, December 11, 2008 you wrote: [snip] The expression you've chosen here can be quite inacccurate, because ((PAGE_SIZE / (8 * THREAD_SIZE)) is a small number. But why is it bad? We do multiplication to 'mempages', not division. All the numbers in the multiplier are the power of 2, so both expressions: mempages * (PAGE_SIZE / (8 * THREAD_SIZE)) and max_threads = (mempages * PAGE_SIZE) / (8 * THREAD_SIZE) are finally equal. The way to preserve accuracy is max_threads = (mempages * PAGE_SIZE) / (8 * THREAD_SIZE); so how about avoiding the nasty ifdefs and doing I'm OK with the approach below, but, leading resulting to the same, this involves some overhead to the code where there was no this overhead before this patch: e.g. your implementation is finally boils down to ~5 times more processor instructions than there were before, plus operations with stack for the 'm' variable. On the other hand, my approach with nasty (I agree) ifdefs doesn't lead to overheads to the code which does not need this: i.e. the most common situation of small PAGE_SIZEs. Big PAGE_SIZE is the exception, so I believe that the more common cases should not suffer because of this. --- a/kernel/fork.c~fork_init-fix-division-by-zero +++ a/kernel/fork.c @@ -69,6 +69,7 @@ #include asm/mmu_context.h #include asm/cacheflush.h #include asm/tlbflush.h +#include asm/div64.h /* * Protected counters by write_lock_irq(tasklist_lock) @@ -185,10 +186,15 @@ void __init fork_init(unsigned long memp /* * The default maximum number of threads is set to a safe -* value: the thread structures can take up at most half -* of memory. +* value: the thread structures can take up at most +* (1/8) part of memory. */ - max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE); + { + /* max_threads = (mempages * PAGE_SIZE) / THREAD_SIZE / 8; */ + u64 m = mempages * PAGE_SIZE; + do_div(m, THREAD_SIZE * 8); + max_threads = m; + } /* * we need to allow at least 20 threads to boot a system _ ? The code is also inaccurate because it assumes that whatever allocator is used for threads will pack the thread_structs into pages with best possible density, which isn't necessarily the case. Let's not worry about that. OT: max_threads is widly wrong anyway. - the caller passes in num_physpages, which includes highmem. And we can't allocate thread structs from highmem. - num_physpages includes kernel pages and other stuff which can never be allocated via the page allocator. A suitable fix would be to switch the caller to the strangely-named nr_free_buffer_pages(). If you grep the tree for `num_physpages', you will find a splendid number of similar bugs. num_physpages should be unexported, burnt, deleted, etc. It's just an invitation to write buggy code. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] fork_init: fix division by zero
Hello Geert, On Wednesday, December 10, 2008 you wrote: On Tue, 9 Dec 2008, Yuri Tikhonov wrote: The following patch fixes divide-by-zero error for the cases of really big PAGE_SIZEs (e.g. 256KB on ppc44x). Support for such big page sizes on 44x is not present in the current kernel yet, but coming soon. Also this patch fixes the comment for the max_threads settings, as this didn't match the things actually done in the code. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- kernel/fork.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 2a372a0..b0ac2fb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -181,10 +181,14 @@ void __init fork_init(unsigned long mempages) /* * The default maximum number of threads is set to a safe - * value: the thread structures can take up at most half - * of memory. + * value: the thread structures can take up at most + * (1/8) part of memory. */ +#if (8 * THREAD_SIZE) PAGE_SIZE max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE); +#else + max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE); +#endif Can't this overflow, e.g. on 32-bit machines with HIGHMEM? The multiplier here is not PAGE_SIZE, but [PAGE_SIZE / (8 * THREAD_SIZE)], and this value is expected to be rather small (2, 4, or so). Furthermore, due to the #if/#endif construction multiplication is used only with rather big PAGE_SIZE values, and the bigger page size is then the smaller 'mempages' is. So, for example, when running with PAGE_SIZE=256KB, THREAD_SIZE=8KB, on 32-bit 440spe-based machine with 4GB RAM installed, here we have: max_threads = (4G/256K) * (256K / 8 * 8K) = 16384 * 4 = 65536. And the overflow will have a place only in case of very very big sizes of RAM: = 256TB: max_threads = (256T / 256K) * (256K / 8 * 8K) = 0x4000. * 4. But I don't think that with 256TB RAM installed this code will be the only place of problems :) Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] powerpc: add 16K/64K pages support for the 44x PPC32 architectures.
/L1_CACHE_BYTES /* Number of lines in a page */ Same comment. pgd_t *pgd_alloc(struct mm_struct *mm) { pgd_t *ret; - ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER); + ret = (pgd_t *)kzalloc(1 PGDIR_ORDER, GFP_KERNEL); return ret; } We may want to consider using a slab cache. Maybe an area where we want to merge 32 and 64 bit code, though it doesn't have to be right now. Do we know the impact of using kzalloc instead of gfp for when it's really just a single page though ? Does it have overhead or will kzalloc just fallback to gfp ? If it has overhead, then we probably want to ifdef and keep using gfp for the 1-page case. This depends on allocator: SLUB looks calling __get_free_pages() if size PAGE_SIZE [note, not = !], but SLAB doesn't. So, we'll add ifdef here. __init_refok pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) @@ -400,7 +395,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable) #endif /* CONFIG_DEBUG_PAGEALLOC */ static int fixmaps; -unsigned long FIXADDR_TOP = 0xf000; +unsigned long FIXADDR_TOP = (-PAGE_SIZE); EXPORT_SYMBOL(FIXADDR_TOP); void __set_fixmap (enum fixed_addresses idx, phys_addr_t phys, pgprot_t flags) diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index 548efa5..73a5aa9 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -204,7 +204,7 @@ config PPC_STD_MMU_32 config PPC_MM_SLICES bool - default y if HUGETLB_PAGE || PPC_64K_PAGES + default y if HUGETLB_PAGE || (PPC64 PPC_64K_PAGES) default n I would make it PPC_64 (HUGETLB_PAGE || PPC_64K_PAGES) for now, I don't think we want to use the existing slice code on anything else. Make it even PPC_STD_MMU_64 Cheers, Ben. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[4]: [PATCH] fork_init: fix division by zero
Hello David, On Wednesday, December 10, 2008 you wrote: Yuri Tikhonov [EMAIL PROTECTED] wrote: Here we believe in preprocessor: since all PAGE_SIZE, 8, and THREAD_SIZE are the constants we expect it will calculate this. The preprocessor shouldn't be calculating this. I believe it will _only_ calculate expressions for #if. In the situation you're referring to, it should perform a substitution and nothing more. The preprocessor doesn't necessarily know how to handle the types involved. In any case, there's an easy way to find out: you can ask the compiler to give you the result of running the source through the preprocessor only. For instance, if you run this: #define PAGE_SIZE 4096 #define THREAD_SIZE 8192 unsigned long mempages; unsigned long jump(void) { unsigned long max_threads; max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE); return max_threads; } through gcc -E, you get: # 1 calc.c # 1 built-in # 1 command line # 1 calc.c unsigned long mempages; unsigned long jump(void) { unsigned long max_threads; max_threads = mempages * 4096 / (8 * 8192); return max_threads; } In any case, adding braces as follows probably would be better: + max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE)); I think you mean brackets, not braces '{}'. Yes, it was a typo. Right ? Definitely not. I added this function to the above: unsigned long alt(void) { unsigned long max_threads; max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE)); return max_threads; } and ran it through gcc -S -O2 for x86_64: jump: movqmempages(%rip), %rax salq$12, %rax shrq$16, %rax ret alt: xorl%eax, %eax ret Note the difference? In jump(), x86_64 first multiplies mempages by 4096, and _then_ divides by 8*8192. In alt(), it just returns 0 because the compiler realised that you're multiplying by 0. I think Geert has already commented this: you've compiled your alt() functions having 4K PAGE_SIZE and 8K THREAD_SIZE - this case is handled by the old code in fork_init. If you're going to bracket the expression, it must be: max_threads = (mempages * PAGE_SIZE) / (8 * THREAD_SIZE); which should be superfluous. E.g. here is the result from this line as produced by cross-gcc 4.2.2: lis r9,0 rlwinm r29,r29,2,16,29 stw r29,0(r9) As you see - only rotate-left, i.e. multiplication to the constant. Ummm... On powerpc, I believe rotate-left would be a division as it does the bit-numbering and the bit direction the opposite way to more familiar CPUs such as x86. On powerpc shifting left is multiplication by 2, as this has the most significant bit first. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] fork_init: fix division by zero
Hello Al, On Wednesday, December 10, 2008 you wrote: On Wed, Dec 10, 2008 at 01:01:13PM +0300, Yuri Tikhonov wrote: + max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE); +#endif Can't this overflow, e.g. on 32-bit machines with HIGHMEM? The multiplier here is not PAGE_SIZE, but [PAGE_SIZE / (8 * THREAD_SIZE)], and this value is expected to be rather small (2, 4, or so). x * y / z is parsed as (x * y) / z, not x * (y / z). Here we believe in preprocessor: since all PAGE_SIZE, 8, and THREAD_SIZE are the constants we expect it will calculate this. E.g. here is the result from this line as produced by cross-gcc 4.2.2: lis r9,0 rlwinm r29,r29,2,16,29 stw r29,0(r9) As you see - only rotate-left, i.e. multiplication to the constant. In any case, adding braces as follows probably would be better: + max_threads = mempages * (PAGE_SIZE / (8 * THREAD_SIZE)); Right ? Only assignment operators (and ?:, in a sense that a ? b : c ? d : e is parsed as a ? b : (c ? d : e)) are right-to-left. The rest is left-to-right. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] ASYNC_TX: async_xor mapping fix
Hello Dan, On Wednesday, December 10, 2008 you wrote: On Mon, 2008-12-08 at 14:08 -0700, Dan Williams wrote: On Mon, 2008-12-08 at 12:14 -0700, Yuri Tikhonov wrote: The destination address may be present in the source list, so we should map the addresses from the source list first. Otherwise, if page corresponding to destination is not marked as write- through (with regards to CPU cache), then mapping it with DMA_FROM_DEVICE may lead to data loss, and finally to an incorrect result of calculations. Thanks Yuri. I think we should avoid mapping the destination twice altogether, and for simplicity just always map it bidirectionally. Yuri, Saeed, can I get your acked-by's for the following 2.6.28 patch: I would do the src_list[i] == dest check with unlikely(), since we a-priori aware that only one of all src_cnt addresses is the possible destination. As for the rest: Acked-by: Yuri Tikhonov [EMAIL PROTECTED] Thanks, Dan snip async_xor: dma_map destination DMA_BIDIRECTIONAL From: Dan Williams [EMAIL PROTECTED] Mapping the destination multiple times is a misuse of the dma-api. Since the destination may be reused as a source, ensure that it is only mapped once and that it is mapped bidirectionally. This appears to add ugliness on the unmap side in that it always reads back the destination address from the descriptor, but gcc can determine that dma_unmap is a nop and not emit the code that calculates its arguments. Cc: [EMAIL PROTECTED] Cc: Saeed Bishara [EMAIL PROTECTED] Reported-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Dan Williams [EMAIL PROTECTED] --- crypto/async_tx/async_xor.c | 11 +-- drivers/dma/iop-adma.c | 16 +--- drivers/dma/mv_xor.c| 15 --- 3 files changed, 34 insertions(+), 8 deletions(-) diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c index c029d3e..a6faa90 100644 --- a/crypto/async_tx/async_xor.c +++ b/crypto/async_tx/async_xor.c @@ -53,10 +53,17 @@ do_async_xor(struct dma_chan *chan, struct page *dest, struct page **src_list, int xor_src_cnt; dma_addr_t dma_dest; - dma_dest = dma_map_page(dma-dev, dest, offset, len, DMA_FROM_DEVICE); - for (i = 0; i src_cnt; i++) + /* map the dest bidrectional in case it is re-used as a source */ + dma_dest = dma_map_page(dma-dev, dest, offset, len, DMA_BIDIRECTIONAL); + for (i = 0; i src_cnt; i++) { + /* only map the dest once */ + if (src_list[i] == dest) { + dma_src[i] = dma_dest; + continue; + } dma_src[i] = dma_map_page(dma-dev, src_list[i], offset, len, DMA_TO_DEVICE); + } while (src_cnt) { async_flags = flags; diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c index c7a9306..6be3172 100644 --- a/drivers/dma/iop-adma.c +++ b/drivers/dma/iop-adma.c @@ -85,18 +85,28 @@ iop_adma_run_tx_complete_actions(struct iop_adma_desc_slot *desc, enum dma_ctrl_flags flags = desc-async_tx.flags; u32 src_cnt; dma_addr_t addr; + dma_addr_t dest; + src_cnt = unmap-unmap_src_cnt; + dest = iop_desc_get_dest_addr(unmap, iop_chan); if (!(flags DMA_COMPL_SKIP_DEST_UNMAP)) { - addr = iop_desc_get_dest_addr(unmap, iop_chan); - dma_unmap_page(dev, addr, len, DMA_FROM_DEVICE); + enum dma_data_direction dir; + + if (src_cnt 1) /* is xor? */ + dir = DMA_BIDIRECTIONAL; + else + dir = DMA_FROM_DEVICE; + + dma_unmap_page(dev, dest, len, dir); } if (!(flags DMA_COMPL_SKIP_SRC_UNMAP)) { - src_cnt = unmap-unmap_src_cnt; while (src_cnt--) { addr = iop_desc_get_src_addr(unmap, iop_chan, src_cnt); + if (addr == dest) + continue; dma_unmap_page(dev, addr, len, DMA_TO_DEVICE); } diff --git a/drivers/dma/mv_xor.c b/drivers/dma/mv_xor.c index 0328da0..bcda174 100644 --- a/drivers/dma/mv_xor.c +++ b/drivers/dma/mv_xor.c @@ -311,17 +311,26
[PATCH] fork_init: fix division by zero
The following patch fixes divide-by-zero error for the cases of really big PAGE_SIZEs (e.g. 256KB on ppc44x). Support for such big page sizes on 44x is not present in the current kernel yet, but coming soon. Also this patch fixes the comment for the max_threads settings, as this didn't match the things actually done in the code. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- kernel/fork.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 2a372a0..b0ac2fb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -181,10 +181,14 @@ void __init fork_init(unsigned long mempages) /* * The default maximum number of threads is set to a safe -* value: the thread structures can take up at most half -* of memory. +* value: the thread structures can take up at most +* (1/8) part of memory. */ +#if (8 * THREAD_SIZE) PAGE_SIZE max_threads = mempages / (8 * THREAD_SIZE / PAGE_SIZE); +#else + max_threads = mempages * PAGE_SIZE / (8 * THREAD_SIZE); +#endif /* * we need to allow at least 20 threads to boot a system -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC PATCH 00/11][v2] md: support for asynchronous execution of RAID6 operations
Hello, This is the next attempt on asynchronous RAID-6 support. This patch-set has the Dan Williams' comments (Nov, 15) addressed. These were mainly about the ASYNC_TX part of the code. The following patch-set includes enhancements to the async_tx api and modifications to md-raid6 to issue memory copies and parity calculations asynchronously. Thus we may process copy operations and RAID-6 calculations on the dedicated DMA engines accessible with ASYNC_TX API, and, as a result off-load CPU, and improve the performance. To reduce the code duplication in the raid driver this patch-set modifies some raid-5 functions to make them possible to use in the raid-6 case. The patch-set can be broken down into thee following main categories: 1) Additions to ASYNC_TX API (patches 1-3; without the patch 1 the ASYNC_TX can't be compiled for 44x in 2.6.27-rc6 or later); 2) RAID-6 implementation (patches 4-10) 3) ppc440spe ADMA driver (patch 11) (provided as a reference here) -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 01/11] async_tx: don't use src_list argument of async_xor() for dma addresses
Using src_list argument of async_xor() as a storage for dma addresses implies sizeof(dma_addr_t) = sizeof(struct page *) restriction which is not always true (e.g. ppc440spe). Signed-off-by: Ilya Yanok [EMAIL PROTECTED] Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] --- crypto/async_tx/async_xor.c | 14 ++ 1 files changed, 2 insertions(+), 12 deletions(-) diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c index c029d3e..00c74c5 100644 --- a/crypto/async_tx/async_xor.c +++ b/crypto/async_tx/async_xor.c @@ -42,7 +42,7 @@ do_async_xor(struct dma_chan *chan, struct page *dest, struct page **src_list, dma_async_tx_callback cb_fn, void *cb_param) { struct dma_device *dma = chan-device; - dma_addr_t *dma_src = (dma_addr_t *) src_list; + dma_addr_t dma_src[src_cnt]; struct dma_async_tx_descriptor *tx = NULL; int src_off = 0; int i; @@ -247,7 +247,7 @@ async_xor_zero_sum(struct page *dest, struct page **src_list, BUG_ON(src_cnt = 1); if (device src_cnt = device-max_xor) { - dma_addr_t *dma_src = (dma_addr_t *) src_list; + dma_addr_t dma_src[src_cnt]; unsigned long dma_prep_flags = cb_fn ? DMA_PREP_INTERRUPT : 0; int i; @@ -296,16 +296,6 @@ EXPORT_SYMBOL_GPL(async_xor_zero_sum); static int __init async_xor_init(void) { - #ifdef CONFIG_DMA_ENGINE - /* To conserve stack space the input src_list (array of page pointers) -* is reused to hold the array of dma addresses passed to the driver. -* This conversion is only possible when dma_addr_t is less than the -* the size of a pointer. HIGHMEM64G is known to violate this -* assumption. -*/ - BUILD_BUG_ON(sizeof(dma_addr_t) sizeof(struct page *)); - #endif - return 0; } -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 02/11][v2] async_tx: add support for asynchronous GF multiplication
This adds support for doing asynchronous GF multiplication by adding four additional functions to async_tx API: async_pq() does simultaneous XOR of sources and XOR of sources GF-multiplied by given coefficients. async_pq_zero_sum() checks if results of calculations match given ones. async_gen_syndrome() does sumultaneous XOR and R/S syndrome of sources. async_syndrome_zerosum() checks if results of XOR/syndrome calculation matches given ones. Latter two functions just use async_pq() with the approprite coefficients in asynchronous case but have significant optimizations if synchronous case. To support this API dmaengine driver should set DMA_PQ and DMA_PQ_ZERO_SUM capabilities and provide device_prep_dma_pq and device_prep_dma_pqzero_sum methods in dma_device structure. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- crypto/async_tx/Kconfig|4 + crypto/async_tx/Makefile |1 + crypto/async_tx/async_pq.c | 586 include/linux/async_tx.h | 45 - include/linux/dmaengine.h | 16 ++- 5 files changed, 648 insertions(+), 4 deletions(-) create mode 100644 crypto/async_tx/async_pq.c diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig index d8fb391..cb6d731 100644 --- a/crypto/async_tx/Kconfig +++ b/crypto/async_tx/Kconfig @@ -14,3 +14,7 @@ config ASYNC_MEMSET tristate select ASYNC_CORE +config ASYNC_PQ + tristate + select ASYNC_CORE + diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile index 27baa7d..1b99265 100644 --- a/crypto/async_tx/Makefile +++ b/crypto/async_tx/Makefile @@ -2,3 +2,4 @@ obj-$(CONFIG_ASYNC_CORE) += async_tx.o obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o obj-$(CONFIG_ASYNC_XOR) += async_xor.o +obj-$(CONFIG_ASYNC_PQ) += async_pq.o diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c new file mode 100644 index 000..439338f --- /dev/null +++ b/crypto/async_tx/async_pq.c @@ -0,0 +1,586 @@ +/* + * Copyright(c) 2007 Yuri Tikhonov [EMAIL PROTECTED] + * + * Developed for DENX Software Engineering GmbH + * + * Asynchronous GF-XOR calculations ASYNC_TX API. + * + * based on async_xor.c code written by: + * Dan Williams [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ +#include linux/kernel.h +#include linux/interrupt.h +#include linux/dma-mapping.h +#include linux/raid/xor.h +#include linux/async_tx.h + +#include ../drivers/md/raid6.h + +/** + * The following static variables are used in cases of synchronous + * zero sum to save the values to check. Two pages used for zero sum and + * the third one is for dumb P destination when calling gen_syndrome() + */ +static spinlock_t spare_lock; +struct page *spare_pages[3]; + +/** + * do_async_pq - asynchronously calculate P and/or Q + */ +static struct dma_async_tx_descriptor * +do_async_pq(struct dma_chan *chan, struct page **blocks, + unsigned char *scf_list, unsigned int offset, int src_cnt, size_t len, + enum async_tx_flags flags, struct dma_async_tx_descriptor *depend_tx, + dma_async_tx_callback cb_fn, void *cb_param) +{ + struct dma_device *dma = chan-device; + dma_addr_t dma_dest[2], dma_src[src_cnt]; + struct dma_async_tx_descriptor *tx = NULL; + dma_async_tx_callback _cb_fn; + void *_cb_param; + int i, pq_src_cnt, src_off = 0; + enum async_tx_flags async_flags; + enum dma_ctrl_flags dma_flags = 0; + + /* If we won't handle src_cnt in one shot, then the following +* flag(s) will be set only on the first pass of prep_dma +*/ + if (flags ASYNC_TX_PQ_ZERO_P) + dma_flags |= DMA_PREP_ZERO_P; + if (flags ASYNC_TX_PQ_ZERO_Q) + dma_flags |= DMA_PREP_ZERO_Q; + + /* DMAs use destinations as sources, so use BIDIRECTIONAL mapping */ + dma_dest[0] = !blocks[src_cnt] ? 0 : + dma_map_page(dma-dev, blocks[src_cnt], +offset, len, DMA_BIDIRECTIONAL); + dma_dest[1
[PATCH 03/11][v2] async_tx: add support for asynchronous RAID6 recovery operations
This patch extends async_tx API with two operations for recovery operations on RAID6 array with two failed disks using new async_pq() operation. Patch introduces the following functions: async_r6_dd_recov() recovers after double data disk failure async_r6_dp_recov() recovers after D+P failure Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- crypto/async_tx/Kconfig |5 + crypto/async_tx/Makefile|1 + crypto/async_tx/async_r6recov.c | 282 +++ include/linux/async_tx.h| 11 ++ 4 files changed, 299 insertions(+), 0 deletions(-) create mode 100644 crypto/async_tx/async_r6recov.c diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig index cb6d731..0b56224 100644 --- a/crypto/async_tx/Kconfig +++ b/crypto/async_tx/Kconfig @@ -18,3 +18,8 @@ config ASYNC_PQ tristate select ASYNC_CORE +config ASYNC_R6RECOV + tristate + select ASYNC_CORE + select ASYNC_PQ + diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile index 1b99265..0ed8f13 100644 --- a/crypto/async_tx/Makefile +++ b/crypto/async_tx/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o obj-$(CONFIG_ASYNC_XOR) += async_xor.o obj-$(CONFIG_ASYNC_PQ) += async_pq.o +obj-$(CONFIG_ASYNC_R6RECOV) += async_r6recov.o diff --git a/crypto/async_tx/async_r6recov.c b/crypto/async_tx/async_r6recov.c new file mode 100644 index 000..403c1aa --- /dev/null +++ b/crypto/async_tx/async_r6recov.c @@ -0,0 +1,282 @@ +/* + * Copyright(c) 2007 Yuri Tikhonov [EMAIL PROTECTED] + * + * Developed for DENX Software Engineering GmbH + * + * Asynchronous RAID-6 recovery calculations ASYNC_TX API. + * + * based on async_xor.c code written by: + * Dan Williams [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ +#include linux/kernel.h +#include linux/interrupt.h +#include linux/dma-mapping.h +#include linux/raid/xor.h +#include linux/async_tx.h + +#include ../drivers/md/raid6.h + +/** + * async_r6_dd_recov - attempt to calculate two data misses using dma engines. + * @disks: number of disks in the RAID-6 array + * @bytes: size of strip + * @faila: first failed drive index + * @failb: second failed drive index + * @ptrs: array of pointers to strips (last two must be p and q, respectively) + * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK + * @depend_tx: depends on the result of this transaction. + * @cb: function to call when the operation completes + * @cb_param: parameter to pass to the callback routine + */ +struct dma_async_tx_descriptor * +async_r6_dd_recov(int disks, size_t bytes, int faila, int failb, + struct page **ptrs, enum async_tx_flags flags, + struct dma_async_tx_descriptor *depend_tx, + dma_async_tx_callback cb, void *cb_param) +{ + struct dma_async_tx_descriptor *tx = NULL; + struct page *lptrs[disks]; + unsigned char lcoef[disks-4]; + int i = 0, k = 0, fc = -1; + uint8_t bc[2]; + dma_async_tx_callback lcb = NULL; + void *lcb_param = NULL; + + /* Assume that failb faila */ + if (faila failb) { + fc = faila; + faila = failb; + failb = fc; + } + + /* Try to compute missed data asynchronously. */ + if (disks == 4) { + /* Pxy and Qxy are zero in this case so we already have +* P+Pxy and Q+Qxy in P and Q strips respectively. +*/ + tx = depend_tx; + lcb = cb; + lcb_param = cb_param; + goto do_mult; + } + + /* (1) Calculate Qxy and Pxy: +* Qxy = A(0)*D(0) + ... + A(n-1)*D(n-1) + A(n+1)*D(n+1) + ... + +* A(m-1)*D(m-1) + A(m+1)*D(m+1) + ... + A(disks-1)*D(disks-1), +* where n = faila, m = failb. +*/ + for (i = 0, k = 0; i disks - 2; i++) { + if (i != faila i != failb) { + lptrs[k] = ptrs[i]; + lcoef[k] = raid6_gfexp[i]; + k
[PATCH 04/11][v2] md: run RAID-6 stripe operations outside the lock
The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pqxor+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_POSTXOR - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/md/Kconfig |2 + drivers/md/raid5.c | 292 include/linux/raid/raid5.h |6 +- 3 files changed, 271 insertions(+), 29 deletions(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 2281b50..6c9964f 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -123,6 +123,8 @@ config MD_RAID456 depends on BLK_DEV_MD select ASYNC_MEMCPY select ASYNC_XOR + select ASYNC_PQ + select ASYNC_R6RECOV ---help--- A RAID-5 set of N drives with a capacity of C MB per drive provides the capacity of C * (N - 1) MB, and protects against a failure diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index a36a743..aeec3e5 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -584,18 +584,26 @@ static void ops_run_biofill(struct stripe_head *sh) ops_complete_biofill, sh); } -static void ops_complete_compute5(void *stripe_head_ref) +static void ops_complete_compute(void *stripe_head_ref) { struct stripe_head *sh = stripe_head_ref; - int target = sh-ops.target; - struct r5dev *tgt = sh-dev[target]; + int target, i; + struct r5dev *tgt; pr_debug(%s: stripe %llu\n, __func__, (unsigned long long)sh-sector); - set_bit(R5_UPTODATE, tgt-flags); - BUG_ON(!test_bit(R5_Wantcompute, tgt-flags)); - clear_bit(R5_Wantcompute, tgt-flags); + /* mark the computed target(s) as uptodate */ + for (i = 0; i 2; i++) { + target = (!i) ? sh-ops.target : sh-ops.target2; + if (target 0) + continue; + tgt = sh-dev[target]; + set_bit(R5_UPTODATE, tgt-flags); + BUG_ON(!test_bit(R5_Wantcompute, tgt-flags)); + clear_bit(R5_Wantcompute, tgt-flags); + } + clear_bit(STRIPE_COMPUTE_RUN, sh-state); if (sh-check_state == check_state_compute_run) sh-check_state = check_state_compute_result; @@ -627,15 +635,155 @@ static struct dma_async_tx_descriptor *ops_run_compute5(struct stripe_head *sh) if (unlikely(count == 1)) tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, - 0, NULL, ops_complete_compute5, sh); + 0, NULL, ops_complete_compute, sh); else tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, ASYNC_TX_XOR_ZERO_DST, NULL, - ops_complete_compute5, sh); + ops_complete_compute, sh); + + return tx; +} + +static struct dma_async_tx_descriptor * +ops_run_compute6_1(struct stripe_head *sh) +{ + /* kernel stack size limits the total number of disks */ + int disks = sh-disks; + struct page *srcs[disks]; + int target = sh-ops.target 0 ? sh-ops.target2 : sh-ops.target; + struct r5dev *tgt = sh-dev[target]; + struct page *dest = sh-dev[target].page; + int count = 0; + int pd_idx = sh-pd_idx, qd_idx = raid6_next_disk(pd_idx, disks); + int d0_idx = raid6_next_disk(qd_idx, disks); + struct dma_async_tx_descriptor *tx; + int i; + + pr_debug(%s: stripe %llu block: %d\n, + __func__, (unsigned long long)sh-sector, target); + BUG_ON(!test_bit(R5_Wantcompute, tgt-flags)); + + atomic_inc(sh-count); + + if (target == qd_idx) { + /* We are actually computing the Q drive*/ + i = d0_idx; + do { + srcs[count++] = sh-dev[i].page; + i = raid6_next_disk(i, disks); + } while (i != pd_idx); + srcs[count] = NULL; + srcs[count+1] = dest; + tx = async_gen_syndrome(srcs, 0, count, STRIPE_SIZE, + 0, NULL, ops_complete_compute, sh); + } else { + /* Compute any data- or p-drive using XOR */ + for (i = disks; i-- ; ) { + if (i != target i
[PATCH 05/11] md: common schedule_reconstruction for raid5/6
To be able to re-use the schedule_reconstruction5() code in RAID-6 case, this should handle Q-parity strip appropriately. This patch introduces this. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/md/raid5.c | 18 ++ 1 files changed, 14 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index aeec3e5..e31f38b 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1885,10 +1885,11 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2) } static void -schedule_reconstruction5(struct stripe_head *sh, struct stripe_head_state *s, +schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, int rcw, int expand) { int i, pd_idx = sh-pd_idx, disks = sh-disks; + int level = sh-raid_conf-level; if (rcw) { /* if we are not expanding this is a proper write request, and @@ -1914,10 +1915,12 @@ schedule_reconstruction5(struct stripe_head *sh, struct stripe_head_state *s, s-locked++; } } - if (s-locked + 1 == disks) + if ((level == 5 s-locked + 1 == disks) || + (level == 6 s-locked + 2 == disks)) if (!test_and_set_bit(STRIPE_FULL_WRITE, sh-state)) atomic_inc(sh-raid_conf-pending_full_writes); } else { + BUG_ON(level == 6); BUG_ON(!(test_bit(R5_UPTODATE, sh-dev[pd_idx].flags) || test_bit(R5_Wantcompute, sh-dev[pd_idx].flags))); @@ -1949,6 +1952,13 @@ schedule_reconstruction5(struct stripe_head *sh, struct stripe_head_state *s, clear_bit(R5_UPTODATE, sh-dev[pd_idx].flags); s-locked++; + if (level == 6) { + int qd_idx = raid6_next_disk(pd_idx, disks); + set_bit(R5_LOCKED, sh-dev[qd_idx].flags); + clear_bit(R5_UPTODATE, sh-dev[qd_idx].flags); + s-locked++; + } + pr_debug(%s: stripe %llu locked: %d ops_request: %lx\n, __func__, (unsigned long long)sh-sector, s-locked, s-ops_request); @@ -2410,7 +2420,7 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf, if ((s-req_compute || !test_bit(STRIPE_COMPUTE_RUN, sh-state)) (s-locked == 0 (rcw == 0 || rmw == 0) !test_bit(STRIPE_BIT_DELAY, sh-state))) - schedule_reconstruction5(sh, s, rcw == 0, 0); + schedule_reconstruction(sh, s, rcw == 0, 0); } static void handle_stripe_dirtying6(raid5_conf_t *conf, @@ -3003,7 +3013,7 @@ static bool handle_stripe5(struct stripe_head *sh) sh-disks = conf-raid_disks; sh-pd_idx = stripe_to_pdidx(sh-sector, conf, conf-raid_disks); - schedule_reconstruction5(sh, s, 1, 1); + schedule_reconstruction(sh, s, 1, 1); } else if (s.expanded !sh-reconstruct_state s.locked == 0) { clear_bit(STRIPE_EXPAND_READY, sh-state); atomic_dec(conf-reshape_stripes); -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 07/11] md: rewrite handle_stripe_dirtying6 in asynchronous way
Rewrite handle_stripe_dirtying6 function to work asynchronously. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/md/raid5.c | 113 ++-- 1 files changed, 30 insertions(+), 83 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e08ed4f..f0b47bd 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2485,99 +2485,46 @@ static void handle_stripe_dirtying6(raid5_conf_t *conf, struct stripe_head *sh, struct stripe_head_state *s, struct r6_state *r6s, int disks) { - int rcw = 0, must_compute = 0, pd_idx = sh-pd_idx, i; + int rcw = 0, pd_idx = sh-pd_idx, i; int qd_idx = r6s-qd_idx; + + set_bit(STRIPE_HANDLE, sh-state); for (i = disks; i--; ) { struct r5dev *dev = sh-dev[i]; - /* Would I have to read this buffer for reconstruct_write */ - if (!test_bit(R5_OVERWRITE, dev-flags) -i != pd_idx i != qd_idx -(!test_bit(R5_LOCKED, dev-flags) - ) - !test_bit(R5_UPTODATE, dev-flags)) { - if (test_bit(R5_Insync, dev-flags)) rcw++; - else { - pr_debug(raid6: must_compute: - disk %d flags=%#lx\n, i, dev-flags); - must_compute++; + /* check if we haven't enough data */ + if (!test_bit(R5_OVERWRITE, dev-flags) + i != pd_idx i != qd_idx + !test_bit(R5_LOCKED, dev-flags) + !(test_bit(R5_UPTODATE, dev-flags) || + test_bit(R5_Wantcompute, dev-flags))) { + rcw++; + if (!test_bit(R5_Insync, dev-flags)) + continue; /* it's a failed drive */ + + if ( + test_bit(STRIPE_PREREAD_ACTIVE, sh-state)) { + pr_debug(Read_old stripe %llu + block %d for Reconstruct\n, +(unsigned long long)sh-sector, i); + set_bit(R5_LOCKED, dev-flags); + set_bit(R5_Wantread, dev-flags); + s-locked++; + } else { + pr_debug(Request delayed stripe %llu + block %d for Reconstruct\n, +(unsigned long long)sh-sector, i); + set_bit(STRIPE_DELAYED, sh-state); + set_bit(STRIPE_HANDLE, sh-state); } } } - pr_debug(for sector %llu, rcw=%d, must_compute=%d\n, - (unsigned long long)sh-sector, rcw, must_compute); - set_bit(STRIPE_HANDLE, sh-state); - - if (rcw 0) - /* want reconstruct write, but need to get some data */ - for (i = disks; i--; ) { - struct r5dev *dev = sh-dev[i]; - if (!test_bit(R5_OVERWRITE, dev-flags) -!(s-failed == 0 (i == pd_idx || i == qd_idx)) -!test_bit(R5_LOCKED, dev-flags) - !test_bit(R5_UPTODATE, dev-flags) - test_bit(R5_Insync, dev-flags)) { - if ( - test_bit(STRIPE_PREREAD_ACTIVE, sh-state)) { - pr_debug(Read_old stripe %llu - block %d for Reconstruct\n, -(unsigned long long)sh-sector, i); - set_bit(R5_LOCKED, dev-flags); - set_bit(R5_Wantread, dev-flags); - s-locked++; - } else { - pr_debug(Request delayed stripe %llu - block %d for Reconstruct\n, -(unsigned long long)sh-sector, i); - set_bit(STRIPE_DELAYED, sh-state); - set_bit(STRIPE_HANDLE, sh-state); - } - } - } /* now if nothing is locked, and if we have enough data, we can start a * write request */ - if (s-locked == 0 rcw == 0 + if ((s-req_compute || !test_bit(STRIPE_COMPUTE_RUN, sh-state)) + s-locked == 0 rcw == 0 !test_bit(STRIPE_BIT_DELAY, sh-state)) { - if (must_compute 0
[PATCH 06/11] md: change handle_stripe_fill6 to work in asynchronous way
Change handle_stripe_fill6 to work asynchronously and introduce helper fetch_block6 function for this. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/md/raid5.c | 154 1 files changed, 106 insertions(+), 48 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e31f38b..e08ed4f 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2224,61 +2224,119 @@ static void handle_stripe_fill5(struct stripe_head *sh, set_bit(STRIPE_HANDLE, sh-state); } -static void handle_stripe_fill6(struct stripe_head *sh, - struct stripe_head_state *s, struct r6_state *r6s, - int disks) +/* fetch_block6 - checks the given member device to see if its data needs + * to be read or computed to satisfy a request. + * + * Returns 1 when no more member devices need to be checked, otherwise returns + * 0 to tell the loop in handle_stripe_fill6 to continue + */ +static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s, +struct r6_state *r6s, int disk_idx, int disks) { - int i; - for (i = disks; i--; ) { - struct r5dev *dev = sh-dev[i]; - if (!test_bit(R5_LOCKED, dev-flags) - !test_bit(R5_UPTODATE, dev-flags) - (dev-toread || (dev-towrite -!test_bit(R5_OVERWRITE, dev-flags)) || -s-syncing || s-expanding || -(s-failed = 1 - (sh-dev[r6s-failed_num[0]].toread || - s-to_write)) || -(s-failed = 2 - (sh-dev[r6s-failed_num[1]].toread || - s-to_write { - /* we would like to get this block, possibly -* by computing it, but we might not be able to + struct r5dev *dev = sh-dev[disk_idx]; + struct r5dev *fdev[2] = { sh-dev[r6s-failed_num[0]], + sh-dev[r6s-failed_num[1]] }; + + if (!test_bit(R5_LOCKED, dev-flags) + !test_bit(R5_UPTODATE, dev-flags) + (dev-toread || +(dev-towrite !test_bit(R5_OVERWRITE, dev-flags)) || +s-syncing || s-expanding || +(s-failed = 1 + (fdev[0]-toread || s-to_write)) || +(s-failed = 2 + (fdev[1]-toread || s-to_write { + /* we would like to get this block, possibly by computing it, +* otherwise read it if the backing disk is insync +*/ + BUG_ON(test_bit(R5_Wantcompute, dev-flags)); + BUG_ON(test_bit(R5_Wantread, dev-flags)); + if ((s-uptodate == disks - 1) + (s-failed (disk_idx == r6s-failed_num[0] || + disk_idx == r6s-failed_num[1]))) { + /* have disk failed, and we're requested to fetch it; +* do compute it */ - if ((s-uptodate == disks - 1) - (s-failed (i == r6s-failed_num[0] || - i == r6s-failed_num[1]))) { - pr_debug(Computing stripe %llu block %d\n, - (unsigned long long)sh-sector, i); - compute_block_1(sh, i, 0); - s-uptodate++; - } else if ( s-uptodate == disks-2 s-failed = 2 ) { - /* Computing 2-failure is *very* expensive; only -* do it if failed = 2 + pr_debug(Computing stripe %llu block %d\n, + (unsigned long long)sh-sector, disk_idx); + set_bit(STRIPE_COMPUTE_RUN, sh-state); + set_bit(STRIPE_OP_COMPUTE_BLK, s-ops_request); + set_bit(R5_Wantcompute, dev-flags); + sh-ops.target = disk_idx; + sh-ops.target2 = -1; /* no 2nd target */ + s-req_compute = 1; + s-uptodate++; + return 1; + } else if ( s-uptodate == disks-2 s-failed = 2 ) { + /* Computing 2-failure is *very* expensive; only +* do it if failed = 2 +*/ + int other; + for (other = disks; other--; ) { + if (other == disk_idx) + continue; + if (!test_bit(R5_UPTODATE, + sh-dev[other].flags)) + break; + } + BUG_ON(other 0
[PATCH 08/11] md: asynchronous handle_parity_check6
This patch introduces the state machine for handling the RAID-6 parities check and repair functionality. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/md/raid5.c | 163 +++- 1 files changed, 110 insertions(+), 53 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index f0b47bd..91e5438 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2621,91 +2621,148 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh, struct r6_state *r6s, struct page *tmp_page, int disks) { - int update_p = 0, update_q = 0; - struct r5dev *dev; + int i; + struct r5dev *devs[2] = {NULL, NULL}; int pd_idx = sh-pd_idx; int qd_idx = r6s-qd_idx; set_bit(STRIPE_HANDLE, sh-state); BUG_ON(s-failed 2); - BUG_ON(s-uptodate disks); + /* Want to check and possibly repair P and Q. * However there could be one 'failed' device, in which * case we can only check one of them, possibly using the * other to generate missing data */ - /* If !tmp_page, we cannot do the calculations, -* but as we have set STRIPE_HANDLE, we will soon be called -* by stripe_handle with a tmp_page - just wait until then. -*/ - if (tmp_page) { + switch (sh-check_state) { + case check_state_idle: + /* start a new check operation if there are 2 failures */ if (s-failed == r6s-q_failed) { /* The only possible failed device holds 'Q', so it * makes sense to check P (If anything else were failed, * we would have used P to recreate it). */ - compute_block_1(sh, pd_idx, 1); - if (!page_is_zero(sh-dev[pd_idx].page)) { - compute_block_1(sh, pd_idx, 0); - update_p = 1; - } + sh-check_state = check_state_run; + set_bit(STRIPE_OP_CHECK_PP, s-ops_request); + clear_bit(R5_UPTODATE, sh-dev[pd_idx].flags); + s-uptodate--; } if (!r6s-q_failed s-failed 2) { /* q is not failed, and we didn't use it to generate * anything, so it makes sense to check it */ - memcpy(page_address(tmp_page), - page_address(sh-dev[qd_idx].page), - STRIPE_SIZE); - compute_parity6(sh, UPDATE_PARITY); - if (memcmp(page_address(tmp_page), - page_address(sh-dev[qd_idx].page), - STRIPE_SIZE) != 0) { - clear_bit(STRIPE_INSYNC, sh-state); - update_q = 1; - } + sh-check_state = check_state_run; + set_bit(STRIPE_OP_CHECK_QP, s-ops_request); + clear_bit(R5_UPTODATE, sh-dev[qd_idx].flags); + s-uptodate--; } - if (update_p || update_q) { - conf-mddev-resync_mismatches += STRIPE_SECTORS; - if (test_bit(MD_RECOVERY_CHECK, conf-mddev-recovery)) - /* don't try to repair!! */ - update_p = update_q = 0; + if (sh-check_state == check_state_run) { + break; } - /* now write out any block on a failed drive, -* or P or Q if they need it -*/ + /* we have 2-disk failure */ + BUG_ON(s-failed != 2); + devs[0] = sh-dev[r6s-failed_num[0]]; + devs[1] = sh-dev[r6s-failed_num[1]]; + /* fall through */ + case check_state_compute_result: + sh-check_state = check_state_idle; - if (s-failed == 2) { - dev = sh-dev[r6s-failed_num[1]]; - s-locked++; - set_bit(R5_LOCKED, dev-flags); - set_bit(R5_Wantwrite, dev-flags); + BUG_ON((devs[0] !devs[1]) || + (!devs[0] devs[1])); + + BUG_ON(s-uptodate (disks - 1)); + + if (!devs[0]) { + if (s-failed = 1) + devs[0] = sh-dev[r6s-failed_num[0]]; + else + devs[0] = sh-dev[pd_idx]; } - if (s-failed = 1
[PATCH 10/11] md: remove unused functions
Some clean-up of the replaced or already unnecessary functions. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/md/raid5.c | 246 1 files changed, 0 insertions(+), 246 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 47b7de3..73307a9 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1645,245 +1645,6 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i) } - -/* - * Copy data between a page in the stripe cache, and one or more bion - * The page could align with the middle of the bio, or there could be - * several bion, each with several bio_vecs, which cover part of the page - * Multiple bion are linked together on bi_next. There may be extras - * at the end of this list. We ignore them. - */ -static void copy_data(int frombio, struct bio *bio, -struct page *page, -sector_t sector) -{ - char *pa = page_address(page); - struct bio_vec *bvl; - int i; - int page_offset; - - if (bio-bi_sector = sector) - page_offset = (signed)(bio-bi_sector - sector) * 512; - else - page_offset = (signed)(sector - bio-bi_sector) * -512; - bio_for_each_segment(bvl, bio, i) { - int len = bio_iovec_idx(bio,i)-bv_len; - int clen; - int b_offset = 0; - - if (page_offset 0) { - b_offset = -page_offset; - page_offset += b_offset; - len -= b_offset; - } - - if (len 0 page_offset + len STRIPE_SIZE) - clen = STRIPE_SIZE - page_offset; - else clen = len; - - if (clen 0) { - char *ba = __bio_kmap_atomic(bio, i, KM_USER0); - if (frombio) - memcpy(pa+page_offset, ba+b_offset, clen); - else - memcpy(ba+b_offset, pa+page_offset, clen); - __bio_kunmap_atomic(ba, KM_USER0); - } - if (clen len) /* hit end of page */ - break; - page_offset += len; - } -} - -#define check_xor()do { \ - if (count == MAX_XOR_BLOCKS) {\ - xor_blocks(count, STRIPE_SIZE, dest, ptr);\ - count = 0;\ - } \ - } while(0) - -static void compute_parity6(struct stripe_head *sh, int method) -{ - raid6_conf_t *conf = sh-raid_conf; - int i, pd_idx = sh-pd_idx, qd_idx, d0_idx, disks = sh-disks, count; - struct bio *chosen; - / FIX THIS: This could be very bad if disks is close to 256 / - void *ptrs[disks]; - - qd_idx = raid6_next_disk(pd_idx, disks); - d0_idx = raid6_next_disk(qd_idx, disks); - - pr_debug(compute_parity, stripe %llu, method %d\n, - (unsigned long long)sh-sector, method); - - switch(method) { - case READ_MODIFY_WRITE: - BUG(); /* READ_MODIFY_WRITE N/A for RAID-6 */ - case RECONSTRUCT_WRITE: - for (i= disks; i-- ;) - if ( i != pd_idx i != qd_idx sh-dev[i].towrite ) { - chosen = sh-dev[i].towrite; - sh-dev[i].towrite = NULL; - - if (test_and_clear_bit(R5_Overlap, sh-dev[i].flags)) - wake_up(conf-wait_for_overlap); - - BUG_ON(sh-dev[i].written); - sh-dev[i].written = chosen; - } - break; - case CHECK_PARITY: - BUG(); /* Not implemented yet */ - } - - for (i = disks; i--;) - if (sh-dev[i].written) { - sector_t sector = sh-dev[i].sector; - struct bio *wbi = sh-dev[i].written; - while (wbi wbi-bi_sector sector + STRIPE_SECTORS) { - copy_data(1, wbi, sh-dev[i].page, sector); - wbi = r5_next_bio(wbi, sector); - } - - set_bit(R5_LOCKED, sh-dev[i].flags); - set_bit(R5_UPTODATE, sh-dev[i].flags); - } - -// switch(method) { -// case RECONSTRUCT_WRITE: -// case CHECK_PARITY: -// case UPDATE_PARITY: - /* Note that unlike RAID-5, the ordering of the disks matters greatly. */ - /* FIX: Is this ordering of drives even remotely optimal
[PATCH 09/11] md: change handle_stripe6 to work asynchronously
handle_stripe6 function is changed to do things asynchronously. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/md/raid5.c | 130 1 files changed, 90 insertions(+), 40 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 91e5438..47b7de3 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3117,9 +3117,10 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page) r6s.qd_idx = raid6_next_disk(pd_idx, disks); pr_debug(handling stripe %llu, state=%#lx cnt=%d, - pd_idx=%d, qd_idx=%d\n, + pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n, (unsigned long long)sh-sector, sh-state, - atomic_read(sh-count), pd_idx, r6s.qd_idx); + atomic_read(sh-count), pd_idx, r6s.qd_idx, + sh-check_state, sh-reconstruct_state); memset(s, 0, sizeof(s)); spin_lock(sh-lock); @@ -3139,35 +3140,24 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page) pr_debug(check %d: state 0x%lx read %p write %p written %p\n, i, dev-flags, dev-toread, dev-towrite, dev-written); - /* maybe we can reply to a read */ - if (test_bit(R5_UPTODATE, dev-flags) dev-toread) { - struct bio *rbi, *rbi2; - pr_debug(Return read for disc %d\n, i); - spin_lock_irq(conf-device_lock); - rbi = dev-toread; - dev-toread = NULL; - if (test_and_clear_bit(R5_Overlap, dev-flags)) - wake_up(conf-wait_for_overlap); - spin_unlock_irq(conf-device_lock); - while (rbi rbi-bi_sector dev-sector + STRIPE_SECTORS) { - copy_data(0, rbi, dev-page, dev-sector); - rbi2 = r5_next_bio(rbi, dev-sector); - spin_lock_irq(conf-device_lock); - if (!raid5_dec_bi_phys_segments(rbi)) { - rbi-bi_next = return_bi; - return_bi = rbi; - } - spin_unlock_irq(conf-device_lock); - rbi = rbi2; - } - } + /* maybe we can reply to a read +* +* new wantfill requests are only permitted while +* ops_complete_biofill is guaranteed to be inactive +*/ + if (test_bit(R5_UPTODATE, dev-flags) dev-toread + !test_bit(STRIPE_BIOFILL_RUN, sh-state)) + set_bit(R5_Wantfill, dev-flags); /* now count some things */ if (test_bit(R5_LOCKED, dev-flags)) s.locked++; if (test_bit(R5_UPTODATE, dev-flags)) s.uptodate++; + if (test_bit(R5_Wantcompute, dev-flags)) + BUG_ON(++s.compute 2); - - if (dev-toread) + if (test_bit(R5_Wantfill, dev-flags)) { + s.to_fill++; + } else if (dev-toread) s.to_read++; if (dev-towrite) { s.to_write++; @@ -3208,6 +3198,11 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page) blocked_rdev = NULL; } + if (s.to_fill !test_bit(STRIPE_BIOFILL_RUN, sh-state)) { + set_bit(STRIPE_OP_BIOFILL, s.ops_request); + set_bit(STRIPE_BIOFILL_RUN, sh-state); + } + pr_debug(locked=%d uptodate=%d to_read=%d to_write=%d failed=%d failed_num=%d,%d\n, s.locked, s.uptodate, s.to_read, s.to_write, s.failed, @@ -3248,18 +3243,62 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page) * or to load a block that is being partially written. */ if (s.to_read || s.non_overwrite || (s.to_write s.failed) || - (s.syncing (s.uptodate disks)) || s.expanding) + (s.syncing (s.uptodate + s.compute disks)) || s.expanding) handle_stripe_fill6(sh, s, r6s, disks); - /* now to consider writing and what else, if anything should be read */ - if (s.to_write) + /* Now we check to see if any write operations have recently +* completed +*/ + if (sh-reconstruct_state == reconstruct_state_drain_result) { + int qd_idx = raid6_next_disk(sh-pd_idx, +conf-raid_disks); + + sh-reconstruct_state = reconstruct_state_idle; + /* All the 'written' buffers and the parity blocks are ready
Re[2]: [PATCH 01/11] async_tx: don't use src_list argument of async_xor() for dma addresses
On Tuesday, December 9, 2008 you wrote: On Mon, Dec 8, 2008 at 2:55 PM, Yuri Tikhonov [EMAIL PROTECTED] wrote: Using src_list argument of async_xor() as a storage for dma addresses implies sizeof(dma_addr_t) = sizeof(struct page *) restriction which is not always true (e.g. ppc440spe). ppc440spe runs with CONFIG_PHYS_64BIT? Yep. It uses 36-bit addressing, so this CONFIG is turned on. If we do this then we need to also change md to limit the number of allowed disks based on the kernel stack size. Because with 256 disks a 4K stack can be consumed by one call to async_pq ((256 sources in raid5.c + 256 sources async_pq.c) * 8 bytes per source on 64-bit). On ppc440spe we have 8KB stack, so the things are not worse than on 32-bit archs with 4KB stack. Thus, I guess no changes to md are required because of this patch. Right? Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 11/11] ppc440spe-adma: ADMA driver for PPC440SP(e) systems
Hello Josh, If you are still intending to review our ppc440spe ADMA driver (thanks in advance if so), then please use the driver from my latest post as the reference: http://ozlabs.org/pipermail/linuxppc-dev/2008-December/065983.html since this has some updates relating to the November version. On Thursday, November 13, 2008 you wrote: On Thu, 13 Nov 2008 20:50:43 +0300 Ilya Yanok [EMAIL PROTECTED] wrote: Josh Boyer wrote: On Thu, Nov 13, 2008 at 06:16:04PM +0300, Ilya Yanok wrote: Adds the platform device definitions and the architecture specific support routines for the ppc440spe adma driver. Any board equipped with PPC440SP(e) controller may utilize this driver. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] Before I really dig into reviewing this driver, I'm going to ask you as simple question. This looks like a 1/2 completed port of an arch/ppc driver that uses the device tree (incorrectly) to get the interrupt resources and that's about it. Otherwise, it's just a straight up platform device driver. Is that correct? Yep, that's correct. OK. If that is the case, I think the driver needs more work before it can be merged. It should get the DCR and MMIO resources from the device tree as well. It should be binding on compatible properties and not based on device tree paths. And it should probably be an of_platform device driver. Surely, you're right. I agree with you in that this driver isn't ready for merging. But it works so we'd like to publish it so interested people could use it and test it. And that's fine. I just wanted to see where you were headed with this one for now. I'll try to do a review in the next few days. Thanks for posting. josh -- To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH][v2] xsysace: use resource_size_t instead of unsigned long
Use resource_size_t for physical address of SystemACE chip. This fixes the driver brokeness for 32 bit systems with 64 bit resources (e.g. PPC440SPe). Also this patch adds one more compatible string for more clean description of the hardware, and fixes a sector_t- related compilation warning. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/block/xsysace.c | 24 +--- 1 files changed, 13 insertions(+), 11 deletions(-) diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c index ecab9e6..9efd3d7 100644 --- a/drivers/block/xsysace.c +++ b/drivers/block/xsysace.c @@ -194,7 +194,7 @@ struct ace_device { int in_irq; /* Details of hardware device */ - unsigned long physaddr; + resource_size_t physaddr; void __iomem *baseaddr; int irq; int bus_width; /* 0 := 8 bit; 1 := 16 bit */ @@ -628,8 +628,8 @@ static void ace_fsm_dostate(struct ace_device *ace) /* Okay, it's a data request, set it up for transfer */ dev_dbg(ace-dev, - request: sec=%lx hcnt=%lx, ccnt=%x, dir=%i\n, - req-sector, req-hard_nr_sectors, + request: sec=%llx hcnt=%lx, ccnt=%x, dir=%i\n, + (unsigned long long)req-sector, req-hard_nr_sectors, req-current_nr_sectors, rq_data_dir(req)); ace-req = req; @@ -935,7 +935,8 @@ static int __devinit ace_setup(struct ace_device *ace) int rc; dev_dbg(ace-dev, ace_setup(ace=0x%p)\n, ace); - dev_dbg(ace-dev, physaddr=0x%lx irq=%i\n, ace-physaddr, ace-irq); + dev_dbg(ace-dev, physaddr=0x%llx irq=%i\n, + (unsigned long long)ace-physaddr, ace-irq); spin_lock_init(ace-lock); init_completion(ace-id_completion); @@ -1017,8 +1018,8 @@ static int __devinit ace_setup(struct ace_device *ace) /* Print the identification */ dev_info(ace-dev, Xilinx SystemACE revision %i.%i.%i\n, (version 12) 0xf, (version 8) 0x0f, version 0xff); - dev_dbg(ace-dev, physaddr 0x%lx, mapped to 0x%p, irq=%i\n, - ace-physaddr, ace-baseaddr, ace-irq); + dev_dbg(ace-dev, physaddr 0x%llx, mapped to 0x%p, irq=%i\n, + (unsigned long long)ace-physaddr, ace-baseaddr, ace-irq); ace-media_change = 1; ace_revalidate_disk(ace-gd); @@ -1035,8 +1036,8 @@ err_alloc_disk: err_blk_initq: iounmap(ace-baseaddr); err_ioremap: - dev_info(ace-dev, xsysace: error initializing device at 0x%lx\n, - ace-physaddr); + dev_info(ace-dev, xsysace: error initializing device at 0x%llx\n, +(unsigned long long)ace-physaddr); return -ENOMEM; } @@ -1059,7 +1060,7 @@ static void __devexit ace_teardown(struct ace_device *ace) } static int __devinit -ace_alloc(struct device *dev, int id, unsigned long physaddr, +ace_alloc(struct device *dev, int id, resource_size_t physaddr, int irq, int bus_width) { struct ace_device *ace; @@ -1119,7 +1120,7 @@ static void __devexit ace_free(struct device *dev) static int __devinit ace_probe(struct platform_device *dev) { - unsigned long physaddr = 0; + resource_size_t physaddr = 0; int bus_width = ACE_BUS_WIDTH_16; /* FIXME: should not be hard coded */ int id = dev-id; int irq = NO_IRQ; @@ -1165,7 +1166,7 @@ static int __devinit ace_of_probe(struct of_device *op, const struct of_device_id *match) { struct resource res; - unsigned long physaddr; + resource_size_t physaddr; const u32 *id; int irq, bus_width, rc; @@ -1205,6 +1206,7 @@ static struct of_device_id ace_of_match[] __devinitdata = { { .compatible = xlnx,opb-sysace-1.00.b, }, { .compatible = xlnx,opb-sysace-1.00.c, }, { .compatible = xlnx,xps-sysace-1.00.a, }, + { .compatible = xlnx,sysace, }, {}, }; MODULE_DEVICE_TABLE(of, ace_of_match); -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH][v2] xsysace: use resource_size_t instead of unsigned long
I'm sorry, but the patch I've just posted turn out to be corrupted. The correct one is below. --- Use resource_size_t for physical address of SystemACE chip. This fixes the driver brokeness for 32 bit systems with 64 bit resources (e.g. PPC440SPe). Also this patch adds one more compatible string for more clean description of the hardware, and fixes a sector_t- related compilation warning. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Ilya Yanok [EMAIL PROTECTED] --- drivers/block/xsysace.c | 24 +--- 1 files changed, 13 insertions(+), 11 deletions(-) diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c index ecab9e6..9efd3d7 100644 --- a/drivers/block/xsysace.c +++ b/drivers/block/xsysace.c @@ -194,7 +194,7 @@ struct ace_device { int in_irq; /* Details of hardware device */ - unsigned long physaddr; + resource_size_t physaddr; void __iomem *baseaddr; int irq; int bus_width; /* 0 := 8 bit; 1 := 16 bit */ @@ -628,8 +628,8 @@ static void ace_fsm_dostate(struct ace_device *ace) /* Okay, it's a data request, set it up for transfer */ dev_dbg(ace-dev, - request: sec=%lx hcnt=%lx, ccnt=%x, dir=%i\n, - req-sector, req-hard_nr_sectors, + request: sec=%llx hcnt=%lx, ccnt=%x, dir=%i\n, + (unsigned long long)req-sector, req-hard_nr_sectors, req-current_nr_sectors, rq_data_dir(req)); ace-req = req; @@ -935,7 +935,8 @@ static int __devinit ace_setup(struct ace_device *ace) int rc; dev_dbg(ace-dev, ace_setup(ace=0x%p)\n, ace); - dev_dbg(ace-dev, physaddr=0x%lx irq=%i\n, ace-physaddr, ace-irq); + dev_dbg(ace-dev, physaddr=0x%llx irq=%i\n, + (unsigned long long)ace-physaddr, ace-irq); spin_lock_init(ace-lock); init_completion(ace-id_completion); @@ -1017,8 +1018,8 @@ static int __devinit ace_setup(struct ace_device *ace) /* Print the identification */ dev_info(ace-dev, Xilinx SystemACE revision %i.%i.%i\n, (version 12) 0xf, (version 8) 0x0f, version 0xff); - dev_dbg(ace-dev, physaddr 0x%lx, mapped to 0x%p, irq=%i\n, - ace-physaddr, ace-baseaddr, ace-irq); + dev_dbg(ace-dev, physaddr 0x%llx, mapped to 0x%p, irq=%i\n, + (unsigned long long)ace-physaddr, ace-baseaddr, ace-irq); ace-media_change = 1; ace_revalidate_disk(ace-gd); @@ -1035,8 +1036,8 @@ err_alloc_disk: err_blk_initq: iounmap(ace-baseaddr); err_ioremap: - dev_info(ace-dev, xsysace: error initializing device at 0x%lx\n, - ace-physaddr); + dev_info(ace-dev, xsysace: error initializing device at 0x%llx\n, +(unsigned long long)ace-physaddr); return -ENOMEM; } @@ -1059,7 +1060,7 @@ static void __devexit ace_teardown(struct ace_device *ace) } static int __devinit -ace_alloc(struct device *dev, int id, unsigned long physaddr, +ace_alloc(struct device *dev, int id, resource_size_t physaddr, int irq, int bus_width) { struct ace_device *ace; @@ -1119,7 +1120,7 @@ static void __devexit ace_free(struct device *dev) static int __devinit ace_probe(struct platform_device *dev) { - unsigned long physaddr = 0; + resource_size_t physaddr = 0; int bus_width = ACE_BUS_WIDTH_16; /* FIXME: should not be hard coded */ int id = dev-id; int irq = NO_IRQ; @@ -1165,7 +1166,7 @@ static int __devinit ace_of_probe(struct of_device *op, const struct of_device_id *match) { struct resource res; - unsigned long physaddr; + resource_size_t physaddr; const u32 *id; int irq, bus_width, rc; @@ -1205,6 +1206,7 @@ static struct of_device_id ace_of_match[] __devinitdata = { { .compatible = xlnx,opb-sysace-1.00.b, }, { .compatible = xlnx,opb-sysace-1.00.c, }, { .compatible = xlnx,xps-sysace-1.00.a, }, + { .compatible = xlnx,sysace, }, {}, }; MODULE_DEVICE_TABLE(of, ace_of_match); -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH][v2] xsysace: use resource_size_t instead of unsigned long
Hello Grant, On Thursday 27 November 2008 17:11, Grant Likely wrote: On Thu, Nov 27, 2008 at 5:21 AM, Yuri Tikhonov [EMAIL PROTECTED] wrote: Use resource_size_t for physical address of SystemACE chip. This fixes the driver brokeness for 32 bit systems with 64 bit resources (e.g. PPC440SPe). Hey Yuri, I actually already picked up the last version of your patch after fixing it up myself. It's currently sitting in Paul's powerpc tree and it will be merged into mainline when Linus gets back from vacation. Oops. Indeed. Thanks. Can you please spin a new version with just the addition of the compatible value and base it on Paul's tree. Sure. I've generated the patch against the origin/merge branch of Paul's tree, and posting it as separate [PATCH] xsysace: add compatible string . Regards, Yuri ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] xsysace: add compatible string
Add one more compatible string to the table for of_platform binding, so that the platforms, which have the SysACE chip on board (e.g. Katmai), could describe it in their device trees correctly. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] --- drivers/block/xsysace.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c index 29e1dfa..381d686 100644 --- a/drivers/block/xsysace.c +++ b/drivers/block/xsysace.c @@ -1206,6 +1206,7 @@ static struct of_device_id ace_of_match[] __devinitdata = { { .compatible = xlnx,opb-sysace-1.00.b, }, { .compatible = xlnx,opb-sysace-1.00.c, }, { .compatible = xlnx,xps-sysace-1.00.a, }, + { .compatible = xlnx,sysace, }, {}, }; MODULE_DEVICE_TABLE(of, ace_of_match); -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[4]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes
0x; /* This drives busses 10 to 0x1f */ bus-range = 0x30 0x3f; diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 77c7fa0..adbeb19 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -59,7 +59,7 @@ extern int check_legacy_ioport(unsigned long base_port); extern unsigned long isa_io_base; extern unsigned long pci_io_base; -extern unsigned long pci_dram_offset; +extern resource_size_t pci_dram_offset; extern resource_size_t isa_mem_base; @@ -728,14 +728,14 @@ static inline void * phys_to_virt(unsigned long address) */ #ifdef CONFIG_PPC32 -static inline unsigned long virt_to_bus(volatile void * address) +static inline resource_size_t virt_to_bus(volatile void * address) { if (address == NULL) return 0; return __pa(address) + PCI_DRAM_OFFSET; } -static inline void * bus_to_virt(unsigned long address) +static inline void * bus_to_virt(resource_size_t address) { if (address == 0) return NULL; diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c index 88db4ff..5855937 100644 --- a/arch/powerpc/kernel/pci_32.c +++ b/arch/powerpc/kernel/pci_32.c @@ -33,7 +33,7 @@ #endif unsigned long isa_io_base = 0; -unsigned long pci_dram_offset = 0; +resource_size_t pci_dram_offset = 0; int pcibios_assign_bus_offset = 1; void pcibios_make_OF_bus_map(void); diff --git a/arch/powerpc/sysdev/ppc4xx_pci.c b/arch/powerpc/sysdev/ppc4xx_pci.c index afbdd48..f748c5b 100644 --- a/arch/powerpc/sysdev/ppc4xx_pci.c +++ b/arch/powerpc/sysdev/ppc4xx_pci.c @@ -126,10 +126,8 @@ static int __init ppc4xx_parse_dma_ranges(struct pci_controller *hose, if ((pci_space 0x0300) != 0x0200) continue; - /* We currently only support memory at 0, and pci_addr -* within 32 bits space -*/ - if (cpu_addr != 0 || pci_addr 0x) { + /* We currently only support memory at 0 */ + if (cpu_addr != 0) { printk(KERN_WARNING %s: Ignored unsupported dma range 0x%016llx...0x%016llx - 0x%016llx\n, hose-dn-full_name, @@ -179,18 +177,12 @@ static int __init ppc4xx_parse_dma_ranges(struct pci_controller *hose, return -ENXIO; } - /* Check that we are fully contained within 32 bits space */ - if (res-end 0x) { - printk(KERN_ERR %s: dma-ranges outside of 32 bits space\n, - hose-dn-full_name); - return -ENXIO; - } out: dma_offset_set = 1; pci_dram_offset = res-start; - printk(KERN_INFO 4xx PCI DMA offset set to 0x%08lx\n, - pci_dram_offset); + printk(KERN_INFO 4xx PCI DMA offset set to 0x%016llx\n, + (unsigned long long)pci_dram_offset); return 0; } Any ideas ? If you really need 32-bit DMA support, you'll have to wait for swiotlb from Becky or work with her in bringing it to powerpc so that we can do bounce buffering for those devices. Ben. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[4]: [2/2] powerpc: support for 256K pages on PPC 44x
Hello Milton, On Friday, November 14, 2008 you wrote: On Nov 13, 2008, at 10:32 PM, Yuri Tikhonov wrote: On Tuesday, November 11, 2008 Milton Miller wrote: #ifdef CONFIG_PTE_64BIT typedef unsigned long long pte_basic_t; +#ifdef CONFIG_PPC_256K_PAGES +#define PTE_SHIFT (PAGE_SHIFT - 7) This seems to be missing the comment on how many ptes are actually in the page that are in the other if and else cases. Ok. I'll fix this. Actually it's another hack: we don't use full page for PTE table because we need to reserve something for PGD I don't understand we need to reserve something for PGD. Do you mean that you would not require a second page for the PGD because the full pagetable could fit in one page? ... That does imply you want to allocate the pte page from a slab instead of pgalloc. Is that covered? Well, in case of 256K PAGE_SIZE we do not need the PGD level indeed (18 bits are used for offset, and remaining 14 bits are for PTE index inside the PTE table). Even the full 256K PTE page isn't necessary to cover the full range: only half of it would be enough (with 14 bits we can address only 16K PTEs). But the head_44x.S code is essentially based on the assumption of 2-level page addressing. Also, I may guess that eliminating of the PGD level won't be as easy as just a re-implementation of the TLB-miss handlers in head_44x.S. So, the current approach for 256K-pages support was just a compromise between the required for the project functionality, and the effort necessary to achieve it. So are you allocating the PAGE_SIZE levels from slabs (either kmalloc or dedicated) instead of allocating pages? Or are you wasting the extra space? Wasting the extra space has a place here. At a very minimum you need to comment this in the code. If I were maintiner I would say not wasting large fractions of pages when the page size is 256k would be my merge requirement. As I said, I'm fine with keeping the page table two levels, but the tradeoff needs to be documented. Agree, we'll document this fact, and re-submit the patch. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[6]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes
On Thursday, November 27, 2008 you wrote: I've implemented (2) (the code is below), and it works. But, admittedly, this (working) looks strange to me because of the following: To be able to use 64-bit PCI mapping on PPC32 I had to replace the 'unsigned long' type of pci_dram_offset with 'resource_size_t', which on ppc440spe is 'u64'. So, in dma_alloc_coherent() I put the 64-bit value into the 'dma_addr_t' handle. I use 2.6.27 kernel for testing, which has sizeof(dma_addr_t) == sizeof(u32). Thus, dma_alloc_coherent() cuts the upper 32 bits of PCI address, and returns only low 32-bit part of PCI address to its caller. And, regardless of this fact, the PCI device does operate somehow (this is the PCI-E LSI disk controller served by the drivers/message/fusion/mptbase.c + mptsas.c drivers). I've verified that ppc440spe PCI-E bridge's BARs (PECFGn_BAR0L,H) are configured with the new, 1TB, address value: Strange... when I look at pci4xx_parse_dma_ranges() I see it specifically avoiding PCI addresses above 4G ... That needs fixing. Right, it avoid. I guess you haven't read my e-mail to its end, because my work-around patch, which I referenced there, fixes this :) diff --git a/arch/powerpc/sysdev/ppc4xx_pci.c b/arch/powerpc/sysdev/ppc4xx_pci.c index afbdd48..f748c5b 100644 --- a/arch/powerpc/sysdev/ppc4xx_pci.c +++ b/arch/powerpc/sysdev/ppc4xx_pci.c @@ -126,10 +126,8 @@ static int __init ppc4xx_parse_dma_ranges(struct pci_controller *hose, if ((pci_space 0x0300) != 0x0200) continue; - /* We currently only support memory at 0, and pci_addr -* within 32 bits space -*/ - if (cpu_addr != 0 || pci_addr 0x) { + /* We currently only support memory at 0 */ + if (cpu_addr != 0) { printk(KERN_WARNING %s: Ignored unsupported dma range 0x%016llx...0x%016llx - 0x%016llx\n, hose-dn-full_name, @@ -179,18 +177,12 @@ static int __init ppc4xx_parse_dma_ranges(struct pci_controller *hose, return -ENXIO; } - /* Check that we are fully contained within 32 bits space */ - if (res-end 0x) { - printk(KERN_ERR %s: dma-ranges outside of 32 bits space\n, - hose-dn-full_name); - return -ENXIO; - } out: dma_offset_set = 1; pci_dram_offset = res-start; - printk(KERN_INFO 4xx PCI DMA offset set to 0x%08lx\n, - pci_dram_offset); + printk(KERN_INFO 4xx PCI DMA offset set to 0x%016llx\n, + (unsigned long long)pci_dram_offset); return 0; } To implement that trick you definitely need to make dma_addr_t 64 bits. Sure. The problem here is that the LSI (the PCI device I want to DMA to/from 1TB PCI addresses) driver doesn't work with this (i.e. it's broken in, e.g., 2.6.28-rc6) on ppc440spe-based platform. It looks like there is no support for 32-bit CPUs with 64-bit physical addresses in the LSI driver. E.g. the following mix in the drivers/message/fusion/mptbase.h code points to the fact that the driver supposes 64-bit dma_addr_t on 64-bit CPUs only: #ifdef CONFIG_64BIT #define CAST_U32_TO_PTR(x) ((void *)(u64)x) #define CAST_PTR_TO_U32(x) ((u32)(u64)x) #else #define CAST_U32_TO_PTR(x) ((void *)x) #define CAST_PTR_TO_U32(x) ((u32)x) #endif #define mpt_addr_size() \ ((sizeof(dma_addr_t) == sizeof(u64)) ? MPI_SGE_FLAGS_64_BIT_ADDRESSING : \ MPI_SGE_FLAGS_32_BIT_ADDRESSING) Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 02/11] async_tx: add support for asynchronous GF multiplication
Hello Dan, On Saturday, November 15, 2008 you wrote: A few comments Thanks. 1/ I don't see code for handling cases where the src_cnt exceeds the hardware maximum. Right, actually the ADMA devices we used (ppc440spe DMA engines) has no limitations on the src_cnt (well, actually there is the limit - the size of descriptors FIFO, but it's more than the number of drives which may be handled with the current RAID-6 driver, i.e. 256), but I agree - the ASYNC_TX functions should not assume that any ADMA device will have such a feature. So we'll implement this, and then re-post the patches. 2/ dmaengine.h defines DMA_PQ_XOR but these patches should really change that to DMA_PQ and do s/pqxor/pq/ across the rest of the code base. OK. 3/ In my implementation (unfinished) of async_pq I decided to make the prototype: May I ask do you have in plans to finish and release your implementation? +/** + * async_pq - attempt to generate p (xor) and q (Reed-Solomon code) with a + * dma engine for a given set of blocks. This routine assumes a field of + * GF(2^8) with a primitive polynomial of 0x11d and a generator of {02}. + * In the synchronous case the p and q blocks are used as temporary + * storage whereas dma engines have their own internal buffers. The + * ASYNC_TX_PQ_ZERO_P and ASYNC_TX_PQ_ZERO_Q flags clear the + * destination(s) before they are used. + * @blocks: source block array ordered from 0..src_cnt with the p destination + * at blocks[src_cnt] and q at blocks[src_cnt + 1] + * NOTE: client code must assume the contents of this array are destroyed + * @offset: offset in pages to start transaction + * @src_cnt: number of source pages: 2 src_cnt = 255 + * @len: length in bytes + * @flags: ASYNC_TX_ACK, ASYNC_TX_DEP_ACK + * @depend_tx: p+q operation depends on the result of this transaction. + * @cb_fn: function to call when p+q generation completes + * @cb_param: parameter to pass to the callback routine + */ +struct dma_async_tx_descriptor * +async_pq(struct page **blocks, unsigned int offset, int src_cnt, size_t len, +enum async_tx_flags flags, struct dma_async_tx_descriptor *depend_tx, +dma_async_tx_callback cb_fn, void *cb_param) Where p and q are not specified separately. This matches more closely how the current gen_syndrome is specified with the goal of not requiring any changes to existing software raid6 interface. Thoughts? Understood. Our goal was to be more close to the ASYNC_TX interfaces, so we specified the destinations separately. Though I'm fine with your prototype, since doubling the same address is no good, so, we'll change this. Any comments regarding the drivers/md/raid5.c part ? Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes
Hello, This patch extends DMA ranges for PCI(X) to 4GB, so that it could work on Katmais with 4GB RAM installed. Add new nodes for the PPC440SPe DMA, XOR engines to be used in the PPC440SPe ADMA driver, and the SysACE controller, which connects Compact Flash to Katmai. Signed-off-by: Ilya Yanok [EMAIL PROTECTED] Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] --- arch/powerpc/boot/dts/katmai.dts | 46 +++-- 1 files changed, 38 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/boot/dts/katmai.dts b/arch/powerpc/boot/dts/katmai.dts index 077819b..7749478 100644 --- a/arch/powerpc/boot/dts/katmai.dts +++ b/arch/powerpc/boot/dts/katmai.dts @@ -245,8 +245,8 @@ ranges = 0x0200 0x 0x8000 0x000d 0x8000 0x 0x8000 0x0100 0x 0x 0x000c 0x0800 0x 0x0001; - /* Inbound 2GB range starting at 0 */ - dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 0x8000; + /* Inbound 4GB range starting at 0 */ + dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 0x; /* This drives busses 0 to 0xf */ bus-range = 0x0 0xf; @@ -289,8 +289,8 @@ ranges = 0x0200 0x 0x8000 0x000e 0x 0x 0x8000 0x0100 0x 0x 0x000f 0x8000 0x 0x0001; - /* Inbound 2GB range starting at 0 */ - dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 0x8000; + /* Inbound 4GB range starting at 0 */ + dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 0x; /* This drives busses 10 to 0x1f */ bus-range = 0x10 0x1f; @@ -330,8 +330,8 @@ ranges = 0x0200 0x 0x8000 0x000e 0x8000 0x 0x8000 0x0100 0x 0x 0x000f 0x8001 0x 0x0001; - /* Inbound 2GB range starting at 0 */ - dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 0x8000; + /* Inbound 4GB range starting at 0 */ + dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 0x; /* This drives busses 10 to 0x1f */ bus-range = 0x20 0x2f; @@ -371,8 +371,8 @@ ranges = 0x0200 0x 0x8000 0x000f 0x 0x 0x8000 0x0100 0x 0x 0x000f 0x8002 0x 0x0001; - /* Inbound 2GB range starting at 0 */ - dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x0 0x8000; + /* Inbound 4GB range starting at 0 */ + dma-ranges = 0x4200 0x0 0x0 0x0 0x0 0x1 0x; /* This drives busses 10 to 0x1f */ bus-range = 0x30 0x3f; @@ -392,6 +392,36 @@ 0x0 0x0 0x0 0x3 UIC3 0xa 0x4 /* swizzled int C */ 0x0 0x0 0x0 0x4 UIC3 0xb 0x4 /* swizzled int D */; }; + DMA0: dma0 { + interrupt-parent = DMA0; + interrupts = 0 1; + #interrupt-cells = 1; + #address-cells = 0; + #size-cells = 0; + interrupt-map = + 0 UIC0 0x14 4 + 1 UIC1 0x16 4; + }; + DMA1: dma1 { + interrupt-parent = DMA1; + interrupts = 0 1; + #interrupt-cells = 1; + #address-cells = 0; + #size-cells = 0; + interrupt-map = + 0 UIC0 0x16 4 + 1 UIC1 0x16 4; + }; + xor { + interrupt-parent = UIC1; + interrupts = 0x1f 4; + }; + [EMAIL PROTECTED] { + compatible = xlnx,opb-sysace-1.00.b; + interrupt-parent = UIC2; + interrupts = 0x19 4; + reg = 0x0004 0xfe00 0x100; + }; }; chosen { -- 1.5.6.1 -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [2/2] powerpc: support for 256K pages on PPC 44x
Hello Milton, On Tuesday, November 11, 2008 Milton Miller wrote: [snip] #ifdef CONFIG_PTE_64BIT typedef unsigned long long pte_basic_t; +#ifdef CONFIG_PPC_256K_PAGES +#define PTE_SHIFT (PAGE_SHIFT - 7) This seems to be missing the comment on how many ptes are actually in the page that are in the other if and else cases. Ok. I'll fix this. Actually it's another hack: we don't use full page for PTE table because we need to reserve something for PGD I don't understand we need to reserve something for PGD. Do you mean that you would not require a second page for the PGD because the full pagetable could fit in one page? My first reaction was to say then create pgtable-nopgd.h like the other two. The page walkers support this with the advent of gigantic pages. Then I realized that might not be optimal: while the page table might fit in one page, it would mean you always allocate the pte space to cover the full address space. Even if your processes spread out over the 3G of address space allocated to them (32 bit kernel), you will allocate space for 4G, wasting 1/4 of the pte space. That does imply you want to allocate the pte page from a slab instead of pgalloc. Is that covered? Well, in case of 256K PAGE_SIZE we do not need the PGD level indeed (18 bits are used for offset, and remaining 14 bits are for PTE index inside the PTE table). Even the full 256K PTE page isn't necessary to cover the full range: only half of it would be enough (with 14 bits we can address only 16K PTEs). But the head_44x.S code is essentially based on the assumption of 2-level page addressing. Also, I may guess that eliminating of the PGD level won't be as easy as just a re-implementation of the TLB-miss handlers in head_44x.S. So, the current approach for 256K-pages support was just a compromise between the required for the project functionality, and the effort necessary to achieve it. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] xsysace: use resource_size_t instead of unsigned long
Hello Stephen, On Thursday, November 13, 2008 you wrote: Hi Yuri, On Thu, 13 Nov 2008 11:43:17 +0300 Yuri Tikhonov [EMAIL PROTECTED] wrote: - dev_dbg(ace-dev, physaddr=0x%lx irq=%i\n, ace-physaddr, ace-irq); + dev_dbg(ace-dev, physaddr=0x%llx irq=%i\n, (u64)ace-physaddr, ace-irq); You should cast the physaddr to unsigned long long as u64 is unsigned long on some architectures. The same is needed in other places as well. Thanks for your comment. We'll fix this. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes
Hello Josh, On Thursday, November 13, 2008 you wrote: [snip] You have no compatible property in these 3 nodes. How are drivers supposed to bind to them? You also have no reg or dcr-reg properties. What exactly are these nodes for? Probably we (me and Ilya) overdone with posting katmai.dts-related changes to ML, and duplicated the same patches in different posts. Sorry for the confusion. These nodes are necessary for the ppc440spe ADMA driver: http://www.nabble.com/-PATCH-11-11--ppc440spe-adma:-ADMA-driver-for-PPC440SP(e)-td20488049.html + [EMAIL PROTECTED] { + compatible = xlnx,opb-sysace-1.00.b; Odd. This isn't a xilinx board by any means. This should probably look something like: compatible = amcc,sysace-440spe, xlnx,opb-sysace-1.00.b; Though I'm curious about it in general. The xilinx bindings have the versioning numbers on them to match particular bit-streams in the FPGAs if I remember correctly. Does that really apply here? No, we just selected the description which looked more appropriate for our case: SysAce is connected to the External Bus Controller of 440SPe, which in turns is attached as a slave to OPB (on-chip peripheral). Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] katmai.dts: extend DMA ranges; add dma/sysace nodes
Hello Grant, On Friday, November 14, 2008 you wrote: On Thu, Nov 13, 2008 at 06:45:33AM -0500, Josh Boyer wrote: On Thu, 13 Nov 2008 11:49:14 +0300 Yuri Tikhonov [EMAIL PROTECTED] wrote: + [EMAIL PROTECTED] { + compatible = xlnx,opb-sysace-1.00.b; Odd. This isn't a xilinx board by any means. This should probably look something like: compatible = amcc,sysace-440spe, xlnx,opb-sysace-1.00.b; Actually, if there is a sysace, it is definitely a xilinx part. It won't be on the SoC. However, compatible = xlnx,opb-sysace-1.00.b isn't really accurate. It should really be compatible = xlnx,sysace and the driver modified to accept this string. OK, we'll do this, and then re-post together with the cleaned-up [PATCH] xsysace: use resource_size_t instead of unsigned long patch. xlnx,opb-sysace-1.00.b is an FPGA block used to interface to the system ace which is definitely not in use here. So while it does get things to work, it is not a clean description of the hardware. g. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] powerpc: add support for PAGE_SIZEs greater than 4KB for
Hello, On Thursday, September 11, 2008 you wrote: I was planning to post a similar patch. Good that you already posted it :-) I will try to finish off similar patch for 40x processors. +choice + prompt Page size + depends on 44x PPC32 + default PPC32_4K_PAGES + help + The PAGE_SIZE definition. Increasing the page size may + improve the system performance in some dedicated cases. + If unsure, set it to 4 KB. + You should mention an example of dedicated cases (eg. RAID). ACK. I think this help should mention that for page size 256KB, you will need to have a special version of binutils, since the ELF standard mentions page sizes only upto 64KB. Right. We use ELDK-4.2 for compiling applications to be run on 256K PAGE_SIZE kernel. This toolchain includes necessary changes for ELF_MAXPAGESIZE in binutils/bfd/elf32-ppc.c. -#ifdef CONFIG_PPC_64K_PAGES +#if defined(CONFIG_PPC32_256K_PAGES) +#define PAGE_SHIFT 18 +#elif defined(CONFIG_PPC32_64K_PAGES) || defined(CONFIG_PPC_64K_PAGES) #define PAGE_SHIFT 16 +#elif defined(CONFIG_PPC32_16K_PAGES) +#define PAGE_SHIFT 14 #else #define PAGE_SHIFT 12 #endif Why should the new defines be inside CONFIG_PPC_64K_PAGES? The definition CONFIG_PPC_64K_PAGES is repeated. We decided to introduce new CONFIG_PPC32_64K_PAGES option to distinguish using 64K pages on PPC32 and PPC64, so PAGE_SHIFT will be defined as 16 when the CONFIG_PPC_64K_PAGES option is set on some PPC64 platform, and as 16 when the CONFIG_PPC32_64K_PAGES option is set on some ppc44x PPC32 platform. Shouldn't these defines be like this: #if defined(CONFIG_PPC32_256K_PAGES) #define PAGE_SHIFT 18 #elif defined(CONFIG_PPC32_64K_PAGES) || defined(CONFIG_PPC_64K_PAGES) #define PAGE_SHIFT 16 #elif defined(CONFIG_PPC32_16K_PAGES) #define PAGE_SHIFT 14 #else #define PAGE_SHIFT 12 #endif Admittedly, I don't see the difference between your version and Ilya's one. Am I missing something ? +#elif (PAGE_SHIFT == 14) +/* + * PAGE_SIZE 16K + * PAGE_SHIFT 14 + * PTE_SHIFT 11 + * PMD_SHIFT 25 + */ +#define PPC44x_TLBE_SIZE PPC44x_TLB_16K +#define PPC44x_PGD_OFF_SH 9 /*(32 - PMD_SHIFT + 2)*/ +#define PPC44x_PGD_OFF_M1 23 /*(PMD_SHIFT - 2)*/ +#define PPC44x_PTE_ADD_SH 21 /*32 - PMD_SHIFT + PTE_SHIFT + 3*/ +#define PPC44x_PTE_ADD_M1 18 /*32 - 3 - PTE_SHIFT*/ +#define PPC44x_RPN_M2 17 /*31 - PAGE_SHIFT*/ Please change PPC44x_PGD_OFF_SH to PPC44x_PGD_OFF_SHIFT. SH sounds very confusing. I don't like the MI and M2 names too. Change PPC44x_RPN_M2 to PPC44x_RPN_MASK. Change M1 to MASK in PPC44x_PGD_OFF_M1 and PPC44x_PTE_ADD_M1 . Is there no way a define like #define PPC44x_PGD_OFF_SH (32 - PMD_SHIFT + 2) be used in assembly file. If yes, we can avoid repeating the defines. I think these 44x specific defines should go to asm/mmu-44x.h since I am planning to post a patch for 40x. For those processors, the defines below will changes as: #define PPC44x_PTE_ADD_SH (32 - PMD_SHIFT + PTE_SHIFT + 2) #define PPC44x_PTE_ADD_M1 (32 - 2 - PTE_SHIFT) Since these defines are not generic, they should be put in the mmu specific header file rather than adding a new header file. When 40x processors are supported, the corresponding defines can go to include/asm/mmu-40x.h +#elif (PAGE_SHIFT == 18) +/* + * PAGE_SIZE 256K + * PAGE_SHIFT 18 + * PTE_SHIFT 11 + * PMD_SHIFT 29 + */ +#define PPC44x_TLBE_SIZE PPC44x_TLB_256K +#define PPC44x_PGD_OFF_SH 5 /*(32 - PMD_SHIFT + 2)*/ +#define PPC44x_PGD_OFF_M1 27 /*(PMD_SHIFT - 2)*/ +#define PPC44x_PTE_ADD_SH 17 /*32 - PMD_SHIFT + PTE_SHIFT + 3*/ +#define PPC44x_PTE_ADD_M1 18 /*32 - 3 - PTE_SHIFT*/ +#define PPC44x_RPN_M2 13 /*31 - PAGE_SHIFT*/ For 256KB page size, I cannot understand why PTE_SHIFT is 11. Since each PTE entry is 8 byte, PTE_SHIFT should have been 15. But then there would be no bits in the Effective address for the 1st level PGDIR offset. On what basis PTE_SHIFT of 11 is chosen? This overflow problem happens only for 256KB page size. We should use smaller PTE area in address to free some bits for PGDIR part. I guess the only impact this approach has is ineffective usage of memory pages allocated for PTE tables, since having PTE_SHIFT of 11 we use only 1/16 of pages with PTEs. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] powerpc: add support for PAGE_SIZEs greater than 4KB for
Hello Prodyut, Thanks for your comments. Some answers below. On Friday, September 12, 2008 you wrote: /* * Create WS1. This is the faulting address (EPN), * page size, and valid flag. */ - li r11,PPC44x_TLB_VALID | PPC44x_TLB_4K + li r11,PPC44x_TLB_VALID | PPC44x_TLBE_SIZE rlwimi r10,r11,0,20,31 /* Insert valid and page size*/ tlbwe r10,r13,PPC44x_TLB_PAGEID /* Write PAGEID */ Change rlwimi r10,r11,0,20,31 /* Insert valid and page size*/ to rlwimi r10,r11,0,PPC44x_PTE_ADD_M1,31 /* Insert valid and page size*/ Agree. We'll fix this. I guess this works for us, because we used the large EPN mask here which covered more bits in EPN field of TLB entries, than it was required for 16/64/256K PAGE_SIZE cases: TLB Word 0 / bits 0..21: EPN (Effective Page Number) [from 4 to 22 bits] TLB Word 0 / bit 22 : V (Valid bit) [1 bit] TLB Word 0 / bits 24..27 : SIZE (Page Size) [4 bits] Thus, doing 'rlwimi' we masked our V/SIZE bits and cleared EPN for all 4/16/64/256K PAGE_SIZE cases. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH] powerpc: add support for PAGE_SIZEs greater than 4KB for
Hi Ilya, On Friday, September 12, 2008 you wrote: Hi, prodyut hazarika wrote: In file arch/powerpc/mm/pgtable_32.c, we have: #ifdef CONFIG_PTE_64BIT /* 44x uses an 8kB pgdir because it has 8-byte Linux PTEs. */ #define PGDIR_ORDER 1 #else #define PGDIR_ORDER 0 #endif pgd_t *pgd_alloc(struct mm_struct *mm) { pgd_t *ret; ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER); return ret; } Thus, we allocate 2 pages for 44x processors for PGD. This is needed only for 4K page. We are anyway not using the whole 64K or 256K page for the PGD. So there is no point to waste an additional 64K or 256KB page Ok. Not sure I'm right but I think 16K case doesn't need second page too. (PGDIR_SHIFT=25, so sizeof(pgd_t)(32-PGDIR_SHIFT) 16KB) ACK, no need need in a second page when working with 16K pages. Prodyut's approach addresses this too, but ... Change this to: #ifdef CONFIG_PTE_64BIT #if (PAGE_SHIFT == 12) I think #ifdef CONFIG_PTE_64BIT is a little bit confusing here... Actually PGDIR_ORDER should be something like max(32 + 2 - PGDIR_SHIFT - PAGE_SHIFT, 0) /* 44x uses an 8kB pgdir because it has 8-byte Linux PTEs. */ #define PGDIR_ORDER 1 #else #define PGDIR_ORDER 0 #endif #else #define PGDIR_ORDER 0 #endif Yuri, any comments? ... as for me, I like your approach more. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Unaligned LocalPlus Bus access on MPC5200
Hello, I've encountered with the problem of unaligned word access to external devices (Flash memory) connected to Local Plus bus of MPC5200 processor. Any comments on this would be very appreciated. And the essence of the issue is as follows: - when I try to read a data word from LPB-connected Flash using some even address (0xFF00, 0xFF02, 0xFF04, etc), then everything works fine (it's not a typo, 2bytes-aligned word accesses pass well too); - when I try to read a data word from LPB-connected Flash using some odd address (0xFF01, 0xFF03, ...) then LP returns only 1 byte of word correctly (the 3 bytes remained are filled with zeros). (a) Here is what I have when I read from LPB using word-aligned accesses with MPC5200 rev.A and LPB configured in Non_Multiplexed mode + 8-bit data bus: lwz from 0xc3082000:0xba476bc7 lwz from 0xc3082004:0xb95d77de ... Now I try the unaligned reads: lwz from 0xc3082000:0xba476bc7 lwz from 0xc3082001:0x00b9 lwz from 0xc3082002:0x6bc7b95d lwz from 0xc3082003:0xc700 (b) With MPC5200 rev.B situation is similar, and different only in fact that unaligned read results to 2-bytes reading for 0x1 addresses cases (LPB again configured in Non_Multiplexed mode + 8-bit data bus): lwz from 0xd1082000:0x2f459eaf lwz from 0xd1082004:0x388ff68d ... lwz from 0xd1082000:0x2f459eaf lwz from 0xd1082001:0xaf38 lwz from 0xd1082002:0x9eaf388f lwz from 0xd1082003:0xaf00 (c) When LPB operates in the Multiplexed mode with 32-bit data bus, the erroneous result is observed for 0x3 addresses cases only (MPC5200 has rev.A in these tests): lwz from 0xc3082000:0x7e9043a6 lwz from 0xc3082004:0x7eb143a6 ... lwz from 0xc3082000:0x7e9043a6 lwz from 0xc3082001:0x9043a67e lwz from 0xc3082002:0x43a67eb1 lwz from 0xc3082003:0xa600 I used the following platforms for tests: - TQM5200 board, which is based on MPC5200rev.A CPU, and has AMD Flash connected to LPB configured in Multiplexed mode with 32-bit data bus; - some customed board, which is based on MPC5200rev.A CPU, and has Intel Flash connected to LPB configured in Non-Multiplexed mode with 8-bit data bus; - Lite5200B board, which is based on MPC5200rev.B CPU, and has AMD Flash connected to LPB configured in Non-Multiplexed mode with 8-bit data bus. The Linux source tree I used is linux-2.6.23.16 (DENX linux-2.6.23-stable branch). The toolchain is ELDK-4.2. As an example, these LPB-related issue leads to the incorrect operation of JFFS2 file-system created on the top of MTD device built on a Flash chip from Intel/Sharp (drivers/mtd/chips/cfi_cmdset_0001.c). With these Flash chips implementation of point/unpoint API is possible, so the cfi_cmdset_0001.c driver exports the corresponding point/unpoint methods for the MTD device, and JFFS2 then uses these methods to operate with data directly from Flash (without copying them to RAM memory). One of these operations is memcpy() in the jffs2_scan_dirent_node() function, which copies the file name from some address at Flash (aligned) to, unfortunately, unaligned destination in RAM (name field of the jffs2_full_dirent structure). The implementation of memcpy() in lib_powerpc first does byte-to-byte transfers to achieve the aligned destination, and then does word-to-word transfers, but by this moment the source is unaligned, so memcpy() does lwz-s from unaligned addresses on LPB. Just FYI, a simple work-around for the issue with the Intel/Sharp Flash chips connected to LPB of MPC5200 is to mark your struct map_info as .phys = NO_XIP, and implement read/write/copy_from/copy_to byte-to-byte functions (in your drivers/mtd/maps/ board file). Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 1/1] [PPC] 8xx swap bug-fix
Hi Scott, You are right. The TLB handlers for 8xx in arch/powerpc branch set the PAGE_ACCESSED flag unconditionally too. And the include/asm-powerpc/pgtable-ppc32.h file still includes the comment that this is the bug. So, probably the corresponding patch for powerpc branch will be usefull. Does anybody use swap with some of the 8xx-based boards supported in powerpc branch ? Regards, Yuri On Monday 04 February 2008 21:24, Scott Wood wrote: On Sat, Feb 02, 2008 at 12:22:17PM +0100, Jochen Friedrich wrote: Hi Yuri, Here is the patch which makes Linux-2.6 swap routines operate correctly on the ppc-8xx-based machines. is there any 8xx board left which isn't ported to ARCH=powerpc? More importantly, is this something that is also broken in arch/powerpc? It looks like it has the same code... -Scott -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/1] [PPC] 8xx swap bug-fix
Hello, Here is the patch which makes Linux-2.6 swap routines operate correctly on the ppc-8xx-based machines. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] -- diff --git a/arch/ppc/kernel/head_8xx.S b/arch/ppc/kernel/head_8xx.S index eb8d26f..321bda2 100644 --- a/arch/ppc/kernel/head_8xx.S +++ b/arch/ppc/kernel/head_8xx.S @@ -329,8 +329,18 @@ InstructionTLBMiss: mfspr r11, SPRN_MD_TWC/* and get the pte address */ lwz r10, 0(r11) /* Get the pte */ +#ifdef CONFIG_SWAP + /* do not set the _PAGE_ACCESSED bit of a non-present page */ + andi. r11, r10, _PAGE_PRESENT + beq 4f + ori r10, r10, _PAGE_ACCESSED + mfspr r11, SPRN_MD_TWC/* get the pte address again */ + stw r10, 0(r11) +4: +#else ori r10, r10, _PAGE_ACCESSED stw r10, 0(r11) +#endif /* The Linux PTE won't go exactly into the MMU TLB. * Software indicator bits 21, 22 and 28 must be clear. @@ -395,8 +405,17 @@ DataStoreTLBMiss: DO_8xx_CPU6(0x3b80, r3) mtspr SPRN_MD_TWC, r11 - mfspr r11, SPRN_MD_TWC/* get the pte address again */ +#ifdef CONFIG_SWAP + /* do not set the _PAGE_ACCESSED bit of a non-present page */ + andi. r11, r10, _PAGE_PRESENT + beq 4f + ori r10, r10, _PAGE_ACCESSED +4: + /* and update pte in table */ +#else ori r10, r10, _PAGE_ACCESSED +#endif + mfspr r11, SPRN_MD_TWC/* get the pte address again */ stw r10, 0(r11) /* The Linux PTE won't go exactly into the MMU TLB. @@ -575,7 +594,16 @@ DataTLBError: /* Update 'changed', among others. */ +#ifdef CONFIG_SWAP + ori r10, r10, _PAGE_DIRTY|_PAGE_HWWRITE + /* do not set the _PAGE_ACCESSED bit of a non-present page */ + andi. r11, r10, _PAGE_PRESENT + beq 4f + ori r10, r10, _PAGE_ACCESSED +4: +#else ori r10, r10, _PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_HWWRITE +#endif mfspr r11, SPRN_MD_TWC/* Get pte address again */ stw r10, 0(r11) /* and update pte in table */ diff --git a/include/asm-ppc/pgtable.h b/include/asm-ppc/pgtable.h index c159315..76717ff 100644 --- a/include/asm-ppc/pgtable.h +++ b/include/asm-ppc/pgtable.h @@ -341,14 +341,6 @@ extern unsigned long ioremap_bot, ioremap_base; #define _PMD_PAGE_MASK 0x000c #define _PMD_PAGE_8M 0x000c -/* - * The 8xx TLB miss handler allegedly sets _PAGE_ACCESSED in the PTE - * for an address even if _PAGE_PRESENT is not set, as a performance - * optimization. This is a bug if you ever want to use swap unless - * _PAGE_ACCESSED is 2, which it isn't, or unless you have 8xx-specific - * definitions for __swp_entry etc. below, which would be gross. - * -- paulus - */ #define _PTE_NONE_MASK _PAGE_ACCESSED #else /* CONFIG_6xx */ -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 0/2] [PPC 4xx] L2-cache synchronization for ppc44x
Hello, Eugene, The h/w snooping mechanism you are talking about is limited to the Low Latency (LL) segment of the PLB bus in ppc440sp and ppc440spe chips (see section 7.2.7 L2 Cache Coherency of the ppc440spe spec), whereas DMA and XOR engines use the High Bandwidth (HB) segment of PLB bus (see section 1.1.2 Internal Buses of the ppc440spe spec). Thus, the h/w snooping mechanism is not able to trace the results of operations performed by DMA and XOR engines and keep L2-cache coherent with SDRAM, because the data flow through the HB PLB segment. This leads to, for example, incorrect results of RAID-parity calculations if one uses the h/w accelerated ppc440spe ADMA driver with L2-cache enabled. The s/w synchronization algorithms proposed in my patches has no LL PLB limitations as opposed to h/w snooping, but, probably, this is not the best way of how it might be implemented. Even though with these patches the h/w accelerated RAID starts to operate correctly (with L2-cache enabled) there is a performance degradation (induced by loops in the L2-cache synchronization routines) observed in the most cases. So, as a result, there is no benefit from using L2-cache for these, RAID, cases at all. Regards, Yuri On Wednesday 28 November 2007 22:50, Eugene Surovegin wrote: On Wed, Nov 07, 2007 at 01:40:10AM +0300, Yuri Tikhonov wrote: Hello all, Here is a patch-set for support L2-cache synchronization routines for the ppc44x processors family. I know that the ppc branch is for bug-fixing only, thus the patch-set is just FYI [though enabled but non-coherent L2-cache may appear as a bug for someone who uses one of the boards listed below :)]. [PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for ppc44x; [PATCH 2/2] [PPC 44x] enable L2-cache for the following ppc44x-based boards: ALPR, Katmai, Ocotea, and Taishan. Why is this all needed? IIRC ibm440gx_l2c_enable() configures 64G snoop region for L2C. Did AMCC made non-only-coherent L2C chips recently? -- Eugene -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for ppc44x
Hi Olof, Thanks a lot for the feedbacks. Comments below. On 07.11.2007, 7:04:28 you wrote: Hi, Some comments below. In general this patch adds #ifdefs in common code, that's normally frowned upon. It would maybe be better to add a new call to ppc_machdeps and call it if set. Agree; this looks better indeed. On Wed, Nov 07, 2007 at 01:40:28AM +0300, Yuri Tikhonov wrote: ... + /* * Write any modified data cache blocks out to memory. * Does not invalidate the corresponding cache lines (especially for diff --git a/include/asm-powerpc/cache.h b/include/asm-powerpc/cache.h index 5350704..8a2f9e6 100644 --- a/include/asm-powerpc/cache.h +++ b/include/asm-powerpc/cache.h @@ -10,12 +10,14 @@ #define MAX_COPY_PREFETCH 1 #elif defined(CONFIG_PPC32) #define L1_CACHE_SHIFT 5 +#define L2_CACHE_SHIFT 5 #define MAX_COPY_PREFETCH 4 #else /* CONFIG_PPC64 */ #define L1_CACHE_SHIFT 7 #endif #defineL1_CACHE_BYTES (1 L1_CACHE_SHIFT) +#defineL2_CACHE_BYTES (1 L2_CACHE_SHIFT) The above looks highly system dependent to me. Should maybe be a part of the cache info structures instead, and filled in from the device tree? This is the Level-2 cache line parameter. I'll see what can be made here. For now I've just renamed these definitions and moved them into the PPC44x-specific header. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re[2]: [PATCH 2/2] [PPC 44x] enable L2-cache for ALPR, Katmai, Ocotea, and Taishan
Hi Olof, On 07.11.2007, 7:06:08 you wrote: ... + +config L2_CACHE + bool Enable Level-2 Cache + depends on NOT_COHERENT_CACHE (KATMAI || TAISHAN || OCOTEA || ALPR) + default y + help + This option enables L2-cache on ppc44x controllers. + If unsure, say Y. That's a very generic config name. Maybe something like PPC_4XX_L2_CACHE? Having the ppc_machdep for invalidating L2-cache lines we can avoid introducing the new configuration options at all. See below. Is there ever a case where a user would NOT want l2 cache enabled (and disabled permanently enough to rebuild the kernel instead of giving a kernel command line option?) Theoretically - yes. Internal SRAM of ppc44x may be used for something else than L2 cache. Admittedly, the configuration option was necessary for me to enable or disable my L2-cache synchronization routine in the generic dma_sync() function. Per your suggestion, now, instead of introducing the new kernel option I initialize the L2-cache sync ppc_machdep right in the L2-cache enable routine: thus if the user will not enable L2-cache (will not want internal SRAM to act as L2-cache and will not call the L2-cache enabling routine) then my new ppc_machdep will remain set to zero and will not affect on SRAM used for some specific purposes. ... @@ -567,7 +569,9 @@ void __init platform_init(unsigned long r3, unsigned long r4, #ifdef CONFIG_KGDB ppc_md.early_serial_map = alpr_early_serial_map; #endif +#ifdef CONFIG_L2_CACHE ppc_md.init = alpr_init; +#endif Why do you take out the above calls if the new option is selected? Seems odd to remove something that worked(?) before. Umm.. Quite the contrary, the option selected made these calls avaiable. Though it doesn't matter anymore since there is no CONFIG_L2_CACHE option anymore (i.e. all the four boards dealt with in this patch-set now have L2-cache enabled regardless of configuration, as it was initially). ppc_md.restart = alpr_restart; } ... +#ifdef CONFIG_L2_CACHE +static void __init katmai_init(void) +{ + ibm440gx_l2c_setup(clocks); +} +#endif + void __init platform_init(unsigned long r3, unsigned long r4, unsigned long r5, unsigned long r6, unsigned long r7) { @@ -599,4 +607,7 @@ void __init platform_init(unsigned long r3, unsigned long r4, ppc_md.early_serial_map = katmai_early_serial_map; #endif ppc_md.restart = katmai_restart; +#ifdef CONFIG_L2_CACHE + ppc_md.init = katmai_init; +#endif See comment above. Should the above init be called for all configs, not just when L2_CACHE is enabled? Also, it looks like the init function is the same on every board. It would be better to make a common function instead of duplicating it everywhere. Agree, but perhaps it's not the case for the ppc branch. Will do this in the powerpc branch as soon as support for these boards will be ported there.. by someone :) Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] [PPC 44x] L2-cache synchronization for ppc44x
This is the updated patch for support synchronization of L2-Cache with the external memory on the ppc44x-based platforms. Differencies against the previous patch-set: - remove L2_CACHE config option; - introduce the ppc machdep to invalidate L2 cache lines; - some code clean-up. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED] -- diff --git a/arch/powerpc/lib/dma-noncoherent.c b/arch/powerpc/lib/dma-noncoherent.c index 1947380..b06f05c 100644 --- a/arch/powerpc/lib/dma-noncoherent.c +++ b/arch/powerpc/lib/dma-noncoherent.c @@ -31,6 +31,7 @@ #include linux/dma-mapping.h #include asm/tlbflush.h +#include asm/machdep.h /* * This address range defaults to a value that is safe for all @@ -186,6 +187,8 @@ __dma_alloc_coherent(size_t size, dma_addr_t *handle, gfp_t gfp) unsigned long kaddr = (unsigned long)page_address(page); memset(page_address(page), 0, size); flush_dcache_range(kaddr, kaddr + size); + if (ppc_md.l2cache_inv_range) + ppc_md.l2cache_inv_range(__pa(kaddr), __pa(kaddr + size)); } /* @@ -351,12 +354,16 @@ void __dma_sync(void *vaddr, size_t size, int direction) BUG(); case DMA_FROM_DEVICE: /* invalidate only */ invalidate_dcache_range(start, end); + if (ppc_md.l2cache_inv_range) + ppc_md.l2cache_inv_range(__pa(start), __pa(end)); break; case DMA_TO_DEVICE: /* writeback only */ clean_dcache_range(start, end); break; case DMA_BIDIRECTIONAL: /* writeback and invalidate */ flush_dcache_range(start, end); + if (ppc_md.l2cache_inv_range) + ppc_md.l2cache_inv_range(__pa(start), __pa(end)); break; } } diff --git a/arch/ppc/kernel/misc.S b/arch/ppc/kernel/misc.S index 46cf8fa..31c9149 100644 --- a/arch/ppc/kernel/misc.S +++ b/arch/ppc/kernel/misc.S @@ -25,6 +25,10 @@ #include asm/thread_info.h #include asm/asm-offsets.h +#ifdef CONFIG_44x +#include asm/ibm44x.h +#endif + #ifdef CONFIG_8xx #define ISYNC_8xx isync #else @@ -386,6 +390,35 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE) sync/* additional sync needed on g4 */ isync blr + +#if defined(CONFIG_44x) +/* + * Invalidate the Level-2 cache lines corresponded to the address + * range. + * + * invalidate_l2cache_range(unsigned long start, unsigned long stop) + */ +_GLOBAL(invalidate_l2cache_range) + li r5,PPC44X_L2_CACHE_BYTES-1 /* align on L2-cache line */ + andcr3,r3,r5 + subfr4,r3,r4 + add r4,r4,r5 + srwi. r4,r4,PPC44X_L2_CACHE_SHIFT + mtctr r4 + + lis r4, L2C_CMD_INV16 +1: mtdcr DCRN_L2C0_ADDR,r3 /* write address to invalidate */ + mtdcr DCRN_L2C0_CMD,r4/* issue the Invalidate cmd */ + +2: mfdcr r5,DCRN_L2C0_SR /* wait for complete */ + andis. r5,r5,L2C_CMD_CLR16 + beq 2b + + addir3,r3,PPC44X_L2_CACHE_BYTES /* next address to invalidate */ + bdnz1b + blr +#endif + /* * Write any modified data cache blocks out to memory. * Does not invalidate the corresponding cache lines (especially for diff --git a/arch/ppc/syslib/ibm440gx_common.c b/arch/ppc/syslib/ibm440gx_common.c index 6b1a801..64c663f 100644 --- a/arch/ppc/syslib/ibm440gx_common.c +++ b/arch/ppc/syslib/ibm440gx_common.c @@ -12,6 +12,8 @@ */ #include linux/kernel.h #include linux/interrupt.h +#include asm/machdep.h +#include asm/cacheflush.h #include asm/ibm44x.h #include asm/mmu.h #include asm/processor.h @@ -201,6 +203,7 @@ void __init ibm440gx_l2c_enable(void){ asm volatile (sync; isync ::: memory); local_irq_restore(flags); + ppc_md.l2cache_inv_range = invalidate_l2cache_range; } /* Disable L2 cache */ diff --git a/include/asm-powerpc/cacheflush.h b/include/asm-powerpc/cacheflush.h index ba667a3..bdebfaa 100644 --- a/include/asm-powerpc/cacheflush.h +++ b/include/asm-powerpc/cacheflush.h @@ -49,6 +49,7 @@ extern void flush_dcache_range(unsigned long start, unsigned long stop); #ifdef CONFIG_PPC32 extern void clean_dcache_range(unsigned long start, unsigned long stop); extern void invalidate_dcache_range(unsigned long start, unsigned long stop); +extern void invalidate_l2cache_range(unsigned long start, unsigned long stop); #endif /* CONFIG_PPC32 */ #ifdef CONFIG_PPC64 extern void flush_inval_dcache_range(unsigned long start, unsigned long stop); diff --git a/include/asm-powerpc/machdep.h b/include/asm-powerpc/machdep.h index 71c6e7e..754f416 100644 --- a/include/asm-powerpc/machdep.h +++ b/include/asm-powerpc/machdep.h @@ -201,6 +201,8 @@ struct machdep_calls { void(*early_serial_map)(void); void
Re[2]: [PATCH] [PPC 44x] L2-cache synchronization for ppc44x
Hi Ben, On 08.11.2007, 2:19:33 you wrote: On Thu, 2007-11-08 at 02:12 +0300, Yuri Tikhonov wrote: This is the updated patch for support synchronization of L2-Cache with the external memory on the ppc44x-based platforms. Differencies against the previous patch-set: - remove L2_CACHE config option; - introduce the ppc machdep to invalidate L2 cache lines; - some code clean-up. Can you tell me more about how this cache operates ? I don't quite understand why you would invalidate it on bidirectional DMAs rather than flush it to memory (unless you get your terminology wrong) and why you wouldn't flush it on transfers to the device.. Unless it is a write-through cache ? Yes, the ppc44x Level2 cache has the write-through design, so no need to do any kind of l2_flush. As far as the DMA_BIDIRECTIONAL case is concerned flush_dcache_range() flushes the data over the following path: L1-L2-RAM, but invalidates L1 only, and L2 remains invalid. Since in the BIDIRECTIONAL case DMA may update the data in RAM - we have to invalidate L2-cache manually, so that CPU may read new data transmitted by DMA right from RAM rather than old ones stuck in L2 due to flush_dcache(). Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for ppc44x
Support for L2-cache coherency synchronization routines in ppc44x processors. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED] -- diff --git a/arch/powerpc/lib/dma-noncoherent.c b/arch/powerpc/lib/dma-noncoherent.c index 1947380..593a425 100644 --- a/arch/powerpc/lib/dma-noncoherent.c +++ b/arch/powerpc/lib/dma-noncoherent.c @@ -351,12 +351,18 @@ void __dma_sync(void *vaddr, size_t size, int direction) BUG(); case DMA_FROM_DEVICE: /* invalidate only */ invalidate_dcache_range(start, end); +#ifdef CONFIG_L2_CACHE + invalidate_l2cache_range(__pa(start), __pa(end)); +#endif break; case DMA_TO_DEVICE: /* writeback only */ clean_dcache_range(start, end); break; case DMA_BIDIRECTIONAL: /* writeback and invalidate */ flush_dcache_range(start, end); +#ifdef CONFIG_L2_CACHE + invalidate_l2cache_range(__pa(start), __pa(end)); +#endif break; } } diff --git a/arch/ppc/kernel/misc.S b/arch/ppc/kernel/misc.S index 46cf8fa..de62f85 100644 --- a/arch/ppc/kernel/misc.S +++ b/arch/ppc/kernel/misc.S @@ -386,6 +386,36 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE) sync/* additional sync needed on g4 */ isync blr + +#ifdef CONFIG_L2_CACHE +/* + * Invalidate the Level-2 cache lines corresponded to the address + * range. + * + * invalidate_l2cache_range(unsigned long start, unsigned long stop) + */ +#include asm/ibm4xx.h +_GLOBAL(invalidate_l2cache_range) + li r5,L2_CACHE_BYTES-1 /* do l2-cache line alignment */ + andcr3,r3,r5 + subfr4,r3,r4 + add r4,r4,r5 + srwi. r4,r4,L2_CACHE_SHIFT + mtctr r4 + + lis r4, L2C_CMD_INV16 +1: mtdcr DCRN_L2C0_ADDR,r3 /* write address to invalidate */ + mtdcr DCRN_L2C0_CMD,r4/* issue the Invalidate cmd */ + +2: mfdcr r5,DCRN_L2C0_SR /* wait for complete */ + andis. r5,r5,L2C_CMD_CLR16 +beq2b + + addir3,r3,L2_CACHE_BYTES/* next address to invalidate */ + bdnz1b + blr +#endif + /* * Write any modified data cache blocks out to memory. * Does not invalidate the corresponding cache lines (especially for diff --git a/include/asm-powerpc/cache.h b/include/asm-powerpc/cache.h index 5350704..8a2f9e6 100644 --- a/include/asm-powerpc/cache.h +++ b/include/asm-powerpc/cache.h @@ -10,12 +10,14 @@ #define MAX_COPY_PREFETCH 1 #elif defined(CONFIG_PPC32) #define L1_CACHE_SHIFT 5 +#define L2_CACHE_SHIFT 5 #define MAX_COPY_PREFETCH 4 #else /* CONFIG_PPC64 */ #define L1_CACHE_SHIFT 7 #endif #defineL1_CACHE_BYTES (1 L1_CACHE_SHIFT) +#defineL2_CACHE_BYTES (1 L2_CACHE_SHIFT) #defineSMP_CACHE_BYTES L1_CACHE_BYTES diff --git a/include/asm-powerpc/cacheflush.h b/include/asm-powerpc/cacheflush.h index ba667a3..bdebfaa 100644 --- a/include/asm-powerpc/cacheflush.h +++ b/include/asm-powerpc/cacheflush.h @@ -49,6 +49,7 @@ extern void flush_dcache_range(unsigned long start, unsigned long stop); #ifdef CONFIG_PPC32 extern void clean_dcache_range(unsigned long start, unsigned long stop); extern void invalidate_dcache_range(unsigned long start, unsigned long stop); +extern void invalidate_l2cache_range(unsigned long start, unsigned long stop); #endif /* CONFIG_PPC32 */ #ifdef CONFIG_PPC64 extern void flush_inval_dcache_range(unsigned long start, unsigned long stop); diff --git a/include/asm-ppc/ibm44x.h b/include/asm-ppc/ibm44x.h index 8078a58..782909a 100644 --- a/include/asm-ppc/ibm44x.h +++ b/include/asm-ppc/ibm44x.h @@ -138,7 +138,6 @@ * The residual board information structure the boot loader passes * into the kernel. */ -#ifndef __ASSEMBLY__ /* * DCRN definitions @@ -814,6 +813,5 @@ #include asm/ibm4xx.h -#endif /* __ASSEMBLY__ */ #endif /* __ASM_IBM44x_H__ */ #endif /* __KERNEL__ */ ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/2] [PPC 44x] enable L2-cache for ALPR, Katmai, Ocotea, and Taishan
This patch introduces the L2_CACHE configuration option available for the ppc44x-based boards with L2-cache enabled. Signed-off-by: Yuri Tikhonov [EMAIL PROTECTED] Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED] -- diff --git a/arch/ppc/platforms/4xx/Kconfig b/arch/ppc/platforms/4xx/Kconfig index 1d2ca42..ad6b581 100644 --- a/arch/ppc/platforms/4xx/Kconfig +++ b/arch/ppc/platforms/4xx/Kconfig @@ -396,4 +396,12 @@ config SERIAL_SICC_CONSOLE bool depends on SERIAL_SICC UART0_TTYS1 default y + +config L2_CACHE + bool Enable Level-2 Cache + depends on NOT_COHERENT_CACHE (KATMAI || TAISHAN || OCOTEA || ALPR) + default y + help + This option enables L2-cache on ppc44x controllers. + If unsure, say Y. endmenu diff --git a/arch/ppc/platforms/4xx/alpr.c b/arch/ppc/platforms/4xx/alpr.c index 3b6519f..0623801 100644 --- a/arch/ppc/platforms/4xx/alpr.c +++ b/arch/ppc/platforms/4xx/alpr.c @@ -537,10 +537,12 @@ static void __init alpr_setup_arch(void) printk(Prodrive ALPR port (DENX Software Engineering [EMAIL PROTECTED])\n); } +#ifdef CONFIG_L2_CACHE static void __init alpr_init(void) { ibm440gx_l2c_setup(clocks); } +#endif static void alpr_progress(char *buf, unsigned short val) { @@ -567,7 +569,9 @@ void __init platform_init(unsigned long r3, unsigned long r4, #ifdef CONFIG_KGDB ppc_md.early_serial_map = alpr_early_serial_map; #endif +#ifdef CONFIG_L2_CACHE ppc_md.init = alpr_init; +#endif ppc_md.restart = alpr_restart; } diff --git a/arch/ppc/platforms/4xx/katmai.c b/arch/ppc/platforms/4xx/katmai.c index d29ebf6..01f1baf 100644 --- a/arch/ppc/platforms/4xx/katmai.c +++ b/arch/ppc/platforms/4xx/katmai.c @@ -219,6 +219,7 @@ katmai_show_cpuinfo(struct seq_file *m) { seq_printf(m, vendor\t\t: AMCC\n); seq_printf(m, machine\t\t: PPC440SPe EVB (Katmai)\n); + ibm440gx_show_cpuinfo(m); return 0; } @@ -584,6 +585,13 @@ static void katmai_restart(char *cmd) mtspr(SPRN_DBCR0, DBCR0_RST_CHIP); } +#ifdef CONFIG_L2_CACHE +static void __init katmai_init(void) +{ + ibm440gx_l2c_setup(clocks); +} +#endif + void __init platform_init(unsigned long r3, unsigned long r4, unsigned long r5, unsigned long r6, unsigned long r7) { @@ -599,4 +607,7 @@ void __init platform_init(unsigned long r3, unsigned long r4, ppc_md.early_serial_map = katmai_early_serial_map; #endif ppc_md.restart = katmai_restart; +#ifdef CONFIG_L2_CACHE + ppc_md.init = katmai_init; +#endif } diff --git a/arch/ppc/platforms/4xx/ocotea.c b/arch/ppc/platforms/4xx/ocotea.c index a7435aa..8b13811 100644 --- a/arch/ppc/platforms/4xx/ocotea.c +++ b/arch/ppc/platforms/4xx/ocotea.c @@ -321,10 +321,12 @@ ocotea_setup_arch(void) printk(IBM Ocotea port (MontaVista Software, Inc. [EMAIL PROTECTED])\n); } +#ifdef CONFIG_L2_CACHE static void __init ocotea_init(void) { ibm440gx_l2c_setup(clocks); } +#endif void __init platform_init(unsigned long r3, unsigned long r4, unsigned long r5, unsigned long r6, unsigned long r7) @@ -345,5 +347,7 @@ void __init platform_init(unsigned long r3, unsigned long r4, #ifdef CONFIG_KGDB ppc_md.early_serial_map = ocotea_early_serial_map; #endif +#ifdef CONFIG_L2_CACHE ppc_md.init = ocotea_init; +#endif } diff --git a/arch/ppc/platforms/4xx/taishan.c b/arch/ppc/platforms/4xx/taishan.c index f4b9435..8bb6f15 100644 --- a/arch/ppc/platforms/4xx/taishan.c +++ b/arch/ppc/platforms/4xx/taishan.c @@ -370,10 +370,12 @@ taishan_setup_arch(void) printk(AMCC PowerPC 440GX Taishan Platform\n); } +#ifdef CONFIG_L2_CACHE static void __init taishan_init(void) { ibm440gx_l2c_setup(clocks); } +#endif void __init platform_init(unsigned long r3, unsigned long r4, unsigned long r5, unsigned long r6, unsigned long r7) @@ -389,6 +391,8 @@ void __init platform_init(unsigned long r3, unsigned long r4, #ifdef CONFIG_KGDB ppc_md.early_serial_map = taishan_early_serial_map; #endif +#ifdef CONFIG_L2_CACHE ppc_md.init = taishan_init; +#endif } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 0/2] [PPC 4xx] L2-cache synchronization for ppc44x
Hello all, Here is a patch-set for support L2-cache synchronization routines for the ppc44x processors family. I know that the ppc branch is for bug-fixing only, thus the patch-set is just FYI [though enabled but non-coherent L2-cache may appear as a bug for someone who uses one of the boards listed below :)]. [PATCH 1/2] [PPC 4xx] invalidate_l2cache_range() implementation for ppc44x; [PATCH 2/2] [PPC 44x] enable L2-cache for the following ppc44x-based boards: ALPR, Katmai, Ocotea, and Taishan. Regards, Yuri -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
On Friday 19 October 2007 17:24, Kumar Gala wrote: On Oct 18, 2007, at 6:21 PM, Paul Mackerras wrote: Yuri Tikhonov writes: The following patch adds support for 256KB PAGE_SIZE on ppc44x- based boards. The applications to be run on the kernel with 256KB PAGE_SIZE have to be built using the modified version of binutils, where the MAXPAGESIZE definition is set to 0x4 (as opposite to standard 0x1). Have you measured the performance using a 64kB page size? If so, how does it compare with the 256kB page size? I was wondering about this as well? Isn't this technically in violation of the ABI? No it isn't the violation. As stated in System V ABI. PowerPC processor supplement (on which the Linux Standard Base Core Specification for PPC32 is based): ... Virtual addresses and file offsets for the PowerPC processor family segments are congruent modulo 64 Kbytes (0x1) or larger powers of 2 So, 256 Kbytes is just a larger case. -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
On Friday 19 October 2007 03:21, Paul Mackerras wrote: Have you measured the performance using a 64kB page size? If so, how does it compare with the 256kB page size? I measured the performance of the sequential full-stripe write operations to a RAID-5 array (P values below are in MB per second) using the h/w accelerated RAID-5 driver. Here are the comparative results for the different PAGE_SIZE values: PAGE_SIZE = 4K: P = 66 MBps; PAGE_SIZE = 16K: P = 145 MBps; PAGE_SIZE = 64K: P = 196 MBps; PAGE_SIZE = 256K: P = 217 MBps. The 64kB page size has the attraction that no binutils changes are required. That's true, but the additional performance is an attractive thing too. Paul. -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
On Friday 19 October 2007 19:48, Kumar Gala wrote: PAGE_SIZE = 4K: P = 66 MBps; PAGE_SIZE = 16K: P = 145 MBps; PAGE_SIZE = 64K: P = 196 MBps; PAGE_SIZE = 256K: P = 217 MBps. Is this all in kernel space? or is there a user space aspect to the benchmark? The situation here is that the Linux RAID driver does a lot of complex things with the pages (strips of array) using CPU before submitting these pages to h/w. Here is where the most time is spent. Thus, increasing the PAGE_SIZE value we reduce the number of these complex algorithms calls needed to process the whole test (writing the fixed number of MBytes to RAID array). So, there are no user space aspects. -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
It has turned out that my mailer had corrupted my previous message (thanks Wolfgang Denk for pointing this). So if you'd like to apply the patch without the conflicts please use the version of the patch in this mail. The following patch adds support for 256KB PAGE_SIZE on ppc44x-based boards. The applications to be run on the kernel with 256KB PAGE_SIZE have to be built using the modified version of binutils, where the MAXPAGESIZE definition is set to 0x4 (as opposite to standard 0x1). Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED] Acked-by: Yuri Tikhonov [EMAIL PROTECTED] -- diff --git a/arch/ppc/Kconfig b/arch/ppc/Kconfig index c590b18..0ee372d 100644 --- a/arch/ppc/Kconfig +++ b/arch/ppc/Kconfig @@ -1223,6 +1223,9 @@ config PPC_PAGE_16K config PPC_PAGE_64K bool 64 KB if 44x + +config PPC_PAGE_256K + bool 256 KB if 44x endchoice endmenu diff --git a/arch/ppc/kernel/entry.S b/arch/ppc/kernel/entry.S index fba7ca1..2140341 100644 --- a/arch/ppc/kernel/entry.S +++ b/arch/ppc/kernel/entry.S @@ -200,7 +200,7 @@ _GLOBAL(DoSyscall) #ifdef SHOW_SYSCALLS bl do_show_syscall #endif /* SHOW_SYSCALLS */ - rlwinm r10,r1,0,0,18 /* current_thread_info() */ + rlwinm r10,r1,0,0,(31-THREAD_SHIFT)/* current_thread_info() */ lwz r11,TI_FLAGS(r10) andi. r11,r11,_TIF_SYSCALL_T_OR_A bne-syscall_dotrace @@ -221,7 +221,7 @@ ret_from_syscall: bl do_show_syscall_exit #endif mr r6,r3 - rlwinm r12,r1,0,0,18 /* current_thread_info() */ + rlwinm r12,r1,0,0,(31-THREAD_SHIFT)/* current_thread_info() */ /* disable interrupts so current_thread_info()-flags can't change */ LOAD_MSR_KERNEL(r10,MSR_KERNEL) /* doesn't include MSR_EE */ SYNC @@ -639,7 +639,7 @@ ret_from_except: user_exc_return: /* r10 contains MSR_KERNEL here */ /* Check current_thread_info()-flags */ - rlwinm r9,r1,0,0,18 + rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r9,TI_FLAGS(r9) andi. r0,r9,(_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK|_TIF_NEED_RESCHED) bne do_work @@ -659,7 +659,7 @@ restore_user: /* N.B. the only way to get here is from the beq following ret_from_except. */ resume_kernel: /* check current_thread_info-preempt_count */ - rlwinm r9,r1,0,0,18 + rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r0,TI_PREEMPT(r9) cmpwi 0,r0,0 /* if non-zero, just restore regs and return */ bne restore @@ -669,7 +669,7 @@ resume_kernel: andi. r0,r3,MSR_EE/* interrupts off? */ beq restore /* don't schedule if so */ 1: bl preempt_schedule_irq - rlwinm r9,r1,0,0,18 + rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r3,TI_FLAGS(r9) andi. r0,r3,_TIF_NEED_RESCHED bne-1b @@ -875,7 +875,7 @@ recheck: LOAD_MSR_KERNEL(r10,MSR_KERNEL) SYNC MTMSRD(r10) /* disable interrupts */ - rlwinm r9,r1,0,0,18 + rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r9,TI_FLAGS(r9) andi. r0,r9,_TIF_NEED_RESCHED bne-do_resched diff --git a/arch/ppc/kernel/head_booke.h b/arch/ppc/kernel/head_booke.h index f3d274c..db4 100644 --- a/arch/ppc/kernel/head_booke.h +++ b/arch/ppc/kernel/head_booke.h @@ -20,7 +20,9 @@ beq 1f; \ mfspr r1,SPRN_SPRG3; /* if from user, start at top of */\ lwz r1,THREAD_INFO-THREAD(r1); /* this thread's kernel stack */\ - addir1,r1,THREAD_SIZE; \ + lis r11,[EMAIL PROTECTED]; \ + ori r11,r11,[EMAIL PROTECTED]; \ + add r1,r1,r11; \ 1: subir1,r1,INT_FRAME_SIZE; /* Allocate an exception frame */\ mr r11,r1; \ stw r10,_CCR(r11); /* save various registers */\ @@ -106,7 +108,9 @@ /* COMING FROM USER MODE */ \ mfspr r11,SPRN_SPRG3; /* if from user, start at top of */\ lwz r11,THREAD_INFO-THREAD(r11); /* this thread's kernel stack */\ - addir11,r11,THREAD_SIZE; \ + lis r11,[EMAIL PROTECTED]; \ + ori r11,r11,[EMAIL PROTECTED]; \ + add r1,r1,r11; \ 1: subir11,r11,INT_FRAME_SIZE; /* Allocate an exception frame */\ stw r10,_CCR(r11); /* save various registers */\ stw r12,GPR12(r11
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
On Thursday 18 October 2007 14:44, you wrote: Sorry, this is against arch/ppc which is bug fix only. New features should be done against arch/powerpc. Understood. The situation here is that the boards, which required these modifications, have no support in the arch/powerpc branch. So this is why we made this in arch/ppc. Also, I'd rather see something along the lines of hugetlbfs support instead. Here I agree with Benjamin. Furthermore, IIRC the hugetlb file-system is supported for PPC64 architectures only. Here we have PPC32. josh -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
On Thursday 18 October 2007 15:47, Benjamin Herrenschmidt wrote: Signed-off-by: Pavel Kolesnikov [EMAIL PROTECTED] Acked-by: Yuri Tikhonov [EMAIL PROTECTED] Small nit... You are posting the patch, thus you should be signing off, not ack'ing. Ack'ing means you agree with the patch but you aren't in the handling chain for it. In this case, it seems like the author is Pavel and you are forwarding it, in wich case, you -are- in the handling chain and should should sign it off. Best would be for Pavel (if he is indeed the author) to submit it himself though. Thanks for the explanations. Will keep this in mind in the future. Ben. -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
On Thursday 18 October 2007 16:12, Benjamin Herrenschmidt wrote: I always reserve the right to change my mind. If something makes sense and the code is decent enough then it might very well be acceptable. Requiring a modified binutils makes me a bit nervous though. From a kernel point of view, I totally don't care about the modified binutils to build userspace as long as it's not required to build the kernel and that option is not enabled by default (and explicitely documented as having that requirement). If it is necessary for building the kernel, then I'm a bit cooler about the whole thing indeed, the max page size needs to be added at least as a command line or linker script param so a different build of binutils isn't needed. No, 256K-page-sized kernel is being built using the standard binutils. Modifications to them are necessary for user-space applications only. And the libraries as well. -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] ppc44x: support for 256K PAGE_SIZE
On Thursday 18 October 2007 17:25, Josh Boyer wrote: Understood. The situation here is that the boards, which required these modifications, have no support in the arch/powerpc branch. So this is why we made this in arch/ppc. Bit of a dilemma then. What board exactly? These are the Katmai and Yucca PPC440SPe-based boards (from AMCC). Also, I'd rather see something along the lines of hugetlbfs support instead. Here I agree with Benjamin. Furthermore, IIRC the hugetlb file-system is supported for PPC64 architectures only. Here we have PPC32. Well that needs fixing anyway, but ok. Also, is the modified binutils only required for userspace to take advantage here? Seems so, but I'd just like to be sure. You are right, for userspace only. josh -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev