Page fault scalability patch V18: Overview
Is there any chance that this patchset could go into mm now? This has been discussed since last August Changelog: V17-V18 Rediff against 2.6.11-rc5-bk4 V16-V17 Do not increment page_count in do_wp_page. Performance data posted. V15-V16 of this patch: Redesign to allow full backback for architectures that do not supporting atomic operations. An introduction to what this patch does and a patch archive can be found on http://oss.sgi.com/projects/page_fault_performance. The archive also has the result of various performance tests (LMBench, Microbenchmark and kernel compiles). The basic approach in this patchset is the same as used in SGI's 2.4.X based kernels which have been in production use in ProPack 3 for a long time. The patchset is composed of 4 patches (and was tested against 2.6.11-rc5-bk4): 1/4: ptep_cmpxchg and ptep_xchg to avoid intermittent zeroing of ptes The current way of synchronizing with the CPU or arch specific interrupts updating page table entries is to first set a pte to zero before writing a new value. This patch uses ptep_xchg and ptep_cmpxchg to avoid writing the zero for certain configurations. The patch introduces CONFIG_ATOMIC_TABLE_OPS that may be enabled as a experimental feature during kernel configuration if the hardware is able to support atomic operations and if an SMP kernel is being configured. A Kconfig update for i386, x86_64 and ia64 has been provided. On i386 this options is restricted to CPUs better than a 486 and non PAE mode (that way all the cmpxchg issues on old i386 CPUS and the problems with 64bit atomic operations on recent i386 CPUS are avoided). If CONFIG_ATOMIC_TABLE_OPS is not set then ptep_xchg and ptep_xcmpxchg are realized by falling back to clearing a pte before updating it. The patch does not change the use of mm-page_table_lock and the only performance improvement is the replacement of xchg-with-zero-and-then-write-new-pte-value with an xchg with the new value for SMP on some architectures if CONFIG_ATOMIC_TABLE_OPS is configured. It should not do anything major to VM operations. 2/4: Macros for mm counter manipulation There are various approaches to handling mm counters if the page_table_lock is no longer acquired. This patch defines macros in include/linux/sched.h to handle these counters and makes sure that these macros are used throughout the kernel to access and manipulate rss and anon_rss. There should be no change to the generated code as a result of this patch. 3/4: Drop the first use of the page_table_lock in handle_mm_fault The patch introduces two new functions: page_table_atomic_start(mm), page_table_atomic_stop(mm) that fall back to the use of the page_table_lock if CONFIG_ATOMIC_TABLE_OPS is not defined. If CONFIG_ATOMIC_TABLE_OPS is defined those functions may be used to prep the CPU for atomic table ops (i386 in PAE mode may f.e. get the MMX register ready for 64bit atomic ops) but are simply empty by default. Two operations may then be performed on the page table without acquiring the page table lock: a) updating access bits in pte b) anonymous read faults installed a mapping to the zero page. All counters are still protected with the page_table_lock thus avoiding any issues there. Some additional statistics are added to /proc/meminfo to give some statistics. Also counts spurious faults with no effect. There is a surprisingly high number of those on ia64 (used to populate the cpu caches with the pte??) 4/4: Drop the use of the page_table_lock in do_anonymous_page The second acquisition of the page_table_lock is removed from do_anonymous_page and allows the anonymous write fault to be possible without the page_table_lock. The macros for manipulating rss and anon_rss in include/linux/sched.h are changed if CONFIG_ATOMIC_TABLE_OPS is set to use atomic operations for rss and anon_rss (safest solution for now, other solutions may easily be implemented by changing those macros). This patch typically yield significant increases in page fault performance for threaded applications on SMP systems. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Page fault scalability patch V18: atomic pte ops, pte_cmpxchg and pte_xchg
The current way of updating ptes in the Linux vm includes first clearing a pte before setting it to another value. The clearing is performed while holding the page_table_lock to insure that the entry will not be modified by the CPU directly (clearing the pte clears the present bit), by an arch specific interrupt handler or another page fault handler running on another CPU. This approach is necessary for some architectures that cannot perform atomic updates of page table entries. If a page table entry is cleared then a second CPU may generate a page fault for that entry. The fault handler on the second CPU will then attempt to acquire the page_table_lock and wait until the first CPU has completed updating the page table entry. The fault handler on the second CPU will then discover that everything is ok and simply do nothing (apart from incrementing the counters for a minor fault and marking the page again as accessed). However, most architectures actually support atomic operations on page table entries. The use of atomic operations on page table entries would allow the update of a page table entry in a single atomic operation instead of writing to the page table entry twice. There would also be no danger of generating a spurious page fault on other CPUs. The following patch introduces two new atomic operations ptep_xchg and ptep_cmpxchg that may be provided by an architecture. The fallback in include/asm-generic/pgtable.h is to simulate both operations through the existing ptep_get_and_clear function. So there is essentially no change if atomic operations on ptes have not been defined. Architectures that do not support atomic operations on ptes may continue to use the clearing of a pte for locking type purposes. Atomic operations may be enabled in the kernel configuration on i386, ia64 and x86_64 if a suitable CPU is configured in SMP mode. Generic atomic definitions for ptep_xchg and ptep_cmpxchg have been provided based on the existing xchg() and cmpxchg() functions that already work atomically on many platforms. It is very easy to implement this for any architecture by adding the appropriate definitions to arch/xx/Kconfig. The provided generic atomic functions may be overridden as usual by defining the appropriate__HAVE_ARCH_xxx constant and providing an implementation. My aim to reduce the use of the page_table_lock in the page fault handler rely on a pte never being clear if the pte is in use even when the page_table_lock is not held. Clearing a pte before setting it to another values could result in a situation in which a fault generated by another cpu could install a pte which is then immediately overwritten by the first CPU setting the pte to a valid value again. This patch is important for future work on reducing the use of spinlocks in the vm. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Index: linux-2.6.10/mm/rmap.c === --- linux-2.6.10.orig/mm/rmap.c 2005-02-24 19:41:50.0 -0800 +++ linux-2.6.10/mm/rmap.c 2005-02-24 19:42:12.0 -0800 @@ -575,11 +575,6 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); - - /* Move the dirty bit to the physical page now the pte is gone. */ - if (pte_dirty(pteval)) - set_page_dirty(page); if (PageAnon(page)) { swp_entry_t entry = { .val = page-private }; @@ -594,11 +589,15 @@ static int try_to_unmap_one(struct page list_add(mm-mmlist, init_mm.mmlist); spin_unlock(mmlist_lock); } - set_pte(pte, swp_entry_to_pte(entry)); + pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); mm-anon_rss--; - } + } else + pteval = ptep_clear_flush(vma, address, pte); + /* Move the dirty bit to the physical page now that the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); mm-rss--; acct_update_integrals(); page_remove_rmap(page); @@ -691,15 +690,15 @@ static void try_to_unmap_cluster(unsigne if (ptep_clear_flush_young(vma, address, pte)) continue; - /* Nuke the page table entry. */ flush_cache_page(vma, address); - pteval = ptep_clear_flush(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page-index != linear_page_index(vma, address)) - set_pte(pte, pgoff_to_pte(page-index)); + pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page-index)); + else + pteval = ptep_clear_flush(vma, address, pte); - /* Move the
Re: Page fault scalability patch V18: abstract rss counter ops
This patch extracts all the operations on rss into definitions in include/linux/sched.h. All rss operations are performed through the following three macros: get_mm_counter(mm, member) - Obtain the value of a counter set_mm_counter(mm, member, value) - Set the value of a counter update_mm_counter(mm, member, value)- Add a value to a counter The simple definitions provided in this patch result in no change to to the generated code. With this patch it becomes easier to add new counters and it is possible to redefine the method of counter handling (f.e. the page fault scalability patches may want to use atomic operations or split rss). Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Index: linux-2.6.10/include/linux/sched.h === --- linux-2.6.10.orig/include/linux/sched.h 2005-02-24 19:41:49.0 -0800 +++ linux-2.6.10/include/linux/sched.h 2005-02-24 19:42:17.0 -0800 @@ -203,6 +203,10 @@ arch_get_unmapped_area_topdown(struct fi extern void arch_unmap_area(struct vm_area_struct *area); extern void arch_unmap_area_topdown(struct vm_area_struct *area); +#define set_mm_counter(mm, member, value) (mm)-member = (value) +#define get_mm_counter(mm, member) ((mm)-member) +#define update_mm_counter(mm, member, value) (mm)-member += (value) +#define MM_COUNTER_T unsigned long struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ @@ -219,7 +223,7 @@ struct mm_struct { atomic_t mm_count; /* How many references to struct mm_struct (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; - spinlock_t page_table_lock; /* Protects page tables, mm-rss, mm-anon_rss */ + spinlock_t page_table_lock; /* Protects page tables and some counters */ struct list_head mmlist;/* List of maybe swapped mm's. These are globally strung * together off init_mm.mmlist, and are protected @@ -229,9 +233,13 @@ struct mm_struct { unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; - unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; + unsigned long total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; + /* Special counters protected by the page_table_lock */ + MM_COUNTER_T rss; + MM_COUNTER_T anon_rss; + unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ unsigned dumpable:1; Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-02-24 19:42:12.0 -0800 +++ linux-2.6.10/mm/memory.c2005-02-24 19:42:17.0 -0800 @@ -313,9 +313,9 @@ copy_one_pte(struct mm_struct *dst_mm, pte = pte_mkclean(pte); pte = pte_mkold(pte); get_page(page); - dst_mm-rss++; + update_mm_counter(dst_mm, rss, 1); if (PageAnon(page)) - dst_mm-anon_rss++; + update_mm_counter(dst_mm, anon_rss, 1); set_pte(dst_pte, pte); page_dup_rmap(page); } @@ -517,7 +517,7 @@ static void zap_pte_range(struct mmu_gat if (pte_dirty(pte)) set_page_dirty(page); if (PageAnon(page)) - tlb-mm-anon_rss--; + update_mm_counter(tlb-mm, anon_rss, -1); else if (pte_young(pte)) mark_page_accessed(page); tlb-freed++; @@ -1340,13 +1340,14 @@ static int do_wp_page(struct mm_struct * spin_lock(mm-page_table_lock); page_table = pte_offset_map(pmd, address); if (likely(pte_same(*page_table, pte))) { - if (PageAnon(old_page)) - mm-anon_rss--; + if (PageAnon(old_page)) + update_mm_counter(mm, anon_rss, -1); if (PageReserved(old_page)) { - ++mm-rss; + update_mm_counter(mm, rss, 1); acct_update_integrals(); update_mem_hiwater(); } else + page_remove_rmap(old_page); break_cow(vma, new_page, address, page_table); lru_cache_add_active(new_page); @@ -1750,7 +1751,7 @@ static int do_swap_page(struct mm_struct if (vm_swap_full()) remove_exclusive_swap_page(page); - mm-rss++; + update_mm_counter(mm, rss, 1); acct_update_integrals(); update_mem_hiwater(); @@ -1817,7 +1818,7 @@
Page fault scalability patch V18: No page table lock in do_anonymous_page
Do not use the page_table_lock in do_anonymous_page. This will significantly increase the parallelism in the page fault handler in SMP systems. The patch also modifies the definitions of _mm_counter functions so that rss and anon_rss become atomic. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-02-24 19:42:21.0 -0800 +++ linux-2.6.10/mm/memory.c2005-02-24 19:42:25.0 -0800 @@ -1832,12 +1832,12 @@ do_anonymous_page(struct mm_struct *mm, vma-vm_page_prot)), vma); - spin_lock(mm-page_table_lock); + page_table_atomic_start(mm); if (!ptep_cmpxchg(page_table, orig_entry, entry)) { pte_unmap(page_table); page_cache_release(page); - spin_unlock(mm-page_table_lock); + page_table_atomic_stop(mm); inc_page_state(cmpxchg_fail_anon_write); return VM_FAULT_MINOR; } @@ -1855,7 +1855,7 @@ do_anonymous_page(struct mm_struct *mm, update_mmu_cache(vma, addr, entry); pte_unmap(page_table); - spin_unlock(mm-page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } Index: linux-2.6.10/include/linux/sched.h === --- linux-2.6.10.orig/include/linux/sched.h 2005-02-24 19:42:17.0 -0800 +++ linux-2.6.10/include/linux/sched.h 2005-02-24 19:42:25.0 -0800 @@ -203,10 +203,26 @@ arch_get_unmapped_area_topdown(struct fi extern void arch_unmap_area(struct vm_area_struct *area); extern void arch_unmap_area_topdown(struct vm_area_struct *area); +#ifdef CONFIG_ATOMIC_TABLE_OPS +/* + * Atomic page table operations require that the counters are also + * incremented atomically +*/ +#define set_mm_counter(mm, member, value) atomic_set((mm)-member, value) +#define get_mm_counter(mm, member) ((unsigned long)atomic_read((mm)-member)) +#define update_mm_counter(mm, member, value) atomic_add(value, (mm)-member) +#define MM_COUNTER_T atomic_t + +#else +/* + * No atomic page table operations. Counters are protected by + * the page table lock + */ #define set_mm_counter(mm, member, value) (mm)-member = (value) #define get_mm_counter(mm, member) ((mm)-member) #define update_mm_counter(mm, member, value) (mm)-member += (value) #define MM_COUNTER_T unsigned long +#endif struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Page fault scalability patch V18: Drop first acquisition of ptl
The page fault handler attempts to use the page_table_lock only for short time periods. It repeatedly drops and reacquires the lock. When the lock is reacquired, checks are made if the underlying pte has changed before replacing the pte value. These locations are a good fit for the use of ptep_cmpxchg. The following patch allows to remove the first time the page_table_lock is acquired and uses atomic operations on the page table instead. A section using atomic pte operations is begun with page_table_atomic_start(struct mm_struct *) and ends with page_table_atomic_stop(struct mm_struct *) Both of these become spin_lock(page_table_lock) and spin_unlock(page_table_lock) if atomic page table operations are not configured (CONFIG_ATOMIC_TABLE_OPS undefined). The atomic operations with pte_xchg and pte_cmpxchg only work for the lowest layer of the page table. Higher layers may also be populated in an atomic way by defining pmd_test_and_populate() etc. The generic versions of these functions fall back to the page_table_lock (populating higher level page table entries is rare and therefore this is not likely to be performance critical). For ia64 the definition of higher level atomic operations is included. This patch depends on the pte_cmpxchg patch to be applied first and will only remove the first use of the page_table_lock in the page fault handler. This will allow the following page table operations without acquiring the page_table_lock: 1. Updating of access bits (handle_mm_faults) 2. Anonymous read faults (do_anonymous_page) The page_table_lock is still acquired for creating a new pte for an anonymous write fault and therefore the problems with rss that were addressed by splitting rss into the task structure do not yet occur. The patch also adds some diagnostic features by counting the number of cmpxchg failures (useful for verification if this patch works right) and the number of patches received that led to no change in the page table. Statistics may be viewed via /proc/meminfo Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Index: linux-2.6.10/mm/memory.c === --- linux-2.6.10.orig/mm/memory.c 2005-02-24 19:42:17.0 -0800 +++ linux-2.6.10/mm/memory.c2005-02-24 19:42:21.0 -0800 @@ -36,6 +36,8 @@ * ([EMAIL PROTECTED]) * * Aug/Sep 2004 Changed to four level page tables (Andi Kleen) + * Jan 2005Scalability improvement by reducing the use and the length of time + * the page table lock is held (Christoph Lameter) */ #include linux/kernel_stat.h @@ -1275,8 +1277,8 @@ static inline void break_cow(struct vm_a * change only once the write actually happens. This avoids a few races, * and potentially makes it more efficient. * - * We hold the mm semaphore and the page_table_lock on entry and exit - * with the page_table_lock released. + * We hold the mm semaphore and have started atomic pte operations, + * exit with pte ops completed. */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte) @@ -1294,7 +1296,7 @@ static int do_wp_page(struct mm_struct * pte_unmap(page_table); printk(KERN_ERR do_wp_page: bogus page at address %08lx\n, address); - spin_unlock(mm-page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_OOM; } old_page = pfn_to_page(pfn); @@ -1306,22 +1308,25 @@ static int do_wp_page(struct mm_struct * flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_set_access_flags(vma, address, page_table, entry, 1); - update_mmu_cache(vma, address, entry); + /* +* If the bits are not updated then another fault +* will be generated with another chance of updating. +*/ + if (ptep_cmpxchg(page_table, pte, entry)) + update_mmu_cache(vma, address, entry); + else + inc_page_state(cmpxchg_fail_flag_reuse); pte_unmap(page_table); - spin_unlock(mm-page_table_lock); + page_table_atomic_stop(mm); return VM_FAULT_MINOR; } } pte_unmap(page_table); + page_table_atomic_stop(mm); /* * Ok, we need to copy. Oh, well.. */ - if (!PageReserved(old_page)) - page_cache_get(old_page); - spin_unlock(mm-page_table_lock); - if (unlikely(anon_vma_prepare(vma))) goto no_new_page;
Re: [Lse-tech] Re: A common layer for Accounting packages
Just a thought - perhaps you could see if Jay can test the performance scaling of these changes on larger systems (8 to 64 CPUs, give or take, small for SGI, but big for some vendors.) Things like a global lock, for example, might be harmless on smaller systems, but hurt big time on bigger systems. I don't know if you have any such constructs ... perhaps this doesn't matter. At the very least, we need to know that performance and scaling are not significantly impacted, on systems not using accounting, either because it is obvious from the code, or because someone has tested it. And if performance or scaling was impacted when accounting was enabled, then at least we would want to know how much performance was impacted, so that users would know what to expect when they use accounting. the process-creation/destruction performance on following three environment. I think this is a good choice of what to measure, and where. Thank-you. kernel was also locked up after 366th-fork() I have no idea what this is -- good luck finding it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Complicated networking problem
On Wed, 2005-03-02 at 13:27 +1000, Jarne Cook wrote: On Tuesday 01 March 2005 12:35, you wrote: On Monday 28 February 2005 21:02, [EMAIL PROTECTED] wrote: On Mon, 28 Feb 2005 14:59:31 +1000, Jarne Cook said: They are both using dhcp to the same simple network. That's right. Same network. They both end up with gateway=192.168.0.1, netmask=255.255.255.0. But ofcourse they do not have the same IP addresses. I don't suppose your network people would be willing to change it thusly: wired ports: gateway 192.168.0.1, netmask 255.255.255.128.0 wireless: gateway 192.168.128.1, netmask 255.255.255.128.0 Or move the wireless up to 192.168.1.1 if they think that would confuse things too much. There's a limit to how far we should bend over backwards to support stupid networking decisions. 192.168 *is* a /16, might as well use it. ;) If they won't, you're pretty much stuck with binding applications to one interface or another. If the goal is to primarily use wired link and seamlessly swith to wireless then look into bonding driver in failover mode with wired interface as primary. This way you have only one address and userspace does not notice anything. Damn Having to configure the interfaces using bonding was not really the answer I was expecting. I did not think linux would be that rigid. I figured if poodoze is able to do it (seamlessly mind you), surely linux (with some tinkering) would be able to do it also. The goal was to have the networking on the laptop work as perfectly as crapdoze does. Perhaps I should and this topic to my list of software issues that no-one else cares about. man that list is getting big. maybe one day I'll develop the balls to get deep into the code. Check out NetworkManager. It will do what you want. Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] put newly registered shrinkers at the tail of the list
This way we actually share dentries before inodes and thus mark more inodes reclaimable once we shake them. --- 1.240/mm/vmscan.c 2005-02-04 01:53:32 +01:00 +++ edited/mm/vmscan.c 2005-03-02 07:09:00 +01:00 @@ -137,7 +137,7 @@ struct shrinker *set_shrinker(int seeks, shrinker-seeks = seeks; shrinker-nr = 0; down_write(shrinker_rwsem); - list_add(shrinker-list, shrinker_list); + list_add_tail(shrinker-list, shrinker_list); up_write(shrinker_rwsem); } return shrinker; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cciss CSMI via sysfs for 2.6
On Fri, Feb 18, 2005 at 12:05:52PM -0800, Greg KH wrote: On Fri, Feb 18, 2005 at 07:46:28PM +, Christoph Hellwig wrote: /* + * sysfs stuff + * this should be moved to it's own file, maybe cciss_sysfs.h + */ + +static ssize_t cciss_firmver_show(struct device *dev, char *buf) +{ + ctlr_info_t *h = dev-driver_data; +return sprintf(buf,%c%c%c%c\n, h-firm_ver[0], h-firm_ver[1], +h-firm_ver[2], h-firm_ver[3]); +} I really wish we had a common firmver release attribut in the driver core, as mentioned in the fc transport class thread. Greg? For a device? It seems a huge overkill to add this attribute for _every_ device in the system, when only a small minority can actually use it. Just put it as a default scsi or transport class attribute instead. it's not related to scsi or a transport at all. I'd rather have the notation of optional generic attributes so that every driver that wantsa to publish it does so in the same way. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about sockfd_lookup( )
I can't use sockfd_put(sock) directly. I trace its code, the code is extern __inline__ void sockfd_put(struct socket *sock) { fput(sock-file); } so I use fput(sock-file) but it has problems too 1) execute ls in the ftp is also block 2) kernel prints socki_lookup: socket file changed! 3) execute ftp localhost after rmmod, it will crash and why the sockfd_put is needed after sockfd_lookup Thanak again MingChieh Chang Taiwan === - Hide quoted text - On Tue, 01 Mar 2005 08:56:19 +0100, Eric Dumazet [EMAIL PROTECTED] wrote: Hi Try adding sockfd_put(sock) ; MingJie Chang wrote: Dear all, I want to get socket information by the sockfd while accetping, so I write a module to test sockfd_lookup(), but I got some problems when I test it. I hope someone can help me... Thank you following text is my code and error message === === code === int my_socketcall(int call,unsigned long *args) { int ret,err; struct socket * sock; ret = run_org_socket_call(call,args); //orignal sys_sockcall() if(call==SYS_ACCEPTret=0) { sock=sockfd_lookup(ret,err); printk(lookup done\n); if (sock) sockfd_put(sock) ; } return ret; } Eric Dumazet - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] aoe: fix printk warning (sparc64)
aoeblk: mac_addr() returns u64, coerce to unsigned long long to printk it: (sparc64 build warning) drivers/block/aoe/aoeblk.c:245: warning: long long unsigned int format, u64 arg (arg 2) drivers/block/aoe/aoeblk.c:31: warning: long long unsigned int format, u64 arg (arg 4) cross-compile results: https://www.osdl.org/plm-cgi/plm?module=patch_infopatch_id=4239 Signed-off-by: Randy Dunlap [EMAIL PROTECTED] diffstat:= drivers/block/aoe/aoeblk.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff -Naurp ./drivers/block/aoe/aoeblk.c~aoe_printk ./drivers/block/aoe/aoeblk.c --- ./drivers/block/aoe/aoeblk.c~aoe_printk 2005-02-25 10:54:42.0 -0800 +++ ./drivers/block/aoe/aoeblk.c2005-03-01 17:22:29.735503376 -0800 @@ -28,7 +28,8 @@ static ssize_t aoedisk_show_mac(struct g { struct aoedev *d = disk-private_data; - return snprintf(page, PAGE_SIZE, %012llx\n, mac_addr(d-addr)); + return snprintf(page, PAGE_SIZE, %012llx\n, + (unsigned long long)mac_addr(d-addr)); } static ssize_t aoedisk_show_netif(struct gendisk * disk, char *page) { @@ -241,7 +242,8 @@ aoeblk_gdalloc(void *vp) aoedisk_add_sysfs(d); printk(KERN_INFO aoe: %012llx e%lu.%lu v%04x has %llu - sectors\n, mac_addr(d-addr), d-aoemajor, d-aoeminor, + sectors\n, (unsigned long long)mac_addr(d-addr), + d-aoemajor, d-aoeminor, d-fw_ver, (long long)d-ssize); } --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
O_DIRECT on 2.4 ext3
Hi, I tried to read from a regular ext3 file opened as O_DIRECT, but got the Invalid argument error. Running the same test program on a block device succeeded. uname -a shows Linux *** 2.4.27-2-686-smp #1 SMP Thu Jan 20 11:02:39 JST 2005 i686 GNU/Linux My test case is #include sys/types.h #include sys/stat.h #include asm/fcntl.h #include stdio.h #include assert.h #define BLK (4096U) main() { char buf[BLK * 2]; char *p = (char*)unsigned)buf) + (BLK-1)) ~(BLK-1)); int fd, l; fprintf(stderr, buf = %p, p = %p\n, buf, p); if((fd=open(sbd0, O_RDONLY|O_DIRECT)) 0) { perror(open); assert(0); } if((l=pread(fd, p, BLK, 0)) 0) { perror(pread); assert(0); } fprintf(stderr, pread returns %d\n, l); close (fd); } Does anyone know what's going on? Thanks, -Junfeng - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Undefined symbols in 2.6.11-rc5-mm1
Hi everybody, I just joined the LKML! Don't worry, this is not just a test message, I do actually have something to say. I just compiled 2.6.11-rc5-mm1 and got undefined symbols match_int, match_octal, match_token, and match_strdup in several modules. This is using binutils 2.15 and gcc 3.4.4 from Debian. I grepped around and found those functions in lib/parser.c, so I just looked at the output of make V=1 and invoked ld manually, adding in lib/lib.a, and the modules work fine now. However, I don't know enough about the kernel build process to make a patch to fix this, so I'm just notifying people of the problem. BTW, I just got a new hard disk and put Reiser4 on it. It works great! Keep up the good work guys! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.11-rc3 01/11] ide: task_end_request() fix
Bartlomiej Zolnierkiewicz wrote: If somebody implements SG_IO ioctl and SCSI command pass-through from libata for IDE driver (and add possibility for discrete taskfiles), we can just deprecate HDIO_DRIVE_TASKFILE, forget about it and some time later remove this FPOS. Can you explain what you mean by add possibility for discrete taskfiles? Jeff - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch ide-dev 8/9] make ide_task_ioctl() use REQ_DRIVE_TASKFILE
Bartlomiej Zolnierkiewicz wrote: Yes but it seems that you've assumed that ioctl == flagged taskfile and fs/internal == normal taskfile which is _not_ what I aim for. I want fully-flagged taskfile handling like flagged_taskfile() and hot path simpler taskfile handling like do_rw_taskfile() (at least for now - we can remove hot path later) where both can be used for fs/internal/ioctl requests (depending on the flags). There is no effective difference in performance between writeb() writeb() writeb() writeb() and if (bit 1) writeb() if (bit 2) writeb() if (bit 3) writeb() if (bit 4) writeb() The cost of a repeated bit test on the same unsigned long is _zero_. It's already in L1 cache. The I/Os are slow, and adding bit tests will not measurably decrease performance. (this is the reason why I do not object to using ioread32() and iowrite32()... it just adds a simple test) Plus, it is better to have a single path for all taskfiles, to ensure that the path is well-tested. libata's -tf_load() and -tf_read() hooks should be updated to use the more fine-grained flags that Tejun is proposing. Note that on SATA, this is largely irrelevant. The functions ata_tf_read() and ata_tf_load() should be updated for flagged taskfiles, because these will be used with PATA drivers. The hooks implemented in individual SATA drivers will not be updated. The reason is that SATA transmits an entire copy of the taskfile to/from the device all at once, in the form of a Frame Information Structure (FIS) -- essentially a SATA packet. Jeff - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] I/O-check interface for driver's error handling
Linas Vepstas wrote: I'd prefer to see it as ioerr_clear(), ioerr_read() ... I'd prefer pci_io_start() and pci_io_check_err() The names should have pci in them. I don't like ioerr_clear because it implies we are clearing the io error; we are not; we are clearing the checker for io errors. My intention was clear/read checker(called iochk) to check my I/O. (bitmask would be better for error flag, but bits are not defined yet.) So I agree that ioerr_clear/read() would be one of good alternatives. But still I'd prefer iochk_*, because it doesn't clear error but checker. iochecker_* would be bit long. And then, I don't think it need to have pci ... limitation of this API's target. It would not be match if there are a recoverable device over some PCI to XXX bridge, or if there are some special arch where don't have PCI but other recoverable bus system, or if future bus system doesn't called pci... Currently we would deal only pci, but in future possibly not. Do we really need a cookie? Some do, some not. For example, if arch has only a counter of error exception, saving value of the counter to the cookie would be make sense. Yes, they should be no-ops. save/restore interrupts would be a bad idea. I expect that we should not do any operation requires enabled interrupt between iochk_clear and iochk_read. If their defaults are no-ops, device maintainers who develops their driver on not-implemented arch should be more careful. Or are there any bad thing other than waste of steps? Thanks, H.Seto - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
some /proc understandings
Hello, 1) I want to know how much can i write to /proc entry file?? Is there any limitation on file size??? 2)Also how can i call /proc entry files proc_read_myfile function on that file by another kernel module call? What parameters i require to pass and how? Say i have read functions as struct myfile_data_t { char value[8]; }; struct proc_dir_entry *myfile_file; struct myfile_data_t myfile_data; int proc_read_myfile(char *page, char **start, off_t off, int count, int *eof, void *data) { int len; /* cast the void pointer of data to myfile_data_t*/ struct myfile_data_t *myfile_data=(struct myfile_data_t *)data; /* use sprintf to fill the page array with a string */ len = sprintf(page, %s, myfile_data-value); return len; } Then can it possible that i can call proc_read_myfile from another kernel module?? Instead read file from user level call? 3) Also Is following code valid of creating /proc files with different file name created by passing function cr_proc(fname)? struct proc_dir_entry *entnew; int cr_proc(char *fname) { if ((entnew1 = create_proc_entry(fname, S_IRUGO | S_IWUSR, NULL)) == NULL) return -EACCES; entnew1-proc_fops = proc_file_operations; } static struct file_operations proc_file_operations = { open: proc_open, release:proc_release, read: proc_read, write: proc_write, }; What will happen if dynamic file names are going to use same all above 4 functions??? regards, linux_lover __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Possible AMD8111e free irq issue
Panagiotis Issaris wrote: Hi, It seems to me that if in the amd8111e_open() fuction dev-irq isn't zero and the irq request succeeds it might not get released anymore. Specifically, on failure of the amd8111e_restart() call the function returns -ENOMEM without releasing the irq. The amd8111e_restart() function can fail because of various pci_alloc_consistent() and dev_alloc_skb() calls in amd8111e_init_ring() which is being called by amd8111e_restart. 1374 if(dev-irq ==0 || request_irq(dev-irq, amd8111e_interrupt, SA_SHIRQ, 1375 dev-name, dev)) 1376 return -EAGAIN; The patch applies to 2.6.11-rc5-bk2. If I'm right about the above, I'm not I'm not sure if the free_irq() should happen before or after releasing the spinlock. With friendly regards, Takis diff -uprN linux-2.6.11-rc5-bk2/drivers/net/amd8111e.c linux-2.6.11-rc5-bk2-pi/drivers/net/amd8111e.c --- linux-2.6.11-rc5-bk2/drivers/net/amd8111e.c 2005-02-28 13:44:46.0 +0100 +++ linux-2.6.11-rc5-bk2-pi/drivers/net/amd8111e.c 2005-02-28 13:45:09.0 +0100 @@ -1381,6 +1381,8 @@ static int amd8111e_open(struct net_devi if(amd8111e_restart(dev)){ spin_unlock_irq(lp-lock); + if (dev-irq) + free_irq(dev-irq, dev); return -ENOMEM; Yes, this is a needed fix. Thanks. Jeff - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Possible VIA-Rhine free irq issue
Panagiotis Issaris wrote: Hi, It seems to me that in the VIA Rhine device driver the requested irq might not be freed in case the alloc_ring() function fails. alloc_ring() can fail with a ENOMEM return value because of possible pci_alloc_consistent() failures. This patch applies to 2.6.11-rc5-bk2. diff -uprN linux-2.6.11-rc5-bk2/drivers/net/via-rhine.c linux-2.6.11-rc5-bk2-pi/drivers/net/via-rhine.c --- linux-2.6.11-rc5-bk2/drivers/net/via-rhine.c 2005-02-28 13:44:37.0 +0100 +++ linux-2.6.11-rc5-bk2-pi/drivers/net/via-rhine.c 2005-02-28 13:44:31.0 +0100 @@ -1198,7 +1198,10 @@ static int rhine_open(struct net_device rc = alloc_ring(dev); if (rc) + { + free_irq(rp-pdev-irq, dev); return rc; + } Yes, this is a needed fix. Thanks, Jeff - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: via 6420 pata/sata controller
If I had to guess, I would try the attached patch. The via82cxxx.c driver is a bit annoying in that, here we do not talk to the ISA bridge but to the PCI device 0x4149 itself. If this doesn't work, I could probably whip together a quick PATA driver for libata that works on this hardware. Jeff = drivers/ide/pci/via82cxxx.c 1.27 vs edited = --- 1.27/drivers/ide/pci/via82cxxx.c2005-02-03 02:24:29 -05:00 +++ edited/drivers/ide/pci/via82cxxx.c 2005-03-02 01:28:26 -05:00 @@ -79,6 +79,7 @@ u8 rev_max; u16 flags; } via_isa_bridges[] = { + { vt6420, 0x4149, 0x00, 0x2f, VIA_UDMA_133 | VIA_BAD_AST }, { vt8237, PCI_DEVICE_ID_VIA_8237, 0x00, 0x2f, VIA_UDMA_133 | VIA_BAD_AST }, { vt8235, PCI_DEVICE_ID_VIA_8235, 0x00, 0x2f, VIA_UDMA_133 | VIA_BAD_AST }, { vt8233a,PCI_DEVICE_ID_VIA_8233A,0x00, 0x2f, VIA_UDMA_133 | VIA_BAD_AST }, @@ -635,9 +636,10 @@ } static struct pci_device_id via_pci_tbl[] = { - { PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_82C576_1, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, - { PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_82C586_1, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, - { 0, }, + { PCI_DEVICE(PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_82C576_1) }, + { PCI_DEVICE(PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_82C586_1) }, + { PCI_DEVICE(PCI_VENDOR_ID_VIA, 0x4149) }, + { },/* terminate list */ }; MODULE_DEVICE_TABLE(pci, via_pci_tbl);
Re: O_DIRECT on 2.4 ext3
On Mar 01, 2005 21:34 -0800, Junfeng Yang wrote: I tried to read from a regular ext3 file opened as O_DIRECT, but got the Invalid argument error. Running the same test program on a block device succeeded. ext3 doesn't support the direct_IO method in 2.4 kernels, though there was a patch at one time. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ pgpRKdFyTZvk0.pgp Description: PGP signature
Re: [2.6.11-rc4-mm1 patch] fix buggy IEEE80211_CRYPT_* selects
Adrian Bunk wrote: + select CRYPTO select CRYPTO_AES ---help--- Include software based cipher suites in support of IEEE 802.11i (aka TGi, WPA, WPA2, WPA-PSK, etc.) for use with CCMP enabled networks. @@ -54,10 +55,11 @@ ieee80211_crypt_ccmp. config IEEE80211_CRYPT_TKIP tristate IEEE 802.11i TKIP encryption depends on IEEE80211 + select CRYPTO select CRYPTO_MICHAEL_MIC 'select CRYPTO_AES' should 'select CRYPTO' automatically, I would hope. Jeff - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SCSI Target Mode issue...... pls help
hello all the gurus out there, i have written simple Target for SCSI device. its in very early stage. I started to handle simple commands from the INITIATOR like INQUIRY, READ CAPACITY , REPORT LUN. Now i am upto READ and WRITE. I have responded READ properly. Problem is in WRITE command. For instance there is a case when i get multiple WRITE command from INITIATOR i queue command as i receive it. CTIO has to be sent to firmware for each recieved command . in my case i send CTIO as i recieve the command. now firmware has to send back the response for each CTIO i sent. here is whats happening i get 2 commands for WRITE. send CTIO for cmd1 and cmd2 and what i get back from firmware is response of second cmd which is cmd2. cmd1's command time out occurs and it fails to respond. if any one has done basic handshake and handled READ and WRITE for TARGET mode then please share ur knowledge.. Best Regards, -- When the going gets tough, The tough gets going...! Peace , Nauman. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fix module paramater permissions in radeon_base.c
You really don't want -2 for the file mode in sysfs. It creates: -rwsrwsrwT 1 root root 4096 Mar 1 22:59 /sys/module/radeonfb/parameters/default_dynclk on my box. Here's a fix against a clean 2.6.11-rc5 kernel, please forward onward as you see fit. Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED] --- 1.27/drivers/video/aty/radeon_base.c2005-02-24 11:40:00 -08:00 +++ edited/drivers/video/aty/radeon_base.c 2005-03-01 23:09:12 -08:00 @@ -2551,7 +2551,7 @@ MODULE_DESCRIPTION(framebuffer driver for ATI Radeon chipset); MODULE_LICENSE(GPL); module_param(noaccel, bool, 0); -module_param(default_dynclk, int, -2); +module_param(default_dynclk, int, 0); MODULE_PARM_DESC(default_dynclk, int: -2=enable on mobility only,-1=do not change,0=off,1=on); MODULE_PARM_DESC(noaccel, bool: disable acceleration); module_param(nomodeset, bool, 0); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix module paramater permissions in radeon_base.c
On Tue, 2005-03-01 at 23:11 -0800, Greg KH wrote: You really don't want -2 for the file mode in sysfs. It creates: -rwsrwsrwT 1 root root 4096 Mar 1 22:59 /sys/module/radeonfb/parameters/default_dynclk on my box. Here's a fix against a clean 2.6.11-rc5 kernel, please forward onward as you see fit. Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED] --- 1.27/drivers/video/aty/radeon_base.c 2005-02-24 11:40:00 -08:00 +++ edited/drivers/video/aty/radeon_base.c2005-03-01 23:09:12 -08:00 @@ -2551,7 +2551,7 @@ MODULE_DESCRIPTION(framebuffer driver for ATI Radeon chipset); MODULE_LICENSE(GPL); module_param(noaccel, bool, 0); -module_param(default_dynclk, int, -2); +module_param(default_dynclk, int, 0); MODULE_PARM_DESC(default_dynclk, int: -2=enable on mobility only,-1=do not change,0=off,1=on); MODULE_PARM_DESC(noaccel, bool: disable acceleration); module_param(nomodeset, bool, 0); Right, that is bogus, thanks. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.11-rc4-mm1] end-of-proces handling for acct-csa
On Tue, 2005-03-01 at 10:06 -0800, Jay Lan wrote: Sorry I was not clear on my point. I was trying to point out that, an exit hook for BSD and CSA is essential to save accounting data before the data is gone. That can not be done with a netlink. So, my patch was to keep acct_process as a wrapper, which would then call do_exit_csa() for CSA and call do_acct_process for BSD. Is it possible to merge BSD and CSA? I mean with CSA, there is a part that does per-process accounting. For exemple in the linux-2.6.9.acct_mm.patch the two functions update_mem_hiwater() and csa_update_integrals() update fields in the current (and parent) process. So maybe you can improve the BSD per-process accounting or maybe CSA can replace the BSD per-process accounting? Guillaume - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove dead cyrix/centaur mtrr init code
On Tue, Mar 01, 2005 at 11:52:44PM +, Alan Cox wrote: On Llu, 2005-02-28 at 19:20, Andries Brouwer wrote: One such case is the mtrr code, where struct mtrr_ops has an init field pointing at __init functions. Unless I overlook something, this case may be easy to settle, since the .init field is never used. The failure to invoke the -init operator appears to be the bug. The centaur code definitely wants the mcr init function to be called. Yes, I expected that to be the answer. Therefore #if 0 instead of deleting. But if calling -init() is needed, and it has not been done the past three years, the question arises whether there are any users. Andries - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Possible AMD8111e free irq issue
Hi, Jeff Garzik wrote: diff -uprN linux-2.6.11-rc5-bk2/drivers/net/amd8111e.c linux-2.6.11-rc5-bk2-pi/drivers/net/amd8111e.c --- linux-2.6.11-rc5-bk2/drivers/net/amd8111e.c2005-02-28 13:44:46.0 +0100 +++ linux-2.6.11-rc5-bk2-pi/drivers/net/amd8111e.c2005-02-28 13:45:09.0 +0100 @@ -1381,6 +1381,8 @@ static int amd8111e_open(struct net_devi if(amd8111e_restart(dev)){ spin_unlock_irq(lp-lock); +if (dev-irq) +free_irq(dev-irq, dev); return -ENOMEM; Yes, this is a needed fix. Thanks. Should the release of the irq happen before or after unlocking the spinlock? I wasn't really sure about it. With friendly regards, Takis -- K.U.Leuven, Mechanical Eng., Mechatronics Robotics Research Group http://people.mech.kuleuven.ac.be/~pissaris/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/