Re: 2.6.22.6: kernel BUG at fs/locks.c:171
On Thu, 2007-09-13 at 09:51 +1000, Nick Piggin wrote: > On Thursday 13 September 2007 19:20, Soeren Sonnenburg wrote: > > Dear all, > > > > I've just seen this in dmesg on a AMD K7 / kernel 2.6.22.6 machine > > (config attached). > > > > Any ideas / which further information needed ? > > Thanks for the report. Is it reproduceable? It seems like the > locks_free_lock call that's oopsing is coming from __posix_lock_file. > The actual function looks fine, but the lock being freed could have > been corrupted if there was slab corruption, or a hardware corruption. > > You could: try running memtest86+ overnight. And try the following > patch and turn on slab debugging then try to reproduce the problem. OK so far I've run memtest86+ 1.40 from freedos for 8 hrs (v1.70 hung on startup) - nothing. Could this corruption be caused by a pci card/driver? I am asking as I am using a new dvb-t card (asus p7131) and the oops happened after 5 or 6 days of uptime just about a day after watching some movie (very bad reception/lots of errors). However this machine used to have uptimes of months before the dvb card was in there and the kernel version upgrade (don't know which version that was...). Anyway I am not sure if this is reproducible, but I will keep memtest running today and then proceed as you said... Thanks, Soeren -- Sometimes, there's a moment as you're waking, when you become aware of the real world around you, but you're still dreaming. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thu, 13 Sep 2007, Mel Gorman wrote: > Surely, we'll be able to detect the situation where the memory is really > contiguous as a fast path and have a slower path where fragmentation was > a problem. Yes I have a draft here now of a virtual compound page solution that I am testing with SLUB. Page allocator first tries to allocate a contiguous page. If that is not possible then we string together a contiguous page through order 0 allocs and vmalloc. > The only implementation question about these patches that hasn't been > addressed > is the mmap() support. What's wrong with it in it's current form. Can it be > fixed or if it's fundamentally screwed etc. That has fallen by the > wayside. Yes I have not heard about those. Got some more ideas how to clean it up even more in the meantime. No feedback usually means no objections. > I am *very* wary of using reserve pools for anything other than > emergency situations. If nothing else pools == wasted memory + a sizing > problem. But hey, it is one option. Well we need the page pools anyways for large page sizes > MAX_ORDER. But > Are we going to agree on some sort of plan or are we just going to > handwave ourselves to death? Ill try to do the virtual mapped compound pages. Hopefully that will allow fallback for compound pages in a generic way. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [-mm] FS: file name must be unique in the same dir in procfs
> > From: Zhang Rui <[EMAIL PROTECTED]> > File name should be unique in the same directory. > > In order to keep the back-compatibility, only a warning is given > currently, but actions must be taken to fix it when such duplicates > are detected. > > Bug report and a simple fix can be found here: > http://bugzilla.kernel.org/show_bug.cgi?id=8798 > > Signed-off-by: Zhang Rui <[EMAIL PROTECTED]> > --- > fs/proc/generic.c | 12 > 1 file changed, 12 insertions(+) > > Index: linux-2.6/fs/proc/generic.c > === > --- linux-2.6.orig/fs/proc/generic.c > +++ linux-2.6/fs/proc/generic.c > @@ -523,6 +523,7 @@ static const struct inode_operations pro > > static int proc_register(struct proc_dir_entry * dir, struct > proc_dir_entry * dp) Your email client is wordwrapping the text. > { > + struct proc_dir_entry *tmp = NULL; That initialisation is unneeded - I removed it. `tmp' is always a crappy name for anything. I renamed it to `de'. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 08/19] jfs: new export ops
On Thu, Sep 13, 2007 at 04:57:01PM -0400, J. Bruce Fields wrote: > On Thu, Sep 13, 2007 at 03:59:34PM +0200, Christoph Hellwig wrote: > > Yes, please add them to the queue. I've put a quilt series of patches > > with the whitespace fixe on http://verein.lst.de/~hch/patches.nfsd.tgz. > > Do you want to take it that way or should I resend everything to the > > list? > > Yeah, I hate to be so picky, but would it be possible to get them as > mail? It just saves me some scripting. (I tried git-quiltimport and it > doesn't get the silly subject lines right.) They don't even have to be > mailed out, an mbox I can wget would be fine. No problem, I'll do that tomorrow. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 08/19] jfs: new export ops
On Thu, Sep 13, 2007 at 03:59:34PM +0200, Christoph Hellwig wrote: > Yes, please add them to the queue. I've put a quilt series of patches > with the whitespace fixe on http://verein.lst.de/~hch/patches.nfsd.tgz. > Do you want to take it that way or should I resend everything to the > list? Yeah, I hate to be so picky, but would it be possible to get them as mail? It just saves me some scripting. (I tried git-quiltimport and it doesn't get the silly subject lines right.) They don't even have to be mailed out, an mbox I can wget would be fine. --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 23:03, David Chinner wrote: > On Thu, Sep 13, 2007 at 03:23:21AM +1000, Nick Piggin wrote: > > Well, it may not be easy to _fix_, but it's easy to try a few > > improvements ;) > > > > How do I make an image and run a workload that will coerce XFS into > > doing a significant number of vmaps? > > # mkfs.xfs -n size=16384 > > to create a filesystem with a 16k directory block size on a 4k page > machine. > > Then just do operations on directories with lots of files in them > (tens of thousands). Every directory operation will require at > least one vmap in this situation - e.g. a traversal will result in > lots and lots of blocks being read that will require vmap() for every > directory block read from disk and an unmap almost immediately > afterwards when the reference is dropped Ah, wow, thanks: I can reproduce it. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6: kernel BUG at fs/locks.c:171
On Thursday 13 September 2007 19:20, Soeren Sonnenburg wrote: > Dear all, > > I've just seen this in dmesg on a AMD K7 / kernel 2.6.22.6 machine > (config attached). > > Any ideas / which further information needed ? Thanks for the report. Is it reproduceable? It seems like the locks_free_lock call that's oopsing is coming from __posix_lock_file. The actual function looks fine, but the lock being freed could have been corrupted if there was slab corruption, or a hardware corruption. You could: try running memtest86+ overnight. And try the following patch and turn on slab debugging then try to reproduce the problem. > > Soeren > > [ cut here ] > kernel BUG at fs/locks.c:171! > invalid opcode: [#1] > Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs > ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE > iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables > x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp > nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x > ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core > dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp > ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:0 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010206 (2.6.22.6 #1) > EIP is at locks_free_lock+0xb/0x3b > eax: e1d07f9c ebx: e1d07f80 ecx: f5f5e2f0 edx: > esi: edi: ebp: esp: da3d7f04 > ds: 007b es: 007b fs: gs: 0033 ss: 0068 > Process mrtg-load (pid: 19688, ti=da3d6000 task=f5e3a030 task.ti=da3d6000) > Stack: c015972b 0002 c04889c8 c012b920 f5f5e290 c048541c > f0ed3ca0 01485414 e1d07f80 f0f39f58 44ef35f1 f62fc2ac > f5f5e290 d23106c0 c015a891 0007 > 0004 Call Trace: > [] __posix_lock_file+0x44e/0x47f > [] getnstimeofday+0x2b/0xaf > [] fcntl_setlk+0xff/0x1f6 > [] do_setitimer+0xfa/0x226 > [] sys_fcntl64+0x74/0x85 > [] syscall_call+0x7/0xb > === > Code: 74 1b 8b 15 30 93 48 c0 8d 43 04 89 53 04 89 42 04 a3 30 93 48 c0 c7 > 40 04 30 93 48 c0 5b 5e c3 53 89 c3 8d 40 1c 39 43 1c 74 04 <0f> 0b eb fe > 8d 43 0c 39 43 0c 74 04 0f 0b eb fe 8d 43 04 39 43 EIP: [] > locks_free_lock+0xb/0x3b SS:ESP 0068:da3d7f04 > BUG: unable to handle kernel paging request at virtual address 9ee420b0 > printing eip: > c014ab7d > *pde = > Oops: 0002 [#2] > Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs > ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE > iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables > x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp > nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x > ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core > dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp > ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:0 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010082 (2.6.22.6 #1) > EIP is at free_block+0x61/0xfb > eax: a75b2c19 ebx: c1cf6c10 ecx: e1d070c4 edx: 9ee420ac > esi: e1d07000 edi: dfde6960 ebp: dfde7620 esp: dfd87f44 > ds: 007b es: 007b fs: gs: ss: 0068 > Process events/0 (pid: 4, ti=dfd86000 task=dfdc4a50 task.ti=dfd86000) > Stack: 0012 0018 c1cf6c10 c1cf6c10 0018 > c1cf6c00 dfde7620 c014ac86 dfde6960 dfde7620 c0521d20 > c014b869 dfde69e0 c0521d20 c014b827 c0125955 dfdc4b5c > 8f0c99c0 Call Trace: > [] drain_array+0x6f/0x89 > [] cache_reap+0x42/0xde > [] cache_reap+0x0/0xde > [] run_workqueue+0x6b/0xdf > [] worker_thread+0x0/0xbd > [] worker_thread+0xb2/0xbd > [] autoremove_wake_function+0x0/0x35 > [] kthread+0x36/0x5a > [] kthread+0x0/0x5a > [] kernel_thread_helper+0x7/0x10 > === > Code: 8b 02 25 00 40 02 00 3d 00 40 02 00 75 03 8b 52 0c 8b 02 84 c0 78 04 > 0f 0b eb fe 8b 72 1c 8b 54 24 28 8b 46 04 8b 7c 95 4c 8b 16 <89> 42 04 89 > 10 2b 4e 0c c7 06 00 01 10 00 c7 46 04 00 02 20 00 EIP: [] > free_block+0x61/0xfb SS:ESP 0068:dfd87f44 > [ cut here ] > kernel BUG at fs/locks.c:171! > invalid opcode: [#3] > Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs > ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE > iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables > x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp > nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x > ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core > dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp > ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:0 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010287 (2.6.
Re: [PATCH] 9p: rename uid and gid parameters
On Thu, 2007-09-13 at 08:51 -0600, Latchesar Ionkov wrote: > Zero was the value that was used before, even though it wasn't defined > explicitly. I just defined a macro so we can see and eventually change > it to something better. I don't know if there is a good default value. > Is nfsnobody the same on all Linux distributions? Not necessarily. [] > On 9/13/07, Eric Van Hensbergen <[EMAIL PROTECTED]> wrote: > > On 9/12/07, Latchesar Ionkov <[EMAIL PROTECTED]> wrote: > > > Change the names of 'uid' and 'gid' parameters to the more appropriate > > > 'dfltuid' and 'dfltgid'. > > > > > > > ... > > > > > strcpy(v9ses->name, V9FS_DEFUSER); > > > strcpy(v9ses->remotename, V9FS_DEFANAME); > > > + v9ses->dfltuid = V9FS_DEFUID; > > > + v9ses->dfltgid = V9FS_DEFGID; > > > > > ... > > > +#define V9FS_DEFUID(0) > > > +#define V9FS_DEFGID(0) > > > > I'm not sure if there is a good solution here, but I'm uncomfortable > > with using uid=0 as the default. I'm not sure if there is a default > > uid for nobody, but anything is probably better than 0. Looks like > > nfsnobody is 65534, we could use that - even if only as a marker for On 32bit hardware. On 64bit it is (similar to 32bit): (unsigned int)-2. > > the server to map it to nobody on the target system? What do you > > think? > > > > Particularly with attach-per-user, we probably need to look at > > interacting with idmapd or create our own variant real soon. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Security attributes and ducumentation update.
On Thu, Sep 13, 2007 at 04:22:59PM +0400, Evgeniy Polyakov wrote: > Hi Paul. > > On Mon, Sep 10, 2007 at 03:14:45PM -0700, Paul E. McKenney ([EMAIL > PROTECTED]) wrote: > > > Further TODO list includes: > > > * implement optional saving of mirroring/linear information on the remote > > > nodes (simple) > > > * implement netlink based setup (simple) > > > * new redundancy algorithm (complex) > > > > > > Homepage: > > > http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst > > > > A couple questions below, but otherwise looks good from an RCU viewpoint. > > > > Thanx, Paul > > Thanks for your comments, and sorry for late reply I was at KS/London > trip. > > > + if (--num) { > > > + list_for_each_entry_rcu(n, &node->shared, shared) { > > > > This function is called under rcu_read_lock() or similar, right? > > (Can't tell from this patch.) It is also OK to call it from under the > > update-side mutex, of course. > > Actually not, but it does not require it, since entry can not be removed > during this operations since appropriate reference counter for given node is > being held. It should not be RCU at all. Ah! Yes, it is OK to use _rcu in this case, but should be avoided unless doing so eliminates duplicate code or some such. So, agree with dropping _rcu in this case. > > > +static int dst_mirror_read(struct dst_request *req) > > > +{ > > > + struct dst_node *node = req->node, *n, *min_dist_node; > > > + struct dst_mirror_priv *priv = node->priv; > > > + u64 dist, d; > > > + int err; > > > + > > > + req->bio_endio = &dst_mirror_read_endio; > > > + > > > + do { > > > + err = -ENODEV; > > > + min_dist_node = NULL; > > > + dist = -1ULL; > > > + > > > + /* > > > + * Reading is never performed from the node under resync. > > > + * If this will cause any troubles (like all nodes must be > > > + * resynced between each other), this check can be removed > > > + * and per-chunk dirty bit can be tested instead. > > > + */ > > > + > > > + if (!test_bit(DST_NODE_NOTSYNC, &node->flags)) { > > > + priv = node->priv; > > > + if (req->start > priv->last_start) > > > + dist = req->start - priv->last_start; > > > + else > > > + dist = priv->last_start - req->start; > > > + min_dist_node = req->node; > > > + } > > > + > > > + list_for_each_entry_rcu(n, &node->shared, shared) { > > > > I see one call to this function that appears to be under the update-side > > mutex, but I cannot tell if the other calls are safe. (Safe as in either > > under the update-side mutex or under rcu_read_lock() and friends.) > > The same here - those processing function are called from > generic_make_request() from any lock on top of them. Each node is linked > into the list of the first added node, which reference counter is > increased in higher layer. Right now there is no way to add or remove > nodes after array was started, such functionality requires storage tree > lock to be taken and RCU can not be used (since it requires sleeping and > I did not investigate sleepable RCU for this purpose). > > So, essentially RCU is not used in DST :) Works for me! "Use the right tool for the job!" > Thanks for review, Paul. Thanx, Paul - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] 9p: rename uid and gid parameters
Zero was the value that was used before, even though it wasn't defined explicitly. I just defined a macro so we can see and eventually change it to something better. I don't know if there is a good default value. Is nfsnobody the same on all Linux distributions? Thanks, Lucho On 9/13/07, Eric Van Hensbergen <[EMAIL PROTECTED]> wrote: > On 9/12/07, Latchesar Ionkov <[EMAIL PROTECTED]> wrote: > > Change the names of 'uid' and 'gid' parameters to the more appropriate > > 'dfltuid' and 'dfltgid'. > > > > ... > > > strcpy(v9ses->name, V9FS_DEFUSER); > > strcpy(v9ses->remotename, V9FS_DEFANAME); > > + v9ses->dfltuid = V9FS_DEFUID; > > + v9ses->dfltgid = V9FS_DEFGID; > > > ... > > +#define V9FS_DEFUID(0) > > +#define V9FS_DEFGID(0) > > I'm not sure if there is a good solution here, but I'm uncomfortable > with using uid=0 as the default. I'm not sure if there is a default > uid for nobody, but anything is probably better than 0. Looks like > nfsnobody is 65534, we could use that - even if only as a marker for > the server to map it to nobody on the target system? What do you > think? > > Particularly with attach-per-user, we probably need to look at > interacting with idmapd or create our own variant real soon. > > -eric > - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] 9p: attach-per-user
On 9/12/07, Latchesar Ionkov <[EMAIL PROTECTED]> wrote: > > - allow only one user to access the tree (access=) > Only the user with uid can access the v9fs tree. Other users that attempt > to access it will get EPERM error. > While access= and dfltuid= creates an interesting flexibility in the way things can be used, I'm wondering if access= dfltuid=DEFAULT_UID is intuitive, it might be nice for the default behavior to be setting defltuid to the uid specified in access when that access option is used. This can be overridden with the dfltuid option, but I think it makes more sense to attach as the uid you are restricting access to. If that's the way we want to go, I think that can be handled in a separate patch. I've merged this stuff into my test tree, as soon as regressions pass and I confirm they compile clean on another architecture I'll push them into my devel to be picked up by -mm. -eric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 08/19] jfs: new export ops
On Mon, Sep 10, 2007 at 03:45:33PM -0400, J. Bruce Fields wrote: > On Thu, Aug 30, 2007 at 03:16:32PM +0200, Christoph Hellwig wrote: > > Trivial switch over to the new generic helpers. > > - result = ERR_PTR(-ESTALE); > > - goto out_iput; > > + iput(inode); > > + return ERR_PTR(-ESTALE); > > For some reason you're introducing a bunch of whitespace damage on lines > like that last one. Would you mind running these through checkpatch.pl > or whatever and then resending? > > Other than that I can't find any problem, and it seems like an obvious > improvement, though I'd like to see Neil take a look. Do you want these > added to the nfsd patches for 2.6.24? Yes, please add them to the queue. I've put a quilt series of patches with the whitespace fixe on http://verein.lst.de/~hch/patches.nfsd.tgz. Do you want to take it that way or should I resend everything to the list? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] 9p: rename uid and gid parameters
On 9/12/07, Latchesar Ionkov <[EMAIL PROTECTED]> wrote: > Change the names of 'uid' and 'gid' parameters to the more appropriate > 'dfltuid' and 'dfltgid'. > ... > strcpy(v9ses->name, V9FS_DEFUSER); > strcpy(v9ses->remotename, V9FS_DEFANAME); > + v9ses->dfltuid = V9FS_DEFUID; > + v9ses->dfltgid = V9FS_DEFGID; > ... > +#define V9FS_DEFUID(0) > +#define V9FS_DEFGID(0) I'm not sure if there is a good solution here, but I'm uncomfortable with using uid=0 as the default. I'm not sure if there is a default uid for nobody, but anything is probably better than 0. Looks like nfsnobody is 65534, we could use that - even if only as a marker for the server to map it to nobody on the target system? What do you think? Particularly with attach-per-user, we probably need to look at interacting with idmapd or create our own variant real soon. -eric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thu, Sep 13, 2007 at 03:23:21AM +1000, Nick Piggin wrote: > On Thursday 13 September 2007 11:49, David Chinner wrote: > > On Wed, Sep 12, 2007 at 01:27:33AM +1000, Nick Piggin wrote: > > > > I just gave 4 things which combined might easily reduce xfs vmap overhead > > > by several orders of magnitude, all without changing much code at all. > > > > Patches would be greatly appreciately. You obviously understand this > > vm code much better than I do, so if it's easy to fix by adding some > > generic vmap cache thingy, please do. > > Well, it may not be easy to _fix_, but it's easy to try a few improvements ;) > > How do I make an image and run a workload that will coerce XFS into > doing a significant number of vmaps? # mkfs.xfs -n size=16384 to create a filesystem with a 16k directory block size on a 4k page machine. Then just do operations on directories with lots of files in them (tens of thousands). Every directory operation will require at least one vmap in this situation - e.g. a traversal will result in lots and lots of blocks being read that will require vmap() for every directory block read from disk and an unmap almost immediately afterwards when the reference is dropped Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Protect update of b_transaction by j_list_lock
Hello, At all other places, j_list_lock is being held when b_transaction is being modified. But at one place in transaction.c, this lock is not held. Also, the journal-head.h file which defines journal_head structure says that b_transaction should be protected by two locks - j_list_lock and jbd_lock_bh_state() : /* * Pointer to the compound transaction which owns this buffer's * metadata: either the running transaction or the committing * transaction (if there is one). Only applies to buffers on a * transaction's data or metadata journaling list. * [j_list_lock] [jbd_lock_bh_state()] */ transaction_t *b_transaction; This was observed while debugging a problem of 'b_transaction' getting corrupted. Here is the trace : - [c267bc50] d00e0654 .__journal_refile_buffer+0x100/0x16c [jbd] [c267bcf0] d00e48bc .journal_commit_transaction+0x136c/0x16e0 [jbd] [c267be00] d00e9968 .kjournald+0xf0/0x2e8 [jbd] [c267bee0] c0080fb8 .kthread+0x128/0x178 [c267bf90] c00264bc .kernel_thread+0x4c/0x68 3:mon> e cpu 0x3: Vector: 700 (Program Check) at [c267b930] pc: d00e02b0: .__journal_file_buffer+0x74/0x2e0 [jbd] lr: d00e0654: .__journal_refile_buffer+0x100/0x16c [jbd] sp: c267bbb0 msr: 80029032 current = 0xc241d3d0 paca= 0xc0464900 pid = 16224, comm = kjournald kernel BUG in __journal_file_buffer at fs/jbd/transaction.c:1951! 3:mon> r R00 = 0001 R16 = 41c0 R01 = c267bbb0 R17 = c0371060 R02 = d0102d20 R18 = R03 = c00038b25208 R19 = 002bf000 R04 = c001e6dfee80 R20 = c27e4680 R05 = 0002 R21 = R06 = 0283 R22 = 0fdc R07 = R23 = R08 = c0464900 R24 = c000e8ad5024 R09 = c001e6dfee80 R25 = c001b6b726e8 R10 = c00038b25208 R26 = c1f5c780 R11 = 0008 R27 = 0002 R12 = 2448 R28 = c001e6dfee80 R13 = c0464900 R29 = c00038b25208 R14 = R30 = d0101f80 R15 = c0372620 R31 = c00163ce7908 pc = d00e02b0 .__journal_file_buffer+0x74/0x2e0 [jbd] lr = d00e0654 .__journal_refile_buffer+0x100/0x16c [jbd] msr = 80029032 cr = 4442 ctr = d00e0170 xer = 2000 trap = 700 - The assertion is coming from following code (fs/jbd/transaction.c): /* * File a buffer on the given transaction list. */ void __journal_file_buffer(struct journal_head *jh, transaction_t *transaction, int jlist) { struct journal_head **list = NULL; int was_dirty = 0; struct buffer_head *bh = jh2bh(jh); J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); assert_spin_locked(&transaction->t_journal->j_list_lock);<==HERE J_ASSERT_JH(jh, jh->b_jlist < BJ_Types); J_ASSERT_JH(jh, jh->b_transaction == transaction || jh->b_transaction == 0); : : It looks like the transaction pointer got corrupted. On code inspection, I could find a place where b_transaction is being updated without holding the j_list_lock. Tried to fix this in the following patch and the bug is no longer being discovered. Please see if this change is okay. Following patch has been tested successfully for more than two weeks: Signed-off-by: Amit Arora <[EMAIL PROTECTED]> diff -Nuarp a/fs/jbd/transaction.c b/fs/jbd/transaction.c --- a/fs/jbd/transaction.c 2007-08-10 15:34:08.0 +0530 +++ b/fs/jbd/transaction.c 2007-08-10 15:56:02.0 +0530 @@ -693,15 +693,15 @@ repeat: * sure it doesn't get written to disk before the caller actually * commits the new data */ + spin_lock(&journal->j_list_lock); if (!jh->b_transaction) { JBUFFER_TRACE(jh, "no transaction"); J_ASSERT_JH(jh, !jh->b_next_transaction); jh->b_transaction = transaction; JBUFFER_TRACE(jh, "file as BJ_Reserved"); - spin_lock(&journal->j_list_lock); __journal_file_buffer(jh, transaction, BJ_Reserved); - spin_unlock(&journal->j_list_lock); } + spin_unlock(&journal->j_list_lock); done: if (need_copy) { - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Security attributes and ducumentation update.
Hi Paul. On Mon, Sep 10, 2007 at 03:14:45PM -0700, Paul E. McKenney ([EMAIL PROTECTED]) wrote: > > Further TODO list includes: > > * implement optional saving of mirroring/linear information on the remote > > nodes (simple) > > * implement netlink based setup (simple) > > * new redundancy algorithm (complex) > > > > Homepage: > > http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst > > A couple questions below, but otherwise looks good from an RCU viewpoint. > > Thanx, Paul Thanks for your comments, and sorry for late reply I was at KS/London trip. > > + if (--num) { > > + list_for_each_entry_rcu(n, &node->shared, shared) { > > This function is called under rcu_read_lock() or similar, right? > (Can't tell from this patch.) It is also OK to call it from under the > update-side mutex, of course. Actually not, but it does not require it, since entry can not be removed during this operations since appropriate reference counter for given node is being held. It should not be RCU at all. > > +static int dst_mirror_read(struct dst_request *req) > > +{ > > + struct dst_node *node = req->node, *n, *min_dist_node; > > + struct dst_mirror_priv *priv = node->priv; > > + u64 dist, d; > > + int err; > > + > > + req->bio_endio = &dst_mirror_read_endio; > > + > > + do { > > + err = -ENODEV; > > + min_dist_node = NULL; > > + dist = -1ULL; > > + > > + /* > > +* Reading is never performed from the node under resync. > > +* If this will cause any troubles (like all nodes must be > > +* resynced between each other), this check can be removed > > +* and per-chunk dirty bit can be tested instead. > > +*/ > > + > > + if (!test_bit(DST_NODE_NOTSYNC, &node->flags)) { > > + priv = node->priv; > > + if (req->start > priv->last_start) > > + dist = req->start - priv->last_start; > > + else > > + dist = priv->last_start - req->start; > > + min_dist_node = req->node; > > + } > > + > > + list_for_each_entry_rcu(n, &node->shared, shared) { > > I see one call to this function that appears to be under the update-side > mutex, but I cannot tell if the other calls are safe. (Safe as in either > under the update-side mutex or under rcu_read_lock() and friends.) The same here - those processing function are called from generic_make_request() from any lock on top of them. Each node is linked into the list of the first added node, which reference counter is increased in higher layer. Right now there is no way to add or remove nodes after array was started, such functionality requires storage tree lock to be taken and RCU can not be used (since it requires sleeping and I did not investigate sleepable RCU for this purpose). So, essentially RCU is not used in DST :) Thanks for review, Paul. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Jfs-discussion] [2/4] 2.6.23-rc6: known regressions
On Wed, 2007-09-12 at 18:58 +0200, Michal Piotrowski wrote: > Subject : umount triggers a warning in jfs and takes almost a minute > References : http://lkml.org/lkml/2007/9/4/73 > Last known good : ? > Submitter : Oliver Neukum <[EMAIL PROTECTED]> > Caused-By : ? > Handled-By : ? > Status : unknown I'm still waiting to hear from Oliver whether or not this is actually a regression. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On (12/09/07 16:17), Christoph Lameter didst pronounce: > On Wed, 12 Sep 2007, Nick Piggin wrote: > > > I will still argue that my approach is the better technical solution for > > large > > block support than yours, I don't think we made progress on that. And I'm > > quite sure we agreed at the VM summit not to rely on your patches for > > VM or IO scalability. > > The approach has already been tried (see the XFS layer) and found lacking. > > Having a fake linear block through vmalloc means that a special software > layer must be introduced and we may face special casing in the block / fs > layer to check if we have one of these strange vmalloc blocks. > One of Nick's points is that to have a 100% reliable solution, that is what is required. We already have a layering between the VM and the FS but my understanding is that fsblock replaces rather than adds to it. Surely, we'll be able to detect the situation where the memory is really contiguous as a fast path and have a slower path where fragmentation was a problem. > > But you just showed in two emails that you don't understand what the > > problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae; > > and antifrag does *not* isolate the issue. > > I do understand what the problem is. I just do not get what your problem > with this is and why you have this drive to demand perfection. We are > working a variety of approaches on the (potential) issue but you > categorically state that it cannot be solved. > This is going in circles. His point is that we also cannot prove it is 100% correct in all situations. Without taking additional (expensive) steps, there will be a workload that fragments physical memory. He doesn't know what it is and neither do we, but that does not mean that someone else will find it. He also has a point about the slow degredation of fragmentation that is woefully difficult to reproduce. We've had this provability of correctness problem before. His initial problem was not with the patches as such but the fact that they seemed to be presented as a 1st class feature that we fully support and is a potential solution for some VM and IO Scalability problems. This is not the case, we have to treat it as a 2nd class feature until we *know* no situation exists where it breaks down. These patches on their own would have to run for months if not a year or so before we could be really sure about it. The only implementation question about these patches that hasn't been addressed is the mmap() support. What's wrong with it in it's current form. Can it be fixed or if it's fundamentally screwed etc. That has fallen by the wayside. > > But what do you say about viable alternatives that do not have to > > worry about these "unlikely scenarios", full stop? So, why should we > > not use fs block for higher order page support? > > Because it has already been rejected in another form and adds more > layering to the filesystem and more checking for special cases in which > we only have virtual linearity? It does not reduce the number of page > structs that have to be handled by the lower layers etc. > Unless callers always use an iterator for blocks that is optimised in the physically linear case to be a simple array offset and when not physically linear it either walks chains (complex) or uses vmap (must deal with TLB flushes amoung other things). If it optimistically uses physically contiguous memory, we may find a way to use only one page struct as well. > Maybe we coud get to something like a hybrid that avoids some of these > issues? Or gee whiz, I don't know. Start with your patches as a strictly 2nd class citizen and build fsblock in while trying to keep use of physically contiguous memory where possible and it makes sense. > Add support so something like a virtual compound page can be > handled transparently in the filesystem layer with special casing if > such a beast reaches the block layer? > > > I didn't skip that. We have large page pools today. How does that give > > first class of support to those allocations if you have to have memory > > reserves? > > See my other mail. That portion is not complete yet. Sorry. > I am *very* wary of using reserve pools for anything other than emergency situations. If nothing else pools == wasted memory + a sizing problem. But hey, it is one option. Are we going to agree on some sort of plan or are we just going to handwave ourselves to death? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 11:49, David Chinner wrote: > On Wed, Sep 12, 2007 at 01:27:33AM +1000, Nick Piggin wrote: > > I just gave 4 things which combined might easily reduce xfs vmap overhead > > by several orders of magnitude, all without changing much code at all. > > Patches would be greatly appreciately. You obviously understand this > vm code much better than I do, so if it's easy to fix by adding some > generic vmap cache thingy, please do. Well, it may not be easy to _fix_, but it's easy to try a few improvements ;) How do I make an image and run a workload that will coerce XFS into doing a significant number of vmaps? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html