Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread Nick Piggin

David Howells wrote:

Nick Piggin <[EMAIL PROTECTED]> wrote:



You can drop the lock, do the invalidation,



Hmmm...  There's a danger of incurring a race by doing that.  Consider two
processes both trying to write to a dirty page for which writeback will be
rejected:

 (1) The first process gets EKEYREJECTED from the server, drops its lock and
 is then preempted.

 (2) The second process gets EKEYREJECTED from the server, drops its lock,
 truncates the page, reloads the page and modifies it.

 (3) The first process resumes and truncates the page, thereby splatting the
 second process's write.

Or:

 (1) The first process gets EKEYREJECTED from the server, clears the writeback
 information from the page, drops its lock and is then preempted.

 (2) The second process attaches its own writeback information to the page and
 modifies it.

 (3) The first process resumes and truncates the page, thereby splatting the
 second process's write.


If there are race issues involving concurrent invalidations, then you'd
fix that up by taking a filesystem specific lock to prevent them.

Generic write path should be holding i_mutex, but I don't think you can
take that from page_mkwrite... Just add one of your own.



Really, what I want to do is pass the page lock to truncate to deal with.
Better still, I want truncate to be selective, based on whether or not a page
is still associated with the rejected writeback.  I wonder if I should call
truncate_complete_page() or invalidate_complete_page() directly.


No, you shouldn't. We could theoretically introduce a new API for this,
but I think it would be preferable if you can fix the race in the fs.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread David Woodhouse
On Thu, 2007-05-17 at 15:12 +0900, Dongjun Shin wrote:
> The current trend of flash-based device is to hide the flash-specific details
> from the host OS. The flash memory is encapsulated in a package
> which contains a dedicated controller where a small piece of software (F/W or 
> FTL)
> runs and makes the storage shown as a block device to the host.

Yes. These things are almost always implemented _very_ badly by the same
kind of crack-smoking hobo they drag in off the streets to write BIOSen.

It's bog-roll technology; if you fancy a laugh try doing some real
reliability tests on them time some. Powerfail testing is a good one.
 
This kind of thing is OK for disposable storage such as in digital
cameras, where it doesn't matter that it's no more reliable than a
floppy disc, but for real long-term storage it's really a bad idea.

> IMHO, for a flash-optimized filesystem to be useful and widely-used, it would 
> be better
> to run on a block device and to be designed to run efficiently on top of the 
> FTL.
> (ex. log-structured filesystem on general block device)

There's little point in optimising a file system _specifically_ for
devices which in often aren't reliable enough to keep your data anyway.
You might as well use ramfs.

It's unfortunate really -- there's no _fundamental_ reason why FTL has
to be done so badly; it's just that it almost always is. Direct access
to the flash from Linux is _always_ going to be better in practice --
and that way you avoid the problems with dual journalling, along with
the problems with the underlying FTL continuing to keep (and copy around
during GC) sectors which the top-level filesystem has actually
deallocated, etc.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] file capabilities: Introduction

2007-05-16 Thread Suparna Bhattacharya
On Mon, May 14, 2007 at 08:00:11PM +, Pavel Machek wrote:
> Hi!
> 
> > "Serge E. Hallyn" <[EMAIL PROTECTED]> wrote:
> > 
> > > Following are two patches which have been sitting for some time in -mm.
> > 
> > Where "some time" == "nearly six months".
> > 
> > We need help considering, reviewing and testing this code, please.
> 
> I did quick scan, and it looks ok. Plus, it means we can finally start
> using that old capabilities subsystem... so I think we should do it.

FWIW, I looked through it recently as well, and it looked reasonable enough
to me, though I'm not a security expert. I did have a question about
testing corner cases etc, which Serge has tried to address.

Serge, are you planning to post an update without STRICTXATTR ? That should
simplify the second patch.

Regards
Suparna

> 
>   Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Jeff Zheng

Yeah, seems you've locked it down, :D. I've written 600GB of data now,
and anything is still fine.
Will let it run overnight, and fill the whole 11T. I'll post the result
tomorrow

Thanks a lot though.

Jeff 

> -Original Message-
> From: Neil Brown [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, 17 May 2007 5:31 p.m.
> To: [EMAIL PROTECTED]; Jeff Zheng; Michal Piotrowski; Ingo 
> Molnar; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; linux-fsdevel@vger.kernel.org
> Subject: RE: Software raid0 will crash the file-system, when 
> each disk is 5TB
> 
> On Thursday May 17, [EMAIL PROTECTED] wrote:
> > 
> > Uhm, I just noticed something.
> > 'chunk' is unsigned long, and when it gets shifted up, we 
> might lose 
> > bits.  That could still happen with the 4*2.75T arrangement, but is 
> > much more likely in the 2*5.5T arrangement.
> 
> Actually, it cannot be a problem with the 4*2.75T arrangement.
>   chuck << chunksize_bits
> 
> will not exceed the size of the underlying device *in*kilobytes*.
> In that case that is 0xAE9EC800 which will git in a 32bit long.
> We don't double it to make sectors until after we add
> zone->dev_offset, which is "sector_t" and so 64bit arithmetic is used.
> 
> So I'm quite certain this bug will cause exactly the problems 
> experienced!!
> 
> > 
> > Jeff, can you try this patch?
> 
> Don't bother about the other tests I mentioned, just try this one.
> Thanks.
> 
> NeilBrown
> 
> > Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> > 
> > ### Diffstat output
> >  ./drivers/md/raid0.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
> > --- .prev/drivers/md/raid0.c2007-05-17 
> 10:33:30.0 +1000
> > +++ ./drivers/md/raid0.c2007-05-17 15:02:15.0 +1000
> > @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
> > x = block >> chunksize_bits;
> > tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
> > }
> > -   rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
> > +   rsect = sector_t)chunk << chunksize_bits) + 
> > +zone->dev_offset)<<1)
> > + sect_in_chunk;
> >   
> > bio->bi_bdev = tmp_dev->bdev;
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
> 
> Uhm, I just noticed something.
> 'chunk' is unsigned long, and when it gets shifted up, we might lose
> bits.  That could still happen with the 4*2.75T arrangement, but is
> much more likely in the 2*5.5T arrangement.

Actually, it cannot be a problem with the 4*2.75T arrangement.
  chuck << chunksize_bits

will not exceed the size of the underlying device *in*kilobytes*.
In that case that is 0xAE9EC800 which will git in a 32bit long.
We don't double it to make sectors until after we add
zone->dev_offset, which is "sector_t" and so 64bit arithmetic is used.

So I'm quite certain this bug will cause exactly the problems
experienced!!

> 
> Jeff, can you try this patch?

Don't bother about the other tests I mentioned, just try this one.
Thanks.

NeilBrown

> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> 
> ### Diffstat output
>  ./drivers/md/raid0.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
> --- .prev/drivers/md/raid0.c  2007-05-17 10:33:30.0 +1000
> +++ ./drivers/md/raid0.c  2007-05-17 15:02:15.0 +1000
> @@ -475,7 +475,7 @@ static int raid0_make_request (request_q
>   x = block >> chunksize_bits;
>   tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
>   }
> - rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
> + rsect = sector_t)chunk << chunksize_bits) + zone->dev_offset)<<1)
>   + sect_in_chunk;
>   
>   bio->bi_bdev = tmp_dev->bdev;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Jeff Zheng

> What is the nature of the corruption?  Is it data in a file 
> that is wrong when you read it back, or does the filesystem 
> metadata get corrupted?
The corruption is in fs metadata, jfs is completely destroied, after 
Umount, fsck does not recogonize it as jfs anymore. Xfs gives kernel 
Crash, but seems still recoverable.
> 
> Can you try the configuration that works, and sha1sum the 
> files after you have written them to make sure that they 
> really are correct?
We have verified the data on the working configuration, we have written 
around 900 identical 10G files , and verified that the md5sum is
actually
the same. The verification took two days though :)

> My thought here is "maybe there is a bad block on one device, 
> and the block is used for data in the 'working' config, and 
> for metadata in the 'broken' config.
> 
> Can you try a degraded raid10 configuration. e.g.
> 
>mdadm -C /dev/md1 --level=10 --raid-disks=4 /dev/first missing \
>/dev/second missing
> 
> That will lay out the data in exactly the same place as with 
> raid0, but will use totally different code paths to access 
> it.  If you still get a problem, then it isn't in the raid0 code.

I will try this later today. As I'm now trying different size of the
component.
3.4T, seems working. Test 4.1T right now.

> Maybe try version 1 metadata (mdadm --metadata=1).  I doubt 
> that would make a difference, but as I am grasping at straws 
> already, it may be a straw woth trying.

Well the problem may also be in 3ware disk array, or disk array driver.
The guy
complaining about the same problem is also using 3ware disk array
controller.
But there is no way to verify that and a single disk array has been
working fine for us.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
> On Thu, 17 May 2007, Neil Brown wrote:
> 
> > On Thursday May 17, [EMAIL PROTECTED] wrote:
> >>
> >>> The only difference of any significance between the working
> >>> and non-working configurations is that in the non-working,
> >>> the component devices are larger than 2Gig, and hence have
> >>> sector offsets greater than 32 bits.
> >>
> >> Do u mean 2T here?, but in both configuartion, the component devices are
> >> larger than 2T (2.25T&5.5T).
> >
> > Yes, I meant 2T, and yes, the components are always over 2T.
> 
> 2T decimal or 2T binary?
> 

Either.  The smallest as actually 2.75T (typo above).
Precisely it was
  2929641472  kilobytes
or 
  5859282944 sectors
or 
  0x15D3D9000 sectors.

So it is over 32bits already...

Uhm, I just noticed something.
'chunk' is unsigned long, and when it gets shifted up, we might lose
bits.  That could still happen with the 4*2.75T arrangement, but is
much more likely in the 2*5.5T arrangement.

Jeff, can you try this patch?

Thanks.
NeilBrown


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid0.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000
+++ ./drivers/md/raid0.c2007-05-17 15:02:15.0 +1000
@@ -475,7 +475,7 @@ static int raid0_make_request (request_q
x = block >> chunksize_bits;
tmp_dev = zone->dev[sector_div(x, zone->nb_dev)];
}
-   rsect = (((chunk << chunksize_bits) + zone->dev_offset)<<1)
+   rsect = sector_t)chunk << chunksize_bits) + zone->dev_offset)<<1)
+ sect_in_chunk;
  
bio->bi_bdev = tmp_dev->bdev;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread david

On Thu, 17 May 2007, Neil Brown wrote:


On Thursday May 17, [EMAIL PROTECTED] wrote:



The only difference of any significance between the working
and non-working configurations is that in the non-working,
the component devices are larger than 2Gig, and hence have
sector offsets greater than 32 bits.


Do u mean 2T here?, but in both configuartion, the component devices are
larger than 2T (2.25T&5.5T).


Yes, I meant 2T, and yes, the components are always over 2T.


2T decimal or 2T binary?


So I'm
at a complete loss.  The raid0 code follows the same paths and does
the same things and uses 64bit arithmetic where needed.

So I have no idea how there could be a difference between these two
cases.

I'm at a loss...

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
> I tried the patch, same problem show up, but no bug_on report
> 
> Is there any other things I can do?
> 

What is the nature of the corruption?  Is it data in a file that is
wrong when you read it back, or does the filesystem metadata get
corrupted?

Can you try the configuration that works, and sha1sum the files after
you have written them to make sure that they really are correct?
My thought here is "maybe there is a bad block on one device, and the
block is used for data in the 'working' config, and for metadata in
the 'broken' config.

Can you try a degraded raid10 configuration. e.g.

   mdadm -C /dev/md1 --level=10 --raid-disks=4 /dev/first missing \
   /dev/second missing

That will lay out the data in exactly the same place as with raid0,
but will use totally different code paths to access it.  If you still
get a problem, then it isn't in the raid0 code.

Maybe try version 1 metadata (mdadm --metadata=1).  I doubt that would
make a difference, but as I am grasping at straws already, it may be a
straw woth trying.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Neil Brown
>On Wednesday May 16, [EMAIL PROTECTED] wrote:
> > If CIFS provides some fix-length identifier for files, then 
> > you might be able to do it
> 
> Most CIFS servers (Windows on NTFS, Samba etc.) can return a "unique 
> identifier" (a 64 bit inode number), in conjunction with the volume id, 
> that is probably good enough ... right?  This can be returned on various 
> calls (level 0x03EE "file_internal_info" - returns only this number).  If 
> reverse lookup is required - ie given a "unique identifier" what is its 
> path name - there are probably a few different ways to handle this but 
> presumably local filesystems run into the same issue.

Yes, that "unique identifier" sounds like it would be suitable to put
in the filehandle.
But reverse lookup is definitely required.
Providing you can turn this into a 'struct inode *' in the filesystem,
the code in exportfs/ can help turn that into a fully connected
dentry.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Steven French
> If CIFS provides some fix-length identifier for files, then 
> you might be able to do it

Most CIFS servers (Windows on NTFS, Samba etc.) can return a "unique 
identifier" (a 64 bit inode number), in conjunction with the volume id, 
that is probably good enough ... right?  This can be returned on various 
calls (level 0x03EE "file_internal_info" - returns only this number).  If 
reverse lookup is required - ie given a "unique identifier" what is its 
path name - there are probably a few different ways to handle this but 
presumably local filesystems run into the same issue.


Steve French
Senior Software Engineer
Linux Technology Center - IBM Austin
phone: 512-838-2294
email: sfrench at-sign us dot ibm dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Jeff Zheng
I tried the patch, same problem show up, but no bug_on report

Is there any other things I can do?


Jeff


> Yes, I meant 2T, and yes, the components are always over 2T.  
> So I'm at a complete loss.  The raid0 code follows the same 
> paths and does the same things and uses 64bit arithmetic where needed.
> 
> So I have no idea how there could be a difference between 
> these two cases.  
> 
> I'm at a loss...
> 
> NeilBrown
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Thursday May 17, [EMAIL PROTECTED] wrote:
> 
> > The only difference of any significance between the working 
> > and non-working configurations is that in the non-working, 
> > the component devices are larger than 2Gig, and hence have 
> > sector offsets greater than 32 bits.
> 
> Do u mean 2T here?, but in both configuartion, the component devices are
> larger than 2T (2.25T&5.5T).

Yes, I meant 2T, and yes, the components are always over 2T.  So I'm
at a complete loss.  The raid0 code follows the same paths and does
the same things and uses 64bit arithmetic where needed.

So I have no idea how there could be a difference between these two
cases.  

I'm at a loss...

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Jeff Zheng

> The only difference of any significance between the working 
> and non-working configurations is that in the non-working, 
> the component devices are larger than 2Gig, and hence have 
> sector offsets greater than 32 bits.

Do u mean 2T here?, but in both configuartion, the component devices are
larger than 2T (2.25T&5.5T).
 
> This does cause a slightly different code path in one place, 
> but I cannot see it making a difference.  But maybe it does.
> 
> What architecture is this running on?
> What C compiler are you using?

I386(i686)
Gcc 4.0.2 20051125, 
Distro is Fedora core, we've tried fc4 and fc6.

> Can you try with this patch?  It is the only thing that I can 
> find that could conceivably go wrong.
> 

OK, I will try the patach and post the result.

Best Regards
Jeff Zheng

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
> Here is the information of the created raid0. Hope it is enough.

Thanks.
Everything looks fine here.

The only difference of any significance between the working and
non-working configurations is that in the non-working, the component
devices are larger than 2Gig, and hence have sector offsets greater
than 32 bits.

This does cause a slightly different code path in one place, but I
cannot see it making a difference.  But maybe it does.

What architecture is this running on?
What C compiler are you using?

Can you try with this patch?  It is the only thing that I can find
that could conceivably go wrong.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid0.c |1 +
 1 file changed, 1 insertion(+)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000
+++ ./drivers/md/raid0.c2007-05-17 10:34:02.0 +1000
@@ -461,6 +461,7 @@ static int raid0_make_request (request_q
  
while (block >= (zone->zone_offset + zone->size)) 
zone++;
+   BUG_ON(block < zone->zone_offset);
 
sect_in_chunk = bio->bi_sector & ((chunk_size<<1) -1);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Neil Brown
On Wednesday May 16, [EMAIL PROTECTED] wrote:
> Any ideas what are the minimum export operation(s) that cifs would need to 
> add to export under nfsd?  It was not clear to me after reading the 
> Exporting document in Documentation directory.

You need to be able to map a dentry to a filehandle (you get about 20
bytes) and back again.
If CIFS provides some fix-length identifier for files, then you might
be able to do it.  If not, cannot really do it at all.  And I suspect
the later.  (There are other requirements, like get_parent, but we
could probably work around those if we really needed to).

Theoretically, you could make it work with NFSv4 and volatile file
handles, but I doubt it would really work in practice.  I don't think
the "volatile" concept quite stretch as far as you would need.

Probably the best way to nfs-export a CIFS filesystem is to use the
user-space nfs server.  It caches recently used filenames and uses a
filehandle which is a hash of the name.  It works on a best-effort
basis, and if, for example, the server restarts, you will lose
connections to open files.  While it is not perfect, it can be very
useful in some situations.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-16 Thread David Chinner
On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> 
> > > Following changes were made to the previous version:
> > >  1) Added description before sys_fallocate() definition.
> > >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> > > posix_fallocate should return EINVAL for len <= 0.
> > >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> > >  4) Do not return ENODEV for dirs (let individual file systems decide if
> > > they want to support preallocation to directories or not.
> > >  5) Check for wrap through zero.
> > >  6) Update c/mtime if fallocate() succeeds.
> > 
> > Please don't make this always happen. c/mtime updates should be dependent
> > on the mode being used and whether there is visible change to the file. If 
> > no
> > userspace visible changes to the file occurred, then timestamps should not
> > be changed.
> 
> i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> shouldn't be changed, though, since the contents of the file will be
> unchanged.

That's assuming blocks were actually allocated - if the prealloc range already
has underlying blocks there is no change and so we should not be changing
mtime either. Only the filesystem will know if it has changed the file, so I
think that timestamp updates need to be driven down to that level, not done
blindy at the highest layer

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread David Chinner
On Wed, May 16, 2007 at 11:19:29AM +0100, David Howells wrote:
> 
> However, page_mkwrite() isn't told which bit of the page is going to be
> written to.  This means it has to ask prepare_write() to make sure the whole
> page is filled in.  In other words, offset and to must be equal (in AFS I set
> them both to 0).

The assumption is the page is already up to date and we are writing
the whole page unless EOF lands inside the page. AFAICT, we can't
get called with a page that is not uptodate and so page filling is
not something we should be doing (or want to be doing) here. All we
want to do is to be able to change the mapping from a read to a
write mapping (e.g. a read mapping of a hole needs to be changed on
write) and do the relevant space reservation/allocation and buffer
mapping needed for this change.

> However, if someone adds a syscall to punch holes in files, this may change...

We already have them - ioctl(XFS_IOC_UNRESVSP) and
madvise(MADV_REMOVE) - and another - fallocate(FA_DEALLOCATE) - is
on it's way. Racing with truncates should already be handled by the
truncate code (i.e. partial page truncation does the zero filling).

/me makes note to implement ->truncate_range() in XFS for MADV_REMOVE.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Jeff Zheng
 
You will definitely meet the same problem. As very large hardware disk
becomes more and more popular, this will become a big issue for software
raid. 


Jeff

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Thursday, 17 May 2007 6:04 a.m.
To: Andreas Dilger
Cc: Jeff Zheng; [EMAIL PROTECTED];
linux-fsdevel@vger.kernel.org
Subject: Re: Software raid0 will crash the file-system, when each disk
is 5TB


my experiance is taht if you don't have CONFIG_LBD enabled then the
kernel will report the larger disk as 2G and everything will work, you
just won't get all the space.

plus he seems to be crashing around 500G of data

and finally (if I am reading the post correctly) if he configures the
drives as 4x2.2TB=11TB instead of 2x5.5TB=11TB he doesn't have the same
problem.

I'm getting ready to setup a similar machine that will have 3x10TB (3 15
disk arrays with 750G drives), but won't be ready to try this for a few
more days.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Jeff Zheng
Problem is that is only happens when you actually write data to the
raid. You need the actual space to reproduce the problem.

Jeff 

-Original Message-
From: Jan Engelhardt [mailto:[EMAIL PROTECTED] 
Sent: Thursday, 17 May 2007 6:17 a.m.
To: [EMAIL PROTECTED]
Cc: Andreas Dilger; Jeff Zheng; [EMAIL PROTECTED];
linux-fsdevel@vger.kernel.org
Subject: Re: Software raid0 will crash the file-system, when each disk
is 5TB


On May 16 2007 11:04, [EMAIL PROTECTED] wrote:
>
> I'm getting ready to setup a similar machine that will have 3x10TB (3 
> 15 disk arrays with 750G drives), but won't be ready to try this for a
few more days.

You could emulate it with VMware. Big disks are quite "cheap" when they
are not allocated.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread J. Bruce Fields
On Wed, May 16, 2007 at 12:03:57PM -0500, Steven French wrote:
> I thought that until a few days ago, a sequence like the following (two 
> nfs servers exporting the same clustered data)
> 
> on client 1 lock range A through B of file1 (exported from nfs server 1)
> on client 2 lock range A through C of file 1 (exported from nfs server 2)
> on client 1 write  A through B
> on client 2 write A through C
> on client 1 unlock A through B
> on client 2 unlock A through C
> 
> would corrupt data (theoretically could be fixed as nfsd calls lock 
> methods 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fd85b8170dabbf021987875ef7f903791f4f181e)
>  

Right.

> but the more obvious point is that with two nfsd servers exporting the 
> same file data via the same cluster fs (under nfsd), the latencies can be 
> longer and the opportunity for stale metadata (file sizes)

Hm.  How could nfsd get stale metadata?

I'm just (probably naively) assuming that a "cluster" filesystem
attempts to provide much higher cache consistency than actually
necessary to keep nfs clients happy.  But, if not, it would be nice to
understand the problem.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 8/14] Union-mount lookup

2007-05-16 Thread Serge E. Hallyn
Quoting Jan Engelhardt ([EMAIL PROTECTED]):
> 
> On May 16 2007 10:38, Bharata B Rao wrote:
> >> 
> >> >+lookup_union:
> >> >+ do {
> >> >+ struct vfsmount *mnt = find_mnt(topmost);
> >> >+ UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
> >> >+ topmost->d_name.name, topmost->d_inode,
> >> >+ mnt->mnt_devname);
> >> >+ mntput(mnt);
> >> >+ } while (0);
> >> 
> >> Why the extra do{}while? [elsewhere too]
> >
> >Not sure, may be to get a scope to define 'mnt' here. Jan ?
> 
> What I was implicitly suggesting that mnt could be moved into the
> normal 'function scope'.
> 
> 
>   Jan

This code can't stay anyway so it's kind of moot.  find_mnt() is bogus,
and the topmost and overlaid mappings need to be changed from
dentry->dentry to (vfsmnt,dentry)->(vfsmnt,dentry) in order to cope with
bind mounts and mount namespaces.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 8/14] Union-mount lookup

2007-05-16 Thread Jan Engelhardt

On May 16 2007 10:38, Bharata B Rao wrote:
>> 
>> >+lookup_union:
>> >+   do {
>> >+   struct vfsmount *mnt = find_mnt(topmost);
>> >+   UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
>> >+   topmost->d_name.name, topmost->d_inode,
>> >+   mnt->mnt_devname);
>> >+   mntput(mnt);
>> >+   } while (0);
>> 
>> Why the extra do{}while? [elsewhere too]
>
>Not sure, may be to get a scope to define 'mnt' here. Jan ?

What I was implicitly suggesting that mnt could be moved into the
normal 'function scope'.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 19:17:18 +, Pavel Machek wrote:
> 
> In kernel fsck
> 
> > --- /dev/null   2007-04-18 05:32:26.652341749 +0200
> > +++ linux-2.6.21logfs/fs/logfs/progs/fsck.c 2007-05-15 00:54:22.0 
> > +0200
> > @@ -0,0 +1,332 @@
> > +/*
> > + * fs/logfs/prog/fsck.c- filesystem check
> > + *
> > + * As should be obvious for Linux kernel code, license is GPLv2
> > + *
> > + * Copyright (c) 2005-2007 Joern Engel
> > + *
> > + * In principle this could get moved to userspace.  However it might still
> > + * make some sense to keep it in the kernel.  It is a pure checker and will
> > + * only report problems, not attempt to repair them.
> > + */
> 
> Is there version that repairs?

No.

> BUG is not right thing to do for media error.

I know.  Top 3 items of my todo list are:
- Handle system crashes
- Add second journal
- Error handling

> > +
> > +#if 0
> > +/* rootdir */
> 
> Please just delete it, not comment it out like this.

That will get resurrected, even before the move to userspace.  I had to
change the filesystem format for compression support and this is an
artifact of the transition phase.

Jörn

-- 
Ninety percent of everything is crap.
-- Sturgeon's Law
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Pavel Machek
Hi!

In kernel fsck

> --- /dev/null 2007-04-18 05:32:26.652341749 +0200
> +++ linux-2.6.21logfs/fs/logfs/progs/fsck.c   2007-05-15 00:54:22.0 
> +0200
> @@ -0,0 +1,332 @@
> +/*
> + * fs/logfs/prog/fsck.c  - filesystem check
> + *
> + * As should be obvious for Linux kernel code, license is GPLv2
> + *
> + * Copyright (c) 2005-2007 Joern Engel
> + *
> + * In principle this could get moved to userspace.  However it might still
> + * make some sense to keep it in the kernel.  It is a pure checker and will
> + * only report problems, not attempt to repair them.
> + */

Is there version that repairs?

> + /* Some segments are reserved.  Just pretend they were all valid */
> + reserved = btree_lookup(&super->s_reserved_segments, segno);
> + if (reserved)
> + return 0;
> +
> + err = wbuf_read(sb, dev_ofs(sb, segno, 0), sizeof(sh), &sh);
> + BUG_ON(err);

BUG is not right thing to do for media error.

> +/*
> + * fs/logfs/prog/mkfs.c  - filesystem generation
> + *
> + * As should be obvious for Linux kernel code, license is GPLv2
> + *
> + * Copyright (c) 2005-2007 Joern Engel
> + *
> + * Should get moved to userspace.
> + */

Indeed. 

> +
> +#if 0
> +/* rootdir */

Please just delete it, not comment it out like this.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> You can drop the lock, do the invalidation,

Hmmm...  There's a danger of incurring a race by doing that.  Consider two
processes both trying to write to a dirty page for which writeback will be
rejected:

 (1) The first process gets EKEYREJECTED from the server, drops its lock and
 is then preempted.

 (2) The second process gets EKEYREJECTED from the server, drops its lock,
 truncates the page, reloads the page and modifies it.

 (3) The first process resumes and truncates the page, thereby splatting the
 second process's write.

Or:

 (1) The first process gets EKEYREJECTED from the server, clears the writeback
 information from the page, drops its lock and is then preempted.

 (2) The second process attaches its own writeback information to the page and
 modifies it.

 (3) The first process resumes and truncates the page, thereby splatting the
 second process's write.

Really, what I want to do is pass the page lock to truncate to deal with.
Better still, I want truncate to be selective, based on whether or not a page
is still associated with the rejected writeback.  I wonder if I should call
truncate_complete_page() or invalidate_complete_page() directly.

What might help is that PG_writeback is set on all pages in the rejected run,
even those that are unlocked.

Network filesystems are such fun:-)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Jan Engelhardt

On May 16 2007 11:04, [EMAIL PROTECTED] wrote:
>
> I'm getting ready to setup a similar machine that will have 3x10TB (3 15 disk
> arrays with 750G drives), but won't be ready to try this for a few more days.

You could emulate it with VMware. Big disks are quite "cheap" when
they are not allocated.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread david

On Wed, 16 May 2007, Bill Davidsen wrote:


Jeff Zheng wrote:

 Here is the information of the created raid0. Hope it is enough.



If I read this correctly, the problem is with JFS rather than RAID?


he had the same problem with xfs.

David Lang

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread david

On Wed, 16 May 2007, Andreas Dilger wrote:


On May 16, 2007  11:09 +1200, Jeff Zheng wrote:

We are using two 3ware disk array controllers, each of them is connected
8 750GB harddrives. And we build a software raid0 on top of that. The
total capacity is 5.5TB+5.5TB=11TB

We use jfs as the file-system, we have a test application that write
data continuously to the disks. After writing 52 10GB files, jfs
crashed. And we are not able to recover it, fsck doesn't recognise it
anymore.
We then tried xfs, same application, lasted a little longer, but gives
kernel crash later.


Check if your kernel has CONFIG_LBD enabled.

The kernel doesn't check if the block layer can actually write to
a block device > 2TB.


my experiance is taht if you don't have CONFIG_LBD enabled then the kernel 
will report the larger disk as 2G and everything will work, you just won't 
get all the space.


plus he seems to be crashing around 500G of data

and finally (if I am reading the post correctly) if he configures the 
drives as 4x2.2TB=11TB instead of 2x5.5TB=11TB he doesn't have the same 
problem.


I'm getting ready to setup a similar machine that will have 3x10TB (3 15 
disk arrays with 750G drives), but won't be ready to try this for a few 
more days.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread Nick Piggin

So did we just get your issues sorted? I _think_ *snip* is the
Howells code for "OK", but I can never be sure ;)

FWIW, as a rule, ClearPageUptodate should never be done by anyone,
least of all a filesystem on regular file pagecache. I need to go
through and audit this stuff... but so much backlog :P

David Howells wrote:

Nick Piggin <[EMAIL PROTECTED]> wrote:



How do you do a write-through cache for shared-writable mmap?


For shared writable mmap? I don't know...


Anyway, *snip* the side discussion.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 30/41] reiserfs convert to new aops.

2007-05-16 Thread Vladimir V. Saveliev
Hello, Nick

This is new version of the patch.

reiserfs_prepare_write and reiserfs_commit_write are still there, but they do 
not show themselves in any struct address_space_operations instance.

xattrs and ioctl use them directly.


From: Vladimir Saveliev <[EMAIL PROTECTED]>

Convert reiserfs to new aops

Signed-off-by: Vladimir Saveliev <[EMAIL PROTECTED]>



diff -puN fs/reiserfs/inode.c~reiserfs-convert-to-new-aops fs/reiserfs/inode.c
--- linux-2.6.21-mm2/fs/reiserfs/inode.c~reiserfs-convert-to-new-aops	2007-05-16 20:13:36.0 +0300
+++ linux-2.6.21-mm2-vs/fs/reiserfs/inode.c	2007-05-16 20:13:36.0 +0300
@@ -16,11 +16,12 @@
 #include 
 #include 
 #include 
+#include 
 
-static int reiserfs_commit_write(struct file *f, struct page *page,
- unsigned from, unsigned to);
-static int reiserfs_prepare_write(struct file *f, struct page *page,
-  unsigned from, unsigned to);
+int reiserfs_commit_write(struct file *f, struct page *page,
+			  unsigned from, unsigned to);
+int reiserfs_prepare_write(struct file *f, struct page *page,
+			   unsigned from, unsigned to);
 
 void reiserfs_delete_inode(struct inode *inode)
 {
@@ -2549,8 +2550,71 @@ static int reiserfs_writepage(struct pag
 	return reiserfs_write_full_page(page, wbc);
 }
 
-static int reiserfs_prepare_write(struct file *f, struct page *page,
-  unsigned from, unsigned to)
+static int reiserfs_write_begin(struct file *file,
+struct address_space *mapping,
+loff_t pos, unsigned len, unsigned flags,
+struct page **pagep, void **fsdata)
+{
+	struct inode *inode;
+	struct page *page;
+	pgoff_t index;
+	int ret;
+	int old_ref = 0;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+	page = __grab_cache_page(mapping, index);
+	if (!page)
+		return -ENOMEM;
+	*pagep = page;
+
+	inode = mapping->host;
+	reiserfs_wait_on_write_block(inode->i_sb);
+	fix_tail_page_for_writing(page);
+	if (reiserfs_transaction_running(inode->i_sb)) {
+		struct reiserfs_transaction_handle *th;
+		th = (struct reiserfs_transaction_handle *)current->
+		journal_info;
+		BUG_ON(!th->t_refcount);
+		BUG_ON(!th->t_trans_id);
+		old_ref = th->t_refcount;
+		th->t_refcount++;
+	}
+	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+reiserfs_get_block);
+	if (ret && reiserfs_transaction_running(inode->i_sb)) {
+		struct reiserfs_transaction_handle *th = current->journal_info;
+		/* this gets a little ugly.  If reiserfs_get_block returned an
+		 * error and left a transacstion running, we've got to close it,
+		 * and we've got to free handle if it was a persistent transaction.
+		 *
+		 * But, if we had nested into an existing transaction, we need
+		 * to just drop the ref count on the handle.
+		 *
+		 * If old_ref == 0, the transaction is from reiserfs_get_block,
+		 * and it was a persistent trans.  Otherwise, it was nested above.
+		 */
+		if (th->t_refcount > old_ref) {
+			if (old_ref)
+th->t_refcount--;
+			else {
+int err;
+reiserfs_write_lock(inode->i_sb);
+err = reiserfs_end_persistent_transaction(th);
+reiserfs_write_unlock(inode->i_sb);
+if (err)
+	ret = err;
+			}
+		}
+	}
+	if (ret) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	return ret;
+}
+
+int reiserfs_prepare_write(struct file *f, struct page *page,
+			   unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
 	int ret;
@@ -2603,8 +2667,101 @@ static sector_t reiserfs_aop_bmap(struct
 	return generic_block_bmap(as, block, reiserfs_bmap);
 }
 
-static int reiserfs_commit_write(struct file *f, struct page *page,
- unsigned from, unsigned to)
+static int reiserfs_write_end(struct file *file, struct address_space *mapping,
+			  loff_t pos, unsigned len, unsigned copied,
+			  struct page *page, void *fsdata)
+{
+	struct inode *inode = page->mapping->host;
+	int ret = 0;
+	int update_sd = 0;
+	struct reiserfs_transaction_handle *th;
+	unsigned start;
+
+
+	reiserfs_wait_on_write_block(inode->i_sb);
+	if (reiserfs_transaction_running(inode->i_sb))
+		th = current->journal_info;
+	else
+		th = NULL;
+
+	start = pos & (PAGE_CACHE_SIZE - 1);
+	if (unlikely(copied < len)) {
+		if (!PageUptodate(page))
+			copied = 0;
+
+		page_zero_new_buffers(page, start + copied, start + len);
+	}
+	flush_dcache_page(page);
+
+	reiserfs_commit_page(inode, page, start, start + copied);
+	unlock_page(page);
+	mark_page_accessed(page);
+	page_cache_release(page);
+
+	/* generic_commit_write does this for us, but does not update the
+	 ** transaction tracking stuff when the size changes.  So, we have
+	 ** to do the i_size updates here.
+	 */
+	pos += copied;
+	if (pos > inode->i_size) {
+		struct reiserfs_transaction_handle myth;
+		reiserfs_write_lock(inode->i_sb);
+		/* If the file have grown beyond the border where it
+		   can have a tail, unmark it as needing a tail
+		   packing */
+		if ((have_large_tails(inode->i_sb)
+		 && inode->i_size > i_block_size(inode) * 4)
+		|| (have_small_tails(inode->i_sb)

Re: [PATCH resend] introduce I_SYNC

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 10:15:35 -0700, Andrew Morton wrote:
> On Wed, 16 May 2007 19:01:14 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:
> 
> > This patch introduces a new flag, I_SYNC and seperates out all sync()
> > users of I_LOCK to use the new flag instead.
> 
> gack, you like sticking your head in dark and dusty holes.

I don't remember enjoying the experience too much.  For a day or two I
was thinking about quitting and becoming a monk somewhere.

> If we're going to do this then please let's get some exhaustive commentary
> in there so that others have a chance of understanding these flags without
> having to do the amount of reverse-engineering which you've been put through.

Not sure how the monasteries would cope with that.

Jörn

-- 
I've never met a human being who would want to read 17,000 pages of
documentation, and if there was, I'd kill him to get him out of the
gene pool.
-- Joseph Costello
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread David Howells
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> > VM in a way that people are not unhappy with :)
> 
> I'm hoping you intended one less negative ;)

Or did you mean one fewer negative? :-)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> > How do you do a write-through cache for shared-writable mmap?
> 
> For shared writable mmap? I don't know...

You can't do write-through caching for shared-writable mmap because the writes
go directly into the pagecache once the page is made writable, at least, short
of instruction emulation.

At some point in the future we'll be asked to turf the data back to the
server.

> does POSIX require mmap data to be coherent with read(2)/write(2)? ;)

I suspect so, but I don't know offhand.  I want it to be coherent anyway,
otherwise it's inconsistent with OpenAFS and Arla (or at least more so).

Note also that the choice of write-through or write-back caching also has
implications for local on-disk caching of modified data and disconnected
operation.

> I just mean more generally. simple write(2) writes, for starters.

Given that writing through an mmap'd section is write-back by its very
nature[*] and so since I have to do write-back anyway, why consider doing
write-through too?

[*] Unless we want to do instruction emulation or single stepping.

Hmmm... I wonder if O_SYNC should use write-through caching.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread Chuck Ebbert
Hugh Dickins wrote:
> On Thu, 17 May 2007, Nick Piggin wrote:
>> ... and I don't want to change the
>> VM in a way that people are not unhappy with :)
> 
> I'm hoping you intended one less negative ;)

Or one more...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread Nick Piggin

David Howells wrote:

Nick Piggin <[EMAIL PROTECTED]> wrote:



I can't call invalidate_inode_pages() or similar because that might
incorrectly kill one of B's writes (or someone else's writes); besides,
the on-server file hasn't changed.


Why would that kill anyone's writes?



Because invalidate_inode_pages() forcibly removes the dirty flag from each page


It had better not. We use that sucker to nuke pagecache when we're trying to
reclaim inodes, for example...



I can't as it can/would deadlock if called from prepare_write() in two
different ways.


Which ways? Are you talking about prepare_write being called from
page_mkwrite, or anywhere?



 (1) prepare_write() is called with the target page locked and does not release
 the lock.  The truncation routines lock the page prior to invalidating it.
 Any truncation routine that skips locked pages is of no use.


You can drop the lock, do the invalidation, and return AOP_TRUNCATED_PAGE. The
new aops patches will provide a better solution, but that will work today.



 (2) Consider a run of pages that make up a single write by one user.  Two
 other writes from other users may be attempting to overwrite that run at
 the same time.  Each of them would need to invalidate the other's locked
 page(s).


See #1.



Furthermore, the caller of prepare_write() probably won't take kindly to the
page it's dealing with being evicted from the pagecache.


It's fine if you return AOP_TRUNCATED_PAGE.



More generally it sounds like a nasty thing to have a writeback cache if it
can become incoherent (due to dirty pages that subsequently cannot be
written back) without notification. Have you tried doing a write-through
one?



How do you do a write-through cache for shared-writable mmap?


I just mean more generally. simple write(2) writes, for starters.

For shared writable mmap? I don't know... does POSIX require mmap data
to be coherent with read(2)/write(2)? ;)



You may be clearing PG_uptodate, but isn't there still an underlying problem
that you can have mappings to the page at that point? If that isn't a problem
for you, then I don't know why you would have to clear PG_uptodate at all.



There might be, yes.  I guess I should ask the VM to nuke all PTEs to each of
these pages too.


That's what the invalidate / truncate routines do.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread Nick Piggin

Hugh Dickins wrote:

On Thu, 17 May 2007, Nick Piggin wrote:


... and I don't want to change the
VM in a way that people are not unhappy with :)



I'm hoping you intended one less negative ;)


Derrr... I'm an idiot!

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread Hugh Dickins
On Thu, 17 May 2007, Nick Piggin wrote:
> 
> ... and I don't want to change the
> VM in a way that people are not unhappy with :)

I'm hoping you intended one less negative ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Bill Davidsen

Jeff Zheng wrote:

Here is the information of the created raid0. Hope it is enough.

  
If I read this correctly, the problem is with JFS rather than RAID? Have 
you tried not mounting the JFS filesystem but just starting the array 
which crashes, so you can read bits of it, etc, and verify that the 
array itself is working?


And can you run an fsck on the filesystem, if that makes sense? I assume 
you got to actually write a f/s at one time, and I've never used JFS 
under Linux. I spent five+ years using it on AIX, though, complex but 
robust.

The crashing one:
md: bind
md: bind
md: raid0 personality registered for level 0
md0: setting max_sectors to 4096, segment boundary to 1048575
raid0: looking at sde
raid0:   comparing sde(5859284992) with sde(5859284992)
raid0:   END
raid0:   ==> UNIQUE
raid0: 1 zones
raid0: looking at sdd
raid0:   comparing sdd(5859284992) with sde(5859284992)
raid0:   EQUAL
raid0: FINAL 1 zones
raid0: done.
raid0 : md_size is 11718569984 blocks.
raid0 : conf->hash_spacing is 11718569984 blocks.
raid0 : nb_zone is 2.
raid0 : Allocating 8 bytes for hash.
JFS: nTxBlock = 8192, nTxLock = 65536

The working one:
md: bind
md: bind
md: bind
md: bind
md0: setting max_sectors to 4096, segment boundary to 1048575
raid0: looking at sdd
raid0:   comparing sdd(2929641472) with sdd(2929641472)
raid0:   END
raid0:   ==> UNIQUE
raid0: 1 zones
raid0: looking at sdg
raid0:   comparing sdg(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: looking at sdf
raid0:   comparing sdf(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: looking at sde
raid0:   comparing sde(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: FINAL 1 zones
raid0: done.
raid0 : md_size is 11718565888 blocks.
raid0 : conf->hash_spacing is 11718565888 blocks.
raid0 : nb_zone is 2.
raid0 : Allocating 8 bytes for hash.
JFS: nTxBlock = 8192, nTxLock = 65536

-Original Message-
From: Neil Brown [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 16 May 2007 12:04 p.m.

To: Michal Piotrowski
Cc: Jeff Zheng; Ingo Molnar; [EMAIL PROTECTED];
[EMAIL PROTECTED]; linux-fsdevel@vger.kernel.org
Subject: Re: Software raid0 will crash the file-system, when each disk
is 5TB

On Wednesday May 16, [EMAIL PROTECTED] wrote:
  

Anybody have a clue?

  


No...
When a raid0 array is assemble, quite a lot of message get printed
about number of zones and hash_spacing etc.  Can you collect and post
those.  Both for the failing case (2*5.5T) and the working case
(4*2.55T) is possible.
  



--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] introduce I_SYNC

2007-05-16 Thread Andrew Morton
On Wed, 16 May 2007 19:01:14 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote:

> This patch introduces a new flag, I_SYNC and seperates out all sync()
> users of I_LOCK to use the new flag instead.

gack, you like sticking your head in dark and dusty holes.

If we're going to do this then please let's get some exhaustive commentary
in there so that others have a chance of understanding these flags without
having to do the amount of reverse-engineering which you've been put through.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Steven French
I thought that until a few days ago, a sequence like the following (two 
nfs servers exporting the same clustered data)

on client 1 lock range A through B of file1 (exported from nfs server 1)
on client 2 lock range A through C of file 1 (exported from nfs server 2)
on client 1 write  A through B
on client 2 write A through C
on client 1 unlock A through B
on client 2 unlock A through C

would corrupt data (theoretically could be fixed as nfsd calls lock 
methods 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fd85b8170dabbf021987875ef7f903791f4f181e)
 
but the more obvious point is that with two nfsd servers exporting the 
same file data via the same cluster fs (under nfsd), the latencies can be 
longer and the opportunity for stale metadata (file sizes) and also writes 
getting reordered is higher.


Steve French
Senior Software Engineer
Linux Technology Center - IBM Austin
phone: 512-838-2294
email: sfrench at-sign us dot ibm dot com



"J. Bruce Fields" <[EMAIL PROTECTED]> 
05/16/2007 11:02 AM

To
Steven French/Austin/[EMAIL PROTECTED]
cc
Christoph Hellwig <[EMAIL PROTECTED]>, [EMAIL PROTECTED], 
[EMAIL PROTECTED], linux-fsdevel@vger.kernel.org, [EMAIL PROTECTED]
Subject
Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree






On Wed, May 16, 2007 at 09:55:41AM -0500, Steven French wrote:
> Any ideas what are the minimum export operation(s) that cifs would need 
to 
> add to export under nfsd?  It was not clear to me after reading the 
> Exporting document in Documentation directory.
> 
> (some users had wanted to export files from Windows servers to nfs 
clients 
> files by putting an nfs server mounted over cifs in between - I realize 
> that this can corrupt data due to nfs client caching etc., as even in 
some 
> cases could happen if you try to export a cluster file system under 
nfsd).

What cases are you thinking of?

--b.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread Nick Piggin

Hugh Dickins wrote:

On Wed, 16 May 2007, Nick Piggin wrote:


Andrew Morton wrote:


I need to work out what to do with

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-doc-fix.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-fix.patch
mm-merge-nopfn-into-fault.patch
convert-hugetlbfs-to-use-vm_ops-fault.patch
mm-remove-legacy-cruft.patch
mm-debug-check-for-the-fault-vs-invalidate-race.patch
mm-fix-clear_page_dirty_for_io-vs-fault-race.patch

Probably merge them, I guess.  Hugh had concerns, I think over small
additional overhead from the lock_page()?



That's right, the overhead of the lock_page()/unlock_page() in the common
path of faulting, and of the extra call to unmap_mapping_range() when
truncating (because page lock doesn't satisfactorily replace the old
sequence count when COWing).



Yes he did. It seems to only be noticable in microbenchmarks.



So far, yes.  I expect it'll surface in some reallife workload
sometime, but let's not get too depressed about that.  I guess
your blithe "Scalability is not an issue" comment still irks me.



I say I believe scalability will not be a huge issue, because for
concurrent faulters on the same page, they still have cacheline
contention beginning before we lock the page (tree_lock), and
ending after we unlock it (page->_count), and a few others in the
middle for good mesure. I sure don't think it is going to help,
but I don't think it would be a great impact on an alrady sucky
workload.

We would have contention against other sources of lock_page, but
OTOH, we want to fix up the clear_page_dirty_for_io race window
by doing a wait_for_page_locked here anyway.

We could introduce some contention for some other lockers... but
again I think they are often situations where the page fields are
going to be modified anyway, or where the fault code would be
contending on something anyway (eg. vs !uptodate read, truncate).
The lock hold time in the fault should be smaller than many.

I'm not saying it would never be contended, but it is such a
granular thing and it is so often surrounded by file level locking.



Still, I have some lock_page speedup work that eliminates that regression
anyway.



Again, rather too blithely said.  You have a deep well of ingenuity,
but I doubt it can actually wash away all of the small overhead added.


OK, I haven't proven to eliminate the overhead completely, but we can
see that options exist... And we should be able to at least help the
more heavily impacted powerpc architecture and bring the hit it to a
more reasonable level there.



However, Hugh hasn't exactly said yes or no yet...



Getting a "yes" or "no" out of me is very hard work indeed.
But I didn't realize that was gating this work: if the world
had to wait on me, we'd be in trouble.

I think there are quite a few people eager to see the subsequent
->fault changes go in.  And I think I'd just like to minimize the
difference between -mm and mainline, so maximize comprehensibilty,
by getting this all in.  I've not heard of any correctness issues
with it, and we all agree that the page lock there is attractively
simple.


Well I value your opinion of course, and I don't want to change the
VM in a way that people are not unhappy with :)



If I ever find a better way of handling it,
I'm free to send patches in future, after all.


That's definitely what I had in mind -- the page lock is a simple
starting point (that is hopefully correct) that we would always be
able to build on with something more fancy in future.



Did that sound something like a "yes"?


A little bit :)


--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] introduce I_SYNC

2007-05-16 Thread Jörn Engel
While others are busy coming up with new silly names, here is something
substantial to worry about.

Patches fixes a deadlock problem well enough for LogFS to survive.  The
problem itself is generic and seems to be ancient.  Shaggy has code in
JFS from about 2.4.20 that seems to work around the deadlock.  Dave
Chinner indicated that this could cause latency problems (not a
deadlock) on the NFS server side.  I still suspect that NTFS has hit the
same deadlock and its current "fix" will cause data corruption instead.

After all that, I consider it important that some version of this patch
gets merged.  But before doing so, the patch needs better review and
testing.  The code it changes is quite subtle and and easy to get wrong.


1. Introduction

In its write path, LogFS may have to do some Garbage Collection when
space is getting tight.  GC requires reading inodes, so the I_LOCK bit
is taken for some random inodes.

I_LOCK is also held when syncing inodes to flash, so LogFS has to wait
for those inodes.  Inodes are written by the same code path as regular
file data and needs to acquire an fs-global mutex.  Call stacks of the
1-2 processes will look roughly like this:

Process A:  Process B:
inode_wait  [filesystem locking write path]
__wait_on_bit   __writeback_single_inode
out_of_line_wait_on_bit
ifind_fast
[filesystem calling iget()]
[filesystem locking write path]


2. The usage of inode_lock and I_LOCK

Almost all modifications of inodes are protected by the inode_lock, a
global spinlock.  Some modifications, however, can block for various
reasons and require the inode_lock to get dropped temporarily.  In the
meantime, the individual inode needs to get protected somehow.  Usually
this happens through the use of I_LOCK.

But I_LOCK is not a simple mutex.  It is a Janus-faced bit in the inode
that is used for several things, including mutual exclusion and
completion notification.  Most users are open-coded, so it is not easy
to follow, but can be summarized in the table below.

In this table columns indicate events when I_LOCK is either set or
reset (or not reset but all waiters are notified anyway).  Rows
indicate code that either checks for I_LOCK and changes behaviour
depending on its state or is waiting until I_LOCK gets reset (or is
waiting even if I_LOCK is not set).

__sync_single_inode
|   get_new_inode[_fast]
|   | unlock_new_inode
|   | | dispose_list
|   | | |   generic_delete_inode
|   | | |   |   generic_forget_inode
lockv   v | |   |   |
unlock/complete v   v   v   v   v   comment
---
__writeback_single_inodeX   O   O   O   O   sync
write_inode_now X   O   O   O   O   sync
clear_inode X   O   O   O   O   sync
__mark_inode_dirty  X   O   O   O   O   lists
generic_osync_inode X   O   O   O   O   sync
get_new_inode[_fast]O   X   O   O   O   mutex
find_inode[_fast]   O   O   X   X   X   I_FREEING
ifind[_fast]O   X   O   O   O   read

jfs txCommit?   ?   ?   ?   ?   ?
xfs_ichgtime[_fast] X   O   O   O   O   sync

Comments:
sync - wait for writeout to finish
lists - move inode to dirty list without racing against __sync_single_inode
mutex - protect against two concurrent get_new_inode[_fast] creating two inodes
I_FREEING - wait for inode to get freed, then repeat
read - don't return inode until it is read from medium

Now, the "X"s mark combinations where columns and rows are related.
"O"s mark combinations where afaics columns and rows share no
relationship whatsoever except that both use either I_LOCK or
wake_up_inode()/wait_on_inode() or any other of the partially open-coded
variants.

The table shows that two large usage groups exist for I_LOCK, one
dealing exclusively with the various sync() functions in
fs/fs-writeback.c and another basically confined to fs/inode.c code.
JFS has one remaining user that is unclear to me.


This patch introduces a new flag, I_SYNC and seperates out all sync()
users of I_LOCK to use the new flag instead.


 fs/fs-writeback.c   |   39 ---
 fs/xfs/linux-2.6/xfs_iops.c |4 ++--
 include/linux/fs.h  |2 ++
 include/linux/writeback.h   |7 +++
 4 files changed, 35 insertions(+), 17 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -99,11 +99,11 @@ void __mark_inode_dirty(struct inode *in
inode->i_state |= flags;

Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> > I can't call invalidate_inode_pages() or similar because that might
> > incorrectly kill one of B's writes (or someone else's writes); besides,
> > the on-server file hasn't changed.
> 
> Why would that kill anyone's writes?

Because invalidate_inode_pages() forcibly removes the dirty flag from each page
in the inode and then calls invalidatepage() - and thus they don't get written
back, but some of those pages may contain writes from other processes.  The
whole inode isn't owned by one user at a time.

I hadn't considered invalidate_inode_pages_range(), but that suffers from the
deadlock problem.

> > I can't as it can/would deadlock if called from prepare_write() in two
> > different ways.
> 
> Which ways? Are you talking about prepare_write being called from
> page_mkwrite, or anywhere?

 (1) prepare_write() is called with the target page locked and does not release
 the lock.  The truncation routines lock the page prior to invalidating it.
 Any truncation routine that skips locked pages is of no use.

 (2) Consider a run of pages that make up a single write by one user.  Two
 other writes from other users may be attempting to overwrite that run at
 the same time.  Each of them would need to invalidate the other's locked
 page(s).

Furthermore, the caller of prepare_write() probably won't take kindly to the
page it's dealing with being evicted from the pagecache.

> More generally it sounds like a nasty thing to have a writeback cache if it
> can become incoherent (due to dirty pages that subsequently cannot be
> written back) without notification. Have you tried doing a write-through
> one?

How do you do a write-through cache for shared-writable mmap?

> You may be clearing PG_uptodate, but isn't there still an underlying problem
> that you can have mappings to the page at that point? If that isn't a problem
> for you, then I don't know why you would have to clear PG_uptodate at all.

There might be, yes.  I guess I should ask the VM to nuke all PTEs to each of
these pages too.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 23:49:55 +0800, David Woodhouse wrote:
> 
> Utility is a factor of the underlying design -- a filesystem designed
> for flash really isn't suited to block devices.

I can think of at least three examples where LogFS would indeed make
sense on block devices.

Jörn

-- 
Happiness isn't having what you want, it's wanting what you have.
-- unknown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-16 Thread Hugh Dickins
On Wed, 16 May 2007, Nick Piggin wrote:
> Andrew Morton wrote:
> > I need to work out what to do with
> > 
> > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-doc-fix.patch
> > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-fix.patch
> > mm-merge-nopfn-into-fault.patch
> > convert-hugetlbfs-to-use-vm_ops-fault.patch
> > mm-remove-legacy-cruft.patch
> > mm-debug-check-for-the-fault-vs-invalidate-race.patch
> > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
> > 
> > Probably merge them, I guess.  Hugh had concerns, I think over small
> > additional overhead from the lock_page()?

That's right, the overhead of the lock_page()/unlock_page() in the common
path of faulting, and of the extra call to unmap_mapping_range() when
truncating (because page lock doesn't satisfactorily replace the old
sequence count when COWing).

> Yes he did. It seems to only be noticable in microbenchmarks.

So far, yes.  I expect it'll surface in some reallife workload
sometime, but let's not get too depressed about that.  I guess
your blithe "Scalability is not an issue" comment still irks me.

> In my opinion not enough to withhold pagecache corruption bug fixes.

It is a pity to be adding overhead to a common path in order to fix
such very rare cases, but yes, we probably have to live with that.

> Still, I have some lock_page speedup work that eliminates that regression
> anyway.

Again, rather too blithely said.  You have a deep well of ingenuity,
but I doubt it can actually wash away all of the small overhead added.

> However, Hugh hasn't exactly said yes or no yet...

Getting a "yes" or "no" out of me is very hard work indeed.
But I didn't realize that was gating this work: if the world
had to wait on me, we'd be in trouble.

I think there are quite a few people eager to see the subsequent
->fault changes go in.  And I think I'd just like to minimize the
difference between -mm and mainline, so maximize comprehensibilty,
by getting this all in.  I've not heard of any correctness issues
with it, and we all agree that the page lock there is attractively
simple.

If I ever find a better way of handling it,
I'm free to send patches in future, after all.

Did that sound something like a "yes"?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread Nick Piggin

David Howells wrote:

Nick Piggin <[EMAIL PROTECTED]> wrote:



In general (modulo bugs and crazy filesystems), you're not allowed to have
!uptodate pages mapped into user addresses because that implies the user
would be allowed to see garbage.



Ths situation I have to deal with is a tricky one.  Consider:

 (1) User A modifies a page with his key.  This change gets made in the
 pagecache, but is not written back immediately.

 (2) User B then wants to modify the same page, but with a different key.
 This means that afs_prepare_write() has to flush A's writes back to the
 server before B is permitted to write.

 (3) The flush fails because A is no longer permitted to write to that file.
 This means that the change in the page cache is now stale.  We can't just
 write it back as B because B didn't make the change.

What I've made afs_prepare_write() do in this situation is to nuke A's entire
write.  We can't write any of it back.  I can't call invalidate_inode_pages()
or similar because that might incorrectly kill one of B's writes (or someone
else's writes); besides, the on-server file hasn't changed.


Why would that kill anyone's writes?



To nuke A's write, each page that makes up that write is marked non-uptodate
and then reloaded.  Whilst I might wish to call invalidate_inode_pages_range(),
I can't as it can/would deadlock if called from prepare_write() in two
different ways.


Which ways? Are you talking about prepare_write being called from page_mkwrite,
or anywhere?

More generally it sounds like a nasty thing to have a writeback cache if it can
become incoherent (due to dirty pages that subsequently cannot be written
back) without notification. Have you tried doing a write-through one?

You may be clearing PG_uptodate, but isn't there still an underlying problem
that you can have mappings to the page at that point? If that isn't a problem
for you, then I don't know why you would have to clear PG_uptodate at all.



Minor issue: you can just check for `if (!page->mapping)` for truncation,
which is the usual signal to tell the reader you're checking for truncate.



That's inconsistent with other core code, truncate_complete_page() for
example.


Your filesystem internally moves pages between mappings like tmpfs?



You misunderstand me.  truncate_complete_page() uses this:

if (page->mapping != mapping)

not this:

if (!page->mapping)

I think that both cases should work in page_mkwrite().  But !page->mapping does
not appear to be the "usual signal" from what I've seen.


truncate_complete_page does that because it has to handle the case where
the mapping changes from one thing to something else that is non-NULL,
which tmpfs does.

This is not the case for most code in fs.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> In general (modulo bugs and crazy filesystems), you're not allowed to have
> !uptodate pages mapped into user addresses because that implies the user
> would be allowed to see garbage.

Ths situation I have to deal with is a tricky one.  Consider:

 (1) User A modifies a page with his key.  This change gets made in the
 pagecache, but is not written back immediately.

 (2) User B then wants to modify the same page, but with a different key.
 This means that afs_prepare_write() has to flush A's writes back to the
 server before B is permitted to write.

 (3) The flush fails because A is no longer permitted to write to that file.
 This means that the change in the page cache is now stale.  We can't just
 write it back as B because B didn't make the change.

What I've made afs_prepare_write() do in this situation is to nuke A's entire
write.  We can't write any of it back.  I can't call invalidate_inode_pages()
or similar because that might incorrectly kill one of B's writes (or someone
else's writes); besides, the on-server file hasn't changed.

To nuke A's write, each page that makes up that write is marked non-uptodate
and then reloaded.  Whilst I might wish to call invalidate_inode_pages_range(),
I can't as it can/would deadlock if called from prepare_write() in two
different ways.

> >>Minor issue: you can just check for `if (!page->mapping)` for truncation,
> >>which is the usual signal to tell the reader you're checking for truncate.
> >
> >
> > That's inconsistent with other core code, truncate_complete_page() for
> > example.
> 
> Your filesystem internally moves pages between mappings like tmpfs?

You misunderstand me.  truncate_complete_page() uses this:

if (page->mapping != mapping)

not this:

if (!page->mapping)

I think that both cases should work in page_mkwrite().  But !page->mapping does
not appear to be the "usual signal" from what I've seen.

However, that's a minor matter.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread J. Bruce Fields
On Wed, May 16, 2007 at 09:55:41AM -0500, Steven French wrote:
> Any ideas what are the minimum export operation(s) that cifs would need to 
> add to export under nfsd?  It was not clear to me after reading the 
> Exporting document in Documentation directory.
> 
> (some users had wanted to export files from Windows servers to nfs clients 
> files by putting an nfs server mounted over cifs in between - I realize 
> that this can corrupt data due to nfs client caching etc., as even in some 
> cases could happen if you try to export a cluster file system under nfsd).

What cases are you thinking of?

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread David Woodhouse
On Wed, 2007-05-16 at 08:34 -0700, Andrew Morton wrote:
> Reduced testability, mainly. Also potentially reduced usefulness. 

CONFIG_MTD has never been a barrier to testability. JFFS2 depends on MTD
and had _most_ of its early testing and development done on the 'fake'
mtdram device.

Utility is a factor of the underlying design -- a filesystem designed
for flash really isn't suited to block devices.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Andrew Morton
On Wed, 16 May 2007 20:07:18 +0800 David Woodhouse <[EMAIL PROTECTED]> wrote:

> > It's strange and a bit regrettable that an fs would have dependency on MTD,
> > really.
> 
> Why? Other file systems has dependencies on BLOCK or on NET. It seems
> entirely normal to me.

Reduced testability, mainly. Also potentially reduced usefulness.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Artem Bityutskiy
On Wed, 2007-05-16 at 22:04 +0800, David Woodhouse wrote:
> On Wed, 2007-05-16 at 15:53 +0200, Jörn Engel wrote:
> > 
> > My experience is that no matter which name I pick, people will
> > complain
> > anyway.  Previous suggestions included:
> > jffs3
> > jefs
> > engelfs
> > poofs
> > crapfs
> > sweetfs
> > cutefs
> > dynamic journaling fs - djofs
> > tfsfkal - the file system formerly known as logfs
> 
> Can we call it jörnfs? :)

And it is essential to preserve "ö" and let Pavel enjoy :-)

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Steven French
Any ideas what are the minimum export operation(s) that cifs would need to 
add to export under nfsd?  It was not clear to me after reading the 
Exporting document in Documentation directory.

(some users had wanted to export files from Windows servers to nfs clients 
files by putting an nfs server mounted over cifs in between - I realize 
that this can corrupt data due to nfs client caching etc., as even in some 
cases could happen if you try to export a cluster file system under nfsd).


Steve French
Senior Software Engineer
Linux Technology Center - IBM Austin
phone: 512-838-2294
email: sfrench at-sign us dot ibm dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread CaT
On Wed, May 16, 2007 at 03:53:19PM +0200, J??rn Engel wrote:
> Imo they all suck.  LogFS also sucks, but it allows me to make a stupid
> joke and keep my logfs.org domain.

Well if stupid jokes are a goer there's always gordonfs. :)

*hides*

-- 
"To the extent that we overreact, we proffer the terrorists the
greatest tribute."
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Kevin Bowling

On 5/16/07, David Woodhouse <[EMAIL PROTECTED]> wrote:

On Wed, 2007-05-16 at 15:53 +0200, Jörn Engel wrote:
>
> My experience is that no matter which name I pick, people will
> complain
> anyway.  Previous suggestions included:
> jffs3
> jefs
> engelfs
> poofs
> crapfs
> sweetfs
> cutefs
> dynamic journaling fs - djofs
> tfsfkal - the file system formerly known as logfs

Can we call it jörnfs? :)


However if Jörn is accused of murder, it will have little chance of
being merged :-).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Andreas Dilger
On May 16, 2007  11:09 +1200, Jeff Zheng wrote:
> We are using two 3ware disk array controllers, each of them is connected
> 8 750GB harddrives. And we build a software raid0 on top of that. The
> total capacity is 5.5TB+5.5TB=11TB
> 
> We use jfs as the file-system, we have a test application that write
> data continuously to the disks. After writing 52 10GB files, jfs
> crashed. And we are not able to recover it, fsck doesn't recognise it
> anymore.
> We then tried xfs, same application, lasted a little longer, but gives
> kernel crash later.

Check if your kernel has CONFIG_LBD enabled.

The kernel doesn't check if the block layer can actually write to
a block device > 2TB.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread David Woodhouse
On Wed, 2007-05-16 at 15:53 +0200, Jörn Engel wrote:
> 
> My experience is that no matter which name I pick, people will
> complain
> anyway.  Previous suggestions included:
> jffs3
> jefs
> engelfs
> poofs
> crapfs
> sweetfs
> cutefs
> dynamic journaling fs - djofs
> tfsfkal - the file system formerly known as logfs

Can we call it jörnfs? :)

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 09:41:10 -0400, John Stoffel wrote:
> Jörn> On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
> 
> Jörn> How many of you have worked for IBM before?  Vowels are not
> evil. ;)
> 
> Nope, they're not.  I just think that LogFS isn't descriptive enough,
> or more accurately, is the *wrong* description of this filesystem.  

That was the whole point.  JFFS2, the journaling flash filesystem, is a
strictly log-structured filesystem.  LogFS has a journal.

It is also the filesystem that tries to scale logarithmically, as Arnd
has noted.  Maybe I should call it Log2 to emphesize this point.  Log1
would be horrible scalability.

> flashfs works for me.  It's longer, but hey, that's ok.  Even flshfs
> might work.  Oh wait, flesh?  flash?  flush?  Too confusing... :-)   

Maybe.  FFS or flash filesystem already exists.  And YAFFS, yet another
flash filesystem, would be older than flashfs.

My experience is that no matter which name I pick, people will complain
anyway.  Previous suggestions included:
jffs3
jefs
engelfs
poofs
crapfs
sweetfs
cutefs
dynamic journaling fs - djofs
tfsfkal - the file system formerly known as logfs

Plus today:
FFFS
flashfs
fredfs
bob
shizzle

Imo they all suck.  LogFS also sucks, but it allows me to make a stupid
joke and keep my logfs.org domain.

Jörn

-- 
There are two ways of constructing a software design: one way is to make
it so simple that there are obviously no deficiencies, and the other is
to make it so complicated that there are no obvious deficiencies.
-- C. A. R. Hoare
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread Nick Piggin

David Howells wrote:

Nick Piggin <[EMAIL PROTECTED]> wrote:



Dave is using prepare_write here to ensure blocks are allocated in the
given range. The filesystem's ->nopage function must ensure it is uptodate
before allowing it to be mapped.



Which is fine... assuming it's called.  For blockdev-based filesystems, this
is probably true.  But I'm not sure you can guarantee it.

I've seen Ext3, for example, unlocking a page that isn't yet uptodate.
nopage() won't get called on it again, but prepare_write() might.  I don't
know why this happens, but it's something I've fallen over in doing
CacheFiles.  When reading, readpage() is just called on it again and again
until it is up to date.  When writing, prepare_write() is called correctly.


There are bugs in the core VM and block filesystem code where !uptodate pages
are left in pagetables. Some of these are fixed in -mm.

But they aren't a good reason to invent completely different ways to do things.



Consider that the code currently works OK today _without_ page_mkwrite.
page_mkwrite is being added to do block allocation / reservation.



Which doesn't prove anything.  All it means is that PG_uptodate being unset is
handled elsewhere.


It means that Dave's page_mkwrite function will do the block allocation
and everything else continues as it is. Your suggested change to pass in
offset == to is just completely wrong for this.

PG_uptodate being unset should be done via pagecache invalidation or truncation
APIs, which (sometimes... modulo bugs) tear down pagetables first.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-05-16 Thread Pavel Machek
Hi!

> Pathname matching, transition table loading, profile loading and
> manipulation.

So we get small interpretter of state machines, and reason we need is
is 'apparmor is misdesigned and works with paths when it should have
worked with handles'.

If you solve the 'new file problem', aa becomes subset of selinux..
And I'm pretty sure patch will be nicer than this.

Pavel

> +int unpack_dfa(struct aa_dfa *dfa, void *blob, size_t size)
> +{
> + int hsize, i;
> + int error = -ENOMEM;
> +
> + /* get dfa table set header */
> + if (size < sizeof(struct table_set_header))
> + goto fail;
> +
> + if (ntohl(*(u32 *)blob) != YYTH_MAGIC)
> + goto fail;
> +
> + hsize = ntohl(*(u32 *)(blob + 4));
> + if (size < hsize)
> + goto fail;
> +
> + blob += hsize;
> + size -= hsize;
> +
> + error = -EPROTO;
> + while (size > 0) {
> + struct table_header *table;
> + table = unpack_table(blob, size);
> + if (!table)
> + goto fail;
> +
> + switch(table->td_id) {
> + case YYTD_ID_ACCEPT:
> + case YYTD_ID_BASE:
> + dfa->tables[table->td_id - 1] = table;
> + if (table->td_flags != YYTD_DATA32)
> + goto fail;
> + break;
> + case YYTD_ID_DEF:
> + case YYTD_ID_NXT:
> + case YYTD_ID_CHK:
> + dfa->tables[table->td_id - 1] = table;
> + if (table->td_flags != YYTD_DATA16)
> + goto fail;
> + break;
> + case YYTD_ID_EC:
> + dfa->tables[table->td_id - 1] = table;
> + if (table->td_flags != YYTD_DATA8)
> + goto fail;
> + break;
> + default:
> + kfree(table);
> + goto fail;
> + }
> +
> + blob += table_size(table->td_lolen, table->td_flags);
> + size -= table_size(table->td_lolen, table->td_flags);
> + }
> +
> + return 0;
> +
> +fail:
> + for (i = 0; i < ARRAY_SIZE(dfa->tables); i++) {
> + if (dfa->tables[i]) {
> + kfree(dfa->tables[i]);
> + dfa->tables[i] = NULL;
> + }
> + }
> + return error;
> +}
> +
> +/**
> + * verify_dfa - verify that all the transitions and states in the dfa tables
> + *  are in bounds.
> + * @dfa: dfa to test
> + *
> + * assumes dfa has gone through the verification done by unpacking
> + */
> +int verify_dfa(struct aa_dfa *dfa)
> +{
> + size_t i, state_count, trans_count;
> + int error = -EPROTO;
> +
> + /* check that required tables exist */
> + if (!(dfa->tables[YYTD_ID_ACCEPT -1 ] &&
> +   dfa->tables[YYTD_ID_DEF - 1] &&
> +   dfa->tables[YYTD_ID_BASE - 1] &&
> +   dfa->tables[YYTD_ID_NXT - 1] &&
> +   dfa->tables[YYTD_ID_CHK - 1]))
> + goto out;
> +
> + /* accept.size == default.size == base.size */
> + state_count = dfa->tables[YYTD_ID_BASE - 1]->td_lolen;
> + if (!(state_count == dfa->tables[YYTD_ID_DEF - 1]->td_lolen &&
> +   state_count == dfa->tables[YYTD_ID_ACCEPT - 1]->td_lolen))
> + goto out;
> +
> + /* next.size == chk.size */
> + trans_count = dfa->tables[YYTD_ID_NXT - 1]->td_lolen;
> + if (trans_count != dfa->tables[YYTD_ID_CHK - 1]->td_lolen)
> + goto out;
> +
> + /* if equivalence classes then its table size must be 256 */
> + if (dfa->tables[YYTD_ID_EC - 1] &&
> + dfa->tables[YYTD_ID_EC - 1]->td_lolen != 256)
> + goto out;
> +
> + for (i = 0; i < state_count; i++) {
> + if (DEFAULT_TABLE(dfa)[i] >= state_count)
> + goto out;
> + if (BASE_TABLE(dfa)[i] >= trans_count + 256)
> + goto out;
> + }
> +
> + for (i = 0; i < trans_count ; i++) {
> + if (NEXT_TABLE(dfa)[i] >= state_count)
> + goto out;
> + if (CHECK_TABLE(dfa)[i] >= state_count)
> + goto out;
> + }
> +
> + error = 0;
> +out:
> + return error;
> +}
> +
> +struct aa_dfa *aa_match_alloc(void)
> +{
> + return kzalloc(sizeof(struct aa_dfa), GFP_KERNEL);
> +}
> +
> +void aa_match_free(struct aa_dfa *dfa)
> +{
> + if (dfa) {
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(dfa->tables); i++)
> + kfree(dfa->tables[i]);
> + }
> + kfree(dfa);
> +}
> +
> +/**
> + * aa_dfa_match - match @path against @dfa starting in @state
> + * @dfa: the dfa to match @path against
> + * @state: the state to start matching in
> + * @path: the path to match against the dfa
> + *
> + * aa_dfa_match will match the full 

Re: [AppArmor 35/45] Allow permission functions to tell between parent and leaf checks

2007-05-16 Thread Pavel Machek
Hi!

> Set the LOOKUP_CONTINUE flag when checking parent permissions. This allows
> permission functions to tell between parent and leaf checks.
> 
> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>

I guess you should sign this off.


> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1409,6 +1409,10 @@ static int may_delete(struct inode *dir,
>   BUG_ON(victim->d_parent->d_inode != dir);
>   audit_inode_child(victim->d_name.name, victim->d_inode, dir);
>  
> +#if 0
> + if (nd)
> + nd->flags |= LOOKUP_CONTINUE;
> +#endif

We don't add disabled code to kernel.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 37/45] AppArmor: Main Part

2007-05-16 Thread Pavel Machek
Hi!

> The underlying functions by which the AppArmor LSM hooks are implemented.
> 
> Signed-off-by: John Johansen <[EMAIL PROTECTED]>
> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>

> +#include "inline.h"

Select a better name for include?

> +static inline void aa_permerror2result(int perm_result, struct aa_audit *sa)

Betternameoffunction?

> +/**
> + * mangle -- escape special characters in str
> + * @str: string to escape
> + * @buffer: buffer containing str
> + *
> + * Escape special characters in @str, which is contained in @buffer. @str 
> must
> + * be aligned to the end of the buffer, and the space between @buffer and 
> @str
> + * may be used for escaping.
> + *
> + * Returns @str if no escaping was necessary, a pointer to the beginning of 
> the
> + * escaped string, or NULL if there was not enough space in @buffer.  When
> + * called with a NULL buffer, the return value tells whether any escaping is
> + * necessary.
> + */
> +static const char *mangle(const char *str, char *buffer)
> +{
> + static const char c_escape[] = {
> + ['\a'] = 'a',   ['\b'] = 'b',
> + ['\f'] = 'f',   ['\n'] = 'n',
> + ['\r'] = 'r',   ['\t'] = 't',
> + ['\v'] = 'v',
> + [' '] = ' ',['\\'] = '\\',
> + };
> + const char *s;
> + char *t, c;
> +
> +#define mangle_escape(c) \
> + unlikely((unsigned char)(c) < ARRAY_SIZE(c_escape) &&   \
> +  c_escape[(unsigned char)c])
> +
> + for (s = (char *)str; (c = *s) != '\0'; s++)
> + if (mangle_escape(c))
> + goto escape;
> + return str;
> +
> +escape:
> + if (!buffer)
> + return NULL;
> + for (s = str, t = buffer; (c = *s) != '\0'; s++) {
> + if (mangle_escape(c)) {
> + if (t == s)
> + return NULL;
> + *t++ = '\\';
> + *t++ = c_escape[(unsigned char)c];
> + } else
> + *t++ = c;
> + }
> + *t++ = '\0';
> +
> +#undef mangle_escape
> +
> + return buffer;
> +}

I do not think we want this in kernel.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 38/45] AppArmor: Module and LSM hooks

2007-05-16 Thread Pavel Machek

> Module parameters, LSM hooks, initialization and teardown.
> 
> Signed-off-by: John Johansen <[EMAIL PROTECTED]>
> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>

> +/* Maximum pathname length before accesses will start getting rejected */
> +unsigned int apparmor_path_max = 2 * PATH_MAX;
> +module_param_named(path_max, apparmor_path_max, aauint, S_IRUSR | S_IWUSR);
> +MODULE_PARM_DESC(apparmor_path_max, "Maximum pathname length allowed");

WTF? Why is this configurable?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread Nick Piggin

David Howells wrote:

Nick Piggin <[EMAIL PROTECTED]> wrote:



I would strongly suggest you used (0, PAGE_CACHE_SIZE) for the range,



That tells prepare_write() that the region to be modified is the whole page -
which is incorrect.  We're going to change a little bit of it.


You don't know how much you're going to change, but it could be anything
in the range of 0 to PAGE_CACHE_SIZE. Clearly a (0, PAGE_CACHE_SIZE)
range is the right range to pass in.



Hmmm... Thinking about it again, I probably shouldn't be using
afs_prepare_write() at all.  afs_prepare_write() does two things:

 (1) Fills in the bits around the edges of the region to be modified if the
 page is not up to date.

 (2) Records the region of the page to be modified.

If afs_prepare_write() function is invoked by write(), then the region in (2)
is known, and thus the edges are known.

But if it's called from page_mkwrite(), we don't know where the edges are, so
we have to refresh the entire page if it's not up to date, and then we have to
record that the entire page needs writing back as we don't know which bits
have changed.


Oh god you're doing ClearPageUptodate directly on pagecache pages?

In general (modulo bugs and crazy filesystems), you're not allowed to have
!uptodate pages mapped into user addresses because that implies the user
would be allowed to see garbage.

If you follow that rule, then you can never have a !uptodate page be passed
into page_mkwrite (unless it has just been truncated, in which case you
catch that case anyway).



and have your nopage function DTRT.



That assumes nopage() will be called in all cases - which it won't.


No, just assumes that nopage brings the page uptodate and people use the
correct invalidation functions to invalidate pagecache, so you never have
!uptodate pages that are mapped.



Rather than add this (not always correct) comment about the VM workings, I'd
just add a directive in the page_mkwrite API documentation that the filesystem
is to return 0 if the page has been truncated.



I still want to put a comment in my code to say *why* I'm doing this.


The one you put there was wrong.



Minor issue: you can just check for `if (!page->mapping)` for truncation,
which is the usual signal to tell the reader you're checking for truncate.



That's inconsistent with other core code, truncate_complete_page() for
example.


Your filesystem internally moves pages between mappings like tmpfs?



Then you can remove the comment...



The comment stays, though I'm willing to improve it.


Well I don't really care, so I'll just concentrate on trying to get the
code fixed.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread David Howells
David Woodhouse <[EMAIL PROTECTED]> wrote:

> Really? Is it _really_ going to be modified?

Well, generic_file_buffered_write() doesn't check the success of the copy
before calling commit_write(), presumably because it uses
fault_in_pages_readable() first.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 15:36:44 +0300, Pekka Enberg wrote:
> On 5/16/07, Jörn Engel <[EMAIL PROTECTED]> wrote:
> >
> >More trouble?
> 
> Forgot to add (see below). Seems logfs_segment_read would be simpler
> too if you fixed this.

Would it?  I think that code would still be needed, although possibly in
a different function.

There are two minor drawbacks to using the page cache, btw:
- Indirect blocks need some mapping too.  So either I need to steal a
  bit from the inode space or from the fpos space.
- OOM handling is a bit more complicated.  I would need a mempool for
  that.

> >[ Objects are the units that get compressed.  Segments can contain both
> >compressed and uncompressed objects. ]
> >
> >It is a trade-off.  Each object has a 24 Byte header plus X Bytes of
> >data.  Whether the data is compressed or not is indicated in the header.
> 
> Was my point really. Why do segments contain both compressed and
> uncompressed objects?

Compressing random data will actually enlarge it.  If that happens I
simply store the verbatim uncompressed data instead and mark it as such.

There is also demand for a user-controlled bit in the inode to disable
compression completely.  All those .jpg, .mpg, .mp3, etc. just waste
time by trying and failing to compress them.

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> Dave is using prepare_write here to ensure blocks are allocated in the
> given range. The filesystem's ->nopage function must ensure it is uptodate
> before allowing it to be mapped.

Which is fine... assuming it's called.  For blockdev-based filesystems, this
is probably true.  But I'm not sure you can guarantee it.

I've seen Ext3, for example, unlocking a page that isn't yet uptodate.
nopage() won't get called on it again, but prepare_write() might.  I don't
know why this happens, but it's something I've fallen over in doing
CacheFiles.  When reading, readpage() is just called on it again and again
until it is up to date.  When writing, prepare_write() is called correctly.

> Consider that the code currently works OK today _without_ page_mkwrite.
> page_mkwrite is being added to do block allocation / reservation.

Which doesn't prove anything.  All it means is that PG_uptodate being unset is
handled elsewhere.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread David Howells
Nick Piggin <[EMAIL PROTECTED]> wrote:

> I would strongly suggest you used (0, PAGE_CACHE_SIZE) for the range,

That tells prepare_write() that the region to be modified is the whole page -
which is incorrect.  We're going to change a little bit of it.

Hmmm... Thinking about it again, I probably shouldn't be using
afs_prepare_write() at all.  afs_prepare_write() does two things:

 (1) Fills in the bits around the edges of the region to be modified if the
 page is not up to date.

 (2) Records the region of the page to be modified.

If afs_prepare_write() function is invoked by write(), then the region in (2)
is known, and thus the edges are known.

But if it's called from page_mkwrite(), we don't know where the edges are, so
we have to refresh the entire page if it's not up to date, and then we have to
record that the entire page needs writing back as we don't know which bits
have changed.

> and have your nopage function DTRT.

That assumes nopage() will be called in all cases - which it won't.

> Rather than add this (not always correct) comment about the VM workings, I'd
> just add a directive in the page_mkwrite API documentation that the filesystem
> is to return 0 if the page has been truncated.

I still want to put a comment in my code to say *why* I'm doing this.

> Minor issue: you can just check for `if (!page->mapping)` for truncation,
> which is the usual signal to tell the reader you're checking for truncate.

That's inconsistent with other core code, truncate_complete_page() for
example.

> Then you can remove the comment...

The comment stays, though I'm willing to improve it.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread Chris Mason
On Wed, May 16, 2007 at 11:04:11PM +1000, Nick Piggin wrote:
> Chris Mason wrote:
> >On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
> >
> >>On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> >>
> >>>The start and end points passed to block_prepare_write() delimit the 
> >>>region of
> >>>the page that is going to be modified.  This means that prepare_write()
> >>>doesn't need to fill it in if the page is not up to date. 
> >>
> >>Really? Is it _really_ going to be modified? Even if the pointer
> >>userspace gave to write() is bogus, and is going to fault half-way
> >>through the copy_from_user()?
> >
> >
> >This is why there are so many variations on copy_from_user that zero on
> >faults.  One way or another, the prepare_write/commit_write pair are
> >responsible for filling it in.
> 
> I'll add to David's question about David's comment on David's patch, yes
> it will be modified but in that case it would be zero-filled as Chris
> says. However I believe this is incorrect behaviour.
> 
> It is possible to easily fix that so it would only happen via a tiny race
> window (where the source memory gets unmapped at just the right time)
> however nobody seemed to interested (just by checking the return value of
> fault_in_pages_readable).
> 
> The buffered write patches I'm working on fix that (among other things) of
> course. But they do away with prepare_write and introduce new aops, and
> they indeed must not expect the full range to have been written to.

I was also wrong to say prepare_write and commit_write are
responsible, they work together with their callers to make the right
things happen.  Oh well, so much for trying to give a short answer for a
chunk of code full of corner cases ;)

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread Nick Piggin

Chris Mason wrote:

On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:


On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:


The start and end points passed to block_prepare_write() delimit the region of
the page that is going to be modified.  This means that prepare_write()
doesn't need to fill it in if the page is not up to date. 


Really? Is it _really_ going to be modified? Even if the pointer
userspace gave to write() is bogus, and is going to fault half-way
through the copy_from_user()?



This is why there are so many variations on copy_from_user that zero on
faults.  One way or another, the prepare_write/commit_write pair are
responsible for filling it in.


I'll add to David's question about David's comment on David's patch, yes
it will be modified but in that case it would be zero-filled as Chris
says. However I believe this is incorrect behaviour.

It is possible to easily fix that so it would only happen via a tiny race
window (where the source memory gets unmapped at just the right time)
however nobody seemed to interested (just by checking the return value of
fault_in_pages_readable).

The buffered write patches I'm working on fix that (among other things) of
course. But they do away with prepare_write and introduce new aops, and
they indeed must not expect the full range to have been written to.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 15:08:15 +0300, Pekka Enberg wrote:
> On 5/16/07, Jamie Lokier <[EMAIL PROTECTED]> wrote:
> >Given that the filesystem is still 'experimental', I'd concentrate on
> >getting it stable before worrying about immutable and xattrs unless
> >they are easy.
> 
> We will run into trouble if the on-disk format is not flexible enough
> to accommodate xattrs (think reiser3 here). So I'd worry about it
> before merging to mainline.

Adding xattrs would be fairly simple.  Inodes just need one extra
pointer for that.

Luckily inodes no longer need to be padded to 128 or 256 bytes.  They
are individually compressed, so their size is not limited to powers of
two.

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 16:29:22 +0400, Evgeniy Polyakov wrote:
> On Wed, May 16, 2007 at 01:50:03PM +0200, Jörn Engel ([EMAIL PROTECTED]) 
> wrote:
> > On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
> > > 
> > > But if akpm can't pronounce it, how about FFFS for faster flash
> > > filesystem ;-)
> > 
> > How many of you have worked for IBM before?  Vowels are not evil. ;)
> 
> Do you think 'eieio' is a good set? IBM's work too...

I will let someone else comment on that one.

http://www.uwsg.iu.edu/hypermail/linux/kernel/0110.1/1294.html

Jörn

-- 
There are two ways of constructing a software design: one way is to make
it so simple that there are obviously no deficiencies, and the other is
to make it so complicated that there are no obvious deficiencies.
-- C. A. R. Hoare
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread Chris Mason
On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
> On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> > The start and end points passed to block_prepare_write() delimit the region 
> > of
> > the page that is going to be modified.  This means that prepare_write()
> > doesn't need to fill it in if the page is not up to date. 
> 
> Really? Is it _really_ going to be modified? Even if the pointer
> userspace gave to write() is bogus, and is going to fault half-way
> through the copy_from_user()?

This is why there are so many variations on copy_from_user that zero on
faults.  One way or another, the prepare_write/commit_write pair are
responsible for filling it in.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-16 Thread Christoph Hellwig
On Wed, May 16, 2007 at 08:57:21AM +0200, Christoph Hellwig wrote:
> On Tue, May 15, 2007 at 02:49:14PM -0700, [EMAIL PROTECTED] wrote:
> >  fs/cifs/export.c |3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff -puN fs/cifs/export.c~knfsd-exportfs-add-exportfsh-header-fix 
> > fs/cifs/export.c
> > --- a/fs/cifs/export.c~knfsd-exportfs-add-exportfsh-header-fix
> > +++ a/fs/cifs/export.c
> > @@ -29,7 +29,8 @@
> >*/
> >  
> >  #include 
> > - 
> > +#include 
> > +
> >  #ifdef CONFIG_CIFS_EXPERIMENTAL
> 
> Looks like cifs has grown and export_operations table since I did the
> patch.  But with only a get_parent method that returns and error it's
> not useful at all, so we should rather remove the whole file:

That patch was missing the Makefile hunk, here's the proper one:
(I wish there was a way to avoid having to do quilt add everytime)


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/cifs/cifsfs.c
===
--- linux-2.6.orig/fs/cifs/cifsfs.c 2007-05-16 07:55:35.0 +0200
+++ linux-2.6/fs/cifs/cifsfs.c  2007-05-16 07:55:50.0 +0200
@@ -49,10 +49,6 @@
 static struct quotactl_ops cifs_quotactl_ops;
 #endif /* QUOTA */
 
-#ifdef CONFIG_CIFS_EXPERIMENTAL
-extern struct export_operations cifs_export_ops;
-#endif /* EXPERIMENTAL */
-
 int cifsFYI = 0;
 int cifsERROR = 1;
 int traceSMB = 0;
@@ -114,10 +110,6 @@ cifs_read_super(struct super_block *sb, 
 
sb->s_magic = CIFS_MAGIC_NUMBER;
sb->s_op = &cifs_super_ops;
-#ifdef CONFIG_CIFS_EXPERIMENTAL
-   if (experimEnabled != 0)
-   sb->s_export_op = &cifs_export_ops;
-#endif /* EXPERIMENTAL */  
 /* if (cifs_sb->tcon->ses->server->maxBuf > MAX_CIFS_HDR_SIZE + 512)
sb->s_blocksize = cifs_sb->tcon->ses->server->maxBuf - 
MAX_CIFS_HDR_SIZE; */
 #ifdef CONFIG_CIFS_QUOTA
Index: linux-2.6/fs/cifs/export.c
===
--- linux-2.6.orig/fs/cifs/export.c 2007-05-16 07:55:59.0 +0200
+++ /dev/null   1970-01-01 00:00:00.0 +
@@ -1,52 +0,0 @@
-/*
- *   fs/cifs/export.c
- *
- *   Copyright (C) International Business Machines  Corp., 2007
- *   Author(s): Steve French ([EMAIL PROTECTED])
- *
- *   Common Internet FileSystem (CIFS) client
- * 
- *   Operations related to support for exporting files via NFSD
- *
- *   This library is free software; you can redistribute it and/or modify
- *   it under the terms of the GNU Lesser General Public License as published
- *   by the Free Software Foundation; either version 2.1 of the License, or
- *   (at your option) any later version.
- *
- *   This library is distributed in the hope that it will be useful,
- *   but WITHOUT ANY WARRANTY; without even the implied warranty of
- *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
- *   the GNU Lesser General Public License for more details.
- *
- *   You should have received a copy of the GNU Lesser General Public License
- *   along with this library; if not, write to the Free Software
- *   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
- */
- 
- /* 
-  * See Documentation/filesystems/Exporting
-  * and examples in fs/exportfs
-  */
-
-#include 
- 
-#ifdef CONFIG_CIFS_EXPERIMENTAL
- 
-static struct dentry *cifs_get_parent(struct dentry *dentry)
-{
-   /* BB need to add code here eventually to enable export via NFSD */
-   return ERR_PTR(-EACCES);
-}
- 
-struct export_operations cifs_export_ops = {
-   .get_parent = cifs_get_parent,
-/* Following five export operations are unneeded so far and can default */ 

-/* .get_dentry =
-   .get_name =
-   .find_exported_dentry =
-   .decode_fh = 
-   .encode_fs =  */
- };
- 
-#endif /* EXPERIMENTAL */
- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 13:25:48 +0100, Jamie Lokier wrote:
> 
> Is LogFS really slower than JFFS2 in practice?

Not sure.  I ran a benchmark before adding compression support in QEMU
with a lightning-fast device.  So the results should differ quite a bit
from practice.

http://logfs.org/~joern/logfs/benchmark/benchmark_overview

LogFS was actually faster than JFFS2.  So for that particular
unrealistic benchmark, updating the LogFS tree was less expensive than
trying (and failing) to compress and calculating the CRC was for JFFS2.

With compression finished, I would expect LogFS numbers to degrade.  If
file data had checksums (not done yet, should be optional for users to
decide) even more so.

> I would have guessed reads to be a similar speed, tree updates to be a
> similar speed  to journal  updates for sustained  non-fsyncing writes,
> and the difference unimportant for tiny individual commits whose index
> updates are not merged with any other.  I've not thought about it much
> though.

LogFS isn't that good yet.  Right now, writing 10 adjacent blocks to a
file requires 10 tree updates instead of 1.  Not full updates though,
just up to the inode.

Quite surprisingly, read speed in the benchmark was significantly better
for LogFS, even after substracting mount time.  I don't know if all of
that can be explained with CRC checks or there is more to it.

Jörn

-- 
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 17/41] ext2 convert to new aops.

2007-05-16 Thread Nick Piggin
On Mon, May 14, 2007 at 04:06:36PM +1000, [EMAIL PROTECTED] wrote:
> Cc: [EMAIL PROTECTED]
> Cc: Linux Filesystems 
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Found a problem in ext2 pagecache directory handling. Trivial fix follows.
Longer-term, it might be better to rework these things a bit so they can
directly use the pagecache_write_begin/pagecache_write_end accessors.
---
Index: linux-2.6/fs/ext2/dir.c
===
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -70,10 +70,18 @@ static int ext2_commit_chunk(struct page
 
dir->i_version++;
block_write_end(NULL, mapping, pos, len, len, page, NULL);
+
+   if (pos+len > dir->i_size) {
+   i_size_write(dir, pos+len);
+   mark_inode_dirty(dir);
+   }
+
if (IS_DIRSYNC(dir))
err = write_one_page(page, 1);
else
unlock_page(page);
+   mark_page_accessed(page);
+
return err;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread CaT
On Wed, May 16, 2007 at 01:50:03PM +0200, J??rn Engel wrote:
> On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
> > 
> > But if akpm can't pronounce it, how about FFFS for faster flash
> > filesystem ;-)
> 
> How many of you have worked for IBM before?  Vowels are not evil. ;)
> 
> Grouping four or more consonants to name anything will cause similar
> expressions on people's faces.  Numbers don't help much either.
> 
> Ext2 is a great name, because "ext" actually is a pronouncable syllable.
> MinixFS, ChunkFS, TileFS are great too.  XFS and JFS are ok, at least
> they only have three consonants.  But FFS exists, so I'd rather go for a
> syllable.

FlashFS?

-- 
"To the extent that we overreact, we proffer the terrorists the
greatest tribute."
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Pekka Enberg

On 5/16/07, Pekka Enberg <[EMAIL PROTECTED]> wrote:

Forgot to add (see below). Seems logfs_segment_read would be simpler
too if you fixed this.


Blah. Just to be clear: I forgot to add a "(see below)" text in the
original review comment.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-16 Thread Amit K. Arora
On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> 
> > > Following changes were made to the previous version:
> > >  1) Added description before sys_fallocate() definition.
> > >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> > > posix_fallocate should return EINVAL for len <= 0.
> > >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> > >  4) Do not return ENODEV for dirs (let individual file systems decide if
> > > they want to support preallocation to directories or not.
> > >  5) Check for wrap through zero.
> > >  6) Update c/mtime if fallocate() succeeds.
> > 
> > Please don't make this always happen. c/mtime updates should be dependent
> > on the mode being used and whether there is visible change to the file. If 
> > no
> > userspace visible changes to the file occurred, then timestamps should not
> > be changed.
> 
> i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> shouldn't be changed, though, since the contents of the file will be
> unchanged.

I agree. Thus the ctime should change for FA_PREALLOCATE mode also
(which does not change the file size) - if we end up having this
additional mode in near future.

--
Regards,
Amit Arora
 
> > e.g. FA_ALLOCATE that changes file size requires same semantics of 
> > ftruncate()
> > extending the file, otherwise no change in timestamps should occur.
> > 
> > Cheers,
> > 
> > Dave.
> -- 
> David Kleikamp
> IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Pekka Enberg

On 5/16/07, Jörn Engel <[EMAIL PROTECTED]> wrote:

> > +/* FIXME: all this mess should get replaced by using the page cache */
> > +static void fixup_from_wbuf(struct super_block *sb, struct logfs_area
> *area,
> > + void *read, u64 ofs, size_t readlen)
> > +{
>
> Indeed. And I think you're getting some more trouble because of this...

More trouble?


Forgot to add (see below). Seems logfs_segment_read would be simpler
too if you fixed this.


[ Objects are the units that get compressed.  Segments can contain both
compressed and uncompressed objects. ]

It is a trade-off.  Each object has a 24 Byte header plus X Bytes of
data.  Whether the data is compressed or not is indicated in the header.


Was my point really. Why do segments contain both compressed and
uncompressed objects?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Evgeniy Polyakov
On Wed, May 16, 2007 at 01:50:03PM +0200, Jörn Engel ([EMAIL PROTECTED]) wrote:
> On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
> > 
> > But if akpm can't pronounce it, how about FFFS for faster flash
> > filesystem ;-)
> 
> How many of you have worked for IBM before?  Vowels are not evil. ;)

Do you think 'eieio' is a good set? IBM's work too...

We do not get new filesystem each couple of month, so stealing so good
name as 'logfs' is not that bad I think.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-16 Thread Amit K. Arora
On Tue, May 15, 2007 at 05:42:46PM -0700, Mingming Cao wrote:
> On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote:
> > This patch implements sys_fallocate() and adds support on i386, x86_64
> > and powerpc platforms.
> 
> > @@ -1137,6 +1148,8 @@ struct inode_operations {
> > ssize_t (*listxattr) (struct dentry *, char *, size_t);
> > int (*removexattr) (struct dentry *, const char *);
> > void (*truncate_range)(struct inode *, loff_t, loff_t);
> > +   long (*fallocate)(struct inode *inode, int mode, loff_t offset,
> > + loff_t len);
> >  };
> 
> Does the return value from fallocate inode operation has to be *long*?
> It's not consistent with the ext4_fallocate() define in patch 4/5, 

I think ->fallocate() should return a "long", since sys_fallocate() has
to return what ->fallocate() returns and hence their return type should
ideally match.
 
> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t
> len)

I will change the ext4_fallocate() to return a "long" (in patch 4/5)
in the next post.

Agree ?

Thanks!
--
Regards,
Amit Arora

> 
> thus cause compile warnings.
> 
> 
> 
> Mingming
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 13:21:11 +0300, Pekka J Enberg wrote:
> 
> > +#define LOGFS_BUG(sb) do {   \
> > + struct super_block *__sb = sb;  \
> > + logfs_crash_dump(__sb); \
> > + BUG();  \
> > +} while(0)
> 
> Note that BUG() can be a no-op so dumping something on disk might not make 
> sense there. This seems useful, but you probably need to make this bit 
> more generic so that using BUG() proper in your filesystem code does the 
> right thing. Inventing your own wrapper should be avoided.

Hmm.  I am not sure how this could be generic and still make sense.

LogFS has some unused write-once space in the superblock segment.
Embedded people always have problems debugging and were suggesting using
this to store debug information.  That allows me to ask for a filesystem
image and get both the offending image plus a crash dump.  It also
allows me to abort mounting if I ever see an existing crash dump (not
implemented yet).  "First failure data capture" was an old IBM slogal
and the "first" really makes a difference.

So how should I deal with BUG() as a no-op?  I _think_ it should not
make a difference.  People choosing that option should know that their
kernels will continue running with errors and can potentially do any
kind of damage.  My wrapper doesn't seem to change that one way or
another.

What I _should_ do is make the filesystem read-only after this point.
That would actually be quite simple.  Onto my list.

> > +static inline struct logfs_super *LOGFS_SUPER(struct super_block *sb)
> > +{
> > + return sb->s_fs_info;
> > +}
> > +
> > +static inline struct logfs_inode *LOGFS_INODE(struct inode *inode)
> > +{
> > + return container_of(inode, struct logfs_inode, vfs_inode);
> > +}
> 
> No need for upper case in function names.

Will change.

> > +int logfs_memcpy(void *in, void *out, size_t inlen, size_t outlen)
> > +{
> > + if (outlen < inlen)
> > + return -EIO;
> > + memcpy(out, in, inlen);
> > + return inlen;
> > +}
> 
> Please drop this wrapper function. It's better to open-code the error
> handling in the callers (there are total of three of them).

Is it?  The advantage of having the wrapper is that the compression case
and the non-compressed case look identical and the code just repeats the
same pattern twice.

I'll try to open-code it and see how it looks.

> > +/* FIXME: combine with per-sb journal variant */
> > +static unsigned char compressor_buf[LOGFS_MAX_OBJECTSIZE];
> > +static DEFINE_MUTEX(compr_mutex);
> 
> This looks fishy. All reads and writes are serialized by compr_mutex 
> because they share a scratch buffer for compression and uncompression?

s/fishy/lame/

Will move to superblock.

> > +/* FIXME: all this mess should get replaced by using the page cache */
> > +static void fixup_from_wbuf(struct super_block *sb, struct logfs_area 
> *area,
> > + void *read, u64 ofs, size_t readlen)
> > +{
> 
> Indeed. And I think you're getting some more trouble because of this... 

More trouble?

Sadly squeezing bugs out of that beast was easier than wrapping my brain
around the proper solution.  And now it works and has moved to a less
urgent position in my todo list.

> > +int logfs_segment_read(struct super_block *sb, void *buf, u64 ofs)
> > +{
> > + struct logfs_object_header *h;
> > + u16 len;
> > + int err, bs = sb->s_blocksize;
> > +
> > + mutex_lock(&compr_mutex);
> > + err = wbuf_read(sb, ofs, bs + LOGFS_HEADERSIZE, compressor_buf);
> > + if (err)
> > + goto out;
> > + h = (void*)compressor_buf;
> > + len = be16_to_cpu(h->len);
> > +
> > + switch (h->compr) {
> > + case COMPR_NONE:
> > + logfs_memcpy(compressor_buf + LOGFS_HEADERSIZE, buf, bs, 
> bs);
> > + break;
> 
> Seems wasteful to first read the data in a scratch buffer and then 
> memcpy() it immediately for the COMPR_NONE case. Any reason why we can't 
> read a full struct page, for example, and simply use that if we don't need 
> to uncompress anything?

No technical reason.  Just a case of "make it work now and optimize
later".

> > + case COMPR_ZLIB:
> > + err = logfs_uncompress(compressor_buf + LOGFS_HEADERSIZE, 
> buf,
> > + len, bs);
> > + BUG_ON(err);
> > + break;
> 
> Not claiming to undestand your on-disk format, but wouldn't it make more 
> sense if we knew whether a given segment is compressed or not _before_ we 
> actually read it?

[ Objects are the units that get compressed.  Segments can contain both
compressed and uncompressed objects. ]

It is a trade-off.  Each object has a 24 Byte header plus X Bytes of
data.  Whether the data is compressed or not is indicated in the header.
So we have to do one of two things.  Either:
- read the complete object, then either uncompress or memcpy, or
- read just the header, then either read uncompressed data or read
  compressed data and uncompre

Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Artem Bityutskiy wrote:
> On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote:
> > Jörn Engel wrote:
> > > On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> > > > Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> > > > the name than either of its predecessors :)
> > > 
> > > Did you ever see akpm's facial expression when he tried to pronounce
> > > "JFFS2"?  ;)
> > 
> > JFFS3 is a good, meaningful name to anyone familiar with JFFS2.
> > 
> > But if akpm can't pronounce it, how about FFFS for faster flash
> > filesystem ;-)
> 
> The problem is that JFFS2 will always be faster in terms of I/O speed
> anyway, just because it does not have to maintain on-flash indexing
> data structures. But yes, it is slow in mount and in building big
> inodes, so the "faster" is confusing.

Is LogFS really slower than JFFS2 in practice?

I would have guessed reads to be a similar speed, tree updates to be a
similar speed  to journal  updates for sustained  non-fsyncing writes,
and the difference unimportant for tiny individual commits whose index
updates are not merged with any other.  I've not thought about it much
though.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-16 Thread Dave Kleikamp
On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:

> > Following changes were made to the previous version:
> >  1) Added description before sys_fallocate() definition.
> >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> > posix_fallocate should return EINVAL for len <= 0.
> >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> >  4) Do not return ENODEV for dirs (let individual file systems decide if
> > they want to support preallocation to directories or not.
> >  5) Check for wrap through zero.
> >  6) Update c/mtime if fallocate() succeeds.
> 
> Please don't make this always happen. c/mtime updates should be dependent
> on the mode being used and whether there is visible change to the file. If no
> userspace visible changes to the file occurred, then timestamps should not
> be changed.

i_blocks will be updated, so it seems reasonable to update ctime.  mtime
shouldn't be changed, though, since the contents of the file will be
unchanged.

> e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate()
> extending the file, otherwise no change in timestamps should occur.
> 
> Cheers,
> 
> Dave.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread David Woodhouse
On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> The start and end points passed to block_prepare_write() delimit the region of
> the page that is going to be modified.  This means that prepare_write()
> doesn't need to fill it in if the page is not up to date. 

Really? Is it _really_ going to be modified? Even if the pointer
userspace gave to write() is bogus, and is going to fault half-way
through the copy_from_user()?

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread Nick Piggin

David Howells wrote:

Implement shared-writable mmap for AFS.

The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.

If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.

[try #2] Only flush the page if the page is still part of the mapping (truncate
may have discarded it).


Couple more issues...


Signed-off-by: David Howells <[EMAIL PROTECTED]>



+/*
+ * notification that a previously read-only page is about to become writable
+ * - if it returns an error, the caller will deliver a bus error signal
+ *
+ * we use this to make a record of the key with which the writeback should be
+ * performed and to flush any outstanding writes made with a different key
+ *
+ * the key to be used is attached to the file pinned by the VMA
+ */
+int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+   struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host);
+   struct key *key = vma->vm_file->private_data;
+   int ret;
+
+   _enter("{{%x:%u},%x},{%lx}",
+  vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index);
+
+   lock_page(page);
+   if (page->mapping == vma->vm_file->f_mapping)
+   ret = afs_prepare_write(vma->vm_file, page, 0, 0);


I would strongly suggest you used (0, PAGE_CACHE_SIZE) for the range, and
have your nopage function DTRT.

Minor issue: you can just check for `if (!page->mapping)` for truncation,
which is the usual signal to tell the reader you're checking for truncate.
Then you can remove the comment...



+   else
+   ret = 0; /* seems truncate interfered - let the caller deal
+ * with it (presumably the PTE changed too) */


Rather than add this (not always correct) comment about the VM workings, I'd
just add a directive in the page_mkwrite API documentation that the filesystem
is to return 0 if the page has been truncated.



+   unlock_page(page);
+
+   _leave(" = %d", ret);
+   return ret;
+}


--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread David Woodhouse
On Tue, 2007-05-15 at 13:37 -0700, Andrew Morton wrote:
> But it includes an MTD header file.
> 
> Can this code be tested by people who don't have MTD hardware?  We used to
> ahve a fake-mtd-on-a-blockdev thing, whcih was in a state of some
> disrepair.  Maybe it got repaired.  Or removed.  I can't immediately locate
> it...

There's block2mtd, as Jörn pointed out. There's also 'nandsim' and
'mtdram', which will vmalloc some space and use that to emulate NAND or
NOR flash, respectively.

> It's strange and a bit regrettable that an fs would have dependency on MTD,
> really.

Why? Other file systems has dependencies on BLOCK or on NET. It seems
entirely normal to me.

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Pekka Enberg

On 5/16/07, Jamie Lokier <[EMAIL PROTECTED]> wrote:

Given that the filesystem is still 'experimental', I'd concentrate on
getting it stable before worrying about immutable and xattrs unless
they are easy.


We will run into trouble if the on-disk format is not flexible enough
to accommodate xattrs (think reiser3 here). So I'd worry about it
before merging to mainline.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread Nick Piggin

David Howells wrote:

David Chinner <[EMAIL PROTECTED]> wrote:



+   ret = block_prepare_write(page, 0, end, get_block);



As I understand the way prepare_write() works, this is incorrect.


I think it is actually OK.



The start and end points passed to block_prepare_write() delimit the region of
the page that is going to be modified.  This means that prepare_write()
doesn't need to fill it in if the page is not up to date.  It does, however,
need to fill in the region before (if present) and the region after (if
present).  Look at it like this:

+---+
|   |
|   |   <-- Filled in by prepare_write()
|   |
to-> |:::|
|   |
|   |   <-- Filled in by caller
|   |
offset->|:::|
|   |
|   |   <-- Filled in by prepare_write()
|   |
page->   +---+

However, page_mkwrite() isn't told which bit of the page is going to be
written to.  This means it has to ask prepare_write() to make sure the whole
page is filled in.  In other words, offset and to must be equal (in AFS I set
them both to 0).


Dave is using prepare_write here to ensure blocks are allocated in the
given range. The filesystem's ->nopage function must ensure it is uptodate
before allowing it to be mapped.



With what you've got, if, say, 'offset' is 0 and 'to' is calculated at
PAGE_SIZE, then if the page is not up to date for any reason, then none of the
page will be updated before the page is written on by the faulting code.


Consider that the code currently works OK today _without_ page_mkwrite.
page_mkwrite is being added to do block allocation / reservation.



You probably get away with this in a blockdev-based filesystem because it's
unlikely that the page will cease to be up to date.

However, if someone adds a syscall to punch holes in files, this may change...


We have one. Strangely enough, it is done with madvise(MADV_REMOVE).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 12:34:34 +0100, Jamie Lokier wrote:
> 
> But if akpm can't pronounce it, how about FFFS for faster flash
> filesystem ;-)

How many of you have worked for IBM before?  Vowels are not evil. ;)

Grouping four or more consonants to name anything will cause similar
expressions on people's faces.  Numbers don't help much either.

Ext2 is a great name, because "ext" actually is a pronouncable syllable.
MinixFS, ChunkFS, TileFS are great too.  XFS and JFS are ok, at least
they only have three consonants.  But FFS exists, so I'd rather go for a
syllable.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Albert Cahalan wrote:
> Please don't forget the immutable bit. ("man lsattr")
> Having both, BSD-style, would be even better.
> The immutable bit is important for working around
> software bugs and "features" that damage files.
> 
> I also can't find xattr support.

Imho,

Given that the filesystem is still 'experimental', I'd concentrate on
getting it stable before worrying about immutable and xattrs unless
they are easy.  (Immutable is easy).

In 13 years of using Linux in all sorts of ways I've never yet to use
either feature.  They would be good to have, but stability in a
filesystem is much more useful.

I'm biased of course: if LogFS were stable and well tested, I'd be
using it right now in my embedded thingy - and that doesn't even
bother with uids :-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Artem Bityutskiy
On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote:
> Jörn Engel wrote:
> > On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> > > Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> > > the name than either of its predecessors :)
> > 
> > Did you ever see akpm's facial expression when he tried to pronounce
> > "JFFS2"?  ;)
> 
> JFFS3 is a good, meaningful name to anyone familiar with JFFS2.
> 
> But if akpm can't pronounce it, how about FFFS for faster flash
> filesystem ;-)

The problem is that JFFS2 will always be faster in terms of I/O speed
anyway, just because it does not have to maintain on-flash indexing
data structures. But yes, it is slow in mount and in building big
inodes, so the "faster" is confusing.

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Tue, 15 May 2007 19:37:36 -0700, Roland Dreier wrote:
> 
> There are rather a lot of of FIXME comments, including scary stuff like
> 
>  > +  /*
>  > +   * FIXME: this cannot be right but it does "fix" a bug of i_count
>  > +   * dropping too low.  Needs more thought.
>  > +   */
>  > +  atomic_inc(&old_dentry->d_inode->i_count);

Any thoughts on this would be appreciated.

> and
> 
>  > +int __logfs_write_inode(struct inode *inode)
>  > +{
>  > +  /*
>  > +   * FIXME: Those two inodes are 512 bytes in total.  Not good to
>  > +   * have on the stack.  Possibly the best solution would be to bite
>  > +   * the bullet and do another format change before release and
>  > +   * shrink the inodes.
>  > +   */
>  > +  struct logfs_disk_inode old, new;
> 
> are you going to change the format?  or fix this some other way?

I would love to put my inodes on a diet.  It is just a matter of time
and priorities.  To me the 512 bytes on the stack are unfortunate, but
not a show stopper.  Crash behaviour is, so that has priority.

> I think a sweep through the code searching for FIXME and at least
> rewriting all such comments to look like stuff that can be deferred
> would be warranted ;)

Are you asking me to hide known problems under a rug? ;)

Will see if I can easily fix some of these.  In particular the
eat-your-data FIXME that can cause LogFS to not survive a crash.

Jörn

-- 
Correctness comes second.
Features come third.
Performance comes last.
Maintainability is easily forgotten.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Jörn Engel wrote:
> On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> > Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> > the name than either of its predecessors :)
> 
> Did you ever see akpm's facial expression when he tried to pronounce
> "JFFS2"?  ;)

JFFS3 is a good, meaningful name to anyone familiar with JFFS2.

But if akpm can't pronounce it, how about FFFS for faster flash
filesystem ;-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> 
> Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> the name than either of its predecessors :)

Did you ever see akpm's facial expression when he tried to pronounce
"JFFS2"?  ;)

Jörn

-- 
Fancy algorithms are slow when n is small, and n is usually small.
Fancy algorithms have big constants. Until you know that n is
frequently going to be big, don't get fancy.
-- Rob Pike
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Jörn Engel
On Wed, 16 May 2007 07:22:54 +0200, Willy Tarreau wrote:
> 
> On hard disks, yes, but as you suggested, there are lots of other flash
> devices interfaced as block devices. CompactFlash comes to mind, USB
> keys too. And on these ones, the most important is to reduce the number
> of writes and to support large sizes. I already see LogFS as an interesting
> alternative to JFFS2 on such devices, eventhough it does not (yet?) support
> compression.

It does support compression now.  That part was finished, erm, about two
weeks ago.

Logfs on Smartmedia format (I assume most CompactFlash, USB, SD and MMC
media use that) will currently cause a nasty interaction.  Unlike JFFS2,
logfs will alternately write to several eraseblocks.  On real flash,
alternating between half-finished blocks is free.

Smartmedia standard requires to finish the last block before moving to
the next.  So the logfs write pattern will cause it to add an
erase/write overhead of - depending on erase size and compression - 4x
to 512x.  Pretty bad.

If Smartmedia was really stupid, there would be no overhead at all.  Any
manucaturers listening should seriously think about allowing raw flash
access to their devices. ;)

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LogFS take three

2007-05-16 Thread Pekka J Enberg
Hi Joern,

> +#define LOGFS_BUG(sb) do {   \
> + struct super_block *__sb = sb;  \
> + logfs_crash_dump(__sb); \
> + BUG();  \
> +} while(0)

Note that BUG() can be a no-op so dumping something on disk might not make 
sense there. This seems useful, but you probably need to make this bit 
more generic so that using BUG() proper in your filesystem code does the 
right thing. Inventing your own wrapper should be avoided.

> +static inline struct logfs_super *LOGFS_SUPER(struct super_block *sb)
> +{
> + return sb->s_fs_info;
> +}
> +
> +static inline struct logfs_inode *LOGFS_INODE(struct inode *inode)
> +{
> + return container_of(inode, struct logfs_inode, vfs_inode);
> +}

No need for upper case in function names.

> +int logfs_memcpy(void *in, void *out, size_t inlen, size_t outlen)
> +{
> + if (outlen < inlen)
> + return -EIO;
> + memcpy(out, in, inlen);
> + return inlen;
> +}

Please drop this wrapper function. It's better to open-code the error
handling in the callers (there are total of three of them).

> +/* FIXME: combine with per-sb journal variant */
> +static unsigned char compressor_buf[LOGFS_MAX_OBJECTSIZE];
> +static DEFINE_MUTEX(compr_mutex);

This looks fishy. All reads and writes are serialized by compr_mutex 
because they share a scratch buffer for compression and uncompression?

> +/* FIXME: all this mess should get replaced by using the page cache */
> +static void fixup_from_wbuf(struct super_block *sb, struct logfs_area 
*area,
> + void *read, u64 ofs, size_t readlen)
> +{

Indeed. And I think you're getting some more trouble because of this... 

> +int logfs_segment_read(struct super_block *sb, void *buf, u64 ofs)
> +{
> + struct logfs_object_header *h;
> + u16 len;
> + int err, bs = sb->s_blocksize;
> +
> + mutex_lock(&compr_mutex);
> + err = wbuf_read(sb, ofs, bs + LOGFS_HEADERSIZE, compressor_buf);
> + if (err)
> + goto out;
> + h = (void*)compressor_buf;
> + len = be16_to_cpu(h->len);
> +
> + switch (h->compr) {
> + case COMPR_NONE:
> + logfs_memcpy(compressor_buf + LOGFS_HEADERSIZE, buf, bs, 
bs);
> + break;

Seems wasteful to first read the data in a scratch buffer and then 
memcpy() it immediately for the COMPR_NONE case. Any reason why we can't 
read a full struct page, for example, and simply use that if we don't need 
to uncompress anything?

> + case COMPR_ZLIB:
> + err = logfs_uncompress(compressor_buf + LOGFS_HEADERSIZE, 
buf,
> + len, bs);
> + BUG_ON(err);
> + break;

Not claiming to undestand your on-disk format, but wouldn't it make more 
sense if we knew whether a given segment is compressed or not _before_ we 
actually read it?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2

2007-05-16 Thread David Howells
David Chinner <[EMAIL PROTECTED]> wrote:

> + ret = block_prepare_write(page, 0, end, get_block);

As I understand the way prepare_write() works, this is incorrect.

The start and end points passed to block_prepare_write() delimit the region of
the page that is going to be modified.  This means that prepare_write()
doesn't need to fill it in if the page is not up to date.  It does, however,
need to fill in the region before (if present) and the region after (if
present).  Look at it like this:

+---+
|   |
|   |   <-- Filled in by prepare_write()
|   |
to->|:::|
|   |
|   |   <-- Filled in by caller
|   |
offset->|:::|
|   |
|   |   <-- Filled in by prepare_write()
|   |
page->  +---+

However, page_mkwrite() isn't told which bit of the page is going to be
written to.  This means it has to ask prepare_write() to make sure the whole
page is filled in.  In other words, offset and to must be equal (in AFS I set
them both to 0).

With what you've got, if, say, 'offset' is 0 and 'to' is calculated at
PAGE_SIZE, then if the page is not up to date for any reason, then none of the
page will be updated before the page is written on by the faulting code.

You probably get away with this in a blockdev-based filesystem because it's
unlikely that the page will cease to be up to date.

However, if someone adds a syscall to punch holes in files, this may change...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] AFS: Implement shared-writable mmap [try #2]

2007-05-16 Thread David Howells
Implement shared-writable mmap for AFS.

The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.

If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.

[try #2] Only flush the page if the page is still part of the mapping (truncate
may have discarded it).

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/file.c |   22 --
 fs/afs/internal.h |1 +
 fs/afs/write.c|   32 
 3 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 9c0e721..da2a18b 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -22,6 +22,7 @@ static int afs_readpage(struct file *file, struct page *page);
 static void afs_invalidatepage(struct page *page, unsigned long offset);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 static int afs_launder_page(struct page *page);
+static int afs_mmap(struct file *file, struct vm_area_struct *vma);
 
 const struct file_operations afs_file_operations = {
.open   = afs_open,
@@ -31,7 +32,7 @@ const struct file_operations afs_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = afs_file_write,
-   .mmap   = generic_file_readonly_mmap,
+   .mmap   = afs_mmap,
.sendfile   = generic_file_sendfile,
.fsync  = afs_fsync,
 };
@@ -54,6 +55,12 @@ const struct address_space_operations afs_fs_aops = {
.writepages = afs_writepages,
 };
 
+static struct vm_operations_struct afs_file_vm_ops = {
+   .nopage = filemap_nopage,
+   .populate   = filemap_populate,
+   .page_mkwrite   = afs_page_mkwrite,
+};
+
 /*
  * open an AFS file or directory and attach a key to it
  */
@@ -266,7 +273,6 @@ static void afs_invalidatepage(struct page *page, unsigned 
long offset)
 static int afs_launder_page(struct page *page)
 {
_enter("{%lu}", page->index);
-
return 0;
 }
 
@@ -293,3 +299,15 @@ static int afs_releasepage(struct page *page, gfp_t 
gfp_flags)
_leave(" = 0");
return 0;
 }
+
+/*
+ * memory map part of an AFS file
+ */
+static int afs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   _enter("");
+
+   file_accessed(file);
+   vma->vm_ops = &afs_file_vm_ops;
+   return 0;
+}
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 4953ba5..bf4ef07 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -709,6 +709,7 @@ extern ssize_t afs_file_write(struct kiocb *, const struct 
iovec *,
  unsigned long, loff_t);
 extern int afs_writeback_all(struct afs_vnode *);
 extern int afs_fsync(struct file *, struct dentry *, int);
+extern int afs_page_mkwrite(struct vm_area_struct *, struct page *);
 
 
 /*/
diff --git a/fs/afs/write.c b/fs/afs/write.c
index a03b92a..208fbc0 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -174,6 +174,8 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct 
page *page,
  * prepare to perform part of a write to a page
  * - the caller holds the page locked, preventing it from being written out or
  *   modified by anyone else
+ * - may be called from afs_page_mkwrite() to set up a page for modification
+ *   through shared-writable mmap
  */
 int afs_prepare_write(struct file *file, struct page *page,
  unsigned offset, unsigned to)
@@ -825,3 +827,33 @@ int afs_fsync(struct file *file, struct dentry *dentry, 
int datasync)
_leave(" = %d", ret);
return ret;
 }
+
+/*
+ * notification that a previously read-only page is about to become writable
+ * - if it returns an error, the caller will deliver a bus error signal
+ *
+ * we use this to make a record of the key with which the writeback should be
+ * performed and to flush any outstanding writes made with a different key
+ *
+ * the key to be used is attached to the file pinned by the VMA
+ */
+int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+   struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host);
+   struct key *key = vma->vm_file->private_data;
+   int ret;
+
+   _enter("{{%x:%u},%x},{%lx}",
+  vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index);
+
+   lock_page(page);
+   if (page->mapping == vma->vm_file->f_mapping)
+   ret = afs_prepare_write(vma->vm_file, page, 0, 0);
+   else
+   ret = 0; /* seems truncate interfered - let the caller deal
+ * with it (presumably the PTE changed too) */
+   unlock_page(page);
+
+   _leave(" = %d", ret);
+   return ret;
+}

-
To unsubscribe fro

  1   2   >