Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
Mingming Cao wrote: On Tue, 2007-07-03 at 15:58 +0530, Kalpak Shah wrote: On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote: + +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do { \ + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ + ext4_decode_extra_time(&(inode)->xtime, \ + raw_inode->xtime ## _extra); \ +} while (0) + +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) \ +do { \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ + (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ + ext4_decode_extra_time(&(einode)->xtime,\ + raw_inode->xtime ## _extra); \ +} while (0) + This nanosecond patch seems to be missing the fix below which is required for http://bugzilla.kernel.org/show_bug.cgi?id=5079 If the timestamp is set to before epoch i.e. a negative timestamp then the file may have its date set into the future on 64-bit systems. So when the timestamp is read it must be cast as signed. Missed this one. Thanks. Will update ext4 patch queue tonight with this fix. IIRC in the conference call it was decided to not to apply this patch. Andreas may be able to update better. -aneesh - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Tue, 2007-07-03 at 15:58 +0530, Kalpak Shah wrote: > On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote: > > + > > +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) > >\ > > +do { > >\ > > + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ > > + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ > > + ext4_decode_extra_time(&(inode)->xtime,\ > > + raw_inode->xtime ## _extra);\ > > +} while (0) > > + > > +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) > >\ > > +do { > >\ > > + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ > > + (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ > > + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ > > + ext4_decode_extra_time(&(einode)->xtime, \ > > + raw_inode->xtime ## _extra);\ > > +} while (0) > > + > > This nanosecond patch seems to be missing the fix below which is required for > http://bugzilla.kernel.org/show_bug.cgi?id=5079 > > If the timestamp is set to before epoch i.e. a negative timestamp then the > file may have its date set into the future on 64-bit systems. So when the > timestamp is read it must be cast as signed. Missed this one. Thanks. Will update ext4 patch queue tonight with this fix. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
Amit K. Arora wrote: FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. We use this capability in XFS at the moment. I think this is mainly for DMF (HSM) but is done via the xfs handle interface (xfs_open_by_handle) AFAICT. This sets up a set of invisible operations (xfs_invis_file_operations). xfs_file_ioctl_invis goes on to set IO_INVIS which goes on to set ATTR_DMI which is then tested in xfs_change_file_space() (which handles XFS_IOC_RESVSP & friends) for whether xfs_ichgtime(ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG) is called or not. --Tim - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup in september?
I'd like to reference a paper titled "FASS : A Flash-Aware Swap System". (http://kernel.kaist.ac.kr/~jinsoo/publication/iwssps05.pdf) The paper describes a technique that uses NAND flash as a swap device without FTL (Flash Translation Layer) or filesystem. It is not related with XIP, however. On 7/3/07, Jörn Engel <[EMAIL PROTECTED]> wrote: On Mon, 2 July 2007 17:46:40 -0700, Jared Hulbert wrote: > > Right, the solution to swap problem is identical to the rw XIP > filesystem problem.Jörn, that's why you're the self-appointed > subject matter expert! All right. I'll try to make an important face whenever the subject comes up. Nick, do you have a problem if LogFS occupies two brainslots at the meeting? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 03, 2007 18:15 -0400, J. Bruce Fields wrote: > How will nfsd tell whether it can really on a given filesystem's > i_version, or whether it should fall back on ctime? Good question. > > As to performance concerns that raise before the inode version counter > > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is > > no extra IO work to store this counter to disk. > > So what's the motivation for the "noversion" mount option? Lustre needs to be able to control the version number directly (version number needs to be ordered between all inodes, is set by Lustre to be a transaction number). Instead of trying to incorporate this unused code into ext4 we just turn off the ext4 version code and let Lustre control this directly. It may even be that NFSv4 will need to control the version numbers itself... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] dio: remove bogus refcounting BUG_ON
Linus, Andrew, please apply the bug fix patch at the end of this reply for .22. > >>One of our perf. team ran into this while doing some runs. > >>I didn't see anything obvious - it looks like we converted > >>async IO to synchronous one. I didn't spend much time digging > >>around. OK, I think this BUG_ON() is just broken. I wasn't able to find any obvious bugs from reading the code which would cause the BUG_ON() to fire. If it's reproducible I'd love to hear what the recipe is. I did notice that this BUG_ON() is evaluating dio after having dropped it's ref :/. So it's not completely absurd to fear that it's a race with the dio's memory being reused, but that'd be a pretty tight race. Let's remove this stupid BUG_ON and see if that test box still has trouble. It might just hit the valid BUG_ON a few lines down, but this unsafe BUG_ON needs to go. --- dio: remove bogus refcounting BUG_ON Badari Pulavarty reported a case of this BUG_ON is triggering during testing. It's completely bogus and should be removed. It's trying to notice if we left references to the dio hanging around in the sync case. They should have been dropped as IO completed while this path was in dio_await_completion(). This condition will also be checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few lines lower. So to start this BUG_ON() is redundant. More fatally, it's dereferencing dio-> after having dropped its reference. It's only safe to dereference the dio after releasing the lock if the final reference was just dropped. Another CPU might free the dio in bio completion and reuse the memory after this path drops the dio lock but before the BUG_ON() is evaluated. This patch passed aio+dio regression unit tests and aio-stress on ext3. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> Cc: Badari Pulavarty <[EMAIL PROTECTED]> diff -r 509ce354ae1b fs/direct-io.c --- a/fs/direct-io.cSun Jul 01 22:00:49 2007 + +++ b/fs/direct-io.cTue Jul 03 14:56:41 2007 -0700 @@ -1106,7 +1106,7 @@ direct_io_worker(int rw, struct kiocb *i spin_lock_irqsave(&dio->bio_lock, flags); ret2 = --dio->refcount; spin_unlock_irqrestore(&dio->bio_lock, flags); - BUG_ON(!dio->is_async && ret2 != 0); + if (ret2 == 0) { ret = dio_complete(dio, offset, ret); kfree(dio); - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] locks: share more common lease code
On Sat, Jun 30, 2007 at 10:20:13AM +0100, Christoph Hellwig wrote: > On Fri, Jun 29, 2007 at 03:21:25PM -0400, J. Bruce Fields wrote: > > From: J. Bruce Fields <[EMAIL PROTECTED]> > > > > Share more code between setlease (used by nfsd) and fcntl. > > > > Also some minor cleanup. > > Looks good. Fine for mainline just after 2.6.23 opens. Thanks. (And, by the way, would it be helpful for me to translate this kind of statement into an "acked-by: Christoph..." on the eventual patch?) --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Mon, Jul 02, 2007 at 10:58:33AM -0400, Mingming Cao wrote: > Trond or Bruce, can you please review these patch series and ack if you > agrees? Thanks, looks like what we need! How will nfsd tell whether it can really on a given filesystem's i_version, or whether it should fall back on ctime? > As to performance concerns that raise before the inode version counter > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is > no extra IO work to store this counter to disk. So what's the motivation for the "noversion" mount option? --b. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] util-linux-ng 2.13-rc1
The first util-linux-ng 2.13 release candidate is available at ftp://ftp.kernel.org/pub/linux/utils/util-linux-ng/v2.13/ Thanks to all who help with util-linux resuscitation: H. Peter Anvin Ian Kent and contribute to this project: Arkadiusz Miskiewicz Matthias Koenig Cliff Wickman Mike Frysinger David Brownell Pádraig Brady David Miller Radek Biba Jason Vas Dias Ram Pai Kay SieversStepan Kasal Luciano Chavez Steve Grubb Marco d'Itri Valerie Henson Martin Schlemmer Feedback and bug reports, as always, are welcomed. Karel Util-linux-ng 2.13 Release Notes Release highlights: -- mount(8) doesn't include NFS client code anymore. Don't forget to install nfs-utils 1.1.0 or newer with /sbin/[u]mount.{nfs,nfs4}. mount(8) doesn't include filesystem detection code anymore. You have to compile --with-fsprobe={blkid,volume_id}, and libblkid (e2fsprogs) or libvolume_id (udev >= v110) is required. mount(8) supports new relatime, context, fscontext, and defcontext mount options. losetup(8) supports command line option "-a" to list all used loop devices, '-s' to print a device name if "-f" and a file argument are present, and "-r" to create a read-only loop device. fdisk(8) Sun label support has been improved. fdisk(8) is also able to warn about detected GPT (fdisk doesn't support GPT). taskset(1) is independent on hardcoded NR_CPUS. chrt(1) supports SCHED_BATCH scheduling policy. The package build system is now based on autotools. The build system supports separate CFLAGS and LDFLAGS for suid programs (SUID_CFLAGS, SUID_LDFLAGS). For more details see the README file hwclock(8) supports command line option --rtc= and /dev/rtc0 device. --systohc functionality has been improved, and it doesn't cause a 500ms inaccuracy each time it is used. Audit system support (--with-audit) has been added to hwclock(8) and login(1). SELinux support (--with-selinux) has been added to mkswap(8) and mount(8). The setarch(8) upstream has been merged with util-linux-ng. Fixed security issues: - CVE-2007-0822 - mount(8) allows local users to trigger a NULL dereference and an application crash CVE-2006-7108 - login(1) omits PAM account validation when auth is skipped Changelog: - agetty: add 'O' escape code to display domain name check gethostname() return value blockdev: add BLKFRAGET/BLKFRASET ioctls cleanup usage() and update man page build-sys: add AC_GNU_SOURCE add Automake option dist-bzip2 add missing files add SUID_CFLAGS add SUID_LDFLAGS add support for audit amend .gitignore call automake after autoconf cleanup architecture conditionals cleanup sys-utils/ rdev symlinks configure.am selinux support cleanup declare SUID_CFLAGS and SUID_LDFLAGS as precious do not build convenience libraries in lib/ do not kick off AM_CFLAGS by SUID_CFLAGS do not play with DEFS, use AM_CPPFLAGS do not set with_foo twice do not use internal Autoconf variables do not use wildcards in EXTRA_DIST factor out common parts from mount/Makefile.am fix HAVE_NCURSES fix ifdef ENABLE_WIDECHAR usage fix linking when ncurses is built with --with-termlib=tinfo fix README filenames and add missing files to EXTRA_DISTs fix the example configure call in README fix the final message of autogen.sh in configure.ac, change "po" -> "$srcdir/po" in the clean targets use "find ... | xargs rm -f" let configure instantiate the misc-utils/*.pl scripts make the getopt example directory relative to datadir merge adjacent AC_CONFIG_HEADERS and AC_CONFIG_FUNCS calls minor fixes in configure.in mount/Makefile.am tiny cleanup mount/Makefile.am tiny cleanup II move -D flags to *_CPPFLAGS move the optimization flags to AM_CFLAGS --prefix defaults to /usr remove aclocal.m4 from SCM remove AC_PROG_RANLIB remove config.h.in from VCS remove config/include-Makefile.am from EXTRA_DIST remove DEFAULT_INCLUDES workaround remove -fomit-frame-pointer remove generated autotools stuff from git remove po/Makevars.template from EXTRA_DIST remove swapargs.h, move the tests to main configure.ac rename to -ng, change maintainer name replace AC_TRY_* by AC_*_IFELSE s/AC_HELP_STRING/AS_HELP_STRING/ set DISTCHECK_CONFIGURE_FLAGS in top-level makefile simplify "clean" in tests/Makefile.am update po/POTFILES.in use dist_example_DATA use dist_no
Re: how do versioning filesystems take snapshot of opened files?
On Tuesday July 3, [EMAIL PROTECTED] wrote: > > Getting a snapshot that is useful with respect to application data > requires help from the application. Certainly. > The app needs to be shutdown or > paused prior to the snapshot and then started up again after the > snapshot is taken. Alternately, the app needs to be able to cope with unexpected system shutdown (aka crash) and the same ability will allow it to cope with an atomic snapshot. It may be able to recover more efficiently from an expected shutdown, so being able to tell the app about an impending snapshot is probably a good idea, but it should be advisory only. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Jul 03, 2007 10:24 -0400, Trond Myklebust wrote: > It looks OK to me, but you might want to strip out the now redundant > i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename(). Agreed, and I thought we discussed that already on the ext4 list. > I also have some questions about how this will affect the readdir code: > unless I missed something, the filp->f_version is still unsigned long, > so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir() > no longer make sense. I don't see them as any worse than existing checks. For 32-bit systems we only ever had a 32-bit in-memory version anyway so using only the low 32 bits of i_version in f_version is no more racy than in the past. For 64-bit systems using the full on-disk i_version is possible. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update
On Jul 03, 2007 12:19 +0530, Aneesh Kumar K.V wrote: > Mingming Cao wrote: > >Index: linux-2.6.22-rc4/fs/ext4/super.c > >=== > >--- linux-2.6.22-rc4.orig/fs/ext4/super.c2007-06-13 > >17:19:11.0 -0700 > >+++ linux-2.6.22-rc4/fs/ext4/super.c 2007-06-13 17:24:45.0 -0700 > >@@ -2846,8 +2846,8 @@ out: > > i_size_write(inode, off+len-towrite); > > EXT4_I(inode)->i_disksize = inode->i_size; > > } > >-inode->i_version++; > > inode->i_mtime = inode->i_ctime = CURRENT_TIME; > >+inode->i_version = 1; > > ext4_mark_inode_dirty(handle, inode); > > mutex_unlock(&inode->i_mutex); > > return len - towrite; > > > Is this correct ? . Why do we set the qutoa file inodes version to 1 > during write ? Hmm, I thought we had previously fixed this? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode
On Sun, Jul 01, 2007 at 03:37:16AM -0400, Mingming Cao wrote: > This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used > for 64-bit inode versions. This field will store the higher 32 bits of the > version, while Jean Noel's patch has added support to store the lower 32-bits > in osd1.linux1.l_i_version. > Sorry, I'm a little lost--where's that earlier patch, and exactly what tree should this patch series apply to? --b. > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> > Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> > Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]> > --- > Index: linux-2.6.21/include/linux/ext4_fs.h > === > --- linux-2.6.21.orig/include/linux/ext4_fs.h > +++ linux-2.6.21/include/linux/ext4_fs.h > @@ -342,6 +342,7 @@ struct ext4_inode { > __le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */ > __le32 i_crtime; /* File Creation time */ > __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */ > + __le32 i_version_hi; /* high 32 bits for 64-bit version */ > }; > > #define i_size_high i_dir_acl > > > ___ > NFSv4 mailing list > [EMAIL PROTECTED] > http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On 7/3/07, Bryan Henderson <[EMAIL PROTECTED]> wrote: >>we want a open/close consistency in snapshots. > >This depends on the transaction engine in your filesystem. None of the >existing linux filesystems have a way to start a transaction when the >file opens and finish it when the file closes, or a way to roll back >individual operations that have happened inside a given transaction. > >It certainly could be done, but it would also introduce a great deal of >complexity to the FS. And I would be opposed as a matter of architecture to making open/close transactional. People often read more into open/close than is there, but open is just about gaining access and close is just about releasing resources. It isn't appropriate for close to _mean_ anything. There are filesystems that have transactions. They use separate start transaction / end transaction system calls (not POSIX). >> Pausing apps itself >> does not solve this problem, because a file could be already opened >> and in the middle of write. Just to be clear: we're saying "pause," but we mean "quiesce." I.e., tell the application to reach a point where it's not in the middle of anything and then tell you it's there. Indeed, whether you use open/close or some other kind of transaction, just pausing the application doesn't help. If you were to implement open/close transactions, the filesystem driver would just wait for the application to close and in the meantime block all new opens. If we want to support open/close consistency, maybe we don't really need the help from the application. For example, the filesystem is implemented this way. When a file is opened for write, we copy the metadata and create a CoW bitmap to keep track what has been changed. Before writing any new data to the file, we copy the old data and then write the new data. As such, when we take snapshot and encounter the opened file, we can save the old data instead of the newdata, since the old data is in a consistent state. Of course, new file opening should also be handled this way. The filesystem driver cannot wait for application to close, I think. If the application is snapshot aware, the wait time could be tolerable. But if the application does not provide a way to process the quience request, the wait could be infinite. What do you think? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
>>we want a open/close consistency in snapshots. > >This depends on the transaction engine in your filesystem. None of the >existing linux filesystems have a way to start a transaction when the >file opens and finish it when the file closes, or a way to roll back >individual operations that have happened inside a given transaction. > >It certainly could be done, but it would also introduce a great deal of >complexity to the FS. And I would be opposed as a matter of architecture to making open/close transactional. People often read more into open/close than is there, but open is just about gaining access and close is just about releasing resources. It isn't appropriate for close to _mean_ anything. There are filesystems that have transactions. They use separate start transaction / end transaction system calls (not POSIX). >> Pausing apps itself >> does not solve this problem, because a file could be already opened >> and in the middle of write. Just to be clear: we're saying "pause," but we mean "quiesce." I.e., tell the application to reach a point where it's not in the middle of anything and then tell you it's there. Indeed, whether you use open/close or some other kind of transaction, just pausing the application doesn't help. If you were to implement open/close transactions, the filesystem driver would just wait for the application to close and in the meantime block all new opens. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On Tue, 3 Jul 2007 13:15:06 -0400 "Xin Zhao" <[EMAIL PROTECTED]> wrote: > OK. From discussion above, can we reach a conclusion: from the > application perspective, it is very hard, if not impossible, to take a > transactional consistent snapshot without the help from applications? You definitely need help from the applications. They define what a transaction is. > > Chris, you mentioned that "Many different applications support some > form of pausing in order to facilitate live backups. " Can you provide > some examples? I mean popular apps. Oracle, db2, mysql, ldap, postgres, sleepycat databases...just search for online backup and most programs that involve something transactional have a way to do it. > > Finally, if we back up a little bit, say, we don't care the > transaction level consistency ( a transaction that open/close many > times), but we want a open/close consistency in snapshots. That is, a > file in a snapshot must be in a single version, but it can be in a > middle state of a transaction. Can we do that? Pausing apps itself > does not solve this problem, because a file could be already opened > and in the middle of write. As I mentioned earlier, some systems can > backup old data every time new data is written, but I suspect that > this will impact the system performance quite a bit. Any idea about > that? > This depends on the transaction engine in your filesystem. None of the existing linux filesystems have a way to start a transaction when the file opens and finish it when the file closes, or a way to roll back individual operations that have happened inside a given transaction. It certainly could be done, but it would also introduce a great deal of complexity to the FS. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
OK. From discussion above, can we reach a conclusion: from the application perspective, it is very hard, if not impossible, to take a transactional consistent snapshot without the help from applications? Chris, you mentioned that "Many different applications support some form of pausing in order to facilitate live backups. " Can you provide some examples? I mean popular apps. Finally, if we back up a little bit, say, we don't care the transaction level consistency ( a transaction that open/close many times), but we want a open/close consistency in snapshots. That is, a file in a snapshot must be in a single version, but it can be in a middle state of a transaction. Can we do that? Pausing apps itself does not solve this problem, because a file could be already opened and in the middle of write. As I mentioned earlier, some systems can backup old data every time new data is written, but I suspect that this will impact the system performance quite a bit. Any idea about that? Thanks. On 7/3/07, Chris Mason <[EMAIL PROTECTED]> wrote: On Tue, 3 Jul 2007 12:31:49 -0400 "Xin Zhao" <[EMAIL PROTECTED]> wrote: > That's a good point! > > But this sounds hopeless to take a real consistent snapshot from app > perspective unless you shutdown the computer. Right? Many different applications support some form of pausing in order to facilitate live backups. You just have to keep it all in mind when designing the total backup solution. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On Tue, 3 Jul 2007 12:31:49 -0400 "Xin Zhao" <[EMAIL PROTECTED]> wrote: > That's a good point! > > But this sounds hopeless to take a real consistent snapshot from app > perspective unless you shutdown the computer. Right? Many different applications support some form of pausing in order to facilitate live backups. You just have to keep it all in mind when designing the total backup solution. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
>But you look around, you may find that many >systems claim that they can take snapshot without shutdown the >application. The claim is true, because you can just pause the application and not shut it down. While this means you can't simply add snapshot capability and solve your copy consistency problem (you need new applications too), this is a huge advance over what there was before. Without snapshots, you do have to shut down the application. Often for hours, and during that time any service request to the application fails. With snapshots, you simply pause the application for a few seconds. During that time it delays processing of service requests, but every request ultimately goes through, with the requester probably not noticing any difference. If a system claims that snapshot function in the filesystem alone gets you consistent backups, it's wrong. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 00/44] AppArmor security module overview
On Monday 02 July 2007 22:15, Christoph Hellwig wrote: > AA on the other hand just fucks up VFS layering [...] Oh come on, this claim clearly isn't justified. How on earth is passing vfsmounts down the lsm hooks supposed to break vfs layering? We are not proposing to pass additional information down to file systems. There is no barrier between the vfs and lsm hooks for vfsmounts even today -- only look at the inode_getattr hook; it already gets a vfsmount. Without vfsmount we cannot tell where in the namespace we are, but that information is essential for any kind of pathname based mechanism, AA or not, and even for plain reporting. LSM as a framework is supposed to allow different security mechanisms to be plugged in. It isn't flexible enough for us right now, and so we are proposing to extend it. What can be wrong about that? Andreas - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
That's a good point! But this sounds hopeless to take a real consistent snapshot from app perspective unless you shutdown the computer. Right? Thanks. On 7/3/07, Bryan Henderson <[EMAIL PROTECTED]> wrote: > Consistent state means many different things. And, significantly, open/close has nothing to do with any of them (assuming we're talking about the system calls). open/close does not identify a transaction; a program may open and close a file multiple times the course of making a "single" update. Also, data and metadata updates remain buffered at the kernel level after a close. And don't forget that a single update may span multiple files. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
> Consistent state means many different things. And, significantly, open/close has nothing to do with any of them (assuming we're talking about the system calls). open/close does not identify a transaction; a program may open and close a file multiple times the course of making a "single" update. Also, data and metadata updates remain buffered at the kernel level after a close. And don't forget that a single update may span multiple files. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
Thanks for your reply. Sounds like one has to stop or pause the applications to get consistent snapshot? But you look around, you may find that many systems claim that they can take snapshot without shutdown the application. Actually, I think it is impractical to require that app to be shutdown before taking snapshot in a commercial environment. Pausing apps is possible from the filesystem perspective. A simple solution is that the filesystem stop writing any data to disk from the point that the snapshotting command is received. But as we discussed earlier, this is not sufficient to prevent a file from containing part of old data and part of new data. That's why I am so confused how can these systems provide consistent snapshotting capability without sacrificing system performance much. On 7/3/07, Chris Mason <[EMAIL PROTECTED]> wrote: On Tue, 3 Jul 2007 01:28:57 -0400 "Xin Zhao" <[EMAIL PROTECTED]> wrote: > Hi, > > > If a file is already opened when snapshot command is issued, the file > itself could be in an inconsistent state already. Before the file is > closed, maybe part of the file contains old data, the rest contains > new data. > How does a versioning filesystem guarantee that the file snapshot is > in a consistent state in this case? > > I googled it but didn't find any answer. Can someone explain it a > little bit? It's the same answer as in most filesystem related questions...it depends ;) Consistent state means many different things. It may mean that the metadata accurately reflects the space on disk allocated to the file and that all data for the file is properly on disk (ie from an fsync). But, even this is less than useful because very few files on the filesystem stand alone. Applications spread their state across a number of files and so consistent means something different to every application. Getting a snapshot that is useful with respect to application data requires help from the application. The app needs to be shutdown or paused prior to the snapshot and then started up again after the snapshot is taken. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
On Mon, 2007-07-02 at 10:58 -0400, Mingming Cao wrote: > Trond or Bruce, can you please review these patch series and ack if you > agrees? Thanks. > > As to performance concerns that raise before the inode version counter > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is > no extra IO work to store this counter to disk. Hi Mingming, It looks OK to me, but you might want to strip out the now redundant i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename(). I also have some questions about how this will affect the readdir code: unless I missed something, the filp->f_version is still unsigned long, so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir() no longer make sense. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup in september?
On Mon, 2 July 2007 17:46:40 -0700, Jared Hulbert wrote: > > Right, the solution to swap problem is identical to the rw XIP > filesystem problem.Jörn, that's why you're the self-appointed > subject matter expert! All right. I'll try to make an important face whenever the subject comes up. Nick, do you have a problem if LogFS occupies two brainslots at the meeting? Jörn -- Eighty percent of success is showing up. -- Woody Allen - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On Tue, 3 Jul 2007 01:28:57 -0400 "Xin Zhao" <[EMAIL PROTECTED]> wrote: > Hi, > > > If a file is already opened when snapshot command is issued, the file > itself could be in an inconsistent state already. Before the file is > closed, maybe part of the file contains old data, the rest contains > new data. > How does a versioning filesystem guarantee that the file snapshot is > in a consistent state in this case? > > I googled it but didn't find any answer. Can someone explain it a > little bit? It's the same answer as in most filesystem related questions...it depends ;) Consistent state means many different things. It may mean that the metadata accurately reflects the space on disk allocated to the file and that all data for the file is properly on disk (ie from an fsync). But, even this is less than useful because very few files on the filesystem stand alone. Applications spread their state across a number of files and so consistent means something different to every application. Getting a snapshot that is useful with respect to application data requires help from the application. The app needs to be shutdown or paused prior to the snapshot and then started up again after the snapshot is taken. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 11:31:07AM +0100, Christoph Hellwig wrote: > On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote: > > > FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default > > > allocate) */ > > > FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change > > > size) */ > > > FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default > > > keep) */ > > > > We now have two sets of flags - > > 1) the above three with which I think no one has any issues with, and > > Yes, I do. FA_FL_DEL_DATA is plain stupid, a preallocation call should > never delete data. FA_FL_DEALLOC should probably be a separate syscall > because it's very different functionality. Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. And, regarding FA_FL_DEALLOC being a separate syscall - I think then the very purpose of @mode argument is not justified. We have this mode so that we can provide more features like this. That said, I don't say that we should make things very complicated; but, atleast we should provide some basic features which we expect most of the applications wanting preallocation to use. To start with, we need to cater to already existing applications/user base who use XFS preallocation feature. And further advanced features, like goal based preallocation, can be implemented as a separate syscall. > While we're at it I also dislike the FA_ prefix becuase it doesn't say > anything and is far too generic. FALLOC_ is much better. Ok. This can be changed in the next take. > > > FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep > > > prealloc) */ > > NACK on this one. We should have just one behaviour, and from the thread > that not freeing the allocation on error. I agree on this one. > > > FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data > > > change) */ > > > FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data > > > change) */ > > NACK to these aswell. If i_size changes c/mtime need updates, if the size > doesn't chamge they don't. No need to add more flags for this. This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote: > + > +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) >\ > +do {\ > + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ > + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ > + ext4_decode_extra_time(&(inode)->xtime,\ > +raw_inode->xtime ## _extra);\ > +} while (0) > + > +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) >\ > +do {\ > + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ > + (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ > + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ > + ext4_decode_extra_time(&(einode)->xtime, \ > +raw_inode->xtime ## _extra);\ > +} while (0) > + This nanosecond patch seems to be missing the fix below which is required for http://bugzilla.kernel.org/show_bug.cgi?id=5079 If the timestamp is set to before epoch i.e. a negative timestamp then the file may have its date set into the future on 64-bit systems. So when the timestamp is read it must be cast as signed. Index: linux-2.6.21/include/linux/ext4_fs.h === --- linux-2.6.21.orig/include/linux/ext4_fs.h +++ linux-2.6.21/include/linux/ext4_fs.h @@ -390,7 +390,7 @@ do { \ #define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ do { \ - (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ + (inode)->xtime.tv_sec = (signed)le32_to_cpu((raw_inode)->xtime); \ if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ ext4_decode_extra_time(&(inode)->xtime,\ raw_inode->xtime ## _extra);\ @@ -399,7 +399,8 @@ do { \ #define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) \ do { \ if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ - (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ + (einode)->xtime.tv_sec = \ + (signed)le32_to_cpu((raw_inode)->xtime); \ if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ ext4_decode_extra_time(&(einode)->xtime, \ raw_inode->xtime ## _extra);\ Thanks, Kalpak. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp
On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote: > + > +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) >\ > +do {\ > + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ > + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ > + ext4_decode_extra_time(&(inode)->xtime,\ > +raw_inode->xtime ## _extra);\ > +} while (0) > + > +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) >\ > +do {\ > + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ > + (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ > + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ > + ext4_decode_extra_time(&(einode)->xtime, \ > +raw_inode->xtime ## _extra);\ > +} while (0) > + This nanosecond patch seems to be missing the fix below which is required for http://bugzilla.kernel.org/show_bug.cgi?id=5079 If the timestamp is set to before epoch i.e. a negative timestamp then the file may have its date set into the future on 64-bit systems. So when the timestamp is read it must be cast as signed. Index: linux-2.6.21/include/linux/ext4_fs.h === --- linux-2.6.21.orig/include/linux/ext4_fs.h +++ linux-2.6.21/include/linux/ext4_fs.h @@ -390,7 +390,7 @@ do { \ #define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ do { \ - (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ + (inode)->xtime.tv_sec = (signed)le32_to_cpu((raw_inode)->xtime); \ if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ ext4_decode_extra_time(&(inode)->xtime,\ raw_inode->xtime ## _extra);\ @@ -399,7 +399,8 @@ do { \ #define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) \ do { \ if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ - (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime); \ + (einode)->xtime.tv_sec = \ + (signed)le32_to_cpu((raw_inode)->xtime); \ if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\ ext4_decode_extra_time(&(einode)->xtime, \ raw_inode->xtime ## _extra);\ Thanks, Kalpak. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote: > > FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default > > allocate) */ > > FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change > > size) */ > > FA_FL_DEL_DATA 0x04 /* delete existing data in alloc range (default > > keep) */ > > We now have two sets of flags - > 1) the above three with which I think no one has any issues with, and Yes, I do. FA_FL_DEL_DATA is plain stupid, a preallocation call should never delete data. FA_FL_DEALLOC should probably be a separate syscall because it's very different functionality. While we're at it I also dislike the FA_ prefix becuase it doesn't say anything and is far too generic. FALLOC_ is much better. > > FA_FL_ERR_FREE 0x08 /* free preallocation on error (default keep > > prealloc) */ NACK on this one. We should have just one behaviour, and from the thread that not freeing the allocation on error. > > FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data > > change) */ > > FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data > > change) */ NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7][TAKE5] support new modes in fallocate
On Sat, Jun 30, 2007 at 12:52:46PM -0400, Andreas Dilger wrote: > The @mode flags that are currently under consideration are (AFAIK): > > FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) > */ > FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change > size) */ > FA_FL_DEL_DATA0x04 /* delete existing data in alloc range (default > keep) */ We now have two sets of flags - 1) the above three with which I think no one has any issues with, and 2) the ones below, for which we need some discussions before finalizing on them. I will prefer fallocate going in mainline with the above three modes, and rest of the modes can be debated upon and discussed parallely. And, each new mode/flag can be pushed as a separate patch. This will not hold fallocate feature indefinitely... Please confirm if you find this approach ok. Otherwise, please object. Thanks! > FA_FL_ERR_FREE0x08 /* free preallocation on error (default keep > prealloc) */ > FA_FL_NO_MTIME0x10 /* keep same mtime (default change on size, data > change) */ > FA_FL_NO_CTIME0x20 /* keep same ctime (default change on size, data > change) */ -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html