[RFC PATCH 5/5] Shadow directories: documentation
Documentation of the shadow directories. Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED] Documentation/filesystems/shadow-directories.txt | 177 + 1 file changed, 177 insertions(+) --- /dev/null 2007-10-18 09:34:42.624413454 +0200 +++ new/Documentation/filesystems/shadow-directories.txt2007-10-18 17:03:06.0 +0200 @@ -0,0 +1,177 @@ +Shadow directories +== + +The Goal + + +Let's say we have an archive file hello.zip with a hello world program source +code. We want to do this: + cat hello.zip^/hello.c + +The '^' is an escape character and it tells the computer to treat the file +as a directory. +[Note: We can't do cat hello.zip/hello.c because of http://lwn.net/Articles/100148/ ] + +One way to implement the scenario above is to create a FUSE VFS server and chroot +everything into it. This will work, but poorly. The performance will be low +and many things, like setuid binaries, won't principally work (iff the server +doesn't have root privileges). + + +The Principle +- + +For every process we define two VFS trees: +(1) the standard system-wide tree, managed by mount/umount, implemented by native + filesystems like ext3, reiserfs, etc..; +(2) a per-process shadow tree, usually implemented by FUSE. + +The main change is within VFS look up code: A file name is looked up in a standard +tree and if it's found we're done. If not the name is transparently looked up +in a shadow tree. + +[Picture: A standard and a shadow tree. The shadow tree will be in fact mounted + on some point in the standard tree, e.g. /home/jara/.vfs/mnt. ] + + Standard Shadow + / / + ,--|---, ,-|--, + binhome usr bin homeusr +|| + jara jara + ,|-, ,|-,-, + tmp hello.zip tmp hello.zip hello.zip^ + | +,---, + hello.c Makefile + + +Generally speaking a shadow tree is a superset of a standard tree -- everything +we can find in the standard tree can be found in the shadow tree. +But the standard tree is faster (it's a native FS), so we want to take most +of files from it and only the rest from the other tree (see the directory +hello.zip^ in the picture above which is not in the std. tree). + +In a task the standard tree is primarily defined by its root directory +(fs_struct.root). Secondarily it's represented by current working directory +and by opened directory handles. To map all these directories to corresponding +shadow directories we add shadow root, shadow current directory and shadow +directories for all the opened directories (in the struct file). + +The user needs to set only the shadow root directory for his/her login shell. The +settings will be inherited by all child processes. Although we provide a system +call to set up shadow current directory (SHDW_FD_PWD, bellow) and shadow directories +of opened directories (@@fd=0 bellow), this information can be automatically +deduced from the standard directories. + +Example 1: See the picture above: +A process has root=/ and pwd=/home/jara. The user's FUSE VFS server is mounted +on /home/jara/.vfs/mnt. We setup shadow root directory of the process with +a system call: + setshdwpath(pid, SHDW_FD_ROOT, /home/jara/.vfs/mnt); +The kernel knows that pwd=/home/jara, so it can deduce that shadow pwd will +be /home/jara/.vfs/mnt/home/jara (absolute path). + + +The Escape Character Mode +- + +As has been said above a file name look-up is now a two stage process: first +we try to look-up the name in the standard tree and if we fail we try in the +shadow tree. The problem is that there are hundreds of failed lookups on +normal session start -- a few dozen per every starting process. All these +bogus lookups will make it to the shadow root and will be processes by the user +space VFS server implemented in FUSE. The lookups will be rejected and everything +works as usuall but it's slow. + +To speed things up and to be practical we define an _escape character_. It's +simply any character which can be used in a file name but which isn't used +very often -- like '#' or '^'. We choose the '^' in this document. + +The escape character is loaded by the system call described bellow. All the +lookups going to the shadow tree are filtered against the escape character. +The VFS look-up procedure is thus: + 1. a component of the path (a name) is looked up in the standard tree. + If it's found, we're done. + 2. if the escape character mode is enabled the name is checked if it + contains the escape character. If not
[RFC PATCH 0/5] Shadow directories
Hello, Let's say we have an archive file hello.zip with a hello world program source code. We want to do this: cat hello.zip^/hello.c gcc hello.zip^/hello.c -o hello etc.. The '^' is an escape character and it tells the computer to treat the file as a directory. [Note: We can't do cat hello.zip/hello.c because of http://lwn.net/Articles/100148/ ] The kernel patch implements only a redirection of the request to another directory (shadow directory) where a FUSE server must be mounted. The decompression of archives is entirely handled in the user space. More info can be found in the documentation patch in the series. The shadow directories are used in RheaVFS project [ http://rheavfs.sourceforge.net/ ], and it also can be used with the original AVFS [ http://www.inf.bme.hu/~mszeredi/avfs/ ]. The patches are against vanilla 2.6.23. This is my first bigger contribution to the kernel so please be gentle ;-) Jara -- Elves and Dragons! I says to him. Cabbages and potatoes are better for you and me. -- J. R. R. Tolkien - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/5] Shadow directories: headers
Header file changes for shadow directories. Adds pointers to shadows dirs to the struct file and struct fs_struct. Defines internal lookup flags and syscall flags. Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED] include/linux/file.h |2 ++ include/linux/fs.h| 18 ++ include/linux/fs_struct.h | 25 + include/linux/namei.h | 16 4 files changed, 61 insertions(+) --- orig/include/linux/fs.h 2007-10-07 19:00:24.0 +0200 +++ new/include/linux/fs.h 2007-10-07 13:39:08.0 +0200 @@ -266,6 +266,14 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* sys_setshdwinfo(), sys_getshdwinfo(): */ +#define FSI_SHDW_ENABLE1 /* enable shadow directories */ +#define FSI_SHDW_ESC_EN2 /* enable use of escape character */ +#define FSI_SHDW_ESC_CHAR 3 /* specify escape character */ +/* sys_setshdwpath */ +#define SHDW_FD_ROOT -1 /* pseudo FD for root shadow dir */ +#define SHDW_FD_PWD-2 /* pseudo FD for pwd shadow dir */ + #ifdef __KERNEL__ #include linux/linkage.h @@ -752,6 +760,16 @@ struct file { spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ struct address_space*f_mapping; + + /* the following fields are protected by f_owner.lock */ + /* | f_shdw | f_shdwmnt | result + +--+-+ + | NULL | NULL| delayed + | NULL | !NULL | invalid + | !NULL| NULL| BUG + | !NULL| !NULL | valid */ + struct dentry *f_shdw; + struct vfsmount *f_shdwmnt; }; extern spinlock_t files_lock; #define file_list_lock() spin_lock(files_lock); --- orig/include/linux/fs_struct.h 2007-07-09 01:32:17.0 +0200 +++ new/include/linux/fs_struct.h 2007-10-07 13:39:08.0 +0200 @@ -10,8 +10,31 @@ struct fs_struct { int umask; struct dentry * root, * pwd, * altroot; struct vfsmount * rootmnt, * pwdmnt, * altrootmnt; + + int flags; + /* shadow dirs: root and pwd */ + /* | shdwroot | shdwrootmnt | result + +--+-+ + | NULL | NULL| BUG_ON(flagsSHDW_ENABLED) + | !NULL| !NULL | ok + +==+=+ + | shdwpwd | shdwpwdmnt | result + +--+-+ + | NULL | NULL| delayed + | NULL | !NULL | invalid + | !NULL| NULL| BUG + | !NULL| !NULL | valid */ + struct dentry *shdwroot, *shdwpwd; + struct vfsmount *shdwrootmnt, *shdwpwdmnt; + /* shadow dirs: escape character */ + unsigned char shdw_escch; }; +/* bitflags for fs_struct.flags */ +#define SHDW_ENABLED 1 /* are shadow dirs enabled? */ +#define SHDW_USE_ESC 2 /* use escape char in shadow dirs? */ + + #define INIT_FS { \ .count = ATOMIC_INIT(1), \ .lock = RW_LOCK_UNLOCKED, \ @@ -24,6 +47,8 @@ extern void exit_fs(struct task_struct * extern void set_fs_altroot(void); extern void set_fs_root(struct fs_struct *, struct vfsmount *, struct dentry *); extern void set_fs_pwd(struct fs_struct *, struct vfsmount *, struct dentry *); +extern void set_fs_shdwpwd(struct fs_struct *fs, + struct vfsmount *mnt, struct dentry *dentry); extern struct fs_struct *copy_fs_struct(struct fs_struct *); extern void put_fs_struct(struct fs_struct *); --- orig/include/linux/namei.h 2007-10-07 19:00:25.0 +0200 +++ new/include/linux/namei.h 2007-10-07 20:03:11.0 +0200 @@ -22,6 +22,7 @@ struct nameidata { int last_type; unsigneddepth; char *saved_names[MAX_NESTED_LINKS + 1]; + unsigned char find_char; /* Intent data */ union { @@ -54,6 +55,16 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA #define LOOKUP_PARENT 16 #define LOOKUP_NOALT 32 #define LOOKUP_REVAL 64 + +/* don't fallback to lookup in shadow directory */ +#define LOOKUP_NOSHDW 128 +/* try to find nameidata.find_char in pathname, + * set LOOKUP_CHARFOUND in nameidata.flags if found */ +#define LOOKUP_FINDCHAR(116) +#define LOOKUP_CHARFOUND (117) +/* (dentry,mnt) was found in shadow dir */ +#define LOOKUP_INSHDW (118) + /* * Intent data */ @@ -68,6 +79,8 @@ extern int FASTCALL(__user_walk_fd(int d __user_walk_fd(AT_FDCWD, name, LOOKUP_FOLLOW, nd) #define user_path_walk_link(name,nd) \ __user_walk_fd(AT_FDCWD, name, 0, nd) +extern int FASTCALL(path_lookup_shdw(int dfd, const char *name,
[RFC PATCH 3/5] Shadow directories: chdir, fchdir
sys_chdir and sys_fchdir changes. Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED] fs/open.c | 79 1 file changed, 73 insertions(+), 6 deletions(-) --- orig/fs/open.c 2007-10-07 19:00:19.0 +0200 +++ new/fs/open.c 2007-10-16 21:04:56.0 +0200 @@ -476,13 +476,51 @@ asmlinkage long sys_access(const char __ return sys_faccessat(AT_FDCWD, filename, mode); } +static inline int read_fs_flags(void) +{ + int res; + read_lock(current-fs-lock); + res = current-fs-flags; + read_unlock(current-fs-lock); + return res; +} + +void set_fs_shdwpwd(struct fs_struct *fs, + struct vfsmount *mnt, struct dentry *dentry) +{ + struct dentry *old_dentry; + struct vfsmount *old_mnt; + + BUG_ON(dentry != NULL mnt == NULL); + write_lock(fs-lock); + /* set shadow pwd */ + old_dentry = fs-shdwpwd; + old_mnt = fs-shdwpwdmnt; + fs-shdwpwd = dget(dentry); + if (dentry) + fs-shdwpwdmnt = mntget(mnt); + else + /* PTR_ERR flag */ + fs-shdwpwdmnt = mnt; + write_unlock(fs-lock); + + if (old_dentry) { + mntput(old_mnt); + dput(old_dentry); + } +} + asmlinkage long sys_chdir(const char __user * filename) { struct nameidata nd; - int error; + char *tmp = getname(filename); + int error = PTR_ERR(tmp);; + + if (IS_ERR(tmp)) + goto out_badname; - error = __user_walk(filename, - LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_CHDIR, nd); + error = path_lookup(tmp, LOOKUP_FOLLOW | LOOKUP_DIRECTORY + | LOOKUP_CHDIR, nd); if (error) goto out; @@ -490,11 +528,23 @@ asmlinkage long sys_chdir(const char __u if (error) goto dput_and_out; - set_fs_pwd(current-fs, nd.mnt, nd.dentry); + if (!(read_fs_flags() SHDW_ENABLED)) + goto set_std; + if (!(nd.flags LOOKUP_INSHDW)) + set_fs_shdwpwd(current-fs, NULL, NULL); + else + /* shadow == std */ + set_fs_shdwpwd(current-fs, nd.mnt, nd.dentry); + +set_std: + /* set std cwd */ + set_fs_pwd(current-fs, nd.mnt, nd.dentry); dput_and_out: path_release(nd); out: + putname(tmp); +out_badname: return error; } @@ -520,8 +570,25 @@ asmlinkage long sys_fchdir(unsigned int goto out_putf; error = file_permission(file, MAY_EXEC); - if (!error) - set_fs_pwd(current-fs, mnt, dentry); + if (error) + goto out_putf; + + set_fs_pwd(current-fs, mnt, dentry); + + if (!(read_fs_flags() SHDW_ENABLED)) + /* shadow dirs aren't enabled */ + goto out_putf; + + if (get_file_shdwdir(file, dentry, mnt)) + /* some error ocured */ + set_fs_shdwpwd(current-fs, NULL, NULL); + else { + /* ok */ + set_fs_shdwpwd(current-fs, mnt, dentry); + mntput(mnt); + dput(dentry); + } + out_putf: fput(file); out: - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/5] Shadow directories: core
Implements two stage lookup with escape character filtering and system calls for i386. Changes lookup path, namely do_path_lookup. This function is split into path_lookup_norm(), which performs standard name lookup, and path_lookup_shdw(), which performs name lookup in an associated shadow directory. Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED] arch/i386/kernel/syscall_table.S |6 fs/exec.c|4 fs/file_table.c | 19 fs/namei.c | 610 - fs/namespace.c | 13 include/linux/syscalls.h |6 kernel/exit.c|8 kernel/fork.c| 20 8 files changed, 672 insertions(+), 14 deletions(-) --- orig/fs/namei.c 2007-10-07 19:00:19.0 +0200 +++ new/fs/namei.c 2007-10-18 15:35:54.0 +0200 @@ -31,6 +31,7 @@ #include linux/file.h #include linux/fcntl.h #include linux/namei.h +#include linux/ptrace.h #include asm/namei.h #include asm/uaccess.h @@ -515,6 +516,25 @@ static struct dentry * real_lookup(struc return result; } +static inline int use_shadow(struct fs_struct *fs, struct nameidata *nd) +{ + /* assert: fs-lock held */ + return (fs-flags SHDW_ENABLED) (nd-flags LOOKUP_INSHDW); +} + +static inline struct dentry *fs_root(struct fs_struct *fs, struct nameidata *nd) +{ + /* assert: current-fs-lock held */ + return (use_shadow(fs, nd)) ? fs-shdwroot : fs-root; +} + +static inline struct vfsmount *fs_rootmnt(struct fs_struct *fs, + struct nameidata *nd) +{ + /* assert: current-fs-lock held */ + return (use_shadow(fs, nd)) ? fs-shdwrootmnt : fs-rootmnt; +} + static int __emul_lookup_dentry(const char *, struct nameidata *); /* SMP-safe */ @@ -532,8 +552,8 @@ walk_init_root(const char *name, struct return 0; read_lock(fs-lock); } - nd-mnt = mntget(fs-rootmnt); - nd-dentry = dget(fs-root); + nd-mnt = mntget(fs_rootmnt(fs, nd)); + nd-dentry = dget(fs_root(fs, nd)); read_unlock(fs-lock); return 1; } @@ -730,9 +750,9 @@ static __always_inline void follow_dotdo struct vfsmount *parent; struct dentry *old = nd-dentry; -read_lock(fs-lock); - if (nd-dentry == fs-root - nd-mnt == fs-rootmnt) { + read_lock(fs-lock); + if (nd-dentry == fs_root(fs, nd) + nd-mnt == fs_rootmnt(fs, nd)) { read_unlock(fs-lock); break; } @@ -842,6 +862,11 @@ static fastcall int __link_path_walk(con hash = init_name_hash(); do { + if (unlikely((nd-flags LOOKUP_FINDCHAR) + (c == nd-find_char))) { + /* shadow control char found */ + nd-flags |= LOOKUP_CHARFOUND; + } name++; hash = partial_name_hash(c, hash); c = *(const unsigned char *)name; @@ -1100,8 +1125,8 @@ set_it: } } -/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */ -static int fastcall do_path_lookup(int dfd, const char *name, +/* Lookup @name, starting at @dfd, use normal (non-shadow) root and pwd */ +static int fastcall path_lookup_norm(int dfd, const char *name, unsigned int flags, struct nameidata *nd) { int retval = 0; @@ -1168,6 +1193,313 @@ fput_fail: goto out_fail; } +/* + * Set @filp-f_shdw, @filp-f_shdwmnt to @mnt,@dentry. + * Takes @filp-f_owner-lock. + * Note: if @dentry == NULL then @mnt may be ERR_PTR(-EINVAL). + */ +static void set_fileshdw(struct file *filp, struct vfsmount *mnt, + struct dentry *dentry) +{ + struct dentry *old_dentry; + struct vfsmount *old_mnt; + + BUG_ON(dentry != NULL mnt == NULL); + write_lock(filp-f_owner.lock); + old_dentry = filp-f_shdw; + old_mnt = filp-f_shdwmnt; + filp-f_shdw = dget(dentry); + if (dentry) + filp-f_shdwmnt = mntget(mnt); + else + /* mnt is ERR_PTR */ + filp-f_shdwmnt = mnt; + write_unlock(filp-f_owner.lock); + + if (old_dentry) { + dput(old_dentry); + mntput(old_mnt); + } +} + +/* + * Determine @filp-f_shdw,f_shdwmnt from @filp-dentry,mnt + * and current-fs-shdwroot. + * Also check whether it's a directory and we have permisson. + * Called only from get_file_shdwdir(). + */ +static int validate_shdwfile(struct file *filp) +{ + struct nameidata nd; + char *buf, *name; + int res = -ENOMEM; + + buf = (char *)__get_free_page(GFP_KERNEL); + if (!buf) +
Re: [RFC PATCH 0/5] Shadow directories
On Oct 18 2007 17:21, Jaroslav Sykora wrote: Hello, Let's say we have an archive file hello.zip with a hello world program source code. We want to do this: cat hello.zip^/hello.c gcc hello.zip^/hello.c -o hello etc.. The '^' is an escape character and it tells the computer to treat the file as a directory. Too bad, since ^ is a valid character in a *file*name. Everything is, with the exception of '\0' and '/'. At the end of the day, there are no control characters you could use. But what you could do is: write a FUSE fs that mirrors the lower content (lofs/fuseloop/however it was named) and expands .zip files as directories are readdir'ed or the zip files stat'ed. That saves us from cluttering up the Linux VFS with such stuff. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
Jaroslav Sykora wrote: Let's say we have an archive file hello.zip with a hello world program source code. We want to do this: cat hello.zip^/hello.c gcc hello.zip^/hello.c -o hello etc.. Wouldn't you do this as a user space filesystem? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
David Newall wrote: Jaroslav Sykora wrote: Let's say we have an archive file hello.zip with a hello world program source code. We want to do this: cat hello.zip^/hello.c gcc hello.zip^/hello.c -o hello etc.. Wouldn't you do this as a user space filesystem? Which is what you were saying. *SMACK* I so stupid. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
David Newall wrote: David Newall wrote: Jaroslav Sykora wrote: Let's say we have an archive file hello.zip with a hello world program source code. We want to do this: cat hello.zip^/hello.c gcc hello.zip^/hello.c -o hello etc.. Wouldn't you do this as a user space filesystem? Which is what you were saying. *SMACK* I so stupid. On third thoughts, what's the reason for this? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[0/3] Distributed storage. Mirror algo extension for automatic recovery.
Hi. I'm pleased to announce sixth release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. This release includes mirroring algorithm extension, which allows to store 'age' of the given node on the underlying media. In this case, if failed node gets new media, which does not contain correct 'age' (unique id assigned to the whole storage during initialization time), the whole node will be marked as dirty and eventually resynced. This allows to have completely transparent failure recovery - failed node can be just turned off, its hardware fixed and then turned on. DST core will detect connection reset and automatically reconnect when node is ready and resync if needed without any special administrator's steps. This patchset has been split into 4 parts: 0 - this introduction 1 - core files 2 - network state machine 3 - documentation and algorithms Hope they all will find its way into mail lists. Further TODO list includes: * new redundancy algorithm (complex, low priority) * some thoughts about distributed filesystem tightly connected to DST Thank you. Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[1/3] Distributed storage. Core files.
Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..5bb9de8 --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,20 @@ +config DST + tristate Distributed storage + depends on NET + select CONNECTOR + ---help--- + This driver allows to create a distributed storage. + +config DST_ALG_LINEAR + tristate Linear distribution algorithm + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate Mirror distribution algorithm + depends on DST + ---help--- + This module allows to create a mirror of the noes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..fdbfc7b --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1533 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h +#include linux/connector.h +#include linux/socket.h +#include linux/dst.h +#include linux/device.h +#include linux/in.h +#include linux/in6.h +#include linux/buffer_head.h + +#include net/sock.h + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/devices/storage/ + * /sys/devices/storage/alg : alg_linear + * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/devices/storage/n-800/size : 800 + * /sys/devices/storage/n-800/start : 800 + * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/devices/storage/n-0/size : 800 + * /sys/devices/storage/n-0/start : 0 + * /sys/devices/storage/remove_all_nodes + * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = dst, + .match = dst_dev_match, +}; + +static struct device dst_dev = { + .bus= dst_dev_bus_type, + .release= dst_dev_release +}; + +static void dst_node_release(struct device *dev) +{ +} + +static struct device dst_node_dev = { + .release= dst_node_release +}; + +static struct bio_set *dst_bio_set; + +static void dst_destructor(struct bio *bio) +{ + bio_free(bio, dst_bio_set); +} + +/* + * Internal callback for local requests (i.e. for local disk), + * which are splitted between nodes (part with local node destination + * ends up with this -bi_end_io() callback). + */ +static int dst_end_io(struct bio *bio, unsigned int size, int err) +{ + struct bio *orig_bio = bio-bi_private; + + if (bio-bi_size) + return 0; + + dprintk(%s: bio: %p, orig_bio: %p, size: %u, orig_size: %u.\n, + __func__, bio, orig_bio, size, orig_bio-bi_size); + + bio_endio(orig_bio, size, 0); + bio_put(bio); + return 0; +} + +/* + * This function sends processing request down to block layer (for local node) + * or to network state machine (for remote node). + */ +static int dst_node_push(struct
[2/3] Distributed storage. Network state machine.
Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..b0608c9 --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1606 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/module.h +#include linux/list.h +#include linux/slab.h +#include linux/socket.h +#include linux/kthread.h +#include linux/net.h +#include linux/in.h +#include linux/poll.h +#include linux/bio.h +#include linux/dst.h + +#include net/sock.h + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr-sa_family, type, proto, st-socket); + if (err) + goto err_out_exit; + + err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr, + addr-sa_data_len); + + err = st-socket-ops-listen(st-socket, backlog); + if (err) + goto err_out_release; + + st-socket-sk-sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st-socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st-socket) { + sock_release(st-socket); + st-socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st-node-w; + unsigned long flags; + + spin_lock_irqsave(w-ready_lock, flags); + if (list_empty(st-ready_entry)) + list_add_tail(st-ready_entry, w-ready_list); + spin_unlock_irqrestore(w-ready_lock, flags); + + wake_up(w-wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st; + + st-whead = whead; + init_waitqueue_func_entry(st-wait, kst_state_wake_callback); + add_wait_queue(whead, st-wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st-whead) { + remove_wait_queue(st-whead, st-wait); + st-whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + struct kst_state *st = req-state; + + rb_erase(req-request_entry, st-request_root); + RB_CLEAR_NODE(req-request_entry); + list_del_init(req-request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(st-request_list)) + req = list_entry(st-request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(st-request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(st-request_lock); + return req; +} + +static inline int dst_compare_request_id(struct dst_request *old, + struct dst_request *new) +{ + int cmd = 0; + + if (old-start + to_sector(old-orig_size) = new-start) + cmd = 1; + if (old-start = new-start + to_sector(new-orig_size)) + cmd = -1; + + dprintk(%s: old: op: %lu, start: %llu, size: %llu, off: %u, + new: op: %lu, start: %llu, size: %llu, off: %u, cmp: %d.\n, + __func__, bio_rw(old-bio), old-start, old-orig_size, +
Re: [RFC PATCH 0/5] Shadow directories
On Oct 19 2007 05:32, David Newall wrote: The claim is wrong. UNIX systems have traditionally allowed the superuser to create hard links to directories. See link(2) for 2.10BSD http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. Having got that wrong throws doubt on the argument; perhaps a path can simultaneously be a file and a directory. But hell will break lose if you allow hardlinking directories. mkdir /tmp/a ln /tmp/a /tmp/a/b And you would not be able to rmdir /tmp/a/b because the directory is not empty (it contains b [full path: /tmp/a/b/b]). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
Jaroslav Sykora wrote: If anybody can think of any other solution of the redirector problem, possibly even non-kernel based one, let me know and I'd be glad :-) If I understand your problem, you wish to treat an archive file as if it was a directory. Thus, in the ideal situation, you could do the following: cat hello.zip/hello.c gcc hello.zip/hello.c -o hello etc.. Rather than complicate matters with a second tree, use FUSE with an explicit directory. For example, ~/expand could be your shadow, thus to compile hello.c from ~/hello.zip: gcc ~/expand/hello.zip^/hello.c -o hello I think no kernel change would be required. I'm not keen on the caret. One of the early claims made in http://lwn.net/Articles/100148/ is: Another branch, led by Al Viro, worries about the locking considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons; The claim is wrong. UNIX systems have traditionally allowed the superuser to create hard links to directories. See link(2) for 2.10BSD http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. Having got that wrong throws doubt on the argument; perhaps a path can simultaneously be a file and a directory. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote: considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons; The claim is wrong. UNIX systems have traditionally allowed the superuser to create hard links to directories. See link(2) for 2.10BSD http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. Having got that wrong throws doubt on the argument; perhaps a path can simultaneously be a file and a directory. Learn to read. Linux has never allowed that. Most of the Unix systems do not allow that. Original _did_ allow that, but at the cost of very easily triggered fs corruption (and it didn't have things like rename(2) - it _did_ have userland implementation, of course, in suid-root mv(1), but that sucker had been extremely racy and could be easily used to screw filesystem to hell and back; adding rename(2) to the set of primitives combined with multiple links to directories leads to very nasty issues on _any_ system). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
On Fri, Oct 19, 2007 at 12:27:16PM +0930, David Newall wrote: Learn to read. Linux has never allowed that. Most of the Unix systems do not allow that. I did read the claim and it is ambiguous, in that it can reasonably be read to mean that most UNIX systems never allowed such links, which is wrong. All UNIX systems allowed it until relatively recently. FVOrelatively recently exceeding a decade and half. In any case, it's _trivial_ to get fs corruption on any system with such links - play with rename() races a bit and you'll get it. And yes, it does include 4.4BSD and quite a chunk of even later history. Anyway, you are quite welcome to propose a sane locking scheme capable of dealing with that mess. As for the posted patch, AFAICS it's FUBAR in handling of .. in such directories. Moreover, how are you going to keep that shadow tree in sync with the main one if somebody starts doing renames in the latter? Or mount --move, or... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/5] Shadow directories
Al Viro wrote: On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote: considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons; The claim is wrong. UNIX systems have traditionally allowed the superuser to create hard links to directories. See link(2) for 2.10BSD http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. Having got that wrong throws doubt on the argument; perhaps a path can simultaneously be a file and a directory. Learn to read. Linux has never allowed that. Most of the Unix systems do not allow that. I did read the claim and it is ambiguous, in that it can reasonably be read to mean that most UNIX systems never allowed such links, which is wrong. All UNIX systems allowed it until relatively recently. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fs menu: small reorg.
From: Randy Dunlap [EMAIL PROTECTED] - move minixfs and ROMfs to the Miscellaneous filesystems menu - move DNOTIFY config symbol so that it is adjacent to INOTIFY instead of being split by the QUOTA config options - add some 'endif' annotations - remove some whitespace (extra blank lines) Signed-off-by: Randy Dunlap [EMAIL PROTECTED] --- fs/Kconfig | 93 ++--- 1 file changed, 46 insertions(+), 47 deletions(-) --- linux-2.6.23-git13.orig/fs/Kconfig +++ linux-2.6.23-git13/fs/Kconfig @@ -458,40 +458,18 @@ config OCFS2_DEBUG_MASKLOG This option will enlarge your kernel, but it allows debugging of ocfs2 filesystem issues. -config MINIX_FS - tristate Minix fs support - help - Minix is a simple operating system used in many classes about OS's. - The minix file system (method to organize files on a hard disk - partition or a floppy disk) was the original file system for Linux, - but has been superseded by the second extended file system ext2fs. - You don't want to use the minix file system on your hard disk - because of certain built-in restrictions, but it is sometimes found - on older Linux floppy disks. This option will enlarge your kernel - by about 28 KB. If unsure, say N. - - To compile this file system support as a module, choose M here: the - module will be called minix. Note that the file system of your root - partition (the one containing the directory /) cannot be compiled as - a module. - -config ROMFS_FS - tristate ROM file system support - ---help--- - This is a very small read-only file system mainly intended for - initial ram disks of installation disks, but it could be used for - other read-only media as well. Read - file:Documentation/filesystems/romfs.txt for details. - - To compile this file system support as a module, choose M here: the - module will be called romfs. Note that the file system of your - root partition (the one containing the directory /) cannot be a - module. +endif # BLOCK - If you don't know whether you need it, then you don't need it: - answer N. +config DNOTIFY + bool Dnotify support + default y + help + Dnotify is a directory-based per-fd file change notification system + that uses signals to communicate events to user-space. There exist + superior alternatives, but some applications may still rely on + dnotify. -endif + If unsure, say Y. config INOTIFY bool Inotify file change notification support @@ -572,17 +550,6 @@ config QUOTACTL depends on XFS_QUOTA || QUOTA default y -config DNOTIFY - bool Dnotify support - default y - help - Dnotify is a directory-based per-fd file change notification system - that uses signals to communicate events to user-space. There exist - superior alternatives, but some applications may still rely on - dnotify. - - If unsure, say Y. - config AUTOFS_FS tristate Kernel automounter support help @@ -708,7 +675,7 @@ config UDF_NLS depends on (UDF_FS=m NLS) || (UDF_FS=y NLS=y) endmenu -endif +endif # BLOCK if BLOCK menu DOS/FAT/NT Filesystems @@ -891,7 +858,7 @@ config NTFS_RW It is perfectly safe to say N here. endmenu -endif +endif # BLOCK menu Pseudo filesystems @@ -1412,6 +1379,24 @@ config VXFS_FS To compile this as a module, choose M here: the module will be called freevxfs. If unsure, say N. +config MINIX_FS + tristate Minix file system support + depends on BLOCK + help + Minix is a simple operating system used in many classes about OS's. + The minix file system (method to organize files on a hard disk + partition or a floppy disk) was the original file system for Linux, + but has been superseded by the second extended file system ext2fs. + You don't want to use the minix file system on your hard disk + because of certain built-in restrictions, but it is sometimes found + on older Linux floppy disks. This option will enlarge your kernel + by about 28 KB. If unsure, say N. + + To compile this file system support as a module, choose M here: the + module will be called minix. Note that the file system of your root + partition (the one containing the directory /) cannot be compiled as + a module. + config HPFS_FS tristate OS/2 HPFS file system support @@ -1429,7 +1414,6 @@ config HPFS_FS module will be called hpfs. If unsure, say N. - config QNX4FS_FS tristate QNX4 file system support (read only) depends on BLOCK @@ -1456,6 +1440,22 @@ config QNX4FS_RW It's currently
Does \32.1% non-contigunous\ mean severely fragmented?
Hello. I ran e2fsck and it reported as follows. [EMAIL PROTECTED] ~]# e2fsck -f /dev/hda1 e2fsck 1.39 (29-May-2006) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /data/VMware: 349/19546112 files (32.1% non-contiguous), 31019203/39072080 blocks Does non-contiguous mean fragmented? If so, where is ext3defrag? Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html