[RFC PATCH 5/5] Shadow directories: documentation

2007-10-18 Thread Jaroslav Sykora
Documentation of the shadow directories.

Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED]

 Documentation/filesystems/shadow-directories.txt |  177 +
 1 file changed, 177 insertions(+)

--- /dev/null   2007-10-18 09:34:42.624413454 +0200
+++ new/Documentation/filesystems/shadow-directories.txt2007-10-18 
17:03:06.0 +0200
@@ -0,0 +1,177 @@
+Shadow directories
+==
+
+The Goal
+
+
+Let's say we have an archive file hello.zip with a hello world program source
+code. We want to do this:
+   cat hello.zip^/hello.c
+
+The '^' is an escape character and it tells the computer to treat the file
+as a directory.
+[Note: We can't do cat hello.zip/hello.c because of 
http://lwn.net/Articles/100148/ ]
+
+One way to implement the scenario above is to create a FUSE VFS server and 
chroot
+everything into it. This will work, but poorly. The performance will be low
+and many things, like setuid binaries, won't principally work (iff the server
+doesn't have root privileges).
+
+
+The Principle
+-
+
+For every process we define two VFS trees:
+(1) the standard system-wide tree, managed by mount/umount, implemented by 
native
+   filesystems like ext3, reiserfs, etc..;
+(2) a per-process shadow tree, usually implemented by FUSE.
+
+The main change is within VFS look up code: A file name is looked up in a 
standard
+tree and if it's found we're done. If not the name is transparently looked up
+in a shadow tree.
+
+[Picture: A standard and a shadow tree. The shadow tree will be in fact mounted
+ on some point in the standard tree, e.g. /home/jara/.vfs/mnt. ]
+
+ Standard Shadow
+   /  /
+ ,--|---,  ,-|--,
+   binhome usr   bin   homeusr
+||
+  jara jara
+   ,|-, ,|-,-,
+  tmp hello.zip   tmp  hello.zip  hello.zip^
+ |
+,---,
+ hello.c Makefile
+
+
+Generally speaking a shadow tree is a superset of a standard tree -- everything
+we can find in the standard tree can be found in the shadow tree.
+But the standard tree is faster (it's a native FS), so we want to take most
+of files from it and only the rest from the other tree (see the directory
+hello.zip^ in the picture above which is not in the std. tree).
+
+In a task the standard tree is primarily defined by its root directory 
+(fs_struct.root). Secondarily it's represented by current working directory 
+and by opened directory handles. To map all these directories to corresponding 
+shadow directories we add shadow root, shadow current directory and shadow 
+directories for all the opened directories (in the struct file).
+
+The user needs to set only the shadow root directory for his/her login shell. 
The
+settings will be inherited by all child processes. Although we provide a system
+call to set up shadow current directory (SHDW_FD_PWD, bellow) and shadow 
directories
+of opened directories (@@fd=0 bellow), this information can be automatically
+deduced from the standard directories.
+
+Example 1: See the picture above:
+A process has root=/ and pwd=/home/jara. The user's FUSE VFS server is mounted
+on /home/jara/.vfs/mnt. We setup shadow root directory of the process with
+a system call:
+   setshdwpath(pid, SHDW_FD_ROOT, /home/jara/.vfs/mnt);
+The kernel knows that pwd=/home/jara, so it can deduce that shadow pwd will
+be /home/jara/.vfs/mnt/home/jara (absolute path).
+
+
+The Escape Character Mode
+-
+
+As has been said above a file name look-up is now a two stage process: first
+we try to look-up the name in the standard tree and if we fail we try in the
+shadow tree. The problem is that there are hundreds of failed lookups on
+normal session start -- a few dozen per every starting process. All these
+bogus lookups will make it to the shadow root and will be processes by the user
+space VFS server implemented in FUSE. The lookups will be rejected and 
everything
+works as usuall but it's slow.
+
+To speed things up and to be practical we define an _escape character_. It's
+simply any character which can be used in a file name but which isn't used
+very often -- like '#' or '^'. We choose the '^' in this document.
+
+The escape character is loaded by the system call described bellow. All the
+lookups going to the shadow tree are filtered against the escape character.
+The VFS look-up procedure is thus:
+   1. a component of the path (a name) is looked up in the standard tree.
+  If it's found, we're done.
+   2. if the escape character mode is enabled the name is checked if it
+  contains the escape character. If not 

[RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Jaroslav Sykora
Hello,

Let's say we have an archive file hello.zip with a hello world program source
code. We want to do this:
cat hello.zip^/hello.c
gcc hello.zip^/hello.c -o hello
etc..

The '^' is an escape character and it tells the computer to treat the file as a 
directory.
[Note: We can't do cat hello.zip/hello.c because of 
http://lwn.net/Articles/100148/ ]
The kernel patch implements only a redirection of the request to another 
directory
(shadow directory) where a FUSE server must be mounted. The decompression of 
archives is entirely  handled in the user space. More info can be found in the 
documentation
patch in the series.

The shadow directories are used in RheaVFS project [ 
http://rheavfs.sourceforge.net/ ],
and it also can be used with the original AVFS [ 
http://www.inf.bme.hu/~mszeredi/avfs/ ].

The patches are against vanilla 2.6.23.
This is my first bigger contribution to the kernel so please be gentle ;-)

Jara

-- 
Elves and Dragons! I says to him.  Cabbages and potatoes are better
for you and me.  -- J. R. R. Tolkien
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/5] Shadow directories: headers

2007-10-18 Thread Jaroslav Sykora
Header file changes for shadow directories.
Adds pointers to shadows dirs to the struct file and struct fs_struct.
Defines internal lookup flags and syscall flags.

Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED]

 include/linux/file.h  |2 ++
 include/linux/fs.h|   18 ++
 include/linux/fs_struct.h |   25 +
 include/linux/namei.h |   16 
 4 files changed, 61 insertions(+)

--- orig/include/linux/fs.h 2007-10-07 19:00:24.0 +0200
+++ new/include/linux/fs.h  2007-10-07 13:39:08.0 +0200
@@ -266,6 +266,14 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE  2
 #define SYNC_FILE_RANGE_WAIT_AFTER 4
 
+/* sys_setshdwinfo(), sys_getshdwinfo(): */
+#define FSI_SHDW_ENABLE1   /* enable shadow directories */
+#define FSI_SHDW_ESC_EN2   /* enable use of escape 
character */
+#define FSI_SHDW_ESC_CHAR  3   /* specify escape character */
+/* sys_setshdwpath */
+#define SHDW_FD_ROOT   -1  /* pseudo FD for root shadow dir */
+#define SHDW_FD_PWD-2  /* pseudo FD for pwd shadow dir */
+
 #ifdef __KERNEL__
 
 #include linux/linkage.h
@@ -752,6 +760,16 @@ struct file {
spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
+
+   /* the following fields are protected by f_owner.lock */
+   /* | f_shdw   | f_shdwmnt   | result
+  +--+-+
+  | NULL | NULL| delayed
+  | NULL | !NULL   | invalid
+  | !NULL| NULL| BUG
+  | !NULL| !NULL   | valid */
+   struct dentry   *f_shdw;
+   struct vfsmount *f_shdwmnt;
 };
 extern spinlock_t files_lock;
 #define file_list_lock() spin_lock(files_lock);
--- orig/include/linux/fs_struct.h  2007-07-09 01:32:17.0 +0200
+++ new/include/linux/fs_struct.h   2007-10-07 13:39:08.0 +0200
@@ -10,8 +10,31 @@ struct fs_struct {
int umask;
struct dentry * root, * pwd, * altroot;
struct vfsmount * rootmnt, * pwdmnt, * altrootmnt;
+
+   int flags;
+   /* shadow dirs: root and pwd */
+   /* | shdwroot | shdwrootmnt | result
+  +--+-+
+  | NULL | NULL| BUG_ON(flagsSHDW_ENABLED)
+  | !NULL| !NULL   | ok
+  +==+=+
+  | shdwpwd  | shdwpwdmnt  | result
+  +--+-+
+  | NULL | NULL| delayed
+  | NULL | !NULL   | invalid
+  | !NULL| NULL| BUG
+  | !NULL| !NULL   | valid */
+   struct dentry *shdwroot, *shdwpwd;
+   struct vfsmount *shdwrootmnt, *shdwpwdmnt;
+   /* shadow dirs: escape character */
+   unsigned char shdw_escch;
 };
 
+/* bitflags for fs_struct.flags */
+#define SHDW_ENABLED   1   /* are shadow dirs enabled? */
+#define SHDW_USE_ESC   2   /* use escape char in shadow dirs? */
+
+
 #define INIT_FS {  \
.count  = ATOMIC_INIT(1),   \
.lock   = RW_LOCK_UNLOCKED, \
@@ -24,6 +47,8 @@ extern void exit_fs(struct task_struct *
 extern void set_fs_altroot(void);
 extern void set_fs_root(struct fs_struct *, struct vfsmount *, struct dentry 
*);
 extern void set_fs_pwd(struct fs_struct *, struct vfsmount *, struct dentry *);
+extern void set_fs_shdwpwd(struct fs_struct *fs,
+  struct vfsmount *mnt, struct dentry *dentry);
 extern struct fs_struct *copy_fs_struct(struct fs_struct *);
 extern void put_fs_struct(struct fs_struct *);
 
--- orig/include/linux/namei.h  2007-10-07 19:00:25.0 +0200
+++ new/include/linux/namei.h   2007-10-07 20:03:11.0 +0200
@@ -22,6 +22,7 @@ struct nameidata {
int last_type;
unsigneddepth;
char *saved_names[MAX_NESTED_LINKS + 1];
+   unsigned char   find_char;
 
/* Intent data */
union {
@@ -54,6 +55,16 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA
 #define LOOKUP_PARENT  16
 #define LOOKUP_NOALT   32
 #define LOOKUP_REVAL   64
+
+/* don't fallback to lookup in shadow directory */
+#define LOOKUP_NOSHDW  128
+/* try to find nameidata.find_char in pathname,
+ * set LOOKUP_CHARFOUND in nameidata.flags if found */
+#define LOOKUP_FINDCHAR(116)
+#define LOOKUP_CHARFOUND   (117)
+/* (dentry,mnt) was found in shadow dir */
+#define LOOKUP_INSHDW  (118)
+
 /*
  * Intent data
  */
@@ -68,6 +79,8 @@ extern int FASTCALL(__user_walk_fd(int d
__user_walk_fd(AT_FDCWD, name, LOOKUP_FOLLOW, nd)
 #define user_path_walk_link(name,nd) \
__user_walk_fd(AT_FDCWD, name, 0, nd)
+extern int FASTCALL(path_lookup_shdw(int dfd, const char *name,

[RFC PATCH 3/5] Shadow directories: chdir, fchdir

2007-10-18 Thread Jaroslav Sykora
sys_chdir and sys_fchdir changes.

Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED]

 fs/open.c |   79 
 1 file changed, 73 insertions(+), 6 deletions(-)

--- orig/fs/open.c  2007-10-07 19:00:19.0 +0200
+++ new/fs/open.c   2007-10-16 21:04:56.0 +0200
@@ -476,13 +476,51 @@ asmlinkage long sys_access(const char __
return sys_faccessat(AT_FDCWD, filename, mode);
 }
 
+static inline int read_fs_flags(void)
+{
+   int res;
+   read_lock(current-fs-lock);
+   res = current-fs-flags;
+   read_unlock(current-fs-lock);
+   return res;
+}
+
+void set_fs_shdwpwd(struct fs_struct *fs,
+   struct vfsmount *mnt, struct dentry *dentry)
+{
+   struct dentry *old_dentry;
+   struct vfsmount *old_mnt;
+
+   BUG_ON(dentry != NULL  mnt == NULL);
+   write_lock(fs-lock);
+   /* set shadow pwd */
+   old_dentry = fs-shdwpwd;
+   old_mnt = fs-shdwpwdmnt;
+   fs-shdwpwd = dget(dentry);
+   if (dentry)
+   fs-shdwpwdmnt = mntget(mnt);
+   else
+   /* PTR_ERR flag */
+   fs-shdwpwdmnt = mnt;
+   write_unlock(fs-lock);
+
+   if (old_dentry) {
+   mntput(old_mnt);
+   dput(old_dentry);
+   }
+}
+
 asmlinkage long sys_chdir(const char __user * filename)
 {
struct nameidata nd;
-   int error;
+   char *tmp = getname(filename);
+   int error = PTR_ERR(tmp);;
+
+   if (IS_ERR(tmp))
+   goto out_badname;
 
-   error = __user_walk(filename,
-   LOOKUP_FOLLOW|LOOKUP_DIRECTORY|LOOKUP_CHDIR, nd);
+   error = path_lookup(tmp, LOOKUP_FOLLOW | LOOKUP_DIRECTORY
+   | LOOKUP_CHDIR, nd);
if (error)
goto out;
 
@@ -490,11 +528,23 @@ asmlinkage long sys_chdir(const char __u
if (error)
goto dput_and_out;
 
-   set_fs_pwd(current-fs, nd.mnt, nd.dentry);
+   if (!(read_fs_flags()  SHDW_ENABLED))
+   goto set_std;
 
+   if (!(nd.flags  LOOKUP_INSHDW))
+   set_fs_shdwpwd(current-fs, NULL, NULL);
+   else
+   /* shadow == std */
+   set_fs_shdwpwd(current-fs, nd.mnt, nd.dentry);
+
+set_std:
+   /* set std cwd */
+   set_fs_pwd(current-fs, nd.mnt, nd.dentry);
 dput_and_out:
path_release(nd);
 out:
+   putname(tmp);
+out_badname:
return error;
 }
 
@@ -520,8 +570,25 @@ asmlinkage long sys_fchdir(unsigned int 
goto out_putf;
 
error = file_permission(file, MAY_EXEC);
-   if (!error)
-   set_fs_pwd(current-fs, mnt, dentry);
+   if (error)
+   goto out_putf;
+
+   set_fs_pwd(current-fs, mnt, dentry);
+
+   if (!(read_fs_flags()  SHDW_ENABLED))
+   /* shadow dirs aren't enabled */
+   goto out_putf;
+
+   if (get_file_shdwdir(file, dentry, mnt))
+   /* some error ocured */
+   set_fs_shdwpwd(current-fs, NULL, NULL);
+   else {
+   /* ok */
+   set_fs_shdwpwd(current-fs, mnt, dentry);
+   mntput(mnt);
+   dput(dentry);
+   }
+
 out_putf:
fput(file);
 out:
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/5] Shadow directories: core

2007-10-18 Thread Jaroslav Sykora
Implements two stage lookup with escape character filtering
and system calls for i386.
Changes lookup path, namely do_path_lookup. This function is split
into path_lookup_norm(), which performs standard name lookup,
and path_lookup_shdw(), which performs name lookup in an associated shadow 
directory.

Signed-off-by: Jaroslav Sykora [EMAIL PROTECTED]

 arch/i386/kernel/syscall_table.S |6 
 fs/exec.c|4 
 fs/file_table.c  |   19 
 fs/namei.c   |  610 -
 fs/namespace.c   |   13 
 include/linux/syscalls.h |6 
 kernel/exit.c|8 
 kernel/fork.c|   20 
 8 files changed, 672 insertions(+), 14 deletions(-)

--- orig/fs/namei.c 2007-10-07 19:00:19.0 +0200
+++ new/fs/namei.c  2007-10-18 15:35:54.0 +0200
@@ -31,6 +31,7 @@
 #include linux/file.h
 #include linux/fcntl.h
 #include linux/namei.h
+#include linux/ptrace.h
 #include asm/namei.h
 #include asm/uaccess.h
 
@@ -515,6 +516,25 @@ static struct dentry * real_lookup(struc
return result;
 }
 
+static inline int use_shadow(struct fs_struct *fs, struct nameidata *nd)
+{
+   /* assert: fs-lock held */
+   return (fs-flags  SHDW_ENABLED)  (nd-flags  LOOKUP_INSHDW);
+}
+
+static inline struct dentry *fs_root(struct fs_struct *fs, struct nameidata 
*nd)
+{
+   /* assert: current-fs-lock held */
+   return (use_shadow(fs, nd)) ? fs-shdwroot : fs-root;
+}
+
+static inline struct vfsmount *fs_rootmnt(struct fs_struct *fs,
+   struct nameidata *nd)
+{
+   /* assert: current-fs-lock held */
+   return (use_shadow(fs, nd)) ? fs-shdwrootmnt : fs-rootmnt;
+}
+
 static int __emul_lookup_dentry(const char *, struct nameidata *);
 
 /* SMP-safe */
@@ -532,8 +552,8 @@ walk_init_root(const char *name, struct 
return 0;
read_lock(fs-lock);
}
-   nd-mnt = mntget(fs-rootmnt);
-   nd-dentry = dget(fs-root);
+   nd-mnt = mntget(fs_rootmnt(fs, nd));
+   nd-dentry = dget(fs_root(fs, nd));
read_unlock(fs-lock);
return 1;
 }
@@ -730,9 +750,9 @@ static __always_inline void follow_dotdo
struct vfsmount *parent;
struct dentry *old = nd-dentry;
 
-read_lock(fs-lock);
-   if (nd-dentry == fs-root 
-   nd-mnt == fs-rootmnt) {
+   read_lock(fs-lock);
+   if (nd-dentry == fs_root(fs, nd) 
+   nd-mnt == fs_rootmnt(fs, nd)) {
 read_unlock(fs-lock);
break;
}
@@ -842,6 +862,11 @@ static fastcall int __link_path_walk(con
 
hash = init_name_hash();
do {
+   if (unlikely((nd-flags  LOOKUP_FINDCHAR) 
+   (c == nd-find_char))) {
+   /* shadow control char found */
+   nd-flags |= LOOKUP_CHARFOUND;
+   }
name++;
hash = partial_name_hash(c, hash);
c = *(const unsigned char *)name;
@@ -1100,8 +1125,8 @@ set_it:
}
 }
 
-/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
-static int fastcall do_path_lookup(int dfd, const char *name,
+/* Lookup @name, starting at @dfd, use normal (non-shadow) root and pwd */
+static int fastcall path_lookup_norm(int dfd, const char *name,
unsigned int flags, struct nameidata *nd)
 {
int retval = 0;
@@ -1168,6 +1193,313 @@ fput_fail:
goto out_fail;
 }
 
+/*
+ * Set @filp-f_shdw, @filp-f_shdwmnt to @mnt,@dentry.
+ * Takes @filp-f_owner-lock.
+ * Note: if @dentry == NULL then @mnt may be ERR_PTR(-EINVAL).
+ */
+static void set_fileshdw(struct file *filp, struct vfsmount *mnt,
+   struct dentry *dentry)
+{
+   struct dentry *old_dentry;
+   struct vfsmount *old_mnt;
+
+   BUG_ON(dentry != NULL  mnt == NULL);
+   write_lock(filp-f_owner.lock);
+   old_dentry = filp-f_shdw;
+   old_mnt = filp-f_shdwmnt;
+   filp-f_shdw = dget(dentry);
+   if (dentry)
+   filp-f_shdwmnt = mntget(mnt);
+   else
+   /* mnt is ERR_PTR */
+   filp-f_shdwmnt = mnt;
+   write_unlock(filp-f_owner.lock);
+
+   if (old_dentry) {
+   dput(old_dentry);
+   mntput(old_mnt);
+   }
+}
+
+/*
+ * Determine @filp-f_shdw,f_shdwmnt from @filp-dentry,mnt
+ * and current-fs-shdwroot.
+ * Also check whether it's a directory and we have permisson.
+ * Called only from get_file_shdwdir().
+ */
+static int validate_shdwfile(struct file *filp)
+{
+   struct nameidata nd;
+   char *buf, *name;
+   int res = -ENOMEM;
+
+   buf = (char *)__get_free_page(GFP_KERNEL);
+   if (!buf)
+   

Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Jan Engelhardt

On Oct 18 2007 17:21, Jaroslav Sykora wrote:
Hello,

Let's say we have an archive file hello.zip with a hello world program source
code. We want to do this:
   cat hello.zip^/hello.c
   gcc hello.zip^/hello.c -o hello
   etc..

The '^' is an escape character and it tells the computer to treat the file as 
a directory.

Too bad, since ^ is a valid character in a *file*name. Everything is, with
the exception of '\0' and '/'. At the end of the day, there are no control
characters you could use.

But what you could do is: write a FUSE fs that mirrors the lower content
(lofs/fuseloop/however it was named) and expands .zip files as
directories are readdir'ed or the zip files stat'ed. That saves us
from cluttering up the Linux VFS with such stuff.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread David Newall

Jaroslav Sykora wrote:

Let's say we have an archive file hello.zip with a hello world program source
code. We want to do this:
cat hello.zip^/hello.c
gcc hello.zip^/hello.c -o hello
etc..
  


Wouldn't you do this as a user space filesystem?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread David Newall

David Newall wrote:

Jaroslav Sykora wrote:
Let's say we have an archive file hello.zip with a hello world 
program source

code. We want to do this:
cat hello.zip^/hello.c
gcc hello.zip^/hello.c -o hello
etc..
  


Wouldn't you do this as a user space filesystem?

Which is what you were saying.

*SMACK* I so stupid.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread David Newall

David Newall wrote:

David Newall wrote:

Jaroslav Sykora wrote:
Let's say we have an archive file hello.zip with a hello world 
program source

code. We want to do this:
cat hello.zip^/hello.c
gcc hello.zip^/hello.c -o hello
etc..
  


Wouldn't you do this as a user space filesystem?

Which is what you were saying.

*SMACK* I so stupid.


On third thoughts, what's the reason for this?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0/3] Distributed storage. Mirror algo extension for automatic recovery.

2007-10-18 Thread Evgeniy Polyakov
Hi.

I'm pleased to announce sixth release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes mirroring algorithm extension, which allows to
store 'age' of the given node on the underlying media.

In this case, if failed node gets new media, which does not contain 
correct 'age' (unique id assigned to the whole storage during 
initialization time), the whole node will be marked as dirty and
eventually resynced.

This allows to have completely transparent failure recovery - failed
node can be just turned off, its hardware fixed and then turned on. DST
core will detect connection reset and automatically reconnect when node
is ready and resync if needed without any special administrator's steps.

This patchset has been split into 4 parts:
0 - this introduction
1 - core files
2 - network state machine
3 - documentation and algorithms

Hope they all will find its way into mail lists.

Further TODO list includes:
* new redundancy algorithm (complex, low priority)
* some thoughts about distributed filesystem tightly connected to DST

Thank you.

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[1/3] Distributed storage. Core files.

2007-10-18 Thread Evgeniy Polyakov
Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..5bb9de8
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,20 @@
+config DST
+   tristate Distributed storage
+   depends on NET
+   select CONNECTOR
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_ALG_LINEAR
+   tristate Linear distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate Mirror distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the noes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..fdbfc7b
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1533 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/slab.h
+#include linux/connector.h
+#include linux/socket.h
+#include linux/dst.h
+#include linux/device.h
+#include linux/in.h
+#include linux/in6.h
+#include linux/buffer_head.h
+
+#include net/sock.h
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = dst,
+   .match  = dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus= dst_dev_bus_type,
+   .release= dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static struct device dst_node_dev = {
+   .release= dst_node_release
+};
+
+static struct bio_set *dst_bio_set;
+
+static void dst_destructor(struct bio *bio)
+{
+   bio_free(bio, dst_bio_set);
+}
+
+/*
+ * Internal callback for local requests (i.e. for local disk),
+ * which are splitted between nodes (part with local node destination
+ * ends up with this -bi_end_io() callback).
+ */
+static int dst_end_io(struct bio *bio, unsigned int size, int err)
+{
+   struct bio *orig_bio = bio-bi_private;
+
+   if (bio-bi_size)
+   return 0;
+
+   dprintk(%s: bio: %p, orig_bio: %p, size: %u, orig_size: %u.\n,
+   __func__, bio, orig_bio, size, orig_bio-bi_size);
+
+   bio_endio(orig_bio, size, 0);
+   bio_put(bio);
+   return 0;
+}
+
+/*
+ * This function sends processing request down to block layer (for local node)
+ * or to network state machine (for remote node).
+ */
+static int dst_node_push(struct 

[2/3] Distributed storage. Network state machine.

2007-10-18 Thread Evgeniy Polyakov
Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..b0608c9
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1606 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/socket.h
+#include linux/kthread.h
+#include linux/net.h
+#include linux/in.h
+#include linux/poll.h
+#include linux/bio.h
+#include linux/dst.h
+
+#include net/sock.h
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr-sa_family, type, proto, st-socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr,
+   addr-sa_data_len);
+
+   err = st-socket-ops-listen(st-socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st-socket-sk-sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st-socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st-socket) {
+   sock_release(st-socket);
+   st-socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st-node-w;
+   unsigned long flags;
+
+   spin_lock_irqsave(w-ready_lock, flags);
+   if (list_empty(st-ready_entry))
+   list_add_tail(st-ready_entry, w-ready_list);
+   spin_unlock_irqrestore(w-ready_lock, flags);
+
+   wake_up(w-wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st;
+
+   st-whead = whead;
+   init_waitqueue_func_entry(st-wait, kst_state_wake_callback);
+   add_wait_queue(whead, st-wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st-whead) {
+   remove_wait_queue(st-whead, st-wait);
+   st-whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   struct kst_state *st = req-state;
+
+   rb_erase(req-request_entry, st-request_root);
+   RB_CLEAR_NODE(req-request_entry);
+   list_del_init(req-request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(st-request_list))
+   req = list_entry(st-request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(st-request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(st-request_lock);
+   return req;
+}
+
+static inline int dst_compare_request_id(struct dst_request *old,
+   struct dst_request *new)
+{
+   int cmd = 0;
+
+   if (old-start + to_sector(old-orig_size) = new-start)
+   cmd = 1;
+   if (old-start = new-start + to_sector(new-orig_size))
+   cmd = -1;
+
+   dprintk(%s: old: op: %lu, start: %llu, size: %llu, off: %u, 
+   new: op: %lu, start: %llu, size: %llu, off: %u, cmp: %d.\n,
+   __func__, bio_rw(old-bio), old-start, old-orig_size,
+   

Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Jan Engelhardt

On Oct 19 2007 05:32, David Newall wrote:

 The claim is wrong.  UNIX systems have traditionally allowed the
 superuser to create hard links to directories.  See link(2) for
 2.10BSD
 http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD.
 Having got that wrong throws doubt on the argument; perhaps a path
 can simultaneously be a file and a directory.

But hell will break lose if you allow hardlinking directories.

mkdir /tmp/a
ln /tmp/a /tmp/a/b

And you would not be able to rmdir /tmp/a/b because the directory is
not empty (it contains b [full path: /tmp/a/b/b]).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread David Newall

Jaroslav Sykora wrote:

If anybody can think of any other solution of the redirector problem, possibly
even non-kernel based one, let me know and I'd be glad :-)


If I understand your problem, you wish to treat an archive file as if it 
was a directory.  Thus, in the ideal situation, you could do the following:


cat hello.zip/hello.c
gcc hello.zip/hello.c -o hello
etc..


Rather than complicate matters with a second tree, use FUSE with an 
explicit directory.  For example, ~/expand could be your shadow, thus to 
compile hello.c from ~/hello.zip:


gcc ~/expand/hello.zip^/hello.c -o hello


I think no kernel change would be required.

I'm not keen on the caret.  One of the early claims made in 
http://lwn.net/Articles/100148/ is:
Another branch, led by Al Viro, worries about the locking 
considerations of this whole scheme. Linux, like most Unix systems, 
has never allowed hard links to directories for a number of reasons;


The claim is wrong.  UNIX systems have traditionally allowed the 
superuser to create hard links to directories.  See link(2) for 2.10BSD 
http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD.  
Having got that wrong throws doubt on the argument; perhaps a path can 
simultaneously be a file and a directory.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Al Viro
On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote:
 considerations of this whole scheme. Linux, like most Unix systems, 
 has never allowed hard links to directories for a number of reasons;
 
 The claim is wrong.  UNIX systems have traditionally allowed the 
 superuser to create hard links to directories.  See link(2) for 2.10BSD 
 http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. 
 Having got that wrong throws doubt on the argument; perhaps a path can 
 simultaneously be a file and a directory.

Learn to read.  Linux has never allowed that.  Most of the Unix systems
do not allow that.  Original _did_ allow that, but at the cost of very
easily triggered fs corruption (and it didn't have things like rename(2) -
it _did_ have userland implementation, of course, in suid-root mv(1),
but that sucker had been extremely racy and could be easily used to
screw filesystem to hell and back; adding rename(2) to the set of primitives
combined with multiple links to directories leads to very nasty issues on
_any_ system).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Al Viro
On Fri, Oct 19, 2007 at 12:27:16PM +0930, David Newall wrote:

 Learn to read.  Linux has never allowed that.  Most of the Unix systems
 do not allow that.
 
 I did read the claim and it is ambiguous, in that it can reasonably be 
 read to mean that most UNIX systems never allowed such links, which is 
 wrong.  All UNIX systems allowed it until relatively recently.

FVOrelatively recently exceeding a decade and half.  In any case,
it's _trivial_ to get fs corruption on any system with such links -
play with rename() races a bit and you'll get it.  And yes, it does
include 4.4BSD and quite a chunk of even later history.

Anyway, you are quite welcome to propose a sane locking scheme capable
of dealing with that mess.

As for the posted patch, AFAICS it's FUBAR in handling of .. in such
directories.  Moreover, how are you going to keep that shadow tree
in sync with the main one if somebody starts doing renames in the
latter?  Or mount --move, or...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread David Newall

Al Viro wrote:

On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote:
  
considerations of this whole scheme. Linux, like most Unix systems, 
has never allowed hard links to directories for a number of reasons;
  
The claim is wrong.  UNIX systems have traditionally allowed the 
superuser to create hard links to directories.  See link(2) for 2.10BSD 
http://www.freebsd.org/cgi/man.cgi?query=linksektion=2manpath=2.10+BSD. 
Having got that wrong throws doubt on the argument; perhaps a path can 
simultaneously be a file and a directory.



Learn to read.  Linux has never allowed that.  Most of the Unix systems
do not allow that.


I did read the claim and it is ambiguous, in that it can reasonably be 
read to mean that most UNIX systems never allowed such links, which is 
wrong.  All UNIX systems allowed it until relatively recently.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fs menu: small reorg.

2007-10-18 Thread Randy Dunlap
From: Randy Dunlap [EMAIL PROTECTED]

- move minixfs and ROMfs to the Miscellaneous filesystems menu
- move DNOTIFY config symbol so that it is adjacent to INOTIFY
  instead of being split by the QUOTA config options
- add some 'endif' annotations
- remove some whitespace (extra blank lines)

Signed-off-by: Randy Dunlap [EMAIL PROTECTED]
---
 fs/Kconfig |   93 ++---
 1 file changed, 46 insertions(+), 47 deletions(-)

--- linux-2.6.23-git13.orig/fs/Kconfig
+++ linux-2.6.23-git13/fs/Kconfig
@@ -458,40 +458,18 @@ config OCFS2_DEBUG_MASKLOG
  This option will enlarge your kernel, but it allows debugging of
  ocfs2 filesystem issues.
 
-config MINIX_FS
-   tristate Minix fs support
-   help
- Minix is a simple operating system used in many classes about OS's.
- The minix file system (method to organize files on a hard disk
- partition or a floppy disk) was the original file system for Linux,
- but has been superseded by the second extended file system ext2fs.
- You don't want to use the minix file system on your hard disk
- because of certain built-in restrictions, but it is sometimes found
- on older Linux floppy disks.  This option will enlarge your kernel
- by about 28 KB. If unsure, say N.
-
- To compile this file system support as a module, choose M here: the
- module will be called minix.  Note that the file system of your root
- partition (the one containing the directory /) cannot be compiled as
- a module.
-
-config ROMFS_FS
-   tristate ROM file system support
-   ---help---
- This is a very small read-only file system mainly intended for
- initial ram disks of installation disks, but it could be used for
- other read-only media as well.  Read
- file:Documentation/filesystems/romfs.txt for details.
-
- To compile this file system support as a module, choose M here: the
- module will be called romfs.  Note that the file system of your
- root partition (the one containing the directory /) cannot be a
- module.
+endif # BLOCK
 
- If you don't know whether you need it, then you don't need it:
- answer N.
+config DNOTIFY
+   bool Dnotify support
+   default y
+   help
+ Dnotify is a directory-based per-fd file change notification system
+ that uses signals to communicate events to user-space.  There exist
+ superior alternatives, but some applications may still rely on
+ dnotify.
 
-endif
+ If unsure, say Y.
 
 config INOTIFY
bool Inotify file change notification support
@@ -572,17 +550,6 @@ config QUOTACTL
depends on XFS_QUOTA || QUOTA
default y
 
-config DNOTIFY
-   bool Dnotify support
-   default y
-   help
- Dnotify is a directory-based per-fd file change notification system
- that uses signals to communicate events to user-space.  There exist
- superior alternatives, but some applications may still rely on
- dnotify.
-
- If unsure, say Y.
-
 config AUTOFS_FS
tristate Kernel automounter support
help
@@ -708,7 +675,7 @@ config UDF_NLS
depends on (UDF_FS=m  NLS) || (UDF_FS=y  NLS=y)
 
 endmenu
-endif
+endif # BLOCK
 
 if BLOCK
 menu DOS/FAT/NT Filesystems
@@ -891,7 +858,7 @@ config NTFS_RW
  It is perfectly safe to say N here.
 
 endmenu
-endif
+endif # BLOCK
 
 menu Pseudo filesystems
 
@@ -1412,6 +1379,24 @@ config VXFS_FS
  To compile this as a module, choose M here: the module will be
  called freevxfs.  If unsure, say N.
 
+config MINIX_FS
+   tristate Minix file system support
+   depends on BLOCK
+   help
+ Minix is a simple operating system used in many classes about OS's.
+ The minix file system (method to organize files on a hard disk
+ partition or a floppy disk) was the original file system for Linux,
+ but has been superseded by the second extended file system ext2fs.
+ You don't want to use the minix file system on your hard disk
+ because of certain built-in restrictions, but it is sometimes found
+ on older Linux floppy disks.  This option will enlarge your kernel
+ by about 28 KB. If unsure, say N.
+
+ To compile this file system support as a module, choose M here: the
+ module will be called minix.  Note that the file system of your root
+ partition (the one containing the directory /) cannot be compiled as
+ a module.
+
 
 config HPFS_FS
tristate OS/2 HPFS file system support
@@ -1429,7 +1414,6 @@ config HPFS_FS
  module will be called hpfs.  If unsure, say N.
 
 
-
 config QNX4FS_FS
tristate QNX4 file system support (read only)
depends on BLOCK
@@ -1456,6 +1440,22 @@ config QNX4FS_RW
  It's currently 

Does \32.1% non-contigunous\ mean severely fragmented?

2007-10-18 Thread Tetsuo Handa
Hello.

I ran e2fsck and it reported as follows.

[EMAIL PROTECTED] ~]# e2fsck -f /dev/hda1
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/data/VMware: 349/19546112 files (32.1% non-contiguous), 31019203/39072080 
blocks

Does non-contiguous mean fragmented?
If so, where is ext3defrag?

Regards.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html