Re: [PATCH] gfs: no need to check return value of debugfs_create functions

2019-01-23 Thread Andreas Gruenbacher
Greg,

On Tue, 22 Jan 2019 at 16:24, Greg Kroah-Hartman
 wrote:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
>
> There is no need to save the dentries for the debugfs files, so drop
> those variables to save a bit of space and make the code simpler.

looking good, pushed to for-next.

Thanks,
Andreas


[PATCH] gfs: no need to check return value of debugfs_create functions

2019-01-22 Thread Greg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

There is no need to save the dentries for the debugfs files, so drop
those variables to save a bit of space and make the code simpler.

Cc: Bob Peterson 
Cc: Andreas Gruenbacher 
Cc: cluster-de...@redhat.com
Signed-off-by: Greg Kroah-Hartman 
---
 fs/gfs2/glock.c  | 70 ++--
 fs/gfs2/glock.h  |  4 +--
 fs/gfs2/incore.h |  3 ---
 fs/gfs2/main.c   |  6 +
 4 files changed, 17 insertions(+), 66 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b92740edc416..f66773c71bcd 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -2131,71 +2131,29 @@ static const struct file_operations gfs2_sbstats_fops = 
{
.release = seq_release,
 };
 
-int gfs2_create_debugfs_file(struct gfs2_sbd *sdp)
-{
-   struct dentry *dent;
-
-   dent = debugfs_create_dir(sdp->sd_table_name, gfs2_root);
-   if (IS_ERR_OR_NULL(dent))
-   goto fail;
-   sdp->debugfs_dir = dent;
-
-   dent = debugfs_create_file("glocks",
-  S_IFREG | S_IRUGO,
-  sdp->debugfs_dir, sdp,
-  _glocks_fops);
-   if (IS_ERR_OR_NULL(dent))
-   goto fail;
-   sdp->debugfs_dentry_glocks = dent;
-
-   dent = debugfs_create_file("glstats",
-  S_IFREG | S_IRUGO,
-  sdp->debugfs_dir, sdp,
-  _glstats_fops);
-   if (IS_ERR_OR_NULL(dent))
-   goto fail;
-   sdp->debugfs_dentry_glstats = dent;
-
-   dent = debugfs_create_file("sbstats",
-  S_IFREG | S_IRUGO,
-  sdp->debugfs_dir, sdp,
-  _sbstats_fops);
-   if (IS_ERR_OR_NULL(dent))
-   goto fail;
-   sdp->debugfs_dentry_sbstats = dent;
+void gfs2_create_debugfs_file(struct gfs2_sbd *sdp)
+{
+   sdp->debugfs_dir = debugfs_create_dir(sdp->sd_table_name, gfs2_root);
 
-   return 0;
-fail:
-   gfs2_delete_debugfs_file(sdp);
-   return dent ? PTR_ERR(dent) : -ENOMEM;
+   debugfs_create_file("glocks", S_IFREG | S_IRUGO, sdp->debugfs_dir, sdp,
+   _glocks_fops);
+
+   debugfs_create_file("glstats", S_IFREG | S_IRUGO, sdp->debugfs_dir, sdp,
+   _glstats_fops);
+
+   debugfs_create_file("sbstats", S_IFREG | S_IRUGO, sdp->debugfs_dir, sdp,
+   _sbstats_fops);
 }
 
 void gfs2_delete_debugfs_file(struct gfs2_sbd *sdp)
 {
-   if (sdp->debugfs_dir) {
-   if (sdp->debugfs_dentry_glocks) {
-   debugfs_remove(sdp->debugfs_dentry_glocks);
-   sdp->debugfs_dentry_glocks = NULL;
-   }
-   if (sdp->debugfs_dentry_glstats) {
-   debugfs_remove(sdp->debugfs_dentry_glstats);
-   sdp->debugfs_dentry_glstats = NULL;
-   }
-   if (sdp->debugfs_dentry_sbstats) {
-   debugfs_remove(sdp->debugfs_dentry_sbstats);
-   sdp->debugfs_dentry_sbstats = NULL;
-   }
-   debugfs_remove(sdp->debugfs_dir);
-   sdp->debugfs_dir = NULL;
-   }
+   debugfs_remove_recursive(sdp->debugfs_dir);
+   sdp->debugfs_dir = NULL;
 }
 
-int gfs2_register_debugfs(void)
+void gfs2_register_debugfs(void)
 {
gfs2_root = debugfs_create_dir("gfs2", NULL);
-   if (IS_ERR(gfs2_root))
-   return PTR_ERR(gfs2_root);
-   return gfs2_root ? 0 : -ENOMEM;
 }
 
 void gfs2_unregister_debugfs(void)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 8949bf28b249..936b3295839c 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -243,9 +243,9 @@ extern void gfs2_glock_free(struct gfs2_glock *gl);
 extern int __init gfs2_glock_init(void);
 extern void gfs2_glock_exit(void);
 
-extern int gfs2_create_debugfs_file(struct gfs2_sbd *sdp);
+extern void gfs2_create_debugfs_file(struct gfs2_sbd *sdp);
 extern void gfs2_delete_debugfs_file(struct gfs2_sbd *sdp);
-extern int gfs2_register_debugfs(void);
+extern void gfs2_register_debugfs(void);
 extern void gfs2_unregister_debugfs(void);
 
 extern const struct lm_lockops gfs2_dlm_ops;
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e10e0b0a7cd5..cdf07b408f54 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -853,9 +853,6 @@ struct gfs2_sbd {
 
unsigned long sd_last_warning;
struct dentry *debugfs_dir;/* debugfs directory */
-   struct dentry *debugfs_dentry_glocks;
-   struct dentry *debugfs_dentry_glstats;
-   struct dentry *debugfs_dentry_sbstats;
 };
 
 static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int which)
diff 

Re: [PATCH 0/3] gfs: More logging neatening

2014-03-07 Thread Steven Whitehouse
Hi,

On Thu, 2014-03-06 at 12:10 -0800, Joe Perches wrote:
> Joe Perches (3):
>   gfs2: Use pr_ more consistently
>   gfs2: Use fs_ more often
>   gfs2: Convert gfs2_lm_withdraw to use fs_err
> 
>  fs/gfs2/dir.c| 14 
>  fs/gfs2/glock.c  |  8 +++--
>  fs/gfs2/lock_dlm.c   |  9 +++--
>  fs/gfs2/main.c   |  2 ++
>  fs/gfs2/ops_fstype.c | 25 ++---
>  fs/gfs2/quota.c  | 10 +++---
>  fs/gfs2/rgrp.c   | 24 +++--
>  fs/gfs2/super.c  | 16 -
>  fs/gfs2/sys.c|  6 ++--
>  fs/gfs2/trans.c  | 19 +-
>  fs/gfs2/util.c   | 99 
> +---
>  fs/gfs2/util.h   | 31 
>  12 files changed, 138 insertions(+), 125 deletions(-)
> 

Thanks for the patches. I've added them to the -nmw tree, and I did spot
the V2 patch of the third one,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] gfs: More logging neatening

2014-03-07 Thread Steven Whitehouse
Hi,

On Thu, 2014-03-06 at 12:10 -0800, Joe Perches wrote:
 Joe Perches (3):
   gfs2: Use pr_level more consistently
   gfs2: Use fs_level more often
   gfs2: Convert gfs2_lm_withdraw to use fs_err
 
  fs/gfs2/dir.c| 14 
  fs/gfs2/glock.c  |  8 +++--
  fs/gfs2/lock_dlm.c   |  9 +++--
  fs/gfs2/main.c   |  2 ++
  fs/gfs2/ops_fstype.c | 25 ++---
  fs/gfs2/quota.c  | 10 +++---
  fs/gfs2/rgrp.c   | 24 +++--
  fs/gfs2/super.c  | 16 -
  fs/gfs2/sys.c|  6 ++--
  fs/gfs2/trans.c  | 19 +-
  fs/gfs2/util.c   | 99 
 +---
  fs/gfs2/util.h   | 31 
  12 files changed, 138 insertions(+), 125 deletions(-)
 

Thanks for the patches. I've added them to the -nmw tree, and I did spot
the V2 patch of the third one,

Steve.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] gfs: More logging neatening

2014-03-06 Thread Joe Perches
Joe Perches (3):
  gfs2: Use pr_ more consistently
  gfs2: Use fs_ more often
  gfs2: Convert gfs2_lm_withdraw to use fs_err

 fs/gfs2/dir.c| 14 
 fs/gfs2/glock.c  |  8 +++--
 fs/gfs2/lock_dlm.c   |  9 +++--
 fs/gfs2/main.c   |  2 ++
 fs/gfs2/ops_fstype.c | 25 ++---
 fs/gfs2/quota.c  | 10 +++---
 fs/gfs2/rgrp.c   | 24 +++--
 fs/gfs2/super.c  | 16 -
 fs/gfs2/sys.c|  6 ++--
 fs/gfs2/trans.c  | 19 +-
 fs/gfs2/util.c   | 99 +---
 fs/gfs2/util.h   | 31 
 12 files changed, 138 insertions(+), 125 deletions(-)

-- 
1.8.1.2.459.gbcd45b4.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] gfs: More logging neatening

2014-03-06 Thread Joe Perches
Joe Perches (3):
  gfs2: Use pr_level more consistently
  gfs2: Use fs_level more often
  gfs2: Convert gfs2_lm_withdraw to use fs_err

 fs/gfs2/dir.c| 14 
 fs/gfs2/glock.c  |  8 +++--
 fs/gfs2/lock_dlm.c   |  9 +++--
 fs/gfs2/main.c   |  2 ++
 fs/gfs2/ops_fstype.c | 25 ++---
 fs/gfs2/quota.c  | 10 +++---
 fs/gfs2/rgrp.c   | 24 +++--
 fs/gfs2/super.c  | 16 -
 fs/gfs2/sys.c|  6 ++--
 fs/gfs2/trans.c  | 19 +-
 fs/gfs2/util.c   | 99 +---
 fs/gfs2/util.h   | 31 
 12 files changed, 138 insertions(+), 125 deletions(-)

-- 
1.8.1.2.459.gbcd45b4.dirty

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/22] [PATCH] gfs: check nlink count

2007-02-09 Thread Dave Hansen
---

 lxc-dave/fs/gfs2/inode.c |1 +
 1 file changed, 1 insertion(+)

diff -puN fs/gfs2/inode.c~gfs-check-nlink-count fs/gfs2/inode.c
--- lxc/fs/gfs2/inode.c~gfs-check-nlink-count   2007-02-09 14:26:59.0 
-0800
+++ lxc-dave/fs/gfs2/inode.c2007-02-09 14:26:59.0 -0800
@@ -169,6 +169,7 @@ static int gfs2_dinode_in(struct gfs2_in
 * to do that.
 */
ip->i_inode.i_nlink = be32_to_cpu(str->di_nlink);
+   check_nlink(>i_inode);
di->di_size = be64_to_cpu(str->di_size);
i_size_write(>i_inode, di->di_size);
di->di_blocks = be64_to_cpu(str->di_blocks);
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/22] [PATCH] gfs: check nlink count

2007-02-09 Thread Dave Hansen
---

 lxc-dave/fs/gfs2/inode.c |1 +
 1 file changed, 1 insertion(+)

diff -puN fs/gfs2/inode.c~gfs-check-nlink-count fs/gfs2/inode.c
--- lxc/fs/gfs2/inode.c~gfs-check-nlink-count   2007-02-09 14:26:59.0 
-0800
+++ lxc-dave/fs/gfs2/inode.c2007-02-09 14:26:59.0 -0800
@@ -169,6 +169,7 @@ static int gfs2_dinode_in(struct gfs2_in
 * to do that.
 */
ip-i_inode.i_nlink = be32_to_cpu(str-di_nlink);
+   check_nlink(ip-i_inode);
di-di_size = be64_to_cpu(str-di_size);
i_size_write(ip-i_inode, di-di_size);
di-di_blocks = be64_to_cpu(str-di_blocks);
_
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-07 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> +static inline void glock_put(struct gfs2_glock *gl)
> +{
> + if (atomic_read(>gl_count) == 1)
> + gfs2_glock_schedule_for_reclaim(gl);
> + gfs2_assert(gl->gl_sbd, atomic_read(>gl_count) > 0,);
> + atomic_dec(>gl_count);
> +}
> 
> this code has a race

The first two lines of the function with the race are non-essential and
could be removed.  In the common case where there's no race, they just add
efficiency by moving the glock to the reclaim list immediately.
Otherwise, the scand thread would do it later when actively trying to
reclaim glocks.

> +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
> +{
> + int empty;
> + spin_lock(>gl_spin);
> + empty = list_empty(head);
> + spin_unlock(>gl_spin);
> + return empty;
> +}
> 
> that looks like a racey interface to me... if so.. why bother locking at
> all?

The spinlock protects the list but is not the primary method of
synchronizing processes that are working with a glock.

When the list is in fact empty, there will be no race, and the locking
wouldn't be necessary.  In this case, the "glmutex" in the code fragment
below is preventing any change in the list, so we can safely release the
spinlock immediately.

When the list is not empty, then a process could be adding another entry
to the list without "glmutex" locked [1], making the spinlock necessary.
In this case we quit after queue_empty() returns and don't do anything
else, so releasing the spinlock immediately was still safe.

[1] A process that already holds a glock (i.e. has a "holder" struct on
the gl_holders list) is allowed to hold it again by adding another holder
struct to the same list.  It adds the second hold without locking glmutex.

if (gfs2_glmutex_trylock(gl)) {
if (gl->gl_ops == _inode_glops) {
struct gfs2_inode *ip = get_gl2ip(gl);
if (ip && !atomic_read(>i_count))
gfs2_inode_destroy(ip);
}
if (queue_empty(gl, >gl_holders) &&
gl->gl_state != LM_ST_UNLOCKED)
handle_callback(gl, LM_ST_UNLOCKED);

gfs2_glmutex_unlock(gl);
}

There is a second way that queue_empty() is used, and that's within
assertions that the list is empty.  If the assertion is correct, locking
isn't necessary; locking is only needed if there's already another bug
causing the list to not be empty and the assertion to fail.

> static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
> +gi_filler_t filler)
> +{
> + unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size);
> + char *buf;
> + unsigned int count = 0;
> + int error;
> +
> + if (size > gi->gi_size)
> + size = gi->gi_size;
> +
> + buf = kmalloc(size, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + error = filler(ip, gi, buf, size, );
> + if (error)
> + goto out;
> +
> + if (copy_to_user(gi->gi_data, buf, count + 1))
> + error = -EFAULT;
> 
> where does count get a sensible value?

from filler()

We'll add comments in the code to document the things above.
Thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-07 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
 +static inline void glock_put(struct gfs2_glock *gl)
 +{
 + if (atomic_read(gl-gl_count) == 1)
 + gfs2_glock_schedule_for_reclaim(gl);
 + gfs2_assert(gl-gl_sbd, atomic_read(gl-gl_count)  0,);
 + atomic_dec(gl-gl_count);
 +}
 
 this code has a race

The first two lines of the function with the race are non-essential and
could be removed.  In the common case where there's no race, they just add
efficiency by moving the glock to the reclaim list immediately.
Otherwise, the scand thread would do it later when actively trying to
reclaim glocks.

 +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
 +{
 + int empty;
 + spin_lock(gl-gl_spin);
 + empty = list_empty(head);
 + spin_unlock(gl-gl_spin);
 + return empty;
 +}
 
 that looks like a racey interface to me... if so.. why bother locking at
 all?

The spinlock protects the list but is not the primary method of
synchronizing processes that are working with a glock.

When the list is in fact empty, there will be no race, and the locking
wouldn't be necessary.  In this case, the glmutex in the code fragment
below is preventing any change in the list, so we can safely release the
spinlock immediately.

When the list is not empty, then a process could be adding another entry
to the list without glmutex locked [1], making the spinlock necessary.
In this case we quit after queue_empty() returns and don't do anything
else, so releasing the spinlock immediately was still safe.

[1] A process that already holds a glock (i.e. has a holder struct on
the gl_holders list) is allowed to hold it again by adding another holder
struct to the same list.  It adds the second hold without locking glmutex.

if (gfs2_glmutex_trylock(gl)) {
if (gl-gl_ops == gfs2_inode_glops) {
struct gfs2_inode *ip = get_gl2ip(gl);
if (ip  !atomic_read(ip-i_count))
gfs2_inode_destroy(ip);
}
if (queue_empty(gl, gl-gl_holders) 
gl-gl_state != LM_ST_UNLOCKED)
handle_callback(gl, LM_ST_UNLOCKED);

gfs2_glmutex_unlock(gl);
}

There is a second way that queue_empty() is used, and that's within
assertions that the list is empty.  If the assertion is correct, locking
isn't necessary; locking is only needed if there's already another bug
causing the list to not be empty and the assertion to fail.

 static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
 +gi_filler_t filler)
 +{
 + unsigned int size = gfs2_tune_get(ip-i_sbd, gt_lockdump_size);
 + char *buf;
 + unsigned int count = 0;
 + int error;
 +
 + if (size  gi-gi_size)
 + size = gi-gi_size;
 +
 + buf = kmalloc(size, GFP_KERNEL);
 + if (!buf)
 + return -ENOMEM;
 +
 + error = filler(ip, gi, buf, size, count);
 + if (error)
 + goto out;
 +
 + if (copy_to_user(gi-gi_data, buf, count + 1))
 + error = -EFAULT;
 
 where does count get a sensible value?

from filler()

We'll add comments in the code to document the things above.
Thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Dmitry Torokhov
On 9/6/05, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > > do you think it is a bit premature to dismiss something even without
> > > > ever seeing the code?
> > >
> > > You told me you are using a dlm for a single-node application, is there
> > > anything more I need to know?
> >
> > I would still like to know why you consider it a "sin". On OpenVMS it is
> > fast, provides a way of cleaning up...
> 
> There is something hard about handling EPIPE?
> 

Just the fact that you want me to handle it ;)

> > and does not introduce single point
> > of failure as it is the case with a daemon. And if we ever want to spread
> > the load between 2 boxes we easily can do it.
> 
> But you said it runs on an aging Alpha, surely you do not intend to expand it
> to two aging Alphas?

You would be right if I was designing this right now. Now roll 10 - 12
years back and now I have a shiny new alpha. Would you criticize me
then for using a mechanism that allowed easily spread application
across several nodes with minimal changes if needed?

What you fail to realize that there applications that run and will
continue to run for a long time.

>  And what makes you think that socket-based
> synchronization keeps you from spreading out the load over multiple boxes?
> 
> > Why would I not want to use it?
> 
> It is not the right tool for the job from what you have told me.  You want to
> get a few bytes of information from one task to another?  Use a socket, as
> God intended.
>

Again, when TCPIP is not a native network stack, when libc socket
routines are not readily available - DLM starts looking much more
viable.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Alan Cox
On Maw, 2005-09-06 at 02:48 -0400, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
> 
> You told me you are using a dlm for a single-node application, is there 
> anything more I need to know?

That's standard practice for many non-Unix operating systems. It means
your code supports failover without much additional work and it provides
all the functionality for locks on a single node too

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-06 Thread Suparna Bhattacharya
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > 
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > >   possibly gain (or vice versa)
> > > > 
> > > > - Relative merits of the two offerings
> > > 
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> >  
> > Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
> 
> There seems to be clearly a need for a shared-storage fs of some sort
> for HA clusters and virtualized usage (multiple guests sharing a
> partition).  Shared storage can be more efficient than network file
> systems like NFS because the storage access is often more efficient
> than network access  and it is more reliable because it doesn't have a
> single point of failure in form of the NFS server.
> 
> It's also a logical extension of the "failover on failure" clusters
> many people run now - instead of only failing over the shared fs at
> failure and keeping one machine idle the load can be balanced between
> multiple machines at any time.
> 
> One argument to merge both might be that nobody really knows yet which
> shared-storage file system (GFS or OCFS2) is better. The only way to
> find out would be to let the user base try out both, and that's most
> practical when they're merged.
> 
> Personally I think ocfs2 has nicer code than GFS.
> It seems to be more or less a 64bit ext3 with cluster support, while

The "more or less" is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)

And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?

The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.

Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ? 

Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.

BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.

> GFS seems to reinvent a lot more things and has somewhat uglier code.
> On the other hand GFS' cluster support seems to be more aimed
> at being a universal cluster service open for other usages too,
> which might be a good thing. OCFS2s cluster seems to be more 
> aimed at only serving the file system.
> 
> But which one works better in practice is really an open question.

True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.

So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips
On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > do you think it is a bit premature to dismiss something even without
> > > ever seeing the code?
> >
> > You told me you are using a dlm for a single-node application, is there
> > anything more I need to know?
>
> I would still like to know why you consider it a "sin". On OpenVMS it is
> fast, provides a way of cleaning up...

There is something hard about handling EPIPE?

> and does not introduce single point 
> of failure as it is the case with a daemon. And if we ever want to spread
> the load between 2 boxes we easily can do it.

But you said it runs on an aging Alpha, surely you do not intend to expand it 
to two aging Alphas?  And what makes you think that socket-based 
synchronization keeps you from spreading out the load over multiple boxes?

> Why would I not want to use it?

It is not the right tool for the job from what you have told me.  You want to 
get a few bytes of information from one task to another?  Use a socket, as 
God intended.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Dmitry Torokhov
On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
> 
> You told me you are using a dlm for a single-node application, is there 
> anything more I need to know?
>

I would still like to know why you consider it a "sin". On OpenVMS it is
fast, provides a way of cleaning up and does not introduce single point
of failure as it is the case with a daemon. And if we ever want to spread
the load between 2 boxes we easily can do it. Why would I not want to use
it?

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips
On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> do you think it is a bit premature to dismiss something even without
> ever seeing the code?

You told me you are using a dlm for a single-node application, is there 
anything more I need to know?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips
On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
 do you think it is a bit premature to dismiss something even without
 ever seeing the code?

You told me you are using a dlm for a single-node application, is there 
anything more I need to know?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Dmitry Torokhov
On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
 On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
  do you think it is a bit premature to dismiss something even without
  ever seeing the code?
 
 You told me you are using a dlm for a single-node application, is there 
 anything more I need to know?


I would still like to know why you consider it a sin. On OpenVMS it is
fast, provides a way of cleaning up and does not introduce single point
of failure as it is the case with a daemon. And if we ever want to spread
the load between 2 boxes we easily can do it. Why would I not want to use
it?

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips
On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
 On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
  On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
   do you think it is a bit premature to dismiss something even without
   ever seeing the code?
 
  You told me you are using a dlm for a single-node application, is there
  anything more I need to know?

 I would still like to know why you consider it a sin. On OpenVMS it is
 fast, provides a way of cleaning up...

There is something hard about handling EPIPE?

 and does not introduce single point 
 of failure as it is the case with a daemon. And if we ever want to spread
 the load between 2 boxes we easily can do it.

But you said it runs on an aging Alpha, surely you do not intend to expand it 
to two aging Alphas?  And what makes you think that socket-based 
synchronization keeps you from spreading out the load over multiple boxes?

 Why would I not want to use it?

It is not the right tool for the job from what you have told me.  You want to 
get a few bytes of information from one task to another?  Use a socket, as 
God intended.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-06 Thread Suparna Bhattacharya
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
 Andrew Morton [EMAIL PROTECTED] writes:
 
  
- Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
  possibly gain (or vice versa)

- Relative merits of the two offerings
   
   You missed the important one - people actively use it and have been for
   some years. Same reason with have NTFS, HPFS, and all the others. On
   that alone it makes sense to include.
   
  Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
  the technical reasons for merging gfs[2], ocfs2, both or neither?
 
 There seems to be clearly a need for a shared-storage fs of some sort
 for HA clusters and virtualized usage (multiple guests sharing a
 partition).  Shared storage can be more efficient than network file
 systems like NFS because the storage access is often more efficient
 than network access  and it is more reliable because it doesn't have a
 single point of failure in form of the NFS server.
 
 It's also a logical extension of the failover on failure clusters
 many people run now - instead of only failing over the shared fs at
 failure and keeping one machine idle the load can be balanced between
 multiple machines at any time.
 
 One argument to merge both might be that nobody really knows yet which
 shared-storage file system (GFS or OCFS2) is better. The only way to
 find out would be to let the user base try out both, and that's most
 practical when they're merged.
 
 Personally I think ocfs2 has nicercleaner code than GFS.
 It seems to be more or less a 64bit ext3 with cluster support, while

The more or less is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)

And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?

The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.

Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ? 

Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.

BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.

 GFS seems to reinvent a lot more things and has somewhat uglier code.
 On the other hand GFS' cluster support seems to be more aimed
 at being a universal cluster service open for other usages too,
 which might be a good thing. OCFS2s cluster seems to be more 
 aimed at only serving the file system.
 
 But which one works better in practice is really an open question.

True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.

So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Alan Cox
On Maw, 2005-09-06 at 02:48 -0400, Daniel Phillips wrote:
 On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
  do you think it is a bit premature to dismiss something even without
  ever seeing the code?
 
 You told me you are using a dlm for a single-node application, is there 
 anything more I need to know?

That's standard practice for many non-Unix operating systems. It means
your code supports failover without much additional work and it provides
all the functionality for locks on a single node too

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-06 Thread Dmitry Torokhov
On 9/6/05, Daniel Phillips [EMAIL PROTECTED] wrote:
 On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
  On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
   On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
do you think it is a bit premature to dismiss something even without
ever seeing the code?
  
   You told me you are using a dlm for a single-node application, is there
   anything more I need to know?
 
  I would still like to know why you consider it a sin. On OpenVMS it is
  fast, provides a way of cleaning up...
 
 There is something hard about handling EPIPE?
 

Just the fact that you want me to handle it ;)

  and does not introduce single point
  of failure as it is the case with a daemon. And if we ever want to spread
  the load between 2 boxes we easily can do it.
 
 But you said it runs on an aging Alpha, surely you do not intend to expand it
 to two aging Alphas?

You would be right if I was designing this right now. Now roll 10 - 12
years back and now I have a shiny new alpha. Would you criticize me
then for using a mechanism that allowed easily spread application
across several nodes with minimal changes if needed?

What you fail to realize that there applications that run and will
continue to run for a long time.

  And what makes you think that socket-based
 synchronization keeps you from spreading out the load over multiple boxes?
 
  Why would I not want to use it?
 
 It is not the right tool for the job from what you have told me.  You want to
 get a few bytes of information from one task to another?  Use a socket, as
 God intended.


Again, when TCPIP is not a native network stack, when libc socket
routines are not readily available - DLM starts looking much more
viable.

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 19:37, Joel Becker wrote:
>  OCFS2, the new filesystem, is fully general purpose.  It
> supports all the usual stuff, is quite fast...

So I have heard, but isn't it time to quantify that?  How do you think you 
would stack up here:

   http://www.caspur.it/Files/2005/01/10/1105354214692.pdf

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 23:58, Daniel Phillips wrote:
> On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > > By the way, you said "alpha server" not "alpha servers", was that just a
> > > slip? Because if you don't have a cluster then why are you using a dlm?
> >
> > No, it is not a slip. The application is running on just one node, so we
> > do not really use "distributed" part. However we make heavy use of the
> > rest of lock manager features, especially lock value blocks.
> 
> Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
> without even having the excuse you were forced to use it.  Why don't you just 
> have a daemon that sends your values over a socket?  That should be all of a 
> day's coding.
>

Umm, because when most of the code was written TCP and the rest was the
clunkiest code out there? Plus, having a daemon introduces problems with
cleanup (say process dies for one reason or another) whereas having it in
OS takes care of that.
 
> Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
> But you nicely supported my claim that most who think they should be using a 
> dlm, really shouldn't.

Heh, do you think it is a bit premature to dismiss something even without
ever seeing the code?

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-05 Thread Daniel Phillips
On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > By the way, you said "alpha server" not "alpha servers", was that just a
> > slip? Because if you don't have a cluster then why are you using a dlm?
>
> No, it is not a slip. The application is running on just one node, so we
> do not really use "distributed" part. However we make heavy use of the
> rest of lock manager features, especially lock value blocks.

Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
without even having the excuse you were forced to use it.  Why don't you just 
have a daemon that sends your values over a socket?  That should be all of a 
day's coding.

Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
But you nicely supported my claim that most who think they should be using a 
dlm, really shouldn't.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> 
> By the way, you said "alpha server" not "alpha servers", was that just a 
> slip?  
> Because if you don't have a cluster then why are you using a dlm?
>

No, it is not a slip. The application is running on just one node, so we
do not really use "distributed" part. However we make heavy use of the
rest of lock manager features, especially lock value blocks.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
> On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > > > The only current users of dlms are cluster filesystems.  There
> > > > > > are zero users of the userspace dlm api.
> > > > >
> > > > > That is incorrect...
> > > >
> > > > Application users Lars, sorry if I did not make that clear.  The
> > > > issue is whether we need to export an all-singing-all-dancing dlm api
> > > > from kernel to userspace today, or whether we can afford to take the
> > > > necessary time to get it right while application writers take their
> > > > time to have a good think about whether they even need it.
> > >
> > > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > > asbout moving our application onto a Linux box because our alpha server
> > > is aging.
> > >
> > > That's just my user application writer $0.02.
> >
> > What stops you from trying it with the patch?  That kind of feedback
> > would be worth way more than $0.02.
>
> We do not have such plans at the moment and I prefer spending my free
> time on tinkering with kernel, not rewriting some in-house application.
> Besides, DLM is not the only thing that does not have a drop-in
> replacement in Linux.
>
> You just said you did not know if there are any potential users for the
> full DLM and I said there are some.

I did not say "potential", I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said "gee this works great, look what it does".

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said "alpha server" not "alpha servers", was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > > The only current users of dlms are cluster filesystems.  There are
> > > > > zero users of the userspace dlm api.
> > > >
> > > > That is incorrect...
> > >
> > > Application users Lars, sorry if I did not make that clear.  The issue is
> > > whether we need to export an all-singing-all-dancing dlm api from kernel
> > > to userspace today, or whether we can afford to take the necessary time
> > > to get it right while application writers take their time to have a good
> > > think about whether they even need it.
> >
> > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > asbout moving our application onto a Linux box because our alpha server is
> > aging.
> >
> > That's just my user application writer $0.02.
> 
> What stops you from trying it with the patch?  That kind of feedback would be 
> worth way more than $0.02.
>

We do not have such plans at the moment and I prefer spending my free
time on tinkering with kernel, not rewriting some in-house application.
Besides, DLM is not the only thing that does not have a drop-in
replacement in Linux.

You just said you did not know if there are any potential users for the
full DLM and I said there are some.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > The only current users of dlms are cluster filesystems.  There are
> > > > zero users of the userspace dlm api.
> > >
> > > That is incorrect...
> >
> > Application users Lars, sorry if I did not make that clear.  The issue is
> > whether we need to export an all-singing-all-dancing dlm api from kernel
> > to userspace today, or whether we can afford to take the necessary time
> > to get it right while application writers take their time to have a good
> > think about whether they even need it.
>
> If Linux fully supported OpenVMS DLM semantics we could start thinking
> asbout moving our application onto a Linux box because our alpha server is
> aging.
>
> That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
> The whole point of the orcacle cluster filesystem as it was described in old
> papers was about pfiles, control files and software, because you can easyly
> use direct block access (with ASM) for tablespaces.

OCFS, the original filesystem, only works for datafiles,
logfiles, and other database data.  It's currently used in serious anger
by several major customers.  Oracle's websites must have a list of them
somewhere.  We're talking many terabytes of datafiles.

> Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
> replicated filesystem makes more sense), I am just nor sure if anybody sane
> would use it for tablespaces.

OCFS2, the new filesystem, is fully general purpose.  It
supports all the usual stuff, is quite fast, and is what we expect folks
to use for both ORACLE_HOME and datafiles in the future.  Customers can,
of course, use ASM or even raw devices.  OCFS2 is as fast as raw
devices, and far more manageable, so raw devices are probably not a
choice for the future.  ASM has its own management advantages, and we
certainly expect customers to like it as well.  But that doesn't mean
people won't use OCFS2 for datafiles depending on their environment or
needs.


-- 

"The first requisite of a good citizen in this republic of ours
 is that he shall be able and willing to pull his weight."
- Theodore Roosevelt

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote:
> I am curious why a lock manager uses open to implement its locking
> semantics rather than using the locking API (POSIX locks etc) however.

Because it is simple (how do you fcntl(2) from a shell fd?), has no
ranges (what do you do with ranges passed in to fcntl(2) and you don't
support them?), and has a well-known fork(2)/exec(2) pattern.  fcntl(2)
has a known but less intuitive fork(2) pattern.
The real reason, though, is that we never considered fcntl(2).
We could never think of a case when a process wanted a lock fd open but
not locked.  At least, that's my recollection.  Mark might have more to
comment.

Joel

-- 

"In the room the women come and go
 Talking of Michaelangelo."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox <[EMAIL PROTECTED]> wrote:
>
> On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
>  > >  - How are they ref counted
>  > >  - What are the cleanup semantics
>  > >  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
>  > >  - How do I poll on a lock coming free. 
>  > >  - What are the semantics of lock ownership
>  > >  - What rules apply for inheritance
>  > >  - How do I access a lock across threads.
>  > >  - What is the permission model. 
>  > >  - How do I attach audit to it
>  > >  - How do I write SELinux rules for it
>  > >  - How do I use mount to make namespaces appear in multiple vservers
>  > > 
>  > >  and thats for starters...
>  > 
>  > Return an fd from create_lockspace().
> 
>  That only answers about four of the questions. The rest only come out if
>  create_lockspace behaves like a file system - in other words
>  create_lockspace is better known as either mkdir or mount.

But David said that "We export our full dlm API through read/write/poll on
a misc device.".  That miscdevice will simply give us an fd.  Hence my
suggestion that the miscdevice be done away with in favour of a dedicated
syscall which returns an fd.

What does a filesystem have to do with this?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
> >  - How are they ref counted
> >  - What are the cleanup semantics
> >  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> >  - How do I poll on a lock coming free. 
> >  - What are the semantics of lock ownership
> >  - What rules apply for inheritance
> >  - How do I access a lock across threads.
> >  - What is the permission model. 
> >  - How do I attach audit to it
> >  - How do I write SELinux rules for it
> >  - How do I use mount to make namespaces appear in multiple vservers
> > 
> >  and thats for starters...
> 
> Return an fd from create_lockspace().

That only answers about four of the questions. The rest only come out if
create_lockspace behaves like a file system - in other words
create_lockspace is better known as either mkdir or mount.

Its certainly viable to make the lock/unlock functions taken a fd, it's
just not clear why the current lock/unlock functions we have won't do
the job. Being able to extend the functionality to leases later on may
be very powerful indeed and will fit the existing API

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Kurt Hackel
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
> On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
> > That is the whole point why OCFS exists ;-)
> 
> The whole point of the orcacle cluster filesystem as it was described in old
> papers was about pfiles, control files and software, because you can easyly
> use direct block access (with ASM) for tablespaces.

The original OCFS was intended for use with pfiles and control files but
very definitely *not* software (the ORACLE_HOME).  It was not remotely
general purpose.  It also predated ASM by about a year or so, and the
two solutions are complementary.  Either one is a good choice for Oracle
datafiles, depending upon your needs.

> > No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
> > benefits in several aspects from a general-purpose SAN-backed CFS.
> 
> Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
> replicated filesystem makes more sense), I am just nor sure if anybody sane
> would use it for tablespaces.

Too many to mention here, but let's just say that some of the largest
databases are running Oracle datafiles on top of OCFS1.  Very large
companies with very important data.

> I guess I have to correct the artile in my german it blog :) (if somebody
> can name productive customers).

Yeah you should definitely update your blog ;-)  If you need named
references, we can give you loads of those.

-kurt

Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Bernd Eckenfels
On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
> That is the whole point why OCFS exists ;-)

The whole point of the orcacle cluster filesystem as it was described in old
papers was about pfiles, control files and software, because you can easyly
use direct block access (with ASM) for tablespaces.

> No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
> benefits in several aspects from a general-purpose SAN-backed CFS.

Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
replicated filesystem makes more sense), I am just nor sure if anybody sane
would use it for tablespaces.

I guess I have to correct the artile in my german it blog :) (if somebody
can name productive customers).

Gruss
Bernd
-- 
http://itblog.eckenfels.net/archives/54-Cluster-Filesysteme.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox <[EMAIL PROTECTED]> wrote:
>
> On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
>  > >   create_lockspace()
>  > >   release_lockspace()
>  > >   lock()
>  > >   unlock()
>  > 
>  > Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
>  > is likely to object if we reserve those slots.
> 
>  If the locks are not file descriptors then answer the following:
> 
>  - How are they ref counted
>  - What are the cleanup semantics
>  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
>  - How do I poll on a lock coming free. 
>  - What are the semantics of lock ownership
>  - What rules apply for inheritance
>  - How do I access a lock across threads.
>  - What is the permission model. 
>  - How do I attach audit to it
>  - How do I write SELinux rules for it
>  - How do I use mount to make namespaces appear in multiple vservers
> 
>  and thats for starters...

Return an fd from create_lockspace().
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread kurt . hackel
On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote:
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <[EMAIL PROTECTED]> wrote:
> > >
> > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > 
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> > 
> > How fat is the dlm interface?   ie: how many syscalls would it take?
> 
> Four functions:
>   create_lockspace()
>   release_lockspace()
>   lock()
>   unlock()

FWIW, it looks like we can agree on the core interface.  ocfs2_dlm
exports essentially the same functions:
dlm_register_domain()
dlm_unregister_domain()
dlmlock()
dlmunlock()

I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes).  There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.

There are quite a few other functions in the "full" spec(1) that we
didn't even attempt, either because we didn't require direct 
user<->kernel access or we just didn't need the function.  As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.

Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device.  With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the 
VMS/opendlm model).

-kurt

1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf


Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote:
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.

The semantics of O_NONBLOCK on many other devices are "trylock"
semantics. OSS audio has those semantics for example, as do regular
files in the presence of SYS5 mandatory locks. While the latter is "try
lock , do operation and then drop lock" the drivers using O_NDELAY are
very definitely providing trylock semantics.

I am curious why a lock manager uses open to implement its locking
semantics rather than using the locking API (POSIX locks etc) however.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
> 
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

If the locks are not file descriptors then answer the following:

- How are they ref counted
- What are the cleanup semantics
- How do I pass a lock between processes (AF_UNIX sockets wont work now)
- How do I poll on a lock coming free. 
- What are the semantics of lock ownership
- What rules apply for inheritance
- How do I access a lock across threads.
- What is the permission model. 
- How do I attach audit to it
- How do I write SELinux rules for it
- How do I use mount to make namespaces appear in multiple vservers

and thats for starters...

Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.

Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.

All our existing locking uses the following behaviour

fd = open(namespace, options)
fcntl(.. lock ...)
blah
flush
fcntl(.. unlock ...)
close

Unfortunately some people here seem to have forgotten WHY we do things
this way.

1.  The semantics of file descriptors are well understood by users and by
programs. That makes programming easier and keeps code size down
2.  Everyone knows how close() works including across fork
3.  FD passing is an obscure art but understood and just works
4.  Poll() is a standard understood interface
5.  Ownership of files is a standard model
6.  FD passing across fork/exec is controlled in a standard way
7.  The semantics for threaded applications are defined
8.  Permissions are a standard model
9.  Audit just works with the same tools
9.  SELinux just works with the same tools
10. I don't need specialist applications to see the system state (the
whole point of sysfs yet someone wants to break it all again)
11. fcntl fd locking is a posix standard interface with precisely
defined semantics. Our extensions including leases are very powerful
12. And yes - fcntl fd locking supports mandatory locking too. That also
is standards based with precise semantics.


Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > The only current users of dlms are cluster filesystems.  There are zero
> > > users of the userspace dlm api.
> >
> > That is incorrect...
> 
> Application users Lars, sorry if I did not make that clear.  The issue is 
> whether we need to export an all-singing-all-dancing dlm api from kernel to 
> userspace today, or whether we can afford to take the necessary time to get 
> it right while application writers take their time to have a good think about 
> whether they even need it.
>

If Linux fully supported OpenVMS DLM semantics we could start thinking asbout
moving our application onto a Linux box because our alpha server is aging.

That's just my user application writer $0.02.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > The only current users of dlms are cluster filesystems.  There are zero
> > users of the userspace dlm api.
>
> That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

> ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

> > What does have to be resolved is a common API for node management.  It is
> > not just cluster filesystems and their lock managers that have to
> > interface to node management.  Below the filesystem layer, cluster block
> > devices and cluster volume management need to be coordinated by the same
> > system, and above the filesystem layer, applications also need to be
> > hooked into it. This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:

> The only current users of dlms are cluster filesystems.  There are zero users 
> of the userspace dlm api. 

That is incorrect, and you're contradicting yourself here:

> What does have to be resolved is a common API for node management.  It is not 
> just cluster filesystems and their lock managers that have to interface to 
> node management.  Below the filesystem layer, cluster block devices and 
> cluster volume management need to be coordinated by the same system, and 
> above the filesystem layer, applications also need to be hooked into it.  
> This work is, in a word, incomplete.

The Cluster Volume Management of LVM2 for example _does_ use simple
cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too.

(EVMS2 in cluster-mode uses a verrry simple locking scheme which is
basically operated by the failover software and thus uses a different
model.)


Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T09:27:41, Bernd Eckenfels <[EMAIL PROTECTED]> wrote:

> Oh thats interesting, I never thought about putting data files (tablespaces)
> in a clustered file system. Does that mean you can run supported RAC on
> shared ocfs2 files and anybody is using that?

That is the whole point why OCFS exists ;-)

> Do you see this go away with ASM?

No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
benefits in several aspects from a general-purpose SAN-backed CFS.


Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 12:09:23AM -0700, Mark Fasheh wrote:
> Btw, I'm curious to know how useful folks find the ext3 mount options
> errors=continue and errors=panic. I'm extremely likely to implement the
> errors=read-only behavior as default in OCFS2 and I'm wondering whether the
> other two are worth looking into.

For a single-user system errors=panic is definitely very useful on the
system disk, since that's the only way that we can force an fsck, and
also abort a server that might be failing and returning erroneous
information to its clients.  Think of it is as i/o fencing when you're
not sure that the system is going to be performing correctly.

Whether or not this is useful for ocfs2 is a different matter.  If
it's only for data volumes, and if the only way to fix filesystem
inconsistencies on a cluster filesystem is to request all nodes in the
cluster to unmount the filesystem and then arrange to run ocfs2's fsck
on the filesystem, then forcing every single cluster in the node to
panic is probably counterproductive.  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: real read-only [was Re: GFS, what's remaining]

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 10:27:35AM +0200, Pavel Machek wrote:
> 
> There's a better reason, too. I do swsusp. Then I'd like to boot with
> / mounted read-only (so that I can read my config files, some
> binaries, and maybe suspended image), but I absolutely may not write
> to disk at this point, because I still want to resume.
> 

You could _hope_ that the filesystem is consistent enough that it is
safe to try to read config files, binaries, etc. without running the
journal, but there is absolutely no guarantee that this is the case.
I'm not sure you want to depend on that for swsusp.

One potential solution that would probably meet your needs is a dm
hack which reads in the blocks in the journal, and then uses the most
recent block in the journal in preference to the version on disk.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:

> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all.  But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.

But mount -o readonly gives you most of what you want already.  An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access.  That's
not necessarily a good thing.

And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node.  So is
this really a useful combination?

About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find.  That's asking for data corruption,
but that may be better than getting no data at all.  

But that is something that could be done with a "-o skip-recovery" mount
option, which would necessarily imply always-readonly behaviour.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
> 
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Patrick is really the expert in this area and he's off this week, but
based on what he's done with the misc device I don't see why there'd be
more than two or three parameters for any of these.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 05:19, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > > David Teigland <[EMAIL PROTECTED]> wrote:
> > > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > > inotify did that for a while, but we ended up going with a straight
> > > syscall interface.
> > >
> > > How fat is the dlm interface?   ie: how many syscalls would it take?
> >
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
>
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland <[EMAIL PROTECTED]> wrote:
>
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <[EMAIL PROTECTED]> wrote:
> > >
> > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > 
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> > 
> > How fat is the dlm interface?   ie: how many syscalls would it take?
> 
> Four functions:
>   create_lockspace()
>   release_lockspace()
>   lock()
>   unlock()

Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
is likely to object if we reserve those slots.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> >
> >  We export our full dlm API through read/write/poll on a misc device.
> >
> 
> inotify did that for a while, but we ended up going with a straight syscall
> interface.
> 
> How fat is the dlm interface?   ie: how many syscalls would it take?

Four functions:
  create_lockspace()
  release_lockspace()
  lock()
  unlock()

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 10:58:08AM +0200, J?rn Engel wrote:

> #define gfs2_assert(sdp, assertion) do {  \
>   if (unlikely(!(assertion))) {   \
>   printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \
>   BUG();  \
> } while (0)

OK thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Jörn Engel
On Mon, 5 September 2005 11:47:39 +0800, David Teigland wrote:
> 
> Joern already suggested moving this out of line and into a function (as it
> was before) to avoid repeating string constants.  In that case the
> function, file and line from BUG aren't useful.  We now have this, does it
> look ok?

Ok wrt. my concerns, but not with Greg's.  BUG() still gives you
everything that you need, except:
o fsid

Notice how this list is just one entry long? ;)

So how about


#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \
BUG();  \
} while (0)


Or, to move the constant out of line again


void __gfs2_assert(struct gfs2_sbd *sdp) {
printk(KERN_ERR "GFS2: fsid=\n", sdp->sd_fsname);
}

#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
__gfs2_assert(sdp); \
BUG();  \
} while (0)


Jörn

-- 
Admonish your friends privately, but praise them openly.
-- Publilius Syrus 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland <[EMAIL PROTECTED]> wrote:
>
>  We export our full dlm API through read/write/poll on a misc device.
>

inotify did that for a while, but we ended up going with a straight syscall
interface.

How fat is the dlm interface?   ie: how many syscalls would it take?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Pekka Enberg
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > +void gfs2_glock_hold(struct gfs2_glock *gl)
> > +{
> > + glock_hold(gl);
> > +}
> >
> > eh why?

On 9/5/05, David Teigland <[EMAIL PROTECTED]> wrote:
> You removed the comment stating exactly why, see below.  If that's not a
> accepted technique in the kernel, say so and I'll be happy to change it
> here and elsewhere.

Is there a reason why users of gfs2_glock_hold() cannot use
glock_hold() directly?

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> Hi!
> 
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

This is a bit of a degression, but it's quite a bit different from
what ocfs2 is doing, where it is not necessary to replay the journal
in order to assure filesystem consistency.  

In the ext3 case, the only time when read-only isn't quite read-only
is when the filesystem was unmounted uncleanly and the journal needs
to be replayed in order for the filesystem to be consistent.  Mounting
the filesystem read-only without replaying the journal could and very
likely would result in the filesystem reporting filesystem consistency
problems, and if the filesystem is mounted with the reboot-on-errors
option, well

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

> +static unsigned int handle_roll(atomic_t *a)
> +{
> + int x = atomic_read(a);
> + if (x < 0) {
> + atomic_set(a, 0);
> + return 0;
> + }
> + return (unsigned int)x;
> +}
> 
> this is just plain scary.

Not really, it was just resetting atomic statistics counters when they
became negative.  Unecessary, though, so removed.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

 +static unsigned int handle_roll(atomic_t *a)
 +{
 + int x = atomic_read(a);
 + if (x  0) {
 + atomic_set(a, 0);
 + return 0;
 + }
 + return (unsigned int)x;
 +}
 
 this is just plain scary.

Not really, it was just resetting atomic statistics counters when they
became negative.  Unecessary, though, so removed.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
 Hi!
 
  - read-only mount
  - specatator mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)
 
 I'd call it real-read-only, and yes, that's very usefull
 mount. Could we get it for ext3, too?

This is a bit of a degression, but it's quite a bit different from
what ocfs2 is doing, where it is not necessary to replay the journal
in order to assure filesystem consistency.  

In the ext3 case, the only time when read-only isn't quite read-only
is when the filesystem was unmounted uncleanly and the journal needs
to be replayed in order for the filesystem to be consistent.  Mounting
the filesystem read-only without replaying the journal could and very
likely would result in the filesystem reporting filesystem consistency
problems, and if the filesystem is mounted with the reboot-on-errors
option, well

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Pekka Enberg
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
  +void gfs2_glock_hold(struct gfs2_glock *gl)
  +{
  + glock_hold(gl);
  +}
 
  eh why?

On 9/5/05, David Teigland [EMAIL PROTECTED] wrote:
 You removed the comment stating exactly why, see below.  If that's not a
 accepted technique in the kernel, say so and I'll be happy to change it
 here and elsewhere.

Is there a reason why users of gfs2_glock_hold() cannot use
glock_hold() directly?

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland [EMAIL PROTECTED] wrote:

  We export our full dlm API through read/write/poll on a misc device.


inotify did that for a while, but we ended up going with a straight syscall
interface.

How fat is the dlm interface?   ie: how many syscalls would it take?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Jörn Engel
On Mon, 5 September 2005 11:47:39 +0800, David Teigland wrote:
 
 Joern already suggested moving this out of line and into a function (as it
 was before) to avoid repeating string constants.  In that case the
 function, file and line from BUG aren't useful.  We now have this, does it
 look ok?

Ok wrt. my concerns, but not with Greg's.  BUG() still gives you
everything that you need, except:
o fsid

Notice how this list is just one entry long? ;)

So how about


#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
printk(KERN_ERR GFS2: fsid=\n, (sdp)-sd_fsname); \
BUG();  \
} while (0)


Or, to move the constant out of line again


void __gfs2_assert(struct gfs2_sbd *sdp) {
printk(KERN_ERR GFS2: fsid=\n, sdp-sd_fsname);
}

#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
__gfs2_assert(sdp); \
BUG();  \
} while (0)


Jörn

-- 
Admonish your friends privately, but praise them openly.
-- Publilius Syrus 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 10:58:08AM +0200, J?rn Engel wrote:

 #define gfs2_assert(sdp, assertion) do {  \
   if (unlikely(!(assertion))) {   \
   printk(KERN_ERR GFS2: fsid=\n, (sdp)-sd_fsname); \
   BUG();  \
 } while (0)

OK thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
 David Teigland [EMAIL PROTECTED] wrote:
 
   We export our full dlm API through read/write/poll on a misc device.
 
 
 inotify did that for a while, but we ended up going with a straight syscall
 interface.
 
 How fat is the dlm interface?   ie: how many syscalls would it take?

Four functions:
  create_lockspace()
  release_lockspace()
  lock()
  unlock()

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland [EMAIL PROTECTED] wrote:

 On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
  David Teigland [EMAIL PROTECTED] wrote:
  
We export our full dlm API through read/write/poll on a misc device.
  
  
  inotify did that for a while, but we ended up going with a straight syscall
  interface.
  
  How fat is the dlm interface?   ie: how many syscalls would it take?
 
 Four functions:
   create_lockspace()
   release_lockspace()
   lock()
   unlock()

Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
is likely to object if we reserve those slots.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 05:19, Andrew Morton wrote:
 David Teigland [EMAIL PROTECTED] wrote:
  On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
   David Teigland [EMAIL PROTECTED] wrote:
 We export our full dlm API through read/write/poll on a misc device.
  
   inotify did that for a while, but we ended up going with a straight
   syscall interface.
  
   How fat is the dlm interface?   ie: how many syscalls would it take?
 
  Four functions:
create_lockspace()
release_lockspace()
lock()
unlock()

 Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
 is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote:
 David Teigland [EMAIL PROTECTED] wrote:
  Four functions:
create_lockspace()
release_lockspace()
lock()
unlock()
 
 Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
 is likely to object if we reserve those slots.

Patrick is really the expert in this area and he's off this week, but
based on what he's done with the misc device I don't see why there'd be
more than two or three parameters for any of these.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:

  - read-only mount
  - specatator mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)
 
 I'd call it real-read-only, and yes, that's very usefull
 mount. Could we get it for ext3, too?

I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all.  But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.

But mount -o readonly gives you most of what you want already.  An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access.  That's
not necessarily a good thing.

And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node.  So is
this really a useful combination?

About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find.  That's asking for data corruption,
but that may be better than getting no data at all.  

But that is something that could be done with a -o skip-recovery mount
option, which would necessarily imply always-readonly behaviour.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: real read-only [was Re: GFS, what's remaining]

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 10:27:35AM +0200, Pavel Machek wrote:
 
 There's a better reason, too. I do swsusp. Then I'd like to boot with
 / mounted read-only (so that I can read my config files, some
 binaries, and maybe suspended image), but I absolutely may not write
 to disk at this point, because I still want to resume.
 

You could _hope_ that the filesystem is consistent enough that it is
safe to try to read config files, binaries, etc. without running the
journal, but there is absolutely no guarantee that this is the case.
I'm not sure you want to depend on that for swsusp.

One potential solution that would probably meet your needs is a dm
hack which reads in the blocks in the journal, and then uses the most
recent block in the journal in preference to the version on disk.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 12:09:23AM -0700, Mark Fasheh wrote:
 Btw, I'm curious to know how useful folks find the ext3 mount options
 errors=continue and errors=panic. I'm extremely likely to implement the
 errors=read-only behavior as default in OCFS2 and I'm wondering whether the
 other two are worth looking into.

For a single-user system errors=panic is definitely very useful on the
system disk, since that's the only way that we can force an fsck, and
also abort a server that might be failing and returning erroneous
information to its clients.  Think of it is as i/o fencing when you're
not sure that the system is going to be performing correctly.

Whether or not this is useful for ocfs2 is a different matter.  If
it's only for data volumes, and if the only way to fix filesystem
inconsistencies on a cluster filesystem is to request all nodes in the
cluster to unmount the filesystem and then arrange to run ocfs2's fsck
on the filesystem, then forcing every single cluster in the node to
panic is probably counterproductive.  :-)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T09:27:41, Bernd Eckenfels [EMAIL PROTECTED] wrote:

 Oh thats interesting, I never thought about putting data files (tablespaces)
 in a clustered file system. Does that mean you can run supported RAC on
 shared ocfs2 files and anybody is using that?

That is the whole point why OCFS exists ;-)

 Do you see this go away with ASM?

No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
benefits in several aspects from a general-purpose SAN-backed CFS.


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:

 The only current users of dlms are cluster filesystems.  There are zero users 
 of the userspace dlm api. 

That is incorrect, and you're contradicting yourself here:

 What does have to be resolved is a common API for node management.  It is not 
 just cluster filesystems and their lock managers that have to interface to 
 node management.  Below the filesystem layer, cluster block devices and 
 cluster volume management need to be coordinated by the same system, and 
 above the filesystem layer, applications also need to be hooked into it.  
 This work is, in a word, incomplete.

The Cluster Volume Management of LVM2 for example _does_ use simple
cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too.

(EVMS2 in cluster-mode uses a verrry simple locking scheme which is
basically operated by the failover software and thus uses a different
model.)


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
 On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
  The only current users of dlms are cluster filesystems.  There are zero
  users of the userspace dlm api.

 That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

 ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

  What does have to be resolved is a common API for node management.  It is
  not just cluster filesystems and their lock managers that have to
  interface to node management.  Below the filesystem layer, cluster block
  devices and cluster volume management need to be coordinated by the same
  system, and above the filesystem layer, applications also need to be
  hooked into it. This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 10:49, Daniel Phillips wrote:
 On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
  On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
   The only current users of dlms are cluster filesystems.  There are zero
   users of the userspace dlm api.
 
  That is incorrect...
 
 Application users Lars, sorry if I did not make that clear.  The issue is 
 whether we need to export an all-singing-all-dancing dlm api from kernel to 
 userspace today, or whether we can afford to take the necessary time to get 
 it right while application writers take their time to have a good think about 
 whether they even need it.


If Linux fully supported OpenVMS DLM semantics we could start thinking asbout
moving our application onto a Linux box because our alpha server is aging.

That's just my user application writer $0.02.

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
create_lockspace()
release_lockspace()
lock()
unlock()
 
 Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
 is likely to object if we reserve those slots.

If the locks are not file descriptors then answer the following:

- How are they ref counted
- What are the cleanup semantics
- How do I pass a lock between processes (AF_UNIX sockets wont work now)
- How do I poll on a lock coming free. 
- What are the semantics of lock ownership
- What rules apply for inheritance
- How do I access a lock across threads.
- What is the permission model. 
- How do I attach audit to it
- How do I write SELinux rules for it
- How do I use mount to make namespaces appear in multiple vservers

and thats for starters...

Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.

Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.

All our existing locking uses the following behaviour

fd = open(namespace, options)
fcntl(.. lock ...)
blah
flush
fcntl(.. unlock ...)
close

Unfortunately some people here seem to have forgotten WHY we do things
this way.

1.  The semantics of file descriptors are well understood by users and by
programs. That makes programming easier and keeps code size down
2.  Everyone knows how close() works including across fork
3.  FD passing is an obscure art but understood and just works
4.  Poll() is a standard understood interface
5.  Ownership of files is a standard model
6.  FD passing across fork/exec is controlled in a standard way
7.  The semantics for threaded applications are defined
8.  Permissions are a standard model
9.  Audit just works with the same tools
9.  SELinux just works with the same tools
10. I don't need specialist applications to see the system state (the
whole point of sysfs yet someone wants to break it all again)
11. fcntl fd locking is a posix standard interface with precisely
defined semantics. Our extensions including leases are very powerful
12. And yes - fcntl fd locking supports mandatory locking too. That also
is standards based with precise semantics.


Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote:
 Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
 lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
 me.  O_NONBLOCK means open this file in nonblocking mode, not attempt to
 acquire a clustered filesystem lock.  Not even close.

The semantics of O_NONBLOCK on many other devices are trylock
semantics. OSS audio has those semantics for example, as do regular
files in the presence of SYS5 mandatory locks. While the latter is try
lock , do operation and then drop lock the drivers using O_NDELAY are
very definitely providing trylock semantics.

I am curious why a lock manager uses open to implement its locking
semantics rather than using the locking API (POSIX locks etc) however.

Alan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread kurt . hackel
On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote:
 On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
  David Teigland [EMAIL PROTECTED] wrote:
  
We export our full dlm API through read/write/poll on a misc device.
  
  
  inotify did that for a while, but we ended up going with a straight syscall
  interface.
  
  How fat is the dlm interface?   ie: how many syscalls would it take?
 
 Four functions:
   create_lockspace()
   release_lockspace()
   lock()
   unlock()

FWIW, it looks like we can agree on the core interface.  ocfs2_dlm
exports essentially the same functions:
dlm_register_domain()
dlm_unregister_domain()
dlmlock()
dlmunlock()

I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes).  There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.

There are quite a few other functions in the full spec(1) that we
didn't even attempt, either because we didn't require direct 
user-kernel access or we just didn't need the function.  As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.

Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device.  With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the 
VMS/opendlm model).

-kurt

1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf


Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox [EMAIL PROTECTED] wrote:

 On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
  create_lockspace()
  release_lockspace()
  lock()
  unlock()
   
   Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
   is likely to object if we reserve those slots.
 
  If the locks are not file descriptors then answer the following:
 
  - How are they ref counted
  - What are the cleanup semantics
  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
  - How do I poll on a lock coming free. 
  - What are the semantics of lock ownership
  - What rules apply for inheritance
  - How do I access a lock across threads.
  - What is the permission model. 
  - How do I attach audit to it
  - How do I write SELinux rules for it
  - How do I use mount to make namespaces appear in multiple vservers
 
  and thats for starters...

Return an fd from create_lockspace().
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Bernd Eckenfels
On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
 That is the whole point why OCFS exists ;-)

The whole point of the orcacle cluster filesystem as it was described in old
papers was about pfiles, control files and software, because you can easyly
use direct block access (with ASM) for tablespaces.

 No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
 benefits in several aspects from a general-purpose SAN-backed CFS.

Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
replicated filesystem makes more sense), I am just nor sure if anybody sane
would use it for tablespaces.

I guess I have to correct the artile in my german it blog :) (if somebody
can name productive customers).

Gruss
Bernd
-- 
http://itblog.eckenfels.net/archives/54-Cluster-Filesysteme.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Kurt Hackel
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
 On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
  That is the whole point why OCFS exists ;-)
 
 The whole point of the orcacle cluster filesystem as it was described in old
 papers was about pfiles, control files and software, because you can easyly
 use direct block access (with ASM) for tablespaces.

The original OCFS was intended for use with pfiles and control files but
very definitely *not* software (the ORACLE_HOME).  It was not remotely
general purpose.  It also predated ASM by about a year or so, and the
two solutions are complementary.  Either one is a good choice for Oracle
datafiles, depending upon your needs.

  No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
  benefits in several aspects from a general-purpose SAN-backed CFS.
 
 Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
 replicated filesystem makes more sense), I am just nor sure if anybody sane
 would use it for tablespaces.

Too many to mention here, but let's just say that some of the largest
databases are running Oracle datafiles on top of OCFS1.  Very large
companies with very important data.

 I guess I have to correct the artile in my german it blog :) (if somebody
 can name productive customers).

Yeah you should definitely update your blog ;-)  If you need named
references, we can give you loads of those.

-kurt

Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
   - How are they ref counted
   - What are the cleanup semantics
   - How do I pass a lock between processes (AF_UNIX sockets wont work now)
   - How do I poll on a lock coming free. 
   - What are the semantics of lock ownership
   - What rules apply for inheritance
   - How do I access a lock across threads.
   - What is the permission model. 
   - How do I attach audit to it
   - How do I write SELinux rules for it
   - How do I use mount to make namespaces appear in multiple vservers
  
   and thats for starters...
 
 Return an fd from create_lockspace().

That only answers about four of the questions. The rest only come out if
create_lockspace behaves like a file system - in other words
create_lockspace is better known as either mkdir or mount.

Its certainly viable to make the lock/unlock functions taken a fd, it's
just not clear why the current lock/unlock functions we have won't do
the job. Being able to extend the functionality to leases later on may
be very powerful indeed and will fit the existing API

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox [EMAIL PROTECTED] wrote:

 On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
 - How are they ref counted
 - What are the cleanup semantics
 - How do I pass a lock between processes (AF_UNIX sockets wont work now)
 - How do I poll on a lock coming free. 
 - What are the semantics of lock ownership
 - What rules apply for inheritance
 - How do I access a lock across threads.
 - What is the permission model. 
 - How do I attach audit to it
 - How do I write SELinux rules for it
 - How do I use mount to make namespaces appear in multiple vservers

 and thats for starters...
   
   Return an fd from create_lockspace().
 
  That only answers about four of the questions. The rest only come out if
  create_lockspace behaves like a file system - in other words
  create_lockspace is better known as either mkdir or mount.

But David said that We export our full dlm API through read/write/poll on
a misc device..  That miscdevice will simply give us an fd.  Hence my
suggestion that the miscdevice be done away with in favour of a dedicated
syscall which returns an fd.

What does a filesystem have to do with this?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote:
 I am curious why a lock manager uses open to implement its locking
 semantics rather than using the locking API (POSIX locks etc) however.

Because it is simple (how do you fcntl(2) from a shell fd?), has no
ranges (what do you do with ranges passed in to fcntl(2) and you don't
support them?), and has a well-known fork(2)/exec(2) pattern.  fcntl(2)
has a known but less intuitive fork(2) pattern.
The real reason, though, is that we never considered fcntl(2).
We could never think of a case when a process wanted a lock fd open but
not locked.  At least, that's my recollection.  Mark might have more to
comment.

Joel

-- 

In the room the women come and go
 Talking of Michaelangelo.

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
 The whole point of the orcacle cluster filesystem as it was described in old
 papers was about pfiles, control files and software, because you can easyly
 use direct block access (with ASM) for tablespaces.

OCFS, the original filesystem, only works for datafiles,
logfiles, and other database data.  It's currently used in serious anger
by several major customers.  Oracle's websites must have a list of them
somewhere.  We're talking many terabytes of datafiles.

 Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
 replicated filesystem makes more sense), I am just nor sure if anybody sane
 would use it for tablespaces.

OCFS2, the new filesystem, is fully general purpose.  It
supports all the usual stuff, is quite fast, and is what we expect folks
to use for both ORACLE_HOME and datafiles in the future.  Customers can,
of course, use ASM or even raw devices.  OCFS2 is as fast as raw
devices, and far more manageable, so raw devices are probably not a
choice for the future.  ASM has its own management advantages, and we
certainly expect customers to like it as well.  But that doesn't mean
people won't use OCFS2 for datafiles depending on their environment or
needs.


-- 

The first requisite of a good citizen in this republic of ours
 is that he shall be able and willing to pull his weight.
- Theodore Roosevelt

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
 On Monday 05 September 2005 10:49, Daniel Phillips wrote:
  On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
   On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
The only current users of dlms are cluster filesystems.  There are
zero users of the userspace dlm api.
  
   That is incorrect...
 
  Application users Lars, sorry if I did not make that clear.  The issue is
  whether we need to export an all-singing-all-dancing dlm api from kernel
  to userspace today, or whether we can afford to take the necessary time
  to get it right while application writers take their time to have a good
  think about whether they even need it.

 If Linux fully supported OpenVMS DLM semantics we could start thinking
 asbout moving our application onto a Linux box because our alpha server is
 aging.

 That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 19:57, Daniel Phillips wrote:
 On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
  On Monday 05 September 2005 10:49, Daniel Phillips wrote:
   On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
 The only current users of dlms are cluster filesystems.  There are
 zero users of the userspace dlm api.
   
That is incorrect...
  
   Application users Lars, sorry if I did not make that clear.  The issue is
   whether we need to export an all-singing-all-dancing dlm api from kernel
   to userspace today, or whether we can afford to take the necessary time
   to get it right while application writers take their time to have a good
   think about whether they even need it.
 
  If Linux fully supported OpenVMS DLM semantics we could start thinking
  asbout moving our application onto a Linux box because our alpha server is
  aging.
 
  That's just my user application writer $0.02.
 
 What stops you from trying it with the patch?  That kind of feedback would be 
 worth way more than $0.02.


We do not have such plans at the moment and I prefer spending my free
time on tinkering with kernel, not rewriting some in-house application.
Besides, DLM is not the only thing that does not have a drop-in
replacement in Linux.

You just said you did not know if there are any potential users for the
full DLM and I said there are some.

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
 On Monday 05 September 2005 19:57, Daniel Phillips wrote:
  On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
   On Monday 05 September 2005 10:49, Daniel Phillips wrote:
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
 On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
  The only current users of dlms are cluster filesystems.  There
  are zero users of the userspace dlm api.

 That is incorrect...
   
Application users Lars, sorry if I did not make that clear.  The
issue is whether we need to export an all-singing-all-dancing dlm api
from kernel to userspace today, or whether we can afford to take the
necessary time to get it right while application writers take their
time to have a good think about whether they even need it.
  
   If Linux fully supported OpenVMS DLM semantics we could start thinking
   asbout moving our application onto a Linux box because our alpha server
   is aging.
  
   That's just my user application writer $0.02.
 
  What stops you from trying it with the patch?  That kind of feedback
  would be worth way more than $0.02.

 We do not have such plans at the moment and I prefer spending my free
 time on tinkering with kernel, not rewriting some in-house application.
 Besides, DLM is not the only thing that does not have a drop-in
 replacement in Linux.

 You just said you did not know if there are any potential users for the
 full DLM and I said there are some.

I did not say potential, I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said gee this works great, look what it does.

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said alpha server not alpha servers, was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 23:02, Daniel Phillips wrote:
 
 By the way, you said alpha server not alpha servers, was that just a 
 slip?  
 Because if you don't have a cluster then why are you using a dlm?


No, it is not a slip. The application is running on just one node, so we
do not really use distributed part. However we make heavy use of the
rest of lock manager features, especially lock value blocks.

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-05 Thread Daniel Phillips
On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
 On Monday 05 September 2005 23:02, Daniel Phillips wrote:
  By the way, you said alpha server not alpha servers, was that just a
  slip? Because if you don't have a cluster then why are you using a dlm?

 No, it is not a slip. The application is running on just one node, so we
 do not really use distributed part. However we make heavy use of the
 rest of lock manager features, especially lock value blocks.

Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
without even having the excuse you were forced to use it.  Why don't you just 
have a daemon that sends your values over a socket?  That should be all of a 
day's coding.

Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
But you nicely supported my claim that most who think they should be using a 
dlm, really shouldn't.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remainingh

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 23:58, Daniel Phillips wrote:
 On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
  On Monday 05 September 2005 23:02, Daniel Phillips wrote:
   By the way, you said alpha server not alpha servers, was that just a
   slip? Because if you don't have a cluster then why are you using a dlm?
 
  No, it is not a slip. The application is running on just one node, so we
  do not really use distributed part. However we make heavy use of the
  rest of lock manager features, especially lock value blocks.
 
 Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
 without even having the excuse you were forced to use it.  Why don't you just 
 have a daemon that sends your values over a socket?  That should be all of a 
 day's coding.


Umm, because when most of the code was written TCP and the rest was the
clunkiest code out there? Plus, having a daemon introduces problems with
cleanup (say process dies for one reason or another) whereas having it in
OS takes care of that.
 
 Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
 But you nicely supported my claim that most who think they should be using a 
 dlm, really shouldn't.

Heh, do you think it is a bit premature to dismiss something even without
ever seeing the code?

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 19:37, Joel Becker wrote:
  OCFS2, the new filesystem, is fully general purpose.  It
 supports all the usual stuff, is quite fast...

So I have heard, but isn't it time to quantify that?  How do you think you 
would stack up here:

   http://www.caspur.it/Files/2005/01/10/1105354214692.pdf

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

> +void gfs2_glock_hold(struct gfs2_glock *gl)
> +{
> + glock_hold(gl);
> +}
> 
> eh why?

You removed the comment stating exactly why, see below.  If that's not a
accepted technique in the kernel, say so and I'll be happy to change it
here and elsewhere.
Thanks,
Dave

static inline void glock_hold(struct gfs2_glock *gl)
{
gfs2_assert(gl->gl_sbd, atomic_read(>gl_count) > 0);
atomic_inc(>gl_count);
}

/**
 * gfs2_glock_hold() - As glock_hold(), but suitable for exporting
 * @gl: The glock to hold
 *
 */

void gfs2_glock_hold(struct gfs2_glock *gl)
{
glock_hold(gl);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread David Teigland
On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Joel Becker <[EMAIL PROTECTED]> wrote:
> >
> >  > What happens when we want to add some new primitive which has no
> >  > posix-file analog?
> > 
> > The point of dlmfs is not to express every primitive that the
> >  DLM has.  dlmfs cannot express the CR, CW, and PW levels of the VMS
> >  locking scheme.  Nor should it.  The point isn't to use a filesystem
> >  interface for programs that need all the flexibility and power of the
> >  VMS DLM.  The point is a simple system that programs needing the basic
> >  operations can use.  Even shell scripts.
> 
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality?  If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?

We're using our dlm quite a bit in user space and require the full dlm
API.  It's difficult to export the full API through a pseudo fs like
dlmfs, so we've not found it a very practical approach.  That said, it's a
nice idea and I'd be happy if someone could map a more complete dlm API
onto it.

We export our full dlm API through read/write/poll on a misc device.  All
user space apps use the dlm through a library as you'd expect.  The
library communicates with the dlm_device kernel module through
read/write/poll and the dlm_device module talks with the actual dlm:
linux/drivers/dlm/device.c  If there's a better way to do this, via a
pseudo fs or not, we'd be pleased to try it.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread David Teigland
On Fri, Sep 02, 2005 at 10:28:21PM -0700, Greg KH wrote:
> On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > 
> > > + gfs2_assert(gl->gl_sbd, atomic_read(>gl_count) > 0,);
> > 
> > > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > > everywhere
> > 
> > When a machine has many gfs file systems mounted at once it can be useful
> > to know which one failed.  Does the following look ok?
> > 
> > #define gfs2_assert(sdp, assertion)   \
> > do {  \
> > if (unlikely(!(assertion))) { \
> > printk(KERN_ERR   \
> > "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
> > "GFS2: fsid=%s:   function = %s\n"\
> > "GFS2: fsid=%s:   file = %s, line = %u\n" \
> > "GFS2: fsid=%s:   time = %lu\n",  \
> > sdp->sd_fsname, # assertion,  \
> > sdp->sd_fsname,  __FUNCTION__,\
> > sdp->sd_fsname, __FILE__, __LINE__,   \
> > sdp->sd_fsname, get_seconds());   \
> > BUG();\
> 
> You will already get the __FUNCTION__ (and hence the __FILE__ info)
> directly from the BUG() dump, as well as the time from the syslog
> message (turn on the printk timestamps if you want a more fine grain
> timestamp), so the majority of this macro is redundant with the BUG()
> macro...

Joern already suggested moving this out of line and into a function (as it
was before) to avoid repeating string constants.  In that case the
function, file and line from BUG aren't useful.  We now have this, does it
look ok?

void gfs2_assert_i(struct gfs2_sbd *sdp, char *assertion, const char *function,
   char *file, unsigned int line)
{
panic("GFS2: fsid=%s: fatal: assertion \"%s\" failed\n"
  "GFS2: fsid=%s:   function = %s, file = %s, line = %u\n",
  sdp->sd_fsname, assertion,
  sdp->sd_fsname, function, file, line);
}

#define gfs2_assert(sdp, assertion) \
do { \
if (unlikely(!(assertion))) { \
gfs2_assert_i((sdp), #assertion, \
  __FUNCTION__, __FILE__, __LINE__); \
} \
} while (0)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

In OCFS2 we call readonly+journal+connected-to-cluster "soft
readonly".  We're a live node, other nodes know we exist, and we can
flush pending transactions during the rw->ro transition.  In addition,
we can allow a ro->rw transition.
The no-journal+no-cluster-connection mode we call "hard
readonly".  This is the mode you get when a device itself is readonly,
because you can't do *anything*.

Joel

-- 

"Lately I've been talking in my sleep.
 Can't imagine what I'd have to say.
 Except my world will be right
 When love comes back my way."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread Pavel Machek
Hi!

> - read-only mount
> - "specatator" mount (like ro but no journal allocated for the mount,
>   no fencing needed for failed node that was mounted as specatator)

I'd call it "real-read-only", and yes, that's very usefull
mount. Could we get it for ext3, too?
Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips
On Sunday 04 September 2005 03:28, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.  That's why I asked (thus
> far unsuccessfully):
>
>Are you saying that the posix-file lookalike interface provides
>access to part of the functionality, but there are other APIs which are
>used to access the rest of the functionality?  If so, what is that
>interface, and why cannot that interface offer access to 100% of the
>functionality, thus making the posix-file tricks unnecessary?

There is no such interface at the moment, nor is one needed in the immediate 
future.  Let's look at the arguments for exporting a dlm to userspace:

  1) Since we already have a dlm in kernel, why not just export that and save
 100K of userspace library?  Answer: because we don't want userspace-only
 dlm features bulking up the kernel.  Answer #2: the extra syscalls and
 interface baggage serve no useful purpose.

  2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
 Answer: only support tools need to do that.  A cut-down locking api is
 entirely appropriate for this.

  3) But the kernel dlm is the only one we have!  Answer: easily fixed, a
 simple matter of coding.  But please bear in mind that dlm-style
 synchronization is probably a bad idea for most cluster applications,
 particularly ones that already do their synchronization via sockets.

In other words, exporting the full dlm api is a red herring.  It has nothing 
to do with getting cluster filesystems up and running.  It is really just 
marketing: it sounds like a great thing for userspace to get a dlm "for 
free", but it isn't free, it contributes to kernel bloat and it isn't even 
the most efficient way to do it.

If after considering that, we _still_ want to export a dlm api from kernel, 
then can we please take the necessary time and get it right?  The full api 
requires not only syscall-style elements, but asynchronous events as well, 
similar to aio.  I do not think anybody has a good answer to this today, nor 
do we even need it to begin porting applications to cluster filesystems.

Oracle guys: what is the distributed locking API for RAC?  Is the RAC team 
waiting with bated breath to adopt your kernel-based dlm?  If not, why not?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Hua Zhong
>takelock domainxxx lock1
>do sutff
>droplock domainxxx lock1
> 
> When someone kills the shell, the lock is leaked, becuase droplock isn't
> called.

Why not open the lock resource (or the lock space) instead of
individual locks as file? It then looks like this:

open lock space file
takelock lockresource lock1
do stuff
droplock lockresource lock1
close lock space file

Then if you are killed the ->release of lock space file should take
care of cleaning up all the locks
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 02:18:36AM -0700, Andrew Morton wrote:
>   take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"

Ahh, but then you have to have lots of scripts somewhere in
path, or do massive inline scripts.  especially if you want to take
another lock in there somewhere.
It's doable, but it's nowhere near as easy. :-)

Joel

-- 

"I always thought the hardest questions were those I could not answer.
 Now I know they are the ones I can never ask."
- Charlie Watkins

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Joel Becker <[EMAIL PROTECTED]> wrote:
>
>   I can't see how that works easily.  I'm not worried about a
>  tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
>  thinking about this shell:
> 
>   exec 7   do stuff
>   exec 7 
>  If someone kills the shell while stuff is doing, the lock is unlocked
>  because fd 7 is closed.  However, if you have an application to do the
>  locking:
> 
>   takelock domainxxx lock1
>   do sutff
>   droplock domainxxx lock1
> 
>  When someone kills the shell, the lock is leaked, becuase droplock isn't
>  called.  And SEGV/QUIT/-9 (especially -9, folks love it too much) are
>  handled by the first example but not by the second.


take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 01:18:05AM -0700, Andrew Morton wrote:
> > I thought I stated this in my other email.  We're not intending
> > to extend dlmfs.
> 
> Famous last words ;)

Heh, of course :-)

> I don't buy the general "fs is nice because we can script it" argument,
> really.  You can just write a few simple applications which provide access
> to the syscalls (or the fs!) and then write scripts around those.

I can't see how that works easily.  I'm not worried about a
tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
thinking about this shell:

exec 7http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >