Re: GFS, what's remaining

2005-09-07 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> +static inline void glock_put(struct gfs2_glock *gl)
> +{
> + if (atomic_read(>gl_count) == 1)
> + gfs2_glock_schedule_for_reclaim(gl);
> + gfs2_assert(gl->gl_sbd, atomic_read(>gl_count) > 0,);
> + atomic_dec(>gl_count);
> +}
> 
> this code has a race

The first two lines of the function with the race are non-essential and
could be removed.  In the common case where there's no race, they just add
efficiency by moving the glock to the reclaim list immediately.
Otherwise, the scand thread would do it later when actively trying to
reclaim glocks.

> +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
> +{
> + int empty;
> + spin_lock(>gl_spin);
> + empty = list_empty(head);
> + spin_unlock(>gl_spin);
> + return empty;
> +}
> 
> that looks like a racey interface to me... if so.. why bother locking at
> all?

The spinlock protects the list but is not the primary method of
synchronizing processes that are working with a glock.

When the list is in fact empty, there will be no race, and the locking
wouldn't be necessary.  In this case, the "glmutex" in the code fragment
below is preventing any change in the list, so we can safely release the
spinlock immediately.

When the list is not empty, then a process could be adding another entry
to the list without "glmutex" locked [1], making the spinlock necessary.
In this case we quit after queue_empty() returns and don't do anything
else, so releasing the spinlock immediately was still safe.

[1] A process that already holds a glock (i.e. has a "holder" struct on
the gl_holders list) is allowed to hold it again by adding another holder
struct to the same list.  It adds the second hold without locking glmutex.

if (gfs2_glmutex_trylock(gl)) {
if (gl->gl_ops == _inode_glops) {
struct gfs2_inode *ip = get_gl2ip(gl);
if (ip && !atomic_read(>i_count))
gfs2_inode_destroy(ip);
}
if (queue_empty(gl, >gl_holders) &&
gl->gl_state != LM_ST_UNLOCKED)
handle_callback(gl, LM_ST_UNLOCKED);

gfs2_glmutex_unlock(gl);
}

There is a second way that queue_empty() is used, and that's within
assertions that the list is empty.  If the assertion is correct, locking
isn't necessary; locking is only needed if there's already another bug
causing the list to not be empty and the assertion to fail.

> static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
> +gi_filler_t filler)
> +{
> + unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size);
> + char *buf;
> + unsigned int count = 0;
> + int error;
> +
> + if (size > gi->gi_size)
> + size = gi->gi_size;
> +
> + buf = kmalloc(size, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + error = filler(ip, gi, buf, size, );
> + if (error)
> + goto out;
> +
> + if (copy_to_user(gi->gi_data, buf, count + 1))
> + error = -EFAULT;
> 
> where does count get a sensible value?

from filler()

We'll add comments in the code to document the things above.
Thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-07 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
 +static inline void glock_put(struct gfs2_glock *gl)
 +{
 + if (atomic_read(gl-gl_count) == 1)
 + gfs2_glock_schedule_for_reclaim(gl);
 + gfs2_assert(gl-gl_sbd, atomic_read(gl-gl_count)  0,);
 + atomic_dec(gl-gl_count);
 +}
 
 this code has a race

The first two lines of the function with the race are non-essential and
could be removed.  In the common case where there's no race, they just add
efficiency by moving the glock to the reclaim list immediately.
Otherwise, the scand thread would do it later when actively trying to
reclaim glocks.

 +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
 +{
 + int empty;
 + spin_lock(gl-gl_spin);
 + empty = list_empty(head);
 + spin_unlock(gl-gl_spin);
 + return empty;
 +}
 
 that looks like a racey interface to me... if so.. why bother locking at
 all?

The spinlock protects the list but is not the primary method of
synchronizing processes that are working with a glock.

When the list is in fact empty, there will be no race, and the locking
wouldn't be necessary.  In this case, the glmutex in the code fragment
below is preventing any change in the list, so we can safely release the
spinlock immediately.

When the list is not empty, then a process could be adding another entry
to the list without glmutex locked [1], making the spinlock necessary.
In this case we quit after queue_empty() returns and don't do anything
else, so releasing the spinlock immediately was still safe.

[1] A process that already holds a glock (i.e. has a holder struct on
the gl_holders list) is allowed to hold it again by adding another holder
struct to the same list.  It adds the second hold without locking glmutex.

if (gfs2_glmutex_trylock(gl)) {
if (gl-gl_ops == gfs2_inode_glops) {
struct gfs2_inode *ip = get_gl2ip(gl);
if (ip  !atomic_read(ip-i_count))
gfs2_inode_destroy(ip);
}
if (queue_empty(gl, gl-gl_holders) 
gl-gl_state != LM_ST_UNLOCKED)
handle_callback(gl, LM_ST_UNLOCKED);

gfs2_glmutex_unlock(gl);
}

There is a second way that queue_empty() is used, and that's within
assertions that the list is empty.  If the assertion is correct, locking
isn't necessary; locking is only needed if there's already another bug
causing the list to not be empty and the assertion to fail.

 static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
 +gi_filler_t filler)
 +{
 + unsigned int size = gfs2_tune_get(ip-i_sbd, gt_lockdump_size);
 + char *buf;
 + unsigned int count = 0;
 + int error;
 +
 + if (size  gi-gi_size)
 + size = gi-gi_size;
 +
 + buf = kmalloc(size, GFP_KERNEL);
 + if (!buf)
 + return -ENOMEM;
 +
 + error = filler(ip, gi, buf, size, count);
 + if (error)
 + goto out;
 +
 + if (copy_to_user(gi-gi_data, buf, count + 1))
 + error = -EFAULT;
 
 where does count get a sensible value?

from filler()

We'll add comments in the code to document the things above.
Thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-06 Thread Suparna Bhattacharya
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > 
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > >   possibly gain (or vice versa)
> > > > 
> > > > - Relative merits of the two offerings
> > > 
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> >  
> > Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
> 
> There seems to be clearly a need for a shared-storage fs of some sort
> for HA clusters and virtualized usage (multiple guests sharing a
> partition).  Shared storage can be more efficient than network file
> systems like NFS because the storage access is often more efficient
> than network access  and it is more reliable because it doesn't have a
> single point of failure in form of the NFS server.
> 
> It's also a logical extension of the "failover on failure" clusters
> many people run now - instead of only failing over the shared fs at
> failure and keeping one machine idle the load can be balanced between
> multiple machines at any time.
> 
> One argument to merge both might be that nobody really knows yet which
> shared-storage file system (GFS or OCFS2) is better. The only way to
> find out would be to let the user base try out both, and that's most
> practical when they're merged.
> 
> Personally I think ocfs2 has nicer code than GFS.
> It seems to be more or less a 64bit ext3 with cluster support, while

The "more or less" is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)

And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?

The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.

Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ? 

Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.

BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.

> GFS seems to reinvent a lot more things and has somewhat uglier code.
> On the other hand GFS' cluster support seems to be more aimed
> at being a universal cluster service open for other usages too,
> which might be a good thing. OCFS2s cluster seems to be more 
> aimed at only serving the file system.
> 
> But which one works better in practice is really an open question.

True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.

So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-06 Thread Suparna Bhattacharya
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
 Andrew Morton [EMAIL PROTECTED] writes:
 
  
- Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
  possibly gain (or vice versa)

- Relative merits of the two offerings
   
   You missed the important one - people actively use it and have been for
   some years. Same reason with have NTFS, HPFS, and all the others. On
   that alone it makes sense to include.
   
  Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
  the technical reasons for merging gfs[2], ocfs2, both or neither?
 
 There seems to be clearly a need for a shared-storage fs of some sort
 for HA clusters and virtualized usage (multiple guests sharing a
 partition).  Shared storage can be more efficient than network file
 systems like NFS because the storage access is often more efficient
 than network access  and it is more reliable because it doesn't have a
 single point of failure in form of the NFS server.
 
 It's also a logical extension of the failover on failure clusters
 many people run now - instead of only failing over the shared fs at
 failure and keeping one machine idle the load can be balanced between
 multiple machines at any time.
 
 One argument to merge both might be that nobody really knows yet which
 shared-storage file system (GFS or OCFS2) is better. The only way to
 find out would be to let the user base try out both, and that's most
 practical when they're merged.
 
 Personally I think ocfs2 has nicercleaner code than GFS.
 It seems to be more or less a 64bit ext3 with cluster support, while

The more or less is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)

And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?

The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.

Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ? 

Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.

BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.

 GFS seems to reinvent a lot more things and has somewhat uglier code.
 On the other hand GFS' cluster support seems to be more aimed
 at being a universal cluster service open for other usages too,
 which might be a good thing. OCFS2s cluster seems to be more 
 aimed at only serving the file system.
 
 But which one works better in practice is really an open question.

True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.

So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 19:37, Joel Becker wrote:
>  OCFS2, the new filesystem, is fully general purpose.  It
> supports all the usual stuff, is quite fast...

So I have heard, but isn't it time to quantify that?  How do you think you 
would stack up here:

   http://www.caspur.it/Files/2005/01/10/1105354214692.pdf

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
> On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > > > The only current users of dlms are cluster filesystems.  There
> > > > > > are zero users of the userspace dlm api.
> > > > >
> > > > > That is incorrect...
> > > >
> > > > Application users Lars, sorry if I did not make that clear.  The
> > > > issue is whether we need to export an all-singing-all-dancing dlm api
> > > > from kernel to userspace today, or whether we can afford to take the
> > > > necessary time to get it right while application writers take their
> > > > time to have a good think about whether they even need it.
> > >
> > > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > > asbout moving our application onto a Linux box because our alpha server
> > > is aging.
> > >
> > > That's just my user application writer $0.02.
> >
> > What stops you from trying it with the patch?  That kind of feedback
> > would be worth way more than $0.02.
>
> We do not have such plans at the moment and I prefer spending my free
> time on tinkering with kernel, not rewriting some in-house application.
> Besides, DLM is not the only thing that does not have a drop-in
> replacement in Linux.
>
> You just said you did not know if there are any potential users for the
> full DLM and I said there are some.

I did not say "potential", I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said "gee this works great, look what it does".

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said "alpha server" not "alpha servers", was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > > The only current users of dlms are cluster filesystems.  There are
> > > > > zero users of the userspace dlm api.
> > > >
> > > > That is incorrect...
> > >
> > > Application users Lars, sorry if I did not make that clear.  The issue is
> > > whether we need to export an all-singing-all-dancing dlm api from kernel
> > > to userspace today, or whether we can afford to take the necessary time
> > > to get it right while application writers take their time to have a good
> > > think about whether they even need it.
> >
> > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > asbout moving our application onto a Linux box because our alpha server is
> > aging.
> >
> > That's just my user application writer $0.02.
> 
> What stops you from trying it with the patch?  That kind of feedback would be 
> worth way more than $0.02.
>

We do not have such plans at the moment and I prefer spending my free
time on tinkering with kernel, not rewriting some in-house application.
Besides, DLM is not the only thing that does not have a drop-in
replacement in Linux.

You just said you did not know if there are any potential users for the
full DLM and I said there are some.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > The only current users of dlms are cluster filesystems.  There are
> > > > zero users of the userspace dlm api.
> > >
> > > That is incorrect...
> >
> > Application users Lars, sorry if I did not make that clear.  The issue is
> > whether we need to export an all-singing-all-dancing dlm api from kernel
> > to userspace today, or whether we can afford to take the necessary time
> > to get it right while application writers take their time to have a good
> > think about whether they even need it.
>
> If Linux fully supported OpenVMS DLM semantics we could start thinking
> asbout moving our application onto a Linux box because our alpha server is
> aging.
>
> That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
> The whole point of the orcacle cluster filesystem as it was described in old
> papers was about pfiles, control files and software, because you can easyly
> use direct block access (with ASM) for tablespaces.

OCFS, the original filesystem, only works for datafiles,
logfiles, and other database data.  It's currently used in serious anger
by several major customers.  Oracle's websites must have a list of them
somewhere.  We're talking many terabytes of datafiles.

> Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
> replicated filesystem makes more sense), I am just nor sure if anybody sane
> would use it for tablespaces.

OCFS2, the new filesystem, is fully general purpose.  It
supports all the usual stuff, is quite fast, and is what we expect folks
to use for both ORACLE_HOME and datafiles in the future.  Customers can,
of course, use ASM or even raw devices.  OCFS2 is as fast as raw
devices, and far more manageable, so raw devices are probably not a
choice for the future.  ASM has its own management advantages, and we
certainly expect customers to like it as well.  But that doesn't mean
people won't use OCFS2 for datafiles depending on their environment or
needs.


-- 

"The first requisite of a good citizen in this republic of ours
 is that he shall be able and willing to pull his weight."
- Theodore Roosevelt

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote:
> I am curious why a lock manager uses open to implement its locking
> semantics rather than using the locking API (POSIX locks etc) however.

Because it is simple (how do you fcntl(2) from a shell fd?), has no
ranges (what do you do with ranges passed in to fcntl(2) and you don't
support them?), and has a well-known fork(2)/exec(2) pattern.  fcntl(2)
has a known but less intuitive fork(2) pattern.
The real reason, though, is that we never considered fcntl(2).
We could never think of a case when a process wanted a lock fd open but
not locked.  At least, that's my recollection.  Mark might have more to
comment.

Joel

-- 

"In the room the women come and go
 Talking of Michaelangelo."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox <[EMAIL PROTECTED]> wrote:
>
> On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
>  > >  - How are they ref counted
>  > >  - What are the cleanup semantics
>  > >  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
>  > >  - How do I poll on a lock coming free. 
>  > >  - What are the semantics of lock ownership
>  > >  - What rules apply for inheritance
>  > >  - How do I access a lock across threads.
>  > >  - What is the permission model. 
>  > >  - How do I attach audit to it
>  > >  - How do I write SELinux rules for it
>  > >  - How do I use mount to make namespaces appear in multiple vservers
>  > > 
>  > >  and thats for starters...
>  > 
>  > Return an fd from create_lockspace().
> 
>  That only answers about four of the questions. The rest only come out if
>  create_lockspace behaves like a file system - in other words
>  create_lockspace is better known as either mkdir or mount.

But David said that "We export our full dlm API through read/write/poll on
a misc device.".  That miscdevice will simply give us an fd.  Hence my
suggestion that the miscdevice be done away with in favour of a dedicated
syscall which returns an fd.

What does a filesystem have to do with this?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
> >  - How are they ref counted
> >  - What are the cleanup semantics
> >  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> >  - How do I poll on a lock coming free. 
> >  - What are the semantics of lock ownership
> >  - What rules apply for inheritance
> >  - How do I access a lock across threads.
> >  - What is the permission model. 
> >  - How do I attach audit to it
> >  - How do I write SELinux rules for it
> >  - How do I use mount to make namespaces appear in multiple vservers
> > 
> >  and thats for starters...
> 
> Return an fd from create_lockspace().

That only answers about four of the questions. The rest only come out if
create_lockspace behaves like a file system - in other words
create_lockspace is better known as either mkdir or mount.

Its certainly viable to make the lock/unlock functions taken a fd, it's
just not clear why the current lock/unlock functions we have won't do
the job. Being able to extend the functionality to leases later on may
be very powerful indeed and will fit the existing API

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Kurt Hackel
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
> On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
> > That is the whole point why OCFS exists ;-)
> 
> The whole point of the orcacle cluster filesystem as it was described in old
> papers was about pfiles, control files and software, because you can easyly
> use direct block access (with ASM) for tablespaces.

The original OCFS was intended for use with pfiles and control files but
very definitely *not* software (the ORACLE_HOME).  It was not remotely
general purpose.  It also predated ASM by about a year or so, and the
two solutions are complementary.  Either one is a good choice for Oracle
datafiles, depending upon your needs.

> > No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
> > benefits in several aspects from a general-purpose SAN-backed CFS.
> 
> Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
> replicated filesystem makes more sense), I am just nor sure if anybody sane
> would use it for tablespaces.

Too many to mention here, but let's just say that some of the largest
databases are running Oracle datafiles on top of OCFS1.  Very large
companies with very important data.

> I guess I have to correct the artile in my german it blog :) (if somebody
> can name productive customers).

Yeah you should definitely update your blog ;-)  If you need named
references, we can give you loads of those.

-kurt

Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Bernd Eckenfels
On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
> That is the whole point why OCFS exists ;-)

The whole point of the orcacle cluster filesystem as it was described in old
papers was about pfiles, control files and software, because you can easyly
use direct block access (with ASM) for tablespaces.

> No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
> benefits in several aspects from a general-purpose SAN-backed CFS.

Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
replicated filesystem makes more sense), I am just nor sure if anybody sane
would use it for tablespaces.

I guess I have to correct the artile in my german it blog :) (if somebody
can name productive customers).

Gruss
Bernd
-- 
http://itblog.eckenfels.net/archives/54-Cluster-Filesysteme.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox <[EMAIL PROTECTED]> wrote:
>
> On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
>  > >   create_lockspace()
>  > >   release_lockspace()
>  > >   lock()
>  > >   unlock()
>  > 
>  > Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
>  > is likely to object if we reserve those slots.
> 
>  If the locks are not file descriptors then answer the following:
> 
>  - How are they ref counted
>  - What are the cleanup semantics
>  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
>  - How do I poll on a lock coming free. 
>  - What are the semantics of lock ownership
>  - What rules apply for inheritance
>  - How do I access a lock across threads.
>  - What is the permission model. 
>  - How do I attach audit to it
>  - How do I write SELinux rules for it
>  - How do I use mount to make namespaces appear in multiple vservers
> 
>  and thats for starters...

Return an fd from create_lockspace().
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread kurt . hackel
On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote:
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <[EMAIL PROTECTED]> wrote:
> > >
> > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > 
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> > 
> > How fat is the dlm interface?   ie: how many syscalls would it take?
> 
> Four functions:
>   create_lockspace()
>   release_lockspace()
>   lock()
>   unlock()

FWIW, it looks like we can agree on the core interface.  ocfs2_dlm
exports essentially the same functions:
dlm_register_domain()
dlm_unregister_domain()
dlmlock()
dlmunlock()

I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes).  There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.

There are quite a few other functions in the "full" spec(1) that we
didn't even attempt, either because we didn't require direct 
user<->kernel access or we just didn't need the function.  As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.

Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device.  With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the 
VMS/opendlm model).

-kurt

1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf


Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote:
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.

The semantics of O_NONBLOCK on many other devices are "trylock"
semantics. OSS audio has those semantics for example, as do regular
files in the presence of SYS5 mandatory locks. While the latter is "try
lock , do operation and then drop lock" the drivers using O_NDELAY are
very definitely providing trylock semantics.

I am curious why a lock manager uses open to implement its locking
semantics rather than using the locking API (POSIX locks etc) however.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
> 
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

If the locks are not file descriptors then answer the following:

- How are they ref counted
- What are the cleanup semantics
- How do I pass a lock between processes (AF_UNIX sockets wont work now)
- How do I poll on a lock coming free. 
- What are the semantics of lock ownership
- What rules apply for inheritance
- How do I access a lock across threads.
- What is the permission model. 
- How do I attach audit to it
- How do I write SELinux rules for it
- How do I use mount to make namespaces appear in multiple vservers

and thats for starters...

Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.

Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.

All our existing locking uses the following behaviour

fd = open(namespace, options)
fcntl(.. lock ...)
blah
flush
fcntl(.. unlock ...)
close

Unfortunately some people here seem to have forgotten WHY we do things
this way.

1.  The semantics of file descriptors are well understood by users and by
programs. That makes programming easier and keeps code size down
2.  Everyone knows how close() works including across fork
3.  FD passing is an obscure art but understood and just works
4.  Poll() is a standard understood interface
5.  Ownership of files is a standard model
6.  FD passing across fork/exec is controlled in a standard way
7.  The semantics for threaded applications are defined
8.  Permissions are a standard model
9.  Audit just works with the same tools
9.  SELinux just works with the same tools
10. I don't need specialist applications to see the system state (the
whole point of sysfs yet someone wants to break it all again)
11. fcntl fd locking is a posix standard interface with precisely
defined semantics. Our extensions including leases are very powerful
12. And yes - fcntl fd locking supports mandatory locking too. That also
is standards based with precise semantics.


Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > The only current users of dlms are cluster filesystems.  There are zero
> > > users of the userspace dlm api.
> >
> > That is incorrect...
> 
> Application users Lars, sorry if I did not make that clear.  The issue is 
> whether we need to export an all-singing-all-dancing dlm api from kernel to 
> userspace today, or whether we can afford to take the necessary time to get 
> it right while application writers take their time to have a good think about 
> whether they even need it.
>

If Linux fully supported OpenVMS DLM semantics we could start thinking asbout
moving our application onto a Linux box because our alpha server is aging.

That's just my user application writer $0.02.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > The only current users of dlms are cluster filesystems.  There are zero
> > users of the userspace dlm api.
>
> That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

> ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

> > What does have to be resolved is a common API for node management.  It is
> > not just cluster filesystems and their lock managers that have to
> > interface to node management.  Below the filesystem layer, cluster block
> > devices and cluster volume management need to be coordinated by the same
> > system, and above the filesystem layer, applications also need to be
> > hooked into it. This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:

> The only current users of dlms are cluster filesystems.  There are zero users 
> of the userspace dlm api. 

That is incorrect, and you're contradicting yourself here:

> What does have to be resolved is a common API for node management.  It is not 
> just cluster filesystems and their lock managers that have to interface to 
> node management.  Below the filesystem layer, cluster block devices and 
> cluster volume management need to be coordinated by the same system, and 
> above the filesystem layer, applications also need to be hooked into it.  
> This work is, in a word, incomplete.

The Cluster Volume Management of LVM2 for example _does_ use simple
cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too.

(EVMS2 in cluster-mode uses a verrry simple locking scheme which is
basically operated by the failover software and thus uses a different
model.)


Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T09:27:41, Bernd Eckenfels <[EMAIL PROTECTED]> wrote:

> Oh thats interesting, I never thought about putting data files (tablespaces)
> in a clustered file system. Does that mean you can run supported RAC on
> shared ocfs2 files and anybody is using that?

That is the whole point why OCFS exists ;-)

> Do you see this go away with ASM?

No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
benefits in several aspects from a general-purpose SAN-backed CFS.


Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 12:09:23AM -0700, Mark Fasheh wrote:
> Btw, I'm curious to know how useful folks find the ext3 mount options
> errors=continue and errors=panic. I'm extremely likely to implement the
> errors=read-only behavior as default in OCFS2 and I'm wondering whether the
> other two are worth looking into.

For a single-user system errors=panic is definitely very useful on the
system disk, since that's the only way that we can force an fsck, and
also abort a server that might be failing and returning erroneous
information to its clients.  Think of it is as i/o fencing when you're
not sure that the system is going to be performing correctly.

Whether or not this is useful for ocfs2 is a different matter.  If
it's only for data volumes, and if the only way to fix filesystem
inconsistencies on a cluster filesystem is to request all nodes in the
cluster to unmount the filesystem and then arrange to run ocfs2's fsck
on the filesystem, then forcing every single cluster in the node to
panic is probably counterproductive.  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: real read-only [was Re: GFS, what's remaining]

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 10:27:35AM +0200, Pavel Machek wrote:
> 
> There's a better reason, too. I do swsusp. Then I'd like to boot with
> / mounted read-only (so that I can read my config files, some
> binaries, and maybe suspended image), but I absolutely may not write
> to disk at this point, because I still want to resume.
> 

You could _hope_ that the filesystem is consistent enough that it is
safe to try to read config files, binaries, etc. without running the
journal, but there is absolutely no guarantee that this is the case.
I'm not sure you want to depend on that for swsusp.

One potential solution that would probably meet your needs is a dm
hack which reads in the blocks in the journal, and then uses the most
recent block in the journal in preference to the version on disk.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:

> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all.  But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.

But mount -o readonly gives you most of what you want already.  An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access.  That's
not necessarily a good thing.

And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node.  So is
this really a useful combination?

About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find.  That's asking for data corruption,
but that may be better than getting no data at all.  

But that is something that could be done with a "-o skip-recovery" mount
option, which would necessarily imply always-readonly behaviour.

--Stephen


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
> 
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Patrick is really the expert in this area and he's off this week, but
based on what he's done with the misc device I don't see why there'd be
more than two or three parameters for any of these.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 05:19, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > > David Teigland <[EMAIL PROTECTED]> wrote:
> > > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > > inotify did that for a while, but we ended up going with a straight
> > > syscall interface.
> > >
> > > How fat is the dlm interface?   ie: how many syscalls would it take?
> >
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
>
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland <[EMAIL PROTECTED]> wrote:
>
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <[EMAIL PROTECTED]> wrote:
> > >
> > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > 
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> > 
> > How fat is the dlm interface?   ie: how many syscalls would it take?
> 
> Four functions:
>   create_lockspace()
>   release_lockspace()
>   lock()
>   unlock()

Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
is likely to object if we reserve those slots.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> >
> >  We export our full dlm API through read/write/poll on a misc device.
> >
> 
> inotify did that for a while, but we ended up going with a straight syscall
> interface.
> 
> How fat is the dlm interface?   ie: how many syscalls would it take?

Four functions:
  create_lockspace()
  release_lockspace()
  lock()
  unlock()

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 10:58:08AM +0200, J?rn Engel wrote:

> #define gfs2_assert(sdp, assertion) do {  \
>   if (unlikely(!(assertion))) {   \
>   printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \
>   BUG();  \
> } while (0)

OK thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Jörn Engel
On Mon, 5 September 2005 11:47:39 +0800, David Teigland wrote:
> 
> Joern already suggested moving this out of line and into a function (as it
> was before) to avoid repeating string constants.  In that case the
> function, file and line from BUG aren't useful.  We now have this, does it
> look ok?

Ok wrt. my concerns, but not with Greg's.  BUG() still gives you
everything that you need, except:
o fsid

Notice how this list is just one entry long? ;)

So how about


#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \
BUG();  \
} while (0)


Or, to move the constant out of line again


void __gfs2_assert(struct gfs2_sbd *sdp) {
printk(KERN_ERR "GFS2: fsid=\n", sdp->sd_fsname);
}

#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
__gfs2_assert(sdp); \
BUG();  \
} while (0)


Jörn

-- 
Admonish your friends privately, but praise them openly.
-- Publilius Syrus 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland <[EMAIL PROTECTED]> wrote:
>
>  We export our full dlm API through read/write/poll on a misc device.
>

inotify did that for a while, but we ended up going with a straight syscall
interface.

How fat is the dlm interface?   ie: how many syscalls would it take?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Pekka Enberg
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > +void gfs2_glock_hold(struct gfs2_glock *gl)
> > +{
> > + glock_hold(gl);
> > +}
> >
> > eh why?

On 9/5/05, David Teigland <[EMAIL PROTECTED]> wrote:
> You removed the comment stating exactly why, see below.  If that's not a
> accepted technique in the kernel, say so and I'll be happy to change it
> here and elsewhere.

Is there a reason why users of gfs2_glock_hold() cannot use
glock_hold() directly?

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> Hi!
> 
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

This is a bit of a degression, but it's quite a bit different from
what ocfs2 is doing, where it is not necessary to replay the journal
in order to assure filesystem consistency.  

In the ext3 case, the only time when read-only isn't quite read-only
is when the filesystem was unmounted uncleanly and the journal needs
to be replayed in order for the filesystem to be consistent.  Mounting
the filesystem read-only without replaying the journal could and very
likely would result in the filesystem reporting filesystem consistency
problems, and if the filesystem is mounted with the reboot-on-errors
option, well

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

> +static unsigned int handle_roll(atomic_t *a)
> +{
> + int x = atomic_read(a);
> + if (x < 0) {
> + atomic_set(a, 0);
> + return 0;
> + }
> + return (unsigned int)x;
> +}
> 
> this is just plain scary.

Not really, it was just resetting atomic statistics counters when they
became negative.  Unecessary, though, so removed.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

 +static unsigned int handle_roll(atomic_t *a)
 +{
 + int x = atomic_read(a);
 + if (x  0) {
 + atomic_set(a, 0);
 + return 0;
 + }
 + return (unsigned int)x;
 +}
 
 this is just plain scary.

Not really, it was just resetting atomic statistics counters when they
became negative.  Unecessary, though, so removed.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
 Hi!
 
  - read-only mount
  - specatator mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)
 
 I'd call it real-read-only, and yes, that's very usefull
 mount. Could we get it for ext3, too?

This is a bit of a degression, but it's quite a bit different from
what ocfs2 is doing, where it is not necessary to replay the journal
in order to assure filesystem consistency.  

In the ext3 case, the only time when read-only isn't quite read-only
is when the filesystem was unmounted uncleanly and the journal needs
to be replayed in order for the filesystem to be consistent.  Mounting
the filesystem read-only without replaying the journal could and very
likely would result in the filesystem reporting filesystem consistency
problems, and if the filesystem is mounted with the reboot-on-errors
option, well

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Pekka Enberg
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
  +void gfs2_glock_hold(struct gfs2_glock *gl)
  +{
  + glock_hold(gl);
  +}
 
  eh why?

On 9/5/05, David Teigland [EMAIL PROTECTED] wrote:
 You removed the comment stating exactly why, see below.  If that's not a
 accepted technique in the kernel, say so and I'll be happy to change it
 here and elsewhere.

Is there a reason why users of gfs2_glock_hold() cannot use
glock_hold() directly?

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland [EMAIL PROTECTED] wrote:

  We export our full dlm API through read/write/poll on a misc device.


inotify did that for a while, but we ended up going with a straight syscall
interface.

How fat is the dlm interface?   ie: how many syscalls would it take?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Jörn Engel
On Mon, 5 September 2005 11:47:39 +0800, David Teigland wrote:
 
 Joern already suggested moving this out of line and into a function (as it
 was before) to avoid repeating string constants.  In that case the
 function, file and line from BUG aren't useful.  We now have this, does it
 look ok?

Ok wrt. my concerns, but not with Greg's.  BUG() still gives you
everything that you need, except:
o fsid

Notice how this list is just one entry long? ;)

So how about


#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
printk(KERN_ERR GFS2: fsid=\n, (sdp)-sd_fsname); \
BUG();  \
} while (0)


Or, to move the constant out of line again


void __gfs2_assert(struct gfs2_sbd *sdp) {
printk(KERN_ERR GFS2: fsid=\n, sdp-sd_fsname);
}

#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) {   \
__gfs2_assert(sdp); \
BUG();  \
} while (0)


Jörn

-- 
Admonish your friends privately, but praise them openly.
-- Publilius Syrus 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 10:58:08AM +0200, J?rn Engel wrote:

 #define gfs2_assert(sdp, assertion) do {  \
   if (unlikely(!(assertion))) {   \
   printk(KERN_ERR GFS2: fsid=\n, (sdp)-sd_fsname); \
   BUG();  \
 } while (0)

OK thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
 David Teigland [EMAIL PROTECTED] wrote:
 
   We export our full dlm API through read/write/poll on a misc device.
 
 
 inotify did that for a while, but we ended up going with a straight syscall
 interface.
 
 How fat is the dlm interface?   ie: how many syscalls would it take?

Four functions:
  create_lockspace()
  release_lockspace()
  lock()
  unlock()

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
David Teigland [EMAIL PROTECTED] wrote:

 On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
  David Teigland [EMAIL PROTECTED] wrote:
  
We export our full dlm API through read/write/poll on a misc device.
  
  
  inotify did that for a while, but we ended up going with a straight syscall
  interface.
  
  How fat is the dlm interface?   ie: how many syscalls would it take?
 
 Four functions:
   create_lockspace()
   release_lockspace()
   lock()
   unlock()

Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
is likely to object if we reserve those slots.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 05:19, Andrew Morton wrote:
 David Teigland [EMAIL PROTECTED] wrote:
  On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
   David Teigland [EMAIL PROTECTED] wrote:
 We export our full dlm API through read/write/poll on a misc device.
  
   inotify did that for a while, but we ended up going with a straight
   syscall interface.
  
   How fat is the dlm interface?   ie: how many syscalls would it take?
 
  Four functions:
create_lockspace()
release_lockspace()
lock()
unlock()

 Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
 is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread David Teigland
On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote:
 David Teigland [EMAIL PROTECTED] wrote:
  Four functions:
create_lockspace()
release_lockspace()
lock()
unlock()
 
 Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
 is likely to object if we reserve those slots.

Patrick is really the expert in this area and he's off this week, but
based on what he's done with the misc device I don't see why there'd be
more than two or three parameters for any of these.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Stephen C. Tweedie
Hi,

On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:

  - read-only mount
  - specatator mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)
 
 I'd call it real-read-only, and yes, that's very usefull
 mount. Could we get it for ext3, too?

I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all.  But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.

But mount -o readonly gives you most of what you want already.  An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access.  That's
not necessarily a good thing.

And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node.  So is
this really a useful combination?

About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find.  That's asking for data corruption,
but that may be better than getting no data at all.  

But that is something that could be done with a -o skip-recovery mount
option, which would necessarily imply always-readonly behaviour.

--Stephen


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: real read-only [was Re: GFS, what's remaining]

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 10:27:35AM +0200, Pavel Machek wrote:
 
 There's a better reason, too. I do swsusp. Then I'd like to boot with
 / mounted read-only (so that I can read my config files, some
 binaries, and maybe suspended image), but I absolutely may not write
 to disk at this point, because I still want to resume.
 

You could _hope_ that the filesystem is consistent enough that it is
safe to try to read config files, binaries, etc. without running the
journal, but there is absolutely no guarantee that this is the case.
I'm not sure you want to depend on that for swsusp.

One potential solution that would probably meet your needs is a dm
hack which reads in the blocks in the journal, and then uses the most
recent block in the journal in preference to the version on disk.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Theodore Ts'o
On Mon, Sep 05, 2005 at 12:09:23AM -0700, Mark Fasheh wrote:
 Btw, I'm curious to know how useful folks find the ext3 mount options
 errors=continue and errors=panic. I'm extremely likely to implement the
 errors=read-only behavior as default in OCFS2 and I'm wondering whether the
 other two are worth looking into.

For a single-user system errors=panic is definitely very useful on the
system disk, since that's the only way that we can force an fsck, and
also abort a server that might be failing and returning erroneous
information to its clients.  Think of it is as i/o fencing when you're
not sure that the system is going to be performing correctly.

Whether or not this is useful for ocfs2 is a different matter.  If
it's only for data volumes, and if the only way to fix filesystem
inconsistencies on a cluster filesystem is to request all nodes in the
cluster to unmount the filesystem and then arrange to run ocfs2's fsck
on the filesystem, then forcing every single cluster in the node to
panic is probably counterproductive.  :-)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T09:27:41, Bernd Eckenfels [EMAIL PROTECTED] wrote:

 Oh thats interesting, I never thought about putting data files (tablespaces)
 in a clustered file system. Does that mean you can run supported RAC on
 shared ocfs2 files and anybody is using that?

That is the whole point why OCFS exists ;-)

 Do you see this go away with ASM?

No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
benefits in several aspects from a general-purpose SAN-backed CFS.


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Lars Marowsky-Bree
On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:

 The only current users of dlms are cluster filesystems.  There are zero users 
 of the userspace dlm api. 

That is incorrect, and you're contradicting yourself here:

 What does have to be resolved is a common API for node management.  It is not 
 just cluster filesystems and their lock managers that have to interface to 
 node management.  Below the filesystem layer, cluster block devices and 
 cluster volume management need to be coordinated by the same system, and 
 above the filesystem layer, applications also need to be hooked into it.  
 This work is, in a word, incomplete.

The Cluster Volume Management of LVM2 for example _does_ use simple
cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too.

(EVMS2 in cluster-mode uses a verrry simple locking scheme which is
basically operated by the failover software and thus uses a different
model.)


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
 On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
  The only current users of dlms are cluster filesystems.  There are zero
  users of the userspace dlm api.

 That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

 ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

  What does have to be resolved is a common API for node management.  It is
  not just cluster filesystems and their lock managers that have to
  interface to node management.  Below the filesystem layer, cluster block
  devices and cluster volume management need to be coordinated by the same
  system, and above the filesystem layer, applications also need to be
  hooked into it. This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 10:49, Daniel Phillips wrote:
 On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
  On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
   The only current users of dlms are cluster filesystems.  There are zero
   users of the userspace dlm api.
 
  That is incorrect...
 
 Application users Lars, sorry if I did not make that clear.  The issue is 
 whether we need to export an all-singing-all-dancing dlm api from kernel to 
 userspace today, or whether we can afford to take the necessary time to get 
 it right while application writers take their time to have a good think about 
 whether they even need it.


If Linux fully supported OpenVMS DLM semantics we could start thinking asbout
moving our application onto a Linux box because our alpha server is aging.

That's just my user application writer $0.02.

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
create_lockspace()
release_lockspace()
lock()
unlock()
 
 Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
 is likely to object if we reserve those slots.

If the locks are not file descriptors then answer the following:

- How are they ref counted
- What are the cleanup semantics
- How do I pass a lock between processes (AF_UNIX sockets wont work now)
- How do I poll on a lock coming free. 
- What are the semantics of lock ownership
- What rules apply for inheritance
- How do I access a lock across threads.
- What is the permission model. 
- How do I attach audit to it
- How do I write SELinux rules for it
- How do I use mount to make namespaces appear in multiple vservers

and thats for starters...

Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.

Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.

All our existing locking uses the following behaviour

fd = open(namespace, options)
fcntl(.. lock ...)
blah
flush
fcntl(.. unlock ...)
close

Unfortunately some people here seem to have forgotten WHY we do things
this way.

1.  The semantics of file descriptors are well understood by users and by
programs. That makes programming easier and keeps code size down
2.  Everyone knows how close() works including across fork
3.  FD passing is an obscure art but understood and just works
4.  Poll() is a standard understood interface
5.  Ownership of files is a standard model
6.  FD passing across fork/exec is controlled in a standard way
7.  The semantics for threaded applications are defined
8.  Permissions are a standard model
9.  Audit just works with the same tools
9.  SELinux just works with the same tools
10. I don't need specialist applications to see the system state (the
whole point of sysfs yet someone wants to break it all again)
11. fcntl fd locking is a posix standard interface with precisely
defined semantics. Our extensions including leases are very powerful
12. And yes - fcntl fd locking supports mandatory locking too. That also
is standards based with precise semantics.


Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote:
 Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
 lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
 me.  O_NONBLOCK means open this file in nonblocking mode, not attempt to
 acquire a clustered filesystem lock.  Not even close.

The semantics of O_NONBLOCK on many other devices are trylock
semantics. OSS audio has those semantics for example, as do regular
files in the presence of SYS5 mandatory locks. While the latter is try
lock , do operation and then drop lock the drivers using O_NDELAY are
very definitely providing trylock semantics.

I am curious why a lock manager uses open to implement its locking
semantics rather than using the locking API (POSIX locks etc) however.

Alan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread kurt . hackel
On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote:
 On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
  David Teigland [EMAIL PROTECTED] wrote:
  
We export our full dlm API through read/write/poll on a misc device.
  
  
  inotify did that for a while, but we ended up going with a straight syscall
  interface.
  
  How fat is the dlm interface?   ie: how many syscalls would it take?
 
 Four functions:
   create_lockspace()
   release_lockspace()
   lock()
   unlock()

FWIW, it looks like we can agree on the core interface.  ocfs2_dlm
exports essentially the same functions:
dlm_register_domain()
dlm_unregister_domain()
dlmlock()
dlmunlock()

I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes).  There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.

There are quite a few other functions in the full spec(1) that we
didn't even attempt, either because we didn't require direct 
user-kernel access or we just didn't need the function.  As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.

Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device.  With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the 
VMS/opendlm model).

-kurt

1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf


Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox [EMAIL PROTECTED] wrote:

 On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
  create_lockspace()
  release_lockspace()
  lock()
  unlock()
   
   Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
   is likely to object if we reserve those slots.
 
  If the locks are not file descriptors then answer the following:
 
  - How are they ref counted
  - What are the cleanup semantics
  - How do I pass a lock between processes (AF_UNIX sockets wont work now)
  - How do I poll on a lock coming free. 
  - What are the semantics of lock ownership
  - What rules apply for inheritance
  - How do I access a lock across threads.
  - What is the permission model. 
  - How do I attach audit to it
  - How do I write SELinux rules for it
  - How do I use mount to make namespaces appear in multiple vservers
 
  and thats for starters...

Return an fd from create_lockspace().
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Bernd Eckenfels
On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
 That is the whole point why OCFS exists ;-)

The whole point of the orcacle cluster filesystem as it was described in old
papers was about pfiles, control files and software, because you can easyly
use direct block access (with ASM) for tablespaces.

 No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
 benefits in several aspects from a general-purpose SAN-backed CFS.

Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
replicated filesystem makes more sense), I am just nor sure if anybody sane
would use it for tablespaces.

I guess I have to correct the artile in my german it blog :) (if somebody
can name productive customers).

Gruss
Bernd
-- 
http://itblog.eckenfels.net/archives/54-Cluster-Filesysteme.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Kurt Hackel
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
 On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
  That is the whole point why OCFS exists ;-)
 
 The whole point of the orcacle cluster filesystem as it was described in old
 papers was about pfiles, control files and software, because you can easyly
 use direct block access (with ASM) for tablespaces.

The original OCFS was intended for use with pfiles and control files but
very definitely *not* software (the ORACLE_HOME).  It was not remotely
general purpose.  It also predated ASM by about a year or so, and the
two solutions are complementary.  Either one is a good choice for Oracle
datafiles, depending upon your needs.

  No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
  benefits in several aspects from a general-purpose SAN-backed CFS.
 
 Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
 replicated filesystem makes more sense), I am just nor sure if anybody sane
 would use it for tablespaces.

Too many to mention here, but let's just say that some of the largest
databases are running Oracle datafiles on top of OCFS1.  Very large
companies with very important data.

 I guess I have to correct the artile in my german it blog :) (if somebody
 can name productive customers).

Yeah you should definitely update your blog ;-)  If you need named
references, we can give you loads of those.

-kurt

Kurt C. Hackel
Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Alan Cox
On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
   - How are they ref counted
   - What are the cleanup semantics
   - How do I pass a lock between processes (AF_UNIX sockets wont work now)
   - How do I poll on a lock coming free. 
   - What are the semantics of lock ownership
   - What rules apply for inheritance
   - How do I access a lock across threads.
   - What is the permission model. 
   - How do I attach audit to it
   - How do I write SELinux rules for it
   - How do I use mount to make namespaces appear in multiple vservers
  
   and thats for starters...
 
 Return an fd from create_lockspace().

That only answers about four of the questions. The rest only come out if
create_lockspace behaves like a file system - in other words
create_lockspace is better known as either mkdir or mount.

Its certainly viable to make the lock/unlock functions taken a fd, it's
just not clear why the current lock/unlock functions we have won't do
the job. Being able to extend the functionality to leases later on may
be very powerful indeed and will fit the existing API

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Andrew Morton
Alan Cox [EMAIL PROTECTED] wrote:

 On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
 - How are they ref counted
 - What are the cleanup semantics
 - How do I pass a lock between processes (AF_UNIX sockets wont work now)
 - How do I poll on a lock coming free. 
 - What are the semantics of lock ownership
 - What rules apply for inheritance
 - How do I access a lock across threads.
 - What is the permission model. 
 - How do I attach audit to it
 - How do I write SELinux rules for it
 - How do I use mount to make namespaces appear in multiple vservers

 and thats for starters...
   
   Return an fd from create_lockspace().
 
  That only answers about four of the questions. The rest only come out if
  create_lockspace behaves like a file system - in other words
  create_lockspace is better known as either mkdir or mount.

But David said that We export our full dlm API through read/write/poll on
a misc device..  That miscdevice will simply give us an fd.  Hence my
suggestion that the miscdevice be done away with in favour of a dedicated
syscall which returns an fd.

What does a filesystem have to do with this?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote:
 I am curious why a lock manager uses open to implement its locking
 semantics rather than using the locking API (POSIX locks etc) however.

Because it is simple (how do you fcntl(2) from a shell fd?), has no
ranges (what do you do with ranges passed in to fcntl(2) and you don't
support them?), and has a well-known fork(2)/exec(2) pattern.  fcntl(2)
has a known but less intuitive fork(2) pattern.
The real reason, though, is that we never considered fcntl(2).
We could never think of a case when a process wanted a lock fd open but
not locked.  At least, that's my recollection.  Mark might have more to
comment.

Joel

-- 

In the room the women come and go
 Talking of Michaelangelo.

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Joel Becker
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
 The whole point of the orcacle cluster filesystem as it was described in old
 papers was about pfiles, control files and software, because you can easyly
 use direct block access (with ASM) for tablespaces.

OCFS, the original filesystem, only works for datafiles,
logfiles, and other database data.  It's currently used in serious anger
by several major customers.  Oracle's websites must have a list of them
somewhere.  We're talking many terabytes of datafiles.

 Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
 replicated filesystem makes more sense), I am just nor sure if anybody sane
 would use it for tablespaces.

OCFS2, the new filesystem, is fully general purpose.  It
supports all the usual stuff, is quite fast, and is what we expect folks
to use for both ORACLE_HOME and datafiles in the future.  Customers can,
of course, use ASM or even raw devices.  OCFS2 is as fast as raw
devices, and far more manageable, so raw devices are probably not a
choice for the future.  ASM has its own management advantages, and we
certainly expect customers to like it as well.  But that doesn't mean
people won't use OCFS2 for datafiles depending on their environment or
needs.


-- 

The first requisite of a good citizen in this republic of ours
 is that he shall be able and willing to pull his weight.
- Theodore Roosevelt

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
 On Monday 05 September 2005 10:49, Daniel Phillips wrote:
  On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
   On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
The only current users of dlms are cluster filesystems.  There are
zero users of the userspace dlm api.
  
   That is incorrect...
 
  Application users Lars, sorry if I did not make that clear.  The issue is
  whether we need to export an all-singing-all-dancing dlm api from kernel
  to userspace today, or whether we can afford to take the necessary time
  to get it right while application writers take their time to have a good
  think about whether they even need it.

 If Linux fully supported OpenVMS DLM semantics we could start thinking
 asbout moving our application onto a Linux box because our alpha server is
 aging.

 That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Dmitry Torokhov
On Monday 05 September 2005 19:57, Daniel Phillips wrote:
 On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
  On Monday 05 September 2005 10:49, Daniel Phillips wrote:
   On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
 The only current users of dlms are cluster filesystems.  There are
 zero users of the userspace dlm api.
   
That is incorrect...
  
   Application users Lars, sorry if I did not make that clear.  The issue is
   whether we need to export an all-singing-all-dancing dlm api from kernel
   to userspace today, or whether we can afford to take the necessary time
   to get it right while application writers take their time to have a good
   think about whether they even need it.
 
  If Linux fully supported OpenVMS DLM semantics we could start thinking
  asbout moving our application onto a Linux box because our alpha server is
  aging.
 
  That's just my user application writer $0.02.
 
 What stops you from trying it with the patch?  That kind of feedback would be 
 worth way more than $0.02.


We do not have such plans at the moment and I prefer spending my free
time on tinkering with kernel, not rewriting some in-house application.
Besides, DLM is not the only thing that does not have a drop-in
replacement in Linux.

You just said you did not know if there are any potential users for the
full DLM and I said there are some.

-- 
Dmitry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
 On Monday 05 September 2005 19:57, Daniel Phillips wrote:
  On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
   On Monday 05 September 2005 10:49, Daniel Phillips wrote:
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
 On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
  The only current users of dlms are cluster filesystems.  There
  are zero users of the userspace dlm api.

 That is incorrect...
   
Application users Lars, sorry if I did not make that clear.  The
issue is whether we need to export an all-singing-all-dancing dlm api
from kernel to userspace today, or whether we can afford to take the
necessary time to get it right while application writers take their
time to have a good think about whether they even need it.
  
   If Linux fully supported OpenVMS DLM semantics we could start thinking
   asbout moving our application onto a Linux box because our alpha server
   is aging.
  
   That's just my user application writer $0.02.
 
  What stops you from trying it with the patch?  That kind of feedback
  would be worth way more than $0.02.

 We do not have such plans at the moment and I prefer spending my free
 time on tinkering with kernel, not rewriting some in-house application.
 Besides, DLM is not the only thing that does not have a drop-in
 replacement in Linux.

 You just said you did not know if there are any potential users for the
 full DLM and I said there are some.

I did not say potential, I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said gee this works great, look what it does.

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said alpha server not alpha servers, was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips
On Monday 05 September 2005 19:37, Joel Becker wrote:
  OCFS2, the new filesystem, is fully general purpose.  It
 supports all the usual stuff, is quite fast...

So I have heard, but isn't it time to quantify that?  How do you think you 
would stack up here:

   http://www.caspur.it/Files/2005/01/10/1105354214692.pdf

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread David Teigland
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:

> +void gfs2_glock_hold(struct gfs2_glock *gl)
> +{
> + glock_hold(gl);
> +}
> 
> eh why?

You removed the comment stating exactly why, see below.  If that's not a
accepted technique in the kernel, say so and I'll be happy to change it
here and elsewhere.
Thanks,
Dave

static inline void glock_hold(struct gfs2_glock *gl)
{
gfs2_assert(gl->gl_sbd, atomic_read(>gl_count) > 0);
atomic_inc(>gl_count);
}

/**
 * gfs2_glock_hold() - As glock_hold(), but suitable for exporting
 * @gl: The glock to hold
 *
 */

void gfs2_glock_hold(struct gfs2_glock *gl)
{
glock_hold(gl);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread David Teigland
On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Joel Becker <[EMAIL PROTECTED]> wrote:
> >
> >  > What happens when we want to add some new primitive which has no
> >  > posix-file analog?
> > 
> > The point of dlmfs is not to express every primitive that the
> >  DLM has.  dlmfs cannot express the CR, CW, and PW levels of the VMS
> >  locking scheme.  Nor should it.  The point isn't to use a filesystem
> >  interface for programs that need all the flexibility and power of the
> >  VMS DLM.  The point is a simple system that programs needing the basic
> >  operations can use.  Even shell scripts.
> 
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality?  If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?

We're using our dlm quite a bit in user space and require the full dlm
API.  It's difficult to export the full API through a pseudo fs like
dlmfs, so we've not found it a very practical approach.  That said, it's a
nice idea and I'd be happy if someone could map a more complete dlm API
onto it.

We export our full dlm API through read/write/poll on a misc device.  All
user space apps use the dlm through a library as you'd expect.  The
library communicates with the dlm_device kernel module through
read/write/poll and the dlm_device module talks with the actual dlm:
linux/drivers/dlm/device.c  If there's a better way to do this, via a
pseudo fs or not, we'd be pleased to try it.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread David Teigland
On Fri, Sep 02, 2005 at 10:28:21PM -0700, Greg KH wrote:
> On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > 
> > > + gfs2_assert(gl->gl_sbd, atomic_read(>gl_count) > 0,);
> > 
> > > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > > everywhere
> > 
> > When a machine has many gfs file systems mounted at once it can be useful
> > to know which one failed.  Does the following look ok?
> > 
> > #define gfs2_assert(sdp, assertion)   \
> > do {  \
> > if (unlikely(!(assertion))) { \
> > printk(KERN_ERR   \
> > "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
> > "GFS2: fsid=%s:   function = %s\n"\
> > "GFS2: fsid=%s:   file = %s, line = %u\n" \
> > "GFS2: fsid=%s:   time = %lu\n",  \
> > sdp->sd_fsname, # assertion,  \
> > sdp->sd_fsname,  __FUNCTION__,\
> > sdp->sd_fsname, __FILE__, __LINE__,   \
> > sdp->sd_fsname, get_seconds());   \
> > BUG();\
> 
> You will already get the __FUNCTION__ (and hence the __FILE__ info)
> directly from the BUG() dump, as well as the time from the syslog
> message (turn on the printk timestamps if you want a more fine grain
> timestamp), so the majority of this macro is redundant with the BUG()
> macro...

Joern already suggested moving this out of line and into a function (as it
was before) to avoid repeating string constants.  In that case the
function, file and line from BUG aren't useful.  We now have this, does it
look ok?

void gfs2_assert_i(struct gfs2_sbd *sdp, char *assertion, const char *function,
   char *file, unsigned int line)
{
panic("GFS2: fsid=%s: fatal: assertion \"%s\" failed\n"
  "GFS2: fsid=%s:   function = %s, file = %s, line = %u\n",
  sdp->sd_fsname, assertion,
  sdp->sd_fsname, function, file, line);
}

#define gfs2_assert(sdp, assertion) \
do { \
if (unlikely(!(assertion))) { \
gfs2_assert_i((sdp), #assertion, \
  __FUNCTION__, __FILE__, __LINE__); \
} \
} while (0)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> >   no fencing needed for failed node that was mounted as specatator)
> 
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?

In OCFS2 we call readonly+journal+connected-to-cluster "soft
readonly".  We're a live node, other nodes know we exist, and we can
flush pending transactions during the rw->ro transition.  In addition,
we can allow a ro->rw transition.
The no-journal+no-cluster-connection mode we call "hard
readonly".  This is the mode you get when a device itself is readonly,
because you can't do *anything*.

Joel

-- 

"Lately I've been talking in my sleep.
 Can't imagine what I'd have to say.
 Except my world will be right
 When love comes back my way."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread Pavel Machek
Hi!

> - read-only mount
> - "specatator" mount (like ro but no journal allocated for the mount,
>   no fencing needed for failed node that was mounted as specatator)

I'd call it "real-read-only", and yes, that's very usefull
mount. Could we get it for ext3, too?
Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips
On Sunday 04 September 2005 03:28, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.  That's why I asked (thus
> far unsuccessfully):
>
>Are you saying that the posix-file lookalike interface provides
>access to part of the functionality, but there are other APIs which are
>used to access the rest of the functionality?  If so, what is that
>interface, and why cannot that interface offer access to 100% of the
>functionality, thus making the posix-file tricks unnecessary?

There is no such interface at the moment, nor is one needed in the immediate 
future.  Let's look at the arguments for exporting a dlm to userspace:

  1) Since we already have a dlm in kernel, why not just export that and save
 100K of userspace library?  Answer: because we don't want userspace-only
 dlm features bulking up the kernel.  Answer #2: the extra syscalls and
 interface baggage serve no useful purpose.

  2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
 Answer: only support tools need to do that.  A cut-down locking api is
 entirely appropriate for this.

  3) But the kernel dlm is the only one we have!  Answer: easily fixed, a
 simple matter of coding.  But please bear in mind that dlm-style
 synchronization is probably a bad idea for most cluster applications,
 particularly ones that already do their synchronization via sockets.

In other words, exporting the full dlm api is a red herring.  It has nothing 
to do with getting cluster filesystems up and running.  It is really just 
marketing: it sounds like a great thing for userspace to get a dlm "for 
free", but it isn't free, it contributes to kernel bloat and it isn't even 
the most efficient way to do it.

If after considering that, we _still_ want to export a dlm api from kernel, 
then can we please take the necessary time and get it right?  The full api 
requires not only syscall-style elements, but asynchronous events as well, 
similar to aio.  I do not think anybody has a good answer to this today, nor 
do we even need it to begin porting applications to cluster filesystems.

Oracle guys: what is the distributed locking API for RAC?  Is the RAC team 
waiting with bated breath to adopt your kernel-based dlm?  If not, why not?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Hua Zhong
>takelock domainxxx lock1
>do sutff
>droplock domainxxx lock1
> 
> When someone kills the shell, the lock is leaked, becuase droplock isn't
> called.

Why not open the lock resource (or the lock space) instead of
individual locks as file? It then looks like this:

open lock space file
takelock lockresource lock1
do stuff
droplock lockresource lock1
close lock space file

Then if you are killed the ->release of lock space file should take
care of cleaning up all the locks
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 02:18:36AM -0700, Andrew Morton wrote:
>   take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"

Ahh, but then you have to have lots of scripts somewhere in
path, or do massive inline scripts.  especially if you want to take
another lock in there somewhere.
It's doable, but it's nowhere near as easy. :-)

Joel

-- 

"I always thought the hardest questions were those I could not answer.
 Now I know they are the ones I can never ask."
- Charlie Watkins

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Joel Becker <[EMAIL PROTECTED]> wrote:
>
>   I can't see how that works easily.  I'm not worried about a
>  tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
>  thinking about this shell:
> 
>   exec 7   do stuff
>   exec 7 
>  If someone kills the shell while stuff is doing, the lock is unlocked
>  because fd 7 is closed.  However, if you have an application to do the
>  locking:
> 
>   takelock domainxxx lock1
>   do sutff
>   droplock domainxxx lock1
> 
>  When someone kills the shell, the lock is leaked, becuase droplock isn't
>  called.  And SEGV/QUIT/-9 (especially -9, folks love it too much) are
>  handled by the first example but not by the second.


take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 01:18:05AM -0700, Andrew Morton wrote:
> > I thought I stated this in my other email.  We're not intending
> > to extend dlmfs.
> 
> Famous last words ;)

Heh, of course :-)

> I don't buy the general "fs is nice because we can script it" argument,
> really.  You can just write a few simple applications which provide access
> to the syscalls (or the fs!) and then write scripts around those.

I can't see how that works easily.  I'm not worried about a
tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
thinking about this shell:

exec 7http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Mark Fasheh <[EMAIL PROTECTED]> wrote:
>
> On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > > What would be an acceptable replacement? I admit that O_NONBLOCK -> 
> > > trylock
> > > is a bit unfortunate, but really it just needs a bit to express that -
> > > nobody over here cares what it's called.
> > 
> > The whole idea of reinterpreting file operations to mean something utterly
> > different just seems inappropriate to me.
> Putting aside trylock for a minute, I'm not sure how utterly different the
> operations are. You create a lock resource by creating a file named after
> it. You get a lock (fd) at read or write level on the resource by calling 
> open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
> Now that we've got an fd, lock value blocks are naturally represented as
> file data which can be read(2) or written(2).
> Close(2) drops the lock.
> 
> A really trivial usage example from shell:
> 
> node1$ echo "hello world" > mylock
> node2$ cat mylock
> hello world
> 
> I could always give a more useful one after I get some sleep :)

It isn't extensible though.  One couldn't retain this approach while adding
(random cfs ignorance exposure) upgrade-read, downgrade-write,
query-for-various-runtime-stats, priority modification, whatever.

> > You get a lot of goodies when using a filesystem - the ability for
> > unrelated processes to look things up, resource release on exit(), etc.  If
> > those features are valuable in the ocfs2 context then fine.
> Right, they certainly are and I think Joel, in another e-mail on this
> thread, explained well the advantages of using a filesystem.
> 
> > But I'd have thought that it would be saner and more extensible to add new
> > syscalls (perhaps taking fd's) rather than overloading the open() mode in
> > this manner.
> The idea behind dlmfs was to very simply export a small set of cluster dlm
> operations to userspace. Given that goal, I felt that a whole set of system
> calls would have been overkill. That said, I think perhaps I should clarify
> that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
> and (imho) intuitive one which could be trivially accessed from any software
> which just knows how to read and write files.

Well, as I say.  Making it a filesystem is superficially attractive, but
once you've build a super-dooper enterprise-grade infrastructure on top of
it all, nobody's going to touch the fs interface by hand and you end up
wondering why it's there, adding baggage.

Not that I'm questioning the fs interface!  It has useful permission
management, monitoring and resource releasing characteristics.  I'm
questioning the open() tricks.  I guess from Joel's tiny description, the
filesystem's interpretation of mknod and mkdir look sensible enough.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Joel Becker <[EMAIL PROTECTED]> wrote:
>
> On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> > If there is already a richer interface into all this code (such as a
> > syscall one) and it's feasible to migrate the open() tricksies to that API
> > in the future if it all comes unstuck then OK.
> > That's why I asked (thus far unsuccessfully):
> 
>   I personally was under the impression that "syscalls are not
> to be added".

We add syscalls all the time.  Whichever user<->kernel API is considered to
be most appropriate, use it.

>  I'm also wary of the effort required to hook into process
> exit.

I'm not questioning the use of a filesystem.  I'm questioning this
overloading of normal filesystem system calls.  For example (and this is
just an example!  there's also mknod, mkdir, O_RDWR, O_EXCL...) it would be
more usual to do

fd = open("/sys/whatever", ...);
err = sys_dlm_trylock(fd);

I guess your current implementation prevents /sys/whatever from ever
appearing if the trylock failed.  Dunno if that's valuable.

>  Not to mention all the lifetiming that has to be written again.
>   On top of that, we lose our cute ability to shell script it.  We
> find this very useful in testing, and think others would in practice.
> 
> >Are you saying that the posix-file lookalike interface provides
> >access to part of the functionality, but there are other APIs which are
> >used to access the rest of the functionality?  If so, what is that
> >interface, and why cannot that interface offer access to 100% of the
> >functionality, thus making the posix-file tricks unnecessary?
> 
>   I thought I stated this in my other email.  We're not intending
> to extend dlmfs.

Famous last words ;)

>  It pretty much covers the simple DLM usage required of
> a simple interface.  The OCFS2 DLM does not provide any other
> functionality.
>   If the OCFS2 DLM grew more functionality, or you consider the
> GFS2 DLM that already has it (and a less intuitive interface via sysfs
> IIRC), I would contend that dlmfs still has a place.  It's simple to use
> and understand, and it's usable from shell scripts and other simple
> code.

(wonders how to do O_NONBLOCK from a script)




I don't buy the general "fs is nice because we can script it" argument,
really.  You can just write a few simple applications which provide access
to the syscalls (or the fs!) and then write scripts around those.

Yes, you suddenly need to get a little tarball into users' hands and that's
a hassle.  And I sometimes think we let this hassle guide kernel interfaces
(mutters something about /sbin/hotplug), and that's sad.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Mark Fasheh
On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> > is a bit unfortunate, but really it just needs a bit to express that -
> > nobody over here cares what it's called.
> 
> The whole idea of reinterpreting file operations to mean something utterly
> different just seems inappropriate to me.
Putting aside trylock for a minute, I'm not sure how utterly different the
operations are. You create a lock resource by creating a file named after
it. You get a lock (fd) at read or write level on the resource by calling 
open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
Now that we've got an fd, lock value blocks are naturally represented as
file data which can be read(2) or written(2).
Close(2) drops the lock.

A really trivial usage example from shell:

node1$ echo "hello world" > mylock
node2$ cat mylock
hello world

I could always give a more useful one after I get some sleep :)

> You get a lot of goodies when using a filesystem - the ability for
> unrelated processes to look things up, resource release on exit(), etc.  If
> those features are valuable in the ocfs2 context then fine.
Right, they certainly are and I think Joel, in another e-mail on this
thread, explained well the advantages of using a filesystem.

> But I'd have thought that it would be saner and more extensible to add new
> syscalls (perhaps taking fd's) rather than overloading the open() mode in
> this manner.
The idea behind dlmfs was to very simply export a small set of cluster dlm
operations to userspace. Given that goal, I felt that a whole set of system
calls would have been overkill. That said, I think perhaps I should clarify
that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
and (imho) intuitive one which could be trivially accessed from any software
which just knows how to read and write files.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.
> That's why I asked (thus far unsuccessfully):

I personally was under the impression that "syscalls are not
to be added".  I'm also wary of the effort required to hook into process
exit.  Not to mention all the lifetiming that has to be written again.
On top of that, we lose our cute ability to shell script it.  We
find this very useful in testing, and think others would in practice.

>Are you saying that the posix-file lookalike interface provides
>access to part of the functionality, but there are other APIs which are
>used to access the rest of the functionality?  If so, what is that
>interface, and why cannot that interface offer access to 100% of the
>functionality, thus making the posix-file tricks unnecessary?

I thought I stated this in my other email.  We're not intending
to extend dlmfs.  It pretty much covers the simple DLM usage required of
a simple interface.  The OCFS2 DLM does not provide any other
functionality.
If the OCFS2 DLM grew more functionality, or you consider the
GFS2 DLM that already has it (and a less intuitive interface via sysfs
IIRC), I would contend that dlmfs still has a place.  It's simple to use
and understand, and it's usable from shell scripts and other simple
code.

Joel

-- 

"The first thing we do, let's kill all the lawyers."
-Henry VI, IV:ii

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Daniel Phillips <[EMAIL PROTECTED]> wrote:
>
> If the only user is their tools I would say let it go ahead and be cute, even 
>  sickeningly so.  It is not supposed to be a general dlm api, at least that 
> is 
>  my understanding.  It is just supposed to be an interface for their tools.  
>  Of course it would help to know exactly how those tools use it.

Well I'm not saying "don't do this".   I'm saying "eww" and "why?".

If there is already a richer interface into all this code (such as a
syscall one) and it's feasible to migrate the open() tricksies to that API
in the future if it all comes unstuck then OK.  That's why I asked (thus
far unsuccessfully):

   Are you saying that the posix-file lookalike interface provides
   access to part of the functionality, but there are other APIs which are
   used to access the rest of the functionality?  If so, what is that
   interface, and why cannot that interface offer access to 100% of the
   functionality, thus making the posix-file tricks unnecessary?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Mark Fasheh <[EMAIL PROTECTED]> wrote:
>
> On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> > Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> > lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> > me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> > acquire a clustered filesystem lock".  Not even close.
>
> What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> is a bit unfortunate, but really it just needs a bit to express that -
> nobody over here cares what it's called.

The whole idea of reinterpreting file operations to mean something utterly
different just seems inappropriate to me.

You get a lot of goodies when using a filesystem - the ability for
unrelated processes to look things up, resource release on exit(), etc.  If
those features are valuable in the ocfs2 context then fine.  But I'd have
thought that it would be saner and more extensible to add new syscalls
(perhaps taking fd's) rather than overloading the open() mode in this
manner.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips
On Sunday 04 September 2005 00:46, Andrew Morton wrote:
> Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > The model you came up with for dlmfs is beyond cute, it's downright
> > clever.
>
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.

Now, I see the ocfs2 guys are all ready to back down on this one, but I will 
at least argue weakly in favor.

Sick is a nice word for it, but it is actually not that far off.  Normally, 
this fs will acquire a lock whenever the user creates a virtual file and the 
create will block until the global lock arrives.  With O_NONBLOCK, it will 
return, erm... ETXTBSY (!) immediately.  Is that not what O_NONBLOCK is 
supposed to accomplish?

> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
>
> What happens when we want to add some new primitive which has no posix-file
> analog?
>
> Wy too cute.  Oh well, whatever.

The explicit way is syscalls or a set of ioctls, which he already has the 
makings of.  If there is going to be a userspace api, I would hope it looks 
more like the contents of userdlm.c than the traditional Vaxcluster API, 
which sucks beyond belief.

Another explicit way is to do it with a whole set of virtual attributes 
instead of just a single file trying to capture the whole model.  That is 
really unappealing, but I am afraid that is exactly what a whole lot of 
sysfs/configfs usage is going to end up looking like.

But more to the point: we have no urgent need for a userspace dlm api at the 
moment.  Nothing will break if we just put that issue off for a few months, 
quite the contrary.

If the only user is their tools I would say let it go ahead and be cute, even 
sickeningly so.  It is not supposed to be a general dlm api, at least that is 
my understanding.  It is just supposed to be an interface for their tools.  
Of course it would help to know exactly how those tools use it.  Too sleepy 
to find out tonight...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Mark Fasheh
On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.
What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
is a bit unfortunate, but really it just needs a bit to express that -
nobody over here cares what it's called.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Mark Fasheh
On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
 Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
 lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
 me.  O_NONBLOCK means open this file in nonblocking mode, not attempt to
 acquire a clustered filesystem lock.  Not even close.
What would be an acceptable replacement? I admit that O_NONBLOCK - trylock
is a bit unfortunate, but really it just needs a bit to express that -
nobody over here cares what it's called.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips
On Sunday 04 September 2005 00:46, Andrew Morton wrote:
 Daniel Phillips [EMAIL PROTECTED] wrote:
  The model you came up with for dlmfs is beyond cute, it's downright
  clever.

 Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
 lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
 me.  O_NONBLOCK means open this file in nonblocking mode, not attempt to
 acquire a clustered filesystem lock.  Not even close.

Now, I see the ocfs2 guys are all ready to back down on this one, but I will 
at least argue weakly in favor.

Sick is a nice word for it, but it is actually not that far off.  Normally, 
this fs will acquire a lock whenever the user creates a virtual file and the 
create will block until the global lock arrives.  With O_NONBLOCK, it will 
return, erm... ETXTBSY (!) immediately.  Is that not what O_NONBLOCK is 
supposed to accomplish?

 It would be much better to do something which explicitly and directly
 expresses what you're trying to do rather than this strange lets do this
 because the names sound the same thing.

 What happens when we want to add some new primitive which has no posix-file
 analog?

 Wy too cute.  Oh well, whatever.

The explicit way is syscalls or a set of ioctls, which he already has the 
makings of.  If there is going to be a userspace api, I would hope it looks 
more like the contents of userdlm.c than the traditional Vaxcluster API, 
which sucks beyond belief.

Another explicit way is to do it with a whole set of virtual attributes 
instead of just a single file trying to capture the whole model.  That is 
really unappealing, but I am afraid that is exactly what a whole lot of 
sysfs/configfs usage is going to end up looking like.

But more to the point: we have no urgent need for a userspace dlm api at the 
moment.  Nothing will break if we just put that issue off for a few months, 
quite the contrary.

If the only user is their tools I would say let it go ahead and be cute, even 
sickeningly so.  It is not supposed to be a general dlm api, at least that is 
my understanding.  It is just supposed to be an interface for their tools.  
Of course it would help to know exactly how those tools use it.  Too sleepy 
to find out tonight...

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Mark Fasheh [EMAIL PROTECTED] wrote:

 On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
  Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
  lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
  me.  O_NONBLOCK means open this file in nonblocking mode, not attempt to
  acquire a clustered filesystem lock.  Not even close.

 What would be an acceptable replacement? I admit that O_NONBLOCK - trylock
 is a bit unfortunate, but really it just needs a bit to express that -
 nobody over here cares what it's called.

The whole idea of reinterpreting file operations to mean something utterly
different just seems inappropriate to me.

You get a lot of goodies when using a filesystem - the ability for
unrelated processes to look things up, resource release on exit(), etc.  If
those features are valuable in the ocfs2 context then fine.  But I'd have
thought that it would be saner and more extensible to add new syscalls
(perhaps taking fd's) rather than overloading the open() mode in this
manner.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Daniel Phillips [EMAIL PROTECTED] wrote:

 If the only user is their tools I would say let it go ahead and be cute, even 
  sickeningly so.  It is not supposed to be a general dlm api, at least that 
 is 
  my understanding.  It is just supposed to be an interface for their tools.  
  Of course it would help to know exactly how those tools use it.

Well I'm not saying don't do this.   I'm saying eww and why?.

If there is already a richer interface into all this code (such as a
syscall one) and it's feasible to migrate the open() tricksies to that API
in the future if it all comes unstuck then OK.  That's why I asked (thus
far unsuccessfully):

   Are you saying that the posix-file lookalike interface provides
   access to part of the functionality, but there are other APIs which are
   used to access the rest of the functionality?  If so, what is that
   interface, and why cannot that interface offer access to 100% of the
   functionality, thus making the posix-file tricks unnecessary?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
 If there is already a richer interface into all this code (such as a
 syscall one) and it's feasible to migrate the open() tricksies to that API
 in the future if it all comes unstuck then OK.
 That's why I asked (thus far unsuccessfully):

I personally was under the impression that syscalls are not
to be added.  I'm also wary of the effort required to hook into process
exit.  Not to mention all the lifetiming that has to be written again.
On top of that, we lose our cute ability to shell script it.  We
find this very useful in testing, and think others would in practice.

Are you saying that the posix-file lookalike interface provides
access to part of the functionality, but there are other APIs which are
used to access the rest of the functionality?  If so, what is that
interface, and why cannot that interface offer access to 100% of the
functionality, thus making the posix-file tricks unnecessary?

I thought I stated this in my other email.  We're not intending
to extend dlmfs.  It pretty much covers the simple DLM usage required of
a simple interface.  The OCFS2 DLM does not provide any other
functionality.
If the OCFS2 DLM grew more functionality, or you consider the
GFS2 DLM that already has it (and a less intuitive interface via sysfs
IIRC), I would contend that dlmfs still has a place.  It's simple to use
and understand, and it's usable from shell scripts and other simple
code.

Joel

-- 

The first thing we do, let's kill all the lawyers.
-Henry VI, IV:ii

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Mark Fasheh
On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
  What would be an acceptable replacement? I admit that O_NONBLOCK - trylock
  is a bit unfortunate, but really it just needs a bit to express that -
  nobody over here cares what it's called.
 
 The whole idea of reinterpreting file operations to mean something utterly
 different just seems inappropriate to me.
Putting aside trylock for a minute, I'm not sure how utterly different the
operations are. You create a lock resource by creating a file named after
it. You get a lock (fd) at read or write level on the resource by calling 
open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
Now that we've got an fd, lock value blocks are naturally represented as
file data which can be read(2) or written(2).
Close(2) drops the lock.

A really trivial usage example from shell:

node1$ echo hello world  mylock
node2$ cat mylock
hello world

I could always give a more useful one after I get some sleep :)

 You get a lot of goodies when using a filesystem - the ability for
 unrelated processes to look things up, resource release on exit(), etc.  If
 those features are valuable in the ocfs2 context then fine.
Right, they certainly are and I think Joel, in another e-mail on this
thread, explained well the advantages of using a filesystem.

 But I'd have thought that it would be saner and more extensible to add new
 syscalls (perhaps taking fd's) rather than overloading the open() mode in
 this manner.
The idea behind dlmfs was to very simply export a small set of cluster dlm
operations to userspace. Given that goal, I felt that a whole set of system
calls would have been overkill. That said, I think perhaps I should clarify
that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
and (imho) intuitive one which could be trivially accessed from any software
which just knows how to read and write files.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Joel Becker [EMAIL PROTECTED] wrote:

 On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
  If there is already a richer interface into all this code (such as a
  syscall one) and it's feasible to migrate the open() tricksies to that API
  in the future if it all comes unstuck then OK.
  That's why I asked (thus far unsuccessfully):
 
   I personally was under the impression that syscalls are not
 to be added.

We add syscalls all the time.  Whichever user-kernel API is considered to
be most appropriate, use it.

  I'm also wary of the effort required to hook into process
 exit.

I'm not questioning the use of a filesystem.  I'm questioning this
overloading of normal filesystem system calls.  For example (and this is
just an example!  there's also mknod, mkdir, O_RDWR, O_EXCL...) it would be
more usual to do

fd = open(/sys/whatever, ...);
err = sys_dlm_trylock(fd);

I guess your current implementation prevents /sys/whatever from ever
appearing if the trylock failed.  Dunno if that's valuable.

  Not to mention all the lifetiming that has to be written again.
   On top of that, we lose our cute ability to shell script it.  We
 find this very useful in testing, and think others would in practice.
 
 Are you saying that the posix-file lookalike interface provides
 access to part of the functionality, but there are other APIs which are
 used to access the rest of the functionality?  If so, what is that
 interface, and why cannot that interface offer access to 100% of the
 functionality, thus making the posix-file tricks unnecessary?
 
   I thought I stated this in my other email.  We're not intending
 to extend dlmfs.

Famous last words ;)

  It pretty much covers the simple DLM usage required of
 a simple interface.  The OCFS2 DLM does not provide any other
 functionality.
   If the OCFS2 DLM grew more functionality, or you consider the
 GFS2 DLM that already has it (and a less intuitive interface via sysfs
 IIRC), I would contend that dlmfs still has a place.  It's simple to use
 and understand, and it's usable from shell scripts and other simple
 code.

(wonders how to do O_NONBLOCK from a script)




I don't buy the general fs is nice because we can script it argument,
really.  You can just write a few simple applications which provide access
to the syscalls (or the fs!) and then write scripts around those.

Yes, you suddenly need to get a little tarball into users' hands and that's
a hassle.  And I sometimes think we let this hassle guide kernel interfaces
(mutters something about /sbin/hotplug), and that's sad.  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Mark Fasheh [EMAIL PROTECTED] wrote:

 On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
   What would be an acceptable replacement? I admit that O_NONBLOCK - 
   trylock
   is a bit unfortunate, but really it just needs a bit to express that -
   nobody over here cares what it's called.
  
  The whole idea of reinterpreting file operations to mean something utterly
  different just seems inappropriate to me.
 Putting aside trylock for a minute, I'm not sure how utterly different the
 operations are. You create a lock resource by creating a file named after
 it. You get a lock (fd) at read or write level on the resource by calling 
 open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
 Now that we've got an fd, lock value blocks are naturally represented as
 file data which can be read(2) or written(2).
 Close(2) drops the lock.
 
 A really trivial usage example from shell:
 
 node1$ echo hello world  mylock
 node2$ cat mylock
 hello world
 
 I could always give a more useful one after I get some sleep :)

It isn't extensible though.  One couldn't retain this approach while adding
(random cfs ignorance exposure) upgrade-read, downgrade-write,
query-for-various-runtime-stats, priority modification, whatever.

  You get a lot of goodies when using a filesystem - the ability for
  unrelated processes to look things up, resource release on exit(), etc.  If
  those features are valuable in the ocfs2 context then fine.
 Right, they certainly are and I think Joel, in another e-mail on this
 thread, explained well the advantages of using a filesystem.
 
  But I'd have thought that it would be saner and more extensible to add new
  syscalls (perhaps taking fd's) rather than overloading the open() mode in
  this manner.
 The idea behind dlmfs was to very simply export a small set of cluster dlm
 operations to userspace. Given that goal, I felt that a whole set of system
 calls would have been overkill. That said, I think perhaps I should clarify
 that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
 and (imho) intuitive one which could be trivially accessed from any software
 which just knows how to read and write files.

Well, as I say.  Making it a filesystem is superficially attractive, but
once you've build a super-dooper enterprise-grade infrastructure on top of
it all, nobody's going to touch the fs interface by hand and you end up
wondering why it's there, adding baggage.

Not that I'm questioning the fs interface!  It has useful permission
management, monitoring and resource releasing characteristics.  I'm
questioning the open() tricks.  I guess from Joel's tiny description, the
filesystem's interpretation of mknod and mkdir look sensible enough.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 01:18:05AM -0700, Andrew Morton wrote:
  I thought I stated this in my other email.  We're not intending
  to extend dlmfs.
 
 Famous last words ;)

Heh, of course :-)

 I don't buy the general fs is nice because we can script it argument,
 really.  You can just write a few simple applications which provide access
 to the syscalls (or the fs!) and then write scripts around those.

I can't see how that works easily.  I'm not worried about a
tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
thinking about this shell:

exec 7/dlm/domain/lock1
do stuff
exec 7/dev/null

If someone kills the shell while stuff is doing, the lock is unlocked
because fd 7 is closed.  However, if you have an application to do the
locking:

takelock domainxxx lock1
do sutff
droplock domainxxx lock1

When someone kills the shell, the lock is leaked, becuase droplock isn't
called.  And SEGV/QUIT/-9 (especially -9, folks love it too much) are
handled by the first example but not by the second.

Joel

-- 

Same dancers in the same old shoes.
 You get too careful with the steps you choose.
 You don't care about winning but you don't want to lose
 After the thrill is gone.

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Andrew Morton
Joel Becker [EMAIL PROTECTED] wrote:

   I can't see how that works easily.  I'm not worried about a
  tarball (eventually Red Hat and SuSE and Debian would have it).  I'm
  thinking about this shell:
 
   exec 7/dlm/domain/lock1
   do stuff
   exec 7/dev/null
 
  If someone kills the shell while stuff is doing, the lock is unlocked
  because fd 7 is closed.  However, if you have an application to do the
  locking:
 
   takelock domainxxx lock1
   do sutff
   droplock domainxxx lock1
 
  When someone kills the shell, the lock is leaked, becuase droplock isn't
  called.  And SEGV/QUIT/-9 (especially -9, folks love it too much) are
  handled by the first example but not by the second.


take-and-drop-lock -d domainxxx -l lock1 -e do stuff
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 02:18:36AM -0700, Andrew Morton wrote:
   take-and-drop-lock -d domainxxx -l lock1 -e do stuff

Ahh, but then you have to have lots of scripts somewhere in
path, or do massive inline scripts.  especially if you want to take
another lock in there somewhere.
It's doable, but it's nowhere near as easy. :-)

Joel

-- 

I always thought the hardest questions were those I could not answer.
 Now I know they are the ones I can never ask.
- Charlie Watkins

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Hua Zhong
takelock domainxxx lock1
do sutff
droplock domainxxx lock1
 
 When someone kills the shell, the lock is leaked, becuase droplock isn't
 called.

Why not open the lock resource (or the lock space) instead of
individual locks as file? It then looks like this:

open lock space file
takelock lockresource lock1
do stuff
droplock lockresource lock1
close lock space file

Then if you are killed the -release of lock space file should take
care of cleaning up all the locks
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips
On Sunday 04 September 2005 03:28, Andrew Morton wrote:
 If there is already a richer interface into all this code (such as a
 syscall one) and it's feasible to migrate the open() tricksies to that API
 in the future if it all comes unstuck then OK.  That's why I asked (thus
 far unsuccessfully):

Are you saying that the posix-file lookalike interface provides
access to part of the functionality, but there are other APIs which are
used to access the rest of the functionality?  If so, what is that
interface, and why cannot that interface offer access to 100% of the
functionality, thus making the posix-file tricks unnecessary?

There is no such interface at the moment, nor is one needed in the immediate 
future.  Let's look at the arguments for exporting a dlm to userspace:

  1) Since we already have a dlm in kernel, why not just export that and save
 100K of userspace library?  Answer: because we don't want userspace-only
 dlm features bulking up the kernel.  Answer #2: the extra syscalls and
 interface baggage serve no useful purpose.

  2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
 Answer: only support tools need to do that.  A cut-down locking api is
 entirely appropriate for this.

  3) But the kernel dlm is the only one we have!  Answer: easily fixed, a
 simple matter of coding.  But please bear in mind that dlm-style
 synchronization is probably a bad idea for most cluster applications,
 particularly ones that already do their synchronization via sockets.

In other words, exporting the full dlm api is a red herring.  It has nothing 
to do with getting cluster filesystems up and running.  It is really just 
marketing: it sounds like a great thing for userspace to get a dlm for 
free, but it isn't free, it contributes to kernel bloat and it isn't even 
the most efficient way to do it.

If after considering that, we _still_ want to export a dlm api from kernel, 
then can we please take the necessary time and get it right?  The full api 
requires not only syscall-style elements, but asynchronous events as well, 
similar to aio.  I do not think anybody has a good answer to this today, nor 
do we even need it to begin porting applications to cluster filesystems.

Oracle guys: what is the distributed locking API for RAC?  Is the RAC team 
waiting with bated breath to adopt your kernel-based dlm?  If not, why not?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread Pavel Machek
Hi!

 - read-only mount
 - specatator mount (like ro but no journal allocated for the mount,
   no fencing needed for failed node that was mounted as specatator)

I'd call it real-read-only, and yes, that's very usefull
mount. Could we get it for ext3, too?
Pavel
-- 
if you have sharp zaurus hardware you don't need... you know my address
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread Joel Becker
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
  - read-only mount
  - specatator mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)
 
 I'd call it real-read-only, and yes, that's very usefull
 mount. Could we get it for ext3, too?

In OCFS2 we call readonly+journal+connected-to-cluster soft
readonly.  We're a live node, other nodes know we exist, and we can
flush pending transactions during the rw-ro transition.  In addition,
we can allow a ro-rw transition.
The no-journal+no-cluster-connection mode we call hard
readonly.  This is the mode you get when a device itself is readonly,
because you can't do *anything*.

Joel

-- 

Lately I've been talking in my sleep.
 Can't imagine what I'd have to say.
 Except my world will be right
 When love comes back my way.

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GFS, what's remaining

2005-09-04 Thread David Teigland
On Fri, Sep 02, 2005 at 10:28:21PM -0700, Greg KH wrote:
 On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote:
  On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
  
   + gfs2_assert(gl-gl_sbd, atomic_read(gl-gl_count)  0,);
  
   what is gfs2_assert() about anyway? please just use BUG_ON directly
   everywhere
  
  When a machine has many gfs file systems mounted at once it can be useful
  to know which one failed.  Does the following look ok?
  
  #define gfs2_assert(sdp, assertion)   \
  do {  \
  if (unlikely(!(assertion))) { \
  printk(KERN_ERR   \
  GFS2: fsid=%s: fatal: assertion \%s\ failed\n \
  GFS2: fsid=%s:   function = %s\n\
  GFS2: fsid=%s:   file = %s, line = %u\n \
  GFS2: fsid=%s:   time = %lu\n,  \
  sdp-sd_fsname, # assertion,  \
  sdp-sd_fsname,  __FUNCTION__,\
  sdp-sd_fsname, __FILE__, __LINE__,   \
  sdp-sd_fsname, get_seconds());   \
  BUG();\
 
 You will already get the __FUNCTION__ (and hence the __FILE__ info)
 directly from the BUG() dump, as well as the time from the syslog
 message (turn on the printk timestamps if you want a more fine grain
 timestamp), so the majority of this macro is redundant with the BUG()
 macro...

Joern already suggested moving this out of line and into a function (as it
was before) to avoid repeating string constants.  In that case the
function, file and line from BUG aren't useful.  We now have this, does it
look ok?

void gfs2_assert_i(struct gfs2_sbd *sdp, char *assertion, const char *function,
   char *file, unsigned int line)
{
panic(GFS2: fsid=%s: fatal: assertion \%s\ failed\n
  GFS2: fsid=%s:   function = %s, file = %s, line = %u\n,
  sdp-sd_fsname, assertion,
  sdp-sd_fsname, function, file, line);
}

#define gfs2_assert(sdp, assertion) \
do { \
if (unlikely(!(assertion))) { \
gfs2_assert_i((sdp), #assertion, \
  __FUNCTION__, __FILE__, __LINE__); \
} \
} while (0)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >