Re: [Linux-cluster] gfs2 v. zfs?

2011-01-27 Thread Wendy Cheng
On Wed, Jan 26, 2011 at 8:59 AM, Steven Whitehouse  wrote:

> Nevertheless, I agree that it would be nice to be able to move the
> inodes around freely. I'm not sure that the cost of the required extra
> layer of indirection would be worth it though, in terms of the benefits
> gained.
>

If the cost is about possible performance hits say it is y%. Let's
take the difference between GFS2 (performance) numbers and other
filesystem's number that users love to compare .. assume it is x%.
Regardless GFS2 is better or worse, what really matters  .. is  ...
"does (x+y)% or (x-y)% make any difference ?" and "what will this y%
buy ?" . If I do a guess, I would say x is close to 20 and y is close
to 3. So does "23 vs 20" or "17 vs 20" make differences ?

On the other hand, what can this "y"  buy  ? ... an infrastructure to
shrink the filesystem (if users not on thin-provision SAN), better
backup strategy (snapshots have its catches), a straightforward
defragmentation tool, *AND* a possibility to group the scattered
inodes within a directory into a sensible (disk) layout such that ...
each time a directory read is issued (e.g. the "ls" cmd family), it
can give enough hints to the underline SAN to trigger its own
readahead engine. ... say you want to read inodes in a huge directory
but part of these inodes are out in other nodes with exclusive glocks
held. You can still read in the rest of these inodes and the reading
pattern may be good enough to trigger the readahead code within the
SAN. By the time these exclusive glocks start to sync their blocks,
these blocks are already in SAN's cache. Many rounds of disk reads
(from SAN point of view) can be avoided. At the same time, if these
to-be-write inodes are close to each other in a reasonable layout. it
helps SAN's writes as well.

Something to think 

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-26 Thread Wendy Cheng
On Wed, Jan 26, 2011 at 8:36 AM, Gordan Bobic  wrote:
> Wendy Cheng wrote:
>
>> GFS2 fragments very soon and very badly ! Its blocks are all over the
>> device due to the nature of how the resource group works. That slows down
>> *every* thing, particularly for backup applications. A production deployment
>> will encounter this issue very soon and they'll find the issue more than
>> annoying.
>
> While I don't disagree that it's a problem, most people will use GFS and
> similar FS-es on a SAN. A typical SAN will allocate sparse files for backing
> the block device. That means it'll be badly fragmented very quickly on the
> back end regardless of what the FS does.

I don't know how a "typical SAN" is defined ... but I could guess
which SAN box does this. Did you check their admin guide ? You may be
surprised by the turning knobs they offer.

-- Wendy

>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-26 Thread Wendy Cheng

On 01/26/2011 02:19 AM, Steven Whitehouse wrote:


I don't know of any reason why the inode number should be related to
back up. The reason why it was suggested that the inode number should be
independent of the physical block number was in order to allow
filesystem shrink without upsetting (for example) NFS which assumed that
its filehandles are valid "forever".


I'm not able to ping-pong too many emails on external mailing lists, at 
least during week days. However, a quick note on this ...


GFS2 fragments very soon and very badly ! Its blocks are all over the 
device due to the nature of how the resource group works. That slows 
down *every* thing, particularly for backup applications. A production 
deployment will encounter this issue very soon and they'll find the 
issue more than annoying.


Now, educate me (though I'll probably not read it until weekend) ... how 
will you defragment the FS with that inode number attaching to physical 
block number ? You can *not* move these inodes.


-- Wendy

The problem with doing that is that it adds an extra layer of
indirection (and one which had not been written in gfs2 at the point in
time we took that decision). That extra layer of indirection means more
overhead on every lookup of the inode. It would also be a contention
point in a distributed filesystem, since it would be global state.

The dump command directly accesses the filesystem via the block device
which is a problem for GFS2, since there is no guarantee (and in general
it won't be) that the information read via this method will match the
actual content of the filesystem. Unlike ext2/3 etc., GFS2 caches its
metadata in per-inode address spaces which are kept coherent using
glocks. In ext2/3 etc., the metadata is cached in the block device
address space which is why dump can work with them.

With GFS2 the only way to ensure that the block device was consistent
would be to umount the filesystem on all nodes. In that case it is no
problem to simply copy the block device using dd, for example. So dump
is not required.

Ideally we want backup to be online (i.e. with the filesystem mounted),
and we also do not want it to disrupt the workload which the cluster was
designed for, so far as possible. So the best solution is to back up
files from the node which is most likely to be caching them. That also
means that the backup can proceed in parallel across the nodes, reducing
the time taken.

It does mean that a bit more thought has to go into it, since it may not
be immediately obvious what the working set of each node actually is.
Usually though, it is possible to make a reasonable approximation of it,

Steve.


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-25 Thread Wendy Cheng
On Tue, Jan 25, 2011 at 11:34 AM, yvette hirth  wrote:
> Rafa Grimán wrote:
>
>> Yes that is true. It's a bit blurry because some file systems have
>> features others have so "classifying" them is quite difficult.
>
> i'm amazed at the conversation that has taken place by me simply asking a
> question.
>
> *Thank You* all for all of this info!

We purposely diverted your question to backup, since it is easier to
have productive discussions (compared to directory list) :) ... In
general, any "walk" operation on GFS2 can become a pain for various
reasons. Directory listing is certainly one of them. It is an age old
problem. Other than the inherited issues from the horrible stat()
system call, it is also to do with the way GFS(1/2) likes to
"distribute" its block all over the device upon write contention. I
don't see how GFS2 can alleviate this pain w/out doing some sorts of
block reallocation.

I'll let other capable people to have another round of fun discussions
 Maybe some creative ideas can get popped out as the result ...

-- Wendy

>
> we've traced the response time slowdown to "number of subdirectories that
> need to be listed when their parent directory is enumerated".
>
> btw, my usage of "enumeration" means, "list contents".  sorry for any
> confusion.
>
> we've noticed that if we do:
>
> ls -lhad /foo
> ls -lhad /foo/stuff
> ls -lhad /foo/stuff/moreStuff
>
> response time is good, because , but
>
> ls -lhad /foo/stuff/moreStuff/*
>
> is where response time increases by a magnitude, because moreStuff has ~260
> directories.  enumerating moreStuff and other "directories with many
> subdirectories" appear to be the culprits.
>
> for now, we'll be moving directories around, trying to reduce the number of
> nested levels, and number of elements in each level.
>
> in human interaction there is a rule:  as the number of people interacting
> increases linearly, the number of interactions between the people increases
> exponentially.  is it true that as the number of nodes, "n", increases
> linearly, the amount of metadata being passed around / inspected during disk
> access increases geometrically?  does this "rule" apply?  or does metadata
> processing increase linearly as well, because the querying is all done by
> one node?
>
> thanks again - what a group!
> yvette
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-25 Thread Wendy Cheng
On Tue, Jan 25, 2011 at 2:01 AM, Steven Whitehouse  wrote:

>> On Mon, Jan 24, 2011 at 6:55 PM, Jankowski, Chris
>>  wrote:
>> > A few comments, which might contrast uses of GFS2 and XFS in enterprise 
>> > class production environments:
>> >
>> > 3.
>> > GFS2 provides only tar(1) as a backup mechanism.
>> > Unfortunately, tar(1) does not cope efficiently with sparse files,
>> > which many applications create.
>> > As an exercise create a 10 TB sparse file with just one byte of non-null 
>> > data at the end.
>> > Then try to back it up to disk using tar(1).
>> > The tar image will be correctly created, but it will take many, many hours.
>> > Dump(8) would do the job in a blink, but is not available for GFS2 
>> > filesystem.
>> > However, XFS does have XFS specific dump(8) command and will backup sparse 
>> > files
>> > efficiently.
>> >
> You don't need dump in order to do this (since dump reads directly from
> the block device itself, that would be problematic on GFS/GFS2 anyway).
> All that is required is a backup too which support the FIEMAP ioctl. I
> don't know if that has made it into tar yet, I suspect probably not.
>

If cluster snapshot is in the hand of another develop team (that may
not see it as a high priority), a GFS2 specific dump command could be
a good alternative. The bottom line here is GFS2 is lacking a sensible
(read as "easy to use") backup strategy that can significantly
jeopardize its deployment.

Of couse, this depends on  someone has to be less stubborn and
willing to move GFS2's inode number away from its physical disk block
number. Cough !

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-24 Thread Wendy Cheng
Comments in-line ...

On Mon, Jan 24, 2011 at 6:55 PM, Jankowski, Chris
 wrote:
> A few comments, which might contrast uses of GFS2 and XFS in enterprise class 
> production environments:
>
> 1.
> SAN snapshot is not a panacea, as it is only crash consistent and only within 
> a single LUN.
> If you have your data or database spread over multiple LUNs each with its own 
> filesystem,
> then you are on your own.

It depends on the SAN box. Some products have aggregate level
snapshots that can contain multiple LUNs.

However, the argument here is correct; that is, SAN snaphost is not a
panacea. Other than different SAN vendors may have different setup(s),
snapshot restore could require specific knowledge of the filesystem
involved (e.g. how the journal is replayed). So there are integration
and test  efforts required for the restore to work well.

>
> 2.
> Therefore, we still need at least OS level (filesystem level) consistent 
> backup
> if the application itself does not provide a hot backup mechanism, which very 
> few do.
> The consistent filesystem level backup requires freeze and thaw commands.
> XFS offers them, GFS2 does not.

I seem to see GFS2 having freeze/thaw patches in the past ? But for
backup to work well, it requires more than freeze/thaw.

>
> 3.
> GFS2 provides only tar(1) as a backup mechanism.
> Unfortunately, tar(1) does not cope efficiently with sparse files,
> which many applications create.
> As an exercise create a 10 TB sparse file with just one byte of non-null data 
> at the end.
> Then try to back it up to disk using tar(1).
> The tar image will be correctly created, but it will take many, many hours.
> Dump(8) would do the job in a blink, but is not available for GFS2 filesystem.
> However, XFS does have XFS specific dump(8) command and will backup sparse 
> files
> efficiently.
>
> 4.
> GFS2 is very convenient to use, as by its nature is clusterised.
> However, there is huge performance cost to pay for all this convenience.
> This cost stems from serialization imposed by distributed lock manager.
>
> 5.
> For these reason, for the HA applications running on one node at a time,
> I found that XFS on top of LVM gives me the best mix of performance and 
> functionality:
> - high performance
> - efficient backup of sparse files
> - backup consistency through freeze/thaw
> - zero downtime backup through use of LVM snapshots
> - short failover times due to efficient XFS transaction logs
>
> So, for this type of HA applications (failover HA) and environment,
> it makes perfect sense to use XFS in a cluster instead of GFS2.
>
> Having said that, GFS2 can, in principle, be engineered to be much better
> for failover HA applications.
>
> It would require development of:
> - GFS2 specific dump(8)
> - GFS2 specific freeze and thaw commands
> - CLVM wide snapshots
> - more efficient DLM

You did a great summary here. By looking at the list, I would imagine
CLVM snapshoting is probably the easiest, technically and politically.
It's all up to GFS2 engineers to take the note.

>
> It certainly is possible to do. Digital/Compaq/HP TruCluster Cluster File 
> System (CFS) built on top of AdvFS had all of these features and much, much 
> more by circa year 2000.
>

Yep, I met a TruCluster developer 3 years ago. Based on his
description, I was impressed. Not sure HP is still marketing it
though.

Again, a great summary !

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-24 Thread Wendy Cheng
On Mon, Jan 24, 2011 at 5:06 PM, Joseph L. Casale
 wrote:
>> A.  Because it breaks the flow and reads backwards.
>
>> Q.  Why is top posting considered harmful?
>
> Hope that was informative:)
> jlc
>

I don't have any intention to start a flame and/or religion war.
However, I'm hoping people could relax a little bit about this "rule",
if it is a rule at all ... Check out:
http://en.wikipedia.org/wiki/Posting_style#Top-posting to see what it
says. You may find it interesting.

At the same time, I don't see comparing performance numbers between
parallel filesystem and cluster filesystem is a bad practice. After
all, I see users and IT shops comparing NFS and GFS numbers from time
to time (as a way to decide which one to use). The bottome line is "I
have a storage box and I want to access it from different machines,
which one is the best solution for me and lets get the capacity
estimated".

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-24 Thread Wendy Cheng
On Mon, Jan 24, 2011 at 2:26 PM, Rafa Grimán  wrote:
> On Monday 24 January 2011 22:58 Jeff Sturm wrote
>> > -Original Message-
>> > From: linux-cluster-boun...@redhat.com
>>
>> [mailto:linux-cluster-boun...@redhat.com]
>>
>> > On Behalf Of Wendy Cheng
>> > Subject: Re: [Linux-cluster] gfs2 v. zfs?
>> >
>> > I would love to get an education here. From usage model point of view,
>> > what is the
>> > difference between a "parallel file system" and a "cluster file
>> > system" ? i.e., when to
>> > use a parallel file system and when to use a cluster file system ?
>>
>> Getting off-topic but I'd also like to hear who uses a parallel
>> distributed FS, and what problem space they work well in.
>
>
> HPC where you need  very high bandwidth/throughput to disk (usually scratch
> filesystem).

You hit the right points (and thanks for previous explanation) !
However, from usage point of view, I think the line is blurry these
days (e.g. IBM's GPFS is said to be a cluster filesystem but have been
used well in HPC environment).

BTW, I never understand why top-post is evil ? Isn't it making some
emails hard to read ?

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-24 Thread Wendy Cheng
Guess GFS2 is out as an "enterprise" file system ? W/out a workable
backup solution, it'll be seriously limited. I have been puzzled why
CLVM is slow to add this feature.

-- Wendy

On Mon, Jan 24, 2011 at 1:07 PM, Nicolas Ross
 wrote:
>> I would guess this "enumeration" means "walk"; e.g. doing backup. One
>> of the most-liked features that ZFS offers is snapshots. So I would
>> suggest telling GFS2 users/customers to use LVM snapshot AND making
>> sure GFS2 works well with Linux LVMm snapshot.
>
> AFAIK, clustered volume group doesn't support LVM snapshots :
>
> 
> LVM snapshots are not supported across the nodes in a cluster. You cannot
> create a snapshot volume in a clustered volume group.
> 
>
> from
>
> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/Logical_Volume_Manager_Administration/index.html#snapshot_volumes
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-24 Thread Wendy Cheng
I would love to get an education here. From usage model point of view,
what is the difference between a "parallel file system" and a "cluster
file system" ? i.e., when to use a parallel file system and when to
use a cluster file system ?

.. Wendy

On Mon, Jan 24, 2011 at 1:10 PM, Rafa Grimán  wrote:
> Hi :)
>
> On Monday 24 January 2011 21:25 Wendy Cheng wrote
>> Sometime ago, the following was advertised:
>>
>> "ZFS is not a native cluster, distributed, or parallel file system and
>> cannot provide concurrent access from multiple hosts as ZFS is a local
>> file system. Sun's Lustre distributed filesystem will adapt ZFS as
>> back-end storage for both data and metadata in version 3.0, which is
>> scheduled to be released in 2010."
>>
>> You can google "Lustre" to see whether their plan (built Lustre on top
>> of ZFS) is panned out.
>
>
> But Lustre isn't a clustered filesystem, it's a parallel filesystem. Similar 
> to
> pNFS, PanFS, ... Comparing GFS to Lustre wouldn't be quite right.
>
>   Rafa
>
> --
> "We cannot treat computers as Humans. Computers need love."
>
> Happily using KDE 4.5.4 :)
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-24 Thread Wendy Cheng
On Mon, Jan 24, 2011 at 11:48 AM, Steven Whitehouse  wrote:
>> our five-node cluster is working fine, the clustering software is great,
>> but when accessing gfs2-based files, enumeration can be very slow...
>>
> What do you mean be "enumeration can be very slow" ? It might be
> possible to slightly rearrange the I/O pattern in order to get greater
> performance. Can you explain what you are trying to achieve?
>

I would guess this "enumeration" means "walk"; e.g. doing backup. One
of the most-liked features that ZFS offers is snapshots. So I would
suggest telling GFS2 users/customers to use LVM snapshot AND making
sure GFS2 works well with Linux LVMm snapshot.

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 v. zfs?

2011-01-24 Thread Wendy Cheng
Sometime ago, the following was advertised:

"ZFS is not a native cluster, distributed, or parallel file system and
cannot provide concurrent access from multiple hosts as ZFS is a local
file system. Sun's Lustre distributed filesystem will adapt ZFS as
back-end storage for both data and metadata in version 3.0, which is
scheduled to be released in 2010."

You can google "Lustre" to see whether their plan (built Lustre on top
of ZFS) is panned out.

-- Wendy

On Mon, Jan 24, 2011 at 12:01 PM, Gordan Bobic  wrote:
> On 01/24/2011 07:51 PM, yvette hirth wrote:
>>
>> Gordan Bobic wrote:
>>>
>>> On 01/24/2011 07:16 PM, yvette hirth wrote:

 hi all,

 does anyone have any performance comparisons of gfs2 v. zfs?

 our five-node cluster is working fine, the clustering software is great,
 but when accessing gfs2-based files, enumeration can be very slow...
>>>
>>> The comparison is a bit like comparing apples and oranges. GFS is a
>>> cluster file system, ZFS is a single-machine file system.
>>>
>>> Gordan
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster@redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>> my apologies. i heard zfs was cluster-aware; thanks for the info.
>>
>> does anyone have any performance comparisons of gfs2 v. any other
>> cluster-aware filesystems?
>
> You may want to google about GFS / GFS2 / OCFS2. That's pretty much all that
> is freely available as far as cluster file systems that live directly on top
> of block devices go.
>
> The original OCFS (Oracle) and VMFS (VMware) work in a similar way but they
> are designed for few large files rather than many small files so they aren't
> suitable for generic use.
>
> Symantec Veritas Cluster also comes with a similar cluster aware file
> system, but it's heavily licenced and last time I checked it didn't provide
> any compelling reasons to use it instead of GFS, GFS2 or OCFS2.
>
> There are a few other things available that may be more suitable for what
> you want to do, but it's impossible to say without knowing more about your
> use-case. Depending on ecactly what you require you may find that SeznamFS,
> GlusterFS, Lustre or even HDFS (Hadoop) are more suitable for what you want
> to do.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Re: determining fsid for fs resource

2009-07-22 Thread Wendy Cheng

Terry wrote:

On Fri, Jul 17, 2009 at 11:05 AM, Terry wrote:
  

Hello,

When I create a fs resource using redhat's luci, it is able to find
the fsid for a fs and life is good.  However, I am not crazy about
luci and would prefer to manually create the resources from the
command line but how do I find the fsid for a filesystem?  Here's an
example of a fs resource created using luci:



Thanks!




Anyone have an idea for this?
  


IIRC, you basically have to make up the key (fsid) by yourself. Just 
pick any number (integer) that is less then 2**32 - but make sure it is 
unique per-filesystem-per-export while NFS service is up and running. 
That is, if you plan to export the same filesystem via two export 
entries (or say, export two different directories from the very same 
filesystem) , you need two fsids. If you have x exports (regardless they 
are from the same filesystem or different filessytems) at the same time, 
you would need x fsid(s). This is mostly to do with NFS export 
(internally represented by an unsigned integer)  - don't confuse it with 
filesystem id (that is obtained via stat system call family).



-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-08 Thread Wendy Cheng
 Kadlecsik Jozsef wrote:
>
>
>
> EXPORT_SYMBOL(inode_lock);
>
> line to fs/inode.c, recompiled the kernel and the modules.
>
> Starting mailman in the test environment did not produce the almost
> instant freeze. I started/stopped mailman several times and the system
> worked just fine. So I believe the patch above and the plus line in
> fs/inode.c fix the reported problem. I dunno whether modifying
> fs/inode.c is acceptable or not...
>
>
>
>  Glad to learn you got it working! Send your patch to cluster-devel to see
> whether GFS team could take it. I personally think any other solution,
> though doable, would not worth the effort. And people using upstream kernel
> needs to do recompiling anyway so that EXPORT_SYMBOL thing should not be too
> bad.
>
> -- Wendy
>
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-05 Thread Wendy Cheng




Then don't remove it yet. The ramification needs more thoughts ...



That generic_drop_inode() can *not* be removed.

Not sure whether my head is clear enough this time 

Based on code reading ...
1. iput() gets inode_lock (a spin lock)
2. iput() calls iput_final()
3. iput_final() calls gfs_drop_inode() that calls
   generic_drop_inode()
4. generic_drop_inode() unlocks inode_lock.

In theory, this logic violates the usage of spin lock as it is expected 
to lock for a short period of time but gfs_drop_inode() could take a 
while to finish. It has a blocking write page that need to make sure the 
data gets sync-ed to storage before it can returns. Make matter worse is 
that inode_lock is a global lock that could block non-GFS threads. One 
would think a quick fix is to drop the inode_lock at the beginning of 
gfs_drop_inode() and then re-acquires it after gfs sync the page. 
Unfortunately, inode_lock is not an exported symbol and GFS is an 
out-of-tree filesystem that has to be compiled as a kernel module. So 
this trick won't work for GFS.


With a flight to catch tomorrow and a flu-infected body, I lose the will 
to think over what the correct fix should and/or will be.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-03 Thread Wendy Cheng

Kadlecsik Jozsef wrote:

On Fri, 3 Apr 2009, Wendy Cheng wrote:

  

Kadlecsik Jozsef wrote:


On Thu, 2 Apr 2009, Wendy Cheng wrote:
  

Kadlecsik Jozsef wrote:



- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
  gfs_drop_inode as .drop_inode replacing .put_inode.
  .put_inode was called without holding a lock, but .drop_inode
  is called under inode_lock held. Might it be a problem

  

Based on code reading ...
1. iput() gets inode_lock (a spin lock)
2. iput() calls iput_final()
3. iput_final() calls filesystem drop_inode(), followed by
generic_drop_inode()
4. generic_drop_inode() unlock inode_lock after doing all sorts of fun
things
with the inode

So look to me that generic_drop_inode() statement within gfs_drop_inode()
should be removed. Otherwise you would get double unlock and double list
free.


I think those function calls are right: iput_final calls either the
filesystem drop_inode function (in this case gfs_drop_inode) or
generic_drop_inode. There's no double call of generic_drop_inode. However
gfs_sync_page_i (and in turn filemap_fdatawrite and filemap_fdatawait) is
now called under inode_lock held and that was not so in previous versions.
But I'm just speculating.
  

It *is* called twice unless my eyes deceive me

static inline void iput_final(struct inode *inode)
{
const struct super_operations *op = inode->i_sb->s_op;
void (*drop)(struct inode *) = generic_drop_inode;

if (op && op->drop_inode)
drop = op->drop_inode; /* gfs call generic_drop_inode() */
drop(inode); /* second call into generic_drop_inode() again. */
}



No, the line 'drop = op->drop_inode;' is just an assignment (there's no 
  


ok, I see ... my eyes do deceive me :) - actually it is my brain that 
was not working ...


Then don't remove it yet. The ramification needs more thoughts ...

-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-03 Thread Wendy Cheng

Kadlecsik Jozsef wrote:

On Thu, 2 Apr 2009, Wendy Cheng wrote:

  

Kadlecsik Jozsef wrote:



- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
  gfs_drop_inode as .drop_inode replacing .put_inode.
  .put_inode was called without holding a lock, but .drop_inode
  is called under inode_lock held. Might it be a problem
  
  

Based on code reading ...
1. iput() gets inode_lock (a spin lock)
2. iput() calls iput_final()
3. iput_final() calls filesystem drop_inode(), followed by
generic_drop_inode()
4. generic_drop_inode() unlock inode_lock after doing all sorts of fun things
with the inode

So look to me that generic_drop_inode() statement within 
gfs_drop_inode() should be removed. Otherwise you would get double 
unlock and double list free.



I think those function calls are right: iput_final calls either the 
filesystem drop_inode function (in this case gfs_drop_inode) or 
generic_drop_inode. There's no double call of generic_drop_inode. However 
gfs_sync_page_i (and in turn filemap_fdatawrite and filemap_fdatawait) is 
now called under inode_lock held and that was not so in previous versions.

But I'm just speculating.
  


It *is* called twice unless my eyes deceive me

static inline void iput_final(struct inode *inode)
{
const struct super_operations *op = inode->i_sb->s_op;
void (*drop)(struct inode *) = generic_drop_inode;

if (op && op->drop_inode)
drop = op->drop_inode; /* gfs call generic_drop_inode() */
drop(inode); /* second call into generic_drop_inode() again. */
}

 
  
In short, *remove* line #73 from gfs-kernel/src/gfs/ops_super.c in your 
source and let us know how it goes.



I won't get a chance to start a test before Monday, sorry. 

  


I'll be traveling next week as well. However, a few cautious words here:

Even this "fix" eventually solves your hang, running GFS on newer 
kernels with production system simply is *not* a good idea.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng



Kadlecsik Jozsef wrote:


- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
  gfs_drop_inode as .drop_inode replacing .put_inode.
  .put_inode was called without holding a lock, but .drop_inode
  is called under inode_lock held. Might it be a problem
  

Based on code reading ...
1. iput() gets inode_lock (a spin lock)
2. iput() calls iput_final()
3. iput_final() calls filesystem drop_inode(), followed by 
generic_drop_inode()
4. generic_drop_inode() unlock inode_lock after doing all sorts of fun 
things with the inode


So look to me that generic_drop_inode() statement within 
gfs_drop_inode() should be removed. Otherwise you would get double 
unlock and double list free.


In short, *remove* line #73 from gfs-kernel/src/gfs/ops_super.c in your 
source and let us know how it goes.


-- Wendy



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng

Kadlecsik Jozsef wrote:

On Thu, 2 Apr 2009, Wendy Cheng wrote:

  

Kadlecsik Jozsef wrote:


- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
  gfs_drop_inode as .drop_inode replacing .put_inode.
  .put_inode was called without holding a lock, but .drop_inode
  is called under inode_lock held. Might it be a problem?
  
  

I was planning to take a look over the weekend .. but this one looks very
promising. Give it a try and let us know !



But - how? .put_inode was eliminated, cannot be used anymore in recent 
kernels. And I have no idea what should be changed in gfs_drop_inode.


  
I see :) ... let me move your tar ball over. Know about cluster IRC 
(check cluster wiki for instruction if you don't know how) ? Go there - 
maybe some IRC folks will be able to work this with you.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng

Kadlecsik Jozsef wrote:

- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
  gfs_drop_inode as .drop_inode replacing .put_inode.
  .put_inode was called without holding a lock, but .drop_inode
  is called under inode_lock held. Might it be a problem?

  
I was planning to take a look over the weekend .. but this one looks 
very promising. Give it a try and let us know !


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng

Kadlecsik Jozsef wrote:


If you have any idea what to do next, please write it.

  
Do you have your kernel source somewhere (in tar ball format) so people 
can look into it ?


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-30 Thread Wendy Cheng

Kadlecsik Jozsef wrote:



You mean the part of the patch

@@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry, 
struct
error = gfs_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &gh);
if (!error) {
generic_fillattr(inode, stat);
+   if (S_ISREG(inode->i_mode) && dentry->d_parent 
+   && dentry->d_parent->d_inode) {

+   p_inode = igrab(dentry->d_parent->d_inode);
+   if (p_inode) {
+   pi = get_v2ip(p_inode);
+   pi->i_dir_stats++;
+   iput(p_inode);
+   }
+   }
gfs_glock_dq_uninit(&gh);
}
 
might cause a deadlock: if the parent directory inode is already locked, 
then this part will wait infinitely to get the lock, isn't it?


If I open a directory and then stat a file in it, is that enough to 
trigger the deadlock?



No, that's too simple and should have came out much earlier, the patch is 
from Nov 6 2008. Something like creating files in a directory by one 
process and statting at the same time by another one, in a loop?


  


It would be a shame if GFS(1/2) ends up losing you as a user - not many 
users can delve into the bits and bytes like you.


My suggestion is that you work directly with GFS engineers, particularly 
the one who submitted this patch. He is bright and hardworking - one of 
the best among young engineers within Red Hat. This patch is a good 
"start" to get into the root cause (as gfs readdir is hung on *every* 
console logs you generated). Maybe a bugzilla would be a good start ?


I really can't keep spending time on this. As Monday arrives, I'm always 
behind few of my tasks


-- Wendy





--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-29 Thread Wendy Cheng

Kadlecsik Jozsef wrote:
There are three different netconsole log recordings at 
http://www.kfki.hu/~kadlec/gfs/
One of the new console logs has a good catch (netconsole0.txt): you *do* 
have a deadlock as the CPUs are spinning waiting for spin lock. This 
seems to be more to do with the changes made via bugzilla 466645. I 
think RHEL version that has the subject patch will have the same issue 
as well.


-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-28 Thread Wendy Cheng

Wendy Cheng wrote:
. [snip] ... There are many foot-prints of spin_lock - that's 
worrisome. Hit a couple of "sysrq-w"  next time when you have hangs, 
other than sysrq-t. This should give traces of the threads that are 
actively on CPUs at that time. Also check your kernel change log (to 
see whether GFS has any new patch that touches spin lock that doesn't 
in previous release).


I re-read your console log few minutes ago, followed by a quick browse 
into cluster git tree. Few of python processes (e.g. pid 4104, 4105, 
etc) are blocked by locks within gfs_readdir(). This somehow relates to 
a performance patch committed on 11/6/2008. The gfs_getattr() has a 
piece of new code that touches vfs inode operation while glock is taken. 
That's an area that needs examination. I don't have linux kernel source 
handy to see whether that iput() and igrab() can lead to deadlock though.


If you have the patch in your kernel and if you can, temporarily remove 
it (and rebuild the kernel) to see how it goes:

commita71b12b692cac3a4786241927227013bf2f3bf99

Again, take my advice with a grain of salt :) ...I'll stop here. Good luck !

-- Wendy



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-28 Thread Wendy Cheng

Kadlecsik Jozsef wrote:

I don't see a strong evidence of deadlock (but it could) from the thread
backtraces However, assuming the cluster worked before, you could have
overloaded the e1000 driver in this case. There are suspicious page faults
but memory is very "ok". So one possibility is that GFS had generated too
many sync requests that flooded the e1000. As the result, the cluster heart
beat missed its interval.



It's a possibility. But it assumes also that the node freezes >because< 
it was fenced off. So far nothing indicates that.
  


Re-read your console log. There are many foot-prints of spin_lock - 
that's worrisome. Hit a couple of "sysrq-w"  next time when you have 
hangs, other than sysrq-t. This should give traces of the threads that 
are actively on CPUs at that time. Also check your kernel change log (to 
see whether GFS has any new patch that touches spin lock that doesn't in 
previous release).


BTW, I do have opinions on other parts of your postings but don't have 
time to express them now. Maybe I'll say something when I finish my 
current chores :) ... Need to rush out now. Good luck on your debugging !


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Wendy Cheng
>
> I should get some sleep - but can't it be that I hit the potential
> deadlock mentioned here:



Please take my observation with a grain of salt (as I don't have Linux
source code in front of me to check the exact locking sequence, nor can I
afford spending time on this) ...

I don't see a strong evidence of deadlock (but it could) from the thread
backtraces However, assuming the cluster worked before, you could have
overloaded the e1000 driver in this case. There are suspicious page faults
but memory is very "ok". So one possibility is that GFS had generated too
many sync requests that flooded the e1000. As the result, the cluster heart
beat missed its interval. Do you have the same ethernet card for both AOE
and cluster traffic ? If yes, seperate them to see how it goes. And of
course, if you don't have Ben's mmap patch (as you described in your post),
it is probably a good idea to get it into your gfs-kmod.

But honestly,  I think running GFS1 on newer kernels is a bad idea.

-- Wendy


>
> commit  4787e11dc7831f42228b89ba7726fd6f6901a1e3
>
> gfs-kmod: workaround for potential deadlock. Prefault user pages
>
> The bug uncovered in 461770 does not seem fixable without a massive
> change to how gfs works.  There is a lock ordering mismatch between
> the process address space lock and the glocks. The only good way to
> avoid this in all cases is to not hold the glock for so long, which
> is what gfs2 does. This is impossible without completely changing
> how gfs does locking.  Fortunately, this is only a problem when you
> have multiple processes sharing an address space, and are doing IO
> to a gfs file with a userspace buffer that's part of an mmapped gfs
> file. In this case, prefaulting the buffer's pages immediately
> before acquiring the glocks significantly shortens the window for
> this deadlock. Closing the window any more causes a large
> performance hit.
>
> Mailman do mmap files...
>
> Best regards,
> Jozsef
> --
> E-mail : kad...@mail.kfki.hu, kad...@blackhole.kfki.hu
> PGP key: 
> http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: KFKI Research Institute for Particle and Nuclear Physics
> H-1525 Budapest 114, POB. 49, Hungary
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Wendy Cheng

... [snip] ...
Sigh. The pressure is mounting to 
fix the cluster at any cost, and nothing remained but to downgrade to

cluster-2.01.00/openais-0.80.3 which would be just ridiculous.

  


I have doubts that GFS (i.e. GFS1) is tuned and well-maintained on newer 
versions of RHCS (as well as 2.6 based kernels). My impression is that 
GFS1 is supposed to be phased out starting from RHEL 5. So if you are 
running with GFS1, why downgrading RHCS is ridiculous ?


Should GFS2 be recommended ? Did you open a Red Hat support ticket ? 
Linux is free but Red Hat engineers still need to eat like any other 
human being.


I have *not* looked at RHCS for more than a year now - so my impression 
(and opinion) may not be correct.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Using ext3 on SAN

2008-12-11 Thread Wendy Cheng

Manish Kathuria wrote:

I am working on a two node Active-Active Cluster using RHEL 5.2 and
the Red Hat Cluster Suite with each node running different services. A
SAN would be used as a shared storage device. We plan to partition the
SAN in such a manner that only one node will mount a filesystem at any
point of time. If there is a fail over, the partitions will be
unmounted and then mounted on the other node. We want to avoid using
GFS because of the associated performance issues. All the partitions
will be formatted with ext3 and the cluster configuration will ensure
that they are not mounted on more than one node at any given point.

Could there be any chances of data loss or corruption in such a
scenario? Is it a must to use a clustered file system if a partition
is going to be mounted at a single node only at any point of time? I
would be glad if you could share your experiences.
  


With your setup, from failover point of view, there is no difference 
between using ext3 or GFS1/2. Ext3 should work fine as long as it is 
only mounted on one node at one given time. - there will be no 
corruption (unless there are unexpected bugs). However, there are 
possibilities of data loss, regardless it is an ext3, GFS1, or GFS2.


All the filesystems mentioned here are journaling filesystems where it 
guarantees no meta-data corruption upon unclean shutdowns (with the help 
of journal replay). However, none of them can ensure no data lost. The 
data that is left beyond in filesystem cache could get lost upon failover.


You have to explicitly mount the filesystem with "sync" option (with 
significant performance hit) to ensure no data lost. If you mount with 
data-journaling mode (check "man mount" and look for the explanation of 
"data=journal" ), the possibility of data lost would be low but, still, 
no guarantee.


Most of the proprietary NAS offerings (e.g. Netapp filer via NFS) in the 
market have embedded NVRAM HW to avoid this issue.


-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Data Loss / Files and Folders "2-Node_GFS-Cluster"

2008-11-05 Thread Wendy Cheng

Doug Tucker wrote:
The changes were made on 2.6.22 kernel. I would think RHEL 4.7 has the 
same issue - but I'm not sure as I left Red Hat before 4.7 was released. 
Better to open a service ticket to Red Hat if you need the fix.


If applications are directly run on GFS nodes, instead of going thru NFS 
servers, posix locks and flocks should work *fine* across different 
nodes. The problem had existed in Linux NFS servers for years - no one 
seemed to complain about it until clusters started to get deployed more 
commonly.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



That's always been tough for me to discern, as they stay with the same
base kernel "name" while actually moving the code forward.  4.7 has
kernel:  2.6.9-78.0.1.ELsmp .  Now how that translates to the "actual"
kernel number as 2.6.21, 22, etc, I never can figure out.
  


You seem to assume, if the service ticket is approved, the fix would 
have to move the whole kernel from 2.6.9 into 2.6.22 ? That is a 
(surprising) mis-understanding.


As any bug fix with any operating system distribution, it could get done 
across different kernels, if it passes certain types of risk and 
resource review process(es). The code change has to be tailored into its 
own release framework - the actual implementation may look different but 
it should accomplish similar logic(s) to fix the identical problem.


Hopefully I interpret your comment right.

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Data Loss / Files and Folders "2-Node_GFS-Cluster"

2008-11-03 Thread Wendy Cheng

Doug Tucker wrote:
  
I don't (or "didn't") have adequate involvements with RHEL5 GFS. I may 
not know enough to response. However, users should be aware of ...


Before RHEL 5.1 and community version 2.6.22 kernels, NFS locks (i.e. 
flock, posix lock, etc) is not populated into filesystem layer. It only 
reaches Linux VFS layer (local to one particular server). If your file 
access needs to get synchronized via either flock or posix locks 
*between multiple hosts (i.e. NFS servers)*,  data loss could occur. 
Newer versions of RHEL and 2.6.22-and-above kernels should have the code 
to support this new feature.


There was an old write-up in section 4.1 of 
"http://people.redhat.com/wcheng/Project/nfs.htm"; about this issue.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



Wendy,

To be clear, does this include RHEL 4.7, or is it specific to 5.x?
  


The changes were made on 2.6.22 kernel. I would think RHEL 4.7 has the 
same issue - but I'm not sure as I left Red Hat before 4.7 was released. 
Better to open a service ticket to Red Hat if you need the fix.


If applications are directly run on GFS nodes, instead of going thru NFS 
servers, posix locks and flocks should work *fine* across different 
nodes. The problem had existed in Linux NFS servers for years - no one 
seemed to complain about it until clusters started to get deployed more 
commonly.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Data Loss / Files and Folders "2-Node_GFS-Cluster"

2008-10-31 Thread Wendy Cheng

Jason Ralph wrote:

Hello List,

We currently have in production a two node cluster with a shared SAS 
storage device.  Both nodes are running RHEL5 AP and are connected 
directly to the storage device via SAS.  We also have configured a 
high availability NFS service directory that is being exported out and 
is mounted on multiple other linux servers. 


The problem that I am seeing is:
FIle and folders that are using the GFS filesystem and live on the 
storage device are mysteriously getting lost.  My first thought was 
that maybe one of our many users has deleted them. So I have revoked 
the users privilleges and it is still happening.  My other tought was 
that a rsync script may have overwrote these files or deleted them.  I 
have stopped all scripting and crons and it has happened again.


Can someone help me with a command or a log to view that would show me 
where any of these folders may have gone?  Or has anyone else ever run 
into this type of data loss using the similar setup?




I don't (or "didn't") have adequate involvements with RHEL5 GFS. I may 
not know enough to response. However, users should be aware of ...


Before RHEL 5.1 and community version 2.6.22 kernels, NFS locks (i.e. 
flock, posix lock, etc) is not populated into filesystem layer. It only 
reaches Linux VFS layer (local to one particular server). If your file 
access needs to get synchronized via either flock or posix locks 
*between multiple hosts (i.e. NFS servers)*,  data loss could occur. 
Newer versions of RHEL and 2.6.22-and-above kernels should have the code 
to support this new feature.


There was an old write-up in section 4.1 of 
"http://people.redhat.com/wcheng/Project/nfs.htm"; about this issue.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Data Loss / Files and Folders "2-Node_GFS-Cluster"

2008-10-30 Thread Wendy Cheng

Jason Ralph wrote:

Hello List,

We currently have in production a two node cluster with a shared SAS 
storage device.  Both nodes are running RHEL5 AP and are connected 
directly to the storage device via SAS.  We also have configured a 
high availability NFS service directory that is being exported out and 
is mounted on multiple other linux servers. 


The problem that I am seeing is:
FIle and folders that are using the GFS filesystem and live on the 
storage device are mysteriously getting lost.  My first thought was 
that maybe one of our many users has deleted them. So I have revoked 
the users privilleges and it is still happening.  My other tought was 
that a rsync script may have overwrote these files or deleted them.  I 
have stopped all scripting and crons and it has happened again.


Can someone help me with a command or a log to view that would show me 
where any of these folders may have gone?  Or has anyone else ever run 
into this type of data loss using the similar setup?





I don't (or "didn't") have adequate involvements with RHEL5 GFS. I may
not know enough to response. However, ..

Before RHEL 5.1 and/or community version 2.6.22 kernels, NFS lock (via
flock, fcntl, etc from client ends) is not populated into filesystem 
layer. It only reaches Linux VFS layer (local to one particular server). 
If your file access needs to get synchronized by either flock or posix 
fcntl *between multiple hosts (NFS servers)*, data loss could occur.

Newer versions of RHEL and 2.6.22-and-after kernels should have the fixes.

There was an old write-up in section 4.1 of
"http://people.redhat.com/wcheng/Project/nfs.htm"; about this issue.


-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS Tunables

2008-10-16 Thread Wendy Cheng

Brandon Young wrote:

Hi all,

I currently have a GFS deployment consisting of eight servers and 
several GFS volumes.  One of my GFS servers is a dedicated backup 
server with a second replica SAN attached to it through a second HBA.  
My approach to backups has been with tools such as rsync and 
rdiff-backup, run on a nightly basis.  I am having a particular 
problem with one or two of my filesystems taking a *very* long time to 
backup.  For example, I have /home living on GFS.  Day-to-day 
performance is acceptable, but backups are hideously slow.  Every 
night, I kick off an rdiff-backup of /home from my backup server, 
which dumps the backup onto an XFS filesystem on the replica SAN.  
This backup can take days in some cases.


Not only GFS, the "getdents()" has been more than annoying on many
filesystems if entries count within the directory is high - but, yes,
GFS is particularly bloody slow with its directory read. There have been
efforts contributed by Red Hat POSIX and LIBC folks to have new
standardized light-weight directory operations. Unfortunately I lost
tracks of their progress ... On the other hand, integrating these new
calls into GFS would take time anyway (if they are available) - so
unlikely it can meet your need. There were also few experimental GFS
patches but none of them made into the production code.

Unless other GFS folks can give you more ideas, I think your best bet at
this moment is to think "outside" the box. That is, don't do
file-to-file backup if all possible. Check out other block level backup
strategies. Are Linux LVM mirroring and/or snapshots workable for you ?
Does your SAN vendor provide embedded features (e.g. Netapp SAN box
offers snapshot, snapmirror, syncmirror, etc) ?

-- Wendy



We have done some investigating, and found that it appears that 
getdents(2) calls (which give the list of filenames present in a 
directory) are spectacularly slow on GFS, irrespective of the size of 
the directory in question.  In particular, with 'strace -r', I'm 
seeing a rate below 100 filenames per second.  The filesystem /home 
has at least 10 million files in it, which doing the math means 29.5 
hours just to do the getdents calls to scan them, which is more than a 
third of wall-clock time.  And that's before we even start stat'ing.


I google'd around a bit and I can't see any discussion of slow 
getdents calls under GFS.  Is there any chance we have some sort of 
tunable turned on/off that might be causing this?  I'm not sure which 
tunables to consider tweaking, even.  This seems awfully slow, even 
with sub-optimal locking.  Is there perhaps some tunable I can try 
tweaking to improve this situation?  Any insights would be much 
appreciated.


--
Brandon


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] rhcs + gfs performance issues

2008-10-07 Thread Wendy Cheng

Hopefully the following provide some relieves ...

1. Enable lock trimming tunable. It is particularly relevant if NFS-GFS 
is used by development type of workloads (editing, compiling, build, 
etc) and/or after filesystem backup. Unlike fast statfs, this tunable is 
per-node base (you don't need to have the same value on each of the 
nodes and a mix of on-off within the same cluster is ok). Make the 
trimming very aggressive on backup node (> 50% where you run backup) and 
moderate on your active node (< 50%). Try to experiment with different 
values to fit the workload. Googling "gfs lock trimming wcheng" to pick 
up the technical background if needed.


shell> gfs_tool settune  glock_purge 
(e.g. gfs_tool settune /mnt/gfs1 glock_purge 50)

2. Turn on readahead tunable. It is effective for large file (stream IO) 
read performance. As I recalled, there was a cluster (with IPTV 
application) used val=256 for 400-500M files. Another one with 2G file 
size used val=2048. Again, it is per-node base so different values are 
ok on different nodes.


shell> gfs_tool settune  seq_readahead 
(e.g. gfs_tool settune /mnt/gfs1 seq_readahead 2048)

3. Fast statfs tunable - you have this on already ? Make sure they need 
to be same across cluster nodes.


4. Understand the risks and performance implications of NFS server's 
"async" vs. "sync" options. Linux NFS server "sync" options are 
controlled by two different mechanisms - mount and export. By default, 
mount is "aysnc" and export is "sync". Even with specific "async" mount 
request, Linux server uses "sync" export as default that is particularly 
troublesome for gfs. I don't plan to give an example and/or suggest the 
exact export option here - hopefully this will force folks to do more 
researches to fully understand the ramifications between performance and 
data liability. Most of the proprietary NFS servers in the market today 
utilize hardware features to relieve this performance and data integrity 
conflicts. Mainline linux servers (and RHEL) are totally software-base 
so it generally has problem in this regard.


Gfs1 in general doesn't do well in "sync" performance (journal layer is 
too bulky). Gfs2 has potentials to do better (but I'm not sure).


There are also few other things that worth mentioned but my flight is 
call for boarding .. I'll stop here ...


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Distributed Replicated GFS shared storage

2008-10-01 Thread Wendy Cheng

José Miguel Parrella Romero wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Juliano Rodrigues escribió, en fecha 30/09/08 10:58:
  

Hello,

In order to design an HA project I need a solution to replicate one GFS
shared storage to another "hot" (standby) GFS "mirror", in case of my
primary shared storage permanently fail.

Inside RHEL Advanced Platform there is any supported way to accomplish
that?



I believe the whole point of GFS is avoiding you to spend twice your
storage capacity just for the sake of storage distribution. It already
enables you to have a standby server which can go live through a
resource manager whenever you need it.
  


Look like the original subject (requirement) is to have redundant (HA) 
storage devices. GFS alone can't accomplish this since it only deals 
with server nodes - as soon as the shared storage unit is gone, the 
filesystem will be completely unusable.


Depending on the hardware, redundant storages do not necessarily consume 
a great deal of storage capacity. Though GFS itself does spread the 
blocks allocation across the whole partition (to avoid write contention 
between multiple clustered nodes), the underneath hardware may do things 
differently. That is, GFS block numbers (and its block layout) do not 
necessarily resemble the real disk block numbers (and the physical 
layout).  So, say if you have a 1TB GFS partition configured but it only 
gets half full, you may have extra 500GB space to spare if your SAN 
product allows this type of over-commit. If your storage vendor supports 
data de-duplication,  the storage consumption can go down even further. 


-- Wendy


However, if you need to have two separate storage facilities which sync
in one way, DRBD is probably the easiest way to do so. Heartbeat can
manage DRBD resources at block- and filesystem-level easily, and other
resource managers can probably do so (though I haven't used them)

HTH,
Jose
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjilzEACgkQUWAsjQBcO4KhwQCeM0lxhXfCwxiAigfi+39pHGog
alwAn3UilZcaPU009vaoxVhXFV6J5KqY
=IVLO
-END PGP SIGNATURE-

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Locking in GFS

2008-09-22 Thread Wendy Cheng

Wendy Cheng wrote:

Chris Joelly wrote:

Hello,

i have a question on locking issues on GFS:

how do GFS lock files on the filesystem. I have found one posting to
this list which states that locking occurs "more or less" on file 
level. Is this true? or does some kind of locking occur on directory

level too?
  


You may view GFS(1) internal lock granularity is on system call level 
- that is, when either a write or read (say pwrite() or pread()) is 
issued, the associated file is locked until the system call returns. 
There are few simple things that will be helpful if you keep them in 
mind:


1. a write requires an exclusive lock (i.e., when there is a write 
going on, every access to that file has to wait).
2. a read needs a shared lock (i.e.  many reads to the same file will 
not be stalled).
3. a write may involve directory lock (e.g. a "create" would need a 
write lock of the parent directory).
4. local locking (two writes compete the same lock on the same node) 
is always much better than inter-node (different nodes) locking 
(ping-pong the same write lock between different nodes is very 
expensive).
5. standard APIs (such as fcntl() and flock()) precedes GFS(1) 
internal locking if used correctly (e.g. upon obtaining an exclusive 
flock, other access to that file will be stalled, assuming every 
instance of the executables running on different nodes has the correct 
flock implemented and honored).


One exception to (5) is posix byte-range locks. Say there are two 
processes running on different nodes, each obtaining its own byte 
range locks. Process A locks byte offset 0 to 10K; process B locks 
byte 10K+1 to 40K. When both have writes issued, one of them has to 
wait until other's write system call completes before it can continue 
- a result of its posix locking implementation that is done by a 
separate module outside its internal filesystem locking.



The question arrose because i was thinking on whats would happen when
using GFS on an file server with home directories for lets say 1000
users. how do i setup this directory structure to avoid locking issues
as best as possible.

is a directory layout like the following ok when /home is a GFS file 
system:


/home/a/b/c/abc
/home/d/e/f/def
/home/g/h/i/ghi
...
/home/x/y/z/xyz

  


Hope above statements have helped you understanding that on GFS(1),

1. A short-and-fat directory structure will work (much) better than 
tall-and-skinny ones.
2. If possible, the directory setup should avoid ping-pong directory 
and/or write locks between different nodes


Really think about it, (1) is *not* a right description at all. What I 
meant to say is that, if possible, try not to put everyone in the very 
same directory if there are lots of write activities that will cause 
directory lock(s) contention, particularly if it (they) has (have) to 
get passed around different nodes.


Sorry !

-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Locking in GFS

2008-09-22 Thread Wendy Cheng

Chris Joelly wrote:

Hello,

i have a question on locking issues on GFS:

how do GFS lock files on the filesystem. I have found one posting to
this list which states that locking occurs "more or less" on file 
level. Is this true? or does some kind of locking occur on directory

level too?
  


You may view GFS(1) internal lock granularity is on system call level - 
that is, when either a write or read (say pwrite() or pread()) is 
issued, the associated file is locked until the system call returns. 
There are few simple things that will be helpful if you keep them in mind:


1. a write requires an exclusive lock (i.e., when there is a write going 
on, every access to that file has to wait).
2. a read needs a shared lock (i.e.  many reads to the same file will 
not be stalled).
3. a write may involve directory lock (e.g. a "create" would need a 
write lock of the parent directory).
4. local locking (two writes compete the same lock on the same node) is 
always much better than inter-node (different nodes) locking (ping-pong 
the same write lock between different nodes is very expensive).
5. standard APIs (such as fcntl() and flock()) precedes GFS(1) internal 
locking if used correctly (e.g. upon obtaining an exclusive flock, other 
access to that file will be stalled, assuming every instance of the 
executables running on different nodes has the correct flock implemented 
and honored).


One exception to (5) is posix byte-range locks. Say there are two 
processes running on different nodes, each obtaining its own byte range 
locks. Process A locks byte offset 0 to 10K; process B locks byte 10K+1 
to 40K. When both have writes issued, one of them has to wait until 
other's write system call completes before it can continue - a result of 
its posix locking implementation that is done by a separate module 
outside its internal filesystem locking.



The question arrose because i was thinking on whats would happen when
using GFS on an file server with home directories for lets say 1000
users. how do i setup this directory structure to avoid locking issues
as best as possible.

is a directory layout like the following ok when /home is a GFS file 
system:


/home/a/b/c/abc
/home/d/e/f/def
/home/g/h/i/ghi
...
/home/x/y/z/xyz

  


Hope above statements have helped you understanding that on GFS(1),

1. A short-and-fat directory structure will work (much) better than 
tall-and-skinny ones.
2. If possible, the directory setup should avoid ping-pong directory 
and/or write locks between different nodes


For GFS2, after browsing thru its newest source code few minutes ago - 
it reminds me of a "sea horse" shape with linux page locks on its curly 
"belly" :). It is difficult (for me) to describe so I'll skip.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Fwd: Re: [Fwd: [Linux-cluster] fast_stafs]]

2008-09-22 Thread Wendy Cheng


--- Begin Message ---


2.6 Local change (delta) is synced to disk whenever quota daemon is 
waked up and the (a tunable, default to 5 seconds). It is then 
subsequently zeroed out.
Does this mean that I can't mount a GFS file system with the noquota 
option and use fast_statfs?


Don't have GFS code in front of me at this moment. IIRC, the logic is 
triggered by a daemon that controls various tasks, including quota. 
Though both (fast statfs and quota) share the same triggering mechanism, 
it is independent of each other. In short, you should be able to use 
fast statfs even noquota option is on.




Also, am I correct in understanding that if there is an unclean 
shutdown that it is best to turn off fast_statfs  and then back on to 
get the statfs data in sync on existing nodes?





That's very correct.

-- Wendy

--- End Message ---
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] live & standby (primary & secondary partitions) in "multipath -ll"

2008-06-24 Thread Wendy Cheng




Anyway, assume your filers are on Data Ontap 10.x releases and they 
are clustered ? 


Sorry, didn't read the rest of the post until now and forgot that 10.x 
releases out in the field do not support FCP protocol. So apparently you 
are on "7.x" releases. The KnowledgeBase article that I referred to is 
*still* valid though.


If the "sanlun lun show" output are all from Filer2, than most likely 
your filers admin had assigned all the disks to Filer2, even Filer1 can 
see them. Check with your filer admin if you want to load balancing the 
filers.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] live & standby (primary & secondary partitions) in "multipath -ll"

2008-06-24 Thread Wendy Cheng

sunhux G wrote:
 
Thanks Wendy, that answered my original question.
 
I should have rephrased my question :
 
I received an alert email from Filer1 :

"autosupport.doit FCP PARTNER PATH MISCONFIGURED"
 
when our outsourced DBA built the Oracle ASM & ocfs2 partitions on

/dev/sdc1, /dev/sdd1, /dev/sde1, /dev/sdf1 & /dev/sdg1 & I suspected
this is related to building the partitions on the "enabled" (ie standby)
partitions as shown by "multipath -ll".  After we got the DBA to wipe
out & rebuild the ASM/ocfs2 on those partitions that are listed under
"prio 8" by "multipath -ll" (see lines indicated by ** in my 1st email),
the error of "FCP partner path misconfigured" cleared.
 
Funny that you got Netapp cluster questions answered on Linux-cluster 
mailing list :) ...


Anyway, assume your filers are on Data Ontap 10.x releases and they are 
clustered ? The action you'd taken *is* correct. For details, check out: 
https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb30541 where 
the issue is documented.  Let me know if you can't access to the KB 
article.


"Hosts (linux) should, under normal circumstances, only access LUNs 
through ports on the cluster node which hosts the LUN. I/O paths that 
utilize the ports of the cluster node that host the LUN are referred to 
as primary paths or optimized paths. I/O paths that utilize the partner 
cluster node are known as secondary paths, partner paths or 
non-optimized paths. A LUN should only be accessed through the partner 
cluster node when the primary ports are unavailable. I/O access to LUNs 
using a secondary path indicates one or both of the following 
conditions: the primary path(s) between host and storage controller have 
failed, or host MPIO software is not configured correctly. These 
conditions indicate that the redundancy and performance of the SAN has 
been compromised. Corrective action should be taken immediately to 
restore primary paths to the storage controllers. "


-- Wendy



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] live & standby (primary & secondary partitions) in "multipath -ll"

2008-06-23 Thread Wendy Cheng

sunhux G wrote:

Question 1:
a) how do we find out which of the device files /dev/sd*  
go to NetApp SAN Filer1 & which to Filer2 (we have 2
NetApp files)? 


Contact your Netapp support or directly go to the "NOW" web site to 
download its linux host utility packages (e.g. 
netapp_linux_host_utils_3_1.tar.gz) . Its installation should be 
reasonably trivial. After install, there will be a command called 
"sanlun" that can tell you the mapping between /dev/sd* and filers luns:


For example:
[EMAIL PROTECTED] wendy]# sanlun lun show
  filer:  lun-pathname   device filename  adapter  
protocol  lun size lun state
wcheng-1  /vol/cluster1/lun0  /dev/sda host5iSCSI   
300GBGOOD  
wcheng-2  /vol/cluster2/lun0  /dev/sdb host6iSCSI   
300GBGOOD   


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs tuning

2008-06-19 Thread Wendy Cheng

Wendy Cheng wrote:

Terry wrote:

On Tue, Jun 17, 2008 at 5:22 PM, Terry <[EMAIL PROTECTED]> wrote:
 
On Tue, Jun 17, 2008 at 3:09 PM, Wendy Cheng 
<[EMAIL PROTECTED]> wrote:
   

Hi, Terry,
 

I am still seeing some high load averages.  Here is an example of a
gfs configuration.  I left statfs_fast off as it would not apply to
one of my volumes for an unknown reason.  Not sure that would have
helped anyways.  I do, however, feel that reducing scand_secs 
helped a

little:


Sorry I missed scand_secs (was mindless as the brain was mostly 
occupied by

day time work).

To simplify the view, glock states include exclusive (write), share 
(read),

and not-locked (in reality, there are more). Exclusive lock has to be
demoted (demote_secs) to share, then to not-locked (another 
demote_secs)
before it is scanned (every scand_secs) to get added into reclaim 
list where
it can be purged. Between exclusive and share state transition, the 
file

contents need to get flushed to disk (to keep file content cluster
coherent).  All of above assume the file (protected by this glock) 
is not

accessed (idle).

You hit an area that GFS normally doesn't perform well. With GFS1 in
maintenance mode while GFS2 seems to be so far away, ext3 could be 
a better
answer. However, before switching, do make sure to test it 
thoroughly (since

Ext3 could have the very same issue as well - check out:
http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).

Did you look (and test) GFS "nolock" protocol (for single node 
GFS)? It
bypasses some locking overhead and can be switched to  DLM in the 
future
(just make sure you reserve enough journal space - the rule of 
thumb is one
journal per node and know how many nodes you plan to have in the 
future).


-- Wendy
  

Good points.  I could try the nolock feature I suppose.  Not quite
clear on how to reserve journal space.  I forgot to post the cpu time,
check out this:

 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 4822 root  10  -5 000 S1  0.0   2159:15 dlm_recv
 4820 root  10  -5 000 S1  0.0 368:09.34 dlm_astd
 4821 root  10  -5 000 S0  0.0 153:06.80 dlm_scand
 3659 root  10  -5 000 S0  0.0 134:40.14 scsi_wq_4
 4823 root  11  -5 000 S1  0.0 109:33.33 dlm_send
 367 root  10  -5 000 S0  0.0 103:33.74 kswapd0

gfs_glockd is further below so not so concerned with that right now.
It appears turning on nolock would do the trick.  The times aren't
extremely accurate because I have failed this cluster between nodes
while testing.




Here is some more testing information

I created a new volume on my iscsi san of 1 TB and formatted it for
ext3. I then used dd to create a 100G file.  This yielded roughly 900
Mb/sec.  I then stopped my application and did the same thing with an
existing GFS volume.  This gave me about 850 Kb/sec.  This isn't an
iscsi issue.  This appears to be a load issue and the number of I/O
occurring on these volumes.  That said, I would expect that performing
the changes I did would result in a major performance improvement.
Since it didn't, what are my other points I could consider?   If its a
GFS issue, ext3 is the way to go.  Maybe even switch to using
active-active on my NFS cluster.   If its a backend disk issue, I
would expect to see the throughput on my iscsi link (bond1) be fully
utilized.  Its not.  Could I be thrashing the disks?  This is an iscsi
san with 30 sata disks.  Just bouncing some thoughts around to see if
anyone has any more thoughts.

  
Really need to focus on my day time job - its worload has been 
climbing ... but can't help to place a quick comment here ..


The 900 MB/s vs. 850 KB/s difference looks like a caching  issue - 
that is, for 900 MB/s, it looks like the data was still lingering in 
the system cache while in 850 KB/s case, the data might already hit 
disk. Cluster filesystem normally syncs more by its nature. In 
general, ext3 does perform better in single node environment but the 
difference should not be as big as above.
There are certainly more tuning knobs available (such as journal size 
and/or network buffer size) to make GFS-iscsi "dd" run better but it 
is pointless. To deploy a cluster filesystem for production usage, the 
tuning should not be driven by such a simple-mind command. You also 
have to consider the support issues when deploying a filesystem. GFS1 
is a little bit out of date and any new development and/or significant 
performance improvements would likely be in GFS2, not in GFS1. 
Research GFS2 (googling to see how other people said about it) to 
understand whether its direction fits your need (so you can migrate 
from GFS1 to GFS2 if you bump into any show stopper in the future). If 
not, ext3 (with ext4 actively developed) is a fine choice if I read 
your configuration right from previou

Re: [Linux-cluster] gfs tuning

2008-06-19 Thread Wendy Cheng

Terry wrote:

On Tue, Jun 17, 2008 at 5:22 PM, Terry <[EMAIL PROTECTED]> wrote:
  

On Tue, Jun 17, 2008 at 3:09 PM, Wendy Cheng <[EMAIL PROTECTED]> wrote:


Hi, Terry,
  

I am still seeing some high load averages.  Here is an example of a
gfs configuration.  I left statfs_fast off as it would not apply to
one of my volumes for an unknown reason.  Not sure that would have
helped anyways.  I do, however, feel that reducing scand_secs helped a
little:



Sorry I missed scand_secs (was mindless as the brain was mostly occupied by
day time work).

To simplify the view, glock states include exclusive (write), share (read),
and not-locked (in reality, there are more). Exclusive lock has to be
demoted (demote_secs) to share, then to not-locked (another demote_secs)
before it is scanned (every scand_secs) to get added into reclaim list where
it can be purged. Between exclusive and share state transition, the file
contents need to get flushed to disk (to keep file content cluster
coherent).  All of above assume the file (protected by this glock) is not
accessed (idle).

You hit an area that GFS normally doesn't perform well. With GFS1 in
maintenance mode while GFS2 seems to be so far away, ext3 could be a better
answer. However, before switching, do make sure to test it thoroughly (since
Ext3 could have the very same issue as well - check out:
http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).

Did you look (and test) GFS "nolock" protocol (for single node GFS)? It
bypasses some locking overhead and can be switched to  DLM in the future
(just make sure you reserve enough journal space - the rule of thumb is one
journal per node and know how many nodes you plan to have in the future).

-- Wendy
  

Good points.  I could try the nolock feature I suppose.  Not quite
clear on how to reserve journal space.  I forgot to post the cpu time,
check out this:

 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 4822 root  10  -5 000 S1  0.0   2159:15 dlm_recv
 4820 root  10  -5 000 S1  0.0 368:09.34 dlm_astd
 4821 root  10  -5 000 S0  0.0 153:06.80 dlm_scand
 3659 root  10  -5 000 S0  0.0 134:40.14 scsi_wq_4
 4823 root  11  -5 000 S1  0.0 109:33.33 dlm_send
 367 root  10  -5 000 S0  0.0 103:33.74 kswapd0

gfs_glockd is further below so not so concerned with that right now.
It appears turning on nolock would do the trick.  The times aren't
extremely accurate because I have failed this cluster between nodes
while testing.




Here is some more testing information

I created a new volume on my iscsi san of 1 TB and formatted it for
ext3. I then used dd to create a 100G file.  This yielded roughly 900
Mb/sec.  I then stopped my application and did the same thing with an
existing GFS volume.  This gave me about 850 Kb/sec.  This isn't an
iscsi issue.  This appears to be a load issue and the number of I/O
occurring on these volumes.  That said, I would expect that performing
the changes I did would result in a major performance improvement.
Since it didn't, what are my other points I could consider?   If its a
GFS issue, ext3 is the way to go.  Maybe even switch to using
active-active on my NFS cluster.   If its a backend disk issue, I
would expect to see the throughput on my iscsi link (bond1) be fully
utilized.  Its not.  Could I be thrashing the disks?  This is an iscsi
san with 30 sata disks.  Just bouncing some thoughts around to see if
anyone has any more thoughts.

  
Really need to focus on my day time job - its worload has been climbing 
... but can't help to place a quick comment here ..


The 900 MB/s vs. 850 KB/s difference looks like a caching  issue - that 
is, for 900 MB/s, it looks like the data was still lingering in the 
system cache while in 850 KB/s case, the data might already hit disk. 
Cluster filesystem normally syncs more by its nature. In general, ext3 
does perform better in single node environment but the difference should 
not be as big as above. 

There are certainly more tuning knobs available (such as journal size 
and/or network buffer size) to make GFS-iscsi "dd" run better but it is 
pointless. To deploy a cluster filesystem for production usage, the 
tuning should not be driven by such a simple-mind command. You also have 
to consider the support issues when deploying a filesystem. GFS1 is a 
little bit out of date and any new development and/or significant 
performance improvements would likely be in GFS2, not in GFS1. Research 
GFS2 (googling to see how other people said about it) to understand 
whether its direction fits your need (so you can migrate from GFS1 to 
GFS2 if you bump into any show stopper in the future). If not, ext3 
(with ext4 actively developed) is a fine choice if I read your 
configuration right from previous posts.


-- Wendy

--
Linux-clu

Re: [Linux-cluster] gfs tuning

2008-06-17 Thread Wendy Cheng

Hi, Terry,


I am still seeing some high load averages.  Here is an example of a
gfs configuration.  I left statfs_fast off as it would not apply to
one of my volumes for an unknown reason.  Not sure that would have
helped anyways.  I do, however, feel that reducing scand_secs helped a
little:
  
Sorry I missed scand_secs (was mindless as the brain was mostly occupied 
by day time work).


To simplify the view, glock states include exclusive (write), share 
(read), and not-locked (in reality, there are more). Exclusive lock has 
to be demoted (demote_secs) to share, then to not-locked (another 
demote_secs) before it is scanned (every scand_secs) to get added into 
reclaim list where it can be purged. Between exclusive and share state 
transition, the file contents need to get flushed to disk (to keep file 
content cluster coherent).  All of above assume the file (protected by 
this glock) is not accessed (idle).


You hit an area that GFS normally doesn't perform well. With GFS1 in 
maintenance mode while GFS2 seems to be so far away, ext3 could be a 
better answer. However, before switching, do make sure to test it 
thoroughly (since Ext3 could have the very same issue as well - check 
out: http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).


Did you look (and test) GFS "nolock" protocol (for single node GFS)? It 
bypasses some locking overhead and can be switched to  DLM in the future 
(just make sure you reserve enough journal space - the rule of thumb is 
one journal per node and know how many nodes you plan to have in the 
future).


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs tuning

2008-06-16 Thread Wendy Cheng

Ross Vandegrift wrote:

On Mon, Jun 16, 2008 at 11:45:51AM -0500, Terry wrote:
  

I have 4 GFS volumes, each 4 TB.  I am seeing pretty high load
averages on the host that is serving these volumes out via NFS.  I
notice that gfs_scand, dlm_recv, and dlm_scand are running with high
CPU%.  I truly believe the box is I/O bound due to high awaits but
trying to dig into root cause.  99% of the activity on these volumes
is write.  The number of files is around 15 million per TB.   Given
the high number of writes, increasing scand_secs will not help.  Any
other optimizations I can do?



  


A similar case two years ago was solved by the following two tunables:

shell> gfs_tool settune  demote_secs 
(e.g. "gfs_tool settune /mnt/gfs1 demote_secs 200").
shell> gfs_tool settune  glock_purge 
(e.g. "gfs_tool settune /mnt/gfs1 glock_purge 50")

The example above will trim 50% of inode away for every 200 seconds 
interval (default is 300 seconds). Do this on all the GFS-NFS servers 
that show this issues. It can be dynamically turned on (non-zero 
percentage) and off (0 percentage).


As I recalled, the customer used a very aggressive percentage (I think 
it was 100%) but please start from middle ground (50%) to see how it goes.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS performance tuning

2008-06-10 Thread Wendy Cheng

Ross Vandegrift wrote:


1. How to use "fast statfs".



On a GFS2 filesystem, I see the following:
[EMAIL PROTECTED] ~]# gfs2_tool gettune /rrds
...
statfs_slow = 0
...

Does that indicate that my filesystem is already using this feature?
  


The fast statfs patch was a *back* port from GFS2 to GFS1. In GFS2, fast 
statfs is the *default*.



2. Disabling updatedb for GFS.
3. More considerations about the Resource Group size and the
   new "bitfit" function.
4. Designing your environment with the DLM in mind.



Do you have any specific reading material you'd suggest on this topic?
I suspect the interesting bits are related ot how GFS actually uses
the DLM to lock filesystem metadata.  I've read some of Christine's
DLM book, but there's not really anything related to GFS therein.
  


I happen to have a new write-up about GFS locking. It, unfortunately, 
also interleaves with my current employer's disk block allocation policy 
(proprietary info). Note that GFS disk block handling has been 
piggy-backed on its glock logic. Would need sometime to clean it up for 
public reading. Stay tuned.


  

5. How to use "glock trimming".



Is the glock trimming patch present in the cluster suite from RH5?
  

For GFS1, yes.

For GFS2 ... it is a long story... I'll let other people comment it.

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs 6.1 superblock backups

2008-06-03 Thread Wendy Cheng

Bob Peterson wrote:

On Tue, 2008-06-03 at 13:27 -0500, Chris Adams wrote:
  
Bob and Wendy, 
Thank you for your input on this.  What I am trying to do 
is upgrade a GFS 6.0 filesystems which are attached to various 
RHEL3/CentOS3 systems.  After performing the steps which outline the 
process of going from 3 to 4, but on a CentOS 5 system, I get the problems 
mentioned in my message yesterday Re: /sbin/mount.gfs thinks fs is gfs2?  
Everyt time I reinstalled a system with CentOS 5 and tried to get gfs 
running again I got the same error.


Since I know that this is an unsupported operation, I haven't sought 
support for this.  However, I noticed that my upgraded filesystem had 
sb_fs_format = 1308.  The mount code checks for sb_fs_format == 
GFS_FORMAT_FS for gfs 6.1 and GFS2_FORMAT_FS for gfs2.  Since it was 
neither of these, it kept dying saying that it was a gfs2 fs when mounting 
it as gfs, and vice versa.  Manually modifying sb_fs_format allowed it to 
mount immediately afterward.  A subsequent gfs_fsck completes all passes 
successfully.  

Is that sufficient for upgrading the filesystem if the other steps are 
performed?  All fs operations appear to be successful at this point.


thanks,
-chris



I can't think of a good reason why my predecessors would have changed
the file system format ID unless there was something in the file system
that changed and needed reorganizing or reformatting. 
I'm not the person who added this ID but it is a *right* thing to do. As 
a rule of thumb, when moving between major releases, such as RHEL3 and 
RHEL4, a filesystem needs to have an identifier to facilitate the 
upgrade process. There should be documents, commands and/or tools to 
guide people how to do the upgrade - all require this type of "ID" 
implementation. And there should be associated testing efforts allocated 
to the upgrade command as a safe guard before you can call a filesystem 
"enterprise product". For GFS specifically, the locking protocols are 
different between GFS 6.0 and 6.1 (e.g. GULM is in RHEL3 but not in 
RHEL4) and locking protocol is part of the superblock structure, iirc.


From practical point of view, it is probably ok to keep going (but do 
check RHEL manuals - there should be chapters talking about migration 
and upgrade between RHEL3 to 4 and RHEL4 to 5).


From process point of view, this looks like a RHEL5 bug to me.

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs 6.1 superblock backups

2008-06-03 Thread Wendy Cheng

Chris Adams wrote:

On Tue, 2008-06-03 at 11:03 -0400, Wendy Cheng wrote:
Chris Adams wrote:
  

Does GFS 6.1 have any superblock backups a la ext2/3?  If so, how
can I find them?
  

Unfortunately, no.




If that is the case, then is it safe to assume that fs_sb_format will
always be bytes 0x1001a and 0x100b on a gfs logical volume, and that that 
is the only location on the lv that it is stored?  I see 
#define GFS_FORMAT_FS (1309) /* Filesystem (all-encompassing) */

and that is the location that where I see 0x051d (1309) stored.

  
Yes .. in theory (since I don't have the source code in front of me at 
this moment).


Thinking to hand patch it, don't you ? ... There is a header file (I 
think it is gfs_ondisk.h) that describes the super block layout.


-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Error with gfs_grow/ gfs_fsck

2008-06-03 Thread Wendy Cheng

Bob Peterson wrote:

Hi,

On Tue, 2008-06-03 at 15:53 +0200, Miolinux wrote:
  

Hi,

I tried to expand my gfs filesystem from 250Gb to 350Gb.
I run gfs_grow without any error or warnings.
But something gone wrong.

Now, i cannot mount the gfs filesystem anymore (lock computer)

When i try to do a gfs_fsck i get:

[EMAIL PROTECTED] ~]# gfs_fsck -v /dev/mapper/VolGroup_FS100-LogVol_FS100 
Initializing fsck

Initializing lists...
Initializing special inodes...
Validating Resource Group index.
Level 1 check.
371 resource groups found.
(passed)
Setting block ranges...
This file system is too big for this computer to handle.
Last fs block = 0x1049c5c47, but sizeof(unsigned long) is 4 bytes.
Unable to determine the boundaries of the file system.



You've probably hit the gfs_grow bug described in bz #434962 (436383)
and the gfs_fsck bug described in 440897 (440896).  My apologies if
you can't read them; permissions to individual bugzilla records are
out of my control.

The fixes are available in the recently released RHEL5.2, although
I don't know when they'll hit Centos.  The fixes are also available
in the latest cluster git tree if you want to compile/install them
from source code yourself.  Documentation for doing this can
be found at: http://sources.redhat.com/cluster/wiki/ClusterGit

  

This is almost qualified as an FAQ entry :) ...

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs 6.1 superblock backups

2008-06-03 Thread Wendy Cheng

Chris Adams wrote:
Does GFS 6.1 have any superblock backups a la ext2/3?  If so, how can I 
find them?


  


Unfortunately, no.

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] Re: [linux-lvm] Distributed LVM/filesystem/storage

2008-06-01 Thread Wendy Cheng

Jan-Benedict Glaw wrote:

On Sat, 2008-05-31 23:12:21 -0500, Wendy Cheng <[EMAIL PROTECTED]> wrote:
  

Jan-Benedict Glaw wrote:


On Fri, 2008-05-30 09:03:35 +0100, Gerrard Geldenhuis <[EMAIL PROTECTED]> wrote:
  

On Behalf Of Jan-Benedict Glaw


I'm just thinking about using my friend's overly empty harddisks for a
common large filesystem by merging them all together into a single,
large storage pool accessible by everybody.
  

[...]
 
  

It would be nice to see if anybody of you did the same before (merging
the free space from a lot computers into one commonly used large
filesystem), if it was successful and what techniques
(LVM/NBD/DM/MD/iSCSI/Tahoe/Freenet/Other P2P/...) you used to get there,
and how well that worked out in the end.
  

Maybe have a look at GFS.


GFS (or GFS2 fwiw) imposes a single, shared storage as its backend. At
least I get that from reading the documentation. This would result in
merging all the single disks via NBD/LVM to one machine first and
export that merged volume back via NBD/iSCSI to the nodes. In case the
actual data is local to a client, it would still be first send to the
central machine (running LVM) and loaded back from there. Not as
distributed as I hoped, or are there other configuration possibilities
to not go that route?
  
However, with its symmetric architecture, 
nothing can prevent it running on top of a group of iscsi disks (with 
GFS node as initiator), as long as each node can see and access these 
disks. It doesn't care where the iscsi targets live, nor how many there 
are.



So I'd configure each machine's empty disk/partition as an iSCSI
target and let them show up an every "client" machine and run that
setup. How good will GFS deal with temporary (or total) outage of
single targets? Eg. 24h disconnects with ADSL connectivity etc.?

  
High availability will not work well in this particular setup - it is 
more about data and storage sharing between GFS nodes.


Note that GFS normally runs on top of CLVM (clustered lvm, in case you 
don't know about it). You might want to check current (Linux) CLVM raid 
level support to see whether it fits your needs. 



-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Linux-cluster] Re: [linux-lvm] Distributed LVM/filesystem/storage

2008-05-31 Thread Wendy Cheng

Jan-Benedict Glaw wrote:

On Fri, 2008-05-30 09:03:35 +0100, Gerrard Geldenhuis <[EMAIL PROTECTED]> wrote:
  

On Behalf Of Jan-Benedict Glaw


I'm just thinking about using my friend's overly empty harddisks for a
common large filesystem by merging them all together into a single,
large storage pool accessible by everybody.
  

[...]
  

It would be nice to see if anybody of you did the same before (merging
the free space from a lot computers into one commonly used large
filesystem), if it was successful and what techniques
(LVM/NBD/DM/MD/iSCSI/Tahoe/Freenet/Other P2P/...) you used to get there,
and how well that worked out in the end.
  

Maybe have a look at GFS.



GFS (or GFS2 fwiw) imposes a single, shared storage as its backend. At
least I get that from reading the documentation. This would result in
merging all the single disks via NBD/LVM to one machine first and
export that merged volume back via NBD/iSCSI to the nodes. In case the
actual data is local to a client, it would still be first send to the
central machine (running LVM) and loaded back from there. Not as
distributed as I hoped, or are there other configuration possibilities
to not go that route?
  


GFS is certainly developed and well tuned in a SAN environment where the 
shared storage(s) and cluster nodes reside on the very same fibre 
channel switch network. However, with its symmetric architecture, 
nothing can prevent it running on top of a group of iscsi disks (with 
GFS node as initiator), as long as each node can see and access these 
disks. It doesn't care where the iscsi targets live, nor how many there 
are. Of course, whether it can perform well in this environment is 
another story. In short, the notion that GFS requires all disks to be 
merged into one machine first and then export the merged volume back to 
the GFS node is *not* correct.


I actually have a 4-nodes cluster in my house. Two nodes running Linux 
iscsi initiators that have a 2-node GFS cluster setup. Another two nodes 
running a special version of  free-BSD as iscsi targets, each directly 
exports their local disks to the GFS nodes. I have not put too much IO 
loads on the GFS nodes though (since the cluster is mostly used to study 
storage block allocation issues - not for real data and/or application).


cc linxu-cluster

-- Wendy




--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS, iSCSI, multipaths and RAID

2008-05-21 Thread Wendy Cheng

Michael O'Sullivan wrote:

Hi Alex,

We wanted an iSCSI SAN that has highly available data, hence the need 
for 2 (or more storage devices) and a reliable storage network 
(omitted from the diagram). Many of the articles I have read for iSCSI 
don't address multipathing to the iSCSI devices, in our configuration 
iSCSI Disk 1 presented as /dev/sdc and /dev/sdd on each server (and 
iSCSI Disk 2 presented as /dev/sde and /dev/sdf), but it wan't clear 
how to let the servers know that the two iSCSI portals attached to the 
same target - thus I used mdadm. Also, I wanted to raid the iSCSI 
disks to make sure the data stays highly available - thus the second 
use of mdadm. Now we had a single iSCSI raid array spread over 2 (or 
more) devices which provides the iSCSI SAN. However, I wanted to make 
sure the servers did not try to access the same data simultaneously, 
so I used GFS to ensure correct use of the iSCSI SAN. If I understand 
correctly it seems like the multipathing and raiding may be possible 
in Red Hat Cluster Suite GFS without using iSCSI? Or to use iSCSI with 
some other software to ensure proper locking happens for the iSCSI 
raid array? I am reading the link you suggested to see what other 
people have done, but as always any suggestions, etc are more than 
welcome.




Check out dm-multipath (*not* md-multi-path) to see whether you can make 
use of it:

http://www.redhat.com/docs/manuals/csgfs/browse/4.6/DM_Multipath/MPIO_description.html

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS, iSCSI, multipaths and RAID

2008-05-21 Thread Wendy Cheng

Alex Kompel wrote:

On Mon, May 19, 2008 at 2:15 PM, Michael O'Sullivan
<[EMAIL PROTECTED]> wrote:
  

Thanks for your response Wendy. Please see a diagram of the system at
http://www.ndsg.net.nz/ndsg_cluster.jpg/view (or
http://www.ndsg.net.nz/ndsg_cluster.jpg/image_view_fullscreen for the
fullscreen view) that (I hope) explains the setup. We are not using FC as we
are building the SAN with commodity components (the total cost of the system
was less than NZ $9000). The SAN is designed to hold files for staff and
students in our department, I'm not sure exactly what applications will use
the GFS. We are using iscsi-target software although we may upgrade to using
firmware in the future. We have used CLVM on top of software RAID, I agree
there are many levels to this system, but I couldn't find the necessary is
hardware/software to implement this in a simpler way. I am hoping the list
may be helpful here.




So what do you want to get out of this configuration? iSCSI SAN, GFS
cluster or both? I don't see any reason for 2 additional servers
running GFS on top of iSCSI SAN.
  
There are advantages (for 2 additional storage servers) because serving 
data traffic over IP network has its own overhead(s). They offload CPU 
as well as memory consumption(s) away from GFS nodes. If done right, the 
setup could emulate high end SAN box using commodity hardware to provide 
low cost solutions. The issue here is how to find the right set of 
software subcomponents to build this configuration. I personally never 
use Linux iscsi target or multi-path md devices - so can't comment on 
their features and/or performance characteristics. I was hoping folks 
well versed in these Linux modules (software raid, dm multi-path, clvm 
raid level etc) could provide their comments. Check out linux-lvm and/or 
dm-devel mailing lists .. you may be able to find good links and/or 
ideas there, or even start to generate interesting discussions from 
scratch.


So, if this configuration will be used as a research project, I'm 
certainly interested to read the final report. Let us know what works 
and which one sucks.


If it is for a production system to store critical data, better to do 
more searches to see what are available in the market (to replace the 
components grouped inside the "iscsi-raid" box in your diagram - it is 
too complicated to isolate issues if problems popped up). There should 
be  plenty of them out there (e.g. Netapp has offered iscsi SAN boxes 
with additional feature set such as failover, data de-duplication, 
backup, performance monitoring, etc). At the same time, it would be nice 
to have support group to call if things go wrong.


From GFS side, I learned from previous GFS-GNBD experiences that 
serving data from IP networks have its overhead and it is not as cheap 
as people would expect. The issue is further complicated by the newer 
Red Hat cluster infra-structure that also places non-trivial amount of 
workloads on the TCP/IP stacks. So separating these IP traffic(s) 
(cluster HA, data, and/or GFS node access by applications) should be a 
priority to make the whole setup works.


-- Wendy






--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] which journaling file system is used in GFS?

2008-05-14 Thread Wendy Cheng

Ja S wrote:

Hi, All:

>From some online articles, in ext3, there are journal,
ordered, and writeback three types of journaling file
systems. Also in ext3, we can attach  the journaling
file system  to the journal block device located on a
different partition. 
  


GFS *is* a journaling filesystem, same as EXT3. All journaling 
filesystem has journal(s) which is (are) almost an equivalence of 
database logging. The internal logic of journaling could be different 
and we call it journaling "mode".

I have not yet found related information for GFS.

My questions are:

1. Does GFS also support the three types of journaling
file systems? If not, what journaling file system is
used in GFS?
  
So please don't use "journaling file system" to describe journal. 
Practically, GFS has only one type of journaling (write-back) but it 
supports data journaling thru "gfs_tool setflag" command (see "man 
gfs_tool). GFS2 has improved this by moving the "setflag" command into 
mount command (so it is less confusing) and has been designed to use 
three journaling modes (write-back, order-write, and data journaling, 
with order-write as its default). It (GFS2), however, doesn't allow 
external journaling devices yet.


I understand moving ext3 journal into an external device and/or moving 
journaling mode from its default (order write) into "write back" can 
significantly lift its performance. These tricks can *not* be applied to 
GFS.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Locks reported by gfs_tool lockdump does not match that presented in dlm_locks. Any reason??

2008-05-13 Thread Wendy Cheng

Ja S wrote:

Hi, All:

For a given lock space, at the same time, I saved a
copy of the output of “gfs_tool lockdump” as
“gfs_locks” and a copy of dlm_locks. 


Then I checked the locks presents in the two saved
files. I realized that the number of locks in
gfs_locks is not the same as the locks presented in
dlm_locks.

For instance, 
>From dlm_locks:

9980 NL locks, where
--7984 locks are from remote nodes
--0 locks are on remote nodes
--1996 locks are processed on its own master lock
resources
0 CR locks, where
--0 locks are from remote nodes
--0 locks are on remote nodes
--0 locks are processed on its own master lock
resources
0 CW locks, where
--0 locks are from remote nodes
--0 locks are on remote nodes
--0 locks are processed on its own master lock
resources
1173 PR locks, where
--684 locks are from remote nodes
--32 locks are on remote nodes
--457 locks are processed on its own master lock
resources
0 PW locks, where
--0 locks are from remote nodes
--0 locks are on remote nodes
--0 locks are processed on its own master lock
resources
47 EX locks, where
--46 locks are from remote nodes
--0 locks are on remote nodes
--1 locks are processed on its own master lock
resources

In summary, 
11200 locks in total, where

-- 8714 locks are from remote nodes (entries with “
Remote: ”)
-- 32 locks are on remote nodes (entries with “
Master: “)
-- 2454 locks are processed on its own master lock
resources (entries with only lock ID and lock mode)

These locks are all in the granted queue. There is
nothing under the conversion and waiting queues.
==

>From gfs_locks, there are 2932 locks in total, ( grep
‘^Glock ‘ and count the entries). Then for each Glock
I got the second number which is the ID of a lock
resource, and searched the ID in dlm_locks. I then
split the searched results into two groups as shown
below:
--46 locks are associated with local copies of master
lock resources on remote nodes
--2886 locks are associated with master lock resources
on the node itself


==
Now, I tried to find the relationship between the five
numbers from two sources but ended up nowhere.
Dlm_locks:
-- 8714 locks are from remote nodes 
-- 32 locks are on remote nodes

-- 2454 locks are processed on its own master lock
resources 
Gfs_locks:

--46 locks are associated with local copies of master
lock resources on remote nodes
--2886 locks are associated with master lock resources
on the node itself

Can anyone kindly point out the relationships between
the number of locks presented in dlm_locks and
gfs_locks?


Thanks for your time on reading this long question and
look forward to your help.

  
I doubt this will help anything from practical point of view.. 
understanding how to run Oprofile and/or SystemTap will probably help 
you more on the long run. However, if you want to know .. the following 
are why they are different:


GFS locking is controlled by a subsysgtem called "glock". Glock is 
designed to run and interact with *different* distributed lock managers; 
e.g. in RHEL 3, other than DLM, it also works with another lock manager 
called "GULM". Only active locks has an one-to-one correspondence with 
the lock entities inside lock manager. If a glock is in UNLOCK state, 
lock manager may or may not have the subject lock in its record - they 
are subject to get purged depending on memory and/or resource pressure. 
The other way around is also true. A lock may exist in lock manager's 
database but it could have been removed from glock subsystem. Glock 
itself doesn't know about cluster configuration so it relies on external 
lock manager to do inter-node communication. On the other hand, it 
carries some other functions such as data flushing to disk when glock is 
demoted from exclusive (write) to shared (read).


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Why GFS is so slow? What it is waiting for?

2008-05-13 Thread Wendy Cheng

Ja S wrote:

Hi, Wendy:

Thanks for your so prompt and kind explanation. It is
very helpful. According to your comments, I did
another test. See below:
 
# stat abc/

  File: `abc/'
  Size: 8192Blocks: 6024   IO Block:
4096   directory
Device: fc00h/64512dInode: 1065226 Links: 2
Access: (0770/drwxrwx---)  Uid: (0/root)  
Gid: (0/root)

Access: 2008-05-08 06:18:58.0 +
Modify: 2008-04-15 03:02:24.0 +
Change: 2008-04-15 07:11:52.0 +

# cd abc/
# time ls | wc -l 
31764


real0m44.797s
user0m0.189s
sys 0m2.276s

The real time in this test is much shorter than the
previous one. However, it is still reasonable long. As
you said, the ‘ls’ command only reads the single
directory file. In my case, the directory file itself
is only 8192 bytes. The time spent on disk IO should
be included in “sys 0m2.276s”. Although DLM needs time
to lookup the location of the corresponding master
lock resource and to process locking, the system
should not take about 42 seconds to complete the “ls”
command. So, what is the hidden issue or is there a
way to identify possible bottlenecks? 

  
IIRC, disk IO wait time is excluded from "sys", so you really can't 
conclude the lion share of your wall (real) time is due to DLM locking. 
We don't know for sure unless you can provide the relevant profiling 
data (try to learn how to use OProfile and/or SystemTap to see where 
exactly your system is waiting at). Latency issues like this is tricky. 
It would be foolish to conclude anything just by reading the command 
output without knowing the surrounding configuration and/or run time 
environment.


If small file read latency is important to you, did you turn off storage 
device's readahead ? Did you try different Linux kernel elevator 
algorithms ? Did you make sure your other network traffic didn't block 
DLM traffic ? Be aware latency and bandwidth are two different things. A 
big and fat network link doesn't automatically imply a quick response 
time though it may carry more bandwidth.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS, iSCSI, multipaths and RAID

2008-05-13 Thread Wendy Cheng

Michael O'Sullivan wrote:

Hi everyone,

I have set up a small experimental network with a linux cluster and 
SAN that I want to have high data availability. There are 2 servers 
that I have put into a cluster using conga (thank you luci and ricci). 
There are 2 storage devices, each consisting of a basic server with 2 
x 1TB disks. The cluster servers and the storage devices each have 2 
NICs and are connected using 2 gigabit ethernet switches.


It is a little bit hard to figure out the exact configuration based on 
this description (a diagram would help if you can). In general, I don't 
think GFS tuned well with iscsi, particularly the latency could spike if 
DLM traffic gets mingled with file data traffic, regardless your network 
bandwidth. However, I don't have enough data to support the speculation. 
It is also very application dependent. One key question is what kind of 
GFS applications you plan to dispatch in this environment ?


I see you have a SAN here .. Any reason to choose iscsi over FC ?



I have created a single striped logical volume on each storage device 
using the 2 disks (to try and speed up I/O on the volume). These 
volumes (one on each storage device) are presented to the cluster 
servers using iSCSI (on the cluster servers) and iSCSI target (on the 
storage devices). Since there are multiple NICs on the storage devices 
I have set up two iSCSI portals to each logical volume. I have then 
used mdadm to ensure the volumes are accessible via multipath.


The iscsi target function is carried out by the storage device 
(firmware) or you use Linux's iscsi target ?


Finally, since I want the storage devices to present the data in a 
highly available way I have used mdadm to create a software raid-5 
across the two multipathed volumes (I realise this is essentially 
mirroring on the 2 storage devices but I am trying to set this up to 
be extensible to extra storage devices). My next step is to present 
the raid array (of the two multipathed volumes - one on each storage 
device) as a GFS to the cluster servers to ensure that locking of 
access to the data is handled properly.


So you're going to have CLVM built on top of software RAID ? .. This 
looks cumbersome. Again, a diagram could help people understand more.


-- Wendy


I have recently read that multipathing is possible within GFS, but 
raid is not (yet). Since I want the two storage devices in a raid-5 
array and I am using iSCSI I'm not sure if I should try and use GFS to 
do the multipathing. Also, being a linux/storage/clustering newbie I'm 
not sure if my approach is the best thing to do. I want to make sure 
that my system has no single point of failure that will make any of 
the data inaccessible. I'm pretty sure our network design supports 
this. I assume (if I configure it right) the cluster will ensure 
services will keep going if one of the cluster servers goes down. Thus 
the only weak point was the storage devices which I hope I have now 
strengthened by essentially implementing network raid across iSCSI and 
then presented as a single GFS.


I would really appreciate comments/advice/constructive criticism as I 
have really been learning much of this as I go.





--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Why GFS is so slow? What it is waiting for?

2008-05-08 Thread Wendy Cheng

Ja S wrote:

Hi, All:

I used to post this question before, but have not
received any comments yet. Please allow me post it
again.

I have a subdirectory containing more than 30,000
small files on a SAN storage (GFS1+DLM, RAID10). No
user application knows the existence of the
subdirectory. In other words, the subdirectory is free
of accessing. 
  
Short answer is to remember "ls" and "ls -la" are very different 
commands. "ls" is a directory read (that reads from one single file) but 
"ls -la" needs to get file attributes (file size, modification times, 
ownership, etc) from *each* of the files from the subject directory. In 
your case, it needs to read more than 30,000 inodes to get them. The "ls 
-la" is slower for *any* filesystem but particularly troublesome for a 
cluster filesystem such as GFS due to:


1. Cluster locking overheads (it needs readlocks from *each* of the 
files involved).
2. Depending on when and how these files are created. During file 
creation time and if there are lock contentions, GFS has a tendency to 
spread the file locations all over the disk.
3. You use iscsi such that dlm lock traffic and file block access are on 
the same fabric ?  If this is true, you will more or less serialize the 
lock access.


Hope above short answer will ease your confusion.

-- Wendy

However, it took ages to list the subdirectory on an
absolute idle cluster node. See below:

# time ls -la | wc -l
31767

real3m5.249s
user0m0.628s
sys 0m5.137s

There are about 3 minutes spent on somewhere. Does
anyone have any clue what the system was waiting for?


Thanks for your time and wish to see your valuable
comments soon.

Jas


  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS lock cache or bug?

2008-05-08 Thread Wendy Cheng

Ja S wrote:

Hi Wendy:

Thank you very much for the kind answer.

Unfortunately, I am using Red Hat Enterprise Linux WS
release 4 (Nahant Update 5) 2.6.9-42.ELsmp.

When I ran gfs_tool gettune /mnt/ABC, I got:
  

[snip] ..



There is no glock_purge option. I will try to tune
demote_secs, but I don't think it will fix 'ls -la'
issue.
  
No, it will not. Don't waste your time. Will try to explain this more 
whenever I get a chance (but not right now).


By the way, could you please kindly direct me to a
place where I can find detailed explanations of these
tunable options?


  


There is one called readme.gfs_tune - in theory, it is in: 
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_tune.


Just check few minutes ago ... my people page seems to have become Bob 
Peterson's people page but large amount of my old write-ups and 
unpublished patches still there. So if you type "wcheng", you probably 
will get "rpeterso" - contents are mostly the same though. 

There are also few GFS1/GFS2/NFS patches, as well as the detailed NFS 
over GFS documents, GFS glock write-ups, etc, in the (people's page) 
"Patches" and "Project" directories. Feel free to peek and/or try them 
out (but I suspect they'll disappear soon).


On the other hand, if GFS2 is out in time, there is really no point to 
mess around with GFS1 any more - it is old and outdated anyway.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS lock cache or bug?

2008-05-08 Thread Wendy Cheng

Ja S wrote:

Hi, All:
  


I have an old write-up about GFS lock cache issues. Shareroot people had 
pulled it into their web site:

http://open-sharedroot.org/Members/marc/blog/blog-on-gfs/glock-trimming-patch/?searchterm=gfs

It should explain some of your confusions. The tunables described in 
that write-up are formally included into RHEL 5.1 and RHEL 4.6 right now 
(so no need to ask for private patches).


There is a long story about GFS(1)'s "ls -la" problem that one time I 
did plan to do something about it. Unfortunately I'm having a new job 
now so the better bet is probably going for GFS2.


Will pass some thoughts about GFS1's "ls -la" when I have some spare 
time next week.


-- Wendy


I used to 'ls -la' a subdirecotry, which contains more
than 30,000 small files, on a SAN storage long time
ago just once from Node 5, which sits in the cluster
but does nothing. In other words, Node 5 is an idel
node.

Now when I looked at /proc/cluster/dlm_locks on the
node, I realised that there are many PR locks and the
number of PR clocks is pretty much the same as the
number of files in the subdirectory I used to list. 


Then I randomly picked up some lock resources and
converted the second part (hex number) of the name of
the lock resources to decimal numbers, which are
simply the inode numbers. Then I searched the
subdirectory and confirmed that these inode numbers
match the files in the subdirectory.


Now, my questions are:

1) how can I find out which unix command requires what
kind of locks? Does the ls command really need PR
lock? 


2) how long GFS caches the locks?

3) whether we can configure the caching period?

4) if GFS should not cache the lock for so many days,
then does it mean this is a bug?

5) Is that a way to find out which process requires a
particular lock? Below is a typical record in
dlm_locks on Node 5. Is any piece of information
useful for identifing the process? 

Resource d95d2ccc (parent ). Name (len=24) "  
5  cb5d35"

Local Copy, Master is node 1
Granted Queue
137203da PR Master: 73980279
Conversion Queue
Waiting Queue


6) If I am sure that no processes or applications are
accessing the subdirectory, then how I can force GFS
release these PR locks so that DLM can release the
corresponding lock resources as well.


Thank you very much for reading the questions and look
forward to hearing from you.

Jas


  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] dlm and IO speed problem

2008-04-11 Thread Wendy Cheng

Kadlecsik Jozsef wrote:

On Thu, 10 Apr 2008, Kadlecsik Jozsef wrote:

  
But this is a good clue to what might bite us most! Our GFS cluster is an 
almost mail-only cluster for users with Maildir. When the users experience 
temporary hangups for several seconds (even when writing a new mail), it 
might be due to the concurrent scanning for a new mail on one node by the 
MUA and the delivery to the Maildir in another node by the MTA.



I personally don't know much about mail server. But if anyone can 
explain more about what these two processes (?) do, say, how does that 
"MTA" deliver its mail (by "rename" system call ?) and/or how mails are 
moved from which node to where, we may have a better chance to figure 
this puzzle out.


Note that "rename" system call is normally very expensive. Minimum 4 
exclusive locks are required (two directory locks, one file lock for 
unlink, one file lock for link), plus resource group lock if block 
allocation is required. There are numerous chances for deadlocks if not 
handled carefully. The issue is further worsen by the way GFS1 does its 
lock ordering - it obtains multiple locks based on lock name order. Most 
of the locknames are taken from inode number so their sequence always 
quite random. As soon as lock contention occurs, lock requests will be 
serialized to avoid deadlocks. So this may be a cause for these spikes 
where "rename"(s) are struggling to get lock order straight. But I don't 
know for sure unless someone explains how email server does its things. 
BTW, GFS2 has relaxed this lock order issue so it should work better.


I'm having a trip (away from internet) but I'm interested to know this 
story... Maybe by the time I get back on my laptop, someone has figured 
this out. But please do share the story :) ...


-- Wendy

What is really strange (and distrurbing) that such "hangups" can take 
10-20 seconds which is just too much for the users.



Yesterday we started to monitor the number of locks/held locks on two of 
the machines. The results from the first day can be found at 
http://www.kfki.hu/~kadlec/gfs/.


It looks as Maildir is definitely a wrong choice for GFS and we should 
consider to convert to mailbox format: at least I cannot explain the 
spikes in another way.
 
  
In order to look at the possible tuning options and the side effects, I 
list what I have learned so far:


- Increasing glock_purge (percent, default 0) helps to trim back the 
  unused glocks by gfs_scand itself. Otherwise glocks can accumulate and 
  gfs_scand eats more and more time at scanning the larger and 
  larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,  
  looking for work to do. By increasing scand_secs one can lessen the load 
  produced by gfs_scand, but it'll hurt because flushing data can be 
  delayed.

- Decreasing demote_secs (seconds, default 300) helps to flush cached data
  more often by moving write locks into less restricted states. Flushing 
  often helps to avoid burstiness *and* to prolong another nodes' 
  lock access. Question is, what are the side effects of small

  demote_secs values? (Probably there is no much point to choose
  smaller demote_secs value than scand_secs.)

Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.



Best regards,
Jozsef
--
E-mail : [EMAIL PROTECTED], [EMAIL PROTECTED]
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
 H-1525 Budapest 114, POB. 49, Hungary

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] dlm and IO speed problem

2008-04-11 Thread Wendy Cheng

christopher barry wrote:

On Tue, 2008-04-08 at 09:37 -0500, Wendy Cheng wrote:
  

[EMAIL PROTECTED] wrote:

  

my setup:
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.


[...]
  

I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.



Anyway, thought I would re-connect to you all and let you know how this
worked out. We ended up scrapping gfs. Not because it's not a great fs,
but because I was using it in a way that was playing to it's weak
points. I had a lot of time and energy invested in it, and it was hard
to let it go. Turns out that connecting to the NetApp filer via nfs is
faster for this workload. I couldn't believe it either, as my bonnie and
dd type tests showed gfs to be faster. But for the use case of large
sets of very small files, and lots of stats going on, gfs simply cannot
compete with NetApp's nfs implementation. GFS is an excellent fs, and it
has it's place in the landscape - but for a development build system,
the NetApp is simply phenomenal.
  


Assuming you run both configurations (nfs-wafl vs. gfs-san) on the very 
same netapp box (?) ...


Both configurations have their pros and cons. The wafl-nfs runs on 
native mode that certainly has its advantages - you've made a good 
choice but the latter (gfs-on-netapp san) can work well in other 
situations. The biggest problem with your original configuration is the 
load-balancer. The round-robin (and its variants) scheduling will not 
work well if you have a write intensive workload that needs to fight for 
locks between multiple GFS nodes. IIRC, there are gfs customers running 
on build-compile development environment. They normally assign groups of 
users on different GFS nodes, say user id starting with a-e on node 1, 
f-j on node2, etc.


One encouraging news from this email is gfs-netapp-san runs well on 
bonnie. GFS1 has been struggling with bonnie (large amount of smaller 
files within one single node) for a very long time. One of the reasons 
is its block allocation tends to get spread across the disk whenever 
there are resource group contentions. It is very difficult for linux IO 
scheduler to merge these blocks within one single server. When the 
workload becomes IO-bound, the locks are subsequently stalled and 
everything start to snow-ball after that. Netapp SAN has one more layer 
of block allocation indirection within its firmware and its write speed 
is "phenomenal" (I'm borrowing your words ;) ), mostly to do with the 
NVRAM where it can aggressively cache write data - this helps GFS to 
relieve its small file issue quite well.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] dlm and IO speed problem

2008-04-09 Thread Wendy Cheng

Kadlecsik Jozsef wrote:

On Wed, 9 Apr 2008, Wendy Cheng wrote:

  

Have been responding to this email from top of the head, based on folks'
descriptions. Please be aware that they are just rough thoughts and the
responses may not fit in general cases. The above is mostly for the original
problem description where:

1. The system is designated for build-compile - my take is that there are many
temporary and deleted files.
2. The gfs_inode tunable was changed (to 30, instead of default, 15).



I'll take it into account when experimenting with the different settings.

  

Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being too
small it results not only long linked lists but clashing at the same
bucket will block otherwise parallel operations. Wouldn't it help
increasing it from 8k to 65k?


Worth a try.
  

Now I remember  we did experiment with different hash sizes when this
latency issue was first reported two years ago. It didn't make much
difference. The cache flushing, on the other hand, was more significant.



What led me to suspect clashing in the hash (or some other lock-creating 
issue) was the simple test I made on our five node cluster: on one node I 
ran


find /gfs -type f -exec cat {} > /dev/null \;

and on another one just started an editor, naming a non-existent file.
It took multiple seconds while the editor "opened" the file. What else 
than creating the lock could delay the process so long?
  


Not knowing how "find" is implemented, I would guess this is caused by 
directory locks. Creating a file needs a directory lock. Your exclusive 
write lock (file create) can't be granted until the "find" releases the 
directory lock. It doesn't look like a lock query performance issue to me.


  

However, the issues involved here are more than lock searching time. It also
has to do with cache flushing. GFS currently accumulates too much dirty
caches. When it starts to flush, it will pause the system for too long.
Glock trimming helps - since cache flush is part of glock releasing
operation.
  


But 'flushing when releasing glock' looks as a side effect. I mean, isn't 
there a more direct way to control the flushing?
  
I can easily be totally wrong, but on the one hand, it's good to keep as 
many locks cached as possible, because lock creation is expensive. But on 
the other hand, trimming locks triggers flushing, which helps to keep the 
systems running more smoothly. So a tunable to control flushing directly 
would be better than just trimming the locks, isn't it. 


To make long story short, I did submit a direct cache flush patch first, 
instead of this final version of lock trimming patch. Unfortunately, it 
was *rejected*.


-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] dlm and IO speed problem

2008-04-09 Thread Wendy Cheng

Wendy Cheng wrote:

Kadlecsik Jozsef wrote:


What is glock_inode? Does it exist or something equivalent in 
cluster-2.01.00?
  


Sorry, typo. What I mean is "inoded_secs" (gfs inode daemon wake-up 
time). This is the daemon that reclaims deleted inodes. Don't set it 
too small though.


Have been responding to this email from top of the head, based on folks' 
descriptions. Please be aware that they are just rough thoughts and the 
responses may not fit in general cases. The above is mostly for the 
original problem description where:


1. The system is designated for build-compile - my take is that there 
are many temporary and deleted files.

2. The gfs_inode tunable was changed (to 30, instead of default, 15).



 
Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being 
too small it results not only long linked lists but clashing at the 
same bucket will block otherwise parallel operations. Wouldn't it 
help increasing it from 8k to 65k?
  


Worth a try.


Now I remember  we did experiment with different hash sizes when 
this latency issue was first reported two years ago. It didn't make much 
difference. The cache flushing, on the other hand, was more significant.


-- Wendy



However, the issues involved here are more than lock searching time. 
It also has to do with cache flushing. GFS currently accumulates too 
much dirty caches. When it starts to flush, it will pause the system 
for too long.  Glock trimming helps - since cache flush is part of 
glock releasing operation.








--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] dlm and IO speed problem

2008-04-09 Thread Wendy Cheng

Kadlecsik Jozsef wrote:


What is glock_inode? Does it exist or something equivalent in 
cluster-2.01.00?
  


Sorry, typo. What I mean is "inoded_secs" (gfs inode daemon wake-up 
time). This is the daemon that reclaims deleted inodes. Don't set it too 
small though.


 
Isn't GFS_GL_HASH_SIZE too small for large amount of glocks? Being too 
small it results not only long linked lists but clashing at the same 
bucket will block otherwise parallel operations. Wouldn't it help 
increasing it from 8k to 65k?
  


Worth a try.

However, the issues involved here are more than lock searching time. It 
also has to do with cache flushing. GFS currently accumulates too much 
dirty caches. When it starts to flush, it will pause the system for too 
long.  Glock trimming helps - since cache flush is part of glock 
releasing operation.


-- Wendy



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] dlm and IO speed problem

2008-04-08 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:




my setup:
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.

[...]

I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.

* From reading, I see that the first node to access a directory will be
the lock master for that directory. How long is that node the master? If
the user is no longer 'on' that node, is it still the master? If
continued accesses are remote, will the master state migrate to the node
that is primarily accessing it? I've set LVS persistence for ssh and
telnet for 5 minutes, to allow multiple xterms fired up in a script to
land on the same node, but new ones later will land on a different node
- by design really. Do I need to make this persistence way longer to
keep people only on the first node they hit? That kind of horks my load
balancing design if so. How can I see which node is master for which
directories? Is there a table I can read somehow?

* I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
mount noatime,noquota,nodiratime, and David Teigland recommended I set
dlm_dropcount to '0' today on irc, which I did, and I see an improvement
in speed on the node that appears to be master for say 'find' command
runs on the second and subsequent runs of the command if I restart them
immediately, but on the other nodes the speed is awful - worse than nfs
would be. On the first run of a find, or If I wait >10 seconds to start
another run after the last run completes, the time to run is
unbelievably slower than the same command on a standalone box with ext3.
e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
a different node it can take over 2 minutes! Yet an immediate re-run on
the cluster, on what I think must be the master is sub-second. How can I
speed up the first access time, and how can I keep the speed up similar
to immediate subsequent runs. I've got a ton of memory - I just do not
know which knobs to turn.


It sounds like bumping up lock trimming might help, but I don't think 
the feature accessibility through /sys has been back-ported to RHEL4, 
so if you're stuck with RHEL4, you may have to rebuild the latest 
versions of the tools and kernel modules from RHEL5, or you're out of 
luck.


Glock trimming patch was mostly written and tuned on top of RHEL 4. It 
doesn't use /sys interface. The original patch was field tested on 
several customer production sites. Upon CVS RHEL 4.5 check-in, it was 
revised to use a less aggressive approach and turned out to be not as 
effective as the original approach. So the original patch was re-checked 
into RHEL 4.6.


I wrote the patch.

-- Wendy




--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] dlm and IO speed problem

2008-04-08 Thread Wendy Cheng
On Mon, Apr 7, 2008 at 9:36 PM, christopher barry <
[EMAIL PROTECTED]> wrote:

> Hi everyone,
>
> I have a couple of questions about the tuning the dlm and gfs that
> hopefully someone can help me with.



There are lots to say about this configuration.. It is not a simple tuning
issue.


>
> my setup:
> 6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
> not new stuff, but corporate standards dictated the rev of rhat.



Putting a load balancer in front of cluster filesystem is tricky to get it
right (to say the least). This is particularly true between GFS and LVS,
mostly because LVS is a general purpose load balancer that is difficult to
tune to work with the existing GFS locking overhead.


The cluster is a developer build cluster, where developers login, and
> are balanced across nodes and edit and compile code. They can access via
> vnc, XDMCP, ssh and telnet, and nodes external to the cluster can mount
> the gfs home via nfs, balanced through the director. Their homes are on
> the gfs, and accessible on all nodes.



Direct login into GFS nodes (via vnc, ssh, telnet, etc) is ok but nfs client
access in this setup will have locking issues. It is *not* only a
performance issue. It is *also* a function issue - that is, before 2.6.19
Linux kernel, NLM locking (used by NFS client) doesn't get propagated into
clustered NFS servers. You'll have file corruption if different NFS clients
do file lockings and expect the lockings can be honored across different
clustered NFS servers. In general, people needs to think *very* carefully to
put a load balancer before a group of linux NFS servers using any
before-2.6.19 kernel. It is not going to work if there are multiple clients
that invoke either posix locks and/or flocks on files that are expected to
get accessed across different linux NFS servers on top  *any* cluster
filesystem (not only GFS). .


>
>
> I'm noticing huge differences in compile times - or any home file access
> really - when doing stuff in the same home directory on the gfs on
> different nodes. For instance, the same compile on one node is ~12
> minutes - on another it's 18 minutes or more (not running concurrently).
> I'm also seeing weird random pauses in writes, like saving a file in vi,
> what would normally take less than a second, may take up to 10 seconds.
>
> * From reading, I see that the first node to access a directory will be
> the lock master for that directory. How long is that node the master? If
> the user is no longer 'on' that node, is it still the master? If
> continued accesses are remote, will the master state migrate to the node
> that is primarily accessing it?



Cluster locking is expensive. As the result, GFS caches its glocks and there
is an one-to-one correspondence between GFS glock and DLM locks. Even an
user is no longer "on" that node, the lock stays on that node unless:

1. some other node requests an exclusive access of this lock (file write);
or
2. the node has memory pressure that kicks off linux virtual memory manager
to reclaim idle filesystem structures (inode, dentries, etc); or
3. abnormal events such as crash, umount, etc.

Check out: ,
http://open-sharedroot.org/Members/marc/blog/blog-on-gfs/glock-trimming-patch/?searchterm=gfs
for details.


I've set LVS persistence for ssh and
> telnet for 5 minutes, to allow multiple xterms fired up in a script to
> land on the same node, but new ones later will land on a different node
> - by design really. Do I need to make this persistence way longer to
> keep people only on the first node they hit? That kind of horks my load
> balancing design if so. How can I see which node is master for which
> directories? Is there a table I can read somehow?



You did the right thing here (by making the connection persistence). There
is a gfs glock dump command that can print out all the lock info (name,
owner, etc) but I really don't want to recommend it - since automating this
process is not trivial and there is no way to do this by hand, i.e.
manually.


>
> * I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
> mount noatime,noquota,nodiratime, and David Teigland recommended I set
> dlm_dropcount to '0' today on irc, which I did, and I see an improvement
> in speed on the node that appears to be master for say 'find' command
> runs on the second and subsequent runs of the command if I restart them
> immediately, but on the other nodes the speed is awful - worse than nfs
> would be. On the first run of a find, or If I wait >10 seconds to start
> another run after the last run completes, the time to run is
> unbelievably slower than the same command on a standalone box with ext3.
> e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
> a different node it can take over 2 minutes! Yet an immediate re-run on
> the cluster, on what I think must be the master is sub-second. How can I
> speed up the first access time, and how can I keep the speed up similar
> to immedi

Re: [Linux-cluster] About GFS1 and I/O barriers.

2008-04-02 Thread Wendy Cheng
On Wed, Apr 2, 2008 at 11:17 AM, Steven Whitehouse <[EMAIL PROTECTED]>
wrote:

>
> Now I agree that it would be nice to support barriers in GFS2, but it
> won't solve any problems relating to ordering of I/O unless all of the
> underlying device supports them too. See also Alasdair's response to the
> thread: http://lkml.org/lkml/2007/5/28/81


I'm not suggesting GFS1/2 should take this patch, considering their current
states. However, you can't give people an impression, as your original reply
implying, that GFS1/2 would not have this problem.


>
> So although I'd like to see barrier support in GFS2, it won't solve any
> problems for most people and really its a device/block layer issue at
> the moment.


This part I agree ... better to attack this issue from volume manager than
from filesystem.

-- Wendy
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] About GFS1 and I/O barriers.

2008-04-02 Thread Wendy Cheng
On Wed, Apr 2, 2008 at 5:53 AM, Steven Whitehouse <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> On Mon, 2008-03-31 at 15:16 +0200, Mathieu Avila wrote:
> > Le Mon, 31 Mar 2008 11:54:20 +0100,
> > Steven Whitehouse <[EMAIL PROTECTED]> a écrit :
> >
> > > Hi,
> > >
> >
> > Hi,
> >
> > > Both GFS1 and GFS2 are safe from this problem since neither of them
> > > use barriers. Instead we do a flush at the critical points to ensure
> > > that all data is on disk before proceeding with the next stage.
> > >
> >
> > I don't think this solves the problem.
> >
> > Consider a cheap iSCSI disk (no NVRAM, no UPS) accessed by all my GFS
> > nodes; this disk has a write cache enabled, which means it will reply
> > that write requests are performed even if they are not really written
> > on the platters. The disk (like most disks nowadays) has some logic
> > that allows it to optimize writes by re-scheduling them. It is possible
> > that all writes are ACK'd before the power failure, but only a fraction
> > of them were really performed : some are before the flush, some are
> > after the flush.
> > --Not all blocks writes before the flush were performed but other
> > blocks after the flush are written -> the FS is corrupted.--
> > So, after the power failure all data in the disk's write cache are
> > forgotten. If the journal data was in the disk cache, the journal was
> > not written to disk, but other metadata have been written, so there are
> > metadata inconsistencies.
> >
> I don't agree that write caching implies that I/O must be acked before
> it has hit disk. It might well be reordered (which is ok), but if we
> wait for all outstanding I/O completions, then we ought to be able to be
> sure that all I/O is actually on disk, or at the very least that further
> I/O will not be reordered with already ACKed data. If devices are
> sending ACKs in advance of the I/O hitting disk then I think thats
> broken behaviour.


You seem to assume when disk subsystem acks back, the data is surely on
disk. That is not correct . You may consider it a brokoen behavior, mostly
from firmware bugs, but it occurs more often than you would expect. The
problem is extremely difficult to debug from host side. So I think the
proposal here is how the filesystem should protect itself from this
situation (though I'm fuzzy about what the actual proposal is without
looking into other subsystems, particularly volume manager, that are
involved)  You can not say "oh, then I don't have the responsibility. Please
go to talk to disk vendors". Serious implementations have been trying to
find good ways to solve this issue.

-- Wendy

Consider what happens if a device was to send an ACK for a write and
> then it discovers an uncorrectable error during the write - how would it
> then be able to report it since it had already sent an "ok"? So far as I
> can see the only reason for having the drive send an I/O completion back
> is to report the success or otherwise of the operation, and if that
> operation hasn't been completed, then we might just as well not wait for
> ACKs.
>
> > This is the problem that I/O barriers try to solve, by really forcing
> > the block device (and the block layer) to have all blocks issued before
> > the barrier to be written before any other after the barrier starts
> > begin written.
> >
> > The other solution is to completely disable the write cache of the
> > disks, but this leads to dramatically bad performances.
> >
> If its a choice between poor performance thats correct and good
> performance which might lose data, then I know which I would choose :-)
> Not all devices support barriers, so it always has to be an option; ext3
> uses the barrier=1 mount option for this reason, and if it fails (e.g.
> if the underlying device doesn't support barriers) it falls back to the
> same technique which we are using in gfs1/2.
>
> The other thing to bear in mind is that barriers, as currently
> implemented are not really that great either. It would be nice to
> replace them with something that allows better performance with (for
> example) mirrors where the only current method of implementing the
> barrier is to wait for all the I/O completions from all the disks in the
> mirror set (and thus we are back to waiting for outstanding I/O again).
>
> Steve.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] About GFS1 and I/O barriers.

2008-03-31 Thread Wendy Cheng

Mathieu Avila wrote:

Le Mon, 31 Mar 2008 11:54:20 +0100,
Steven Whitehouse <[EMAIL PROTECTED]> a écrit :

  

Hi,




Hi,

  

Both GFS1 and GFS2 are safe from this problem since neither of them
use barriers. Instead we do a flush at the critical points to ensure
that all data is on disk before proceeding with the next stage.




I don't think this solves the problem.
  


I agree. Maybe this is one of the root causes of customer site 
corruption reports we've seen in the past but were never able to figure 
out why and/or duplicate.


However, without fully understanding how Linux IO layer, block device, 
and/or volume manager handle this issue, it is difficult to comment your 
patch. It is not proper to assume if it works on ext3, then it will work 
on gfs1/2.


Don't rush - give people sometime to think about this problem. Or use 
Netapp SAN, it has NVRAM and embedded logic to handle this :) ...


-- Wendy


Consider a cheap iSCSI disk (no NVRAM, no UPS) accessed by all my GFS
nodes; this disk has a write cache enabled, which means it will reply
that write requests are performed even if they are not really written
on the platters. The disk (like most disks nowadays) has some logic
that allows it to optimize writes by re-scheduling them. It is possible
that all writes are ACK'd before the power failure, but only a fraction
of them were really performed : some are before the flush, some are
after the flush. 
--Not all blocks writes before the flush were performed but other

blocks after the flush are written -> the FS is corrupted.--
So, after the power failure all data in the disk's write cache are
forgotten. If the journal data was in the disk cache, the journal was
not written to disk, but other metadata have been written, so there are
metadata inconsistencies.

This is the problem that I/O barriers try to solve, by really forcing
the block device (and the block layer) to have all blocks issued before
the barrier to be written before any other after the barrier starts
begin written.

The other solution is to completely disable the write cache of the
disks, but this leads to dramatically bad performances.


  

Using barriers can improve performance in certain cases, but we've not
yet implemented them in GFS2,

Steve.

On Mon, 2008-03-31 at 12:46 +0200, Mathieu Avila wrote:


Hello all again,

More information on this topic:
http://lkml.org/lkml/2007/5/25/71

I guess the problem also applies to GFSS2.

--
Mathieu

Le Fri, 28 Mar 2008 15:34:58 +0100,
Mathieu Avila <[EMAIL PROTECTED]> a écrit :

  

Hello GFS team,

Some recent kernel developements have brought IO barriers into the
kernel to prevent corruptions that could happen when blocks are
being reordered before write, by the kernel or the block device
itself, just before an electrical power failure.
(on high-end block devices with UPS or NVRAM, those problems
cannot happen)
Some file systems implement them, notably ext3 and XFS. It seems
to me that GFS1 has no such thing.

Do you plan to implement it ? If so, could the attached patch do
the work ? It's incomplete : it would need a global tuning like
fast_stafs, and a mount option like it's done for ext3. The code
is mainly a copy-paste from JBD, and does a barrier only for
journal meta-data. (should i do it for other meta-data ?)

Thanks,

--
Mathieu



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Unformatting a GFS cluster disk

2008-03-30 Thread Wendy Cheng

Wendy Cheng wrote:


The problem can certainly be helped by the snapshot functions embedded 
in Netapp SAN box. However, if tape (done from linux host ?) is 
preferred as you described due to space consideration, you may want to 
take a (filer) snapshot instance and do a (filer) "lun clone" to it. 
It is then followed by a gfs mount as a separate gfs filesystem (this 
is more involved than people would expect, more on this later). After 
that, the tape backup can take place without interfering with the 
original gfs filesystem on the linux host. On the filer end, 
copy-on-write will fork disk blocks as soon as new write requests come 
in, with and without the tape backup activities.
The above is for doing tape backup from Linux end. I think you can also 
do the backup directly from the filer - check out:

http://now.netapp.com/NOW/knowledge/docs/ontap/rel724/pdfs/ontap/tapebkup.pdf
(Data ONTAP 7.2 Data Protection Tape Backup and Recovery Guide)

I never really try that out myself though.

-- Wendy






--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Unformatting a GFS cluster disk

2008-03-30 Thread Wendy Cheng

christopher barry wrote:

On Fri, 2008-03-28 at 07:42 -0700, Lombard, David N wrote:

A fun feature is that the multiple snapshots of a file have the identical
inode value



  


I can't fault this statement but would prefer to think snapshots as 
different trees with their own root inodes (check "Figure 3" of:
"Network Appliance: File System Design for an NFS File Server Appliance" 
- the pdf paper can be downloaded from:
http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout; scroll down to 
the bottom of the page).


In the SAN environment, I also like to think multiple snapshots as 
different trees that may share same disk blocks for faster backup (no 
write) and less disk space consumption, but each with its own root 
inode. Upon recovery time, the (different) trees can be exported and 
seen by linux host as different lun(s). The detailed internal could be 
quite tedious and I'm not in the position to describe it here.



So, I'm trying to understand what to takeaway from this thread:
* I should not use them?
* I can use them, but having multiple snapshots introduces a risk that a
snap-restore could wipe files completely by potentially putting a
deleted file on top of a new file?
  


Isn't that a "restore" is supposed to do ? Knowing this caveat without 
being told, you don't look like an admin who will make this mistake .. 


* I should use them - but not use multiples.
* something completely different ;)

Our primary goal here is to use snapshots to enable us to backup to tape
from the snapshot over FC - and not have to pull a massive amount of
data over GbE nfs through our NAT director from one of our cluster nodes
to put it on tape. We have thought about a dedicated GbE backup network,
but would rather use the 4Gb FC fabric we've got.

  
Check Netapp NOW web site (http://now.netapp.com - accessible by its 
customers) to see whether other folks have good tips about this. Just 
did a quick search and found a document titled "Linux Snapshot Records 
and LUN Resizing in a SAN Environment". It is a little bit out of date 
(dated on 1/27/2003 with RHEL 2.1) but still very usable in ext3 
environment.


In general, GFS backup from Linux side during run time has been a pain, 
mostly because of its slowness and the process has to walk thru the 
whole filesystem to read every single file that ends up accumulating 
non-trivial amount of cached glocks and memory. For a sizable filesystem 
(say in TBs range like yours), past experiences have shown that after 
backup(s), the filesystem latency can go up to an unacceptable level 
unless its glocks are trimmed. There is a tunable specifically written 
for this purpose (glock_purge - introduced via RHEL 4.5 ) though.


The problem can certainly be helped by the snapshot functions embedded 
in Netapp SAN box. However, if tape (done from linux host ?) is 
preferred as you described due to space consideration, you may want to 
take a (filer) snapshot instance and do a (filer) "lun clone" to it. It 
is then followed by a gfs mount as a separate gfs filesystem (this is 
more involved than people would expect, more on this later). After that, 
the tape backup can take place without interfering with the original gfs 
filesystem on the linux host. On the filer end, copy-on-write will fork 
disk blocks as soon as new write requests come in, with and without the 
tape backup activities.


The thinking here is to leverage the embedded Netapp copy-on-write 
feature to speed up the backup process with reasonable disk space 
requirement. The snapshot volume and the cloned lun shouldn't take much 
disk space and we can turn on gfs readahead and glock_purge tunables 
with minimum interruptions to the original gfs volume. The caveat here 
is GFS-mounting the cloned lun - for one, gfs itself at this moment 
doesn't allow mounting of multiple devices that have the same filesystem 
identifiers (the -t value you use during mkfs time e.g. 
"cluster-name:filesystem-name") on the same node - but it can be fixed 
(by rewriting the filesystem ID and lock protocol - I will start to test 
out the described backup script and a gfs kernel patch next week). Also 
as any tape backup from linux host, you should not expect an image of 
gfs mountable device (when retrieving from tape) - it is basically a 
collection of all files residing on the gfs filesystem when the backup 
events take places.


Will the above serve your need ? Maybe other folks have (other) better 
ideas ?


BTW, the described procedure is not well tested out yet, and more 
importantly, any statement in this email does not represent my 
ex-employer, nor my current employer's official recommendations.


-- Wendy







--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Unformatting a GFS cluster disk

2008-03-29 Thread Wendy Cheng

Lombard, David N wrote:

On Fri, Mar 28, 2008 at 04:54:22PM -0500, Wendy Cheng wrote:
  

christopher barry wrote:


On Fri, 2008-03-28 at 07:42 -0700, Lombard, David N wrote:
 
  

On Thu, Mar 27, 2008 at 03:26:55PM -0400, christopher barry wrote:
   


On Wed, 2008-03-26 at 13:58 -0700, Lombard, David N wrote:
  
A fun feature is that the multiple snapshots of a file have the 
identical

inode value

Wait ! First, the "multiple snapshots sharing one inode" 
interpretation about WAFL is not correct.



Same inode value.  I've experienced this multiple times, and, as
I noted, is a consequence of copy-on-write.
  


Yes, you're correct, if you use Netapp filer as NAS server via NFS/CIFS 
protocol. However, if you use Netapp filer as a block device (SAN) where 
the disk resources are presented to linux host as LUNs that host GFS 
filesystem(s), then how WAFL handles its inode is not relevant to this 
discussion, since all user sees are GFS files (and gfs inodes).


I apologize for my terse sentence though.

-- Wendy

I've also had to help other people understand why various utilities
didn't work as expected, like gnu diff, which immediately reported
identical files as soon as it saw the identical values for st_dev
and st_ino in the two files it was asked to compare.

>From the current diffutils (2.8.1) source:

  /* Do struct stat *S, *T describe the same file?  Answer -1 if unknown.  */
  #ifndef same_file
  # define same_file(s, t) \
  s)->st_ino == (t)->st_ino) && ((s)->st_dev == (t)->st_dev)) \
   || same_special_file (s, t))
  #endif

  
   Second, there are plenty 
documents talking about how to do snapshots with Linux filesystems 
(e.g. ext3) on Netapp NOW web site where its customers can get 
accesses.



I didn't say snapshots don't work on Linux.  I've used NetApp on
Linux and directly benefitted from snapshots.

  



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Unformatting a GFS cluster disk

2008-03-28 Thread Wendy Cheng

christopher barry wrote:

On Fri, 2008-03-28 at 07:42 -0700, Lombard, David N wrote:
  

On Thu, Mar 27, 2008 at 03:26:55PM -0400, christopher barry wrote:


On Wed, 2008-03-26 at 13:58 -0700, Lombard, David N wrote:
...
  

Can you point me at any docs that describe how best to implement snaps
against a gfs lun?
  

FYI, the NetApp "snapshot" capability is a result of their "WAFL" filesystem
.  Basically, they use a
copy-on-write mechanism that naturally maintains older versions of disk blocks.

A fun feature is that the multiple snapshots of a file have the identical
inode value



fun as in 'May you live to see interesting times' kinda fun? Or really
fun?
  

The former.  POSIX says that two files with the identical st_dev and
st_ino must be the *identical* file, e.g., hard links.  On a snapshot,
they could be two *versions* of a file with completely different
contents.  Google suggests that this contradiction also exists
elsewhere, such as with the virtual FS provided by ClearCase's VOB.




So, I'm trying to understand what to takeaway from this thread:
* I should not use them?
* I can use them, but having multiple snapshots introduces a risk that a
snap-restore could wipe files completely by potentially putting a
deleted file on top of a new file?
* I should use them - but not use multiples.
* something completely different ;)
  


Wait ! First, the "multiple snapshots sharing one inode" interpretation 
about WAFL is not correct.  Second, there are plenty documents talking 
about how to do snapshots with Linux filesystems (e.g. ext3) on Netapp 
NOW web site where its customers can get accesses. Third, doing snapshot 
on GFS is easier than ext3 (since ext3 journal can be on different volume).


Will do a draft write-up as soon as I'm off my current task (sometime 
over this weekend).


-- Wendy

Our primary goal here is to use snapshots to enable us to backup to tape
from the snapshot over FC - and not have to pull a massive amount of
data over GbE nfs through our NAT director from one of our cluster nodes
to put it on tape. We have thought about a dedicated GbE backup network,
but would rather use the 4Gb FC fabric we've got.

If anyone can recommend a better way to accomplish that, I would love to
hear about how other people are backing up large-ish (1TB) GFS
filesystems to tape.

Regards,
-C

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Unformatting a GFS cluster disk

2008-03-26 Thread Wendy Cheng

chris barry wrote:

On Wed, 2008-03-26 at 10:41 -0500, Wendy Cheng wrote:
  

[EMAIL PROTECTED] wrote:

..The disk was previously a GFS disk and we reformatted it with 
exactly the same mkfs command both times. Here are more details. We 
are running the cluster on a Netapp SAN device.
  
Netapp SAN device has embedded snapshot features (and it has been the 
main reason of choosing NetApp SAN devices for most of the customers). 
It can restore your previous filesystem easily (just few commands away - 
go to the console, do a "snap list", find your volume that hosts the lun 
used for gfs, then do a "snap restore"). This gfs_edit approach (to 
search thru the whole device block by block) is really a brute-force way 
to do the restore. Unless you don't have "snap restore" license ?



Wendy,

We too are using a NetApp. There was talk amongst out IT group that
these snaps would not work against a raw lun.
  


Have you (or IT group) talked to Netapp NGS folks ? I personally don't 
know whether there is a specific document talking about the interaction 
between GFS and Netapp SAN. However, remember that Netapp SAN does its 
snapshots on its volume(s). Now assume you export /vol/vol1/lun1 via FCP 
or ISCSI that is seen by Linux box as /dev/sda, then, the volume 
snapshots lists can be found via "snap list" command, if you do have 
proper licenses in place.


-- Wendy


Can you point me at any docs that describe how best to implement snaps
against a gfs lun?



Regards,
-C

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Unformatting a GFS cluster disk

2008-03-26 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:



..The disk was previously a GFS disk and we reformatted it with 
exactly the same mkfs command both times. Here are more details. We 
are running the cluster on a Netapp SAN device.


Netapp SAN device has embedded snapshot features (and it has been the 
main reason of choosing NetApp SAN devices for most of the customers). 
It can restore your previous filesystem easily (just few commands away - 
go to the console, do a "snap list", find your volume that hosts the lun 
used for gfs, then do a "snap restore"). This gfs_edit approach (to 
search thru the whole device block by block) is really a brute-force way 
to do the restore. Unless you don't have "snap restore" license ?


-- Wendy


1) mkfs.gfs -J 1024 -j 4 -p lock_gulm -t aicluster:cmsgfs /dev/sda   
[100Gb device]

2) Copy lots of files to the disk
3) gfs_grow /san   [Extra 50Gb extension added to device]
4) Copy lots of files to the disk
5) mkfs.gfs -J 1024 -j 4 -p lock_gulm -t aicluster:cmsgfs /dev/sda

I have now read about resource groups and the GFS ondisk structure here..
  http://www.redhat.com/archives/cluster-devel/2006-August/msg00324.html

A couple more questions if you don't mind...

What exactly would the mkfs command have done? Would the mkfs command 
have overwritten the resource group headers from the previous disk 
structure? Or does it just wipe the superblock and journals?


If the resource group headers still exist shouldn't they have a 
characteristic structure we could identify enabling us to put 0xFF in 
only the correct places on disk?


Also is there anyway we can usefully depend on this information. Or 
would mkfs have wiped these special inodes too?


+ * A few special hidden inodes are contained in a GFS filesystem. 
 They do

+ * not appear in any directories; instead, the superblock points to them
+ * using block numbers for their location.  The special inodes are:
+ *
+ *   Root inode:  Root directory of the filesystem
+ *   Resource Group Index:  A file containing block numbers and sizes 
of all RGs
+ *   Journal Index:  A file containing block numbers and sizes of all 
journals

+ *   Quota:  A file containing all quota information for the filesystem
+ *   License:  A file containing license information

In particular there is one 11Gb complete backup tar.gz on the disk 
somewhere. I'm thinking if we could write some custom utility that 
recognizes the gfs on disk structure and extracts very large files 
from it?


Damon.
Working to protect human rights worldwide

DISCLAIMER
Internet communications are not secure and therefore Amnesty International Ltd 
does not accept legal responsibility for the contents of this message. If you 
are not the intended recipient you must not disclose or rely on the information 
in this e-mail. Any views or opinions presented are solely those of the author 
and do not necessarily represent those of Amnesty International Ltd unless 
specifically stated. Electronic communications including email might be 
monitored by Amnesty International Ltd. for operational or business reasons.

This message has been scanned for viruses by Postini.
www.postini.com



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] shared ext3

2008-01-18 Thread Wendy Cheng

Brad Filipek wrote:


I know ext3 is not "cluster aware", but what if I had a SAN with an 
ext3 partition on it and one node connected to it. If I was to unmount 
the partition, physically disconnect the server from the SAN, connect 
another server to the SAN, and then mount to the ext3 partition, would 
there be any issues? I am not looking to access the partition 
simultaneously, just one at a time. I am asking incase the server 
connected to the SAN dies and I need to access the data on this ext3 
volume from another server. Will it work?




Yes, that's should work quite well. Actually that's how people use ext3 
in a cluster environment.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Behavior of "statfs_fast" settune

2008-01-16 Thread Wendy Cheng




For GFS1, we can't change disk layout so we borrow the "license" file 
that happens to be an unused on-disk GFS1 file. There is only one per 
file system, comparing to GFS2 that uses N+1 files (N is the number of 
nodes in this cluster) to handle the "df" statistics. Every node keeps 
its changes in memory buffer and syncs its local changes to the master 
(license) file every 30 seconds. Upon unclean shutdown (or crash), the 
local changes in the memory buffer will be lost. To re-sync the 
correct statistics, we need to use real "df" command (that scans the 
on-disk RGRP disk structures) to adjust the correct statistics. For 
details, check out one of my old write-ups in:


http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_fast_statfs.R4



BTW, I'm not very happy with this implementation and there are few other 
ideas on the table. However, since GFS2 is imminent (I hope), will keep 
the code as it is today. 


-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Behavior of "statfs_fast" settune

2008-01-16 Thread Wendy Cheng

Mathieu Avila wrote:




I am in the process of evaluating the performance gain of
the "statfs_fast" patch.
Once the FS is mounted, I perform "gfs_tool settune " and then
i measure the time to perform "df" on a partially filled FS. The
time is almost the same, "df" returns almost instantly, with a
value really near the truth, and progressively reaching the true
one.

But I have noticed that when the FS size increases, the time to
perform "gfs_tool settune " increases dramatically. In fact,
after a few measures, it appears that the time to perform "df"
without fuzzy statfs is the same as the time to activate fuzzy
statfs. 

In theory, this shouldn't happen. Are you on RHEL 4 or RHEL 5 ? And 
what is the FS size that causes this problem ?


  

I just did a quick try. It doesn't happen to me. By reading your
note, were you *repeatedly* issuing "gfs_tool settune .." then
followed by "df" ? Remember the "settune" is expected to be run
*once* right after the particular GFS filesystem is mounted. You
certainly *can* run it multiple times. It won't hurt anything.
However, each time the "settune" is invoked, the code has to perform
a regular "df" (i.e. that's the way it initializes itself). I suspect
this is the cause of your issue. Let me know either way.




I am using "cluster-1.03" with the statfs_fast patch from:
http://www.redhat.com/archives/cluster-devel/2007-March/msg00124.html
(has this been changed after ?)
All this on a Centos 5.

My use case is :
 * mkfs of a volume
 * mount on all 6 nodes 
 * timing of "settune statfs_fast 1", on all 6 nodes. 
 * timing of "df" on one node.

All commands are executed immediately one after the other.

So i issued only one "settune", on all nodes, and was expecting it to
return immediately. From what you've just said (settune performing a
real "df"), i guess this behaviour is normal.
  


yes ...


I don't understand why it's necessary to perform a real "df" in
"settune". Isn't the licence inode used to store the previous
values of "df" so that it can give an immediate answer to "df", and then
perform a real regular "df" in background to upgrade the "cached df" to
the real value ?
  


For GFS1, we can't change disk layout so we borrow the "license" file 
that happens to be an unused on-disk GFS1 file. There is only one per 
file system, comparing to GFS2 that uses N+1 files (N is the number of 
nodes in this cluster) to handle the "df" statistics. Every node keeps 
its changes in memory buffer and syncs its local changes to the master 
(license) file every 30 seconds. Upon unclean shutdown (or crash), the 
local changes in the memory buffer will be lost. To re-sync the correct 
statistics, we need to use real "df" command (that scans the on-disk 
RGRP disk structures) to adjust the correct statistics. For details, 
check out one of my old write-ups in:


http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_fast_statfs.R4

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Behavior of "statfs_fast" settune

2008-01-16 Thread Wendy Cheng

Wendy Cheng wrote:

Mathieu Avila wrote:

Hello GFS developers,

I am in the process of evaluating the performance gain of
the "statfs_fast" patch.
Once the FS is mounted, I perform "gfs_tool settune " and then i
measure the time to perform "df" on a partially filled FS. The time is
almost the same, "df" returns almost instantly, with a value really
near the truth, and progressively reaching the true one.

But I have noticed that when the FS size increases, the time to
perform "gfs_tool settune " increases dramatically. In fact,
after a few measures, it appears that the time to perform "df" without
fuzzy statfs is the same as the time to activate fuzzy statfs. 
In theory, this shouldn't happen. Are you on RHEL 4 or RHEL 5 ? And 
what is the FS size that causes this problem ?


I just did a quick try. It doesn't happen to me. By reading your note, 
were you *repeatedly* issuing "gfs_tool settune .." then followed by 
"df" ? Remember the "settune" is expected to be run *once* right after 
the particular GFS filesystem is mounted. You certainly *can* run it 
multiple times. It won't hurt anything. However, each time the "settune" 
is invoked, the code has to perform a regular "df" (i.e. that's the way 
it initializes itself). I suspect this is the cause of your issue. Let 
me know either way.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Behavior of "statfs_fast" settune

2008-01-16 Thread Wendy Cheng

Mathieu Avila wrote:

Hello GFS developers,

I am in the process of evaluating the performance gain of
the "statfs_fast" patch.
Once the FS is mounted, I perform "gfs_tool settune  " and then i
measure the time to perform "df" on a partially filled FS. The time is
almost the same, "df" returns almost instantly, with a value really
near the truth, and progressively reaching the true one.

But I have noticed that when the FS size increases, the time to
perform  "gfs_tool settune  " increases dramatically. In fact,
after a few measures, it appears that the time to perform "df" without
fuzzy statfs is the same as the time to activate fuzzy statfs. 
  
In theory, this shouldn't happen. Are you on RHEL 4 or RHEL 5 ? And what 
is the FS size that causes this problem ?


-- Wendy

I've read the patch, and from what I understand, there is an inode used
to store the fuzzy value of statfs, read once and updated later in
background. So the behavior i experienced shouldn't happen. So, please,
what am I doing wrong ?

--
Mathieu



Les opinions et prises de position emises par le signataire du present
message lui sont propres et ne sauraient engager la responsabilite de la
societe SEANODES.

Ce message ainsi que les eventuelles pieces jointes constituent une
correspondance privee et confidentielle a l'attention exclusive du
destinataire designe ci-dessus. Si vous n'etes pas le destinataire du
present message ou une personne susceptible de pouvoir le lui delivrer, il
vous est signifie que toute divulgation, distribution ou copie de cette
transmission est strictement interdite. Si vous avez recu ce message par
erreur, nous vous remercions d'en informer l'expediteur par telephone ou de
lui retourner le present message, puis d'effacer immediatement ce message de
votre systeme.


The views and opinions expressed by the author of this message are personal.
SEANODES shall assume no liability, express or implied for such message.

This e-mail and any attachments is a confidential correspondence intended
only for use of the individual or entity named above. If you are not the
intended recipient or the agent responsible for delivering the message to
the intended recipient, you are hereby notified that any disclosure,
distribution or copying of this communication is strictly prohibited. If you
have received this communication in error, please notify the sender by phone
or by replying this message, and then delete this message from your system. 


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS tuning advice sought

2008-01-07 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:


Is there any GFS tuning I can do which might help speed up access to
these mailboxes?

 


You probably need GFS2 in this case. To fix mail server issues in GFS1
would be too intrusive with current state of development cycle.
   



Wendy,

I noticed you mention that GFS2 might be best for this. Would this apply for 
web servers as well? I've been using GFS on RHEL4 for web server cluster 
sharing. Would I be better to look at GFS2 for performance?



 


Not sure about web servers though - I think it depends on access patterns.

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS tuning advice sought

2008-01-07 Thread Wendy Cheng

James Fidell wrote:


I have a 3-node cluster built on CentOS 5.1, fully updated, providing
Maildir mail spool filesystems to dovecot-based IMAP servers.  As it
stands GFS is in its default configuration -- no tuning has been done
so far.

Mostly, it's working fine.  Unfortunately we do have a few people with
tens of thousands of emails in single mailboxes who are seeing fairly
significant performance problems when fetching their email and in this
instance "make your mailbox smaller" isn't an acceptable solution :(

Is there any GFS tuning I can do which might help speed up access to
these mailboxes?


 

You probably need GFS2 in this case. To fix mail server issues in GFS1 
would be too intrusive with current state of development cycle.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS performance

2008-01-04 Thread Wendy Cheng

Kamal Jain wrote:

Feri,

Thanks for the information.  A number of people have emailed me expressing some 
level of interest in the outcome of this, so hopefully I will soon be able to 
do some tuning and performance experiments and report back our results.

On the demote_secs tuning parameter, I see you're suggesting 600 seconds, which 
appears to be longer than the default 300 seconds as stated by Wendy Cheng at 
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4 -- 
we're running RHEL4.5.  Wouldn't a SHORTER demote period be better for lots of 
files, whereas perhaps a longer demote period might be more efficient for a 
smaller number of files being locked for long periods of time?
  


This demote_secs tunable is a little bit tricky :) ... What happens here 
is that, GFS caches glocks that could get accumulated to a huge amount 
of count. Unless vm releases these inodes (files) associated with these 
glocks, current GFS internal daemons will do *fruitless* scan trying to 
remove these glock (but never succeed). If you set the demote_secs to a 
large number, it will *reduce* the wake-up frequencies of these daemons 
doing these fruitless works, that, in turns, leaving more CPU cycles for 
real works. Without glock trimming patch in place, that is a way to tune 
a system that is constantly touching large amount of files (such as 
rsync). Ditto for "scand" wake-up internal, making it larger will help 
the performance in this situation.


With the *new* glock trimming patch, we actually remove the memory 
reference count so glock can be "demoted" and subsequently removed from 
the system if in idle states. To demote the glock, we need gfs_scand 
daemon to wake up often - this implies we need smaller demote_secs for 
it to be effective.

On a related note, I converted a couple of the clusters in our lab from GULM to 
DLM and while performance is not necessarily noticeably improved (though more 
detailed testing was done after the conversion), we did notice that both 
clusters became more stable in the DLM configuration.
  
This is mostly because DLM is the current default lock manager (with 
on-going development efforts) while GULM is not actively maintained.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS performance

2008-01-02 Thread Wendy Cheng

Kamal Jain wrote:

Hi Wendy,

IOZONE v3.283 was used to generate the results I posted.

An example invocation line [for the IOPS result]:


./iozone -O -l 1 -u 8 -T -b 
/root/iozone_IOPS_1_TO_8_THREAD_1_DISK_ISCSI_DIRECT.xls -F 
/mnt/iscsi_direct1/iozone/iozone1.tmp ...


It's for 1 to 8 threads, and I provided 8 file names through I'm only showing 
one in the line above.  The file destinations were on the same disk for a 
single disk test, and on alternating disks for a 2-disk test.  I believe IOZONE 
uses a simple random string, repeated in certain default record sizes, when 
performing its various operations.

  
Intuitively (by reading your iozone command), this is a locking issue. 
There are lots to say on your setup, mostly because all data and lock 
traffic are funneling thru the same network. Remember locking is mostly 
to do with *latency*, not bandwidth. So even your network is not 
saturated, the performance can go down. It is different from the rsync 
issue (as described by Jos Vos) so the glock trimming patch is not 
helpful in this case.


However, I won't know for sure until we get the data analyzed. Thanks 
for the input.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] gfs2 hang

2008-01-01 Thread Wendy Cheng

Jos Vos wrote:



The one thing that's horribly wrong in some applications is performance.
If you need to have large amounts of files and frequent directory scans
(i.e. rsync etc.), you're lost.

 

On GFS(1) part, the glock trimming patch 
(http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4) 
was developed for customers with rsync issues. Field data have shown 
positive results. It is released on RHEL 5.1, as well on RHEL 4.6. Check 
out the usage part of above write-up.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS performance

2008-01-01 Thread Wendy Cheng

Kamal Jain wrote:

A challenge we’re dealing with is a massive number of small files, so 
there is a lot of file-level overhead, and as you saw in the 
charts…the random reads and writes were not friends of GFS.




It is expected that GFS2 would do better in this area butt this does 
*not* imply GFS(1) is not fixable. One thing would be helpful is sending 
us the benchmark (or test program that can reasonably represent your 
application IO patterns) you used to generate the performance data. Then 
we'll see what can be done from there 


-- Wendy


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS or Web Server Performance issues?

2007-11-28 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:
I think a part of the problem is perception. Clustering in most cases 
leads to _LOWER_ performance on I/O bound processes. If it's CPU 
bound, then sure, it'll help. But on I/O it'll likely do you harm. 
It's more about redundancy and graceful degradation than performance. 
There's no way of getting away from the fact that a cluster has to do 
more work than a single node, just because it has to keep itself in sync.


The only way clustering will give you scaleable performance benefit is 
with partitioned (as opposed to shared) data. Shared data clustering 
is about convenience and redundancy, not about performance.


Well said ! This reminds me some of previous conversations with 
customers in mid-90 when people started to port their applications from 
supercomputers and/or big SMP boxes into clustered machines. It had 
taken non-trivial amount of collaborative efforts between the customers 
and the team's application enablement group to achieve the expected 
performance when moving applications between different platforms. Be 
aware that cluster management and its associated performance tuning is 
really not a trivial task. It is kind of hard to give a "catch-all" 
advice in a mailing list, particularly we have been participating the 
discussions on our spare time basis.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS or Web Server Performance issues?

2007-11-27 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:

Here is ab sending a test to an LVS server in front of a 3 node web server. 
The average loads on each server was around 8.00 to 10.00. These aren't very 
good numbers and I'm wondering where to start looking.
 



Using a load balancer in front of GFS nodes is tricky. Make sure to set 
your scheduling rule (or whatever it is called in LVS) in such a way 
that it would not generate un-necessary lock traffic. For example, you 
don't want the same write lock to get rotated between three nodes. Be 
aware that moving a write lock between nodes requires many steps that 
include a disk flush.


-- Wendy


These are pretty much default apache RPM versions out of the box.

# ab -kc 50 -t 30 http://192.168.1.150/
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.150 (be patient)
Finished 130 requests


Server Software:Apache
Server Hostname:192.168.1.150
Server Port:80

Document Path:  /
Document Length:8997 bytes

Concurrency Level:  50
Time taken for tests:   30.234185 seconds
Complete requests:  130
Failed requests:0
Write errors:   0
Keep-Alive requests:0
Total transferred:  1371090 bytes
HTML transferred:   1299586 bytes
Requests per second:4.30 [#/sec] (mean)
Time per request:   11628.532 [ms] (mean)
Time per request:   232.571 [ms] (mean, across all concurrent requests)
Transfer rate:  44.25 [Kbytes/sec] received

Connection Times (ms)
 min  mean[+/-sd] median   max
Connect:01   1.4  2   4
Processing:   409 6228 4357.5   5227   16317
Waiting:  369 6146 4370.8   5199   16261
Total:409 6230 4358.1   5230   16321

Percentage of the requests served within a certain time (ms)
 50%   5230
 66%   8331
 75%   9310
 80%  10386
 90%  12820
 95%  14502
 98%  15399
 99%  15746
100%  16321 (longest request)



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
 



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS Performance Problems (RHEL5)

2007-11-27 Thread Wendy Cheng

Paul Risenhoover wrote:



Sorry about this mis-send.

I'm guessing my problem has to do with this:

https://www.redhat.com/archives/linux-cluster/2007-October/msg00332.html

BTW: My file system is 13TB.

I found this article that talks about tuning the glock_purge setting:
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4

But it seems to require a special kernel module that I don't have :(.  
Anybody know where I can get it?





The patch should be part of RHEL 4.6 (or RHEL 5.1) - both will be 
released soon.


-- Wendy


Hi All,

I am experiencing some substantial performance problems on my RHEL 5 
server running GFS.  The specific symptom that I'm seeing is that the 
file system will hang for anywhere from 5 to 45 seconds on occasion.  
When this happens it stalls all processes that are attempting to 
access the file system (ie, "ls -l") such that even a ctrl-break 
can't stop it.


It also appears that gfs_scand is working extremely hard.  It runs at 
7-10% CPU almost constantly.  I did some research on this and 
discovered a discussion about cluster locking in relation to 
directories with large numbers of files, and believe it might be 
related.  I've got some directories with 5000+ files.  However, I get 
the stalling behavior even when nothing is accessing those particular 
directories.


I also tried some tuning some of the parameters:

gfs_tool settune /mnt/promise demote_secs 10
gfs_tool settune /mnt/promise scand_secs 2
gfs_tool settune /mnt/promise/ reclaim_limit 1000

But this doesn't appear to have done much.Does anybody have some 
thoughts on how I might resolve this?


Paul

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Any thoughts on losing mount?

2007-11-27 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:

be nice to have some kinds of "control node" concept where these admin
commands can be performed on one particular pre-defined node. This would
allow the tools to check and prevent mistakes like these (say fsck would



In my test setup, this is somewhat how I've been using my cluster in the past 
year or so. I try to maintain everything from just one node and it sort of 
becomes my management node.


  


Yep .. that's smart.

But I'm thinking more about a formal GFS tool set from a product point 
of view .. Anyway, glad your system is back to normal. Is it ?


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Any thoughts on losing mount?

2007-11-27 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:
Thanks for the help. Your suggestion lead to fixing things just fine. I went 
with reformatting the space since that is an easy option. I understand about 
making sure that all nodes are unmounted before doing any gfs_fsck work on the 
disk. 
  


Sorry... I was a little bit worried. Anyway, I start to think it would 
be nice to have some kinds of "control node" concept where these admin 
commands can be performed on one particular pre-defined node. This would 
allow the tools to check and prevent mistakes like these (say fsck would 
start to ssh to each node to unmount the filesystem before it starts to 
do anything).  This is something to think about. After all, cluster 
system management is not a trivial task and mistakes can be plenty, 
regardless admins' skills.


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Any thoughts on losing mount?

2007-11-27 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:

I've unmounted the partition from one node and am now running gfs_fsck on it.
  
Please *don't* do that. While fsck (gfs_fsck), unmount the filesystem 
from *all* nodes.

There were a number of problems;

Leaf(15651992) entry count in directory 15651847 doesn't match number of 
entries found - is 49, found 0

Leaf entry count updated
Leaf(15651935) entry count in directory 15651847 doesn't match number of 
entries found - is 44, found 0

Leaf entry count updated

A very long list in fact. So, rather than Updating the countless problems, 
would I be better off to simply reformat the storage with GFS?



  

Your call ...


  


Few things (for next time):
1. Any fsck should be run *without* being mounted. In GFS case, you 
should unmount the filesystem from all nodes.

2. All filesystem has their own fsck. So please be aware the difference.
3. For any tool you never use before, please at least browse thru the 
man page (in GFS case, "man gfs_fsck").

4. If you want to redo mkfs, *unmount* the current partition from all nodes.

-- Wendy



--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] Any thoughts on losing mount?

2007-11-27 Thread Wendy Cheng

[EMAIL PROTECTED] wrote:

I'm pulling my hair out here :).
One node in my cluster has decided that it doesn't want to mount a storage 
partition which other nodes are not having a problem with. The console 
messages say that there is an inconsistency in the filesystem yet none of the 
other nodes are complaining. 

I cannot figure this one out so am hoping someone on the list can give me some 
leads on what else to look for as I do not want to cause any new problems.


  


The error message indicates resource group (RG) may get corrupted. Have 
you tried to do an fsck (or did it fixes anything) ? 

Different nodes could be accessing different RGs so other nodes may not 
see the corruption (until it starts to access this particular RG 
sometime later). Note that GFS normally tries to make node and/or 
process accessing the same RG it previously used if all possible - this 
is to avoid cluster-wide bottleneck (different nodes on different RGs) 
but still keep locality (use previously accessed RG) for performance 
reason.


Also do you remember any abnormal event (unclean shut-down, panic, 
power-lost, etc) *before* this issue pops out ?


-- Wendy



Nov 27 10:29:26 compdev kernel: GFS: Trying to join cluster "lock_dlm", 
"vgcomp:web"
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Joined cluster. Now 
mounting FS...
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Trying to 
acquire journal lock...
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Looking at 
journal...

Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: jid=3: Done
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Scanning for log 
elements...
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found 1 unlinked 
inodes
Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Found quota changes 
for 0 IDs

Nov 27 10:29:28 compdev kernel: GFS: fsid=vgcomp:web.3: Done
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: fatal: filesystem 
consistency error

Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   RG = 31104599
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   function = 
gfs_setbit
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   file = 
/home/xos/gen/updates-2007-11/xlrpm29472/rpm/BUILD/gfs-kernel-2.6.9-72/up/src/

gfs/bits.c, line = 71
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3:   time = 1196180975
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: about to withdraw from 
the cluster
Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: waiting for 
outstanding I/O

Nov 27 10:29:35 compdev kernel: GFS: fsid=vgcomp:web.3: telling LM to withdraw
Nov 27 10:29:37 compdev kernel: lock_dlm: withdraw abandoned memory
Nov 27 10:29:37 compdev kernel: GFS: fsid=vgcomp:web.3: withdrawn




--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
  


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS RG size (and tuning)

2007-11-02 Thread Wendy Cheng

Wendy Cheng wrote:

Jos Vos wrote:

On Fri, Nov 02, 2007 at 04:12:39PM -0400, Wendy Cheng wrote:

Also I read your previous mailing list post with "df" issue - didn't 
have time to comment. Note that both RHEL 4.6 and RHEL 5.1 will have 
a "fast_statfs" tunable that is specifically added to speed up the 
"df" command. Give it a try. If it works well, we'll switch it from 
a tunable to default (so people don't have to suffer from GFS1's df 
command so much).


OK, thanks, we'll try with 5.1.

In the meantime we rebuilded all fs's with larger RGs (-r 2048), which
already improved the "df" behavior seriously.
Sign .. everything has a trade-off. Forgot to explain this .. larger 
RG will introduce more disk reads if RG locks (that guards disk 
allocation) happen to get moved around between different nodes. You 
may also have to carry more buffer head in the memory cache . If you 
do lots of rsync, it could contribute to the lock/memory congestion.
Also, fs performance is not that bad w.r.t. bandwidth (our 
measurements were first incorrect due to 32-bit counter troubles), 
but operations like

rsync (which we do a lot) that scan large directory trees are horrable.
For that we'll wait for 5.1.


Thanks for the patient - let us know how it goes ...


s/patient/patience/


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS RG size (and tuning)

2007-11-02 Thread Wendy Cheng

Jos Vos wrote:

On Fri, Nov 02, 2007 at 04:12:39PM -0400, Wendy Cheng wrote:

  
Also I read your previous mailing list post with "df" issue - didn't 
have time to comment. Note that both RHEL 4.6 and RHEL 5.1 will have a 
"fast_statfs" tunable that is specifically added to speed up the "df" 
command. Give it a try. If it works well, we'll switch it from a tunable 
to default (so people don't have to suffer from GFS1's df command so much).



OK, thanks, we'll try with 5.1.

In the meantime we rebuilded all fs's with larger RGs (-r 2048), which
already improved the "df" behavior seriously.
  
Sign .. everything has a trade-off. Forgot to explain this .. larger RG 
will introduce more disk reads if RG locks (that guards disk allocation) 
happen to get moved around between different nodes. You may also have to 
carry more buffer head in the memory cache . If you do lots of rsync, it 
could contribute to the lock/memory congestion.
Also, fs performance is not that bad w.r.t. bandwidth (our measurements 
were first incorrect due to 32-bit counter troubles), but operations like

rsync (which we do a lot) that scan large directory trees are horrable.
For that we'll wait for 5.1.

  

Thanks for the patient - let us know how it goes ...

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS RG size (and tuning)

2007-11-02 Thread Wendy Cheng

Wendy Cheng wrote:
2. The gfs_scand issue is more to do with the number of glock count. 
One way to tune this is via purge_glock tunable. There is an old 
write-up in:
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4 
. It is for RHEL4 but should work the same way for RHEL5.
Sorry, you apparently mis-understood what I meant about "work the same 
way as in RHEL5". The logic and tunable setting are identical but the 
code (patch) itself will have to depend on individual RHEL release and 
update versions.


Also I read your previous mailing list post with "df" issue - didn't 
have time to comment. Note that both RHEL 4.6 and RHEL 5.1 will have a 
"fast_statfs" tunable that is specifically added to speed up the "df" 
command. Give it a try. If it works well, we'll switch it from a tunable 
to default (so people don't have to suffer from GFS1's df command so much).


-- Wendy

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS RG size (and tuning)

2007-11-02 Thread Wendy Cheng

Jos Vos wrote:

On Fri, Oct 26, 2007 at 07:57:18PM -0400, Wendy Cheng wrote:

  
2. The gfs_scand issue is more to do with the number of glock count. One 
way to tune this is via purge_glock tunable. There is an old write-up in:
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4 
. It is for RHEL4 but should work the same way for RHEL5.



Unfortunately, no.  The patch applies fine (only with some offsets),
but building results in an error:
  

The patch on my people's page is a *RHEL4* patch.

Just did a quick check, RHEL5.1 gfs-kmod-0.1.19 should have this patch 
and will be released *very soon*. Mind to wait for a little bit longer 
to get an "official" version ? Or go to our cvs (which is open to 
everyone) to extract the source yourself ?


-- Wendy

  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/acl.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/bits.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/bmap.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/daemon.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.o
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c: In function 
'stuck_releasepage':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:85: warning: 
format '%lu' expects type 'long unsigned int', but argument 3 has type 
'sector_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:96: warning: 
format '%lu' expects type 'long unsigned int', but argument 4 has type 'u64'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:122: warning: 
format '%lu' expects type 'long unsigned int', but argument 3 has type 
'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c:122: warning: 
format '%lu' expects type 'long unsigned int', but argument 4 has type 
'uint64_t'
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dir.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/eaops.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/eattr.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/file.o
  CC [M]  /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.o
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 
'try_purge_iopen':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2563: error: 
implicit declaration of function 'gl2gl'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2563: 
warning: assignment makes pointer from integer without a cast
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 
'dump_inode':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: 
warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 
'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: 
warning: format '%lu' expects type 'long unsigned int', but argument 5 has type 
'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: 
warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 
'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2844: 
warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 
'uint64_t'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c: In function 
'dump_glock':
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2882: 
warning: format '%lu' expects type 'long unsigned int', but argument 5 has type 
'u64'
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.c:2882: 
warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 
'u64'
make[4]: *** 
[/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/glock.o] Error 1
make[3]: *** 
[_module_/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs] Error 2

Regards,

  


--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


Re: [Linux-cluster] GFS RG size (and tuning)

2007-10-26 Thread Wendy Cheng

Jos Vos wrote:


Hi,

The gfs_mkfs manual page (RHEL 5.0) says:

 If not  specified,  gfs_mkfs  will  choose the RG size based on the size
 of the file system: average size file systems will have 256 MB  RGs,
 and bigger file systems will have bigger RGs for better performance.

My 3 TB filesystems still seem to have 256 MB RG's (I don't know how to
see the RG size, but there are 11173 of them, so that seems to indicate
a size of 256 MB).  Is 3 TB considered to be "average size"? ;-)

Anyway, it is recommended trying to rebuild the fs's with "-r 2048" for
3 TB filesystems, each with between 1 and 2 million files on it?
Especially gfs_scand uses *huge* amounts of CPU time and doing df takes
a *very* long time


 

1. 3TB is not "average size". Smaller RG can help with "df" command - 
but if your system is congested, it won't help much.
2. The gfs_scand issue is more to do with the number of glock count. One 
way to tune this is via purge_glock tunable. There is an old write-up in:
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_glock_trimming.R4 
. It is for RHEL4 but should work the same way for RHEL5.
3. If you don't need to know the exact disk usage and/or can tolerate 
some delays in disk usage update, there is another tunable 
"statfs_fast". The old write-up (RHEL4) is in: 
http://people.redhat.com/wcheng/Patches/GFS/readme.gfs_fast_statfs.R4 
(and should work the same way as in RHEL 5).


-- Wendy




--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


  1   2   >