Re: [Cluster-devel] [PATCH dlm-tool 1/4] fence: make pkg-config binary as passable make var
Hi Alex, all 4 patches look good to me. Cheers Fabio On 11/04/2023 16.49, Alexander Aring wrote: This patch defines PKG_CONFIG make var which could be overwrite by the user like it's the case for dlm_controld Makefile. --- fence/Makefile | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fence/Makefile b/fence/Makefile index ee4dfb88..894f6396 100644 --- a/fence/Makefile +++ b/fence/Makefile @@ -19,7 +19,10 @@ CFLAGS += -D_GNU_SOURCE -O2 -ggdb \ CFLAGS += -fPIE -DPIE CFLAGS += -I../include -CFLAGS += $(shell pkg-config --cflags pacemaker-fencing) + +PKG_CONFIG ?= pkg-config + +CFLAGS += $(shell $(PKG_CONFIG) --cflags pacemaker-fencing) LDFLAGS += -Wl,-z,relro -Wl,-z,now -pie LDFLAGS += -ldl
Re: [Cluster-devel] [ClusterLabs] gfs2-utils 3.5.0 released
On 14/02/2023 21.17, Valentin Vidic wrote: On Tue, Feb 14, 2023 at 06:18:55AM +0100, Fabio M. Di Nitto wrote: The process would have to look like: (usual) apt-get install git clone gfs2-utils export CFLAGS/LDFLAGS/CC or whatever env var ./autogen.sh ./configure.. make make Using other build tools like debbuild or mock has been problematic in the past for other projects, might not be the case for gfs2-utils. so you can try that all in a local VM and let me know the steps, then we can add it to CI. Sure, the commands to build and test a 32-bit version look like this for me: dpkg --add-architecture i386 doh.. didn´t think of cross compilation. apt-get update apt-get install --yes build-essential crossbuild-essential-i386 autoconf automake autopoint autotools-dev bison flex check:i386 libblkid-dev:i386 libbz2-dev:i386 libncurses-dev:i386 libtool pkg-config:i386 zlib1g-dev:i386 ./configure --build=x86_64-linux-gnu --host=i686-linux-gnu make make check ack perfect, we already have a Debian CI builder dedicated to arm cross compilation, we can tweak it to add i386 as well. Thanks Fabio
Re: [Cluster-devel] [ClusterLabs] gfs2-utils 3.5.0 released
On 13/02/2023 10.58, Andrew Price wrote: On 11/02/2023 17:16, Valentin Vidić wrote: On Thu, Feb 09, 2023 at 01:12:58PM +, Andrew Price wrote: gfs2-utils contains the tools needed to create, check, modify and inspect gfs2 filesystems along with support scripts needed on every gfs2 cluster node. Hi, Some tests seem to be failing for the new version in Debian: gfs2_edit tests 37: Save/restoremeta, defaults FAILED (edit.at:14) 38: Save/restoremeta, no compression FAILED (edit.at:24) 39: Save/restoremeta, min. block size FAILED (edit.at:34) 40: Save/restoremeta, 4 journals FAILED (edit.at:44) 41: Save/restoremeta, min. block size, 4 journals FAILED (edit.at:54) 42: Save metadata to /dev/null ok It seems this is all on 32-bit architectures, more info here: https://buildd.debian.org/status/fetch.php?pkg=gfs2-utils=armel=3.5.0-1=1676127480=0 https://buildd.debian.org/status/fetch.php?pkg=gfs2-utils=armhf=3.5.0-1=1676127632=0 https://buildd.debian.org/status/fetch.php?pkg=gfs2-utils=i386=3.5.0-1=1676127477=0 https://buildd.debian.org/status/fetch.php?pkg=gfs2-utils=mipsel=3.5.0-1=1676130593=0 Can you check? The smoking gun is "stderr: Error: File system is too small to restore this metadata. File system is 524287 blocks. Restore block = 537439" It's caused by size_t being used for a variable relating to file size and it's too small in 32-bit environments. It should be fixed by this commit: https://pagure.io/fork/andyp/gfs2-utils/c/a3f3aadc789f214cd24606808f5d8a6608e10219 It's waiting for the CI queue to flush after last week's outage but it should be in main shortly. I doubt we have any users on 32-bit architectures but perhaps we can get a 32-bit test runner added to the CI pool to prevent these issues slipping through anyway. We had to drop i686 from CI for lack of BaseOS support of 32bit OpenStack / Cloud images. Also, other HA tools like pacemaker, have dropped 32bit support a while back. Not sure it´s worth the troubles any more. If Valentin has an easy way to setup a 64 bit Debian based that will build a 32bit env with easy envvar overrides, I am happy to add it to the pool for gfs2-utils, but I am not going to build pure i686 images for that. The process would have to look like: (usual) apt-get install git clone gfs2-utils export CFLAGS/LDFLAGS/CC or whatever env var ./autogen.sh ./configure.. make make Using other build tools like debbuild or mock has been problematic in the past for other projects, might not be the case for gfs2-utils. so you can try that all in a local VM and let me know the steps, then we can add it to CI. Fabio
Re: [Cluster-devel] [ClusterLabs Developers] Pacemaker 2.1.0: Should we rename the master branch?
On 10/21/2020 7:25 PM, Ken Gaillot wrote: Maybe we should wait until github finishes putting its plans in place. Especially if we want to do all projects at once, there's no need to tie it to a particular Pacemaker release. Right, I don´t see any reason to tie releases with branch changes. Let´s keep operations as-is till github has all the infra in place and that will make the change much more smooth. It might give me time to start changing CI to handle main and master as if they were the same in the meantime. Cheers Fabio On Wed, 2020-10-21 at 06:10 +0200, Fabio M. Di Nitto wrote: On 10/20/2020 7:26 PM, Andrew Price wrote: [CC+ cluster-devel] On 19/10/2020 23:59, Ken Gaillot wrote: On Mon, 2020-10-19 at 07:19 +0200, Fabio M. Di Nitto wrote: Hi Ken, On 10/2/2020 8:02 PM, Digimer wrote: On 2020-10-02 1:12 p.m., Ken Gaillot wrote: Hi all, I sent a message to the us...@clusterlabs.org list about releasing Pacemaker 2.1.0 next year. Coincidentally, there is a plan in the git and Github communities to change the default git branch from "master" to "main": https://github.com/github/renaming The rationale for the change is not the specific meaning as used in branching, but rather to avoid any possibility of fostering an exclusionary environment, and to replace generic metaphors with something more obvious (especially to non-native English speakers). No objections to the change, but please let´s coordinate the change across all HA projects at once, or CI is going to break badly as the concept of master branch is embedded everywhere and not per- project. Presumably this would be all the projects built by jenkins? correct. booth corosync fence-agents fence-virt knet libqb pacemaker pcs qdevice resource-agents sbd Maintainers, do you think that's practical and desirable? I think I have super powers all repos to do the switch when github is ready to make us the switch. Practical no, there will be disruptions... desirable no, it´s extra work, but the point is that it is doable. If the ClusterLabs projects switch together I might take the opportunity to make the switch in gfs2-utils.git at the same time, for consistency. Is there a single name that makes sense for all projects? "next", "development" or "unstable" captures how pacemaker uses master, not sure about other projects. "main" is generic enough for all projects, but so generic it doesn't give an idea of how it's used. Or we could go for something distinctive like fedora's "rawhide" or suse's "tumbleweed". "main" works for me, it seems to be the most widely adopted alternative thanks to Github, so its purpose will be clear by convention. That said, it doesn't matter too much as long as the remote HEAD is set to the new branch. I would go for main and follow github recommendations. They are putting automatic redirects in place to smooth the transition and we can avoid spending time finding a name that won´t offend some delicate soul over the internet. Another question is how to do the switch without causing confusion the next time someone pulls. It might be safest to simply create the main branch and delete the master branch (rather than, say, replacing all of the content in master with an explanatory note). That way a 'git pull' gives a hint of the change and no messy conflicts: $ git pull From /tmp/gittest/upstream * [new branch] main -> origin/main Your configuration specifies to merge with the ref 'refs/heads/master' from the remote, but no such ref was fetched. Maybe also push a 'master_is_now_main' tag annotated with 'use git branch -u origin/main to fix tracking branches'. Or maybe that's excessive :) Let´s wait for github to put those in place for us. No point to re-invent the wheel. Last blog I read they were working to do it at infrastructure level and that would save us a lot of headaches and complications. IIRC they will add main branch automatically to new projects and transition old ones. the master branch will be an automatic redirect to main. Basically will solve 99% of our issues. git pull won´t break etc. Cheers Fabio Cheers, Andy Since we are admin of all repositories, we can do it in one shot without too much pain and suffering in CI. It will require probably a day or two of CI downtime to rebuild the world as well. Fabio The change would not affect existing repositories/projects. However I am wondering if we should take the opportunity of the minor- version bump to do the same for Pacemaker. The impact on developers would be a one-time process for each checkout/fork: https://wiki.clusterlabs.org/wiki/Pacemaker_2.1_Changes#Development_changes In my opinion, this is a minor usage that many existing projects will not bother changing, but I do think that since all new projects will default to "main", some
Re: [Cluster-devel] [ClusterLabs Developers] Pacemaker 2.1.0: Should we rename the master branch?
On 10/20/2020 7:26 PM, Andrew Price wrote: [CC+ cluster-devel] On 19/10/2020 23:59, Ken Gaillot wrote: On Mon, 2020-10-19 at 07:19 +0200, Fabio M. Di Nitto wrote: Hi Ken, On 10/2/2020 8:02 PM, Digimer wrote: On 2020-10-02 1:12 p.m., Ken Gaillot wrote: Hi all, I sent a message to the us...@clusterlabs.org list about releasing Pacemaker 2.1.0 next year. Coincidentally, there is a plan in the git and Github communities to change the default git branch from "master" to "main": https://github.com/github/renaming The rationale for the change is not the specific meaning as used in branching, but rather to avoid any possibility of fostering an exclusionary environment, and to replace generic metaphors with something more obvious (especially to non-native English speakers). No objections to the change, but please let´s coordinate the change across all HA projects at once, or CI is going to break badly as the concept of master branch is embedded everywhere and not per-project. Presumably this would be all the projects built by jenkins? correct. booth corosync fence-agents fence-virt knet libqb pacemaker pcs qdevice resource-agents sbd Maintainers, do you think that's practical and desirable? I think I have super powers all repos to do the switch when github is ready to make us the switch. Practical no, there will be disruptions... desirable no, it´s extra work, but the point is that it is doable. If the ClusterLabs projects switch together I might take the opportunity to make the switch in gfs2-utils.git at the same time, for consistency. Is there a single name that makes sense for all projects? "next", "development" or "unstable" captures how pacemaker uses master, not sure about other projects. "main" is generic enough for all projects, but so generic it doesn't give an idea of how it's used. Or we could go for something distinctive like fedora's "rawhide" or suse's "tumbleweed". "main" works for me, it seems to be the most widely adopted alternative thanks to Github, so its purpose will be clear by convention. That said, it doesn't matter too much as long as the remote HEAD is set to the new branch. I would go for main and follow github recommendations. They are putting automatic redirects in place to smooth the transition and we can avoid spending time finding a name that won´t offend some delicate soul over the internet. Another question is how to do the switch without causing confusion the next time someone pulls. It might be safest to simply create the main branch and delete the master branch (rather than, say, replacing all of the content in master with an explanatory note). That way a 'git pull' gives a hint of the change and no messy conflicts: $ git pull From /tmp/gittest/upstream * [new branch] main -> origin/main Your configuration specifies to merge with the ref 'refs/heads/master' from the remote, but no such ref was fetched. Maybe also push a 'master_is_now_main' tag annotated with 'use git branch -u origin/main to fix tracking branches'. Or maybe that's excessive :) Let´s wait for github to put those in place for us. No point to re-invent the wheel. Last blog I read they were working to do it at infrastructure level and that would save us a lot of headaches and complications. IIRC they will add main branch automatically to new projects and transition old ones. the master branch will be an automatic redirect to main. Basically will solve 99% of our issues. git pull won´t break etc. Cheers Fabio Cheers, Andy Since we are admin of all repositories, we can do it in one shot without too much pain and suffering in CI. It will require probably a day or two of CI downtime to rebuild the world as well. Fabio The change would not affect existing repositories/projects. However I am wondering if we should take the opportunity of the minor- version bump to do the same for Pacemaker. The impact on developers would be a one-time process for each checkout/fork: https://wiki.clusterlabs.org/wiki/Pacemaker_2.1_Changes#Development_changes In my opinion, this is a minor usage that many existing projects will not bother changing, but I do think that since all new projects will default to "main", sometime in the future any project still using "master" will appear outdated to young developers. We could use "main" or something else. Some projects are switching to names like "release", "stable", or "next" depending on how they're actually using the branch ("next" would be appropriate in Pacemaker's case). This will probably go on for years, so I am fine with either changing it with 2.1.0 (since it has bigger changes than usual, and we can get ahead of the curve) or waiting until the dust settles and future conventions are
Re: [Cluster-devel] [Linux-cluster] fence-agents-4.0.16 stable release
On 3/5/2015 12:47 PM, Marek marx Grac wrote: Welcome to the fence-agents 4.0.16 release This release includes several bugfixes and features: * fence_kdump has implemented 'monitor' action that check if local node is capable of working with kdump * path to smnp(walk|get|set) can be set at runtime * new operation 'validate-all' for majority of agents that checks if entered parameters are sufficient without connecting to fence device. Be aware that some checks can be done only after we receive information from fence device, so these are not tested. * new operation 'list-status' that present CSV output (plug_number, plug_alias, plug_status) where status is ON/OFF/UNKNOWN Git repository was moved to https://github.com/ClusterLabs/fence-agents/ so this is last release made from fedorahosted. The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.16.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. There is a new IRC channel in use now. #clusterlabs on Freenode. We are slowly dismissing #linux-cluster and centralize all cluster related activities on the new channel. Fabio
Re: [Cluster-devel] [ha-wg] [Planning] Organizing HA Summit 2015
All, On 1/13/2015 6:31 AM, Digimer wrote: Hi all, With Fabio away for now, I (and others) are working on the final preparations for the summit. This is your chance to speak up and influence the planning! Objections/suggestions? Speak now please. :) Digimer, I would like to thank you very much for helping in the organization of the summit. I unfortunately have to cancel my travel and won´t be able to attend myself. Maybe I´ll join some sessions remotely if time allows. I wish everybody to have a great time in Brno and make the best out of it! I am really looking forward to see the outcome when so many brilliant people will sit in the same room. Cheers Fabio
Re: [Cluster-devel] [Pacemaker] Wiki for planning created - Re: [RFC] Organizing HA Summit 2015
On 11/28/2014 8:10 PM, Jan Pokorný wrote: On 28/11/14 00:37 -0500, Digimer wrote: On 28/11/14 12:33 AM, Fabio M. Di Nitto wrote: On 11/27/2014 5:52 PM, Digimer wrote: I just created a dedicated/fresh wiki for planning and organizing: http://plan.alteeve.ca/index.php/Main_Page [...] Awesome! thanks for taking care of it. Do you have a chance to add also an instance of etherpad to the site? Mostly to do collaborative editing while we sit all around the same table. Otherwise we can use a public instance and copy paste info after that in the wiki. Never tried setting up etherpad before, but if it runs on rhel 6, I should have no problem setting it up. Provided no conspiracy to be started, there are a bunch of popular instances, e.g. http://piratepad.net/ Right, some of them only store etherpads for 30 days. Just be careful the one we choose or we make our own. Fabio
Re: [Cluster-devel] [ha-wg-technical] [Pacemaker] [ha-wg] [Linux-HA] [RFC] Organizing HA Summit 2015
On 11/27/2014 1:33 PM, Kristoffer Grönlund wrote: On 27 Nov 2014, at 2:41 am, Lars Marowsky-Bree l...@suse.com wrote: On 2014-11-25T16:46:01, David Vossel dvos...@redhat.com wrote: Okay, okay, apparently we have got enough topics to discuss. I'll grumble a bit more about Brno, but let's get the organisation of that thing on track ... Sigh. Always so much work! Will Chris Feist be at the summit? Yes :) Fabio I would be happy to have a roundtable discussion or something similar about clients, exchange ideas and so on. I don't necessarily think that there is an urgent need to unify the efforts code-wise, but I think there is a lot we could do together on the level of idea exchange without giving up our independence, so to speak ;) Of course I would be happy to talk about such things with anyone else who is interested as well.
Re: [Cluster-devel] [ha-wg-technical] [Pacemaker] [ha-wg] [Linux-HA] [RFC] Organizing HA Summit 2015
On 11/27/2014 1:33 PM, Kristoffer Grönlund wrote: On 27 Nov 2014, at 2:41 am, Lars Marowsky-Bree l...@suse.com wrote: On 2014-11-25T16:46:01, David Vossel dvos...@redhat.com wrote: Okay, okay, apparently we have got enough topics to discuss. I'll grumble a bit more about Brno, but let's get the organisation of that thing on track ... Sigh. Always so much work! Will Chris Feist be at the summit? I would be happy to have a roundtable discussion or something similar about clients, exchange ideas and so on. I don't necessarily think that there is an urgent need to unify the efforts code-wise, but I think there is a lot we could do together on the level of idea exchange without giving up our independence, so to speak ;) Of course I would be happy to talk about such things with anyone else who is interested as well. sorry, I keep replying from my private email address... Yes Chris will be there too. Fabio
Re: [Cluster-devel] Wiki for planning created - Re: [Pacemaker] [RFC] Organizing HA Summit 2015
On 11/27/2014 5:52 PM, Digimer wrote: I just created a dedicated/fresh wiki for planning and organizing: http://plan.alteeve.ca/index.php/Main_Page Other than the domain, it has no association with any existing project, so it should be a neutral enough platform. Also, it's not owned by $megacorp (I wish!), so spying/privacy shouldn't be an issue I hope. If there is concern, I can setup https. If no one else gets to it before me, I'll start collating the data from the mailing list onto that wiki tomorrow (maaaybe today, depends). The wiki requires registration, but that's it. I'm not bothering with captchas because, in my experience, spammer walk right through them anyway. I do have edits email me, so I can catch and roll back any spam quickly. Awesome! thanks for taking care of it. Do you have a chance to add also an instance of etherpad to the site? Mostly to do collaborative editing while we sit all around the same table. Otherwise we can use a public instance and copy paste info after that in the wiki. Fabio
Re: [Cluster-devel] [ha-wg] [Pacemaker] [Linux-HA] [RFC] Organizing HA Summit 2015
On 11/26/2014 4:41 PM, Lars Marowsky-Bree wrote: On 2014-11-25T16:46:01, David Vossel dvos...@redhat.com wrote: Okay, okay, apparently we have got enough topics to discuss. I'll grumble a bit more about Brno, but let's get the organisation of that thing on track ... Sigh. Always so much work! I'm assuming arrival on the 3rd and departure on the 6th would be the plan? Yes that´s correct. Devconf starts the 6. Fabio Personally I'm interested in talking about scaling - with pacemaker-remoted and/or a new messaging/membership layer. If we're going to talk about scaling, we should throw in our new docker support in the same discussion. Docker lends itself well to the pet vs cattle analogy. I see management of docker with pacemaker making quite a bit of sense now that we have the ability to scale into the cattle territory. While we're on that, I'd like to throw in a heretic thought and suggest that one might want to look at etcd and fleetd. Other design-y topics: - SBD Point taken. I have actually not forgotten this Andrew, and am reading your development. I probably just need to pull the code over ... - degraded mode - improved notifications - containerisation of services (cgroups, docker, virt) - resource-agents (upstream releases, handling of pull requests, testing) Yep, We definitely need to talk about the resource-agents. Agreed. User-facing topics could include recent features (ie. pacemaker-remoted, crm_resource --restart) and common deployment scenarios (eg. NFS) that people get wrong. Adding to the list, it would be a good idea to talk about Deployment integration testing, what's going on with the phd project and why it's important regardless if you're interested in what the project functionally does. OK. So QA is within scope as well. It seems the agenda will fill up quite nicely. Regards, Lars
Re: [Cluster-devel] [ha-wg] [Linux-HA] [RFC] Organizing HA Summit 2015
On 11/25/2014 10:54 AM, Lars Marowsky-Bree wrote: On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote: Yeah, well, devconf.cz is not such an interesting event for those who do not wear the fedora ;-) That would be the perfect opportunity for you to convert users to Suse ;) I´d prefer, at least for this round, to keep dates/location and explore the option to allow people to join remotely. Afterall there are tons of tools between google hangouts and others that would allow that. That is, in my experience, the absolute worst. It creates second class participants and is a PITA for everyone. I agree, it is still a way for people to join in tho. I personally disagree. In my experience, one either does a face-to-face meeting, or a virtual one that puts everyone on the same footing. Mixing both works really badly unless the team already knows each other. I know that an in-person meeting is useful, but we have a large team in Beijing, the US, Tasmania (OK, one crazy guy), various countries in Europe etc. Yes same here. No difference.. we have one crazy guy in Australia.. Yeah, but you're already bringing him for your personal conference. That's a bit different. ;-) OK, let's switch tracks a bit. What *topics* do we actually have? Can we fill two days? Where would we want to collect them? I´d say either a google doc or any random etherpad/wiki instance will do just fine. As for the topics: - corosync qdevice and plugins (network, disk, integration with sdb?, others?) - corosync RRP / libknet integration/replacement - fence autodetection/autoconfiguration For the user facing topics (that is if there are enough participants and I only got 1 user confirmation so far): - demos, cluster 101, tutorials - get feedback - get feedback - get more feedback Fabio
Re: [Cluster-devel] [ha-wg] [RFC] Organizing HA Summit 2015
On 11/24/2014 3:39 PM, Lars Marowsky-Bree wrote: On 2014-09-08T12:30:23, Fabio M. Di Nitto fdini...@redhat.com wrote: Folks, Fabio, thanks for organizing this and getting the ball rolling. And again sorry for being late to said game; I was busy elsewhere. However, it seems that the idea for such a HA Summit in Brno/Feb 2015 hasn't exactly fallen on fertile grounds, even with the suggested user/client day. (Or if there was a lot of feedback, it wasn't public.) I wonder why that is, and if/how we can make this more attractive? Frankly, as might have been obvious ;-), for me the venue is an issue. It's not easy to reach, and I'm theoretically fairly close in Germany already. I wonder if we could increase participation with a virtual meeting (on either those dates or another), similar to what the Ceph Developer Summit does? Those appear really productive and make it possible for a wide range of interested parties from all over the world to attend, regardless of travel times, or even just attend select sessions (that would otherwise make it hard to justify travel expenses time off). Alternatively, would a relocation to a more connected venue help, such as Vienna xor Prague? I'd love to get some more feedback from the community. I agree. some feedback would be useful. As Fabio put it, yes, I *can* suck it up and go to Brno if that's where everyone goes to play ;-), but I'd also prefer to have a broader participation. dates and location were chosen to piggy-back with devconf.cz and allow people to travel for more than just HA Summit. I´d prefer, at least for this round, to keep dates/location and explore the option to allow people to join remotely. Afterall there are tons of tools between google hangouts and others that would allow that. Fabio
Re: [Cluster-devel] [ha-wg] [RFC] Organizing HA Summit 2015
On 11/24/2014 4:12 PM, Lars Marowsky-Bree wrote: On 2014-11-24T15:54:33, Fabio M. Di Nitto fdini...@redhat.com wrote: dates and location were chosen to piggy-back with devconf.cz and allow people to travel for more than just HA Summit. Yeah, well, devconf.cz is not such an interesting event for those who do not wear the fedora ;-) That would be the perfect opportunity for you to convert users to Suse ;) I´d prefer, at least for this round, to keep dates/location and explore the option to allow people to join remotely. Afterall there are tons of tools between google hangouts and others that would allow that. That is, in my experience, the absolute worst. It creates second class participants and is a PITA for everyone. I agree, it is still a way for people to join in tho. I know that an in-person meeting is useful, but we have a large team in Beijing, the US, Tasmania (OK, one crazy guy), various countries in Europe etc. Yes same here. No difference.. we have one crazy guy in Australia.. Fabio
Re: [Cluster-devel] [ha-wg] [ha-wg-technical] [Linux-HA] [RFC] Organizing HA Summit 2015
On 11/5/2014 4:16 PM, Lars Ellenberg wrote: On Sat, Nov 01, 2014 at 01:19:35AM -0400, Digimer wrote: All the cool kids will be there. You want to be a cool kid, right? Well, no. ;-) But I'll still be there, and a few other Linbit'ers as well. Fabio, let us know what we could do to help make it happen. I appreciate the offer. Assuming we achieve quorum to do the event, I´d say that I´ll take of the meeting rooms/hotel logistics and one lunch and learn pizza event. It would be nice if others could organize a dinner event. Cheers Fabio Lars On 01/11/14 01:06 AM, Fabio M. Di Nitto wrote: just a kind reminder. On 9/8/2014 12:30 PM, Fabio M. Di Nitto wrote: All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/ Could you please let me know by end of Nov if you are interested or not? I have heard only from few people so far. Cheers Fabio ___ ha-wg mailing list ha...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/ha-wg
Re: [Cluster-devel] [ha-wg] [RFC] Organizing HA Summit 2015
just a kind reminder. On 9/8/2014 12:30 PM, Fabio M. Di Nitto wrote: All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/ Could you please let me know by end of Nov if you are interested or not? I have heard only from few people so far. Cheers Fabio
Re: [Cluster-devel] [Linux-HA] [RFC] Organizing HA Summit 2015
Hi Alan, On 09/09/2014 03:11 PM, Alan Robertson wrote: Hi Fabio, Do you know much about the Brno DevConf? It would be my first visit to DevConf so not much really :) I was wondering if the Assimilation Project might be interesting to the audience there. http://assimilationsystems.com/ http://assimproj.org/ It's related to High Availability in that we monitor systems and services with zero configuration - we even use OCF RAs ;-). Because of that, we could eventually intervene in systems - restarting services, or even migrating them. That's not in current plans, but it is technically very possible. I don't see why not. HA Summit != pacemaker ;) Having a pool of presentations from other HA related project would be cool. But it's so much more than that - and HUGELY scalable - 10K servers without breathing hard, and 100K servers without proxies, etc. It also discovers systems, services, dependencies, switch connections, and lots of other things. Basically everything is done with near-zero configuration. We wind up with a graph database describing everything in great detail - and it's continually up to date. sounds interesting. Would you be willing to join us for a presentation/demo? I don't know if you know me, but I founded the Linux-HA project and led it for about 10 years. Yeps, your name is very well known :) Cheers Fabio -- Alan Robertson al...@unix.sh On 09/08/2014 04:30 AM, Fabio M. Di Nitto wrote: All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/ ___ Linux-HA mailing list linux...@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Cluster-devel] [Linux-HA] [RFC] Organizing HA Summit 2015
On 09/09/2014 06:31 PM, Alan Robertson wrote: My apologizes for spamming everyone. I thought I deleted all the other email addresses. I failed. Apologies :-( I think it's good that we have an open discussion with all parties involved. I hardly fail to see that as an issue. Apologies not accepted ;) Fabio
[Cluster-devel] [RFC] Organizing HA Summit 2015
All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/
Re: [Cluster-devel] [PATCH]fence-virtd: Fix typo in debug mesage of do_fence_request_tcp
On 05/15/2014 08:45 PM, Masatake YAMATO wrote: I'ms sorry. I should post this to linux-cluster list. nope, cluster-devel is the right place! thanks for the patch. Masatake YAMATO fence-virtd: Fix typo in debug mesage of do_fence_request_tcp Signed-off-by: Masatake YAMATO yam...@redhat.com --- server/mcast.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/server/mcast.c b/server/mcast.c index e850ec7..5fbe46a 100644 --- a/server/mcast.c +++ b/server/mcast.c @@ -250,7 +250,7 @@ do_fence_request_tcp(fence_req_t *req, mcast_info *info) fd = connect_tcp(req, info-args.auth, info-key, info-key_len); if (fd 0) { -dbg_printf(2, Could call back for fence request: %s\n, +dbg_printf(2, Could not call back for fence request: %s\n, strerror(errno)); goto out; } -- 1.9.0
Re: [Cluster-devel] [cluster.git/STABLE32][PATCH] xml: ccs_update_schema: be verbose about extraction fail
ACK Fabio On 4/29/2014 11:30 PM, Jan Pokorný wrote: Previously, the distillation of resource-agents' metadata could fail from unexpected reasons without any evidence ever being made, unlike in case of fence-agents. Also no metadata and issue with their extraction will allegedly yield the same outcome, so it is reflected in the comments being emitted to the schema for both sorts of agents. Signed-off-by: Jan Pokorný jpoko...@redhat.com --- config/tools/xml/ccs_update_schema.in | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/config/tools/xml/ccs_update_schema.in b/config/tools/xml/ccs_update_schema.in index 98ed885..b63c987 100644 --- a/config/tools/xml/ccs_update_schema.in +++ b/config/tools/xml/ccs_update_schema.in @@ -215,6 +215,9 @@ generate_ras() { lecho ras: processing $(basename $i) $i meta-data 2/dev/null | xsltproc $rngdir/ra2rng.xsl - \ $outputdir/resources.rng.cache 2/dev/null + [ $? != 0 ] \ + echo !-- Problem evaluating metadata for $i \ + -- $outputdir/resources.rng.cache done cat $rngdir/resources.rng.mid $outputdir/resources.rng.cache lecho ras: generating ref data @@ -301,8 +304,8 @@ generate_fas() { xsltproc $rngdir/fence2rng.xsl - \ $outputdir/fence_agents.rng.cache 2/dev/null [ $? != 0 ] \ - echo !-- No metadata for $i -- \ - $outputdir/fence_agents.rng.cache + echo !-- Problem evaluating metadata for $i \ + -- $outputdir/fence_agents.rng.cache done cat $rngdir/fence.rng.tail $outputdir/fence_agents.rng.cache }
Re: [Cluster-devel] [PATCH] fencing: Replace printing to stderr with proper logging solution
On 04/02/2014 05:06 PM, Marek 'marx' Grac wrote: This patch replaces local solutions by standard python logging module. Levels of messages is not final, it just reflects the previous state. So, debug level is available only with -v / verbose option. Hi Marek, are we keeping out-of-tree agents in sync too? specifically fence_virt and fence_sanlock. Fabio
Re: [Cluster-devel] [PATCH] fencing: Add support for ipmitool/amttool binaries during autoconf
Thanks for doing it, we still need to change the agent to use IPMITOOL_PATH co. :) Fabio On 12/02/2013 04:39 PM, Marek 'marx' Grac wrote: Configuration of autoconf was extended to dynamically find ipmitool/amttool. If the binary is not found on the system then we will switch to default values (Fedora/RHEL). Path to binaries is exported and replaced in fencebuild using same processes as a version number or sbin/logdir. --- configure.ac |6 ++ make/fencebuild.mk |2 ++ 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/configure.ac b/configure.ac index 6f4baa0..02c46b8 100644 --- a/configure.ac +++ b/configure.ac @@ -163,6 +163,9 @@ LOGDIR=${localstatedir}/log/cluster CLUSTERVARRUN=${localstatedir}/run/cluster CLUSTERDATA=${datadir}/cluster +## path to 3rd-party binaries +AC_PATH_PROG([IPMITOOL_PATH], [ipmitool], [/usr/bin/ipmitool]) +AC_PATH_PROG([AMTTOOL_PATH], [amttool], [/usr/bin/amttool]) ## do subst AC_SUBST([DEFAULT_CONFIG_DIR]) @@ -187,6 +190,9 @@ AC_SUBST([SNMPBIN]) AC_SUBST([AGENTS_LIST]) AM_CONDITIONAL(BUILD_XENAPILIB, test $XENAPILIB -eq 1) +AC_SUBST([IPMITOOL_PATH]) +AC_SUBST([AMTTOOL_PATH]) + ## *FLAGS handling ENV_CFLAGS=$CFLAGS diff --git a/make/fencebuild.mk b/make/fencebuild.mk index 15a47fd..5cbe3bd 100644 --- a/make/fencebuild.mk +++ b/make/fencebuild.mk @@ -9,6 +9,8 @@ $(TARGET): $(SRC) -e 's#@''LOGDIR@#${LOGDIR}#g' \ -e 's#@''SBINDIR@#${sbindir}#g' \ -e 's#@''LIBEXECDIR@#${libexecdir}#g' \ + -e 's#@''IPMITOOL_PATH#${IPMITOOL_PATH}#g' \ + -e 's#@''AMTTOOL_PATH#${AMTTOOL_PATH}#g' \ $@ if [ 0 -eq `echo $(SRC) | grep fence_ /dev/null; echo $$?` ]; then \
Re: [Cluster-devel] [PATCH 2/3] fence_ipmilan: option --method and new option --ipmitool-path
On 11/29/2013 05:32 PM, Ondrej Mular wrote: Add support for option --method and new option --ipmitool-path --- fence/agents/ipmilan/fence_ipmilan.py | 80 +++ 1 file changed, 54 insertions(+), 26 deletions(-) diff --git a/fence/agents/ipmilan/fence_ipmilan.py b/fence/agents/ipmilan/fence_ipmilan.py index 5c32690..4d33234 100644 --- a/fence/agents/ipmilan/fence_ipmilan.py +++ b/fence/agents/ipmilan/fence_ipmilan.py @@ -11,14 +11,6 @@ REDHAT_COPYRIGHT= BUILD_DATE= #END_VERSION_GENERATION -PATHS = [/usr/local/bull/NSMasterHW/bin/ipmitool, -/usr/bin/ipmitool, -/usr/sbin/ipmitool, -/bin/ipmitool, -/sbin/ipmitool, -/usr/local/bin/ipmitool, -/usr/local/sbin/ipmitool] - def get_power_status(_, options): cmd = create_command(options, status) @@ -28,9 +20,8 @@ def get_power_status(_, options): try: process = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE) -except OSError, ex: -print ex -fail(EC_TOOL_FAIL) +except OSError: +fail_usage(Ipmitool not found or not accessible) process.wait() @@ -54,13 +45,31 @@ def set_power_status(_, options): process = subprocess.Popen(shlex.split(cmd), stdout=null, stderr=null) except OSError: null.close() -fail(EC_TOOL_FAIL) +fail_usage(Ipmitool not found or not accessible) process.wait() null.close() return +def reboot_cycle(_, options): +cmd = create_command(options, cycle) + +if options[log] = LOG_MODE_VERBOSE: +options[debug_fh].write(executing: + cmd + \n) + +try: +process = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE) +except OSError: +fail_usage(Ipmitool not found or not accessible) + +process.wait() + +out = process.communicate() +process.stdout.close() + +return bool(re.search('chassis power control: cycle', str(out).lower())) + def is_executable(path): if os.path.exists(path): stats = os.stat(path) @@ -68,13 +77,17 @@ def is_executable(path): return True return False -def get_ipmitool_path(): -for path in PATHS: -if is_executable(path): -return path +def get_ipmitool_path(options): +if type(options[--ipmitool-path]) == type(list()): +for path in options[--ipmitool-path]: +if is_executable(path): +return path +else: +if is_executable(options[--ipmitool-path]): +return options[--ipmitool-path] return None -def create_command(options, action): +def create_command(options, action): cmd = options[ipmitool_path] # --lanplus / -L @@ -120,7 +133,7 @@ def define_new_opts(): all_opt[lanplus] = { getopt : L, longopt : lanplus, -help : -L, --lanplusUse Lanplus to improve security of connection, +help : -L, --lanplus Use Lanplus to improve security of connection, required : 0, shortdesc : Use Lanplus to improve security of connection, order: 1 @@ -128,7 +141,7 @@ def define_new_opts(): all_opt[auth] = { getopt : A:, longopt : auth, -help : -A, --auth=[auth]IPMI Lan Auth type (md5|password|none), +help : -A, --auth=[auth] IPMI Lan Auth type (md5|password|none), required : 0, shortdesc : IPMI Lan Auth type., default : none, @@ -138,7 +151,7 @@ def define_new_opts(): all_opt[cipher] = { getopt : C:, longopt : cipher, -help : -C, --cipher=[cipher]Ciphersuite to use (same as ipmitool -C parameter), +help : -C, --cipher=[cipher] Ciphersuite to use (same as ipmitool -C parameter), required : 0, shortdesc : Ciphersuite to use (same as ipmitool -C parameter), default : 0, @@ -147,28 +160,44 @@ def define_new_opts(): all_opt[privlvl] = { getopt : P:, longopt : privlvl, -help : -P, --privlvl=[level]Privilege level on IPMI device (callback|user|operator|administrator), +help : -P, --privlvl=[level] Privilege level on IPMI device (callback|user|operator|administrator), required : 0, shortdesc : Privilege level on IPMI device, default : administrator, choices : [callback, user, operator, administrator], order: 1 } +all_opt[ipmitool_path] = { +getopt : i:, +longopt : ipmitool-path, +help : --ipmitool-path=[path] Path to ipmitool binary, +required : 0, +shortdesc : Path to ipmitool binary, +default : [/usr/local/bull/NSMasterHW/bin/ipmitool, +/usr/bin/ipmitool, +
Re: [Cluster-devel] [PATCH 3/3] fence_amt: option --method and new option --amttool-path
On 11/29/2013 05:32 PM, Ondrej Mular wrote: Add support for option --method and new option --amttool-path --- fence/agents/amt/fence_amt.py | 72 ++- 1 file changed, 57 insertions(+), 15 deletions(-) diff --git a/fence/agents/amt/fence_amt.py b/fence/agents/amt/fence_amt.py index 8fe2dbc..7077828 100755 --- a/fence/agents/amt/fence_amt.py +++ b/fence/agents/amt/fence_amt.py @@ -1,6 +1,6 @@ #!/usr/bin/python -import sys, subprocess, re +import sys, subprocess, re, os, stat from pipes import quote sys.path.append(@FENCEAGENTSLIBDIR@) from fencing import * @@ -21,12 +21,11 @@ def get_power_status(_, options): try: process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True) except OSError: -fail(EC_TOOL_FAIL) +fail_usage(Amttool not found or not accessible) process.wait() output = process.communicate() - process.stdout.close() match = re.search('Powerstate:[\\s]*(..)', str(output)) @@ -51,19 +50,44 @@ def set_power_status(_, options): process = subprocess.Popen(cmd, stdout=null, stderr=null, shell=True) except OSError: null.close() -fail(EC_TOOL_FAIL) +fail_usage(Amttool not found or not accessible) process.wait() null.close() return +def reboot_cycle(_, options): +cmd = create_command(options, cycle) + +if options[log] = LOG_MODE_VERBOSE: +options[debug_fh].write(executing: + cmd + \n) + +null = open('/dev/null', 'w') +try: +process = subprocess.Popen(cmd, stdout=null, stderr=null, shell=True) +except OSError: +null.close() +fail_usage(Amttool not found or not accessible) + +status = process.wait() +null.close() + +return not bool(status) + +def is_executable(path): +if os.path.exists(path): +stats = os.stat(path) +if stat.S_ISREG(stats.st_mode) and os.access(path, os.X_OK): +return True +return False + def create_command(options, action): # --password / -p cmd = AMT_PASSWORD= + quote(options[--password]) -cmd += + options[amttool_path] +cmd += + options[--amttool-path] # --ip / -a cmd += + options[--ip] @@ -77,7 +101,10 @@ def create_command(options, action): elif action == off: cmd = echo \y\| + cmd cmd += powerdown -if action in [on, off] and options.has_key(--boot-options): +elif action == cycle: +cmd = echo \y\| + cmd +cmd += powercycle +if action in [on, off, cycle] and options.has_key(--boot-options): cmd += options[--boot-options] # --use-sudo / -d @@ -86,25 +113,40 @@ def create_command(options, action): return cmd -def main(): - -atexit.register(atexit_handler) - -device_opt = [ ipaddr, no_login, passwd, boot_option, no_port, sudo] - +def define_new_opts(): all_opt[boot_option] = { getopt : b:, longopt : boot-option, -help:-b, --boot-option=[option] Change the default boot behavior of the machine. (pxe|hd|hdsafe|cd|diag), +help:-b, --boot-option=[option] Change the default boot behavior of the machine. (pxe|hd|hdsafe|cd|diag), required : 0, shortdesc : Change the default boot behavior of the machine., choices : [pxe, hd, hdsafe, cd, diag], order : 1 } +all_opt[amttool_path] = { +getopt : i:, +longopt : amttool-path, +help : --amttool-path=[path] Path to amttool binary, +required : 0, +shortdesc : Path to amttool binary, +default : /usr/bin/amttool, similar here. Hardcoding paths is bad. Fabio
Re: [Cluster-devel] [PATCH 2/3] fence_ipmilan: option --method and new option --ipmitool-path
On 11/29/2013 05:32 PM, Ondrej Mular wrote: @@ -147,28 +160,44 @@ def define_new_opts(): all_opt[privlvl] = { getopt : P:, longopt : privlvl, -help : -P, --privlvl=[level]Privilege level on IPMI device (callback|user|operator|administrator), +help : -P, --privlvl=[level] Privilege level on IPMI device (callback|user|operator|administrator), All the reformatting and cosmetic changes should be in a separate commit. Also, this patch assumes that the first patch you posted is applied to the tree. It's not. Sending incremental patches over patches makes it difficult to rebuild the final binary and test it (yes I have ipmi devices at home :)) Fabio
Re: [Cluster-devel] [PATCH 1/2] fence_ipmilan: port fencing agent to fencing library
On 11/22/2013 5:18 PM, Jan Pokorný wrote: On 21/11/13 16:48 +0100, Fabio M. Di Nitto wrote: On 11/21/2013 4:16 PM, Ondrej Mular wrote: +PATHS = [/usr/local/bull/NSMasterHW/bin/ipmitool, +/usr/bin/ipmitool, +/usr/sbin/ipmitool, +/bin/ipmitool, +/sbin/ipmitool, +/usr/local/bin/ipmitool, +/usr/local/sbin/ipmitool] this hard-cording it bad. Always use OS define PATH and if really necessary allow user to override with an option (for example: --pathtoipmitool=/usr/local) see, e.g., http://git.engineering.redhat.com/users/jpokorny/clufter/tree/utils.py?id=d37db7470f4e44598af0b91d02221182178677ff#n22 that mimics which standard utility Hope this helps I´d like to understand why we need a search path in the first place tho and we can´t just rely on shell hitting the right tool :) Fabio
Re: [Cluster-devel] [PATCH 1/2] fence_ipmilan: port fencing agent to fencing library
Hi Ondrej, On 11/21/2013 4:16 PM, Ondrej Mular wrote: This is port of fence_ipmilan to fencing library. Also added fail message to fencing library if tool (e.g. impitool, amttool...) is not accessible. --- fence/agents/ipmilan/fence_ipmilan.py | 184 ++ fence/agents/lib/fencing.py.py| 4 +- 2 files changed, 187 insertions(+), 1 deletion(-) create mode 100644 fence/agents/ipmilan/fence_ipmilan.py diff --git a/fence/agents/ipmilan/fence_ipmilan.py b/fence/agents/ipmilan/fence_ipmilan.py new file mode 100644 index 000..5c32690 --- /dev/null +++ b/fence/agents/ipmilan/fence_ipmilan.py @@ -0,0 +1,184 @@ +#!/usr/bin/python + +import sys, shlex, stat, subprocess, re, os +from pipes import quote +sys.path.append(@FENCEAGENTSLIBDIR@) +from fencing import * + +#BEGIN_VERSION_GENERATION +RELEASE_VERSION= +REDHAT_COPYRIGHT= +BUILD_DATE= +#END_VERSION_GENERATION + +PATHS = [/usr/local/bull/NSMasterHW/bin/ipmitool, +/usr/bin/ipmitool, +/usr/sbin/ipmitool, +/bin/ipmitool, +/sbin/ipmitool, +/usr/local/bin/ipmitool, +/usr/local/sbin/ipmitool] this hard-cording it bad. Always use OS define PATH and if really necessary allow user to override with an option (for example: --pathtoipmitool=/usr/local) Fabio + +def get_power_status(_, options): + +cmd = create_command(options, status) + +if options[log] = LOG_MODE_VERBOSE: +options[debug_fh].write(executing: + cmd + \n) + +try: +process = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE) +except OSError, ex: +print ex +fail(EC_TOOL_FAIL) + +process.wait() + +out = process.communicate() +process.stdout.close() + +match = re.search('[Cc]hassis [Pp]ower is [\\s]*([a-zA-Z]{2,3})', str(out)) +status = match.group(1) if match else None + +return status + +def set_power_status(_, options): + +cmd = create_command(options, options[--action]) + +if options[log] = LOG_MODE_VERBOSE: +options[debug_fh].write(executing: + cmd + \n) + +null = open('/dev/null', 'w') +try: +process = subprocess.Popen(shlex.split(cmd), stdout=null, stderr=null) +except OSError: +null.close() +fail(EC_TOOL_FAIL) + +process.wait() +null.close() + +return + +def is_executable(path): +if os.path.exists(path): +stats = os.stat(path) +if stat.S_ISREG(stats.st_mode) and os.access(path, os.X_OK): +return True +return False + +def get_ipmitool_path(): +for path in PATHS: +if is_executable(path): +return path +return None + +def create_command(options, action): +cmd = options[ipmitool_path] + +# --lanplus / -L +if options.has_key(--lanplus): +cmd += -I lanplus +else: +cmd += -I lan +# --ip / -a +cmd += -H + options[--ip] + +# --username / -l +if options.has_key(--username) and len(options[--username]) != 0: +cmd += -U + quote(options[--username]) + +# --auth / -A +if options.has_key(--auth): +cmd += -A + options[--auth] + +# --password / -p +if options.has_key(--password): +cmd += -P + quote(options[--password]) + +# --cipher / -C +cmd += -C + options[--cipher] + +# --port / -n +if options.has_key(--ipport): +cmd += -p + options[--ipport] + +if options.has_key(--privlvl): +cmd += -L + options[--privlvl] + +# --action / -o +cmd += chassis power + action + + # --use-sudo / -d +if options.has_key(--use-sudo): +cmd = SUDO_PATH + + cmd + +return cmd + +def define_new_opts(): +all_opt[lanplus] = { +getopt : L, +longopt : lanplus, +help : -L, --lanplusUse Lanplus to improve security of connection, +required : 0, +shortdesc : Use Lanplus to improve security of connection, +order: 1 +} +all_opt[auth] = { +getopt : A:, +longopt : auth, +help : -A, --auth=[auth]IPMI Lan Auth type (md5|password|none), +required : 0, +shortdesc : IPMI Lan Auth type., +default : none, +choices : [md5, password, none], +order: 1 +} +all_opt[cipher] = { +getopt : C:, +longopt : cipher, +help : -C, --cipher=[cipher]Ciphersuite to use (same as ipmitool -C parameter), +required : 0, +shortdesc : Ciphersuite to use (same as ipmitool -C parameter), +default : 0, +order: 1 +} +all_opt[privlvl] = { +getopt : P:, +longopt : privlvl, +help : -P, --privlvl=[level]Privilege level on IPMI device (callback|user|operator|administrator), +required :
Re: [Cluster-devel] fence-agents: master - fence_ipmilan: Better description of lanplus parameter
Hi Marek, On 7/18/2013 12:04 PM, Marek Grác wrote: Gitweb: http://git.fedorahosted.org/git/?p=fence-agents.git;a=commitdiff;h=7117a54a55aafb9f6ea97fe7b3a7b56355f609e4 Commit:7117a54a55aafb9f6ea97fe7b3a7b56355f609e4 Parent:c61430f65c843c4e4b7b3487f378d306efe1d52a Author:Marek 'marx' Grac mg...@redhat.com AuthorDate:Thu Jul 18 12:04:01 2013 +0200 Committer: Marek 'marx' Grac mg...@redhat.com CommitterDate: Thu Jul 18 12:04:01 2013 +0200 fence_ipmilan: Better description of lanplus parameter resolves: rhbz#981086 --- fence/agents/ipmilan/ipmilan.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fence/agents/ipmilan/ipmilan.c b/fence/agents/ipmilan/ipmilan.c index 4d286ea..3561456 100644 --- a/fence/agents/ipmilan/ipmilan.c +++ b/fence/agents/ipmilan/ipmilan.c @@ -167,7 +167,7 @@ struct xml_parameter_s xml_parameters[]={ {ipaddr,-a,1,string,NULL,IPMI Lan IP to talk to}, {passwd,-p,0,string,NULL,Password (if required) to control power on IPMI device}, {passwd_script,-S,0,string,NULL,Script to retrieve password (if required)}, - {lanplus,-P,0,boolean,NULL,Use Lanplus}, + {lanplus,-P,0,boolean,NULL,Use Lanplus to improve security of connection}, Can you be just a bit more descriptive and explain what improve security means? thanks Fabio
Re: [Cluster-devel] [PATCH] fsck.gfs2: Don't rely on cluster.conf when rebuilding sb
You also want to get rid of this code in RHEL6 btw. It's just broken in many different ways. Fabio On 07/17/2013 01:51 PM, Andrew Price wrote: As cluster.conf no longer exists we can't sniff the locking options from it when rebuilding the superblock and in any case we shouldn't assume that fsck.gfs2 is running on the cluster the volume belongs to. This patch removes the get_lockproto_table function and instead sets the lock table name to a placeholder (unknown) and sets lockproto to lock_dlm. It warns the user at the end of the run that the locktable will need to be set before mounting. Signed-off-by: Andrew Price anpr...@redhat.com --- gfs2/fsck/initialize.c | 57 -- gfs2/fsck/main.c | 4 2 files changed, 8 insertions(+), 53 deletions(-) diff --git a/gfs2/fsck/initialize.c b/gfs2/fsck/initialize.c index b01b240..869d2de 100644 --- a/gfs2/fsck/initialize.c +++ b/gfs2/fsck/initialize.c @@ -33,6 +33,7 @@ static int was_mounted_ro = 0; static uint64_t possible_root = HIGHEST_BLOCK; static struct master_dir fix_md; static unsigned long long blks_2free = 0; +extern int sb_fixed; /** * block_mounters @@ -828,58 +829,6 @@ static int init_system_inodes(struct gfs2_sbd *sdp) return -1; } -static int get_lockproto_table(struct gfs2_sbd *sdp) -{ - FILE *fp; - char line[PATH_MAX]; - char *cluname, *end; - const char *fsname, *cfgfile = /etc/cluster/cluster.conf; - - memset(sdp-lockproto, 0, sizeof(sdp-lockproto)); - memset(sdp-locktable, 0, sizeof(sdp-locktable)); - fp = fopen(cfgfile, rt); - if (!fp) { - /* no cluster.conf; must be a stand-alone file system */ - strcpy(sdp-lockproto, lock_nolock); - log_warn(_(Lock protocol determined to be: lock_nolock\n)); - log_warn(_(Stand-alone file system: No need for a lock -table.\n)); - return 0; - } - /* We found a cluster.conf so assume it's a clustered file system */ - log_warn(_(Lock protocol assumed to be: GFS2_DEFAULT_LOCKPROTO -\n)); - strcpy(sdp-lockproto, GFS2_DEFAULT_LOCKPROTO); - - while (fgets(line, sizeof(line) - 1, fp)) { - cluname = strstr(line,cluster name=); - if (cluname) { - cluname += 15; - end = strchr(cluname,''); - if (end) - *end = '\0'; - break; - } - } - if (cluname == NULL || end == NULL || end - cluname 1) { - log_err(_(Error: Unable to determine cluster name from %s\n), - cfgfile); - } else { - fsname = strrchr(opts.device, '/'); - if (fsname) - fsname++; - else - fsname = repaired; - snprintf(sdp-locktable, sizeof(sdp-locktable), %.*s:%.16s, - (int)(sizeof(sdp-locktable) - strlen(fsname) - 2), - cluname, fsname); - log_warn(_(Lock table determined to be: %s\n), - sdp-locktable); - } - fclose(fp); - return 0; -} - /** * is_journal_copy - Is this a real dinode or a copy inside a journal? * A real dinode will be located at the block number in its no_addr. @@ -1256,7 +1205,8 @@ static int sb_repair(struct gfs2_sbd *sdp) } } /* Step 3 - Rebuild the lock protocol and file system table name */ - get_lockproto_table(sdp); + strcpy(sdp-lockproto, GFS2_DEFAULT_LOCKPROTO); + strcpy(sdp-locktable, unknown); if (query(_(Okay to fix the GFS2 superblock? (y/n { log_info(_(Found system master directory at: 0x%llx\n), sdp-sd_sb.sb_master_dir.no_addr); @@ -1280,6 +1230,7 @@ static int sb_repair(struct gfs2_sbd *sdp) build_sb(sdp, uuid); inode_put(sdp-md.rooti); inode_put(sdp-master_dir); + sb_fixed = 1; } else { log_crit(_(GFS2 superblock not fixed; fsck cannot proceed without a valid superblock.\n)); diff --git a/gfs2/fsck/main.c b/gfs2/fsck/main.c index 9c3b06d..f9e7166 100644 --- a/gfs2/fsck/main.c +++ b/gfs2/fsck/main.c @@ -36,6 +36,7 @@ struct osi_root dirtree = (struct osi_root) { NULL, }; struct osi_root inodetree = (struct osi_root) { NULL, }; int dups_found = 0, dups_found_first = 0; struct gfs_sb *sbd1 = NULL; +int sb_fixed = 0; /* This function is for libgfs2's sake. */ void print_it(const char *label, const char *fmt, const char *fmt2, ...) @@ -315,6 +316,9 @@ int main(int argc, char **argv) log_notice( _(Writing changes to disk\n));
Re: [Cluster-devel] [gfs2-utils PATCH 1/7] fsck.gfs2: Fix reference to uninitialized variable
On 07/16/2013 02:56 PM, Bob Peterson wrote: This patch initializes a variable so that it no longer references it uninitialized. rhbz#984085 --- gfs2/fsck/initialize.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gfs2/fsck/initialize.c b/gfs2/fsck/initialize.c index b01b240..936fd5e 100644 --- a/gfs2/fsck/initialize.c +++ b/gfs2/fsck/initialize.c @@ -832,7 +832,7 @@ static int get_lockproto_table(struct gfs2_sbd *sdp) { FILE *fp; char line[PATH_MAX]; - char *cluname, *end; + char *cluname, *end = NULL; const char *fsname, *cfgfile = /etc/cluster/cluster.conf; Just spotted this reference to cluster.conf ^^ remember it doesn't exist anymore in the new era. Fabio
Re: [Cluster-devel] qdisk - memcpy incorrect(?)
This is already fixed in more recent releases. See commit: 8edb0d0eb31d94b8a3ba81f6d5b4c398accc950d your patch also misses another incorrect in diskRawWrite. Fabio On 05/16/2013 08:00 PM, Neale Ferguson wrote: Hi, In diskRawRead in disk.c there is the following code: readret = posix_memalign((void **)alignedBuf, disk-d_pagesz, disk-d_blksz); if (readret 0) { return -1; } io_state(STATE_READ); readret = read(disk-d_fd, alignedBuf, readlen); io_state(STATE_NONE); if (readret 0) { if (readret len) { memcpy(alignedBuf, buf, len); readret = len; } else { memcpy(alignedBuf, buf, readret); } } free(alignedBuf); The memcpy() above have the src/dst operands swapped. We read into alignedBuf and are supposed to copy to buf. I’m not sure why qdiskd works sometimes and not others. --- cluster-3.0.12.1/cman/qdisk/disk.c2013/05/16 16:45:491.1 +++ cluster-3.0.12.1/cman/qdisk/disk.c2013/05/16 16:46:29 @@ -430,14 +430,14 @@ io_state(STATE_READ); readret = read(disk-d_fd, alignedBuf, readlen); io_state(STATE_NONE); if (readret 0) { if (readret len) { -memcpy(alignedBuf, buf, len); +memcpy(buf, alignedBuf, len); readret = len; } else { -memcpy(alignedBuf, buf, readret); +memcpy(buf, alignedBuf, readret); } } free(alignedBuf); if (readret != len) { Neale
Re: [Cluster-devel] Heads-up: retiring gfs_controld
On 02/15/2013 02:18 PM, Andrew Price wrote: Hi, Now that Fedora 16 has EOL'd we have little reason to keep gfs_controld and gfs_control in gfs2-utils. They're currently disabled by default but can be enabled with configure option --enable-gfs_controld which adds additional dependencies on corosynclib, clusterlib (discontinued) and openaislib (discontinued). My intention is to remove gfs_control* from gfs2-utils.git before the next release unless there are any good reasons to keep them around. Andy Just make sure it is clear from which exact kernel version it is possible to operate without gfs_control*. so that maintainers will not try to backport to linux 1.0. Fabio
Re: [Cluster-devel] [PATCH] config/tools/xml: validate resulting cluster.rng with relaxng.rng
Hi Jan, On 2/6/2013 9:47 PM, Jan Pokorný wrote: Doing so will guarantee the file is valid RELAX NG schema, not just a valid XML. Validating schema, relaxng.rng, was obtained directly from [1] and matches directly to a version bundled with xmlcopyeditor in Fedora 17. The same (modulo VCS headers, comments and spacing details) can be obtained by combining schema as in the specification [2] and its errata [3]. [1] http://relaxng.org/relaxng.rng [2] http://relaxng.org/spec-20011203.html [3] http://relaxng.org/spec-20011203-errata.html this looks like a good idea, but i have one question. Is there a specific reason why we need to ship/embed the file with our tarball? How bad is it to require the one installed on a system? I can see it´s rather stable and hardly updated, but i prefer to avoid duplication if we can. Fabio Signed-off-by: Jan Pokorný jpoko...@redhat.com --- config/tools/xml/Makefile | 2 +- config/tools/xml/ccs_update_schema.in | 3 +- config/tools/xml/relaxng.rng | 335 ++ 3 files changed, 338 insertions(+), 2 deletions(-) create mode 100644 config/tools/xml/relaxng.rng diff --git a/config/tools/xml/Makefile b/config/tools/xml/Makefile index 3c9e97c..a86eb01 100644 --- a/config/tools/xml/Makefile +++ b/config/tools/xml/Makefile @@ -7,7 +7,7 @@ TARGET4 = cluster.rng SBINDIRT = $(TARGET1) $(TARGET2) $(TARGET3) SHAREDIRSYMT = $(TARGET4) -RELAXNGDIRT = cluster.rng.in.head cluster.rng.in.tail +RELAXNGDIRT = cluster.rng.in.head cluster.rng.in.tail relaxng.rng all: $(TARGET1) $(TARGET2) $(TARGET3) $(TARGET4) diff --git a/config/tools/xml/ccs_update_schema.in b/config/tools/xml/ccs_update_schema.in index a5aa351..16ce9f7 100644 --- a/config/tools/xml/ccs_update_schema.in +++ b/config/tools/xml/ccs_update_schema.in @@ -316,7 +316,8 @@ build_schema() { return 1 } - xmllint --noout $outputdir/cluster.rng || { + xmllint --noout --relaxng $rngdir/relaxng.rng $outputdir/cluster.rng \ + || { echo generated schema does not pass xmllint validation 2 return 1 }
Re: [Cluster-devel] [PATCH] cman: Prevent libcman from causing SIGPIPE
ACK On 12/17/2012 10:23 AM, Christine Caulfield wrote: If corosync goes down/is shut down cman will return 0 from cman_dispatch and close the socket. However, if a cman write operation is issued before this happens then SIGPIPE can result from the writev() call to an open, but disconnected, FD. This patch changes writev() to sendmg() so it can pass MSG_NOSIGNAL to the system call and prevent SIGPIPEs from occurring. Signed-Off-By: Christine Caulfield ccaul...@redhat.com
Re: [Cluster-devel] Fence agents - supported fence devices in next major release
On 11/25/2012 02:55 PM, Marek Grac wrote: Hi, In next major version of fence agents we would like to include only those fence agents which are used and can be tested. We have access to various fence devices but there is still need for more and we would like to include you and your hardware in testing process. Testing process will follow of creating simple configuration file for your device (almost copypaste from cluster.conf) and running simple script ( 5 minutes). We believe that with yours help we will be able to test more devices and make upstream code even better. We are looking for these fence devices (and their owners) used by following fence agents: * fence_baytech * fence_bullpap * fence_vixel * fence_zvm * fence_cpint * fence_rackswitch * fence_brocade * fence_mcdata Thanks for you help. If you would like to help with testing please send me a mail directly. We can probably drop fence_na too from this list. Hardware has not made production and I don't think it will happen in any short time. Digimer? Fabio
Re: [Cluster-devel] bug reports
On 10/24/2012 12:38 PM, Heiko Nardmann wrote: Hi together! Since all (or almost all?) GFS2 developers (as far as I can tell) are employed by RedHat I wonder whether it makes sense to additionally post bug reports to this mailing list beside reporting them to the RH support? No, please report bugs via RH support. This list is for development only. Fabio
Re: [Cluster-devel] [PATCH] rgmanager: Fix return code when a service would deadlock
ACK On 10/13/2012 03:18 AM, Ryan McCabe wrote: When we detect that starting a service would cause a deadlock, return 0 instead of -1. This fixes a crash that occurred when -1 was returned. Resolves: rhbz#861157 Signed-off-by: Ryan McCabe rmcc...@redhat.com --- rgmanager/src/daemons/rg_thread.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rgmanager/src/daemons/rg_thread.c b/rgmanager/src/daemons/rg_thread.c index 5e551c3..b888717 100644 --- a/rgmanager/src/daemons/rg_thread.c +++ b/rgmanager/src/daemons/rg_thread.c @@ -756,7 +756,7 @@ rt_enqueue_request(const char *resgroupname, int request, logt_print(LOG_DEBUG, Failed to queue %d request for %s: Would block\n, request, resgroupname); - return -1; + return 0; } ret = rq_queue_request(resgroup-rt_queue, resgroup-rt_name,
Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd
On 10/10/2012 6:33 AM, Dietmar Maurer wrote: Will you add some documentaion how to use those scripts? Yes our documentation overlord is preparing an upstream wiki page for it. It will be ready before a release. Seems those scripts does not check if the node is joined to the fence domain? It doesn´t really need to. I´ll put this in the easiest way as possible: - real fencing == murder there can only be one killer in the cluster at a time fence domain coordinates who can/should be killed by who - checkquorum.wdmd == suicide there are N nodes in the cluster that can decide to commit suicide without really caring about what others are doing. this can run without any fencing configuration at all. Anyway examples and all, setups, limitations.. all in the doc as soon as it´s ready. Be a bit patience :) Fabio
Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd
On 10/10/2012 10:06 AM, Dietmar Maurer wrote: On 10/10/2012 6:26 AM, Dietmar Maurer wrote: +# rpm based distros +[ -d /etc/sysconfig ] \ + [ -f /etc/sysconfig/checkquorum ] \ + . /etc/sysconfig/checkquorum + +# deb based distros +[ ! -d /etc/sysconfig ] \ + [ -f /etc/default/checkquorum ] \ + . /etc/default/checkquorum + FYI: Some RAID tool vendors delivers utilities for debian which creates directory '/etc/sysconfig' on debian boxes, so that test is not reliable. This might be a controversial argument. I just though there are better tests to see if you run on debian, for example: [ -f /etc/debian_version -d /etc/default ] that doesn´t scale well for debian derivates that don´t ship debian_version :) (see ubuntu co..) You can´t even use something like which dpkg since the tool is available on rpm based distributions... or viceversa.. there is rpm for Debian derivates. hardcoding all distributions is not optimal either, as they might change policy by version Fabio
Re: [Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd
On 10/10/2012 1:04 PM, Heiko Nardmann wrote: Am 10.10.2012 10:11, schrieb Fabio M. Di Nitto: [snip] that doesn´t scale well for debian derivates that don´t ship debian_version :) (see ubuntu co..) You can´t even use something like which dpkg since the tool is available on rpm based distributions... or viceversa.. there is rpm for Debian derivates. hardcoding all distributions is not optimal either, as they might change policy by version Fabio What about 'lsb_release'? Is that executable available on all platforms? Not installed by default, it´s generally shipped with $distro-lsb metapackage that pulls in half gazillions dependencies. I doubt it would solve anything since you still need to parse the output. It´s really no different than hardcoding /etc/$distro_release, actually with a few GB of extra packages ;) Fabio
[Cluster-devel] [PATCH 1/2] cman init: make sure we start after fence_sanlockd and warn users
From: Fabio M. Di Nitto fdini...@redhat.com Resolves: rhbz#509056 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/init.d/cman.in | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in index a88f52f..849739b 100644 --- a/cman/init.d/cman.in +++ b/cman/init.d/cman.in @@ -8,8 +8,8 @@ # ### BEGIN INIT INFO # Provides:cman -# Required-Start: $network $time -# Required-Stop: $network $time +# Required-Start: $network $time fence_sanlockd +# Required-Stop: $network $time fence_sanlockd # Default-Start: # Default-Stop: # Short-Description: Starts and stops cman @@ -740,6 +740,13 @@ stop_cmannotifyd() stop_daemon cmannotifyd } +fence_sanlock_check() +{ + service fence_sanlockd status /dev/null 21 + echofence_sanlockd detected. Unfencing might take several minutes! + return 0 +} + unfence_self() { # fence_node returns 0 on success, 1 on failure, 2 if unconfigured @@ -881,6 +888,8 @@ start() [ $breakpoint = daemons ] exit 0 + fence_sanlock_check + runwrap unfence_self \ none \ Unfencing self -- 1.7.7.6
[Cluster-devel] [PATCH 2/2] checkquorum.wdmd: add integration script with wdmd
From: Fabio M. Di Nitto fdini...@redhat.com requires wdmd = 2.6 Resolves: rhbz#509056 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/scripts/Makefile |2 +- cman/scripts/checkquorum.wdmd | 104 + 2 files changed, 105 insertions(+), 1 deletions(-) create mode 100644 cman/scripts/checkquorum.wdmd diff --git a/cman/scripts/Makefile b/cman/scripts/Makefile index b4866c8..7950311 100644 --- a/cman/scripts/Makefile +++ b/cman/scripts/Makefile @@ -1,4 +1,4 @@ -SHAREDIRTEX=checkquorum +SHAREDIRTEX=checkquorum checkquorum.wdmd include ../../make/defines.mk include $(OBJDIR)/make/clean.mk diff --git a/cman/scripts/checkquorum.wdmd b/cman/scripts/checkquorum.wdmd new file mode 100644 index 000..1d81ff6 --- /dev/null +++ b/cman/scripts/checkquorum.wdmd @@ -0,0 +1,104 @@ +#!/bin/bash +# Quorum detection watchdog script +# +# This script will return -2 if the node had quorum at one point +# and then subsequently lost it +# +# Copyright 2012 Red Hat, Inc. + +# defaults + +# Amount of time in seconds to wait after quorum is lost to fail script +waittime=60 + +# action to take if quorum is missing for over waittime +# autodetect|hardreboot|crashdump|watchdog +action=autodetect + +# Location of temporary file to capture timeouts +timerfile=/var/run/cluster/checkquorum-timer + +# rpm based distros +[ -d /etc/sysconfig ] \ + [ -f /etc/sysconfig/checkquorum ] \ + . /etc/sysconfig/checkquorum + +# deb based distros +[ ! -d /etc/sysconfig ] \ + [ -f /etc/default/checkquorum ] \ + . /etc/default/checkquorum + +has_quorum() { + corosync-quorumtool -s 2/dev/null | \ + grep ^Quorate: | \ + grep -q Yes$ +} + +had_quorum() { + output=$(corosync-objctl 2/dev/null | \ + grep runtime.totem.pg.mrp.srp.operational_entered | cut -d = -f 2) + [ -n $output ] { + [ $output -ge 1 ] return 0 + return 1 + } +} + +take_action() { + case $action in + watchdog) + [ -n $wdmd_action ] return 1 + ;; + hardreboot) + echo 1 /proc/sys/kernel/sysrq + echo b /proc/sysrq-trigger + ;; + crashdump) + echo 1 /proc/sys/kernel/sysrq + echo c /proc/sysrq-trigger + ;; + autodetect) + service kdump status /dev/null 21 + usekexec=$? + [ -n $wdmd_action ] [ $usekexec != 0 ] return 1 + echo 1 /proc/sys/kernel/sysrq + [ $usekexec = 0 ] echo c /proc/sysrq-trigger + echo b /proc/sysrq-trigger + esac +} + +# watchdog uses $1 = test or = repair +# with no arguments we are called by wdmd +[ -z $1 ] wdmd_action=yes + +# we don't support watchdog repair action +[ $1 = repair ] exit 1 + +service corosync status /dev/null 21 +ret=$? + +case $ret in + 3) # corosync is not running (clean) + rm -f $timerfile + exit 0 + ;; + 1) # corosync crashed or did exit abonormally (dirty - take action) + logger -t checkquorum.wdmd corosync crashed or exited abonarmally. Node will soon reboot + take_action + ;; + 0) # corosync is running (clean) + # check quorum here + has_quorum { + echo -e oldtime=$(date +%s) $timerfile + exit 0 + } + . $timerfile + newtime=$(date +%s) + delta=$((newtime - oldtime)) + logger -t checkquorum.wdmd Node has lost quorum. Node will soon reboot + had_quorum [ $delta -gt $waittime ] { + take_action + } + ;; +esac + +exit $? -- 1.7.7.6
Re: [Cluster-devel] checkquorum script for self fencing
On 10/02/2012 08:07 PM, Dietmar Maurer wrote: Hi Fabio, was there any progress on that topic? As a matter of fact, yes, we are completing the first implementation and writing down the docs and howto's. I think the first cut will be available for testing within a week, maybe two. Fabio -Original Message- From: cluster-devel-boun...@redhat.com [mailto:cluster-devel- boun...@redhat.com] On Behalf Of Fabio M. Di Nitto Sent: Donnerstag, 22. Dezember 2011 06:57 To: cluster-devel@redhat.com Subject: Re: [Cluster-devel] checkquorum script for self fencing On 12/21/2011 08:28 PM, Dietmar Maurer wrote: I recently detected that checkquorum script for self fencing. That seems to work reliable, but the remaining nodes (with quorum) does not get any fence acknowledge. I wonder if it would be possible to extend the checkquorum script so that it runs fence_ack_manual on the fence master after some safety timeout? Or do you think that is a bad idea? We are already working on a similar feature based on checkquorum, but I got injured on my hand and I had to delay a bit the write up for the feature (I am incredibly slow writing with one hand, never mind the typos ;)). The way you suggest is dangerous, so no, don't take that route. The full feature proposal will come soon after this December holiday/xmas break. Fabio
[Cluster-devel] [PATCH] cman init: increase default shutdown timeouts
From: Fabio M. Di Nitto fdini...@redhat.com in some conditions, specially triggered when shutting down all nodes at the same time, corosync takes a lot longer than 10 seconds to stabilize membership. That means that daemons will not quit fast enough before cman init will declare a shutdown error. Increase the default shutdown timeouts from 10 to 30 seconds. Resolves: rhbz#854032 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/init.d/cman.in |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in index 1917abd..a88f52f 100644 --- a/cman/init.d/cman.in +++ b/cman/init.d/cman.in @@ -305,7 +305,7 @@ stop_daemon() shift retryforsec=$1 - [ -z $retryforsec ] retryforsec=1 + [ -z $retryforsec ] retryforsec=30 retries=0 if check_sleep; then @@ -661,7 +661,7 @@ start_qdiskd() stop_qdiskd() { - stop_daemon qdiskd 5 + stop_daemon qdiskd } start_groupd() @@ -770,7 +770,7 @@ join_fence_domain() leave_fence_domain() { if status fenced /dev/null 21; then - errmsg=$( fence_tool leave -w 10 21 ) + errmsg=$( fence_tool leave -w 30 21 ) return $? fi } -- 1.7.7.6
Re: [Cluster-devel] cluster: RHEL6 - fsck.gfs2: Fix buffer overflow in get_lockproto_table
On 8/17/2012 11:57 AM, Andrew Price wrote: On 17/08/12 05:02, Fabio M. Di Nitto wrote: On 08/16/2012 11:01 PM, Andrew Price wrote: Gitweb: http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=f796ee8752712e9e523e1516bb9165b274552753 Commit:f796ee8752712e9e523e1516bb9165b274552753 Parent:638deec0ccbf45862eee97294f09ba9d6b3f56d0 Author:Andrew Price anpr...@redhat.com AuthorDate:Sat Jul 7 22:03:24 2012 +0100 Committer: Andrew Price anpr...@redhat.com CommitterDate: Thu Aug 16 21:54:56 2012 +0100 fsck.gfs2: Fix buffer overflow in get_lockproto_table Coverity discovered a buffer overflow in this function where an overly long cluster name in cluster.conf could cause a crash while repairing the superblock. This patch fixes the bug by making sure the lock table is composed sensibly, limiting the fsname to 16 chars as documented, and only allowing the cluster name (which doesn't seem to have a documented max size) to use the remaining space in the locktable name string. cluster name is max 16 bytes too (including \0). It's actually verified by cman at startup so it can't be longer than that. OK, thanks for clearing that up. There are other places in gfs2-utils which we can tighten up now that we know that the cluster name has a solid limit so I'm going to push this patch (which fixes the overflow bug) and we'll address the limit issues separately. BTW, now that cman has disappeared upstream is anything checking the length of the cluster name now? I am not sure. I don´t think corosync enforces any limit, but best to check with Jan. Fabio
Re: [Cluster-devel] [PATCH 0/3] minor edits of cluster.rng (fixed head part)
ACK all 3 of them. please push them to STABLE32 branch. Fabio On 08/16/2012 09:52 PM, Jan Pokorný wrote: Jan Pokorný (3): cluster.rng: fix trailing whitespaces in head cluster.rng: fencedevice initial non-digit note to description cluster.rng: retab the head (use space uniformly) config/tools/xml/cluster.rng.in.head | 41 ++-- 1 file changed, 21 insertions(+), 20 deletions(-)
Re: [Cluster-devel] cluster: RHEL6 - fsck.gfs2: Fix buffer overflow in get_lockproto_table
On 08/16/2012 11:01 PM, Andrew Price wrote: Gitweb: http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=f796ee8752712e9e523e1516bb9165b274552753 Commit:f796ee8752712e9e523e1516bb9165b274552753 Parent:638deec0ccbf45862eee97294f09ba9d6b3f56d0 Author:Andrew Price anpr...@redhat.com AuthorDate:Sat Jul 7 22:03:24 2012 +0100 Committer: Andrew Price anpr...@redhat.com CommitterDate: Thu Aug 16 21:54:56 2012 +0100 fsck.gfs2: Fix buffer overflow in get_lockproto_table Coverity discovered a buffer overflow in this function where an overly long cluster name in cluster.conf could cause a crash while repairing the superblock. This patch fixes the bug by making sure the lock table is composed sensibly, limiting the fsname to 16 chars as documented, and only allowing the cluster name (which doesn't seem to have a documented max size) to use the remaining space in the locktable name string. cluster name is max 16 bytes too (including \0). It's actually verified by cman at startup so it can't be longer than that. Fabio
[Cluster-devel] cluster 3.1.93 release (Release Candidate)
Welcome to the cluster 3.1.93 (Release Candidate) release. This release addresses a few major issues. Users of previous releases are strongly encouraged to upgrade to this version. This release also strictly requires corosync 1.4.4 to build and run. Unless major issues will be reported, the next release will be marked stable 3.2.0. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.93.tar.xz ChangeLog: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.93 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadmins and users. Thanks/congratulations to all people that contributed to this release! Happy clustering, Fabio
[Cluster-devel] [PATCH] qdiskd: backport dual socket connection to cman
From: Fabio M. Di Nitto fdini...@redhat.com Patch 76741bb2a94ae94e493c609d50f570d02e2f3029 had a not so obvious dependency on 08ae3ce147b2771c5ee6e1d364a5e48c88384427. Backport portion of 08ae3ce147b2771c5ee6e1d364a5e48c88384427 to handle dual cman socket (admin and user) and use the correct socket (user) for send/receive data. Move cman_alive check and heartbeat (for dispatch) to ch_user. Resolves: rhbz#782900 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/qdisk/disk.h |5 ++- cman/qdisk/disk_util.c |7 +++-- cman/qdisk/iostate.c |8 +++--- cman/qdisk/main.c | 69 ++- 4 files changed, 43 insertions(+), 46 deletions(-) diff --git a/cman/qdisk/disk.h b/cman/qdisk/disk.h index d491de1..83167ea 100644 --- a/cman/qdisk/disk.h +++ b/cman/qdisk/disk.h @@ -270,7 +270,8 @@ typedef struct { int qc_master; /* Master?! */ int qc_status_sock; run_flag_t qc_flags; - cman_handle_t qc_ch; + cman_handle_t qc_ch_admin; + cman_handle_t qc_ch_user; char *qc_device; char *qc_label; char *qc_status_file; @@ -299,7 +300,7 @@ typedef struct { int qd_write_status(qd_ctx *ctx, int nid, disk_node_state_t state, disk_msg_t *msg, memb_mask_t mask, memb_mask_t master); int qd_read_print_status(target_info_t *disk, int nid); -int qd_init(qd_ctx *ctx, cman_handle_t ch, int me); +int qd_init(qd_ctx *ctx, cman_handle_t ch_admin, cman_handle_t ch_user, int me); void qd_destroy(qd_ctx *ctx); /* proc.c */ diff --git a/cman/qdisk/disk_util.c b/cman/qdisk/disk_util.c index f5539c0..25f4013 100644 --- a/cman/qdisk/disk_util.c +++ b/cman/qdisk/disk_util.c @@ -312,16 +312,17 @@ generate_token(void) Initialize a quorum disk context, given a CMAN handle and a nodeid. */ int -qd_init(qd_ctx *ctx, cman_handle_t ch, int me) +qd_init(qd_ctx *ctx, cman_handle_t ch_admin, cman_handle_t ch_user, int me) { - if (!ctx || !ch || !me) { + if (!ctx || !ch_admin || !ch_user || !me) { errno = EINVAL; return -1; } memset(ctx, 0, sizeof(*ctx)); ctx-qc_incarnation = generate_token(); - ctx-qc_ch = ch; + ctx-qc_ch_admin = ch_admin; + ctx-qc_ch_user = ch_user; ctx-qc_my_id = me; ctx-qc_status_sock = -1; diff --git a/cman/qdisk/iostate.c b/cman/qdisk/iostate.c index eb74ad2..ba7ad12 100644 --- a/cman/qdisk/iostate.c +++ b/cman/qdisk/iostate.c @@ -69,7 +69,7 @@ io_nanny_thread(void *arg) iostate_t last_main_state = 0, current_main_state = 0; int last_main_incarnation = 0, current_main_incarnation = 0; int logged_incarnation = 0; - cman_handle_t ch = (cman_handle_t)arg; + cman_handle_t ch_user = (cman_handle_t)arg; int32_t whine_state; /* Start with wherever we're at now */ @@ -105,7 +105,7 @@ io_nanny_thread(void *arg) /* Whine on CMAN api */ whine_state = (int32_t)current_main_state; swab32(whine_state); - cman_send_data(ch, whine_state, sizeof(int32_t), 0, CLUSTER_PORT_QDISKD, 0); + cman_send_data(ch_user, whine_state, sizeof(int32_t), 0, CLUSTER_PORT_QDISKD, 0); /* Don't log things twice */ if (logged_incarnation == current_main_incarnation) @@ -125,7 +125,7 @@ io_nanny_thread(void *arg) int -io_nanny_start(cman_handle_t ch, int timeout) +io_nanny_start(cman_handle_t ch_user, int timeout) { int ret; @@ -135,7 +135,7 @@ io_nanny_start(cman_handle_t ch, int timeout) qdisk_timeout = timeout; thread_active = 1; - ret = pthread_create(io_nanny_tid, NULL, io_nanny_thread, ch); + ret = pthread_create(io_nanny_tid, NULL, io_nanny_thread, ch_user); pthread_mutex_unlock(state_mutex); return ret; diff --git a/cman/qdisk/main.c b/cman/qdisk/main.c index 90d00ab..72a3c07 100644 --- a/cman/qdisk/main.c +++ b/cman/qdisk/main.c @@ -287,7 +287,7 @@ check_transitions(qd_ctx *ctx, node_info_t *ni, int max, memb_mask_t mask) if (ctx-qc_flags RF_ALLOW_KILL) { clulog(LOG_DEBUG, Telling CMAN to kill the node\n); - cman_kill_node(ctx-qc_ch, + cman_kill_node(ctx-qc_ch_admin, ni[x].ni_status.ps_nodeid); } } @@ -325,7 +325,7 @@ check_transitions(qd_ctx *ctx, node_info_t *ni, int max, memb_mask_t mask) if (ctx-qc_flags RF_ALLOW_KILL) { clulog(LOG_DEBUG, Telling CMAN to kill the node\n); - cman_kill_node(ctx-qc_ch
Re: [Cluster-devel] Fence driver for the Digital Loggers Web Power Switches
On 07/31/2012 10:24 PM, Dwight Hubbard wrote: Hopefully this is a correct patch, been a long while since I've generated one Don't worry.. I'll have Marek review it and send comments back. My only minor concern is the license. Do you think you can make your agent GPLv2+ ? otherwise I guess it's time to fix the build system and packaging to deal with multiple license. tho having the whole tree under the same umbrella is easier ;) Thanks Fabio On Tue, Jul 31, 2012 at 12:06 PM, Fabio M. Di Nitto fdini...@redhat.com mailto:fdini...@redhat.com wrote: On 07/31/2012 06:59 PM, Dwight Hubbard wrote: If I knew where to submit it I'd be happy to here is just fine :) either in form of patch to fence-agents.git master branch or as a standalone agent and we can help integrating in the current tree. Fabio On Mon, Jul 23, 2012 at 11:18 PM, Fabio M. Di Nitto fdini...@redhat.com mailto:fdini...@redhat.com mailto:fdini...@redhat.com mailto:fdini...@redhat.com wrote: On 07/23/2012 10:12 PM, Dwight Hubbard wrote: I updated the Fence driver I wrote back in 2009 for the Digital loggers network power switches (http://digital-loggers.com/lpc.html) to work with some additional powerswitch models and put the code in a github repo http://github.com/dwighthubbard/python-dlipower. In case it's useful for anyone else... Is there a specific reason why you don't submit the code upstream and have it part of fence-agents.git? Thanks Fabio
Re: [Cluster-devel] [PATCH] rgmanager: Exit uncleanly only when CMAN_SHUTDOWN_ANYWAY is set
ACK we will need an upstream/rhel6 equivalent too for this one. See my comment in BZ. Fabio On 07/27/2012 07:07 PM, Ryan McCabe wrote: Only exit uncleanly when the CMAN_SHUTDOWN_ANYWAY flag is set in the argument passed when handling the CMAN_REASON_TRY_SHUTDOWN event. This fixes the case where args is 2, where we want to refuse to shut down. Resolves: rhbz#769730 Signed-off-by: Ryan McCabe rmcc...@redhat.com --- rgmanager/src/clulib/msg_cluster.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/rgmanager/src/clulib/msg_cluster.c b/rgmanager/src/clulib/msg_cluster.c index e864853..e4b6b39 100644 --- a/rgmanager/src/clulib/msg_cluster.c +++ b/rgmanager/src/clulib/msg_cluster.c @@ -211,7 +211,7 @@ poll_cluster_messages(int timeout) if (cman_dispatch(ch, 0) 0) { process_cman_event(ch, NULL, -CMAN_REASON_TRY_SHUTDOWN, 1); +CMAN_REASON_TRY_SHUTDOWN, CMAN_SHUTDOWN_ANYWAY); } ret = 0; } @@ -987,7 +987,9 @@ process_cman_event(cman_handle_t handle, void *private, int reason, int arg) printf(EVENT: %p %p %d %d\n, handle, private, reason, arg); #endif - if (reason == CMAN_REASON_TRY_SHUTDOWN !arg) { + if (reason == CMAN_REASON_TRY_SHUTDOWN + !(arg CMAN_SHUTDOWN_ANYWAY)) + { cman_replyto_shutdown(handle, 0); return; }
Re: [Cluster-devel] Fence driver for the Digital Loggers Web Power Switches
On 07/23/2012 10:12 PM, Dwight Hubbard wrote: I updated the Fence driver I wrote back in 2009 for the Digital loggers network power switches (http://digital-loggers.com/lpc.html) to work with some additional powerswitch models and put the code in a github repo http://github.com/dwighthubbard/python-dlipower. In case it's useful for anyone else... Is there a specific reason why you don't submit the code upstream and have it part of fence-agents.git? Thanks Fabio
[Cluster-devel] [PATCH] cman init: allow dlm hash table sizes to be tunable at startup
From: Fabio M. Di Nitto fdini...@redhat.com Resolves: rhbz#842370 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/init.d/cman.in | 28 cman/init.d/cman.init.defaults.in |7 +++ 2 files changed, 35 insertions(+), 0 deletions(-) diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in index 9a0d726..9de349d 100644 --- a/cman/init.d/cman.in +++ b/cman/init.d/cman.in @@ -110,6 +110,13 @@ fi # DLM_CONTROLD_OPTS -- allow extra options to be passed to dlm_controld daemon. [ -z $DLM_CONTROLD_OPTS ] DLM_CONTROLD_OPTS= +# DLM_LKBTBL_SIZE - DLM_RSBTBL_SIZE - DLM_DIRTBL_SIZE +# Allow tuning of DLM kernel hash table sizes. +# do NOT change unless instructed to do so. +[ -z $DLM_LKBTBL_SIZE ] DLM_LKBTBL_SIZE= +[ -z $DLM_RSBTBL_SIZE ] DLM_RSBTBL_SIZE= +[ -z $DLM_DIRTBL_SIZE ] DLM_DIRTBL_SIZE= + # FENCE_JOIN_TIMEOUT -- seconds to wait for fence domain join to # complete. If the join hasn't completed in this time, fence_tool join # exits with an error, and this script exits with an error. To wait @@ -706,6 +713,23 @@ leave_fence_domain() fi } +tune_dlm_hash_sizes() +{ + dlmdir=/sys/kernel/config/dlm/cluster + + [ -n $DLM_LKBTBL_SIZE ] [ -f $dlmdir/lkbtbl_size ] \ + echo $DLM_LKBTBL_SIZE $dlmdir/lkbtbl_size + + [ -n $DLM_RSBTBL_SIZE ] [ -f $dlmdir/rsbtbl_size ] \ + echo $DLM_RSBTBL_SIZE $dlmdir/rsbtbl_size + + [ -n $DLM_DIRTBL_SIZE ] [ -f $dlmdir/dirtbl_size ] \ + echo $DLM_DIRTBL_SIZE $dlmdir/dirtbl_size + + return 0 +} + + start() { currentaction=start @@ -773,6 +797,10 @@ start() none \ Starting dlm_controld + runwrap tune_dlm_hash_sizes \ + none \ + Tuning DLM kernel hash tables + runwrap start_gfs_controld \ none \ Starting gfs_controld diff --git a/cman/init.d/cman.init.defaults.in b/cman/init.d/cman.init.defaults.in index 1b7913e..bbaa049 100644 --- a/cman/init.d/cman.init.defaults.in +++ b/cman/init.d/cman.init.defaults.in @@ -34,6 +34,13 @@ # DLM_CONTROLD_OPTS -- allow extra options to be passed to dlm_controld daemon. #DLM_CONTROLD_OPTS= +# DLM_LKBTBL_SIZE - DLM_RSBTBL_SIZE - DLM_DIRTBL_SIZE +# Allow tuning of DLM kernel hash table sizes. +# do NOT change unless instructed to do so. +#DLM_LKBTBL_SIZE= +#DLM_RSBTBL_SIZE= +#DLM_DIRTBL_SIZE= + # FENCE_JOIN_TIMEOUT -- seconds to wait for fence domain join to # complete. If the join hasn't completed in this time, fence_tool join # exits with an error, and this script exits with an error. To wait -- 1.7.7.6
Re: [Cluster-devel] [PATCH] rgmanager: Add IP interface parameter
On 07/20/2012 11:07 PM, Lon Hohberger wrote: On 07/14/2012 03:54 PM, Fabio M. Di Nitto wrote: On 07/13/2012 10:08 PM, Lon Hohberger wrote: On 07/13/2012 12:08 AM, Fabio M. Di Nitto wrote: Hi Ryan, only one comment here.. many times we have been asked to implement interface parameter to allow any random IP on any specific interface (beside the pre configured ip on that interface). We haven't done that because we might end up owning routing. However, if we make it explicit that this is not the case, then we could in theory do both. hmm right.. forgot about that. I would still prefer to avoid the use of interface= option if possible tho. Maybe something slightly less overloaded. force_interface or force_net_device. Sure; that's fine. prefer_interface= maybe? If more than one match, use this one, otherwise, use the one that matches Yes that sounds a lot better than force_* :) Thanks Fabio
Re: [Cluster-devel] [PATCH] rgmanager: Add IP interface parameter
On 07/20/2012 11:07 PM, Lon Hohberger wrote: On 07/14/2012 03:54 PM, Fabio M. Di Nitto wrote: On 07/13/2012 10:08 PM, Lon Hohberger wrote: On 07/13/2012 12:08 AM, Fabio M. Di Nitto wrote: Hi Ryan, only one comment here.. many times we have been asked to implement interface parameter to allow any random IP on any specific interface (beside the pre configured ip on that interface). We haven't done that because we might end up owning routing. However, if we make it explicit that this is not the case, then we could in theory do both. hmm right.. forgot about that. I would still prefer to avoid the use of interface= option if possible tho. Maybe something slightly less overloaded. force_interface or force_net_device. Sure; that's fine. prefer_interface= maybe? If more than one match, use this one, otherwise, use the one that matches Yes that sounds a lot better than force_* :) Thanks Fabio
Re: [Cluster-devel] [PATCH] rgmanager: Add IP interface parameter
On 07/13/2012 10:08 PM, Lon Hohberger wrote: On 07/13/2012 12:08 AM, Fabio M. Di Nitto wrote: Hi Ryan, only one comment here.. many times we have been asked to implement interface parameter to allow any random IP on any specific interface (beside the pre configured ip on that interface). We haven't done that because we might end up owning routing. However, if we make it explicit that this is not the case, then we could in theory do both. hmm right.. forgot about that. I would still prefer to avoid the use of interface= option if possible tho. Maybe something slightly less overloaded. force_interface or force_net_device. Fabio
Re: [Cluster-devel] [PATCH] rgmanager: Add IP interface parameter
Hi Ryan, only one comment here.. many times we have been asked to implement interface parameter to allow any random IP on any specific interface (beside the pre configured ip on that interface). Can we change the patch to simply fix both problems at once? Effectively, the fact that 2 interfaces have 2 ip on the same subnet is simply a corner case. Maybe later on we can add something like: ifconfig iface up / down. when doing ifconfig up we need to store the output of ip addresses automatically assigned to that interface. on shutdown, we need to check if the ip we are removing is the last one on that interface _before_ issuing an ifconfig down in case there are more ip resources associated to it. The patch looks ok, but I would probably use a different term than interface as it sounds very similar to the expected feature above. Fabio On 07/12/2012 07:23 PM, Ryan McCabe wrote: This patch adds an interface parameter for IP resources. The interface must already be configured and active. This parameter should be used only when at least two active interfaces have IP addresses on the same subnet and it's necessary to specify which particular interface should be used. Signed-off-by: Ryan McCabe rmcc...@redhat.com --- rgmanager/src/resources/ip.sh | 17 + 1 file changed, 17 insertions(+) diff --git a/rgmanager/src/resources/ip.sh b/rgmanager/src/resources/ip.sh index 38d1ab9..3adbb12 100755 --- a/rgmanager/src/resources/ip.sh +++ b/rgmanager/src/resources/ip.sh @@ -132,6 +132,15 @@ meta_data() content type=boolean/ /parameter + parameter name=interface + longdesc lang=en + The network interface to which the IP address should be added. The interface must already be configured and active. This parameter should be used only when at least two active interfaces have IP addresses on the same subnet and it is desired to have the IP address added to a particular interface. + /longdesc + shortdesc lang=en + Network interface + /shortdesc + content type=string/ + /parameter /parameters actions @@ -587,6 +596,10 @@ ipv6() fi if [ $1 = add ]; then + if [ -n $OCF_RESKEY_interface ] \ +[ $OCF_RESKEY_interface != $dev ]; then + continue + fi ipv6_same_subnet $ifaddr_exp/$maskbits $addr_exp if [ $? -ne 0 ]; then continue @@ -670,6 +683,10 @@ ipv4() fi if [ $1 = add ]; then + if [ -n $OCF_RESKEY_interface ] \ +[ $OCF_RESKEY_interface != $dev ]; then + continue + fi ipv4_same_subnet $ifaddr/$maskbits $addr if [ $? -ne 0 ]; then continue
Re: [Cluster-devel] cluster.cman.nodename vanish on config reload
On 7/11/2012 9:37 AM, Dietmar Maurer wrote: Ok, bisect myself. This lead directly to commit f3f4499d4ace7a3bf5fe09ce6d9f04ed6d8958f6 But this is just the check you introduced. If I revert that patch, everything works as before, but I noticed that It still deletes the values from the corosync objdb after config reload - even in 3.1.8! Both cluster.cman.nodename and cluster.cman.cluster_id get removed. Testing with earlier versions now. That even happens with 3.1.4 (cant test easily with older versions). Any ideas? No, not yet, but what kind of operational problem do you get? does it affect runtime? if so how? Fabio
Re: [Cluster-devel] cluster.cman.nodename vanish on config reload
On 7/11/2012 10:14 AM, Fabio M. Di Nitto wrote: On 7/11/2012 9:37 AM, Dietmar Maurer wrote: Ok, bisect myself. This lead directly to commit f3f4499d4ace7a3bf5fe09ce6d9f04ed6d8958f6 But this is just the check you introduced. If I revert that patch, everything works as before, but I noticed that It still deletes the values from the corosync objdb after config reload - even in 3.1.8! Both cluster.cman.nodename and cluster.cman.cluster_id get removed. Testing with earlier versions now. That even happens with 3.1.4 (cant test easily with older versions). Any ideas? No, not yet, but what kind of operational problem do you get? does it affect runtime? if so how? Fabio Nevermind.. i answered my own question. Fabio
Re: [Cluster-devel] cluster.cman.nodename vanish on config reload
On 7/11/2012 10:21 AM, Dietmar Maurer wrote: This lead directly to commit f3f4499d4ace7a3bf5fe09ce6d9f04ed6d8958f6 But this is just the check you introduced. If I revert that patch, everything works as before, but I noticed that It still deletes the values from the corosync objdb after config reload - even in 3.1.8! Both cluster.cman.nodename and cluster.cman.cluster_id get removed. Testing with earlier versions now. That even happens with 3.1.4 (cant test easily with older versions). Any ideas? No, not yet, but what kind of operational problem do you get? does it affect runtime? if so how? I cannot change/reload the configuration with commit f3f4499d4ace7a3bf5fe09ce6d9f04ed6d8958f6 When I revert that commit everything works fine. I just wonder why those values get removed from the corosync objdb? That´s the root cause of the issue. Note: You added that check, so I guess it has negative side effects when there is no nodename (why did you add that check)? Well yes, it is an error if we can´t determine our nodename. The issue now is to understand why it fails for you but doesn´t fail for me using git. Fabio
Re: [Cluster-devel] cluster.cman.nodename vanish on config reload
On 7/11/2012 10:32 AM, Dietmar Maurer wrote: Well yes, it is an error if we can´t determine our nodename. The issue now is to understand why it fails for you but doesn´t fail for me using git. Oh, you can't reproduce the bug? Found it it is triggered only when cluster.conf has a cman.. section. Working on a fix now. Fabio
Re: [Cluster-devel] cluster.cman.nodename vanish on config reload
On 7/11/2012 10:32 AM, Dietmar Maurer wrote: Well yes, it is an error if we can´t determine our nodename. The issue now is to understand why it fails for you but doesn´t fail for me using git. Oh, you can't reproduce the bug? Can you please try the patch I just posted to the list? it works for me, but a couple of extra eyes won´t hurt. Thanks fabio
Re: [Cluster-devel] cluster.cman.nodename vanish on config reload
If are running stable32 from git, can you please revert: commit 8975bd6341b2d94c1f89279b1b00d4360da1f5ff and see if it´s still a problem? Thanks Fabio On 7/10/2012 1:33 PM, Dietmar Maurer wrote: I just updated from 3.1.8 to latest STABLE32: I use this cluster.conf: # cat /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=235 name=test cman keyfile=/var/lib/pve-cluster/corosync.authkey transport=udpu/ clusternodes clusternode name=maui nodeid=3 votes=1/ clusternode name=cnode1 nodeid=1 votes=1/ /clusternodes rm pvevm autostart=0 vmid=100/ /rm /cluster cman service starts without problems: # /etc/init.d/cman start Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting GFS2 Control Daemon: gfs_controld. Unfencing self... [ OK ] Joining fence domain... [ OK ] And the corosync objdb contains: # corosync-objctl|grep cluster.cman cluster.cman.keyfile=/var/lib/pve-cluster/corosync.authkey cluster.cman.transport=udpu cluster.cman.nodename=maui cluster.cman.cluster_id=1678 Note: there is a value for ‘nodename’ and ‘cluster_id’ Now I simply increase the version inside cluster.conf (on both nodes): # cat /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=236 name=test cman keyfile=/var/lib/pve-cluster/corosync.authkey transport=udpu/ clusternodes clusternode name=maui nodeid=3 votes=1/ clusternode name=cnode1 nodeid=1 votes=1/ /clusternodes rm pvevm autostart=0 vmid=100/ /rm /cluster And trigger a reload: # cman_tool version -r –S cman_tool: Error loading configuration in corosync/cman And the syslog have more details: Jul 10 13:28:25 maui corosync[488675]: [CMAN ] cman was unable to determine our node name! Jul 10 13:28:25 maui corosync[488675]: [CMAN ] Can't get updated config version: Successfully read config from /etc/cluster/cluster.conf#012. Jul 10 13:28:25 maui corosync[488675]: [CMAN ] Continuing activity with old configuration Somehow the nodename and cluster_id values are removed from the corosync objdb: # corosync-objctl|grep cluster.cman cluster.cman.keyfile=/var/lib/pve-cluster/corosync.authkey cluster.cman.transport=udpu Any Idea why that happens? - Dietmar
Re: [Cluster-devel] cluster.cman.nodename vanish on config reload
On 7/10/2012 2:09 PM, Dietmar Maurer wrote: If are running stable32 from git, can you please revert: commit 8975bd6341b2d94c1f89279b1b00d4360da1f5ff and see if it´s still a problem? Yes, same problem. - Dietmar Ok. then please file a bugzilla. I´ll need to bisect and see when the problem has been introduced (unless you want to give bisect a shot). Fabio
[Cluster-devel] [PATCH] qdiskd: restrict master_wins to 2 node cluster
From: Fabio M. Di Nitto fdini...@redhat.com given enough mingling of cluster.conf it was possible to break quorum rule #1: there is only one quorum in a cluster at any given time. this change restricts master_wins to 2 node cluster only and provides extra feedback to the user (via logging) on why the mode is disabled. Resolves: rhbz#838047 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/man/qdisk.5 |5 +++-- cman/qdisk/disk.h |1 + cman/qdisk/main.c | 22 +++--- 3 files changed, 19 insertions(+), 9 deletions(-) diff --git a/cman/man/qdisk.5 b/cman/man/qdisk.5 index ca974fa..938ed69 100644 --- a/cman/man/qdisk.5 +++ b/cman/man/qdisk.5 @@ -297,8 +297,9 @@ and qdiskd's timeout (interval*tko) should be less than half of Totem's token timeout. See section 3.3.1 for more information. This option only takes effect if there are no heuristics -configured. Usage of this option in configurations with more than -two cluster nodes is undefined and should not be done. +configured and it is valid only for 2 node cluster. +This option is automatically disabled if heuristics are +defined or cluster has more than 2 nodes configured. In a two-node cluster with no heuristics and no defined vote count (see above), this mode is turned by default. If enabled in diff --git a/cman/qdisk/disk.h b/cman/qdisk/disk.h index fd80fa6..1792377 100644 --- a/cman/qdisk/disk.h +++ b/cman/qdisk/disk.h @@ -249,6 +249,7 @@ typedef struct { int qc_master; /* Master?! */ int qc_config; int qc_token_timeout; + int qc_auto_votes; disk_node_state_t qc_disk_status; disk_node_state_t qc_status; run_flag_t qc_flags; diff --git a/cman/qdisk/main.c b/cman/qdisk/main.c index 32677a2..e14d534 100644 --- a/cman/qdisk/main.c +++ b/cman/qdisk/main.c @@ -1444,7 +1444,7 @@ auto_qdisk_votes(int desc) logt_print(LOG_ERR, Unable to determine qdiskd votes automatically\n); else - logt_print(LOG_DEBUG, Setting votes to %d\n, ret); + logt_print(LOG_DEBUG, Setting autocalculated votes to %d\n, ret); return (ret); } @@ -1606,6 +1606,8 @@ get_dynamic_config_data(qd_ctx *ctx, int ccsfd) ctx-qc_flags = ~RF_AUTO_VOTES; } + ctx-qc_auto_votes = auto_qdisk_votes(ccsfd); + snprintf(query, sizeof(query), /cluster/quorumd/@votes); if (ccs_get(ccsfd, query, val) == 0) { ctx-qc_votes = atoi(val); @@ -1613,7 +1615,7 @@ get_dynamic_config_data(qd_ctx *ctx, int ccsfd) if (ctx-qc_votes 0) ctx-qc_votes = 0; } else { - ctx-qc_votes = auto_qdisk_votes(ccsfd); + ctx-qc_votes = ctx-qc_auto_votes; if (ctx-qc_votes 0) { if (ctx-qc_config) { logt_print(LOG_WARNING, Unable to determine @@ -1879,15 +1881,21 @@ get_config_data(qd_ctx *ctx, struct h_data *h, int maxh, int *cfh) *cfh = configure_heuristics(ccsfd, h, maxh, ctx-qc_interval * (ctx-qc_tko - 1)); - if (*cfh) { - if (ctx-qc_flags RF_MASTER_WINS) { - logt_print(LOG_WARNING, Master-wins mode disabled\n); + if (ctx-qc_flags RF_MASTER_WINS) { + if (*cfh) { + logt_print(LOG_WARNING, Master-wins mode disabled + (not compatible with heuristics)\n); + ctx-qc_flags = ~RF_MASTER_WINS; + } + if (ctx-qc_auto_votes != 1) { + logt_print(LOG_WARNING, Master-wins mode disabled + (not compatible with more than 2 nodes)\n); ctx-qc_flags = ~RF_MASTER_WINS; } } else { if (ctx-qc_flags RF_AUTO_VOTES - !(ctx-qc_flags RF_MASTER_WINS) - ctx-qc_votes == 1) { + !*cfh + ctx-qc_auto_votes == 1) { /* Two node cluster, no heuristics, 1 vote for * quorum disk daemon. Safe to enable master-wins. * In fact, qdiskd without master-wins in this config -- 1.7.7.6
Re: [Cluster-devel] [PATCH 1/5] rgmanager: Fix orainstance.sh error checking
ACK On 6/28/2012 9:57 PM, Ryan McCabe wrote: Pull in the fixed error checking that was added to oracledb.sh as a fix for rhbz#471066. Resolves: rhbz#723819 Signed-off-by: Ryan McCabe rmcc...@redhat.com --- rgmanager/src/resources/orainstance.sh |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rgmanager/src/resources/orainstance.sh b/rgmanager/src/resources/orainstance.sh index 6f2ff15..a9f690d 100755 --- a/rgmanager/src/resources/orainstance.sh +++ b/rgmanager/src/resources/orainstance.sh @@ -105,7 +105,7 @@ start_db() { # If we see: # ORA-.: failure, we failed -grep -q failure $logfile +grep -q ^ORA- $logfile rv=$? rm -f $logfile @@ -155,7 +155,7 @@ stop_db() { return 1 fi - grep -q failure $logfile + grep -q ^ORA- $logfile rv=$? rm -f $logfile
Re: [Cluster-devel] [PATCH 2/5] rgmanager: Don't exit uncleanly when cman asks us to shut down.
ACK On 6/28/2012 9:57 PM, Ryan McCabe wrote: Original patch from Lon rediffed to apply to the current tree: Previous to this, rgmanager would uncleanly exit if you issued a 'service cman stop'. This patch makes it uncleanly exit if 'cman_tool leave force' or a corosync/openais crash occurs, but in a simple cman_tool leave, rgmanager will no longer exit uncleanly. Without this patch, issuing 'service cman stop' when rgmanager is running will make it impossible to stop the cman service because rgmanager will have exited without releasing its dlm lockspace. Resolves: rhbz#769730 Signed-off-by: Ryan McCabe rmcc...@redhat.com --- rgmanager/src/clulib/msg_cluster.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/rgmanager/src/clulib/msg_cluster.c b/rgmanager/src/clulib/msg_cluster.c index 8dc22d0..e864853 100644 --- a/rgmanager/src/clulib/msg_cluster.c +++ b/rgmanager/src/clulib/msg_cluster.c @@ -211,7 +211,7 @@ poll_cluster_messages(int timeout) if (cman_dispatch(ch, 0) 0) { process_cman_event(ch, NULL, -CMAN_REASON_TRY_SHUTDOWN, 0); +CMAN_REASON_TRY_SHUTDOWN, 1); } ret = 0; } @@ -987,6 +987,11 @@ process_cman_event(cman_handle_t handle, void *private, int reason, int arg) printf(EVENT: %p %p %d %d\n, handle, private, reason, arg); #endif + if (reason == CMAN_REASON_TRY_SHUTDOWN !arg) { + cman_replyto_shutdown(handle, 0); + return; + } + /* Allocate queue node */ while ((node = malloc(sizeof(*node))) == NULL) { sleep(1);
Re: [Cluster-devel] [PATCH 5/5] rgmanager: Fix a possible NULL pointer dereference
ACK On 6/28/2012 9:58 PM, Ryan McCabe wrote: Fix a NULL pointer dereference that could happen when cman_get_node_count() returns 0 with errno set to EINTR. Possibly resolves rhbz#820632 Signed-off-by: Ryan McCabe rmcc...@redhat.com --- rgmanager/src/clulib/members.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/rgmanager/src/clulib/members.c b/rgmanager/src/clulib/members.c index f705297..72f4529 100644 --- a/rgmanager/src/clulib/members.c +++ b/rgmanager/src/clulib/members.c @@ -367,8 +367,10 @@ get_member_list(cman_handle_t h) do { ++tries; - if (nodes) + if (nodes) { free(nodes); + nodes = NULL; + } c = cman_get_node_count(h); if (c = 0) {
Re: [Cluster-devel] [PATCH 4/5] rgmanager: Treat exit status 16 from umount as success
ACK, but please add Masatake YAMATO suggestion to the final patch. Fabio On 6/28/2012 9:57 PM, Ryan McCabe wrote: When the filesystem /etc lives on is completely full, umount will exit with exit status 16 if the umount syscall succeeded but it was unable to write a new mtab file because the disk is full. umount won't exit with status 16 under any other circumstances. This patch changes the fs.sh, clusterfs.sh, and netfs.sh resource agents to check treat both exit status 0 and exit status 16 as success. Resolves: rhbz#819595 Signed-off-by: Ryan McCabe rmcc...@redhat.com --- rgmanager/src/resources/clusterfs.sh |3 ++- rgmanager/src/resources/fs.sh|3 ++- rgmanager/src/resources/netfs.sh |3 ++- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/rgmanager/src/resources/clusterfs.sh b/rgmanager/src/resources/clusterfs.sh index 49eb724..eae1ee0 100755 --- a/rgmanager/src/resources/clusterfs.sh +++ b/rgmanager/src/resources/clusterfs.sh @@ -793,7 +793,8 @@ stop: Could not match $OCF_RESKEY_device with a real device ocf_log info unmounting $dev ($mp) umount $mp - if [ $? -eq 0 ]; then + retval=$? + if [ $retval -eq 0 -o $retval -eq 16 ]; then umount_failed= done=$YES continue diff --git a/rgmanager/src/resources/fs.sh b/rgmanager/src/resources/fs.sh index a98cddc..5d6bc1b 100755 --- a/rgmanager/src/resources/fs.sh +++ b/rgmanager/src/resources/fs.sh @@ -1103,7 +1103,8 @@ stop: Could not match $OCF_RESKEY_device with a real device ocf_log info unmounting $mp umount $mp - if [ $? -eq 0 ]; then + retval=$? + if [ $retval -eq 0 -o $retval -eq 16 ]; then umount_failed= done=$YES continue diff --git a/rgmanager/src/resources/netfs.sh b/rgmanager/src/resources/netfs.sh index 837a4c4..9f0daa4 100755 --- a/rgmanager/src/resources/netfs.sh +++ b/rgmanager/src/resources/netfs.sh @@ -560,7 +560,8 @@ stopNFSFilesystem() { ocf_log info unmounting $mp umount $umount_flag $mp - if [ $? -eq 0 ]; then + retval=$? + if [ $retval -eq 0 -o $retval -eq 16 ]; then umount_failed= done=$YES continue
[Cluster-devel] [PATCH] qdiskd: Make multipath issues go away
From: Lon Hohberger l...@redhat.com Qdiskd hsitorically has required significant tuning to work around delays which occur during multipath failover, overloaded I/O, and LUN trespasses in both device-mapper-multipath and EMC PowerPath environments. This patch goes a very long way towards eliminating false evictions when these conditions occur by making qdiskd whine to the other cluster members when it detects hung system calls. When a cluster member whines, it indicates the source of the problem (which system call is hung), and the act of receiving a whine from a host indicates that qdiskd is operational, but that I/O is hung. Hung I/O is different from losing storage entirely (where you get I/O errors). Possible problems: - Receive queue getting very full, causing messages to become blocked on a node where I/O is hung. 1) that would take a very long time, and 2) node should get evicted at that point anyway. Resolves: rhbz#782900 this version of the patch is a backport of: e2937eb33f224f86904fead08499a6178868ca6a 34d2872fb7e60be1594158acaaeb8acd74f78d22 There is a minor change vs original patch based on how qdiskd in RHEL5 handles cman connection. We add an extra call to cman_alive in main qdisk_loop to make sure data are not stalled on the cman port, and data_callback to qdiskd_whine executed. Signed-off-by: Lon Hohberger l...@redhat.com Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/daemon/cnxman-socket.h |1 + cman/qdisk/Makefile |2 +- cman/qdisk/disk.h |6 cman/qdisk/iostate.c| 17 +++-- cman/qdisk/iostate.h|4 ++- cman/qdisk/main.c | 54 +++ 6 files changed, 74 insertions(+), 10 deletions(-) diff --git a/cman/daemon/cnxman-socket.h b/cman/daemon/cnxman-socket.h index 351c97c..1d01b44 100644 --- a/cman/daemon/cnxman-socket.h +++ b/cman/daemon/cnxman-socket.h @@ -79,6 +79,7 @@ #define CLUSTER_PORT_SERVICES2 #define CLUSTER_PORT_SYSMAN 10/* Remote execution daemon */ #define CLUSTER_PORT_CLVMD 11/* Cluster LVM daemon */ +#defineCLUSTER_PORT_QDISKD 178/* Quorum disk daemon */ /* Port numbers above this will be blocked when the cluster is inquorate or in * transition */ diff --git a/cman/qdisk/Makefile b/cman/qdisk/Makefile index f58806b..9bfc486 100644 --- a/cman/qdisk/Makefile +++ b/cman/qdisk/Makefile @@ -32,7 +32,7 @@ qdiskd: disk.o crc32.o disk_util.o main.o score.o bitmap.o clulog.o \ gcc -o $@ $^ -lpthread -L../lib -L${ccslibdir} -lccs -lrt mkqdisk: disk.o crc32.o disk_util.o iostate.o \ -proc.o mkqdisk.o scandisk.o clulog.o gettid.o +proc.o mkqdisk.o scandisk.o clulog.o gettid.o ../lib/libcman.a gcc -o $@ $^ -lrt %.o: %.c diff --git a/cman/qdisk/disk.h b/cman/qdisk/disk.h index b784220..d491de1 100644 --- a/cman/qdisk/disk.h +++ b/cman/qdisk/disk.h @@ -290,6 +290,12 @@ typedef struct { status_block_t ni_status; } node_info_t; +typedef struct { + qd_ctx *ctx; + node_info_t *ni; + size_t ni_len; +} qd_priv_t; + int qd_write_status(qd_ctx *ctx, int nid, disk_node_state_t state, disk_msg_t *msg, memb_mask_t mask, memb_mask_t master); int qd_read_print_status(target_info_t *disk, int nid); diff --git a/cman/qdisk/iostate.c b/cman/qdisk/iostate.c index 65b4d50..eb74ad2 100644 --- a/cman/qdisk/iostate.c +++ b/cman/qdisk/iostate.c @@ -1,10 +1,14 @@ #include pthread.h +#include libcman.h #include iostate.h #include unistd.h #include time.h #include sys/time.h #include clulog.h +#include stdint.h +#include platform.h #include iostate.h +#include ../daemon/cnxman-socket.h static iostate_t main_state = 0; static int main_incarnation = 0; @@ -26,7 +30,7 @@ static struct state_table io_state_table[] = { { STATE_LSEEK,seek }, { -1, NULL} }; -static const char * +const char * state_to_string(iostate_t state) { static const char *ret = unknown; @@ -65,6 +69,8 @@ io_nanny_thread(void *arg) iostate_t last_main_state = 0, current_main_state = 0; int last_main_incarnation = 0, current_main_incarnation = 0; int logged_incarnation = 0; + cman_handle_t ch = (cman_handle_t)arg; + int32_t whine_state; /* Start with wherever we're at now */ pthread_mutex_lock(state_mutex); @@ -96,6 +102,11 @@ io_nanny_thread(void *arg) continue; } + /* Whine on CMAN api */ + whine_state = (int32_t)current_main_state; + swab32(whine_state); + cman_send_data(ch, whine_state, sizeof(int32_t), 0, CLUSTER_PORT_QDISKD, 0); + /* Don't log things twice */ if (logged_incarnation == current_main_incarnation) continue; @@ -114,7 +125,7 @@ io_nanny_thread(void *arg) int -io_nanny_start(int timeout
[Cluster-devel] [PATCH] cman-preconfig: allow host aliases as valid cluster nodenames
From: Fabio M. Di Nitto fdini...@redhat.com Resolves: rhbz#786118 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/daemon/cman-preconfig.c | 91 +++--- 1 files changed, 76 insertions(+), 15 deletions(-) diff --git a/cman/daemon/cman-preconfig.c b/cman/daemon/cman-preconfig.c index d88ff3d..68fec22 100644 --- a/cman/daemon/cman-preconfig.c +++ b/cman/daemon/cman-preconfig.c @@ -451,7 +451,7 @@ static int verify_nodename(struct objdb_iface_ver0 *objdb, char *node) struct sockaddr *sa; hdb_handle_t nodes_handle; hdb_handle_t find_handle = 0; - int error; + int found = 0; /* nodename is either from commandline or from uname */ if (nodelist_byname(objdb, cluster_parent_handle, node)) @@ -497,12 +497,11 @@ static int verify_nodename(struct objdb_iface_ver0 *objdb, char *node) } objdb-object_find_destroy(find_handle); - - /* The cluster.conf names may not be related to uname at all, - they may match a hostname on some network interface. - NOTE: This is IPv4 only */ - error = getifaddrs(ifa_list); - if (error) + /* +* The cluster.conf names may not be related to uname at all, +* they may match a hostname on some network interface. +*/ + if (getifaddrs(ifa_list)) return -1; for (ifa = ifa_list; ifa; ifa = ifa-ifa_next) { @@ -521,12 +520,13 @@ static int verify_nodename(struct objdb_iface_ver0 *objdb, char *node) if (sa-sa_family == AF_INET6) salen = sizeof(struct sockaddr_in6); - error = getnameinfo(sa, salen, nodename2, - sizeof(nodename2), NULL, 0, 0); - if (!error) { + if (getnameinfo(sa, salen, + nodename2, sizeof(nodename2), + NULL, 0, 0) == 0) { if (nodelist_byname(objdb, cluster_parent_handle, nodename2)) { strncpy(node, nodename2, sizeof(nodename) - 1); + found = 1; goto out; } @@ -537,27 +537,88 @@ static int verify_nodename(struct objdb_iface_ver0 *objdb, char *node) if (nodelist_byname(objdb, cluster_parent_handle, nodename2)) { strncpy(node, nodename2, sizeof(nodename) - 1); + found = 1; goto out; } } } /* See if it's the IP address that's in cluster.conf */ - error = getnameinfo(sa, sizeof(*sa), nodename2, - sizeof(nodename2), NULL, 0, NI_NUMERICHOST); - if (error) + if (getnameinfo(sa, sizeof(*sa), + nodename2, sizeof(nodename2), + NULL, 0, NI_NUMERICHOST)) continue; if (nodelist_byname(objdb, cluster_parent_handle, nodename2)) { strncpy(node, nodename2, sizeof(nodename) - 1); + found = 1; goto out; } } - error = -1; out: + if (found) { + freeifaddrs(ifa_list); + return 0; + } + + /* +* This section covers the usecase where the nodename specified in cluster.conf +* is an alias specified in /etc/hosts. For example: +* ipaddr hostname alias1 alias2 +* and clusternode name=alias2 +* the above calls use uname and getnameinfo does not return aliases. +* here we take the name specified in cluster.conf, resolve it to an address +* and then compare against all known local ip addresses. +* if we have a match, we found our nodename. In theory this chunk of code +* could replace all the checks above, but let's avoid any possible regressions +* and use it as last. +*/ + + nodes_handle = nodeslist_init(objdb, cluster_parent_handle, find_handle); + while (nodes_handle) { + char *dbnodename = NULL; + struct addrinfo hints; + struct addrinfo *result = NULL, *rp = NULL; + + if (objdb_get_string(objdb, nodes_handle, name, dbnodename)) { + goto next; + } + + memset(hints, 0, sizeof(struct addrinfo)); + hints.ai_family = AF_UNSPEC; + hints.ai_socktype = SOCK_DGRAM; + hints.ai_flags = 0; + hints.ai_protocol = IPPROTO_UDP; + + if (getaddrinfo(dbnodename, NULL, hints, result)) + goto next; + + for (rp
[Cluster-devel] [PATCH] rgmanager: fix nfsrestart option to be effective
From: Fabio M. Di Nitto fdini...@redhat.com The original patch e512a9ce367 was still racy in some conditions as other rpc.* and nfs* processes were holding a lock on the filesystem. stopping nfs in kernel is simply not enough in rhel5 this fixed version does stop nfs completely and re-instante nfs exports. Resolves: rhbz#822066 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- rgmanager/src/resources/clusterfs.sh | 31 --- rgmanager/src/resources/fs.sh| 31 --- 2 files changed, 40 insertions(+), 22 deletions(-) diff --git a/rgmanager/src/resources/clusterfs.sh b/rgmanager/src/resources/clusterfs.sh index 89b30a2..49eb724 100755 --- a/rgmanager/src/resources/clusterfs.sh +++ b/rgmanager/src/resources/clusterfs.sh @@ -681,7 +681,10 @@ stopFilesystem() { typeset -i max_tries=3 # how many times to try umount typeset -i sleep_time=2 # time between each umount failure typeset -i refs=0 - typeset nfsdthreads + typeset nfsexports= + typeset nfsexp= + typeset nfsopts= + typeset nfsacl= typeset done= typeset umount_failed= typeset force_umount= @@ -804,16 +807,22 @@ stop: Could not match $OCF_RESKEY_device with a real device if [ $OCF_RESKEY_nfsrestart = yes ] || \ [ $OCF_RESKEY_nfsrestart = 1 ]; then - if [ -f /proc/fs/nfsd/threads ]; then - ocf_log warning Restarting nfsd/nfslock - nfsdthreads=$(cat /proc/fs/nfsd/threads) - service nfslock stop - echo 0 /proc/fs/nfsd/threads - echo $nfsdthreads /proc/fs/nfsd/threads - service nfslock start - else - ocf_log err Unable to determin nfsd information. nfsd restart aborted - fi + ocf_log warning Restarting nfsd/nfslock + nfsexports=$(cat /var/lib/nfs/etab) + service nfslock stop + service nfs stop + service nfs start + service nfslock start + echo $nfsexports | { while read line; do + nfsexp=$(echo $line | awk '{print $1}') + nfsopts=$(echo $line | sed -e 's#.*(##g' -e 's#).*##g') + nfsacl=$(echo $line | awk '{print $2}' | sed -e 's#(.*##g') + if [ -n $nfsopts ]; then + exportfs -i -o $nfsopts $nfsacl:$nfsexp + else + exportfs -i $nfsacl:$nfsexp + fi + done; } fi else diff --git a/rgmanager/src/resources/fs.sh b/rgmanager/src/resources/fs.sh index 5724352..a98cddc 100755 --- a/rgmanager/src/resources/fs.sh +++ b/rgmanager/src/resources/fs.sh @@ -1019,7 +1019,10 @@ stopFilesystem() { typeset -i max_tries=3 # how many times to try umount typeset -i sleep_time=5 # time between each umount failure typeset -i nfslock_reclaim=0 - typeset nfsdthreads + typeset nfsexports= + typeset nfsexp= + typeset nfsopts= + typeset nfsacl= typeset done= typeset umount_failed= typeset force_umount= @@ -1126,16 +1129,22 @@ stop: Could not match $OCF_RESKEY_device with a real device if [ $OCF_RESKEY_nfsrestart = yes ] || \ [ $OCF_RESKEY_nfsrestart = 1 ]; then - if [ -f /proc/fs/nfsd/threads ]; then - ocf_log warning Restarting nfsd/nfslock - nfsdthreads=$(cat /proc/fs/nfsd/threads) - service nfslock stop - echo 0 /proc/fs/nfsd/threads
Re: [Cluster-devel] [PATCH] rgmanager: fix nfsrestart option to be effective
On 6/21/2012 3:26 PM, Lon Hohberger wrote: On 06/21/2012 04:07 AM, Fabio M. Di Nitto wrote: From: Fabio M. Di Nittofdini...@redhat.com The original patch e512a9ce367 was still racy in some conditions as other rpc.* and nfs* processes were holding a lock on the filesystem. stopping nfs in kernel is simply not enough in rhel5 this fixed version does stop nfs completely and re-instante nfs exports. Resolves: rhbz#822066 This is okay; ideally we wouldn't have to do this in the first place, however. and I would like some ponies, rainbows and unicorns.. however. Fabio
Re: [Cluster-devel] [PATCH] mkfs.gfs2: Follow symlinks before checking device contents
Hi, On 6/20/2012 6:15 PM, Bob Peterson wrote: - Original Message - | + absname = canonicalize_file_name(sdp-device_name); Hi Andy, Thanks for the patch. I just wanted to point out that in the past we've used realpath rather than canonicalize_file_name. For example, see this patch we did a long time ago to gfs2_tool: http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=e70898cfa09939a7100a057433fff3a4ad666bdd It would be nice if our use was consistent. I'm not sure if there's an advantage of one over the other. If canonicalize_file_name is now preferred upstream over realpath, we should probably replace all occurrences of that. On the other hand, if realpath is now preferred upstream, we should adjust this patch to use it instead. AFAIK, they are the same, and I don't have a personal preference; whatever is most favoured by the upstream community. :) Otherwise, the patch looks good. I don´t remember what other mkfs.* tools do, but if I would prefer to see something like: # ./mkfs.gfs2 -p lock_nolock /dev/vg/test WARNING: /dev/vg/test appears to be a symlink to /dev/real/device This will destroy any data on /dev/real/device It appears to contain: RANDOM_FS_OF_DOOM (blocksize..) Fabio
Re: [Cluster-devel] when do I need to start cpglockd
On 6/19/2012 6:23 AM, Dietmar Maurer wrote: Yes, that's a bug. cpglockd will be started from the rgmanager init script when RRP mode is enabled. Ryan Actually no, it's not a bug. cpglockd has it's own init script too. Yes, and that script 'unconditionally' (always) starts cpglockd Nothing wrong with that. If you ask a daemon to start it will start :) On top of that, cpglockd is harmless if there is no RRP mode active, or forcefully disabled. The Required-Start: tells sysvinint that if cpglockd is enabled, it has to be started before rgmanager. That tells sysvinint to always start that script before rgmanager. So we end up with cpglockd always running, although it is not required at all. What do I miss? It tells sysvinit to start cpglockd before rgmanager IF cpglockd is enabled via chkconfig, otherwise it is not started. That value is used only to calculate the symlink S* K** values for rc.d/ Fabio
Re: [Cluster-devel] when do I need to start cpglockd
On 6/19/2012 8:54 AM, Dietmar Maurer wrote: Yes, and that script 'unconditionally' (always) starts cpglockd Nothing wrong with that. If you ask a daemon to start it will start :) For me this is wrong. I have to maintain a debian package, and I do not want to start unnecessary daemons. So I simply remove that dependency. If Debian handling of daemons has changed, then the change is debian specific, it doesn´t make it a bug for all distributions. Last I checked if I run: apt-get install bind9 - bind9 will start automatically. Or for that matter also apache2 or The init scripts we deliver are as generic as possible, it doesn´t mean that they fit everything everywhere. And then again, expressing an order is correct. If Required-Start behavior in Debian is different than in other distro (I can speak for Fedora/RHEL here), then clearly there needs to be some distro specific tuning. Fabio
Re: [Cluster-devel] when do I need to start cpglockd
On 6/19/2012 9:24 AM, Dietmar Maurer wrote: And then again, expressing an order is correct. If Required-Start behavior in Debian is different than in other distro (I can speak for Fedora/RHEL here), then clearly there needs to be some distro specific tuning. You simply start a daemon which is not necessary. And I guess you do that on all distros if there is a Required-Start start dependency. Fresh install on Fedora: root@fedora16-node2 ~]# chkconfig --list |grep cpg cpglockd0:off 1:off 2:off 3:off 4:off 5:off 6:off [root@fedora16-node2 ~]# chkconfig rgmanager on [root@fedora16-node2 ~]# chkconfig --list |grep rg rgmanager 0:off 1:off 2:on3:on4:on5:on6:off [root@fedora16-node2 ~]# chkconfig --list |grep cpg cpglockd0:off 1:off 2:off 3:off 4:off 5:off 6:off [reboot] [root@fedora16-node2 ~]# ps ax|grep cpglockd 3741 pts/1S+ 0:00 grep --color=auto cpglockd [root@fedora16-node2 ~]# [root@fedora16-node2 ~]# clustat [SNIP] service:vip1 fedora16-node2 started As you can see, rgmanager is on, cpglockd off. At boot rgmanager starts fine, without cpglockd running. I think the problem here is the interpretation of the LSB specifications between different distributions. I am not going to argue which one is right or wrong but the key issue is here: An init.d shell script may declare using the Required-Start: header that it shall not be run until certain boot facilities are provided. This information is used by the installation tool or the boot-time boot-script execution facility to assure that init scripts are run in the correct order. In the fedora world that means that if cpglockd is enabled (via chkconfig), the Required-Start: make sure that cpglockd is started before rgmanager, always. It is possible that other distributions might interpret that as: cpglockd must be started even if disabled when rgmanager Required-Start: cpglockd and rgmanager is enabled. So based on the platform I use for testing/development, the daemon does not start unless it is necessary :) Fabio
Re: [Cluster-devel] when do I need to start cpglockd
On 6/19/2012 10:12 AM, Dietmar Maurer wrote: At boot rgmanager starts fine, without cpglockd running. I think the problem here is the interpretation of the LSB specifications between different distributions. I am not going to argue which one is right or wrong but the key issue is here: An init.d shell script may declare using the Required-Start: header that it shall not be run until certain boot facilities are provided. This information is used by the installation tool or the boot-time boot-script execution facility to assure that init scripts are run in the correct order. In the fedora world that means that if cpglockd is enabled (via chkconfig), the Required-Start: make sure that cpglockd is started before rgmanager, always. It is possible that other distributions might interpret that as: cpglockd must be started even if disabled when rgmanager Required-Start: cpglockd and rgmanager is enabled. So based on the platform I use for testing/development, the daemon does not start unless it is necessary :) OK, I was not aware of that. Many thanks for that detailed reply! So let´s instead try to figure out the correct fix. Let´s put one minute aside the possibility that some distributions might use the second interpretation of LSB header and focus only on the ordering instead. Dropping Required-Start: might look like an easy fix in the Debian world, but that could cripple the startup order as cpglockd could theoretically land after rgmanager (i don´t think it´s possible, but let´s not take a chance). I think the correct fix should be: move the conditional start start_cpglockd function/check from rgmanager.init to cpglockd.init. move the cpglockd is up and running test from rgmanager.init to cpglockd.init (that´s a bug as-is now). cpglockd.init should return 0 (success) if it does not need to run and would allow rgmanager to start given Debian current interpretation of LSB header. rgmanager.init can simply fire cpglockd.init without any check, as those would be done properly by cpglockd.init. I think this should solve the issue for Debian and keep current behavior in Fedora. Fabio
Re: [Cluster-devel] when do I need to start cpglockd
On 06/14/2012 06:06 PM, Ryan McCabe wrote: On Thu, Jun 14, 2012 at 03:41:39PM +, Dietmar Maurer wrote: I can't see that in the current cman init script. Instead, the rgmanager init script depends on the cpglockd unconditionally: # Required-Start: cman cpglockd So that is a bug? Hi, Yes, that's a bug. cpglockd will be started from the rgmanager init script when RRP mode is enabled. Ryan Actually no, it's not a bug. cpglockd has it's own init script too. The Required-Start: tells sysvinint that if cpglockd is enabled, it has to be started before rgmanager. rgmanager snippet to start cpglockd is there only for backward compatibility mode that avoids breaking upgrades from non RRP environments to RRP. This was done so that users didn't need to enable cpglockd via chkconfig (being a new daemon and all is not known yet). A perfect install would see the user doing: chkconfig cpglockd on chkconfig rgmanager on only for RRP installations. But then again, docs are fresh, cpglockd is new.. might as well help the users not to shoot their foot with an RRP gun ;) Fabio
[Cluster-devel] [PATCH] rgmanager: add nfsdrestart option as last resource to umount fs
From: Fabio M. Di Nitto fdini...@redhat.com Resolves: rhbz#822053 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- rgmanager/src/resources/fs.sh.in | 26 ++ 1 files changed, 26 insertions(+), 0 deletions(-) diff --git a/rgmanager/src/resources/fs.sh.in b/rgmanager/src/resources/fs.sh.in index c43c177..404fe01 100644 --- a/rgmanager/src/resources/fs.sh.in +++ b/rgmanager/src/resources/fs.sh.in @@ -135,6 +135,18 @@ do_metadata() content type=boolean/ /parameter + parameter name=nfsrestart inherit=nfsrestart + longdesc lang=en + If set and unmounting the file system fails, the node will + try to restart nfs daemon and nfs lockd to drop all filesystem + references. Use this option as last resource. + /longdesc + shortdesc lang=en + Enable NFS daemon and lockd workaround + /shortdesc + content type=boolean/ + /parameter + parameter name=fsid longdesc lang=en File system ID for NFS exports. This can be overridden @@ -446,6 +458,20 @@ do_force_unmount() { export nfslock_reclaim=1 fi + if [ $OCF_RESKEY_nfsrestart = yes ] || \ + [ $OCF_RESKEY_nfsrestart = 1 ]; then + if [ -f /proc/fs/nfsd/threads ]; then + ocf_log warning Restarting nfsd/nfslock + nfsdthreads=$(cat /proc/fs/nfsd/threads) + service nfslock stop + rpc.nfsd 0 + rpc.nfsd $nfsdthreads + service nfslock start + else + ocf_log err Unable to determin nfsd information. nfsd restart aborted + fi + fi + # Proceed with fuser -kvm... return 1 } -- 1.7.7.6
[Cluster-devel] [PATCH] rgmanager: add nfsdrestart option as last resource to umount fs
From: Fabio M. Di Nitto fdini...@redhat.com Resolves: rhbz#822066 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- rgmanager/src/resources/fs.sh | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/rgmanager/src/resources/fs.sh b/rgmanager/src/resources/fs.sh index 49912c2..f67f80e 100755 --- a/rgmanager/src/resources/fs.sh +++ b/rgmanager/src/resources/fs.sh @@ -202,6 +202,18 @@ meta_data() content type=boolean/ /parameter + parameter name=nfsrestart inherit=nfsrestart + longdesc lang=en + If set and unmounting the file system fails, the node will + try to restart nfs daemon and nfs lockd to drop all filesystem + references. Use this option as last resource. + /longdesc + shortdesc lang=en + Enable NFS daemon and lockd workaround + /shortdesc + content type=boolean/ + /parameter + parameter name=fsid longdesc lang=en File system ID for NFS exports. This can be overridden @@ -1005,6 +1017,7 @@ stopFilesystem() { typeset -i max_tries=3 # how many times to try umount typeset -i sleep_time=5 # time between each umount failure typeset -i nfslock_reclaim=0 + typeset nfsdthreads typeset done= typeset umount_failed= typeset force_umount= @@ -1108,6 +1121,20 @@ stop: Could not match $OCF_RESKEY_device with a real device notify_list_store $mp/.clumanager/statd nfslock_reclaim=1 fi + + if [ $OCF_RESKEY_nfsrestart = yes ] || \ +[ $OCF_RESKEY_nfsrestart = 1 ]; then + if [ -f /proc/fs/nfsd/threads ]; then + ocf_log warning Restarting nfsd/nfslock + nfsdthreads=$(cat /proc/fs/nfsd/threads) + service nfslock stop + echo 0 /proc/fs/nfsd/threads + echo $nfsdthreads /proc/fs/nfsd/threads + service nfslock start + else + ocf_log err Unable to determin nfsd information. nfsd restart aborted + fi + fi else fuser -kvm $mp fi -- 1.7.7.6
[Cluster-devel] [PATCH] cman init: allow sysconfig/cman to pass options to dlm_controld
From: Fabio M. Di Nitto fdini...@redhat.com DLM_CONTROLD_OPTS= can now be used to pass startup options to the daemon. Resolves: rhbz#821016 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/init.d/cman.in |5 - cman/init.d/cman.init.defaults.in |3 +++ 2 files changed, 7 insertions(+), 1 deletions(-) diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in index a39f19f..dddfe6e 100644 --- a/cman/init.d/cman.in +++ b/cman/init.d/cman.in @@ -116,6 +116,9 @@ fi # empty or any other value (default) | cman init will start the daemons #CMAN_DAEMONS_START= +# DLM_CONTROLD_OPTS -- allow extra options to be passed to dlm_controld daemon. +[ -z $DLM_CONTROLD_OPTS ] DLM_CONTROLD_OPTS= + # FENCE_JOIN_TIMEOUT -- seconds to wait for fence domain join to # complete. If the join hasn't completed in this time, fence_tool join # exits with an error, and this script exits with an error. To wait @@ -674,7 +677,7 @@ stop_fenced() start_dlm_controld() { - start_daemon dlm_controld || return 1 + start_daemon dlm_controld $DLM_CONTROLD_OPTS || return 1 if [ $INITLOGLEVEL = full ]; then ok diff --git a/cman/init.d/cman.init.defaults.in b/cman/init.d/cman.init.defaults.in index 04b3b5b..adde8d9 100644 --- a/cman/init.d/cman.init.defaults.in +++ b/cman/init.d/cman.init.defaults.in @@ -39,6 +39,9 @@ # empty or any other value (default) | cman init will start the daemons #CMAN_DAEMONS_START= +# DLM_CONTROLD_OPTS -- allow extra options to be passed to dlm_controld daemon. +#DLM_CONTROLD_OPTS= + # FENCE_JOIN_TIMEOUT -- seconds to wait for fence domain join to # complete. If the join hasn't completed in this time, fence_tool join # exits with an error, and this script exits with an error. To wait -- 1.7.7.6
[Cluster-devel] [PATCH] cman init: add extra documentation for FENCE_JOIN=
From: Fabio M. Di Nitto fdini...@redhat.com Related: rhbz#821016 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/init.d/cman.in |3 +++ cman/init.d/cman.init.defaults.in |3 +++ 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in index dddfe6e..95323b4 100644 --- a/cman/init.d/cman.in +++ b/cman/init.d/cman.in @@ -135,6 +135,9 @@ fi # set to yes, then the script will attempt to join the fence domain. # If FENCE_JOIN is set to any other value, the default behavior is # to join the fence domain (equivalent to yes). +# When setting FENCE_JOIN to no, it is important to check +# DLM_CONTROLD_OPTS to reflect expected behavior regarding fencing +# and quorum. [ -z $FENCE_JOIN ] FENCE_JOIN=yes # FENCED_OPTS -- allow extra options to be passed to fence daemon. diff --git a/cman/init.d/cman.init.defaults.in b/cman/init.d/cman.init.defaults.in index adde8d9..b981bab 100644 --- a/cman/init.d/cman.init.defaults.in +++ b/cman/init.d/cman.init.defaults.in @@ -58,6 +58,9 @@ # set to yes, then the script will attempt to join the fence domain. # If FENCE_JOIN is set to any other value, the default behavior is # to join the fence domain (equivalent to yes). +# When setting FENCE_JOIN to no, it is important to check +# DLM_CONTROLD_OPTS to reflect expected behavior regarding fencing +# and quorum. #FENCE_JOIN=yes # FENCED_OPTS -- allow extra options to be passed to fence daemon. -- 1.7.7.6
Re: [Cluster-devel] GFS2: Update main gfs2 doc
On 5/10/2012 2:11 PM, Steven Whitehouse wrote: From 49f30789fc33c4516fbe123f05ea4313866381d3 Mon Sep 17 00:00:00 2001 From: Steven Whitehouse swhit...@redhat.com Date: Thu, 10 May 2012 11:45:31 +0100 Subject: [PATCH 1/2] GFS2: Update main gfs2 doc Various items were a bit out of date, so this is a refresh to the latest info. Signed-off-by: Steven Whitehouse swhit...@redhat.com diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt index 4cda926..cc4f230 100644 --- a/Documentation/filesystems/gfs2.txt +++ b/Documentation/filesystems/gfs2.txt @@ -1,7 +1,7 @@ Global File System -- -http://sources.redhat.com/cluster/wiki/ +https://fedorahosted.org/cluster/wiki/HomePage GFS is a cluster file system. It allows a cluster of computers to simultaneously use a block device that is shared between them (with FC, @@ -30,7 +30,8 @@ needed, simply: If you are using Fedora, you need to install the gfs2-utils package and, for lock_dlm, you will also need to install the cman package -and write a cluster.conf as per the documentation. +and write a cluster.conf as per the documentation. For F17 and above +cman has been replaced by the dlm package. ^^^ cman has been replaced by corosync 2.0 (or higher) in combination with votequorum provide (see votequorum.5). gfs2 still requires dlm for it´s dependencies but it´s not a replacement. Fabio
Re: [Cluster-devel] GFS2: Update main gfs2 doc
On 5/10/2012 3:13 PM, Steven Whitehouse wrote: Hi, On Thu, 2012-05-10 at 15:09 +0200, Fabio M. Di Nitto wrote: On 5/10/2012 2:11 PM, Steven Whitehouse wrote: From 49f30789fc33c4516fbe123f05ea4313866381d3 Mon Sep 17 00:00:00 2001 From: Steven Whitehouse swhit...@redhat.com Date: Thu, 10 May 2012 11:45:31 +0100 Subject: [PATCH 1/2] GFS2: Update main gfs2 doc Various items were a bit out of date, so this is a refresh to the latest info. Signed-off-by: Steven Whitehouse swhit...@redhat.com diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt index 4cda926..cc4f230 100644 --- a/Documentation/filesystems/gfs2.txt +++ b/Documentation/filesystems/gfs2.txt @@ -1,7 +1,7 @@ Global File System -- -http://sources.redhat.com/cluster/wiki/ +https://fedorahosted.org/cluster/wiki/HomePage GFS is a cluster file system. It allows a cluster of computers to simultaneously use a block device that is shared between them (with FC, @@ -30,7 +30,8 @@ needed, simply: If you are using Fedora, you need to install the gfs2-utils package and, for lock_dlm, you will also need to install the cman package -and write a cluster.conf as per the documentation. +and write a cluster.conf as per the documentation. For F17 and above +cman has been replaced by the dlm package. ^^^ cman has been replaced by corosync 2.0 (or higher) in combination with votequorum provide (see votequorum.5). corosync was always a requirement though, it gets pulled in through the deps No disagreement on the dependency here, but cman is not replaced by dlm in terms of functionality, that would be incorrect. gfs2 still requires dlm for it´s dependencies but it´s not a replacement. Well it is kind of, since thats where dlm_controld resides and that now deals with all the recovery stuff now that gfs_controld is gone, so maybe it could have been worded better, but it at least is correct in terms of what needs to be installed package-wise, Right, package wise you are right, you install dlm and you get corosync indirectly. I was only pointing out the functionality chain here vs package chain. It might be better to express both in a doc since the landscape has changed substantially. Fabio
[Cluster-devel] [PATCH] qdisk: Fix man page example (take 2)
From: Fabio M. Di Nitto fdini...@redhat.com Resolves: rhbz#745538 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/man/qdisk.5 |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/cman/man/qdisk.5 b/cman/man/qdisk.5 index e0b0ff6..ca974fa 100644 --- a/cman/man/qdisk.5 +++ b/cman/man/qdisk.5 @@ -479,11 +479,11 @@ by the qdiskd timeout. .br quorumd interval=1 tko=10 votes=3 label=testing .in 12 -heuristic program=ping A -c1 -t1 score=1 interval=2 tko=3/ +heuristic program=ping A -c1 -w1 score=1 interval=2 tko=3/ .br -heuristic program=ping B -c1 -t1 score=1 interval=2 tko=3/ +heuristic program=ping B -c1 -w1 score=1 interval=2 tko=3/ .br -heuristic program=ping C -c1 -t1 score=1 interval=2 tko=3/ +heuristic program=ping C -c1 -w1 score=1 interval=2 tko=3/ .br .in 8 /quorumd -- 1.7.7.6
[Cluster-devel] [PATCH] cmannotifyd: deliver cluster status at startup and fix daemon init
From: Fabio M. Di Nitto fdini...@redhat.com cmannotifyd is very often (if not always) started _after_ cman is completely settled. That means cmannotifyd does not receive/dispatch any notifications on the current cluster status at startup. change cman connection loop to generate a fake notification that config and membership have changed (we can't poll if they did) and use those information internally too, to reinit logging with new cman connection. Resolves: rhbz#819787 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/notifyd/main.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/cman/notifyd/main.c b/cman/notifyd/main.c index 3091d2f..4a9f868 100644 --- a/cman/notifyd/main.c +++ b/cman/notifyd/main.c @@ -189,6 +189,10 @@ static void init_logging(int reconf) ccs_read_logging(ccs_handle, cmannotifyd, debug, mode, syslog_facility, syslog_priority, logfile_priority, logfile); ccs_disconnect(ccs_handle); + } else { + if (debug) { + logfile_priority = LOG_DEBUG; + } } if (!daemonize) @@ -311,6 +315,8 @@ static void byebye_cman(void) static void setup_cman(int forever) { int init = 0, active = 0; + int quorate; + const char *str = NULL; retry_init: cman_handle = cman_init(NULL); @@ -346,6 +352,14 @@ retry_active: exit(EXIT_FAILURE); } + logt_print(LOG_DEBUG, Dispatching first cluster status\n); + init_logging(1); + str = CMAN_REASON_CONFIG_UPDATE; + dispatch_notification(str, 0); + str = CMAN_REASON_STATECHANGE; + quorate = cman_is_quorate(cman_handle); + dispatch_notification(str, quorate); + return; out: -- 1.7.7.6
Re: [Cluster-devel] [PATCH 1/2] fence_scsi: fix typos in debug messages
ACK On 04/18/2012 02:01 AM, Ryan O'Hara wrote: Resolves: rhbz#674497 Signed-off-by: Ryan O'Hara roh...@redhat.com --- fence/agents/scsi/fence_scsi.pl |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fence/agents/scsi/fence_scsi.pl b/fence/agents/scsi/fence_scsi.pl index 91f113d..84cee91 100755 --- a/fence/agents/scsi/fence_scsi.pl +++ b/fence/agents/scsi/fence_scsi.pl @@ -111,7 +111,7 @@ sub get_node_id sub get_node_name { -print [$pname]: get_hode_name = $opt_n\n if $opt_v; +print [$pname]: get_node_name = $opt_n\n if $opt_v; return $opt_n; } @@ -163,7 +163,7 @@ sub get_host_name } } -print [$pname]: get_host_nam = $host_name\n if $opt_v; +print [$pname]: get_host_name = $host_name\n if $opt_v; return $host_name; }
Re: [Cluster-devel] [PATCH 2/2] fence_scsi: remove limitations section from man page
ACK On 04/18/2012 02:02 AM, Ryan O'Hara wrote: Resolves: rhbz#753839 Signed-off-by: Ryan O'Hara roh...@redhat.com --- fence/man/fence_scsi.8 |7 --- 1 files changed, 0 insertions(+), 7 deletions(-) diff --git a/fence/man/fence_scsi.8 b/fence/man/fence_scsi.8 index 8a2d5a8..d9ab03f 100644 --- a/fence/man/fence_scsi.8 +++ b/fence/man/fence_scsi.8 @@ -99,12 +99,5 @@ Name of the node to be fenced. \fIverbose = param \fR Verbose output. -.SH LIMITATIONS -The fence_scsi fencing agent requires a minimum of three nodes in the -cluster to operate. For SAN devices connected via fiber channel, -these must be physical nodes. SAN devices connected via iSCSI may use -virtual or physical nodes. In addition, fence_scsi cannot be used in -conjunction with qdisk. - .SH SEE ALSO fence(8), fence_node(8), sg_persist(8), lvs(8), lvm.conf(5)
Re: [Cluster-devel] cluster: RHEL6 - Apply patch from John Ruemker to resolve rhbz#803474
Hi Ryan, This patch is not upstream (STABLE32 branch) and has not been reviewed/ack'ed for inclusion. Commit has been reverted from the RHEL6 branch. Please also write a more comprehensive changelog entry in the commit because not all bugzilla's are visible to outside world. Example: Fix this or that by init var foo to NULL and compare blabla Patch from Resolves: rhbz#123456 Thanks Fabio On 04/09/2012 09:35 PM, Ryan McCabe wrote: Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=cd9d9be98b4276c4e73eac81563f54e92a08045d Commit:cd9d9be98b4276c4e73eac81563f54e92a08045d Parent:54a29913c5de797da6adb69e03b38487fef451b4 Author:Ryan McCabe rmcc...@redhat.com AuthorDate:Mon Apr 9 15:34:08 2012 -0400 Committer: Ryan McCabe rmcc...@redhat.com CommitterDate: Mon Apr 9 15:35:50 2012 -0400 Apply patch from John Ruemker to resolve rhbz#803474 --- rgmanager/src/daemons/main.c |8 +++- rgmanager/src/daemons/rg_event.c |4 ++-- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/rgmanager/src/daemons/main.c b/rgmanager/src/daemons/main.c index 94047c3..9a1e5e9 100644 --- a/rgmanager/src/daemons/main.c +++ b/rgmanager/src/daemons/main.c @@ -456,7 +456,13 @@ dispatch_msg(msgctx_t *ctx, int nodeid, int need_close) /* Centralized processing or request is from clusvcadm */ nid = event_master(); - if (nid != my_id()) { + if (nid 0) { + logt_print(LOG_ERR, #40b: Unable to determine + event master\n); + ret = -1; + goto out; + } + else if (nid != my_id()) { /* Forward the message to the event master */ forward_message(ctx, msg_sm, nid); } else { diff --git a/rgmanager/src/daemons/rg_event.c b/rgmanager/src/daemons/rg_event.c index 7048bc6..e6a2abd 100644 --- a/rgmanager/src/daemons/rg_event.c +++ b/rgmanager/src/daemons/rg_event.c @@ -247,7 +247,7 @@ static int find_master(void) { event_master_t *masterinfo = NULL; - void *data; + void *data = NULL; uint32_t sz; cluster_member_list_t *m; uint64_t vn; @@ -255,7 +255,7 @@ find_master(void) m = member_list(); if (vf_read(m, Transition-Master, vn, - (void **)(data), sz) 0) { + (void **)(data), sz) != VFR_OK) { logt_print(LOG_ERR, Unable to discover master status\n); masterinfo = NULL; ___ cluster-commits mailing list cluster-comm...@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/cluster-commits
[Cluster-devel] [PATCH 1/2] config: update relax ng schema to include totem miss_count_const
From: Fabio M. Di Nitto fdini...@redhat.com Resolves: rhbz#804938 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- config/tools/xml/cluster.rng.in.head |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/config/tools/xml/cluster.rng.in.head b/config/tools/xml/cluster.rng.in.head index c2fed3e..4e3d901 100644 --- a/config/tools/xml/cluster.rng.in.head +++ b/config/tools/xml/cluster.rng.in.head @@ -255,6 +255,15 @@ To validate your cluster.conf against this schema, run: calculated from retransmits_before_loss and token. rha:default=4 rha:sample=5/ /optional + optional +attribute name=miss_count_const + rha:description=This constant defines the maximum number of times + on receipt of a token a message is checked for retransmission before + retransmission occurs. This parameter is useful to modify for switches + that delay multicast packets compared to unicast packets. + The default setting works well for nearly all modern switches. + rha:default=5 rha:sample=10/ + /optional !-- FIXME: The following description was adapted from the man page. It may be tool long for the schema document. Consider cutting text after the second sentence and referring the reader to the openais.conf -- 1.7.7.6
[Cluster-devel] [PATCH 2/2] cman init: fix start sequence error handling
From: Fabio M. Di Nitto fdini...@redhat.com Any daemon that fails to start would leave no traces. the problem with cman init is that we need to handle multiple daemons and tools. If one in the chain fails, we never reverted to the original state of the system. This can indeed cause other issues. Fix the init script to stop cman if any error happens during start. Resolves: rhbz#806002 Signed-off-by: Fabio M. Di Nitto fdini...@redhat.com --- cman/init.d/cman.in |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in index d0c6f70..a39f19f 100644 --- a/cman/init.d/cman.in +++ b/cman/init.d/cman.in @@ -19,6 +19,9 @@ # set secure PATH PATH=/bin:/usr/bin:/sbin:/usr/sbin:@SBINDIR@ +# save invokation for rollback ops +thisinvokation=$0 + chkconfig2() { case $1 in @@ -199,6 +202,9 @@ nok() { echo -e $errmsg failure echo + if [ $currentaction = start ]; then + $thisinvokation stop + fi exit 1 } @@ -744,6 +750,7 @@ leave_fence_domain() start() { + currentaction=start breakpoint=$1 sshd_enabled cd @INITDDIR@ ./sshd start -- 1.7.7.6
Re: [Cluster-devel] [Patch] GFS2: Add gfs2_lockgather script and man page
On 03/05/2012 06:51 PM, Adam Drew wrote: This is a backport of the gfs2_lockgather script and manpage from gfs2_utils upstream. I have to NACK this backport for now. I already explain to Adam what needs changing. Fabio
Re: [Cluster-devel] [PATCH] resource-agnets: Add support for using tunnelled migrations with qemu
Looks good to me. ACK Fabio On 03/05/2012 11:36 PM, Chris Feist wrote: Add support for using tunnelled migrations with qemu Resolves: rhbz#712174 Allow using the --tunnelled option when migrating with virsh
Re: [Cluster-devel] [Patch] GFS2: Add gfs2_lockgather script and man page
On 03/05/2012 07:34 PM, Steven Whitehouse wrote: Hi, On Mon, 2012-03-05 at 19:27 +0100, Fabio M. Di Nitto wrote: On 03/05/2012 06:51 PM, Adam Drew wrote: This is a backport of the gfs2_lockgather script and manpage from gfs2_utils upstream. I have to NACK this backport for now. I already explain to Adam what needs changing. Fabio What is the issue? There are different ones. The script is GPLv3 and we can't pull it in cluster.git (GPLv2+) without some re-licensing work. Some parts of the script make use of /tmp in unsafe way that can cause security problems (mostly DoS in this case). Execution of some cluster commands is not safe. If the cluster is hanging and you want to use this tool to gather data, the script won't work because it will hang as well, creating extra load on the cluster. The script needs to handle shell errors correctly and AFAICT it doesn't. Basically it can give the impression to run correctly without collecting data (missing set -e or error handling per call). (minor) the backport patch needs fixing for the Makefile or it will fail to build/install. Fabio
Re: [Cluster-devel] [Patch] GFS2: Add gfs2_lockgather script and man page
On 03/05/2012 07:34 PM, Steven Whitehouse wrote: Hi, On Mon, 2012-03-05 at 19:27 +0100, Fabio M. Di Nitto wrote: On 03/05/2012 06:51 PM, Adam Drew wrote: This is a backport of the gfs2_lockgather script and manpage from gfs2_utils upstream. I have to NACK this backport for now. I already explain to Adam what needs changing. Fabio What is the issue? Forgot to mention in the previous email: since this is a long time (tar/ssh/scp..) running script, it needs to handle trap of signals and locking differently or if a user hits ctrl+c or the script is killed for whatever reason, it doesn't clean after itself. Leaking disk space and leaving the lock file around that would block the next run. I didn't check all the paths it uses, but an update to selinux policies might be necessary too. Fabio
Re: [Cluster-devel] [PATCH] rgmanager: Retry when config is out of sync [RHEL5]
ACK. Fabio On 03/01/2012 12:53 AM, Lon Hohberger wrote: [This patch is already in RHEL5] If you add a service to rgmanager v1 or v2 and that service fails to start on the first node but succeeds in its initial stop operation, there is a chance that the remote instance of rgmanager has not yet reread the configuration, causing the service to be placed into the 'recovering' state without further action. This patch causes the originator of the request to retry the operation. Later versions of rgmanager (ex STABLE3 branch and derivatives) are unlikely to have this problem since configuration updates are not polled, but rather delivered to clients. Update 22-Feb-2012: The above is incorrect, this was reproduced a rgmanager v3 installation. Resolves: rhbz#796272 Signed-off-by: Lon Hohberger l...@redhat.com --- rgmanager/src/daemons/rg_state.c | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/rgmanager/src/daemons/rg_state.c b/rgmanager/src/daemons/rg_state.c index 23a4bec..8c5af5b 100644 --- a/rgmanager/src/daemons/rg_state.c +++ b/rgmanager/src/daemons/rg_state.c @@ -1801,6 +1801,7 @@ handle_relocate_req(char *svcName, int orig_request, int preferred_target, rg_state_t svcStatus; int target = preferred_target, me = my_id(); int ret, x, request = orig_request; + int retries; get_rg_state_local(svcName, svcStatus); if (svcStatus.rs_state == RG_STATE_DISABLED || @@ -1933,6 +1934,8 @@ handle_relocate_req(char *svcName, int orig_request, int preferred_target, if (target == me) goto exhausted; + retries = 0; +retry: ret = svc_start_remote(svcName, request, target); switch (ret) { case RG_ERUN: @@ -1942,6 +1945,22 @@ handle_relocate_req(char *svcName, int orig_request, int preferred_target, *new_owner = svcStatus.rs_owner; free_member_list(allowed_nodes); return 0; + case RG_ENOSERVICE: + /* + * Configuration update pending on remote node? Give it + * a few seconds to sync up. rhbz#568126 + * + * Configuration updates are synchronized in later releases + * of rgmanager; this should not be needed. + */ + if (retries++ 4) { + sleep(3); + goto retry; + } + logt_print(LOG_WARNING, Member #%d has a different +configuration than I do; trying next +member., target); + /* Deliberate */ case RG_EDEPEND: case RG_EFAIL: /* Uh oh - we failed to relocate to this node.
Re: [Cluster-devel] [PATCH] rgmanager: Fix clusvcadm message when run with -F [RHEL6]
ACK On 02/21/2012 07:53 PM, Lon Hohberger wrote: The new_owner was not being correctly set when enabling a service with -F when run without central processing enabled. Resolves: rhbz#727326 Signed-off-by: Lon Hohberger l...@redhat.com --- rgmanager/src/daemons/rg_state.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/rgmanager/src/daemons/rg_state.c b/rgmanager/src/daemons/rg_state.c index 5501b3f..23a4bec 100644 --- a/rgmanager/src/daemons/rg_state.c +++ b/rgmanager/src/daemons/rg_state.c @@ -2293,6 +2293,7 @@ handle_fd_start_req(char *svcName, int request, int *new_owner) switch(ret) { case RG_ESUCCESS: + *new_owner = target; ret = RG_ESUCCESS; goto out; case RG_ERUN: