Re: [Lustre-discuss] clients gets EINTR from time to time
Dear list, still investigating on this issue, I am now struggling with debugging.. The issue arose once more yesterday, so I started to look at it deeper and decided that the trace debug should be written to disk using debug_daemon. Alas, debugging with only the trace debug active spits more than 100 MB/s worth of log ! (yes these are busy clients)... I've tried several strategies like using debug_kernel from a cron job, or while watching my products error log, but even there dk would dump 70MB worth of data representing less that one second of debug log... So chances for me to trace the signal seems looow. Is there any debug flag less verbose but that may include the signal I'm looking for ? Given John's answers could I maybe use /proc/sys/lustre/dump_on_timeout to dump the log only when timeout happens, but this will work only if my problem is matching what John can reproduce. Please also note that I've looked around for abnormal threads_started numbers, it is everywhere at the same value than threads_min, except for one mdt entry which is at thread_min+1... Regards weboramalineFrançois Chassaing Directeur Technique - CTO - Mail Original - De: John Hammond jhamm...@tacc.utexas.edu À: Andreas Dilger adil...@whamcloud.com Cc: lustre-discuss@lists.lustre.org Envoyé: Vendredi 25 Février 2011 21h16:36 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: Re: [Lustre-discuss] clients gets EINTR from time to time On 02/25/2011 11:39 AM, Andreas Dilger wrote: On 2011-02-25, at 6:28, Brian J. Murrell br...@whamcloud.com wrote: On 11-02-25 06:18 AM, Francois wrote: I continue to parse debug logs and keep them posted. I don't understand why you don't just fix your application to handle a perfectly valid and expected condition (that it's currently not handling) instead of wasting time trying to find the cause of the expected condition. Even if you find it, it's likely not a bug and not something that can/will be fixed. It's your application that needs to be fixed. In all fairness Brian, it isn't always possible to fix an application like you suggest. It might be commercial (binary only), it might be complex code using 3rd party libraries to do the IO that would lose support if modifed, etc. I think the first action to debug this is to run on the client with lctl set_param debug=+trace or =~0 which will enable function entry/exit tracing in Lustre. Then when the problem us hit run lctl dk /tmp/debug to dump the Lustre debug log, and search for -4 (which is -EINTR) to see where this error is first appearing. At that point we can make a determination where the source of the error is, and if it is Lustre's fault. I know at one time there was a related problem in the l_wait_event() macro that was improperly masking signals, but I thought it was fixed by 1.8.5. Setting aside the moral question of which calls should be interruptible, I think that the handling of the LUSTRE_FATAL_SIGS (defined in lustre_lib.h to be SIGKILL, SIGINT, SIGTERM, SIGQUIT, SIGALRM) is slightly broken. Under certain situations, Lustre will return -EINTR although no signals were delivered. That's probably not the end of the world for most applications, but OTOH I don't think anybody assumes that -EINTR will be delivered spuriously. Consider the following sequence: 1) Process P has a Lustre file F open. 2) P has SIGALRM pending (but blocked). 3) P starts to writing to F and ends up sleeping in (something like): sys_write() ... ll_extent_lock() ... osc_enqueue() ... ptlrpc_queue_wait(). 4) The OST does not respond to the request before the deadline, so l_wait_event() replaces the signal mask of P with the LUSTRE_FATAL_SIGS, notices that SIGALRM is now deliverable, restores the signal mask of P, and ptlrpc_queue_wait() returns -EINTR. 5) P is exiting from sys_write(), SIGALRM is blocked (but still pending) so it doesn't get delivered. 6) P spuriously returns -EINTR from sys_write(). I can reproduce this on 1.8.5/RHEL 5.5. If the goal is to emulate NFS's interruptibility during congestion then returning -ERESTARTSYS would be more appropriate. Also, it might be worthwhile to make this extra interruptibility a mount flag, as NFS does. Best, John -- John L. Hammond, Ph.D. TACC, The University of Texas at Austin jhamm...@tacc.utexas.edu ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST problem
Hi Larry, thank you for your answer, but I do not have the chance to use infiniband. This description also starts with formatting the fs. I don't want to format the already in user node. I would like to extend it online. Is it possible to achieve full sync of the data of the two nodes (the one existing already, and the second is the new that is on the new server about to be attached) The old node has 50% of its capacity uploaded, the new node is completely empty. So the question is, how do I add a failnode to an online system and how do I manage to get the data to be in synchron. Hope someone can help thank you, Lucius - Eredeti üzenet - From: Larry Sent: Tuesday, March 01, 2011 6:25 AM To: Lucius Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] OST problem Hi Lucius, lustre manual chapter 15 tells you how to do it On Tue, Mar 1, 2011 at 1:05 PM, Lucius lucius...@hotmail.com wrote: Hello everyone, I would like to extend a OSS, which is still in current use. I would like to extend it with a server which has exactly the same HW configuration, and I’d like to extend it in an active/active mode. I couldn’t find any documentation about this, as most of the examples show how to use failnode during formatting. However, I need to extend the currently working system without losing data. Also, tunefs.lustre examples show only the parameter configuration, but they won’t tell if you need to synchronize the file system before setting the How would the system know that on the given server identified by its unique IP, which OST mirrors should run? Thank you in advance, Viktor ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 2.1 Release
This is really great and thanks for keeping this open! For aspiring software engineers at my school it would be valuable to dial-in the calls to hear professional speak. On Fri, Feb 25, 2011 at 8:18 AM, Diego Moreno diego.moreno-laz...@bull.net wrote: Hi Peter, That's great news! It's really interesting to know about the Lustre 2.1 release with all the community involved in it. I imagine it's still too soon but is there any roadmap or any date for the end of development and testing? Is there any features list or it's still to be defined in the mailing list? Regards, Diego On 24/02/2011 23:23, Peter Jones wrote: Hi there There has been much discussion within the Lustre community about the future of the Lustre 2.x codeline with the following outcome. Roles -I have taken on the role of Release Manager for the Lustre 2.1 release and Oleg Drokin (gr...@whamcloud.com) will be the Technical Lead for this release. Issue Tracking -Issues relating to this release will be tracked in Whamcloud's JIRA system -http://jira.whamcloud.com . Signup is open and free. -To see the present list of blockers, please use the filter Lustre 2.1 Blockers. This can be conveniently accessed by selecting Manage Filters and then Popular. Source Control -The code for the release will be made from Whamcloud's git instance -http://git.whamcloud.com/ -Patches contributed by engineers from third party organizations will be according to arrangement similar to the kernel (seehttp://wiki.whamcloud.com/display/PUB/Submitting+Changes for details). The outcome will be that no single organization will own the copyright to this release Testing -The latest build can be downloaded from thehttp://build.whamcloud.com/ -Testing results from both Whamcloud and third party organizations will be stored in Maloo, the Whamcloud test database -http://maloo.whamcloud.com http://maloo.whamcloud.com/. Seehttp://wiki.whamcloud.com/display/PUB/Using+Maloo for details on how to use Maloo either to view progress or to upload your own testing results. Weekly Call -A weekly status call will take place Tuesday at 1:30pm PT. This call is open to any interested parties. 866-914-3976 534986# This was considered the most expedient plan for the Lustre 2.1 release, but a different approach may be taken for ongoing Lustre 2.x releases. This is still under consideration within the Lustre community. The Lustre community organizations - EOFS, HPCFS, and OpenSFS - have all expressed support for these plans and we look forward to collaborating with the community for this release. A Lustre 2.1 Google group has been setup as a forum to discuss this release -http://groups.google.com/group/lustre-21. Please feel free to signup for this mailing list whether you are interested in collaborating in this release or just observing the progress. Regards Peter NB\ Lustre is a trademark of the Oracle Corporation -- Peter Jones Whamcloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Diego Moreno http://www.bull-world.com/ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Update of PDSI filesystem stats data
i hope one of the new features you are going to implement is SNS :-) On Wed, Feb 23, 2011 at 8:37 PM, Andreas Dilger adil...@whamcloud.com wrote: When looking at how to implement features for Lustre (which I'm doing a lot of recently :-) I somtimes consult the PDSI filesystem statistics data at http://www.pdsi-scidac.org/fsstats/ in order to see how these large filesystems are used in real life. Information like the length of filenames, how many files have hard links, the age of files in the filesystem, etc are useful in deciding where to optimize the implementation. Unfortunately, the filesystem surveys there are starting to get a bit dated (the most recent one is almost 3 years old, and the largest filesystems are only ~300TB in size). I want to solicit the Lustre user community to contribute some updated statistics, and have confirmed with Garth Gibson (leader of the PDSI workshops and maintainer of that site) that it is still worthwhile to send updated statistics using the http://www.pdsi-scidac.org/fsstats/questionnaire.html form. Garth will look at getting some grad students to compile the submitted data, and is particularly interested if anyone has updated data for any filesystem they previously submitted (PNNL, PSC). In the meantime, I would also appreciate an email with the results as well. Thanks in advance for any contributions. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST problem
Hi Lucius, I am not exactly sure what are you trying to do here. Do you have two OSS servers, first old one currently in production (OST 50% full), second a new one that you want to attach to your Lustre filesystem with (OST empty)? Do you want to create a mirror of old OST on the new OST? Adding a new failnode in Lustre is simple and can be done while filesystem is running. For that you use tunefs.lustre or lctl API, examples are in the Lustre manual. If you would like to change current failonde configuration, for example change IP address or network type, you will need to stop Lustre filesystem and amend the configuration while lustre is stopped, please see examples in manual. Cheers Wojciech On 4 March 2011 15:16, Lucius lucius...@hotmail.com wrote: Hi Larry, thank you for your answer, but I do not have the chance to use infiniband. This description also starts with formatting the fs. I don't want to format the already in user node. I would like to extend it online. Is it possible to achieve full sync of the data of the two nodes (the one existing already, and the second is the new that is on the new server about to be attached) The old node has 50% of its capacity uploaded, the new node is completely empty. So the question is, how do I add a failnode to an online system and how do I manage to get the data to be in synchron. Hope someone can help thank you, Lucius - Eredeti üzenet - From: Larry Sent: Tuesday, March 01, 2011 6:25 AM To: Lucius Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] OST problem Hi Lucius, lustre manual chapter 15 tells you how to do it On Tue, Mar 1, 2011 at 1:05 PM, Lucius lucius...@hotmail.com wrote: Hello everyone, I would like to extend a OSS, which is still in current use. I would like to extend it with a server which has exactly the same HW configuration, and I’d like to extend it in an active/active mode. I couldn’t find any documentation about this, as most of the examples show how to use failnode during formatting. However, I need to extend the currently working system without losing data. Also, tunefs.lustre examples show only the parameter configuration, but they won’t tell if you need to synchronize the file system before setting the How would the system know that on the given server identified by its unique IP, which OST mirrors should run? Thank you in advance, Viktor ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mixed oss/ost performance question
On 2011-03-04, at 3:14 PM, Samuel Aparicio wrote: I have a general question about mixing of osts with slower or faster backing storage. we have a fair number of slower legacy disk pools and a bunch of newer faster ones. the fast and slow are aggregated separately to provide OST storage targets with a uniform speed characteristic (slow or fast), my question is whether it would be better to make two separate filesystems (say lustre1 and lustre2) with the slow and fast OSTs respectively, or is it reasonable to have these all under one filesystem. It depends today on the sophistication of your users. It is possible to split different storage classes with OST pools (see the commands lctl pool_add and lfs setstripe -p), but these are optional separations today. If users don't specify any pool then the default is to use all OSTs (mixing fast and slow storage). You CAN specify default pools on a per-directory basis, but this only applies to newly-created files. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST problem
For clarity, Lustre does not replicate data. If you add an OST, it is unique. If you wish to do failover, this requires shared storage between two nodes. We do not replicate storage. If you wish to increase the size of your filesystem, you can add OSTs. cliffw On Fri, Mar 4, 2011 at 7:16 AM, Lucius lucius...@hotmail.com wrote: Hi Larry, thank you for your answer, but I do not have the chance to use infiniband. This description also starts with formatting the fs. I don't want to format the already in user node. I would like to extend it online. Is it possible to achieve full sync of the data of the two nodes (the one existing already, and the second is the new that is on the new server about to be attached) The old node has 50% of its capacity uploaded, the new node is completely empty. So the question is, how do I add a failnode to an online system and how do I manage to get the data to be in synchron. Hope someone can help thank you, Lucius - Eredeti üzenet - From: Larry Sent: Tuesday, March 01, 2011 6:25 AM To: Lucius Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] OST problem Hi Lucius, lustre manual chapter 15 tells you how to do it On Tue, Mar 1, 2011 at 1:05 PM, Lucius lucius...@hotmail.com wrote: Hello everyone, I would like to extend a OSS, which is still in current use. I would like to extend it with a server which has exactly the same HW configuration, and I’d like to extend it in an active/active mode. I couldn’t find any documentation about this, as most of the examples show how to use failnode during formatting. However, I need to extend the currently working system without losing data. Also, tunefs.lustre examples show only the parameter configuration, but they won’t tell if you need to synchronize the file system before setting the How would the system know that on the given server identified by its unique IP, which OST mirrors should run? Thank you in advance, Viktor ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss