Re: [Lustre-discuss] Landing and tracking tools improvements
Hi, I found useful in Bugzilla system to be able to 'reply' to a previous comment, with email quoting style pre-filled text box. I also miss the ability to really simply generate a link to a specific bug , or even a comment in a bug, when writing a comment. For instance, when I was writing in Bugzilla 'as explained in bug 1 comment 3', my comment was processed and 'bug 1 comment 3' was turned into a clickable link. In order to be usable this would require the comments to be numbered. Apart from that, I like the possibility to edit or delete my own comments. Keep up the good work! Cheers, Sebastien. lustre-discuss-boun...@lists.lustre.org a écrit sur 23/05/2011 19:06:40 : De : Chris Gearing ch...@whamcloud.com A : lustre-discuss@lists.lustre.org Date : 23/05/2011 19:06 Objet : [Lustre-discuss] Landing and tracking tools improvements Envoyé par : lustre-discuss-boun...@lists.lustre.org We now have a whole kit of tools [Jira, Gerrit, Jenkins and Maloo] used for tracking, reviewing and testing of code that are being used for the development of Lustre. A lot of time has been spent integrating and connecting them appropriately but as with anything the key is to continuously look for ways to improve what we have and how it works. So my question is what do people think of the tools as they stand today and how can we improve them moving forwards. if people can respond to lustre-discuss then I'll correlate the outcome of any discussions and then create a Wiki page that can form some plan for improvement. Please be as descriptive as possible in your replies and take into account that I and others have no experience of Lustre past so if you liked something prior to the current tools you'll need to help me and them understand the details. Thanks Chris --- Chris Gearing Snr Engineer Whamcloud. Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [Lustre-community] Poor multithreaded I/O performance
[Moved to Lustre-discuss] However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle) How exactly does your multi-threaded application write the data? Are you using pwrite to ensure non-overlapping regions or are they all just doing unlocked write() operations on the same fd to each write (each just transferring size/8)? If it divides the file into N pieces, and each thread does pwrite on its piece, then what each OST sees are multiple streams at wide offsets to the same object, which could impact performance. If on the other hand the file is written sequentially, where each thread grabs the next piece to be written (locking normally used for the current_offset value, so you know where each chunk is actually going), then you get a more sequential pattern at the OST. If the number of threads maps to the number of OSTs (or some modulo, like in your case 6 OSTs per thread), and each thread owns the piece of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; offset size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then you've eliminated the need for application locks (assuming the use of pwrite) and ensured each OST object is being written sequentially. It's quite possible there is some bottleneck on the shared fd. So perhaps the question is not why you aren't scaling with more threads, but why the single file is not able to saturate the client, or why the file BW is not scaling with more OSTs. It is somewhat common for multiple processes (on different nodes) to write non-overlapping regions of the same file; does performance improve if each thread opens its own file descriptor? Kevin Wojciech Turek wrote: Ok so it looks like you have in total 64 OSTs and your output file is striped across 48 of them. May I suggest that you limit number of stripes, lets say a good number to start with would be 8 stripes and also for best results use OST pools feature to arrange that each stripe goes to OST owned by different OSS. regards, Wojciech On 23 May 2011 23:09, kme...@cs.uh.edu mailto:kme...@cs.uh.edu wrote: Actually, 'lfs check servers' returns 64 entries as well, so I presume the system documentation is out of date. Again, I am sorry the basic information had been incorrect. - Kshitij Run lfs getstripe your_output_file and paste the output of that command to the mailing list. Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe count will be 11) If your striping is correct, the bottleneck can be your client network. regards, Wojciech On 23 May 2011 22:35, kme...@cs.uh.edu mailto:kme...@cs.uh.edu wrote: The stripe count is 48. Just fyi, this is what my application does: A simple I/O test where threads continually write blocks of size 64Kbytes or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is created. Thanks, Kshitij What is your stripe count on the file, if your default is 1, you are only writing to one of the OST's. you can check with the lfs getstripe command, you can set the stripe bigger, and hopefully your wide-stripped file with threaded writes will be faster. Evan -Original Message- From: lustre-community-boun...@lists.lustre.org mailto:lustre-community-boun...@lists.lustre.org [mailto:lustre-community-boun...@lists.lustre.org mailto:lustre-community-boun...@lists.lustre.org] On Behalf Of kme...@cs.uh.edu mailto:kme...@cs.uh.edu Sent: Monday, May 23, 2011 2:28 PM To: lustre-commun...@lists.lustre.org mailto:lustre-commun...@lists.lustre.org Subject: [Lustre-community] Poor multithreaded I/O performance Hello, I am running a multithreaded application that writes to a common shared file on lustre fs, and this is what I see: If I have a single thread in my application, I get a bandwidth of approx. 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle), I still get the same bandwidth. Now, instead of writing to a shared file, if these threads write to separate files, the bandwidth obtained is approx. 700 Mbytes/sec. I would ideally like my multithreaded application to see similar scaling. Any ideas why the performance is limited and any workarounds? Thank you, Kshitij
Re: [Lustre-discuss] Poor multithreaded I/O performance
This is what my application does: Each thread has its own file descriptor to the file. I use pwrite to ensure non-overlapping regions, as follows: Thread 0, data_size: 1MB, offset: 0 Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB, offset: 2MB Thread 3, data_size: 1MB, offset: 3MB repeat cycle Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in parallel, I dont wait for one cycle to end before the next one begins). I am gonna try the following: a) Instead of a round-robin distribution of offsets, test with sequential offsets: Thread 0, data_size: 1MB, offset:0 Thread 0, data_size: 1MB, offset:1MB Thread 0, data_size: 1MB, offset:2MB Thread 0, data_size: 1MB, offset:3MB Thread 1, data_size: 1MB, offset:4MB and so on. (I am gonna keep these separate pwrite I/O requests instead of merging them or using writev) b) Map the threads to the no. of OSTs using some modulo, as suggested in the email below. c) Experiment with fewer no. of OSTs (I currently have 48). I shall report back with my findings. Thanks, Kshitij [Moved to Lustre-discuss] However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle) How exactly does your multi-threaded application write the data? Are you using pwrite to ensure non-overlapping regions or are they all just doing unlocked write() operations on the same fd to each write (each just transferring size/8)? If it divides the file into N pieces, and each thread does pwrite on its piece, then what each OST sees are multiple streams at wide offsets to the same object, which could impact performance. If on the other hand the file is written sequentially, where each thread grabs the next piece to be written (locking normally used for the current_offset value, so you know where each chunk is actually going), then you get a more sequential pattern at the OST. If the number of threads maps to the number of OSTs (or some modulo, like in your case 6 OSTs per thread), and each thread owns the piece of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; offset size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then you've eliminated the need for application locks (assuming the use of pwrite) and ensured each OST object is being written sequentially. It's quite possible there is some bottleneck on the shared fd. So perhaps the question is not why you aren't scaling with more threads, but why the single file is not able to saturate the client, or why the file BW is not scaling with more OSTs. It is somewhat common for multiple processes (on different nodes) to write non-overlapping regions of the same file; does performance improve if each thread opens its own file descriptor? Kevin Wojciech Turek wrote: Ok so it looks like you have in total 64 OSTs and your output file is striped across 48 of them. May I suggest that you limit number of stripes, lets say a good number to start with would be 8 stripes and also for best results use OST pools feature to arrange that each stripe goes to OST owned by different OSS. regards, Wojciech On 23 May 2011 23:09, kme...@cs.uh.edu mailto:kme...@cs.uh.edu wrote: Actually, 'lfs check servers' returns 64 entries as well, so I presume the system documentation is out of date. Again, I am sorry the basic information had been incorrect. - Kshitij Run lfs getstripe your_output_file and paste the output of that command to the mailing list. Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe count will be 11) If your striping is correct, the bottleneck can be your client network. regards, Wojciech On 23 May 2011 22:35, kme...@cs.uh.edu mailto:kme...@cs.uh.edu wrote: The stripe count is 48. Just fyi, this is what my application does: A simple I/O test where threads continually write blocks of size 64Kbytes or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is created. Thanks, Kshitij What is your stripe count on the file, if your default is 1, you are only writing to one of the OST's. you can check with the lfs getstripe command, you can set the stripe bigger, and hopefully your wide-stripped file with threaded writes will be faster. Evan -Original Message- From: lustre-community-boun...@lists.lustre.org mailto:lustre-community-boun...@lists.lustre.org [mailto:lustre-community-boun...@lists.lustre.org mailto:lustre-community-boun...@lists.lustre.org] On Behalf Of kme...@cs.uh.edu mailto:kme...@cs.uh.edu Sent: Monday, May 23, 2011 2:28 PM To:
Re: [Lustre-discuss] Landing and tracking tools improvements
I have to echo some of the comments of Chris Morrone. The in-line comments in Gerrit do not appear in the expanded view, and need to be found individually in the various patches. Having 3 lines of context plus the comments is enough for most cases, and if not then a URL to the actual comment would be great. While the linking from Gerrit back to Jira is good (due to embedded LU-nnn link in patch summary line) it is not so easy to find which changes in Gerrit are open from the Jira ticket. Sometimes there are multiple changes open for a single bug, either by accident (forgetting the Change-Id) or on purpose. Having a single comment in Jira for each open change would be good. I also agree that allowing an entire change to be visible on a single page would be helpful. I user to pre-load a bunch of patches from bugzilla into my browser before a flight, but with Gerrit that isn't really practical due to the number of tabs it would create. That said, I like that Gerrit is a gatekeeper and ensures that what is inspected is also what is tested and landed, even if it means that patches sometimes have to go through multiple review cycles for trivial changes. Being able to compare patches against previous versions in Gerrit speeds up the process of reviewing new versions of a change, but it is complicated if the base version of the patch is changed. At that point it will also show what gas changed between the base versions as if it were part of the patch, which is confusing. It would be better to limit the output to only the code that was modified in the two changes. Cheers, Andreas On 2011-05-23, at 11:06 AM, Chris Gearing ch...@whamcloud.com wrote: We now have a whole kit of tools [Jira, Gerrit, Jenkins and Maloo] used for tracking, reviewing and testing of code that are being used for the development of Lustre. A lot of time has been spent integrating and connecting them appropriately but as with anything the key is to continuously look for ways to improve what we have and how it works. So my question is what do people think of the tools as they stand today and how can we improve them moving forwards. if people can respond to lustre-discuss then I'll correlate the outcome of any discussions and then create a Wiki page that can form some plan for improvement. Please be as descriptive as possible in your replies and take into account that I and others have no experience of Lustre past so if you liked something prior to the current tools you'll need to help me and them understand the details. Thanks Chris --- Chris Gearing Snr Engineer Whamcloud. Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] SLES 11 SP1 Client rpms built but not working
Peter, Sorry for the late response. I don't know if this will help you or not, but below are the commands I ran to build the lustre client rpms on one of our SLES systems: nautilus:~ # cat /etc/SuSE-release SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 1 nautilus:~ # uname -a Linux nautilus 2.6.32.29-0.3.1.2687.3.PTF.607050.iommu-default #1 SMP 2011-02-25 13:36:59 +0100 x86_64 x86_64 x86_64 GNU/Linux nautilus:~ # cd /usr/src/linux-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu nautilus:/usr/src/linux-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu # make cloneconfig Cloning configuration file /proc/config.gz ... nautilus:/usr/src/linux-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu # make prepare scripts/kconfig/conf -s arch/x86/Kconfig CHK include/linux/version.h UPD include/linux/version.h nautilus:/usr/src/linux-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu # make scripts HOSTCC scripts/genksyms/genksyms.o SHIPPED scripts/genksyms/lex.c nautilus:/usr/src/linux-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu # cd /root/lustre-1.8.5 nautilus:~/lustre-1.8.5 # ./configure --disable-server --with-linux=/usr/src/linux-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu \ --with-linux-obj=/usr/src/linux-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu-obj/x86_64/default \ --with-linux-config=/boot/config-2.6.32.29-0.3.1.2687.3.PTF.607050.iommu-default checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu nautilus:~/lustre-1.8.5 # make rpms -- Rick Mohr HPC Systems Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu/ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5
stick with 1.6.6 , its a great release! BTW, why did you decide to upgrade to 1.8.x? is there a feature you are looking for? On Fri, May 20, 2011 at 2:48 PM, Aaron Everett aever...@forteds.com wrote: Thanks for the tip. I've already updated with the LU-286 patch, but I'll build new rpms with both patches and roll that out too. Since updating with the LU-286 patch Lustre has been running cleanly. Thanks for the support and the work! Aaron On Fri, May 20, 2011 at 4:40 AM, Johann Lombardi joh...@whamcloud.com wrote: On Thu, May 19, 2011 at 01:57:33PM -0400, Aaron Everett wrote: Sorry for the noise. I cleaned everything up, untarred a fresh copy of np. BTW, while you are patching the lustre client, you might also want to apply the following patch http://review.whamcloud.com/#change,457 which fixes a memory leak in the same part of the code. Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre HA Experiences
What was your conclusion? What is a good HA solution with Lustre? I am hoping SNS will be a big push for the next year On Wed, May 4, 2011 at 5:16 PM, Jason Rappleye jason.rappl...@nasa.gov wrote: On May 4, 2011, at 10:05 AM, Charles Taylor wrote: We are dipping our toes into the waters of Lustre HA using pacemaker. We have 16 7.2 TB OSTs across 4 OSSs (4 OSTs each). The four OSSs are broken out into two dual-active pairs running Lustre 1.8.5. Mostly, the water is fine but we've encountered a few surprises. 1. An 8-client iozone write test in which we write 64 files of 1.7 TB each seems to go well - until the end at which point iozone seems to finish successfully and begins its cleanup. That is to say it starts to remove all 64 large files. At this point, the ll_ost threads go bananas - consuming all available cpu cycles on all 8 cores of each server. This seems to block the corosync totem exchange long enough to initiate a stonith request. Running oprofile or profile.pl (possibly only included in SGI's respin of perfsuite, original is at http://perfsuite.ncsa.illinois.edu/) is useful in situations where you have one or more thread consuming a lot of CPU. It should point to what function(s) the offending thread(s) are spending time in. From there, bugzilla/jira or the mailing list should be able to help further. 2. We have found that re-mounting the OSTs, either via the HA agent or manually, often can take a *very* long time - on the order of four or five minutes. We have not figured out why yet. An strace of the mount process has not yielded much. The mount seems to just be waiting for something but we can't tell what. Could be bz 18456. Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss