[Beowulf] ADMIN: Beowulf mailing list outage - now resolved
Hi all, Unfortunately we had an unexpected outage for the Beowulf mailing list from last weekend through to today. As you may know this list was until a few years ago run by Don Becker at Penguin Computing and I took it over after he'd left and it had been running on autopilot. It now runs on a VM I provide and Penguin had kindly delegated the DNS to the hosting company I use, but it appears there was confusion over who owned the domain and its registration lapsed (I suspect on the weekend). I realised on Wednesday and after some frantic emailing of people at both Penguin and in the community (thanks Doug, Lara!) contact was re-established and Penguin have kindly renewed the domain for 5 years and we're back in business. Apologies for this! I should have realised that I wasn't getting the usual spam that gets sent to the admin address earlier but I was out at the Slurm Users Group. :-D All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] xCAT closing up shop
Hi all, Sad to hear that the xCAT developers are moving on and having to call it a day: https://sourceforge.net/p/xcat/mailman/xcat-user/thread/MW4PR15MB51826CAD47B7E44D0D808F01F7E4A%40MW4PR15MB5182.namprd15.prod.outlook.com/#msg37890495 Mark Gurevich, Peter Wong, and I have been the primary xCAT maintainers for the past few years. This year, we have moved on to new roles unrelated to xCAT and can no longer continue to support the project. As a result, we plan to archive the project on December 1, 2023. xCAT 2.16.5, released on March 7, 2023, is our final planned release. We would consider transitioning responsibility for the project to a new group of maintainers if members of the xCAT community can develop a viable proposal for future maintenance. Thank you all for you support of the project over the past 20+ years. I first came across xCAT in the mid-2000s when IBM brought Egan Ford to Melbourne to talk to folks from VPAC (where I was) and Monash Uni about it in light of a cluster that Monash had bought and we were going to help run. The first system I brought up with it from scratch was at the start of 2010 when I'd moved to VLSCI and was bringing up our first machine, an SGI Altix XE cluster. I was very pleasantly surprised to find that it didn't really care that it wasn't IBM hardware. :-) We used it on the rest of our systems from then on, both for the HPC systems (IBM iDataplex's and for deploying all the LPARs needed to run and use our BlueGene/Q) as well as the infrastructure side (GPFS NSD servers and TSM servers for backup and HSM use). I do seem to remember it took a bit of persuading to get the statelite configs to work with the iDataplex nodes that had Knights Corner Xeon Phi cards in them, but because it was open source and written in Perl we got it to work. It'll be interesting to see if others do step up to keep it going, I know on the ACM SIGHPC SysPro Slack there's been some noises from folks who seem interested. All the best, Chris ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Your thoughts on the latest RHEL drama?
On 26/6/23 11:38, Joe Landman wrote: This was likely aimed at the other folks like Oracle who are making money off of rebuilds and not so much at Alma/Rocky. Those are collateral damage. From memory (insert sirens, klaxons and other warnings sounds here) Oracle was the target for Red Hat's obfuscating of their kernel sources to make it harder for them to do their kernel variants. It would horribly ironic if this move pushed more people towards using OL given Oracle seem to give that away for free. :-/ I've not had to use RHEL since leaving Australia for the US but my experience with their support was pretty poor up to that point. They had a nasty habit of breaking Mellanox drivers and not being able to fix them for extended periods of time (we had one in RHEL5 that was still unresolved when we went to RHEL6, and one in RHEL6 that would crash our PPC64 BG/Q management node in RHEL 6.2 through 6.4 before it got fixed - we were stuck on a RHEL 6.1 kernel until they finally fixed it). Hopefully things have improved since then. All the best, Chris ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Understanding environments and libraries caching on a beowulf cluster
On 28/6/22 11:44 am, leo camilo wrote: My time indeed has a cost, hence I will favour a "cheap and dirty" solution to get the ball rolling and try something fancy later. One thing I'd add is the use of some sort of cluster management system can be very handy to let you manage things as a whole. I've never used Qlustar that Tony mentioned but it does look interesting from a quick scan of the website. I'm also a big fan of booting nodes from a standard image as a ramdisk and then mounting the filesystems you need containing apps and user files from some sort of shared storage. The benefit with using standard images is that it's very easy to keep everything in step, you don't get gradual configuration drift as changes are made to some nodes and not others (perhaps one was down for some hardware work at one point and so a change couldn't be applied, etc, etc). Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Administrivia: disabling of monthly password reminders
Hi all, We've had some issues with some provider mistakenly marking Mailman password reminders as spam and (from the limited info I can glean) also causing us to get marked as being of poor reputation for a while (though the last case appeared to have cleared up relatively quickly). Because of this I've taken the liberty of disabling the monthly password reminders for the list itself. You should still be able to request one via the web interface should you need it. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] List archives
Hi John, On Monday, 16 August 2021 12:57:20 AM PDT John Hearns wrote: > The Beowulf list archives seem to end in July 2021. > I was looking for Doug Eadline's post on limiting AMD power and the results > on performance. I just went through the archives for July and compared them with what I have in my inpile and as far as I can tell there's nothing missing. There was a thread from June with the subject "AMD and AVX512", perhaps that's what you're thinking of? https://www.beowulf.org/pipermail/beowulf/2021-June/thread.html Your email from today & my earlier reply are in the archives for August. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Administrivia: fixed up issue with some people being unable to email beowulf.org
Hi all, I had two separate people contact me today via mutual friends about problems either contacting the list owner address here when trying to subscribe or sending to the list. I went digging and found that I'd missed creating a directory for the greylist software during the transition from the old system to the new VM which meant that some folks were getting temporary failures back blocking their email until it eventually bounced (as the greylist software was unable to create its database). Most subscribers were not affected as I'd modified the code to check whether the address was subscribed to the list already and bypass the greylist if they were, but there appear to be some edge cases it wasn't catching. I believe I've fixed this now, apologies to those affected! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] RIP CentOS 8
Hi folks, It looks like the CentOS project has announced the end of CentOS 8 as a version that tracked RHEL for the end of 2021, it will be replaced by the CentOS stream which will run ahead of RHEL8. CentOS 7 is unaffected (though RHEL7 only has 3 more years of life left). https://blog.centos.org/2020/12/future-is-centos-stream/ > The future of the CentOS Project is CentOS Stream, and over the > next year we’ll be shifting focus from CentOS Linux, the rebuild > of Red Hat Enterprise Linux (RHEL), to CentOS Stream, which > tracks just ahead of a current RHEL release. CentOS Linux 8, as > a rebuild of RHEL 8, will end at the end of 2021. CentOS Stream > continues after that date, serving as the upstream (development) > branch of Red Hat Enterprise Linux. > > Meanwhile, we understand many of you are deeply invested in > CentOS Linux 7, and we’ll continue to produce that version through > the remainder of the RHEL 7 life cycle. I always thought that Fedora was meant to be that upstream for RHEL, but perhaps the arrangement now will be Fedora -> CentOS -> RHEL. I wonder where this leaves the Lustre project, currently they only support RHEL7/CentOS7 as the server, and more interestingly, people who build Lustre appliances on top of CentOS. Then there's the question of projects like OpenHPC who've only just announced support for CentOS8 (and OpenSuSE15). They could choose to track CentOS Stream instead, probably without too much effort. I do wonder if this opens the door for the return of something like Scientific Linux. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] pdsh
Hi Jim, On Sunday, 29 November 2020 2:31:18 PM PST Lux, Jim (US 7140) via Beowulf wrote: > Today, > https://code.google.com/archive/p/pdsh/ > is where to go. I think code.google.com is a read-only archive now, it stopped in 2016 (see https://killedbygoogle.com/ for more info). It looks like pdsh is now on Github here: https://github.com/chaos/pdsh All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] RoCE vs. InfiniBand
On Thursday, 26 November 2020 3:14:05 AM PST Jörg Saßmannshausen wrote: > Now, traditionally I would say that we are going for InfiniBand. However, > for reasons I don't want to go into right now, our existing file storage > (Lustre) will be in a different location. Thus, we decided to go for RoCE > for the file storage and InfiniBand for the HPC applications. I think John hinted at this, but is there a reason for not going for IB for the cluster and then using Lnet routers to connect out to the Lustre storage via ethernet (with RoCE) ? https://wiki.lustre.org/LNet_Router_Config_Guide We use Lnet routers on our Cray system to bridge between the Aries interconnect inside the XC to the IB fabric our Lustre storage sits on. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Administrivia: Beowulf list moved to new server
Hi all, Today I've moved the Beowulf VM across to its new home at Rimuhosting (an NZ based hosting company, using their Dallas DC). I'm hoping that it should be transparent to everyone (though it'll hopefully perform a little better as it's got 3 times the memory). A number of you have asked about contributing to costs, it's very kind of you all but the monthly cost is less than I'd pay a week for coffee were I not working from home, so please don't worry. If that ever gets to be a problem then I think I'll have more to worry about than the list. :-) Please do let me know if you come across problems! A message directly to me would be best rather than spamming the list. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Administrivia: update on the beowulf list
On 21/11/20 12:27 pm, Chris Samuel wrote: As part of that I'll be doing some upgrades on the current VM over the weekend to bring it up to the same Debian version as on the new VM to ease the transition so there may be some disruption and downtime to the list & associated webserver, I'll try and keep that to a minimum. This work is done and we're now on Debian Buster, the current release. I'm hoping to transition the VM over to the new hosting service tomorrow and have reduced the TTL on our DNS records so we can cut over quickly. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Administrivia: update on the beowulf list
Hi all, A quick update on where we are with the list, I've had very useful discussions with my contact at Penguin about the DNS issue and it seems that it's not resolvable, and so we've agreed that I'll be moving the list to a VM I will provide at my current hosting provider where I will have DNS control so we don't have this happen again. As part of that I'll be doing some upgrades on the current VM over the weekend to bring it up to the same Debian version as on the new VM to ease the transition so there may be some disruption and downtime to the list & associated webserver, I'll try and keep that to a minimum. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Administrivia: chasing down DNS problems for the beowulf list
Hi Jonathan, On 11/8/20 9:47 pm, Jonathan Aquilina via Beowulf wrote: I am wondering if there is a way we can have a backup dns provider in the sense if there are issues like this dns resolution can be done through another provider? The problem is in the authoritative server for the reverse lookups zone, and so the existing backup (secondary) DNS server just has the same incorrect info. I'll update once it's fixed (Penguin are the only ones who can). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Administrivia: chasing down DNS problems for the beowulf list
Hi Darren, On 11/8/20 2:09 pm, Darren Wise wrote: No worries at all, have you dug with dig and compared to the server records. Hate sticking my nose in but could be why they will fix tomorrow while awaiting for said wherever hosting provider is to have a record refresh. I'm not sure, it might just be down to the availability of the folks at Penguin who admin their DNS. As this hosting is their gift to the list (thank you!) I don't think we can expect 24x7 support. :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Administrivia: chasing down DNS problems for the beowulf list
Hi John, On 11/8/20 3:40 pm, Jonathan Engwall wrote: I see nothing wrong with the website. Nothing to do with the website, it's a DNS issue. The PTR record for the IP address is missing from the in-addr.arpa domain. Penguin will hopefully fix this tomorrow. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Administrivia: chasing down DNS problems for the beowulf list
On 11/8/20 1:06 pm, Chris Samuel wrote: I've had no response back from them sadly, and we've started shedding a fair number of subscribers from the list because of this issue. I've sent a query on to their postmaster to see if they can help establish contact. Had a response already (my contact has been crazy busy, which I can sympathise with a lot!), issue will get looked at tomorrow. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Administrivia: chasing down DNS problems for the beowulf list
On 10/18/20 10:07 am, Chris Samuel wrote: Just a quick heads up that some folks will be having issues receiving email from the list as beowulf.org seems to have lost its reverse DNS entry and many subscribers email systems won't accept email from sites without that. I've just emailed the person I had contact with last at Penguin to see if this can be resolved (ahem), in the meantime I've disabled Mailman's automatic processing of bounce messages so we don't lose people because of this issue. I've had no response back from them sadly, and we've started shedding a fair number of subscribers from the list because of this issue. I've sent a query on to their postmaster to see if they can help establish contact. I'm really sorry about this! :-( All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Administrivia: chasing down DNS problems for the beowulf list
Hi all, Just a quick heads up that some folks will be having issues receiving email from the list as beowulf.org seems to have lost its reverse DNS entry and many subscribers email systems won't accept email from sites without that. I've just emailed the person I had contact with last at Penguin to see if this can be resolved (ahem), in the meantime I've disabled Mailman's automatic processing of bounce messages so we don't lose people because of this issue. chris@quad:~$ host beowulf.org beowulf.org has address 12.53.5.80 beowulf.org mail is handled by 10 beowulf.org. chris@quad:~$ host 12.53.5.80 Host 80.5.53.12.in-addr.arpa. not found: 3(NXDOMAIN) Looking at my home email inpile I can see this started happening a little while ago but work and taxes has occupied all my time so I've only just noticed. :-( Apologies for this! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] experience with HPC running on OpenStack
On 29/6/20 5:09 pm, Jörg Saßmannshausen wrote: we are currently planning a new cluster and this time around the idea was to use OpenStack for the HPC part of the cluster as well. I was wondering if somebody has some first hand experiences on the list here. At $JOB-2 I helped a group set up a cluster on OpenStack (they were resource constrained, they had access to OpenStack nodes and that was it). In my experience it was just another added layer of complexity for no added benefit and resulted in a number of outages due to failures in the OpenStack layers underneath. Given that Slurm which was being used there already had mature cgroups support there really was no advantage to them to having a layer of virtualisation on top of the hardware, especially as (if I'm remembering properly) in the early days the virtualisation layer didn't properly understand the Intel CPUs we had and so didn't reflect the correct capabilities to the VM. All that said, these days it's likely improved, and I know then people were thinking about OpenStack "Ironic" which was a way for it to manage bare metal nodes. But I do know the folks in question eventually managed to go to purely a bare metal solution and seemed a lot happier for it. As for IB, I suspect that depends on the capabilities of your virtualisation layer, but I do believe that is quite possible. This cluster didn't have IB (when they started getting bare metal nodes they went RoCE instead). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Neocortex unreal supercomputer
On 13/6/20 10:11 pm, Jonathan Engwall wrote: There is the strange part. How to utilize such a vast cpu? Storage should be the back end, unless the use is an api. In this case a gargantuan cpu sits in back, or so it seems. My guess is that this sits connected to the server, they load an algorithm on to it and they shovel data at it over the vast number of network cards and eventually it comes back with an answer. Hopefully their acceptance test will say "42". -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Neocortex unreal supercomputer
On 13/6/20 7:58 pm, Fischer, Jeremy wrote: It’s my understanding that NeoCortex is going to have a petabyte or two of NVME disk sitting in front of it with some HPE hardware and then it’ll utilize the queues and lustre file system on Bridges2 as its front end. There's more information here: https://www.psc.edu/3206-nsf-funds-neocortex-a-groundbreaking-ai-supercomputer-at-psc-2 # Neocortex will use the HPE Superdome Flex, an extremely powerful, # user-friendly front-end high-performance computing (HPC) solution # for the Cerebras CS-1 servers. This will enable flexible pre- and # post-processing of data flowing in and out of the attached WSEs, # preventing bottlenecks and taking full advantage of the WSE # capability. HPE Superdome Flex will be robustly provisioned with # 24 terabytes of memory, 205 terabytes of high-performance flash # storage, 32 powerful Intel Xeon CPUs, and 24 network interface # cards for 1.2 terabits per second of data bandwidth to each # Cerebras CS-1. The way it reads both of these CS-1's will sit behind that single Flex. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] NFS over IPoIB
On Friday, 12 June 2020 2:36:26 PM PDT John McCulloch wrote: > It is my understanding that setting MTU to 9000 is recommended but that > seems to be applicable for 10GbE. That depends how you're running your IB fabric. In datagram mode (which I think is the default these days) you're limited to an MTU of 2044 bytes, but in connected mode (which used to be the default) you could get (from memory) a 64KB MTU. Back at VLSCI we ran with connected mode and 64KB MTUs with GPFS running on IPoIB (it was before you could run GPFS multi-homed on different IB fabrics with RDMA on both so we just ran it over TCP/IP instead). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Sad news - RIP Rich Brueckner
Hi all, I've learned via @HPC_Guru on Twitter tonight that Rich Brueckner (the guy in the red hat), who ran InsideHPC and InsideBigData, passed away in Portland, Oregon on Wednesday (20th May). https://obits.oregonlive.com/obituaries/oregon/obituary.aspx?n=richard-a-brueckner=196230633 I don't think I ever had the pleasure of meeting Rich (though I certainly saw him around SC a lot, busily interviewing people for InsideHPC), but I know many people on the list will likely have done so. His obituary sasy "Rich's family asks that donations be made to Multnomah County Animal Services.". Two years ago Rich made a fundraising film for them as well. http://www.oncetherewasagiant.com/ The Multnomah County Animal Services website is here: https://multcopets.org/ You can make a donation via that site should you wish. All the best Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Reframe (was Re: [External] Re: Intel Cluster Checker)
On 4/30/20 12:14 pm, John Hearns wrote: Thanks Chris. I worked in one place which was setting up Reframe. It looked to be complicated to get running. Has this changed? To be honest I am not sure, another team at NERSC set it up so I just check out our local git repo and run it with: reframe.py -c checkout and it automagically figures out which system it's on and runs the appropriate checkout tests. It used to be more complicated to start but they spent time configuring that to avoid the need to specify the system name, etc. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] [External] Re: Intel Cluster Checker
On 4/30/20 6:54 am, John Hearns wrote: That is a four letter abbreviation... Ah you mean an ETLA (Extended TLA). I've not used ICC but we do use Reframe (from CSCS) at work for testing both between maintenances on our test system for changes we're making and also after the maintenance as a checkout before opening the system back up to users. It's proved very useful. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Illegal instruction (signal 4)
On 24/3/20 7:55 pm, Jonathan Engwall wrote: Building it was not a problem, it install a binary in /usr/local/bin, mpich makes a handshake...then I see Illegal instruction (signal 4). That usually means the application is trying to execute an instruction that's not supported on your CPU. I don't know if the BSD's overload that in any way, but I'd be surprised if they did. I've not touched the *BSD's since the 90's, so I don't think there's much useful advice I could offer other than to try their mailing lists (unless someone here has better ideas). Which BSD are you using? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Have machine, will compute: ESXi or bare metal?
On 9/2/20 10:36 pm, Benson Muite wrote: Take a look at the bootable cluster CD here: http://www.littlefe.net/ From what I can see BCCD hasn't been updated for just over 5 years, and the last email on their developer list was Feb 2018, so it's likely a little out of date now. http://bccd.net/downloads http://bccd.net/pipermail/bccd-developers/ On the other hand their TRAC does list some ticket updates a few months ago, so perhaps there are things going on but Skylar needs more hands? https://cluster.earlham.edu/trac/bccd-ng/report/1?sort=created=0=1 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] [EXTERNAL] Re: Interactive vs batch, and schedulers
On 16/1/20 9:35 pm, Lux, Jim (US 337K) via Beowulf wrote: And I suppose there’s no equivalent of “timeslicing” where the cores run job A for 99% of the time and job B, C, D, E, F, for 1% of the time. Slurm has a gang scheduling mode which sounds a little like what you're asking for (though it looks like each job will get an equal slice of time defined by the "SchedulerTimeSlice" parameter). https://slurm.schedmd.com/gang_scheduling.html All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Interactive vs batch, and schedulers
On 16/1/20 3:24 pm, Lux, Jim (US 337K) via Beowulf wrote: What I’m interested in is the idea of jobs that, if spread across many nodes (dozens) can complete in seconds (<1 minute) providing essentially “interactive” access, in the context of large jobs taking days to complete. It’s not clear to me that the current schedulers can actually do this – rather, they allocate M of N nodes to a particular job pulled out of a series of queues, and that job “owns” the nodes until it completes. Smaller jobs get run on (M-1) of the N nodes, and presumably complete faster, so it works down through the queue quicker, but ultimately, if you have a job that would take, say, 10 seconds on 1000 nodes, it’s going to take 20 minutes on 10 nodes. But doesn't that depend a lot on what the user asks for, or am I misunderstanding? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] [EXTERNAL] Re: Is Crowd Computing the Next Big Thing?
On 30/11/19 6:27 pm, Douglas Eadline wrote: The most interesting thing I learned was how well some laptops functioned for a "users needs" while technically in a state of "brokenness" There is a larger lesson there. This is why I'm a big big fan of compute nodes booting from a set image each time, we did it at VLSCI with xCAT and its "statelite" target (so we could keep GPFS metadata & other state on an NFS mount from the mgmt node for easy booting) with our SGI and IBM hardware and it worked really nicely. At least then everything should be identically broken. ;-) (and you only need to fix something in one place) Similar approach here at NERSC with Cray ansible (convergent evolution). We keep our recipes/definitions/etc in git and reuse them across systems (as much as possible) with config information abstracted out to define personalities for image builds and for boot. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Is Crowd Computing the Next Big Thing?
On 28/11/19 5:08 am, Dernat Rémy wrote: this works only when the phone is connected via WiFi, meaning that it doesn’t chew up data-plan data ever. This assumes you live in a part of the world where you can have an unmetered Internet connection. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] [EXTERNAL] Re: Is Crowd Computing the Next Big Thing?
On 27/11/19 9:50 am, Lux, Jim (US 337K) via Beowulf wrote: Wasn't there a minor scandal a year or so ago about websites mining bitcoin in the background using user resources? And some phone apps doing the same? Javascript cryptominers are a thing, and Firefox tries to block them automatically. https://blog.mozilla.org/firefox/block-cryptominers-with-firefox/ All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] MS-DOS DOSBOX
Hi John, On Tuesday, 5 November 2019 5:22:14 PM PST Jonathan Engwall wrote: > Yesterday I raised DosBox cpu emulation to nearly 600 megahertz with frame > skipping at 10, and found DosBox still useable. Can this cpu emulation be > verified somehow? I'm curious & have to know - what are you using DosBox for on your cluster? It's not unheard of, over a decade ago when I was at VPAC in Melbourne we installed Wine on our clusters so that a group could run the Windows command line code "LatentGold" on our x86 Linux clusters. Apparently worked a treat! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Rsync - checksums
On 30/9/19 5:55 pm, Stu Midgley wrote: That's pretty awesome, are you going to make it available? or push it upstream? If possible it'd be good to try and get it upstream, probably worth posting on the rsync list to get advice. https://lists.samba.org/mailman/listinfo/rsync All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Cluster stack based on Ansible. Looking for feedback.
On Thursday, 12 September 2019 2:00:00 PM PDT Oxedions wrote: > Thank you for reading this very long and boring mail. That wasn't boring at all, thanks so much for taking the time for putting it together! It sounds like a fun project, I do hope you get some interest. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Centos 7 news
On Friday, 30 August 2019 10:27:08 PM PDT Jonathan Engwall wrote: > 1300+ packages marked for update, if this effects you. It looks like CentOS 7.7 has just come out. https://lists.centos.org/pipermail/centos/2019-August/173310.html Heaps of announcements of the new packages here: https://lists.centos.org/pipermail/centos-cr-announce/2019-August/thread.html All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] GPUs Nvidia C2050 w/OpenMP 4.5 in cluster
On Monday, 12 August 2019 9:44:43 AM PDT Tony Travis wrote: > Well, I'm still using my nVidia C2050/75's under Ubuntu-MATE 18.04 LTS: Ubuntu also has a gcc-offload-nvptx package to (apparently) make this work. For instance on my Kubuntu 19.04 desktop here at home: chris@quad:~$ apt show gcc-offload-nvptx Package: gcc-offload-nvptx Version: 4:8.3.0-1ubuntu3 Priority: optional Section: universe/devel Description: GCC offloading compiler to NVPTX This package contains libgomp plugin for offloading to NVidia PTX. The plugin needs libcuda.so.1 shared library that has to be installed separately. . This is a dependency package providing the default GNU Objective-C compiler. I can't test it as I don't have an nvidia GPU. :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Lustre on google cloud
On Saturday, 27 July 2019 10:07:14 PM PDT Jonathan Aquilina wrote: > What would be the reason for getting such large data sets back on premise? > Why not leave them in the cloud for example in an S3 bucket on amazon or > google data store. Provider independent backup? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Lustre on google cloud
On Friday, 26 July 2019 4:46:56 AM PDT John Hearns via Beowulf wrote: > Terabyte scale data movement into or out of the cloud is not scary in 2019. > You can move data into and out of the cloud at basically the line rate of > your internet connection as long as you take a little care in selecting and > tuning your firewalls and inline security devices. Pushing 1TB/day etc. > into the cloud these days is no big deal and that level of volume is now > normal for a ton of different markets and industries. Whilst this is true as Chris points out this does not mean that there won't be data transport costs imposed by the cloud provider (usually for egress). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] flatpack
On 22/7/19 10:40 pm, Jonathan Aquilina wrote: So in a nut shell this is taking dockerization/ containerization and making it more for the every day Linux user instead of the HPC user? I don't think this goes as far as containers with isolation, as I think that's not what they're trying to do. But it does seem they're thinking along those lines. It would be interesting to have a distro built around such a setup. I think this is targeting cross-distro applications. With all the duplication of libraries, etc, a distro using it would be quite bulky. Also may you have a similar security as containers have, whereby when a vulnerability is found and patched in an application or library you end up with lots of people out there still running the vulnerable version. This is why distros tend to discourage "vendoring" of libraries as that tends to fossilise vulnerabilities into an application whereas if people use the version provided in the distro the maintainers only need to fix it in that one package and everyone who links against it benefits. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Lustre on google cloud
On 22/7/19 10:31 pm, Jonathan Aquilina wrote: I am sure though that with the GUI side of things through the console I am sure it makes things a lot easier to setup and manage no? You would hope so! Although I've got to say with my limited experience of Lustre when you're running it you pretty quickly end up poking through the entrails of the Linux kernel trying to figure out what's going on when it's not behaving right. :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] flatpack
On 22/7/19 10:26 pm, Jonathan Aquilina wrote: Hi Guys, I think I might be a bit tardy to the party here, but the way you describe flatpack is equivalent to the portable apps on windows is my understanding correct? It seems that way, with an element of sandboxing to try and protect the user who is using these packages. The Debian/Ubuntu package describes it thus: Flatpak installs, manages and runs sandboxed desktop application bundles. Application bundles run partially isolated from the wider system, using containerization techniques such as namespaces to prevent direct access to system resources. Resources from outside the sandbox can be accessed via "portal" services, which are responsible for access control; for example, the Documents portal displays an "Open" dialog outside the sandbox, then allows the application to access only the selected file. . Each application uses a specified "runtime", or set of libraries, which is available as /usr inside its sandbox. This can be used to run application bundles with multiple, potentially incompatible sets of dependencies within the same desktop environment. . This package contains the services and executables needed to install and launch sandboxed applications, and the portal services needed to provide limited access to resources outside the sandbox. There's also more about it here: http://docs.flatpak.org/en/latest/basic-concepts.html The downside (from the HPC point of view) is that these binaries will need to be compiled for a relatively low common denominator of architecture (or with a compiler that can do optimisations selected at runtime depending on the architecture). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Lustre on google cloud
On 22/7/19 10:12 pm, Jonathan Aquilina wrote: I am aware of that as I follow their youtube channel. Fair enough, others may not. :-) I think my main query is compared to managing a cluster in house is this the way forward be it AWS or google cloud? I think the answer there is likely "it depends". The reasons may not all be technical either, you may be an organisation from outside the US that cannot allow your data to reside offshore, or be held by a US company subject to US law even if data is not held in the US. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] IBM alert for GPFS crashes on RHEL 7.6
Hi folks, A heads up for folks running GPFS on RHEL7.6 (and I guess derivatives) from a colleague on the Australian HPC Slack: https://www-01.ibm.com/support/docview.wss?uid=ibm10887213=s033=OCSTXKQY=E_sp=s033-_-OCSTXKQY-_-E "IBM has identified an issue in IBM Spectrum Scale (GPFS) version that support RHEL7.6 (4.2.3.13 or later and 5.0.2.2 or later), in which a RHEL7.6 node running kernel versions 3.10.0-957.19.1 or higher, including 3.10.0-957.21.2, may encounter a kernel crash while running an IO operations." Basically it looks like they've fallen foul of this hardening fix: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9548906b All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] A careful exploit?
On 11/6/19 8:18 pm, Robert G. Brown wrote: * Are these real hosts, each with their own network interface (wired or wireless), or are these virtual hosts? In addendum to RGB's excellent advice and questions I would add to this question the network engineers maxim of "start at layer 1 and work up". In other words, first check your physical connectivity and then head up the layers. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Cray job in Canberra
Hi all, A friend of mine at Cray in Australia (who I knew before they moved there) let me know that they're looking to recruit someone to work in Canberra. You've got to be an Australian though and willing to get a security clearance. https://www.cray.com/company/careers/job-details?Req_Code=19-0119 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
[Beowulf] Performance impacts of Zombieload mitigation
Hi folks, There are initial benchmark results on the Zombieload mitigations up on Phoronix, unsurprisingly it looks like context switches & IPC take the brunt of the impact. https://www.phoronix.com/scan.php?page=news_item=MDS-Zombieload-Initial-Impact They've expanded their coverage here now (not had time to read yet). https://www.phoronix.com/scan.php?page=article=mds-zombieload-mit=1 They don't seem to benchmark any CPU intensive code so it'll be interesting to see how this impacts MPI & multithreaded HPC codes. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] HPE to acquire Cray
On 17/5/19 12:37 pm, Jonathan Aquilina wrote: That is my biggest fear for centos to be fair with the IBM RH acquisition. I think IBM at least seems to get open source, especially around Linux. Now if it was Oracle who had bought them... All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] HPE to acquire Cray
On 17/5/19 8:31 am, Kilian Cavalotti wrote: Such an acquisition surely can't be done without an impact on such monumental projects, and I'm wondering what route HPE will follow there. I would guess there would be contractual obligations around these that HPE would inherit. I'd also note that Blue Waters is a counter-example of a case where the initial vendor (IBM in this case) didn't get bought but it still didn't guarantee delivery. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Frontier Announcement
On Thursday, 9 May 2019 7:51:26 AM PDT Tony Brian Albers wrote: > Please stop, both of you. Sorry for not seeing this before, was away at the Cray User Group and so not keeping up with email (I know, what else is new). I'm disturbed by this thread, distressed at the threats that have been mentioned and concerned by the way the thread developed. I understand from close family experience how receiving threats of violence can set the person up for unexpected behaviours later on due to inoffensive triggers that can cause hurt, offence and anxiety amongst others and how distressing that can be to those on the receiving end of them. That said, I expect this to be the end of this branch of this thread. No responses please, either to the list or privately, this is the end of the matter. Thank you, Chris (list administrator) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?
On 2/5/19 10:50 am, Faraz Hussain wrote: Thanks John. I believe we purchased the enclosure from HPe with only hardware support. I am not aware of any support contract with Mellanox. We are running RHEL 7.5 ( I may have accidentally said it was Cent OS, but that was a typo ).. Red Hat do have documentation on setting up IB too. This might be a good starting point: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-infiniband_and_rdma_related_software_packages You should also be able to call on Red Hat for support with this as well. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] GPFS question
On Monday, 29 April 2019 3:47:10 PM PDT Jörg Saßmannshausen wrote: > thanks for the feedback. I guess it also depends how much meta-data you have > and whether or not you have zillions of small or larger files. > At least I got an idea how long it might take. This thread might also be useful, it is a number of years old but it does have some advice on placement of the filesystem manager before the scan and also on their experience scanning a ~1PB filesystem. https://www.ibm.com/developerworks/community/forums/html/topic?id=----14834266 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Beowulf in the news
On 17/4/19 4:14 pm, Lux, Jim (337K) via Beowulf wrote: In other news, I note that the Event Horizon Telescope (EHT) used the well known (to beowulf list members) “station wagon full of disk drives” approach to high bandwidth, high latency data comm. This technique does have significant history in the VLBI field (station wagon full of digital or analog tapes). The effective data rate from the telescope in Hawaii to MIT was 112 Gbps (700TB of data in 50,400 seconds). Same trick the pulsar astronomers at Swinburne used to do with Apple Xserve RAID boxes taking empty drives from Melbourne to Parkes and returning with full drives with data. These days they have connectivity out to the dish thankfully! This also had an application in HEP in Australia when it was cheaper to fly someone to Japan to the KEK collider to recover data on tape than it was to ship it over the 'net back to Australian researchers. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] Large amounts of data to store and process
On Friday, 15 March 2019 6:19:14 PM PDT Lawrence Stewart wrote: > * Some punters argue that MPI memory use scales badly with huge numbers of > ranks, so a hybrid approach is best, with OpenMP on node and MPI between > nodes. I am not convinced. You get the complexities of both. I think the thing there is "it depends" - for instance on BlueGene/Q where you had 16 cores and 16 GB RAM you could run 16 ranks of an MPI application per node but only have 1GB RAM per rank, or a single rank per node with 16GB RAM (or some power of 2 in between). So for some large molecular dynamics simulations (like NAMD) going hybrid could be the difference between failing due to not enough memory (usually on rank 0) and being able to run to completion. Now that's not necessarily the case any more (especially as BlueGene has gone the way of the dodo) but it was pretty important where I used to be! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
Re: [Beowulf] List returned to service (was Re: Administrivia: Beowulf down this weekend for OS upgrade)
On Saturday, 9 March 2019 7:29:10 PM PST Chris Samuel wrote: > The list is working, and the archives are being updated, but there's a > niggling issue that stops access to the archives via the web that I've not > been able to solve yet. Final, final email for the night. This is fixed (Apache 2.4 doesn't like the old syntax but doesn't complain). Archives are visible again. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] List returned to service (was Re: Administrivia: Beowulf down this weekend for OS upgrade)
On Saturday, 9 March 2019 7:29:10 PM PST Chris Samuel wrote: > However, I don't want to have that stopping access so I've made it > accessible again! Final email for the night, the website now uses Lets Encrypt certificates so we finally have proper HTTPS. It also forces browsers to HTTPS. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] List returned to service (was Re: Administrivia: Beowulf down this weekend for OS upgrade)
On Saturday, 9 March 2019 5:27:08 PM PST Chris Samuel wrote: > Upgrade going well, please forgive this test message to check that the list > is still working and archives are correctly updated. The list is working, and the archives are being updated, but there's a niggling issue that stops access to the archives via the web that I've not been able to solve yet. However, I don't want to have that stopping access so I've made it accessible again! Please do let me know if anything else is broken. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Administrivia: Beowulf down this weekend for OS upgrade
On Saturday, 9 March 2019 3:36:28 PM PST Chris Samuel wrote: > Just a heads up that I'll be taking beowulf.org down shortly to do a much > needed OS upgrade. Upgrade going well, please forgive this test message to check that the list is still working and archives are correctly updated. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Administrivia: Beowulf down this weekend for OS upgrade
Hi all, Just a heads up that I'll be taking beowulf.org down shortly to do a much needed OS upgrade. A side benefit for you folks for me being on call for NERSC this weekend and not being able to stray too far from home. :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Introduction and question
On Thursday, 28 February 2019 12:41:57 AM PST Bill Broadley wrote: > * avoid installing/fixing things with vi/apt-get/dpkg/yum/dnf, use ansible > whenever possible. Eventually you'll have to reinstall and it's painful > to manually apply months of changes. Another approach is to build a RAM disk image that gets booted on each node, and then you only make changes to that image. That way you know your nodes are in lockstep. At ${JOB-2} we used xCAT for that with its "statelite" method (so we could have some persistent state for things like GPFS config info on an NFS share), at ${JOB-1} we had an image on Lustre that was updated via some scripts from a master image that was kept in git, and where I am now we use Ansible to build boot images for our various systems. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Password mining
On Saturday, 2 February 2019 11:42:26 AM AEDT Robert G. Brown wrote: > There is an ancient Unix/Linux application called "crack" (it's still in > at least Fedora, if not all the rest). At this point it is usually used > by sysadmins to run on their password file to detect terrible passwords > when users pick easily crackable ones. Well that's why Alec wrote it when he was at Aberystwyth, to try and find users with weak passwords. :-) > One part of the (rather > intelligent -- written by generations of mostly-white hat wizards) > program checks for common passwords, unchanged passwords (like > changeme), and then runs the entire dictionary(s) with all reasonable > permutations of things like S -> 5, E -> 3, L -> 1. Yeah, Crack has a rule based system to express all the types of munging you would want to try, as well as the ability to add dictionaries and split the run up over multiple machines. ObHPC: the "John the Ripper" password cracker includes GPU support, at ${JOB-3} one of our HPC sysadmins was running it there to check our users passwords. We found that the OpenCL version was (then) faster than the straight CUDA version. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Thoughts on EasyBuild?
On 17/1/19 7:32 am, Faraz Hussain wrote: Some folks I work with are using EasyBuild but I am not sure what to make of it. I used easybuild for a number of years at VLSCI and at Swinburne and I can say that whilst the "easy" can be a bit of a misnomer it is very powerful and does simplify the job of managing HPC software. You want OpenFOAM? Pick the version, tell it to build it in robot mode and it'll go off and build it from the compiler all the way to the finished install. What I really like is that it codifies a lot of the knowledge about building these applications so even if you don't use it yourself the easyconfigs can help for software you may be struggling to build. The community around it is also strong and helpful. They also use checksums for the files they download to confirm nothing went wrong, it also has the useful side-effect of catching projects who spot a bug in a release and then re-release it without updating the version number. :-) One complication I did find is when you build things that you want to apply custom configuration to - like ensuring that all your OpenMPI versions are built with Slurm integration enabled for example - it means modifying all those easyconfigs in advance. I don't know if that's changed recently. One other nit is that it can lead to an explosion of versions of GCC, various MPI's, etc because the easyconfigs encode all those. What we did was use custom versions and pick a GCC v6, v7, etc version to use as well as a version of Python2, Python3, Perl, etc for those modules. What I liked is that it is very flexible - I really liked the way I could create an easyconfig for a bundle of Python modules for a group and they would just load that one bundle to get everything they wanted. It's not embedded into the existing Python install so you can do this over and over again and not accidentally upgrade one groups versions of a module simply because another group wants a module that depends on a later version. They also have a sense of humour, Kenneth Hoste the lead developer has a great talk called "How to make package managers cry" which goes into detail about how to make your software hard to install, with examples. https://www.youtube.com/watch?v=NSemlYagjIU Hope that's useful! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] USB flash drive bootable distro to check cluster health.
On 11/1/19 4:59 am, Richard Chang wrote: Anyone knows of any pre-existing distribution that will do the job ? Or know how to do it with Centos or Ubuntu ? I've used Advanced Clustering's "Breakin" bootable image for this in the past - it's open source and freely downloadable. http://www.advancedclustering.com/products/software/breakin/ It looks like their code is up on their own git server too: http://git.advancedclustering.com/cgi-bin/gitweb.cgi It looks like the downloads haven't been updated for several years, whilst the git repos are more recent. I've also just noticed a tool called "stressant" but that's not been touched for a year: https://gitlab.com/anarcat/stressant No idea what that's like! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] ipoib routing
On 11/12/18 10:58 pm, John Hearns via Beowulf wrote: Chris, we are talking about exactly the same hardware here. If you opened one up there was a SATA DOM which contained the OS and the configuration script. Interesting, I had been pretty sure the ones we had were just spinning rust, but it was a long time ago now! -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] ipoib routing
On 11/12/18 2:40 pm, John Hearns via Beowulf wrote: Michael, yes. Panasas engineered IPOIB to Ethernet routers for their storage platform. Remember that until the latest generation of their kit they ran on BSD, which had no Infiniband capability. Panasas IB routers booted from an onboard SATA DOM, which was quite a neat solution. Sounds like they've changed a bit since we had them at VLSCI (circa 2010), they were just two SuperMicro 2-in-1U boxes, each with a QDR IB and 10gigE and with a relatively vanilla CentOS 5 install and a script to configure them. Mind you they didn't need to be complicated, they just ran and ran. We lost 1 at one point due to a hardware problem and the solution was to replace the whole unit of 2 nodes. From memory we just needed to take the routes out of the Panasas config and the cluster they were routing for (and put them back when the replacement arrived). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] No HTTPS for the mailman interface
On 2/12/18 10:43 pm, jaquil...@eagleeyet.net wrote: I know Chris is away, but dont you guys feel like there should be an SSL certificate on the mailman interface as right now it is sending all credentials over http. Don't worry, I've been wanting to do this for ages. From memory (no access at the moment) port 443 is blocked by some firewall config I don't have access to. Now I'll be in the Bay area making contact with someone at Penguin who can help me with this should hopefully be easier.. (crosses fingers) Thanks for the reminder! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Administrivia: list admin traveling for a while
Hi folks, This won't be news to those who stalk^Wfollow me on Twitter, but today is my last day in Australia, I fly this evening to the US where I'll be starting work at NERSC at LBL on December 10th. I'm taking a couple of days beforehand to visit my partner in Philadelphia and so I'll have random email access (though it's been pretty random since they packed all my computers into a container and I've been madly tidying and cleaning) for some time. Any emails that need moderation and any new subscription requests may take longer than usual to process, sorry about that. I really want to catch up on that HPC workflows thread soon! All the best, Chris (back to cleaning) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] If you can help ...
Hi Doug, On Tuesday, 20 November 2018 9:08:49 AM AEDT Douglas Eadline wrote: > Thanks again to all those who helped. It was a great success. Good to hear that the community helped, sorry it couldn't cover the whole extra cost. :-( Look forward to seeing the film! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] PMIX and Julia?
On Tuesday, 20 November 2018 7:53:30 AM AEDT Prentice Bisbal via Beowulf wrote: > Conceptually, I don't think there's > anything preventing it from being used for Julia. In reality, it depends > on how general the API is. Yeah, if you're launched via Slurm (for instance) you could have access to PMIx (various versions), PMI2, PMI1 (if you're earlier than 18.08) or none of the above.. For other resource managers the choices may be different. Much easier to layer yourself on top of MPI and let it handle that for you! :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] PMIX and Julia?
On Monday, 19 November 2018 6:35:56 AM AEDT John Hearns via Beowulf wrote: > What I am really asking is will pmix be totally necessary when running on > near-exascale systems, or am I missing something? My thoughts are should > the Julia world be looking at mpix adaptations? If someone with a clue > about pmix could enlighten me I would be grateful. My (limited) understanding of this is that PMI* is an MPI wire-up protocol, in other words a mechanism for the MPI ranks to discover each other, set up communications and also talk to the resource scheduler (if present). There's a handy little description of PMI (v2 in this case) here: https://wiki.mpich.org/mpich/index.php/PMI_v2_API The PMIx website (along with standards documents etc) here: https://pmix.org/ My instinct is that it might be better for Julia to sit on MPI and let it handle this for it, rather than have to know about PMI2/PMIx itself.. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] If you can help ...
On Saturday, 10 November 2018 2:21:42 PM AEDT Adam DeConinck wrote: > I won’t be able to make it this year, but just kicked in $50. Good luck, and > I’m sad to miss it! Likewise, donated and shared on LinkedIn and Twitter. Fingers crossed Doug! -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] SC18
On Thursday, 8 November 2018 4:16:46 AM AEDT Ryan J. Negri wrote: > I'd love to meet anyone from the Beowulf list if you're also in town. Sadly I can't be there this year, but please do plug the list to people who you think might benefit! :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Oh.. IBM eats Red Hat
On Monday, 5 November 2018 2:14:50 AM AEDT Gerald Henriksen wrote: > The biggest threat to RHEL isn't lost sales to CentOS but losing > customers and mindshare to Ubuntu (which certainly appears to have > been an issue the last number of years based on the number of software > projects that support Ubuntu but not Red Hat). I don't think that's surprising, and I don't think that's going to change no matter what happens with Red Hat and IBM. From what I've seen in my time people tend to develop on their desktops and those tend to run Ubuntu (either natively or in a VM), not CentOS/RHEL. This is why tools like EasyBuild, Spack and containers, are important, we need to be able to cater for these wide ranging dependencies. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] More about those underwater data centers
On Monday, 5 November 2018 3:13:29 AM AEDT John Hearns via Beowulf wrote: > Have we faced up to the environmental impact of this? Where I've been has always tried to reuse/resell/recycle systems. Our alphacluster was snapped up by folks in the US, our first Intel cluster went to another university, our Power5 cluster went somewhere I can't remember. At ${JOB-1} we had Intel clusters redirected to other parts of the university (including one to the LHC ATLAS folks there). BlueGene - well not so much as far as I could tell. :-( > It is often said that CPUs can be upgraded - I have only once seen an > upgrade in place in my career. Only been offered (and did) this once about a decade ago at ${JOB-2} where we upgraded a system from dual core to quad core Opteron (Barcelona). That could have gone better... First of all it was all delayed because of the TLB errata (we eventually got affected chips and ran with the kernel patch before getting the rev'd silicon) and then we started to see random lock ups. Turned out (after a lot of chasing) that whilst the mainboard was meant to be OK the layout in the box meant the RAM next to the CPUs would sometimes overheat and take down the box. They added some heatsinks to those DIMMs and the problem went away! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] "NNSA’s first really big heterogeneous supercomputer"
On Thursday, 1 November 2018 4:58:30 AM AEDT Prentice Bisbal via Beowulf wrote: > I read a publication from LANL on it's architecture years ago, and I believe > to had to program for all 3 different processors to take advantage of it's > architecture. (If soneone knows for sure, please correct me if I'm wrong.) > I'd say that's even more heterogeneous than what we are seeing today > (CPU + GPU). I think you're bang on the money, here's a presentation from LANL about Roadrunner which includes programming on slide 25. https://www.lanl.gov/conferences/salishan/salishan2007/Roadrunner-Salishan-Ken%20Koch.pdf All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Oh.. IBM eats Red Hat
On Tuesday, 30 October 2018 2:58:18 AM AEDT Joe Landman wrote: > Python 2.x is dead, 3.x should be used/shipped everywhere. [Looks at folks running Python2 apps that rely on no-longer maintained Python2 only modules, then looks at others running 32-bit IRAF binaries. Goes and cries in corner...] -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Oh.. IBM eats Red Hat
On Tuesday, 30 October 2018 5:04:39 AM AEDT John Hearns via Beowulf wrote: > I just realised... I will now need an account on the IBM Support Site, a > SiteID AND an Entitlement to file bugs on any Redhat packages. I suspect that won't be the case, from what IBM are saying they're basically going to let them carry on with how Red Hat are doing things. A couple of points now I've had some time to think further on this: 1) IBM has always required you to run either Red Hat or SLES for hardware support on xSeries hardware. Having better links into one of those means it becomes easier to track down issues when Red Hat stuff up a kernel feature on a point release (like breaking Mellanox IB for several releases in RHEL6 for BG/Q Power systems, it panics your service node when you boot 4 racks at once, had to run RHEL 6.2 kernel until it was fixed in 6.5). I would be a bit nervous if I was SuSE on that front on that potential for more tie-in. On Power I'd be worried if I was Canonical as they had gone in hard with partnerships with IBM for Power around 2015. 2) IBM does have techies (as others have mentioned); from my local perspective they hired most/all of the OzLabs folks in Canberra in 2001 (who were stranded after LinuxCare folded) to join the Linux Technology Centre there, and were doing a lot of PPC kernel & firmware hacking. They brought up Linux on Power5 before AIX (for the first time - AIX needed firmware support whilst they could get Linux to boot on the bare metal). Some of them you may have heard of :-) (Andrew Tridgell, Rusty Russell, Paul Mackerras, Chris Yeoh). I had the privilege of working with some IBM folks seconded to VLSCI and they were a very smart bunch (Mark Nelson moved from the LTC down to Melbourne & did a bunch of work on Slurm for us). There's a heap more information here: https://ozlabs.org/about.html 3) xSeries support has always been a pain point for folks dealing with IBM, but the pSeries (POWER) support has been a lot better in general. As long as your IBM account manager doesn't muck up your support schedule. ;-) Red Hat's support has been not that great in my experience, and there are signs of a lack of testing in their release cycle (they released a point release of RHEL6 where rsync's parsing of remote source/destinations was broken so you couldn't rsync to/from a remote source, plus of course the recent release of a kernel where RDMA was completely broken due to a single character typo). 4) Sierra (and presumably other IBM CORAL systems) runs RHEL7. See point 1. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Oh.. IBM eats Red Hat
On Monday, 29 October 2018 6:42:48 PM AEDT Tony Brian Albers wrote: > I wonder where that places us in the not too distant future.. Yeah, it's certainly a case of interesting times. At least they do say: https://investors.redhat.com/news-and-events/press-releases/2018/10-28-2018-184027500 # Upon closing of the acquisition, Red Hat will join IBM's Hybrid Cloud team # as a distinct unit, preserving the independence and neutrality of Red Hat's # open source development heritage and commitment, current product # portfolio and go-to-market strategy, and unique development culture. # Red Hat will continue to be led by Jim Whitehurst and Red Hat's current # management team. Jim Whitehurst also will join IBM's senior management # team and report to Ginni Rometty. IBM intends to maintain Red Hat's # headquarters, facilities, brands and practices. So (at least at first) it seems they intend to stay pretty hands off, which I think is a good thing. Fingers crossed for the future.. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] It is time ...
On Tuesday, 23 October 2018 9:48:23 PM AEDT Jeffrey Layton wrote: > Is there a "consumption game" around the word blockchain? Power consumption? -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Contents of Compute Nodes Images vs. Login Node Images
On Wednesday, 24 October 2018 5:47:19 AM AEDT Prentice Bisbal via Beowulf wrote: > For the Blue Gene/Q, they did start supporting dynamically linked > executables, but I don't know what changed to the OS to allow that. The CNK (from memory) just passed all I/O over to the I/O nodes anyway, so if your code dlopen()d a library it was just reading it from the image on the I/O nodes. BG/Q of course had the extra core for the kernel threads to avoid getting in the way of the application on the 16 cores for compute. This was one reason that there was a preference to static link on BG/P and BG/ Q as it would put less load on the I/O nodes when starting large jobs. But of course that's non-negotiable for some applications! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Contents of Compute Nodes Images vs. Login Node Images
On Wednesday, 24 October 2018 3:15:51 AM AEDT Ryan Novosielski wrote: > I realize this may not apply to all cluster setups, but I’m curious what > other sites do with regard to software (specifically distribution packages, > not a shared software tree that might be remote mounted) for their login > nodes vs. their compute nodes. At VLSCI we had separate xCAT package lists for both, but basically the login node was a superset of the compute node list. These built RAMdisk images so keeping them lean (on top of what xCAT automatically strips out for you) was important. Here at Swinburne we run the same image on both, but that's a root filesystem chroot on Lustre so size doesn't impact memory usage (the node boots a patched oneSIS RAMdisk that brings up OPA and mounts Lustre then pivots over onto the image there for the rest of the boot). The kernel has a patched overlayfs2 module that does clever things for that part of the tree to avoid constantly stat()ing Lustre for things it has already cached (IIRC, that's a colleagues code). We install things into the master for the chroot (tracked with git) then have a script that turns the cache mode off across the cluster, rsync's things into the actual chroot area, does a drop_caches and then turns the cache mode on again. Hope that helps! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Hacked MBs It was only a matter of time
On Thursday, 4 October 2018 11:47:17 PM AEDT Douglas Eadline wrote: > https://www.bloomberg.com/news/features/2018-10-04/the-big-hack-how-china-used-a-tiny-chip-to-infiltrate-america-s-top-companies So two weeks on and it looks like this wasn't real, and I've read somewhere (though I can't find the reference now) that this isn't the first time for the person who wrote that article. A lot of people wrote about how this sort of attack doesn't really make sense, there are far easier ways to do this sort of thing (nobbled BMC firmware probably being one of the easiest) and without the problems of possibly thousands of SM boxes trying to ping back to a CnC server to set off alarms in a host of companies. This sums it up nicely.. https://twitter.com/SwiftOnSecurity/status/1053102057245286401 Two weeks since Bloomberg claimed Supermicro servers were backdoored by Chinese spying chips. No Evidence Whatsoever shows these claims real. All companies angrily deny it to Congress. Senior US intelligence including Rob Joyce refute it. It’s time. It’s over. This is not true. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] If I were specifying a new custer...
On Saturday, 13 October 2018 12:38:15 AM AEDT Gerald Henriksen wrote: > If ARM, or Power, want to move from their current positions in the > market they really need to provide affordable developer machines, Not sure if this comes in at a price point that makes sense for this, but there is now an ATX Power9 mainboard available. https://raptorcs.com/TALOSIILITE/ They claim: https://twitter.com/RaptorCompSys/status/1020371675316215809 # TalosIILite in stock and ready to ship! #POWER9 mainboard + CPU + RAM + HSF # for under $2,000 USD, what's not to like? Supports all of our Sforza CPU # options, from 4 core to the high end 22 core CPUs. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] If I were specifying a new custer...
On 12/10/18 08:50, Scott Atchley wrote: Perhaps Power9 or Naples with 8 memory channels? Also, Cavium ThunderX2. I'm not sure if Power or ARM (yet) qualify for a general HPC workload that Doug mentions; sadly a lot of the commercial codes are only available for x86-64 these days. MATLAB dropped PowerPC support back in 2007 for instance. All the best, Chris (still in the UK) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
[Beowulf] Administrivia: list admin travelling
Hi folks, Just a heads up that I've got to return to the UK for family matters and will be back in Melbourne late on the 14th October (and giving a short talk at a workshop at eResearch on the 16th, eep!). I should still have access to email whilst travelling, but likely with a higher latency than usual. The list has been quiet recently; please remember it is your list and everyone is welcome to initiate and participate in discussions no matter your level of experience. If you know people who might benefit from being on the list please let them know about it, HPC is a fairly small community in the wider IT world and so often we can only make connections with our peers through things like this. If you are going to be at events like SC (sorry, don't think I'll be there this year) please do promote the list to those who may not know about it if it has been of benefit to you. Finally if you, or people you know, have had problems getting or sending emails to this list please do let me know by emailing me directly. The advent of sometimes misguided or over-enthusiastic anti-spam filters is often a problem these days [looks pointedly at current work email system]. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] Tiered RAM
On Wednesday, 26 September 2018 1:38:15 PM AEST John Hearns via Beowulf wrote: > I had a look at the Intel pages on Optane memory. It is definitely > being positioned as a fast file cache, ie for block oriented devices. Interestingly there was a (non-block device) filesystem presented this year called NOVA targeting these sorts of NVMM devices. LWN has a nice little article on it (which in turn links to earlier articles on it, and the original paper). https://lwn.net/Articles/754505/ It's still going through rapid development and (learning from the btrfs experience) the kernel folks aren't going to let it in without a working fsck from the sound of things. There's also some NVDIMM documentation in the kernel tree which is pretty heavy going, I've just tried to skim it and I think I know less now than when I started. ;-) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/nvdimm/nvdimm.txt There's also the more readable documentation for the "block translation table" which seems to be intended to provide a way to give some atomicity to storage transactions to NVDIMMs which are not present given the nature of the hardware: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/nvdimm/btt.txt There is also the persistent memory wiki: https://nvdimm.wiki.kernel.org/ > This worked by tiering RAM memory - ie a driver in the Linux > kernel would move little used pages to the slower but higher capacity > device. > I though the same thing would apply to Optane, but it seems not. Well the simplest way to get what you describe there might be to use the Optane as a swap partition. :-) Red Hat have some docs about using NVDIMMs. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-persistent-memory-nvdimms Not sure I helped much there! :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?
On Monday, 10 September 2018 2:23:18 PM AEST Jonathan Engwall wrote: > If it is helpful there are a few similar bugs, generally considered > unreproducible. One thread calls it bogus xcomp_bv...the kernel clobbers > itself writing zeroes when that is not the state. And spectre came up. One > suggestion is to disable IBRS; according to other sources IBRS is dangerous > to disable and should protect against Spectre. Maybe the OpenFOAM is to > blame. Yeah, I suspect what we're seeing is different to that, it looks like something manages to generate a SIMD exception whilst the kernel is dealing with an APIC timer interrupt. A colleague has backported this patch that I found to our CentOS kernel in case it helps. https://lore.kernel.org/patchwork/patch/953364/ For now we've constrained this users workload on to a handful of nodes as they are trying to get some project work done. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Tuesday, 11 September 2018 10:41:24 AM AEST Kilian Cavalotti wrote: > Last I heard, the fix will be in 862.14.1 to be released on the 25th Ah interesting, I wonder if that fix is already in the 3.10.0-933 kernel that's meant to be in the RHEL 7.6 beta? -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Tuesday, 11 September 2018 9:17:21 AM AEST Ryan Novosielski wrote: > So we’ve learned what, here, that RedHat doesn’t test the RDMA stack at all? It certainly does seem to be the case. Unlike other issues I've hit in the past with bugs introduced in the IB stack in 6.x -> 6.y transitions where they've needed more hardware than you could reasonably expect them to have to be able to spot the bug this is a pretty fundamental failure. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Tuesday, 11 September 2018 1:25:55 AM AEST Peter St. John wrote: > I had wanted to say that such a bug would be caught by compiling with some > reasonalbe warning level; but I think I was wrong. Interesting - looks like it depends on your GCC version, 7.3.0 catches it with -Wall here: chris@quad:/tmp$ gcc -Wall test.c -o test test.c: In function ‘main’: test.c:6:2: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] if ( test ); ^~ test.c:7:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’ printf ( "hello\n" ); ^~ > So I guess I have to forgive the software engineer who fat-fingered that > semicolon. Of course I've done worse. Oh yes, same here too! There but for... and all that. :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Friday, 17 August 2018 2:47:37 PM AEST Chris Samuel wrote: > Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS > that was released to address the most recent Intel CPU problem "L1TF" seems > to break RDMA (found by a colleague here at Swinburne). So this CentOS bug has a one line bug fix for this problem! https://bugs.centos.org/view.php?id=15193 It's a corker - basically it looks like someone typo'd a ; into an if statement, the fix is: - if (!rdma_is_port_valid_nospec(device, _attr->port_num)); + if (!rdma_is_port_valid_nospec(device, _attr->port_num)) return -EINVAL; So it always returns -EINVAL when checking the port as the if becomes a noop.. :-( Patch attached... -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC >From 6353587a7efa488a4064f3661cf64bd4d74eaa73 Mon Sep 17 00:00:00 2001 From: Pablo Greco Date: Mon, 20 Aug 2018 06:39:55 -0300 Subject: [PATCH] OMG --- drivers/infiniband/core/verbs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index debe718..c080eb2 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1232,7 +1232,7 @@ int ib_resolve_eth_dmac(struct ib_device *device, int ret = 0; struct ib_global_route *grh; - if (!rdma_is_port_valid_nospec(device, _attr->port_num)); + if (!rdma_is_port_valid_nospec(device, _attr->port_num)) return -EINVAL; if (ah_attr->type != RDMA_AH_ATTR_TYPE_ROCE) -- 1.8.3.1 ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] About Torque maillist
On Wednesday, 22 August 2018 12:56:21 AM AEST Dmitri Chubarov wrote: > It looks like the list at Cluster Resources still exists but has not been > particularly active for quite some time now. That's pretty sad to see really, and from the archives it looks like they had an outage earlier in the year for the lists as well. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Tuesday, 21 August 2018 3:27:59 AM AEST Lux, Jim (337K) wrote: > I'd find it hard to believe that Intel's CPU designers sat around > implementing deliberate flaws ( the Bosch engine controller for VW model). Not to mention that Spectre variants affected AMD, ARM & IBM (at least). This publicly NSA funded research ("The Intel 80x86 processor architecture: pitfalls for secure systems") from 1995 has an interesting section: https://ieeexplore.ieee.org/document/398934/ https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f6536c91aaf7756857.pdf Section 3.10 - Cache and TLB timing channels which warns (in generalities) about the use of MSRs and the use of instruction timing as side channels. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Monday, 20 August 2018 6:32:26 AM AEST Jonathan Engwall wrote: > I am not shocked that my previous message may have been removed. To clarify: nothing has been removed to my knowledge. Your email is in the list archives. http://beowulf.org/pipermail/beowulf/2018-August/035219.html All the best, Chris (just woken up) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Sunday, 19 August 2018 5:19:07 AM AEST Jeff Johnson wrote: > With the spate of security flaws over the past year and the impacts their > fixes have on performance and functionality it might be worthwhile to just > run airgapped. For me none of the HPC systems I've been involved with here in Australia would have had that option. Virtually all have external users and/or reliance on external data for some of the work they are used for (and the sysadmins don't usually have control over the projects & people who get to use them). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Saturday, 18 August 2018 11:55:22 PM AEST Jörg Saßmannshausen wrote: > So I don't really understand about "Cannot make this public, as the patch > that caused it was due to embargo'd security fix." issue. I don't think any of us do, unless there's another fix there that is for an undisclosed CVE (which seems unlikely). -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On 18/8/18 8:47 pm, Jörg Saßmannshausen wrote: Hi Chris, Hiya, these are bad news if InfiniBand will be affected here as well as that is what we need to use for parallel calculations. They make use of RMDA and if that has a problem. well, you get the idea I guess. Oh yes, this is why I wanted to bring it to everyones attention, this isn't just about Lustre, it's much more widespread. Has anybody contacted the vendors like Mellanox or Intel regarding this? As Kilian wrote in the Lustre bug quoting his RHEL bug: https://bugzilla.redhat.com/show_bug.cgi?id=1618452 — Comment #3 from Don Dutile — Already reported and being actively fixed. Cannot make this public, as the patch that caused it was due to embargo'd security fix. This issue has highest priority for resolution. Revert to 3.10.0-862.11.5.el7 in the mean time. This bug has been marked as a duplicate of bug 1616346 -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Saturday, 18 August 2018 12:54:03 AM AEST Kilian Cavalotti wrote: > That's true: RH mentioned an "embargo'd security fix" but didn't refer > to L1TF explicitly (which I think is not under embargo anymore). Agreed, though I'm not sure any of the listed fixes are embargoed now. > As the reporter of the issue on the Whamcloud JIRA, I also have to > apologize for initially pointing fingers at Lustre, it didn't cross my > mind that this kind of whole RDMA stack breakage would have slipped > past Red Hat's QA. Oh I didn't read that as pointing any fingers at Lustre at all, just that the kernel update broke Lustre for you (and for us!). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Re: [Beowulf] RHEL7 kernel update for L1TF vulnerability breaks RDMA
On Friday, 17 August 2018 2:47:37 PM AEST Chris Samuel wrote: > Just a heads up that the 3.10.0-862.11.6.el7.x86_64 kernel from RHEL/CentOS > that was released to address the most recent Intel CPU problem "L1TF" seems > to break RDMA (found by a colleague here at Swinburne). There's 6 CVE's addressed in that update from the look of it, so it might not be the L1TF fix itself that has triggered it. https://access.redhat.com/errata/RHSA-2018:2384 -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf