Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Blomqvist Janne
nt: Tuesday, February 26, 2019 22:25 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters But rsync -a will only help you if people are using identical or at least overlapping data sets? And you don't need rsync to prune out old

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Goetz, Patrick G
But rsync -a will only help you if people are using identical or at least overlapping data sets? And you don't need rsync to prune out old files. On 2/26/19 1:53 AM, Janne Blomqvist wrote: > On 22/02/2019 18.50, Will Dennis wrote: >> Hi folks, >> >> Not directly Slurm-related, but... We have a

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Ansgar Esztermann-Kirchner
Hi, I'd like to share our set-up as well, even though it's very specialized and thus probably won't work in most places. However, it's also very efficient in terms of budget when it does. Our users don't usually have shared data sets, so we don't need high bandwidth at any particular point -- the

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Adam Podstawka
Am 26.02.19 um 09:20 schrieb Tru Huynh: > On Fri, Feb 22, 2019 at 04:46:33PM -0800, Christopher Samuel wrote: >> On 2/22/19 3:54 PM, Aaron Jackson wrote: >> >>> Happy to answer any questions about our setup. >> >> > >> Email me directly to get added (I had to disable the Mailman web > Coul

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Raymond Wan
Hi Janne, On Tue, Feb 26, 2019 at 3:56 PM Janne Blomqvist wrote: > When reaping, it searches for these special .datasync directories (up to > a configurable recursion depth, say 2 by default), and based on the > LAST_SYNCED timestamps, deletes entire datasets starting with the oldest > LAST_SYNC

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Tru Huynh
On Fri, Feb 22, 2019 at 04:46:33PM -0800, Christopher Samuel wrote: > On 2/22/19 3:54 PM, Aaron Jackson wrote: > > >Happy to answer any questions about our setup. > > > > Email me directly to get added (I had to disable the Mailman web Could you add me to that list? Thanks Tru -- Dr Tr

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-25 Thread Janne Blomqvist
On 22/02/2019 18.50, Will Dennis wrote: Hi folks, Not directly Slurm-related, but... We have a couple of research groups that have large data sets they are processing via Slurm jobs (deep-learning applications) and are presently consuming the data via NFS mounts (both groups have 10G ethernet

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-23 Thread John Hearns
Will, there are some excellent responses here. I agree that moving data to local fast storage on a node is a great idea. Regarding the NFS storage, I would look at implementing BeeGFS if you can get some new hardware or free up existing hardware. BeeGFS is a skoosh case to set up. (*) Scottish sl

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Raymond Wan
Hi Will, On 23/2/2019 1:50 AM, Will Dennis wrote: For one of my groups, on the GPU servers in their cluster, I have provided a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path ("/mnt/local" for historical reasons) that they can use for local scratch space. Their othe

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Christopher Samuel
On 2/22/19 3:54 PM, Aaron Jackson wrote: Happy to answer any questions about our setup. If folks are interested in a mailing list where this discussion would be decidedly on-topic then I'm happy to add people to the Beowulf list where there's a lot of other folks with expertise in this are

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Aaron Jackson
Hi Will, I look after our GPU cluster in our vision lab. We have a similar setup - we are working from a single ZFS file server. We have two pools: /db which is about 40TB spinning SAS built out of two raidz vdevs, with 16TB of L2ARC (across 4 SSDs). This reduces the size of ARC quite significant

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Matthew BETTINGER
We stuck avere between Isilon and a cluster to get us over the hump until next budget cycle ... then we replaced with spectrascale for mid level storage. Still use lustre of course as scratch. On 2/22/19, 12:24 PM, "slurm-users on behalf of Will Dennis" wrote: (replies inline)

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Yes, we've thought about using FS-Cache, but it doesn't help on the first read-in, and the cache eviction may affect subsequent read attempts... (different people are using different data sets, and the cache will probably not hold all of them at the same time...) On Friday, February 22, 2019 2

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Mark Hahn
applications) and are presently consuming the data via NFS mounts (both groups have 10G ethernet interconnects between the Slurm nodes and the NFS servers.) They are both now complaining of "too-long loading times" for the how about just using cachefs (backed by a local filesystem on ssd)? http

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
(replies inline) On Friday, February 22, 2019 1:03 PM, Alex Chekholko said: >Hi Will, > >If your bottleneck is now your network, you may want to upgrade the network. >Then the disks will become your bottleneck :) > Via network bandwidth analysis, it's not really network that's the problem...

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Alex Chekholko
Hi Will, You have bumped into the old adage: "HPC is just about moving the bottlenecks around". If your bottleneck is now your network, you may want to upgrade the network. Then the disks will become your bottleneck :) For GPU training-type jobs that load the same set of data over and over agai

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Paul Edmon
At least in our case we use a Lustre filesystem for scratch access, we have it mounted over IB though.  That said some of our nodes only access it over the 1GbE and I have never heard any complaints about performance.  In general for large scale production work Lustre tends to be more resilient

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Thanks for the reply, Ray. For one of my groups, on the GPU servers in their cluster, I have provided a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path ("/mnt/local" for historical reasons) that they can use for local scratch space. Their other servers in the cluster ha

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Raymond Wan
Hi Will, On 23/2/2019 12:50 AM, Will Dennis wrote: ... would be considered “scratch space”, not for long-term data storage, but for use over the lifetime of a job, or maybe perhaps a few sequential jobs (given the nature of the work.) “Permanent” storage would remain the existing NFS serve

[slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Hi folks, Not directly Slurm-related, but... We have a couple of research groups that have large data sets they are processing via Slurm jobs (deep-learning applications) and are presently consuming the data via NFS mounts (both groups have 10G ethernet interconnects between the Slurm nodes and