Re: reproducibility of results
Virtualisation can help a lot. The ability to snapshot a server version along with the data it analysed is probably the right approach (and a relatively small footprint, since if you are doing it right, you are taking snapshots anyway) But you still have to set it up... Research data is interesting. Massive volumes, relatively short "live" time frame before it can be migrated to slower storage. Tiered storage solution help a lot. Archiving to tape is viable but really need a way for it to be automatically accessed (by users without sysadmin involvement - less work, more likely to be used, and users test the archiving validity constantly... ) On 25 Dec 2016 11:44 pm, "Craig Sanders via luv-main"wrote: > On Sun, Dec 25, 2016 at 04:56:05PM +1100, Paul van den Bergen wrote: > > Funny, I was asked about exactly the same problem when I started > > @WEHI... only there was no attempt made to even start tackling the > > problem... > > yeah, we were constantly getting individual academics and research > groups asking us about storage, and then trying to do the best we could > with minimal resources. > > the unfortunate fact is that disks/storage arrays and file-servers > and tape libraries etc are expensive. You can replace a very large > percentage of your up-front capital expense with skilled techs, which > are an on-going cost (you're going to need them to look after expensive > equipment anyway, and it has to be maintained & upgraded for 7+ years), > but it's still going to cost a lot for huge data storage anyway, even if > you avoid over-priced name-brand gear. > > > Virtualisation of workload makes the problem a lot easier to tackle, > > but even so... 7 years is a long time in IT... > > cheap big disks helps a lot too. but you need a lot of them, plus > backup - on-site and off-site. > > CPU & RAM are more than adequate for pretty nearly any file-storage > needs these days...could always use more of both for computational > stuff. > > craig > > -- > craig sanders > ___ > luv-main mailing list > luv-main@luv.asn.au > https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main > ___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
Re: reproducibility of results
On Sun, Dec 25, 2016 at 04:56:05PM +1100, Paul van den Bergen wrote: > Funny, I was asked about exactly the same problem when I started > @WEHI... only there was no attempt made to even start tackling the > problem... yeah, we were constantly getting individual academics and research groups asking us about storage, and then trying to do the best we could with minimal resources. the unfortunate fact is that disks/storage arrays and file-servers and tape libraries etc are expensive. You can replace a very large percentage of your up-front capital expense with skilled techs, which are an on-going cost (you're going to need them to look after expensive equipment anyway, and it has to be maintained & upgraded for 7+ years), but it's still going to cost a lot for huge data storage anyway, even if you avoid over-priced name-brand gear. > Virtualisation of workload makes the problem a lot easier to tackle, > but even so... 7 years is a long time in IT... cheap big disks helps a lot too. but you need a lot of them, plus backup - on-site and off-site. CPU & RAM are more than adequate for pretty nearly any file-storage needs these days...could always use more of both for computational stuff. craig -- craig sanders___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
Re: reproducibility of results
Funny, I was asked about exactly the same problem when I started @WEHI... only there was no attempt made to even start tackling the problem... Virtualisation of workload makes the problem a lot easier to tackle, but even so... 7 years is a long time in IT... On 24 Dec 2016 4:19 pm, "Craig Sanders via luv-main"wrote: On Sat, Dec 24, 2016 at 12:51:01AM +1100, russ...@coker.com.au wrote: > https://github.com/docker/docker/issues/28705 > https://lwn.net/Articles/446528/ thanks, i'll have to read those later today. > So the kernel command-line option might be the best option for etch. possibly. worth a try, anyway. > Also the Jessie kernel will have security support for another couple > of years. There's no reason why you couldn't run a Stretch host with > a Jessie kernel to support Etch docker images if that was necessary. > Of course that would make ZFS kernel module support even more exciting > than it already is... I'd rather just run an etch VM. Or if i really needed etch in a container (or multiple containers) for some reason then a jessie VM running docker or similarthat way i'm only running the old stuff for the things that actually need it. > > > But I can imagine a situation where part of the tool-chain for > > > scientific computing had a bug that was only fixed in a new > > > upstream release that required a new compiler. > > > > that's one of the advantageѕ of VMs, you can keep old software > > alive indefinitelyand that works very nicely with the kind of > > stuff I was doing at Nectar with openstack - basic idea was to > > let researchers start up VMs or even entire HPC clusters of VMs > > (e.g a controller-node VM and a bunch of compute-node VMs, plus > > the required private networking, scripting, configuration, etc) as > > needed for their computational tasks. > > That doesn't even solve the problem. it does for reproducing results from most scientific computing software. in fact, a VM is about the only way to guarantee the exact same software environment for old software running on an old OS (you still have to be careful about the underlying hardware - Intel and AMD, for example, have slightly different quirks and bugsand Intel has or had a habit of crippling code compiled with their compilers if they detect at run time that it's running on a non-Intel CPU) To reproduce results from an old version of, say, Gaussian (a very popular commercial computational chemistry program) all i need to do is build and keep a VM image that runs it. if/when i ever need to, i just fire up the VM. How is that not solving the problem? The requirement is to reproduce the same results from the same data using the same program (bugs and all), not to get the old software running on a newer, updated linux distro or to re-process the old data with a new improved version(*). What the old version runs on is pretty much irrelevant, as long as it runs and produces exactly the same results. the point of the exercise is academic integrity, being able to prove that you didn't make up your results. if you can't generate exactly the same results, then that's a problem - even if the new results are better or more accurate. (*) If updated, bug-fixed results are required then that's a complete separate issue - and material for a new paper or at a separately published correction. BTW, VMs also solve the problem for other desktop apps that only run in an old version of your distro. ditto for windows apps. For example, i've got a Win XP app(**) that won't run in Win 7. It will run partially in WINE (almost everything works except print and export), but runs fine in a Win XP VM. (**) GURPS GURU, a GURPS RPG character generator - merely freeware, not Free Software so source code is not available...i haven't even been able to track down the author's current contact details, he seems to have vanished off the net sometime in the mid-2000s) > Wheezy lost support for Chromium due to compiler issues. Kmail in > Jessie never worked correctly for large mailboxes. So for me a basic > workstation (lots of mail and web browsing) wasn't functional on > either of those releases. I ended up running Jessie and then Unstable > with a Wheezy image under systemd- nspawn for email which meant that I > couldn't click on a link to launch a browser (not without more hackery > than I had time for). Scientific computing has little or nothing to do with chromium, other browsers, or most other desktop stuff. If there's a GUI at all, it's usually just some kind of front-end app to generate control or data files for text-mode programs (often run on multiple nodes of a cluster), and/or to visualise or post-process the results. > It's easy to imagine a similar chain of events breaking a scientific > workflow. i think you don't know what a scientific computing workflow actually is. it's not like desktop app stuff (they have macs or windows or linux PCs for that), or even like systems geekery.
Re: reproducibility of results
On Sat, Dec 24, 2016 at 12:51:01AM +1100, russ...@coker.com.au wrote: > https://github.com/docker/docker/issues/28705 > https://lwn.net/Articles/446528/ thanks, i'll have to read those later today. > So the kernel command-line option might be the best option for etch. possibly. worth a try, anyway. > Also the Jessie kernel will have security support for another couple > of years. There's no reason why you couldn't run a Stretch host with > a Jessie kernel to support Etch docker images if that was necessary. > Of course that would make ZFS kernel module support even more exciting > than it already is... I'd rather just run an etch VM. Or if i really needed etch in a container (or multiple containers) for some reason then a jessie VM running docker or similarthat way i'm only running the old stuff for the things that actually need it. > > > But I can imagine a situation where part of the tool-chain for > > > scientific computing had a bug that was only fixed in a new > > > upstream release that required a new compiler. > > > > that's one of the advantageѕ of VMs, you can keep old software > > alive indefinitelyand that works very nicely with the kind of > > stuff I was doing at Nectar with openstack - basic idea was to > > let researchers start up VMs or even entire HPC clusters of VMs > > (e.g a controller-node VM and a bunch of compute-node VMs, plus > > the required private networking, scripting, configuration, etc) as > > needed for their computational tasks. > > That doesn't even solve the problem. it does for reproducing results from most scientific computing software. in fact, a VM is about the only way to guarantee the exact same software environment for old software running on an old OS (you still have to be careful about the underlying hardware - Intel and AMD, for example, have slightly different quirks and bugsand Intel has or had a habit of crippling code compiled with their compilers if they detect at run time that it's running on a non-Intel CPU) To reproduce results from an old version of, say, Gaussian (a very popular commercial computational chemistry program) all i need to do is build and keep a VM image that runs it. if/when i ever need to, i just fire up the VM. How is that not solving the problem? The requirement is to reproduce the same results from the same data using the same program (bugs and all), not to get the old software running on a newer, updated linux distro or to re-process the old data with a new improved version(*). What the old version runs on is pretty much irrelevant, as long as it runs and produces exactly the same results. the point of the exercise is academic integrity, being able to prove that you didn't make up your results. if you can't generate exactly the same results, then that's a problem - even if the new results are better or more accurate. (*) If updated, bug-fixed results are required then that's a complete separate issue - and material for a new paper or at a separately published correction. BTW, VMs also solve the problem for other desktop apps that only run in an old version of your distro. ditto for windows apps. For example, i've got a Win XP app(**) that won't run in Win 7. It will run partially in WINE (almost everything works except print and export), but runs fine in a Win XP VM. (**) GURPS GURU, a GURPS RPG character generator - merely freeware, not Free Software so source code is not available...i haven't even been able to track down the author's current contact details, he seems to have vanished off the net sometime in the mid-2000s) > Wheezy lost support for Chromium due to compiler issues. Kmail in > Jessie never worked correctly for large mailboxes. So for me a basic > workstation (lots of mail and web browsing) wasn't functional on > either of those releases. I ended up running Jessie and then Unstable > with a Wheezy image under systemd- nspawn for email which meant that I > couldn't click on a link to launch a browser (not without more hackery > than I had time for). Scientific computing has little or nothing to do with chromium, other browsers, or most other desktop stuff. If there's a GUI at all, it's usually just some kind of front-end app to generate control or data files for text-mode programs (often run on multiple nodes of a cluster), and/or to visualise or post-process the results. > It's easy to imagine a similar chain of events breaking a scientific > workflow. i think you don't know what a scientific computing workflow actually is. it's not like desktop app stuff (they have macs or windows or linux PCs for that), or even like systems geekery. it's a different kind of usage entirely. scientific computing jobs tend to be batch jobs, not interactive. and can take anywhere from minutes to hours, days, months, sometimes even years to run to completion even on large clusters (which are shared resources running other jobs for you and for other people at the same too). You submit the job
Re: reproducibility of results
On Friday, 23 December 2016 10:50:04 PM AEDT Craig Sanders via luv-main wrote: > On Fri, Dec 23, 2016 at 09:41:30PM +1100, russ...@coker.com.au wrote: > > Debian/Unstable has a new version of GCC that has deprecated a lot of the > > older STL interfaces. It also has a kernel that won't work with the amd64 > > libc from Wheezy. > > yeah, i know. i had to build frankenwheezy (mwahahahah! gasp in horror > at wheezy with libc6 from jessie grafted on) for some docker images a > while back: > > > http://blog.taz.net.au/2016/09/16/frankenwheezy-keeping-wheezy-alive-on-a-c > ontainer-host-running-libc6-2-24/ > > AFAICT at the time, the incompatibility was with the later libc6 on > my sid docker host, not necessarily with the kernel...IIRC it started > failing after I upgraded libc6 on the host, without upgrading the kernel > or rebooting. https://github.com/docker/docker/issues/28705 Above is a Docker issue about this. It's the vsyscall interface. Apparently you can use a kernel command-line option to enable the compatability (while weakening security). I never even tested that kernel option, I prefer to avoid things that reduce security and a franken-wheezy setup isn't difficult to arrange. Even extracting the glibc Debian package with ar and tar (because I couldn't run dpkg in the chroot) wasn't that difficult. https://lwn.net/Articles/446528/ Above is the explanation for the security issue. Note that it's from 2011. We have had a lot of time to deal with this. It seems that having libc in Wheezy depend on vsyscall was a mistake. > > It should be possible to change the Wheezy libc to the newer amd64 > > system call interface without changing much and using kvm or Xen is a > > possibility too. > > yep, wheezy still runs in a VM. as does etch (i also tried the same kind > of libc6 etc upgrades on etch to get it working in docker on sid but > couldn't get it working. gave up on that and created a VM instead. so > etch-based images will stop working in docker when stretch is released) Debian is designed to have everything run with the libc from the next release, there's no other way upgrades can work. It isn't designed to have a libc from 2 releases above. So the kernel command-line option might be the best option for etch. Also the Jessie kernel will have security support for another couple of years. There's no reason why you couldn't run a Stretch host with a Jessie kernel to support Etch docker images if that was necessary. Of course that would make ZFS kernel module support even more exciting than it already is... > > But I can imagine a situation where part of the tool-chain for > > scientific computing had a bug that was only fixed in a new upstream > > release that required a new compiler. > > that's one of the advantageѕ of VMs, you can keep old software alive > indefinitelyand that works very nicely with the kind of stuff I was > doing at Nectar with openstack - basic idea was to let researchers start > up VMs or even entire HPC clusters of VMs (e.g a controller-node VM > and a bunch of compute-node VMs, plus the required private networking, > scripting, configuration, etc) as needed for their computational tasks. That doesn't even solve the problem. Wheezy lost support for Chromium due to compiler issues. Kmail in Jessie never worked correctly for large mailboxes. So for me a basic workstation (lots of mail and web browsing) wasn't functional on either of those releases. I ended up running Jessie and then Unstable with a Wheezy image under systemd- nspawn for email which meant that I couldn't click on a link to launch a browser (not without more hackery than I had time for). It's easy to imagine a similar chain of events breaking a scientific workflow. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ ___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
Re: reproducibility of results
On Fri, Dec 23, 2016 at 09:41:30PM +1100, russ...@coker.com.au wrote: > Debian/Unstable has a new version of GCC that has deprecated a lot of the > older STL interfaces. It also has a kernel that won't work with the amd64 > libc from Wheezy. yeah, i know. i had to build frankenwheezy (mwahahahah! gasp in horror at wheezy with libc6 from jessie grafted on) for some docker images a while back: http://blog.taz.net.au/2016/09/16/frankenwheezy-keeping-wheezy-alive-on-a-container-host-running-libc6-2-24/ AFAICT at the time, the incompatibility was with the later libc6 on my sid docker host, not necessarily with the kernel...IIRC it started failing after I upgraded libc6 on the host, without upgrading the kernel or rebooting. i don't know if further evidence invalidated my theory. either way, upgrading wheezy's libc6 to jessie's libc6 solved the problem. > It should be possible to change the Wheezy libc to the newer amd64 > system call interface without changing much and using kvm or Xen is a > possibility too. yep, wheezy still runs in a VM. as does etch (i also tried the same kind of libc6 etc upgrades on etch to get it working in docker on sid but couldn't get it working. gave up on that and created a VM instead. so etch-based images will stop working in docker when stretch is released) > But I can imagine a situation where part of the tool-chain for > scientific computing had a bug that was only fixed in a new upstream > release that required a new compiler. that's one of the advantageѕ of VMs, you can keep old software alive indefinitelyand that works very nicely with the kind of stuff I was doing at Nectar with openstack - basic idea was to let researchers start up VMs or even entire HPC clusters of VMs (e.g a controller-node VM and a bunch of compute-node VMs, plus the required private networking, scripting, configuration, etc) as needed for their computational tasks. craig -- craig sanders___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
Re: reproducibility of results
On Friday, 23 December 2016 8:55:05 PM AEDT Craig Sanders via luv-main wrote: > On Fri, Dec 23, 2016 at 08:22:45PM +1100, russ...@coker.com.au wrote: > > I've heard a lot of scientific computing people talk about a desire to > > reproduce calculations, but I haven't heard them talking about these > > issues so I presume that they haven't got far in this regard. > > it was a big issue when i was at unimelb (where i built a HPC cluster > for the chemistry dept and later worked on the nectar research cloud). > > depending on the funding source or the journal that papers were > published in, raw data typically had to be stored for at least 7 or 12 > years, and the exact same software used to process it also had to be > kept available and runnable (which was an ongoing problem, especially > with some of the commercial software like gaussian...but even open > source stuff is affected by bit-rot and also by CADT-syndrome. we had > a source license for gaussian, but that didn't guarantee that we could > even compile it with newer compilers. Debian/Unstable has a new version of GCC that has deprecated a lot of the older STL interfaces. It also has a kernel that won't work with the amd64 libc from Wheezy. These are upstream issues so other distributions may have dealt with them in some ways. It should be possible to change the Wheezy libc to the newer amd64 system call interface without changing much and using kvm or Xen is a possibility too. Also compiling against the old STL isn't that hard to do, and Debian has good support for multiple versions of GCC. The STL isn't necessarily a trivial issue. I recall that Wheezy stopped getting security support for Chromium because upstream (Google) decided to just make new releases which depended on new C++ features that weren't in the Wheezy version of GCC. Supporting old versions of software is the usual requirement and that's usually a lot easier. But I can imagine a situation where part of the tool-chain for scientific computing had a bug that was only fixed in a new upstream release that required a new compiler. Speaking for myself I'm having enough trouble making the software I'm responsible work on all the newer versions of compilers etc. I don't give much thought to backwards compatability. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ ___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
reproducibility of results
On Fri, Dec 23, 2016 at 08:22:45PM +1100, russ...@coker.com.au wrote: > I've heard a lot of scientific computing people talk about a desire to > reproduce calculations, but I haven't heard them talking about these > issues so I presume that they haven't got far in this regard. it was a big issue when i was at unimelb (where i built a HPC cluster for the chemistry dept and later worked on the nectar research cloud). depending on the funding source or the journal that papers were published in, raw data typically had to be stored for at least 7 or 12 years, and the exact same software used to process it also had to be kept available and runnable (which was an ongoing problem, especially with some of the commercial software like gaussian...but even open source stuff is affected by bit-rot and also by CADT-syndrome. we had a source license for gaussian, but that didn't guarantee that we could even compile it with newer compilers. it might have changed now, but iirc it would only compile with a specific intel fortran compiler. numerous efforts to compile it with gfortran ended in failure) and some of the data sets that had to be stored were huge - dozens or hundreds of terabytes or more. and while it wasn't something i worked on personally, i know that for some of the people working with, e.g., the synchrotron that that's a relatively piddling quantity of data. craig -- craig sanders___ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main