Re: reproducibility of results

2016-12-26 Thread Paul van den Bergen via luv-main
Virtualisation can help a lot. The ability to snapshot a server version
along with the data it analysed is probably the right approach (and a
relatively small footprint, since if you are doing it right, you are taking
snapshots anyway)

But you still have to set it up...


Research data is interesting. Massive volumes, relatively short "live" time
frame before it can be migrated to slower storage. Tiered storage solution
help a lot. Archiving to tape is viable but really need a way for it to be
automatically accessed (by users without sysadmin involvement - less work,
more likely to be used, and users test the archiving validity constantly...
)

On 25 Dec 2016 11:44 pm, "Craig Sanders via luv-main" 
wrote:

> On Sun, Dec 25, 2016 at 04:56:05PM +1100, Paul van den Bergen wrote:
> > Funny, I was asked about exactly the same problem when I started
> > @WEHI... only there was no attempt made to even start tackling the
> > problem...
>
> yeah, we were constantly getting individual academics and research
> groups asking us about storage, and then trying to do the best we could
> with minimal resources.
>
> the unfortunate fact is that disks/storage arrays and file-servers
> and tape libraries etc are expensive. You can replace a very large
> percentage of your up-front capital expense with skilled techs, which
> are an on-going cost (you're going to need them to look after expensive
> equipment anyway, and it has to be maintained & upgraded for 7+ years),
> but it's still going to cost a lot for huge data storage anyway, even if
> you avoid over-priced name-brand gear.
>
> > Virtualisation of workload makes the problem a lot easier to tackle,
> > but even so... 7 years is a long time in IT...
>
> cheap big disks helps a lot too. but you need a lot of them, plus
> backup - on-site and off-site.
>
> CPU & RAM are more than adequate for pretty nearly any file-storage
> needs these days...could always use more of both for computational
> stuff.
>
> craig
>
> --
> craig sanders 
> ___
> luv-main mailing list
> luv-main@luv.asn.au
> https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main
>
___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


Re: reproducibility of results

2016-12-25 Thread Craig Sanders via luv-main
On Sun, Dec 25, 2016 at 04:56:05PM +1100, Paul van den Bergen wrote:
> Funny, I was asked about exactly the same problem when I started
> @WEHI... only there was no attempt made to even start tackling the
> problem...

yeah, we were constantly getting individual academics and research
groups asking us about storage, and then trying to do the best we could
with minimal resources.

the unfortunate fact is that disks/storage arrays and file-servers
and tape libraries etc are expensive. You can replace a very large
percentage of your up-front capital expense with skilled techs, which
are an on-going cost (you're going to need them to look after expensive
equipment anyway, and it has to be maintained & upgraded for 7+ years),
but it's still going to cost a lot for huge data storage anyway, even if
you avoid over-priced name-brand gear.

> Virtualisation of workload makes the problem a lot easier to tackle,
> but even so... 7 years is a long time in IT...

cheap big disks helps a lot too. but you need a lot of them, plus
backup - on-site and off-site.

CPU & RAM are more than adequate for pretty nearly any file-storage
needs these days...could always use more of both for computational
stuff.

craig

--
craig sanders 
___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


Re: reproducibility of results

2016-12-24 Thread Paul van den Bergen via luv-main
Funny, I was asked about exactly the same problem when I started @WEHI...
only there was no attempt made to even start tackling the problem...

Virtualisation of workload makes the problem a lot easier to tackle, but
even so... 7 years is a long time in IT...

On 24 Dec 2016 4:19 pm, "Craig Sanders via luv-main" 
wrote:

On Sat, Dec 24, 2016 at 12:51:01AM +1100, russ...@coker.com.au wrote:

> https://github.com/docker/docker/issues/28705
> https://lwn.net/Articles/446528/

thanks, i'll have to read those later today.


> So the kernel command-line option might be the best option for etch.

possibly.  worth a try, anyway.


> Also the Jessie kernel will have security support for another couple
> of years.  There's no reason why you couldn't run a Stretch host with
> a Jessie kernel to support Etch docker images if that was necessary.
> Of course that would make ZFS kernel module support even more exciting
> than it already is...

I'd rather just run an etch VM. Or if i really needed etch in a container
(or multiple containers) for some reason then a jessie VM running docker
or similarthat way i'm only running the old stuff for the things that
actually need it.



> > > But I can imagine a situation where part of the tool-chain for
> > > scientific computing had a bug that was only fixed in a new
> > > upstream release that required a new compiler.
> >
> > that's one of the advantageѕ of VMs, you can keep old software
> > alive indefinitelyand that works very nicely with the kind of
> > stuff I was doing at Nectar with openstack - basic idea was to
> > let researchers start up VMs or even entire HPC clusters of VMs
> > (e.g a controller-node VM and a bunch of compute-node VMs, plus
> > the required private networking, scripting, configuration, etc) as
> > needed for their computational tasks.
>
> That doesn't even solve the problem.

it does for reproducing results from most scientific computing software.
in fact, a VM is about the only way to guarantee the exact same software
environment for old software running on an old OS (you still have to be
careful about the underlying hardware - Intel and AMD, for example, have
slightly different quirks and bugsand Intel has or had a habit of
crippling code compiled with their compilers if they detect at run time that
it's running on a non-Intel CPU)

To reproduce results from an old version of, say, Gaussian (a very popular
commercial computational chemistry program) all i need to do is build and
keep
a VM image that runs it. if/when i ever need to, i just fire up the VM.

How is that not solving the problem?

The requirement is to reproduce the same results from the same data using
the same program (bugs and all), not to get the old software running on a
newer, updated linux distro or to re-process the old data with a new
improved
version(*). What the old version runs on is pretty much irrelevant, as long
as
it runs and produces exactly the same results. the point of the exercise is
academic integrity, being able to prove that you didn't make up your
results.

if you can't generate exactly the same results, then that's a problem - even
if the new results are better or more accurate.

(*) If updated, bug-fixed results are required then that's a complete
separate
issue - and material for a new paper or at a separately published
correction.




BTW, VMs also solve the problem for other desktop apps that only run in an
old
version of your distro. ditto for windows apps. For example, i've got a Win
XP app(**) that won't run in Win 7.  It will run partially in WINE (almost
everything works except print and export), but runs fine in a Win XP VM.

(**) GURPS GURU, a GURPS RPG character generator - merely freeware, not Free
Software so source code is not available...i haven't even been able to track
down the author's current contact details, he seems to have vanished off the
net sometime in the mid-2000s)


> Wheezy lost support for Chromium due to compiler issues.  Kmail in
> Jessie never worked correctly for large mailboxes.  So for me a basic
> workstation (lots of mail and web browsing) wasn't functional on
> either of those releases.  I ended up running Jessie and then Unstable
> with a Wheezy image under systemd- nspawn for email which meant that I
> couldn't click on a link to launch a browser (not without more hackery
> than I had time for).

Scientific computing has little or nothing to do with chromium, other
browsers, or most other desktop stuff. If there's a GUI at all, it's usually
just some kind of front-end app to generate control or data files for
text-mode programs (often run on multiple nodes of a cluster), and/or to
visualise or post-process the results.

> It's easy to imagine a similar chain of events breaking a scientific
> workflow.

i think you don't know what a scientific computing workflow actually is.
it's
not like desktop app stuff (they have macs or windows or linux PCs for
that),
or even like systems geekery. 

Re: reproducibility of results

2016-12-23 Thread Craig Sanders via luv-main
On Sat, Dec 24, 2016 at 12:51:01AM +1100, russ...@coker.com.au wrote:

> https://github.com/docker/docker/issues/28705
> https://lwn.net/Articles/446528/

thanks, i'll have to read those later today.


> So the kernel command-line option might be the best option for etch.

possibly.  worth a try, anyway.


> Also the Jessie kernel will have security support for another couple
> of years.  There's no reason why you couldn't run a Stretch host with
> a Jessie kernel to support Etch docker images if that was necessary.
> Of course that would make ZFS kernel module support even more exciting
> than it already is...

I'd rather just run an etch VM. Or if i really needed etch in a container
(or multiple containers) for some reason then a jessie VM running docker
or similarthat way i'm only running the old stuff for the things that
actually need it.



> > > But I can imagine a situation where part of the tool-chain for
> > > scientific computing had a bug that was only fixed in a new
> > > upstream release that required a new compiler.
> >
> > that's one of the advantageѕ of VMs, you can keep old software
> > alive indefinitelyand that works very nicely with the kind of
> > stuff I was doing at Nectar with openstack - basic idea was to
> > let researchers start up VMs or even entire HPC clusters of VMs
> > (e.g a controller-node VM and a bunch of compute-node VMs, plus
> > the required private networking, scripting, configuration, etc) as
> > needed for their computational tasks.
>
> That doesn't even solve the problem.

it does for reproducing results from most scientific computing software.
in fact, a VM is about the only way to guarantee the exact same software
environment for old software running on an old OS (you still have to be
careful about the underlying hardware - Intel and AMD, for example, have
slightly different quirks and bugsand Intel has or had a habit of
crippling code compiled with their compilers if they detect at run time that
it's running on a non-Intel CPU)

To reproduce results from an old version of, say, Gaussian (a very popular
commercial computational chemistry program) all i need to do is build and keep
a VM image that runs it. if/when i ever need to, i just fire up the VM.

How is that not solving the problem?

The requirement is to reproduce the same results from the same data using
the same program (bugs and all), not to get the old software running on a
newer, updated linux distro or to re-process the old data with a new improved
version(*). What the old version runs on is pretty much irrelevant, as long as
it runs and produces exactly the same results. the point of the exercise is
academic integrity, being able to prove that you didn't make up your results.

if you can't generate exactly the same results, then that's a problem - even
if the new results are better or more accurate.

(*) If updated, bug-fixed results are required then that's a complete separate
issue - and material for a new paper or at a separately published correction.




BTW, VMs also solve the problem for other desktop apps that only run in an old
version of your distro. ditto for windows apps. For example, i've got a Win
XP app(**) that won't run in Win 7.  It will run partially in WINE (almost
everything works except print and export), but runs fine in a Win XP VM.

(**) GURPS GURU, a GURPS RPG character generator - merely freeware, not Free
Software so source code is not available...i haven't even been able to track
down the author's current contact details, he seems to have vanished off the
net sometime in the mid-2000s)


> Wheezy lost support for Chromium due to compiler issues.  Kmail in
> Jessie never worked correctly for large mailboxes.  So for me a basic
> workstation (lots of mail and web browsing) wasn't functional on
> either of those releases.  I ended up running Jessie and then Unstable
> with a Wheezy image under systemd- nspawn for email which meant that I
> couldn't click on a link to launch a browser (not without more hackery
> than I had time for).

Scientific computing has little or nothing to do with chromium, other
browsers, or most other desktop stuff. If there's a GUI at all, it's usually
just some kind of front-end app to generate control or data files for
text-mode programs (often run on multiple nodes of a cluster), and/or to
visualise or post-process the results.

> It's easy to imagine a similar chain of events breaking a scientific
> workflow.

i think you don't know what a scientific computing workflow actually is. it's
not like desktop app stuff (they have macs or windows or linux PCs for that),
or even like systems geekery. it's a different kind of usage entirely.

scientific computing jobs tend to be batch jobs, not interactive. and can take
anywhere from minutes to hours, days, months, sometimes even years to run to
completion even on large clusters (which are shared resources running other
jobs for you and for other people at the same too).  You submit the job 

Re: reproducibility of results

2016-12-23 Thread Russell Coker via luv-main
On Friday, 23 December 2016 10:50:04 PM AEDT Craig Sanders via luv-main wrote:
> On Fri, Dec 23, 2016 at 09:41:30PM +1100, russ...@coker.com.au wrote:
> > Debian/Unstable has a new version of GCC that has deprecated a lot of the
> > older STL interfaces.  It also has a kernel that won't work with the amd64
> > libc from Wheezy.
> 
> yeah, i know. i had to build frankenwheezy (mwahahahah! gasp in horror
> at wheezy with libc6 from jessie grafted on) for some docker images a
> while back:
> 
>  
> http://blog.taz.net.au/2016/09/16/frankenwheezy-keeping-wheezy-alive-on-a-c
> ontainer-host-running-libc6-2-24/
> 
> AFAICT at the time, the incompatibility was with the later libc6 on
> my sid docker host, not necessarily with the kernel...IIRC it started
> failing after I upgraded libc6 on the host, without upgrading the kernel
> or rebooting.

https://github.com/docker/docker/issues/28705

Above is a Docker issue about this.  It's the vsyscall interface.  Apparently 
you can use a kernel command-line option to enable the compatability (while 
weakening security).

I never even tested that kernel option, I prefer to avoid things that reduce 
security and a franken-wheezy setup isn't difficult to arrange.  Even 
extracting the glibc Debian package with ar and tar (because I couldn't run 
dpkg in the chroot) wasn't that difficult.

https://lwn.net/Articles/446528/

Above is the explanation for the security issue.  Note that it's from 2011.  
We have had a lot of time to deal with this.  It seems that having libc in 
Wheezy depend on vsyscall was a mistake.

> > It should be possible to change the Wheezy libc to the newer amd64
> > system call interface without changing much and using kvm or Xen is a
> > possibility too.
> 
> yep, wheezy still runs in a VM. as does etch (i also tried the same kind
> of libc6 etc upgrades on etch to get it working in docker on sid but
> couldn't get it working. gave up on that and created a VM instead. so
> etch-based images will stop working in docker when stretch is released)

Debian is designed to have everything run with the libc from the next release, 
there's no other way upgrades can work.  It isn't designed to have a libc from 
2 releases above.  So the kernel command-line option might be the best option 
for etch.

Also the Jessie kernel will have security support for another couple of years.  
There's no reason why you couldn't run a Stretch host with a Jessie kernel to 
support Etch docker images if that was necessary.  Of course that would make 
ZFS kernel module support even more exciting than it already is...

> > But I can imagine a situation where part of the tool-chain for
> > scientific computing had a bug that was only fixed in a new upstream
> > release that required a new compiler.
> 
> that's one of the advantageѕ of VMs, you can keep old software alive
> indefinitelyand that works very nicely with the kind of stuff I was
> doing at Nectar with openstack - basic idea was to let researchers start
> up VMs or even entire HPC clusters of VMs (e.g a controller-node VM
> and a bunch of compute-node VMs, plus the required private networking,
> scripting, configuration, etc) as needed for their computational tasks.

That doesn't even solve the problem.

Wheezy lost support for Chromium due to compiler issues.  Kmail in Jessie 
never worked correctly for large mailboxes.  So for me a basic workstation 
(lots of mail and web browsing) wasn't functional on either of those releases.  
I ended up running Jessie and then Unstable with a Wheezy image under systemd-
nspawn for email which meant that I couldn't click on a link to launch a 
browser (not without more hackery than I had time for).

It's easy to imagine a similar chain of events breaking a scientific workflow.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


Re: reproducibility of results

2016-12-23 Thread Craig Sanders via luv-main
On Fri, Dec 23, 2016 at 09:41:30PM +1100, russ...@coker.com.au wrote:
> Debian/Unstable has a new version of GCC that has deprecated a lot of the 
> older STL interfaces.  It also has a kernel that won't work with the amd64 
> libc from Wheezy.  

yeah, i know. i had to build frankenwheezy (mwahahahah! gasp in horror
at wheezy with libc6 from jessie grafted on) for some docker images a
while back:

  
http://blog.taz.net.au/2016/09/16/frankenwheezy-keeping-wheezy-alive-on-a-container-host-running-libc6-2-24/

AFAICT at the time, the incompatibility was with the later libc6 on
my sid docker host, not necessarily with the kernel...IIRC it started
failing after I upgraded libc6 on the host, without upgrading the kernel
or rebooting.

i don't know if further evidence invalidated my theory. either way,
upgrading wheezy's libc6 to jessie's libc6 solved the problem.


> It should be possible to change the Wheezy libc to the newer amd64
> system call interface without changing much and using kvm or Xen is a 
> possibility too.  

yep, wheezy still runs in a VM. as does etch (i also tried the same kind
of libc6 etc upgrades on etch to get it working in docker on sid but
couldn't get it working. gave up on that and created a VM instead. so
etch-based images will stop working in docker when stretch is released)


> But I can imagine a situation where part of the tool-chain for
> scientific computing had a bug that was only fixed in a new upstream
> release that required a new compiler.

that's one of the advantageѕ of VMs, you can keep old software alive
indefinitelyand that works very nicely with the kind of stuff I was
doing at Nectar with openstack - basic idea was to let researchers start
up VMs or even entire HPC clusters of VMs (e.g a controller-node VM
and a bunch of compute-node VMs, plus the required private networking,
scripting, configuration, etc) as needed for their computational tasks.

craig

--
craig sanders 
___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


Re: reproducibility of results

2016-12-23 Thread Russell Coker via luv-main
On Friday, 23 December 2016 8:55:05 PM AEDT Craig Sanders via luv-main wrote:
> On Fri, Dec 23, 2016 at 08:22:45PM +1100, russ...@coker.com.au wrote:
> > I've heard a lot of scientific computing people talk about a desire to
> > reproduce calculations, but I haven't heard them talking about these
> > issues so I presume that they haven't got far in this regard.
> 
> it was a big issue when i was at unimelb (where i built a HPC cluster
> for the chemistry dept and later worked on the nectar research cloud).
> 
> depending on the funding source or the journal that papers were
> published in, raw data typically had to be stored for at least 7 or 12
> years, and the exact same software used to process it also had to be
> kept available and runnable (which was an ongoing problem, especially
> with some of the commercial software like gaussian...but even open
> source stuff is affected by bit-rot and also by CADT-syndrome. we had
> a source license for gaussian, but that didn't guarantee that we could
> even compile it with newer compilers.

Debian/Unstable has a new version of GCC that has deprecated a lot of the 
older STL interfaces.  It also has a kernel that won't work with the amd64 
libc from Wheezy.  These are upstream issues so other distributions may have 
dealt with them in some ways.

It should be possible to change the Wheezy libc to the newer amd64 system call 
interface without changing much and using kvm or Xen is a possibility too.  
Also compiling against the old STL isn't that hard to do, and Debian has good 
support for multiple versions of GCC.

The STL isn't necessarily a trivial issue.  I recall that Wheezy stopped 
getting security support for Chromium because upstream (Google) decided to 
just make new releases which depended on new C++ features that weren't in the 
Wheezy version of GCC.  Supporting old versions of software is the usual 
requirement and that's usually a lot easier.  But I can imagine a situation 
where part of the tool-chain for scientific computing had a bug that was only 
fixed in a new upstream release that required a new compiler.

Speaking for myself I'm having enough trouble making the software I'm 
responsible work on all the newer versions of compilers etc.  I don't give 
much thought to backwards compatability.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main


reproducibility of results

2016-12-23 Thread Craig Sanders via luv-main
On Fri, Dec 23, 2016 at 08:22:45PM +1100, russ...@coker.com.au wrote:

> I've heard a lot of scientific computing people talk about a desire to
> reproduce calculations, but I haven't heard them talking about these
> issues so I presume that they haven't got far in this regard.

it was a big issue when i was at unimelb (where i built a HPC cluster
for the chemistry dept and later worked on the nectar research cloud).

depending on the funding source or the journal that papers were
published in, raw data typically had to be stored for at least 7 or 12
years, and the exact same software used to process it also had to be
kept available and runnable (which was an ongoing problem, especially
with some of the commercial software like gaussian...but even open
source stuff is affected by bit-rot and also by CADT-syndrome. we had
a source license for gaussian, but that didn't guarantee that we could
even compile it with newer compilers. it might have changed now, but
iirc it would only compile with a specific intel fortran compiler.
numerous efforts to compile it with gfortran ended in failure)

and some of the data sets that had to be stored were huge - dozens or
hundreds of terabytes or more. and while it wasn't something i worked on
personally, i know that for some of the people working with, e.g., the
synchrotron that that's a relatively piddling quantity of data.

craig

--
craig sanders 
___
luv-main mailing list
luv-main@luv.asn.au
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main