Re: [easybuild] TensorFlow with GPU support.
Thank you! I do not think OpenMPI with cuda support is particularly relevant for us. I will read the docs and try to understand what --minmal-toolchains does. Thanks for your suggestions Jakob > On 26 Mar 2019, at 14:41, Jack Perdue wrote: > > Howdy Jakob, > > The primary difference between fosscuda and > foss+CUDA is that fosscuda has an OpenMPI built > with CUDA support where as the latter does not. > > We run with: > > EASYBUILD_MINIMAL_TOOLCHAINS > > which cuts down on the number of things that > have be rebuilt here For example for > > TensorFlow/1.10.1-fosscuda-2018b-Python-3.6.6 > > we only had to rebuild these packages: > > Python/3.6.6-fosscuda-2018b > protobuf-python/3.6.0-fosscuda-2018b-Python-3.6.6 > cuDNN/7.1.4.18-fosscuda-2018b > > so you might want look at the --minmal-toolchains option. > > > Jack Perdue > Lead Systems Administrator > High Performance Research Computing > TAMU Division of Research > j-per...@tamu.eduhttp://hprc.tamu.edu > HPRC Helpdesk: h...@hprc.tamu.edu > > On 3/26/19 8:26 AM, Jakob Schiøtz wrote: >> Dear EasyBuilders, >> >> I would like to build a TensorFlow module supporting GPUs. Currently, that >> looks to be TensorFlow-1.12.0-fosscuda-2018b-Python-3.6.6.eb, but this >> requires building a new toolchain (fosscuda), including rebuilding both >> OpenMPI and Python with GPU support. In addition, any other software that >> the user may need alongside TensorFlow will also have to be rebuilt with the >> fosscuda toolchain to prevent mixing toolchains. That seems to be overkill >> to me - after all little if anything is gained by rebuilding Python and >> stuff with GPU support unless the scripts are actually going to use it. I >> don’t know if numpy will begin offloading computations to the GPU, but >> presumably moving data back and forth will be expensive, and TensorFlow will >> allocate all the GPU memory anyway. >> >> There used to be a tensorflow variant for the normal toolchain which just >> depended on CUDA explicitly. Is there a reason not to do it this way? I >> guess I could try to make a .eb file inspired on the existing ones doing >> just that - or have I overlooked something? >> >> Best regards >> >> Jakob >> >> >> -- >> Jakob Schiøtz, professor, Ph.D. >> Department of Physics >> Technical University of Denmark >> DK-2800 Kongens Lyngby, Denmark >> http://www.fysik.dtu.dk/~schiotz/ >> >> >> > -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/
Re: [easybuild] TensorFlow with GPU support.
Howdy Jakob, The primary difference between fosscuda and foss+CUDA is that fosscuda has an OpenMPI built with CUDA support where as the latter does not. We run with: EASYBUILD_MINIMAL_TOOLCHAINS which cuts down on the number of things that have be rebuilt here For example for TensorFlow/1.10.1-fosscuda-2018b-Python-3.6.6 we only had to rebuild these packages: Python/3.6.6-fosscuda-2018b protobuf-python/3.6.0-fosscuda-2018b-Python-3.6.6 cuDNN/7.1.4.18-fosscuda-2018b so you might want look at the --minmal-toolchains option. Jack Perdue Lead Systems Administrator High Performance Research Computing TAMU Division of Research j-per...@tamu.eduhttp://hprc.tamu.edu HPRC Helpdesk: h...@hprc.tamu.edu On 3/26/19 8:26 AM, Jakob Schiøtz wrote: Dear EasyBuilders, I would like to build a TensorFlow module supporting GPUs. Currently, that looks to be TensorFlow-1.12.0-fosscuda-2018b-Python-3.6.6.eb, but this requires building a new toolchain (fosscuda), including rebuilding both OpenMPI and Python with GPU support. In addition, any other software that the user may need alongside TensorFlow will also have to be rebuilt with the fosscuda toolchain to prevent mixing toolchains. That seems to be overkill to me - after all little if anything is gained by rebuilding Python and stuff with GPU support unless the scripts are actually going to use it. I don’t know if numpy will begin offloading computations to the GPU, but presumably moving data back and forth will be expensive, and TensorFlow will allocate all the GPU memory anyway. There used to be a tensorflow variant for the normal toolchain which just depended on CUDA explicitly. Is there a reason not to do it this way? I guess I could try to make a .eb file inspired on the existing ones doing just that - or have I overlooked something? Best regards Jakob -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/
Re: [easybuild] TensorFlow with GPU support.
On 08/01/2018 21:28, Jakob Schiøtz wrote: On 8 Jan 2018, at 20:27, Kenneth Hostewrote: On 08/01/2018 15:48, Jakob Schiøtz wrote: Hi Kenneth, I have now tested your TensorFlow 1.4.0 eb on our machines with a real-world script. It works, but it runs three times slower than with the prebuild TensorFlow 1.2.1 :-( The prebuild version complains that it was build without AVX2 etc, so I do not really understand why it is so much slower to use the version compiled from source - assuming of course that there is not a factor three performance loss between 1.2.1 and 1.4.0; which seems unlikely. Wow, that must be wrong somehow... Is this on the GPU systems? You're not comparing a GPU-enabled TF 1.2 with a CPU-only TF 1.4 built with EB, are you? If you are, then a only factor 3 slower using only CPU is actually quite impressive vs GPU-enabled build. ;-) No, I am comparing not-GPU enabled versions running on a machine without a GPU. So that is not the problem. I am running a custom script training one of my students’ model. I agree the result is suspicious, and I am rerunning it now (in the queue). I will try the benchmark you mentioned below as well; and report back - but it may be a few days… By the way, could the difference be due to the compiler (Intel versus foss)? That would be an unusually large difference, but my own MD code (ASAP) displays almost a factor two difference. Which is which? Did you install the binary wheel on top of a Python built with foss or Intel? That could certainly matter, but I would be very surprised if it's more than 10-20% to be honest. I saw 10% performance loss for TF 1.4 built with intel/2017b vs foss/2017b (on top of Python 3.6.3) on Haswell (so the foss build was slightly faster). regards, Kenneth Jakob How are you benchmarking this exactly? When I was trying with the script from https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks, I saw 7x better performance when building TF 1.4.0 from source on Intel Haswell (no GPU) compared to a conda install (which is basically the same as using the binary wheel). On a GPU system (NVIDIA K40) with the TF 1.4.0 binary wheel I saw another 8x performance increase over the EB-installed-from-source CPU-only TF 1.4.0 installation. Here's the command I was running (don't forget the change --device when running on a GPU system): python tf_cnn_benchmarks.py --device cpu --batch_size=32 --model=resnet50 --variable_update=parameter_server --data_format NHWC regards, Kenneth Best regards Jakob On 5 Jan 2018, at 13:57, Kenneth Hoste wrote: On 04/01/2018 16:37, Jakob Schiøtz wrote: Dear Kenneth, Pablo and Maxime, Thanks for your feedback. Yes, I will try to see if I can build from source, but I will focus on the foss toolchain since we use that one for our Python here (we do not have the Intel MPI license, and the iomkl toolchain could not built Python last time I tried). I assume the reason for building from source is to ensure consistent library versions etc. If that proves very difficult, could we perhaps in the interim have builds (with a -bin suffix?) using the prebuilt wheels? The main reason for building from source is performance and compatibility with the OS. The binary wheels that are available for TensorFlow are not compatible with older OS versions like CentOS 6, as I experienced first-hand when trying to get it to work on an older (GPU) system. Since the compilation from source with CUDA support didn't work yet, I had to resort to injecting a newer glibc version in the 'python' binary, which was not fun (well...). For CPU-only installations, you really have no other option than building from source, since the binary wheels were not built with AVX2 instructions for example, which leads to large performance losses (some quick benchmarking showed a 7x increase in performance for TF 1.4 built with foss/2017b over using the binary wheel). For GPU installations, a similar concern arises, although it may be less severe there, depending on what CUDA compute capabilities the binary wheels were built with (I only tested the wheels on old systems with NVIDIA K20x/K40 GPUs, so there I doubt you'll get much performance increase when building from source). If it turns out to be too difficult or time-consuming to get the build from source with CUDA support to work, then we can of course progress with sticking to the binary wheel releases for now, I'm not going to oppose that. regards, Kenneth Best regards Jakob On 4 Jan 2018, at 15:29, Kenneth Hoste wrote: Dear Jakob, On 04/01/2018 10:23, Jakob Schiøtz wrote: Hi, I made a TensorFlow easyconfig a while ago depending on Python with the foss toolchain; and including a variant with GPU support (PR 4904). The latter has not yet been merged, probably because it is annoying to have something that can
Re: [easybuild] TensorFlow with GPU support.
> On 8 Jan 2018, at 20:27, Kenneth Hostewrote: > > On 08/01/2018 15:48, Jakob Schiøtz wrote: >> Hi Kenneth, >> >> I have now tested your TensorFlow 1.4.0 eb on our machines with a real-world >> script. It works, but it runs three times slower than with the prebuild >> TensorFlow 1.2.1 :-( >> >> The prebuild version complains that it was build without AVX2 etc, so I do >> not really understand why it is so much slower to use the version compiled >> from source - assuming of course that there is not a factor three >> performance loss between 1.2.1 and 1.4.0; which seems unlikely. > > Wow, that must be wrong somehow... > > Is this on the GPU systems? > You're not comparing a GPU-enabled TF 1.2 with a CPU-only TF 1.4 built with > EB, are you? > If you are, then a only factor 3 slower using only CPU is actually quite > impressive vs GPU-enabled build. ;-) No, I am comparing not-GPU enabled versions running on a machine without a GPU. So that is not the problem. I am running a custom script training one of my students’ model. I agree the result is suspicious, and I am rerunning it now (in the queue). I will try the benchmark you mentioned below as well; and report back - but it may be a few days… By the way, could the difference be due to the compiler (Intel versus foss)? That would be an unusually large difference, but my own MD code (ASAP) displays almost a factor two difference. Jakob > > How are you benchmarking this exactly? > When I was trying with the script from > https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks, > I saw 7x better performance when building TF 1.4.0 from source on Intel > Haswell (no GPU) compared to a conda install (which is basically the same as > using the binary wheel). > On a GPU system (NVIDIA K40) with the TF 1.4.0 binary wheel I saw another 8x > performance increase over the EB-installed-from-source CPU-only TF 1.4.0 > installation. > > Here's the command I was running (don't forget the change --device when > running on a GPU system): > > python tf_cnn_benchmarks.py --device cpu --batch_size=32 --model=resnet50 > --variable_update=parameter_server --data_format NHWC > > > regards, > > Kenneth > >> >> Best regards >> >> Jakob >> >> >>> On 5 Jan 2018, at 13:57, Kenneth Hoste wrote: >>> >>> On 04/01/2018 16:37, Jakob Schiøtz wrote: Dear Kenneth, Pablo and Maxime, Thanks for your feedback. Yes, I will try to see if I can build from source, but I will focus on the foss toolchain since we use that one for our Python here (we do not have the Intel MPI license, and the iomkl toolchain could not built Python last time I tried). I assume the reason for building from source is to ensure consistent library versions etc. If that proves very difficult, could we perhaps in the interim have builds (with a -bin suffix?) using the prebuilt wheels? >>> The main reason for building from source is performance and compatibility >>> with the OS. >>> >>> The binary wheels that are available for TensorFlow are not compatible with >>> older OS versions like CentOS 6, as I experienced first-hand when trying to >>> get it to work on an older (GPU) system. >>> Since the compilation from source with CUDA support didn't work yet, I had >>> to resort to injecting a newer glibc version in the 'python' binary, which >>> was not fun (well...). >>> >>> For CPU-only installations, you really have no other option than building >>> from source, since the binary wheels were not built with AVX2 instructions >>> for example, which leads to large performance losses (some quick >>> benchmarking showed a 7x increase in performance for TF 1.4 built with >>> foss/2017b over using the binary wheel). >>> >>> For GPU installations, a similar concern arises, although it may be less >>> severe there, depending on what CUDA compute capabilities the binary wheels >>> were built with (I only tested the wheels on old systems with NVIDIA >>> K20x/K40 GPUs, so there I doubt you'll get much performance increase when >>> building from source). >>> >>> If it turns out to be too difficult or time-consuming to get the build from >>> source with CUDA support to work, then we can of course progress with >>> sticking to the binary wheel releases for now, I'm not going to oppose that. >>> >>> >>> regards, >>> >>> Kenneth >>> Best regards Jakob > On 4 Jan 2018, at 15:29, Kenneth Hoste wrote: > > Dear Jakob, > > On 04/01/2018 10:23, Jakob Schiøtz wrote: >> Hi, >> >> I made a TensorFlow easyconfig a while ago depending on Python with the >> foss toolchain; and including a variant with GPU support (PR 4904). The >> latter has not yet been merged, probably because it is annoying to have >> something that can only build on a machine with a
Re: [easybuild] TensorFlow with GPU support.
On 08/01/2018 15:48, Jakob Schiøtz wrote: Hi Kenneth, I have now tested your TensorFlow 1.4.0 eb on our machines with a real-world script. It works, but it runs three times slower than with the prebuild TensorFlow 1.2.1 :-( The prebuild version complains that it was build without AVX2 etc, so I do not really understand why it is so much slower to use the version compiled from source - assuming of course that there is not a factor three performance loss between 1.2.1 and 1.4.0; which seems unlikely. Wow, that must be wrong somehow... Is this on the GPU systems? You're not comparing a GPU-enabled TF 1.2 with a CPU-only TF 1.4 built with EB, are you? If you are, then a only factor 3 slower using only CPU is actually quite impressive vs GPU-enabled build. ;-) How are you benchmarking this exactly? When I was trying with the script from https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks, I saw 7x better performance when building TF 1.4.0 from source on Intel Haswell (no GPU) compared to a conda install (which is basically the same as using the binary wheel). On a GPU system (NVIDIA K40) with the TF 1.4.0 binary wheel I saw another 8x performance increase over the EB-installed-from-source CPU-only TF 1.4.0 installation. Here's the command I was running (don't forget the change --device when running on a GPU system): python tf_cnn_benchmarks.py --device cpu --batch_size=32 --model=resnet50 --variable_update=parameter_server --data_format NHWC regards, Kenneth Best regards Jakob On 5 Jan 2018, at 13:57, Kenneth Hostewrote: On 04/01/2018 16:37, Jakob Schiøtz wrote: Dear Kenneth, Pablo and Maxime, Thanks for your feedback. Yes, I will try to see if I can build from source, but I will focus on the foss toolchain since we use that one for our Python here (we do not have the Intel MPI license, and the iomkl toolchain could not built Python last time I tried). I assume the reason for building from source is to ensure consistent library versions etc. If that proves very difficult, could we perhaps in the interim have builds (with a -bin suffix?) using the prebuilt wheels? The main reason for building from source is performance and compatibility with the OS. The binary wheels that are available for TensorFlow are not compatible with older OS versions like CentOS 6, as I experienced first-hand when trying to get it to work on an older (GPU) system. Since the compilation from source with CUDA support didn't work yet, I had to resort to injecting a newer glibc version in the 'python' binary, which was not fun (well...). For CPU-only installations, you really have no other option than building from source, since the binary wheels were not built with AVX2 instructions for example, which leads to large performance losses (some quick benchmarking showed a 7x increase in performance for TF 1.4 built with foss/2017b over using the binary wheel). For GPU installations, a similar concern arises, although it may be less severe there, depending on what CUDA compute capabilities the binary wheels were built with (I only tested the wheels on old systems with NVIDIA K20x/K40 GPUs, so there I doubt you'll get much performance increase when building from source). If it turns out to be too difficult or time-consuming to get the build from source with CUDA support to work, then we can of course progress with sticking to the binary wheel releases for now, I'm not going to oppose that. regards, Kenneth Best regards Jakob On 4 Jan 2018, at 15:29, Kenneth Hoste wrote: Dear Jakob, On 04/01/2018 10:23, Jakob Schiøtz wrote: Hi, I made a TensorFlow easyconfig a while ago depending on Python with the foss toolchain; and including a variant with GPU support (PR 4904). The latter has not yet been merged, probably because it is annoying to have something that can only build on a machine with a GPU (it fails the sanity check otherwise, as TensorFlow with GPU support cannot load on a machine without it). Not being able to test this on a non-GPU system is a bit unfortunate, but that's not a reason that it hasn't been merged yet, that's mostly due to a lack of time from my side to get back to it... Since I made that PR, two newer releases of TensorFlow have appeared (1.3 and 1.4). There are easyconfigs for 1.3 with the Intel tool chain. I am considering making easyconfigs for TensorFlow 1.4 with Python-3.6.3-foss-2017b (both with and without GPU support), but first I would like to know if anybody else is doing this - it is my impression that somebody who actually know what they are doing may be working on TensorFlow. :-) I have spent quite a bit of time puzzling together an easyblock that supports building TensorFlow from source, see [1]. It already works for non-GPU installations (see [2] for example), but it's not entirely finished yet because: * building from source with
Re: [easybuild] TensorFlow with GPU support.
On 05/01/2018 17:28, Jakob Schiøtz wrote: Hi again, Kenneth. It turns out that I was wrong about the lack of internet access from the compute nodes. In principle, there should be nothing stopping me from testing building with GPUs next week, except for my lack of knowledge :-) I see this in the easyblock: def extra_options(): extra_vars = { # see https://developer.nvidia.com/cuda-gpus 'cuda_compute_capabilities': [[], "List of CUDA compute capabilities to build with", CUSTOM], 'with_mkl_dnn': [True, "Make TensorFlow use Intel MKL-DNN", CUSTOM], } Does that mean that I can call eb with something like this eb TensorFlow-1.4.0-foss-2017b-Python-3.6.3.eb -r --cuda_compute_capabilities=Tesla or something like that (I will not be able to test it until next week). Or do I need to make a new easyconfig which sets that extra option somehow (and depends on CUDA and friends)? The latter, cuda_compute_capabilities is a custom easyconfig parameter for TensorFlow, not a command line option. Although you can try something like this to avoid having the copy & edit an easyconfig file yourself. eb TensorFlow.eb --try-amend=cuda_compute_capabilities=x,y Do note that you'll need to add CUDA & cuDNN as dependencies when you want to enable GPU support. The values you provide need to be known CUDA compute capabilities though, so something like '3.7' (see https://developer.nvidia.com/cuda-gpus). regards, Kenneth Best regards Jakob On 5 Jan 2018, at 16:10, Jakob Schiøtzwrote: On 5 Jan 2018, at 15:18, Kenneth Hoste wrote: On 05/01/2018 14:13, Jakob Schiøtz wrote: Hi again, Yes, I have overlooked that - I just switched my repo to your branch and tried to build :-) Now I get an error when building TensorFlow. It is a 502 Bad Gateway, indicating that some server is down somewhere. But is it not a problem that the build process itself tried to download extra stuff in addition to the source files listed in the .eb file? At least it makes the checksum checking moot. That's indeed a problem, but one that is hard to avoid with TensorFlow, at least in a first iteration... Once we're happy with the current approach, a new target could be to get TensorFlow to build "offline". One step at a time though... ;-) It could be a showstopper for me, though. On our cluster, only two nodes have GPUs. With the binary build, I could only install TensorFlow on those, since although CUDA and friends are available on all the nodes, you can only load the resulting TensorFlow module on a machine with a GPU. Unfortunately, these two nodes are officially compute-nodes, not login-nodes, and that means that they are cut off from the Internet. So no downloading is possible on these. :-( So I have two questions: 1. What do we expect to gain by building from source instead of installing from the wheel? 2. Would it be OK to have a “-bin” variant installing from the binary distribution until we get these issues ironed out? In my second attempt, I managed to build with foss/2017b (obviously the server was up again). I have not really tested it yet (I am only just dabbing into TensorFlow and my main application i crashing due to another problem). Do you want me to submit the new .eb file as a PR to your PR? Or should I just wait till your stuff has converged? /Jakob regards, Kenneth Best regards Jakob WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 40,240 bytes Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 205,436 bytes Loading package: tensorflow/tools/pip_package Loading package: @bazel_tools//tools/cpp Loading package: @local_jdk// Loading package: @local_config_cc// Loading complete. Analyzing... ERROR: /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway and referenced by '//tensorflow/tools/pip_package:build_pip_package'. ERROR:
Re: [easybuild] TensorFlow with GPU support.
Hi Jakob, On 05/01/2018 16:10, Jakob Schiøtz wrote: On 5 Jan 2018, at 15:18, Kenneth Hostewrote: On 05/01/2018 14:13, Jakob Schiøtz wrote: Hi again, Yes, I have overlooked that - I just switched my repo to your branch and tried to build :-) Now I get an error when building TensorFlow. It is a 502 Bad Gateway, indicating that some server is down somewhere. But is it not a problem that the build process itself tried to download extra stuff in addition to the source files listed in the .eb file? At least it makes the checksum checking moot. That's indeed a problem, but one that is hard to avoid with TensorFlow, at least in a first iteration... Once we're happy with the current approach, a new target could be to get TensorFlow to build "offline". One step at a time though... ;-) It could be a showstopper for me, though. On our cluster, only two nodes have GPUs. With the binary build, I could only install TensorFlow on those, since although CUDA and friends are available on all the nodes, you can only load the resulting TensorFlow module on a machine with a GPU. Unfortunately, these two nodes are officially compute-nodes, not login-nodes, and that means that they are cut off from the Internet. So no downloading is possible on these. :-( So I have two questions: 1. What do we expect to gain by building from source instead of installing from the wheel? 2. Would it be OK to have a “-bin” variant installing from the binary distribution until we get these issues ironed out? See my previous e-mail. ;-) 1. better performance (due to targeting correct architecture) + compatibility with more OSs (e.g. CentOS 6) 2. yes In my second attempt, I managed to build with foss/2017b (obviously the server was up again). I have not really tested it yet (I am only just dabbing into TensorFlow and my main application i crashing due to another problem). Do you want me to submit the new .eb file as a PR to your PR? Or should I just wait till your stuff has converged? That should be a separate PR I think (I'm more concerned about complicating existing PRs rather than one more PR to deal with). regards, Kenneth /Jakob regards, Kenneth Best regards Jakob WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 40,240 bytes Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 205,436 bytes Loading package: tensorflow/tools/pip_package Loading package: @bazel_tools//tools/cpp Loading package: @local_jdk// Loading package: @local_config_cc// Loading complete. Analyzing... ERROR: /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway and referenced by '//tensorflow/tools/pip_package:build_pip_package'. ERROR: /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway and referenced by '//tensorflow/tools/pip_package:build_pip_package'. ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway. Elapsed time: 6.561s (at easybuild/tools/run.py:481 in parse_cmd_output) ==
Re: [easybuild] TensorFlow with GPU support.
Hi Kenneth, I have now tested your TensorFlow 1.4.0 eb on our machines with a real-world script. It works, but it runs three times slower than with the prebuild TensorFlow 1.2.1 :-( The prebuild version complains that it was build without AVX2 etc, so I do not really understand why it is so much slower to use the version compiled from source - assuming of course that there is not a factor three performance loss between 1.2.1 and 1.4.0; which seems unlikely. Best regards Jakob > On 5 Jan 2018, at 13:57, Kenneth Hostewrote: > > On 04/01/2018 16:37, Jakob Schiøtz wrote: >> Dear Kenneth, Pablo and Maxime, >> >> Thanks for your feedback. Yes, I will try to see if I can build from >> source, but I will focus on the foss toolchain since we use that one for our >> Python here (we do not have the Intel MPI license, and the iomkl toolchain >> could not built Python last time I tried). >> >> I assume the reason for building from source is to ensure consistent library >> versions etc. If that proves very difficult, could we perhaps in the >> interim have builds (with a -bin suffix?) using the prebuilt wheels? > > The main reason for building from source is performance and compatibility > with the OS. > > The binary wheels that are available for TensorFlow are not compatible with > older OS versions like CentOS 6, as I experienced first-hand when trying to > get it to work on an older (GPU) system. > Since the compilation from source with CUDA support didn't work yet, I had to > resort to injecting a newer glibc version in the 'python' binary, which was > not fun (well...). > > For CPU-only installations, you really have no other option than building > from source, since the binary wheels were not built with AVX2 instructions > for example, which leads to large performance losses (some quick benchmarking > showed a 7x increase in performance for TF 1.4 built with foss/2017b over > using the binary wheel). > > For GPU installations, a similar concern arises, although it may be less > severe there, depending on what CUDA compute capabilities the binary wheels > were built with (I only tested the wheels on old systems with NVIDIA K20x/K40 > GPUs, so there I doubt you'll get much performance increase when building > from source). > > If it turns out to be too difficult or time-consuming to get the build from > source with CUDA support to work, then we can of course progress with > sticking to the binary wheel releases for now, I'm not going to oppose that. > > > regards, > > Kenneth > >> >> Best regards >> >> Jakob >> >> >>> On 4 Jan 2018, at 15:29, Kenneth Hoste wrote: >>> >>> Dear Jakob, >>> >>> On 04/01/2018 10:23, Jakob Schiøtz wrote: Hi, I made a TensorFlow easyconfig a while ago depending on Python with the foss toolchain; and including a variant with GPU support (PR 4904). The latter has not yet been merged, probably because it is annoying to have something that can only build on a machine with a GPU (it fails the sanity check otherwise, as TensorFlow with GPU support cannot load on a machine without it). >>> Not being able to test this on a non-GPU system is a bit unfortunate, but >>> that's not a reason that it hasn't been merged yet, that's mostly due to a >>> lack of time from my side to get back to it... >>> Since I made that PR, two newer releases of TensorFlow have appeared (1.3 and 1.4). There are easyconfigs for 1.3 with the Intel tool chain. I am considering making easyconfigs for TensorFlow 1.4 with Python-3.6.3-foss-2017b (both with and without GPU support), but first I would like to know if anybody else is doing this - it is my impression that somebody who actually know what they are doing may be working on TensorFlow. :-) >>> I have spent quite a bit of time puzzling together an easyblock that >>> supports building TensorFlow from source, see [1]. >>> >>> It already works for non-GPU installations (see [2] for example), but it's >>> not entirely finished yet because: >>> >>> * building from source with CUDA support does not work yet, the build fails >>> with strange Bazel errors... >>> >>> * there are some issues when the TensorFlow easyblock is used together with >>> --use-ccache and the Intel compilers; >>> because two compiler wrappers are used, they end up calling each other >>> resulting in a "fork bomb" style situation... >>> >>> I would really like to get it finished and have easyconfigs available for >>> TensorFlow 1.4 and newer where we properly build TensorFlow from source >>> rather than using the binary wheels... >>> >>> Are you up for giving it a try, and maybe helping out with the problems >>> mentioned above? >>> >>> >>> regards, >>> >>> Kenneth >>> >>> >>> [1] https://github.com/easybuilders/easybuild-easyblocks/pull/1287 >>> [2]
Re: [easybuild] TensorFlow with GPU support.
Hi again, Kenneth. It turns out that I was wrong about the lack of internet access from the compute nodes. In principle, there should be nothing stopping me from testing building with GPUs next week, except for my lack of knowledge :-) I see this in the easyblock: def extra_options(): extra_vars = { # see https://developer.nvidia.com/cuda-gpus 'cuda_compute_capabilities': [[], "List of CUDA compute capabilities to build with", CUSTOM], 'with_mkl_dnn': [True, "Make TensorFlow use Intel MKL-DNN", CUSTOM], } Does that mean that I can call eb with something like this eb TensorFlow-1.4.0-foss-2017b-Python-3.6.3.eb -r --cuda_compute_capabilities=Tesla or something like that (I will not be able to test it until next week). Or do I need to make a new easyconfig which sets that extra option somehow (and depends on CUDA and friends)? Best regards Jakob > On 5 Jan 2018, at 16:10, Jakob Schiøtzwrote: > > > >> On 5 Jan 2018, at 15:18, Kenneth Hoste wrote: >> >> On 05/01/2018 14:13, Jakob Schiøtz wrote: >>> Hi again, >>> >>> Yes, I have overlooked that - I just switched my repo to your branch and >>> tried to build :-) >>> >>> Now I get an error when building TensorFlow. It is a 502 Bad Gateway, >>> indicating that some server is down somewhere. But is it not a problem >>> that the build process itself tried to download extra stuff in addition to >>> the source files listed in the .eb file? At least it makes the checksum >>> checking moot. >> >> That's indeed a problem, but one that is hard to avoid with TensorFlow, at >> least in a first iteration... >> >> Once we're happy with the current approach, a new target could be to get >> TensorFlow to build "offline". >> >> One step at a time though... ;-) > > It could be a showstopper for me, though. On our cluster, only two nodes > have GPUs. With the binary build, I could only install TensorFlow on those, > since although CUDA and friends are available on all the nodes, you can only > load the resulting TensorFlow module on a machine with a GPU. Unfortunately, > these two nodes are officially compute-nodes, not login-nodes, and that means > that they are cut off from the Internet. So no downloading is possible on > these. :-( > > So I have two questions: > > 1. What do we expect to gain by building from source instead of installing > from the wheel? > > 2. Would it be OK to have a “-bin” variant installing from the binary > distribution until we get these issues ironed out? > > In my second attempt, I managed to build with foss/2017b (obviously the > server was up again). I have not really tested it yet (I am only just > dabbing into TensorFlow and my main application i crashing due to another > problem). Do you want me to submit the new .eb file as a PR to your PR? Or > should I just wait till your stuff has converged? > > /Jakob > > >> >> >> regards, >> >> Kenneth >>> >>> Best regards >>> >>> Jakob >>> >>> >>> >>> WARNING: The lower priority option '-c opt' does not override the previous >>> value '--compilation_mode=opt'. >>> WARNING: The lower priority option '-c opt' does not override the previous >>> value '--compilation_mode=opt'. >>> Downloading >>> https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz >>> via codeload.github.com: 40,240 bytes >>> Downloading >>> https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz >>> via codeload.github.com: 205,436 bytes >>> Loading package: tensorflow/tools/pip_package >>> Loading package: @bazel_tools//tools/cpp >>> Loading package: @local_jdk// >>> Loading package: @local_config_cc// >>> Loading complete. Analyzing... >>> ERROR: >>> /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: >>> error loading package 'tensorflow': Encountered error while reading >>> extension file 'protobuf.bzl': no such package '@protobuf_archive//': >>> java.io.IOException: Error downloading >>> [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] >>> to >>> /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: >>> GET returned 502 Bad Gateway and referenced by >>> '//tensorflow/tools/pip_package:build_pip_package'. >>> ERROR: >>> /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: >>> error loading package 'tensorflow': Encountered error while reading >>> extension file 'protobuf.bzl': no such package '@protobuf_archive//': >>> java.io.IOException: Error downloading >>>
Re: [easybuild] TensorFlow with GPU support.
> On 5 Jan 2018, at 15:18, Kenneth Hostewrote: > > On 05/01/2018 14:13, Jakob Schiøtz wrote: >> Hi again, >> >> Yes, I have overlooked that - I just switched my repo to your branch and >> tried to build :-) >> >> Now I get an error when building TensorFlow. It is a 502 Bad Gateway, >> indicating that some server is down somewhere. But is it not a problem that >> the build process itself tried to download extra stuff in addition to the >> source files listed in the .eb file? At least it makes the checksum >> checking moot. > > That's indeed a problem, but one that is hard to avoid with TensorFlow, at > least in a first iteration... > > Once we're happy with the current approach, a new target could be to get > TensorFlow to build "offline". > > One step at a time though... ;-) It could be a showstopper for me, though. On our cluster, only two nodes have GPUs. With the binary build, I could only install TensorFlow on those, since although CUDA and friends are available on all the nodes, you can only load the resulting TensorFlow module on a machine with a GPU. Unfortunately, these two nodes are officially compute-nodes, not login-nodes, and that means that they are cut off from the Internet. So no downloading is possible on these. :-( So I have two questions: 1. What do we expect to gain by building from source instead of installing from the wheel? 2. Would it be OK to have a “-bin” variant installing from the binary distribution until we get these issues ironed out? In my second attempt, I managed to build with foss/2017b (obviously the server was up again). I have not really tested it yet (I am only just dabbing into TensorFlow and my main application i crashing due to another problem). Do you want me to submit the new .eb file as a PR to your PR? Or should I just wait till your stuff has converged? /Jakob > > > regards, > > Kenneth >> >> Best regards >> >> Jakob >> >> >> >> WARNING: The lower priority option '-c opt' does not override the previous >> value '--compilation_mode=opt'. >> WARNING: The lower priority option '-c opt' does not override the previous >> value '--compilation_mode=opt'. >> Downloading >> https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz >> via codeload.github.com: 40,240 bytes >> Downloading >> https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz >> via codeload.github.com: 205,436 bytes >> Loading package: tensorflow/tools/pip_package >> Loading package: @bazel_tools//tools/cpp >> Loading package: @local_jdk// >> Loading package: @local_config_cc// >> Loading complete. Analyzing... >> ERROR: >> /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: >> error loading package 'tensorflow': Encountered error while reading >> extension file 'protobuf.bzl': no such package '@protobuf_archive//': >> java.io.IOException: Error downloading >> [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] >> to >> /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: >> GET returned 502 Bad Gateway and referenced by >> '//tensorflow/tools/pip_package:build_pip_package'. >> ERROR: >> /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: >> error loading package 'tensorflow': Encountered error while reading >> extension file 'protobuf.bzl': no such package '@protobuf_archive//': >> java.io.IOException: Error downloading >> [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] >> to >> /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: >> GET returned 502 Bad Gateway and referenced by >> '//tensorflow/tools/pip_package:build_pip_package'. >> ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' >> failed; build aborted: error loading package 'tensorflow': Encountered error >> while reading extension file 'protobuf.bzl': no such package >> '@protobuf_archive//': java.io.IOException: Error downloading >> [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] >> to >> /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: >> GET returned 502 Bad Gateway. >> Elapsed time: 6.561s >> (at easybuild/tools/run.py:481 in parse_cmd_output) >> == 2018-01-05 14:07:30,582 easyblock.py:2685 WARNING build failed (first 300 >> chars): cmd "bazel
Re: [easybuild] TensorFlow with GPU support.
On 05/01/2018 14:13, Jakob Schiøtz wrote: Hi again, Yes, I have overlooked that - I just switched my repo to your branch and tried to build :-) Now I get an error when building TensorFlow. It is a 502 Bad Gateway, indicating that some server is down somewhere. But is it not a problem that the build process itself tried to download extra stuff in addition to the source files listed in the .eb file? At least it makes the checksum checking moot. That's indeed a problem, but one that is hard to avoid with TensorFlow, at least in a first iteration... Once we're happy with the current approach, a new target could be to get TensorFlow to build "offline". One step at a time though... ;-) regards, Kenneth Best regards Jakob WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 40,240 bytes Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 205,436 bytes Loading package: tensorflow/tools/pip_package Loading package: @bazel_tools//tools/cpp Loading package: @local_jdk// Loading package: @local_config_cc// Loading complete. Analyzing... ERROR: /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway and referenced by '//tensorflow/tools/pip_package:build_pip_package'. ERROR: /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway and referenced by '//tensorflow/tools/pip_package:build_pip_package'. ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway. Elapsed time: 6.561s (at easybuild/tools/run.py:481 in parse_cmd_output) == 2018-01-05 14:07:30,582 easyblock.py:2685 WARNING build failed (first 300 chars): cmd "bazel --output_base=/tmp/eb-GpWEyg/tmpfJrPWS-bazel-build build --compilation_mode=opt --config=opt --subcommands --verbose_failures --config=mkl //tensorflow/tools/pip_package:build_pip_package" exited with exit code 1 and output: On 5 Jan 2018, at 13:50, Kenneth Hostewrote: Hi Jakob, On 05/01/2018 13:19, Jakob Schiøtz wrote: Hi Kenneth, Is it possible that you forgot to check in the patches TensorFlow-1.4.0_swig-env.patch and TensorFlow-1.4.0_no-enum34.patch in your PR? Attempting to build TensorFlow fails because it cannot find these. The patch files are available from https://github.com/easybuilders/easybuild-easyconfigs/pull/5318 (as mentioned in the description of the PR). regards, Kenneth Best regards Jakob On 4 Jan 2018, at 16:37, Jakob Schiøtz wrote: Dear Kenneth, Pablo and Maxime, Thanks for your feedback. Yes, I will try to see if I can build from source, but I will focus on the foss toolchain since we use that one for our Python here (we do not have the Intel MPI license, and the iomkl toolchain could not built Python last time I tried). I assume the reason for building from source is to ensure consistent library versions etc. If that proves very difficult, could we perhaps in the interim have builds (with a -bin suffix?) using the prebuilt wheels? Best regards Jakob On 4 Jan 2018, at
Re: [easybuild] TensorFlow with GPU support.
Hi again, Yes, I have overlooked that - I just switched my repo to your branch and tried to build :-) Now I get an error when building TensorFlow. It is a 502 Bad Gateway, indicating that some server is down somewhere. But is it not a problem that the build process itself tried to download extra stuff in addition to the source files listed in the .eb file? At least it makes the checksum checking moot. Best regards Jakob WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. WARNING: The lower priority option '-c opt' does not override the previous value '--compilation_mode=opt'. Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 40,240 bytes Downloading https://github.com/bazelbuild/rules_closure/archive/4af89ef1db659eb41f110df189b67d4cf14073e1.tar.gz via codeload.github.com: 205,436 bytes Loading package: tensorflow/tools/pip_package Loading package: @bazel_tools//tools/cpp Loading package: @local_jdk// Loading package: @local_config_cc// Loading complete. Analyzing... ERROR: /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway and referenced by '//tensorflow/tools/pip_package:build_pip_package'. ERROR: /home/niflheim/schiotz/easybuild_experimental/sandybridge/build/TensorFlow/1.4.0/foss-2017b-Python-3.6.3/tensorflow-1.4.0/tensorflow/tools/pip_package/BUILD:139:1: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway and referenced by '//tensorflow/tools/pip_package:build_pip_package'. ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: error loading package 'tensorflow': Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf_archive//': java.io.IOException: Error downloading [http://mirror.bazel.build/github.com/google/protobuf/archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz] to /tmp/eb-GpWEyg/tmpfJrPWS-bazel-build/external/protobuf_archive/b04e5cba356212e4e8c66c61bbe0c3a20537c5b9.tar.gz: GET returned 502 Bad Gateway. Elapsed time: 6.561s (at easybuild/tools/run.py:481 in parse_cmd_output) == 2018-01-05 14:07:30,582 easyblock.py:2685 WARNING build failed (first 300 chars): cmd "bazel --output_base=/tmp/eb-GpWEyg/tmpfJrPWS-bazel-build build --compilation_mode=opt --config=opt --subcommands --verbose_failures --config=mkl //tensorflow/tools/pip_package:build_pip_package" exited with exit code 1 and output: > On 5 Jan 2018, at 13:50, Kenneth Hostewrote: > > Hi Jakob, > > On 05/01/2018 13:19, Jakob Schiøtz wrote: >> Hi Kenneth, >> >> Is it possible that you forgot to check in the patches >> TensorFlow-1.4.0_swig-env.patch and TensorFlow-1.4.0_no-enum34.patch in your >> PR? Attempting to build TensorFlow fails because it cannot find these. > > The patch files are available from > https://github.com/easybuilders/easybuild-easyconfigs/pull/5318 (as mentioned > in the description of the PR). > > > regards, > > Kenneth >> >> Best regards >> >> Jakob >> >> >> >> >>> On 4 Jan 2018, at 16:37, Jakob Schiøtz wrote: >>> >>> Dear Kenneth, Pablo and Maxime, >>> >>> Thanks for your feedback. Yes, I will try to see if I can build from >>> source, but I will focus on the foss toolchain since we use that one for >>> our Python here (we do not have the Intel MPI license, and the iomkl >>> toolchain could not built Python last time I tried). >>> >>> I assume the reason for building from source is to ensure consistent >>> library versions etc. If that proves very difficult, could we perhaps in >>> the interim have builds (with a -bin suffix?) using the prebuilt wheels? >>> >>> Best regards >>> >>> Jakob >>> >>> On 4 Jan 2018, at 15:29, Kenneth Hoste wrote: Dear Jakob, On 04/01/2018 10:23, Jakob Schiøtz wrote: > Hi, > > I made a TensorFlow
Re: [easybuild] TensorFlow with GPU support.
On 04/01/2018 16:37, Jakob Schiøtz wrote: Dear Kenneth, Pablo and Maxime, Thanks for your feedback. Yes, I will try to see if I can build from source, but I will focus on the foss toolchain since we use that one for our Python here (we do not have the Intel MPI license, and the iomkl toolchain could not built Python last time I tried). I assume the reason for building from source is to ensure consistent library versions etc. If that proves very difficult, could we perhaps in the interim have builds (with a -bin suffix?) using the prebuilt wheels? The main reason for building from source is performance and compatibility with the OS. The binary wheels that are available for TensorFlow are not compatible with older OS versions like CentOS 6, as I experienced first-hand when trying to get it to work on an older (GPU) system. Since the compilation from source with CUDA support didn't work yet, I had to resort to injecting a newer glibc version in the 'python' binary, which was not fun (well...). For CPU-only installations, you really have no other option than building from source, since the binary wheels were not built with AVX2 instructions for example, which leads to large performance losses (some quick benchmarking showed a 7x increase in performance for TF 1.4 built with foss/2017b over using the binary wheel). For GPU installations, a similar concern arises, although it may be less severe there, depending on what CUDA compute capabilities the binary wheels were built with (I only tested the wheels on old systems with NVIDIA K20x/K40 GPUs, so there I doubt you'll get much performance increase when building from source). If it turns out to be too difficult or time-consuming to get the build from source with CUDA support to work, then we can of course progress with sticking to the binary wheel releases for now, I'm not going to oppose that. regards, Kenneth Best regards Jakob On 4 Jan 2018, at 15:29, Kenneth Hostewrote: Dear Jakob, On 04/01/2018 10:23, Jakob Schiøtz wrote: Hi, I made a TensorFlow easyconfig a while ago depending on Python with the foss toolchain; and including a variant with GPU support (PR 4904). The latter has not yet been merged, probably because it is annoying to have something that can only build on a machine with a GPU (it fails the sanity check otherwise, as TensorFlow with GPU support cannot load on a machine without it). Not being able to test this on a non-GPU system is a bit unfortunate, but that's not a reason that it hasn't been merged yet, that's mostly due to a lack of time from my side to get back to it... Since I made that PR, two newer releases of TensorFlow have appeared (1.3 and 1.4). There are easyconfigs for 1.3 with the Intel tool chain. I am considering making easyconfigs for TensorFlow 1.4 with Python-3.6.3-foss-2017b (both with and without GPU support), but first I would like to know if anybody else is doing this - it is my impression that somebody who actually know what they are doing may be working on TensorFlow. :-) I have spent quite a bit of time puzzling together an easyblock that supports building TensorFlow from source, see [1]. It already works for non-GPU installations (see [2] for example), but it's not entirely finished yet because: * building from source with CUDA support does not work yet, the build fails with strange Bazel errors... * there are some issues when the TensorFlow easyblock is used together with --use-ccache and the Intel compilers; because two compiler wrappers are used, they end up calling each other resulting in a "fork bomb" style situation... I would really like to get it finished and have easyconfigs available for TensorFlow 1.4 and newer where we properly build TensorFlow from source rather than using the binary wheels... Are you up for giving it a try, and maybe helping out with the problems mentioned above? regards, Kenneth [1] https://github.com/easybuilders/easybuild-easyblocks/pull/1287 [2] https://github.com/easybuilders/easybuild-easyconfigs/pull/5499 Best regards Jakob -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/ -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/
Re: [easybuild] TensorFlow with GPU support.
Hi Jakob, On 05/01/2018 13:19, Jakob Schiøtz wrote: Hi Kenneth, Is it possible that you forgot to check in the patches TensorFlow-1.4.0_swig-env.patch and TensorFlow-1.4.0_no-enum34.patch in your PR? Attempting to build TensorFlow fails because it cannot find these. The patch files are available from https://github.com/easybuilders/easybuild-easyconfigs/pull/5318 (as mentioned in the description of the PR). regards, Kenneth Best regards Jakob On 4 Jan 2018, at 16:37, Jakob Schiøtzwrote: Dear Kenneth, Pablo and Maxime, Thanks for your feedback. Yes, I will try to see if I can build from source, but I will focus on the foss toolchain since we use that one for our Python here (we do not have the Intel MPI license, and the iomkl toolchain could not built Python last time I tried). I assume the reason for building from source is to ensure consistent library versions etc. If that proves very difficult, could we perhaps in the interim have builds (with a -bin suffix?) using the prebuilt wheels? Best regards Jakob On 4 Jan 2018, at 15:29, Kenneth Hoste wrote: Dear Jakob, On 04/01/2018 10:23, Jakob Schiøtz wrote: Hi, I made a TensorFlow easyconfig a while ago depending on Python with the foss toolchain; and including a variant with GPU support (PR 4904). The latter has not yet been merged, probably because it is annoying to have something that can only build on a machine with a GPU (it fails the sanity check otherwise, as TensorFlow with GPU support cannot load on a machine without it). Not being able to test this on a non-GPU system is a bit unfortunate, but that's not a reason that it hasn't been merged yet, that's mostly due to a lack of time from my side to get back to it... Since I made that PR, two newer releases of TensorFlow have appeared (1.3 and 1.4). There are easyconfigs for 1.3 with the Intel tool chain. I am considering making easyconfigs for TensorFlow 1.4 with Python-3.6.3-foss-2017b (both with and without GPU support), but first I would like to know if anybody else is doing this - it is my impression that somebody who actually know what they are doing may be working on TensorFlow. :-) I have spent quite a bit of time puzzling together an easyblock that supports building TensorFlow from source, see [1]. It already works for non-GPU installations (see [2] for example), but it's not entirely finished yet because: * building from source with CUDA support does not work yet, the build fails with strange Bazel errors... * there are some issues when the TensorFlow easyblock is used together with --use-ccache and the Intel compilers; because two compiler wrappers are used, they end up calling each other resulting in a "fork bomb" style situation... I would really like to get it finished and have easyconfigs available for TensorFlow 1.4 and newer where we properly build TensorFlow from source rather than using the binary wheels... Are you up for giving it a try, and maybe helping out with the problems mentioned above? regards, Kenneth [1] https://github.com/easybuilders/easybuild-easyblocks/pull/1287 [2] https://github.com/easybuilders/easybuild-easyconfigs/pull/5499 Best regards Jakob -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/ -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/ -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/
Re: [easybuild] TensorFlow with GPU support.
Hi Kenneth, Is it possible that you forgot to check in the patches TensorFlow-1.4.0_swig-env.patch and TensorFlow-1.4.0_no-enum34.patch in your PR? Attempting to build TensorFlow fails because it cannot find these. Best regards Jakob > On 4 Jan 2018, at 16:37, Jakob Schiøtzwrote: > > Dear Kenneth, Pablo and Maxime, > > Thanks for your feedback. Yes, I will try to see if I can build from source, > but I will focus on the foss toolchain since we use that one for our Python > here (we do not have the Intel MPI license, and the iomkl toolchain could not > built Python last time I tried). > > I assume the reason for building from source is to ensure consistent library > versions etc. If that proves very difficult, could we perhaps in the interim > have builds (with a -bin suffix?) using the prebuilt wheels? > > Best regards > > Jakob > > >> On 4 Jan 2018, at 15:29, Kenneth Hoste wrote: >> >> Dear Jakob, >> >> On 04/01/2018 10:23, Jakob Schiøtz wrote: >>> Hi, >>> >>> I made a TensorFlow easyconfig a while ago depending on Python with the >>> foss toolchain; and including a variant with GPU support (PR 4904). The >>> latter has not yet been merged, probably because it is annoying to have >>> something that can only build on a machine with a GPU (it fails the sanity >>> check otherwise, as TensorFlow with GPU support cannot load on a machine >>> without it). >> >> Not being able to test this on a non-GPU system is a bit unfortunate, but >> that's not a reason that it hasn't been merged yet, that's mostly due to a >> lack of time from my side to get back to it... >> >>> Since I made that PR, two newer releases of TensorFlow have appeared (1.3 >>> and 1.4). There are easyconfigs for 1.3 with the Intel tool chain. I am >>> considering making easyconfigs for TensorFlow 1.4 with >>> Python-3.6.3-foss-2017b (both with and without GPU support), but first I >>> would like to know if anybody else is doing this - it is my impression that >>> somebody who actually know what they are doing may be working on >>> TensorFlow. :-) >> >> I have spent quite a bit of time puzzling together an easyblock that >> supports building TensorFlow from source, see [1]. >> >> It already works for non-GPU installations (see [2] for example), but it's >> not entirely finished yet because: >> >> * building from source with CUDA support does not work yet, the build fails >> with strange Bazel errors... >> >> * there are some issues when the TensorFlow easyblock is used together with >> --use-ccache and the Intel compilers; >> because two compiler wrappers are used, they end up calling each other >> resulting in a "fork bomb" style situation... >> >> I would really like to get it finished and have easyconfigs available for >> TensorFlow 1.4 and newer where we properly build TensorFlow from source >> rather than using the binary wheels... >> >> Are you up for giving it a try, and maybe helping out with the problems >> mentioned above? >> >> >> regards, >> >> Kenneth >> >> >> [1] https://github.com/easybuilders/easybuild-easyblocks/pull/1287 >> [2] https://github.com/easybuilders/easybuild-easyconfigs/pull/5499 >> >>> >>> Best regards >>> >>> Jakob >>> >>> -- >>> Jakob Schiøtz, professor, Ph.D. >>> Department of Physics >>> Technical University of Denmark >>> DK-2800 Kongens Lyngby, Denmark >>> http://www.fysik.dtu.dk/~schiotz/ >>> >>> >>> >> > > -- > Jakob Schiøtz, professor, Ph.D. > Department of Physics > Technical University of Denmark > DK-2800 Kongens Lyngby, Denmark > http://www.fysik.dtu.dk/~schiotz/ > > > -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/
Re: [easybuild] TensorFlow with GPU support.
On 18-01-04 04:23, Jakob Schiøtz wrote: Hi, I made a TensorFlow easyconfig a while ago depending on Python with the foss toolchain; and including a variant with GPU support (PR 4904). The latter has not yet been merged, probably because it is annoying to have something that can only build on a machine with a GPU (it fails the sanity check otherwise, as TensorFlow with GPU support cannot load on a machine without it). Since I made that PR, two newer releases of TensorFlow have appeared (1.3 and 1.4). You're actually missing 1.5 which just came out this morning. Built with AVX support, Cuda 9 and cuDNN 7. Maxime
Re: [easybuild] TensorFlow with GPU support.
Dear Jakob, On 04/01/2018 10:23, Jakob Schiøtz wrote: Hi, I made a TensorFlow easyconfig a while ago depending on Python with the foss toolchain; and including a variant with GPU support (PR 4904). The latter has not yet been merged, probably because it is annoying to have something that can only build on a machine with a GPU (it fails the sanity check otherwise, as TensorFlow with GPU support cannot load on a machine without it). Not being able to test this on a non-GPU system is a bit unfortunate, but that's not a reason that it hasn't been merged yet, that's mostly due to a lack of time from my side to get back to it... Since I made that PR, two newer releases of TensorFlow have appeared (1.3 and 1.4). There are easyconfigs for 1.3 with the Intel tool chain. I am considering making easyconfigs for TensorFlow 1.4 with Python-3.6.3-foss-2017b (both with and without GPU support), but first I would like to know if anybody else is doing this - it is my impression that somebody who actually know what they are doing may be working on TensorFlow. :-) I have spent quite a bit of time puzzling together an easyblock that supports building TensorFlow from source, see [1]. It already works for non-GPU installations (see [2] for example), but it's not entirely finished yet because: * building from source with CUDA support does not work yet, the build fails with strange Bazel errors... * there are some issues when the TensorFlow easyblock is used together with --use-ccache and the Intel compilers; because two compiler wrappers are used, they end up calling each other resulting in a "fork bomb" style situation... I would really like to get it finished and have easyconfigs available for TensorFlow 1.4 and newer where we properly build TensorFlow from source rather than using the binary wheels... Are you up for giving it a try, and maybe helping out with the problems mentioned above? regards, Kenneth [1] https://github.com/easybuilders/easybuild-easyblocks/pull/1287 [2] https://github.com/easybuilders/easybuild-easyconfigs/pull/5499 Best regards Jakob -- Jakob Schiøtz, professor, Ph.D. Department of Physics Technical University of Denmark DK-2800 Kongens Lyngby, Denmark http://www.fysik.dtu.dk/~schiotz/
Re: [easybuild] TensorFlow with GPU support.
Hi Jakob, I installed Tensorflow in my cluster few days ago modifying your easyconfigs. I have just sent two PR with the two easyconfigs I installed: https://github.com/easybuilders/easybuild-easyconfigs/pull/5590 https://github.com/easybuilders/easybuild-easyconfigs/pull/5591 I used cuDDN 6.0 as dependency instead of cuDDN 7.x because the provided .whl is linked with 6.0. If you try 7.x you will get a ".so lib not found" error regards, Pablo. 2018-01-04 10:23 GMT+01:00 Jakob Schiøtz: > Hi, > > I made a TensorFlow easyconfig a while ago depending on Python with the > foss toolchain; and including a variant with GPU support (PR 4904). The > latter has not yet been merged, probably because it is annoying to have > something that can only build on a machine with a GPU (it fails the sanity > check otherwise, as TensorFlow with GPU support cannot load on a machine > without it). > > Since I made that PR, two newer releases of TensorFlow have appeared (1.3 > and 1.4). There are easyconfigs for 1.3 with the Intel tool chain. I am > considering making easyconfigs for TensorFlow 1.4 with > Python-3.6.3-foss-2017b (both with and without GPU support), but first I > would like to know if anybody else is doing this - it is my impression that > somebody who actually know what they are doing may be working on > TensorFlow. :-) > > Best regards > > Jakob > > -- > Jakob Schiøtz, professor, Ph.D. > Department of Physics > Technical University of Denmark > DK-2800 Kongens Lyngby, Denmark > http://www.fysik.dtu.dk/~schiotz/ > > > > -- Pablo Escobar López HPC systems engineer sciCORE, University of Basel SIB Swiss Institute of Bioinformatics http://scicore.unibas.ch