Hi Kenneth, Kenneth Hoste <kenneth.ho...@ugent.be> writes:
> Hi Loris, > > On 29/08/2023 08:19, Loris Bennett wrote: >> Hi, >> When I try to install 'dorado' via >> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb >> the tests stall at some point. The process tree is as follows: >> ├─ /bin/bash /var/spool/slurmd/job14269886/slurm_script >> │ └─ /usr/bin/python3.6 -m easybuild.main >> dorado-0.3.1-foss-2022a-CUDA-11.7.0.eb --robot >> --cuda-compute-capabilities=6.1,7.5 --buildpath=/dev/shm >> --tmpdir=/scratch/eb-bu >> │ └─ /bin/bash -c export >> PYTHONPATH=/scratch/eb-build/eb-yoirmakd/tmpc14mrksg/lib/python3.10/site-packages:$PYTHONPATH >> && cd test && PYTHONUNBUFFERED=1 /trinity/shar >> │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> run_test.py --continue-through-error --verbose -x >> distributed/elastic/utils/distrib >> │ ├─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v --subprocess >> │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >> │ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >> │ │ │ ... >> │ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >> │ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >> from multiprocessing.resource_tracker import main;main(30) >> │ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >> │ │ │ ├─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >> │ │ │ └─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v TestRPCPickler.test_case >> │ │ └─ >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> distributed/rpc/test_share_memory.py -v --subprocess >> │ └─ /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python >> run_test.py --continue-through-error --verbose -x >> distributed/elastic/utils/dist >> The problem seems to be the that the following process hangs while >> calling 'read': >> Trace of process 95404 - >> /easybuild/software/Python/3.10.4-GCCcore-11.3.0/bin/python -s -c >> from multiprocessing.resource_tracker import main;main(30) >> strace: Process 95404 attached >> read(30, >> I have tried this twice and both times the installation has stopped >> like >> this, so I assume it is not some temporary issue with the file system. >> Does anyone have any ideas about what else I could look at? > > > Not sure at first sight, but maybe it's similar to a problem I ran > into with scipy recently, which boiled down to pytext-xdist getting > stuck when running in cgroups (for example when running from a Slurm > job), see https://github.com/pytest-dev/pytest-xdist/issues/658 . > > Doesn't look like it's exactly the same problem, but perhaps it gives > you a push in the right direction... Thanks for the tip. However, even with --parallel=1 the test hangs at the same place - it just takes longer to get there :-) I guess I'll just try skipping the test for the moment. Cheers, Loris > regards, > > Kenneth > >> Cheers, >> Loris >> > -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin