BTW I did the Valgrind run and there is nothing there (I don't have the affected MKL, but either with OpenBLAS or with the Netlib LAPACK/BLAS there are no Valgrind defects at all in the Wien2k code, just some harmless leaked memory.) So yeah, confirming this is definitelly MKL.
Pavel On Thu, 2021-08-19 at 06:56 -0500, Laurence Marks wrote: > A suggestion: check your mkl version, as there is a mkl bug that was > recently fixed, see > https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Problem-with-LAPACK-subroutine-ZHEEVR-input-array-quot-isuppz/td-p/1150816 > _____ > Professor Laurence Marks > "Research is to see what everybody else has seen, and to think what > nobody else has thought", Albert Szent-Györgyi > www.numis.northwestern.edu > > On Thu, Aug 19, 2021, 06:45 Peter Blaha > <[email protected]> wrote: > > I'm still on vacations, so cannot test myself. > > > > However, I experienced such problems before. It has to do with > > multithreading (1 thread works always fine) and the mkl routine > > zheevr. > > > > In my case I could fix the problem by enlarging the workspace > > beyond > > what the routine calculates itself. (see comment in hmsec on line > > 841). > > > > Right below, the workspace was enlarged by a factor 10, which fixed > > my > > problem. But I can easily envision that it might not be enough in > > some > > other cases. > > > > An alternative is to switch back to zheevx (commented in the code). > > > > Peter Blaha > > > > Am 18.08.2021 um 20:01 schrieb Pavel Ondračka: > > > Right, I think that the reason deallocate is failing because the > > memory > > > has been corrupted at some earlier point is quite clear, the only > > other > > > option why it should crash would be that it was not allocated at > > all, > > > which seem not to be the case here... The question is what > > corrupted > > > the memory and even more strange is why does it work if we > > > disable > > MKL > > > multithreading? > > > > > > It could indeed be that we are doing something wrong. I can > > > imagine > > the > > > memory could be corrupted in some BLAS call if the number of > > > columns/rows passed to the specific BLAS call is more than the > > actual > > > size of the matrix, than this could easily happen (and the > > > multithreading is somehow influencing what the final value of the > > > corrupted memory, and depending on the final value the deallocate > > could > > > fail or pass somehow). This should be possible to diagnose with > > > valgrind as suggested. > > > > > > Luis, can you upload the testcase somewhere, or recompile with > > > debuginfo as suggested by Laurence earlier, run "valgrind -- > > > track- > > > origins=yes lapwso lapwso.def" and send the output? Just be > > > warned, > > > there is a massive slowdown with valgrind (up to 100x) and the > > logfile > > > can get very large. > > > > > > Best regards > > > Pavel > > > > > > > > > On Wed, 2021-08-18 at 12:10 -0500, Laurence Marks wrote: > > > > Correction, I was looking at an older modules.F. It looks like > > > > it > > > > should be > > > > > > > > DEALLOCATE(vect,stat=IV) ; if(IV .ne. 0)write(*,*)IV > > > > > > > > > > > > On Wed, Aug 18, 2021 at 11:23 AM Laurence Marks > > > > <[email protected]> wrote: > > > > > I do wonder about this. I suggest editing module.F and > > > > > changing > > > > > lines 118 and 119 to > > > > > DEALLOCATE(en,stat=Ien) ; if(Ien .ne. 0)write(*,*)'Err > > > > > en > > > > > ',ien > > > > > DEALLOCATE(vnorm,stat=Ivn ; ) if(Ivn .ne. > > > > > 0)write(*,*)'Err > > > > > vnorm ',Ivn > > > > > > > > > > There is every chance that the bug is not in those lines, but > > > > > somewhere completely different. SIGSEV often means that the > > > > > code > > > > > has been overwritten, for instance arrays going out of > > > > > bounds. > > > > > > > > > > You can also recompile with -g (don't change other options) > > > > > added, and/or -C. Sometimes this is better. Or use other > > > > > things > > > > > like debuggers or valgrind. > > > > > > > > > > On Wed, Aug 18, 2021 at 10:47 AM Pavel Ondračka > > > > > <[email protected]> wrote: > > > > > > I'm CCing the list back as the crash was now diagnosed to a > > > > > > likely > > > > > > MKL > > > > > > problem, see below for more details. > > > > > > > > > > > > > > > So just to be clear, explicitly setting > > > > > > > > OMP_STACKSIZE=1g does > > > > > > not > > > > > > > > help > > > > > > > > to solve the issue? > > > > > > > > > > > > > > > > > > > > > > Right! OMP_STACKSIZE=1g with OMP_NUM_THREADS=4 does not > > > > > > > solve > > > > > > > the > > > > > > > problem! > > > > > > > > > > > > > > > The problem is that the OpenMP code in lapwso is very > > > > > > > > simple, > > > > > > so I'm > > > > > > > > having problems seeing how it could be causing the > > > > > > > > problems. > > > > > > > > > > > > > > > > Could you also try to see what happens if run with: > > > > > > > > OMP_NUM_THREADS=1 > > > > > > > > MKL_NUM_THREADS=4 > > > > > > > > > > > > > > > > > > > > > > It does not work with these values, but I checked and it > > > > > > > works > > > > > > > reverting them: > > > > > > > OMP_NUM_THREADS=4 > > > > > > > MKL_NUM_THREADS=1 > > > > > > This was very helpfull and IMO points to a problem with MKL > > > > > > instead > > > > > > of > > > > > > Wien2k. > > > > > > > > > > > > Unfortunatelly setting MKL_NUM_THREADS=1 globally will > > > > > > reduce > > the > > > > > > OpenMP performance, mostly in lapw1 but also at other > > > > > > places. So > > > > > > if > > > > > > you > > > > > > want to keep the OpenMP BLAS/lapack level parallelism you > > > > > > have > > to > > > > > > either find some MKL version that works (if you do please > > > > > > report > > > > > > it > > > > > > here), link with OpenBLAS (using it for lapwso is enough) > > > > > > or > > > > > > create > > > > > > a > > > > > > simple wrapper that sets the MKL_NUM_THREADS=1 just for > > > > > > lapwso, > > > > > > i.e., > > > > > > rename lapwso binary in WIENROOT to lapwso_bin and create > > > > > > new > > > > > > lapwso > > > > > > file there with: > > > > > > > > > > > > #!/bin/bash > > > > > > MKL_NUM_THREADS=1 lapwso_bin $1 > > > > > > > > > > > > and set it to executable with chmod +x lapwso. > > > > > > > > > > > > Or maybe MKL has a non-OpenMP version which you could link > > > > > > with > > > > > > just > > > > > > lapwso and use standard one in other parts, but dunno, I > > > > > > mostly > > > > > > use > > > > > > OpenBLAS. If you need some further help, let me know. > > > > > > > > > > > > Reporting the issue to intel could be also nice, however I > > > > > > never > > > > > > had > > > > > > any real luck there and it is also a bit problematic as you > > can't > > > > > > provide testcase due to Wien2k being proprietary code... > > > > > > > > > > > > Best regards > > > > > > Pavel > > > > > > > > > > > > > > > > > > > > > This should disable the Wien2k-specific OpenMP > > > > > > > > parallelism > > > > > > > > but > > > > > > still > > > > > > > > keep the rest of paralellism at the BLAS/lapack level. > > > > > > > > > > > > > > > > > > > > > > So, perhaps, the problem is related to MKL! > > > > > > > > > > > > > > > Another option is that something is going wrong before > > > > > > > > lapwso > > > > > > and the > > > > > > > > lapwso crash is just the symptom. What happens if you > > > > > > > > run > > > > > > everything > > > > > > > > up > > > > > > > > to lapwso without OpenMP (OMP_NUM_THREADS=1) and than > > > > > > > > enable > > > > > > > > it > > > > > > just > > > > > > > > for lapwso? > > > > > > > > > > > > > > > > > > > > > > If I run lapw0 and lapw1 with OMP_NUM_THREADS=4 and then > > > > > > > change > > > > > > it to 1 > > > > > > > just before lapwso, it works. > > > > > > > If I do the opposite, starting with OMP_NUM_THREADS=1 and > > > > > > > then > > > > > > change > > > > > > > it to 4 just before lapwso, it does not work. > > > > > > > So I believe that the problem is really at lapwso. > > > > > > > > > > > > > > If you need more information, please, let me know! > > > > > > > All the best, > > > > > > > Luis > > > > > > > > > > > > _______________________________________________ > > > > > > Wien mailing list > > > > > > [email protected] > > > > > > > > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$ > > > > > > > > > > > > SEARCH the MAILING-LIST at: > > > > > > > > https://urldefense.com/v3/__http://www.mail-archive.com/[email protected]/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$ > > > > > > > > > > > > > > > > -- > > > > > Professor Laurence Marks > > > > > Department of Materials Science and Engineering > > > > > Northwestern University > > > > > http://www.numis.northwestern.edu > > > > > "Research is to see what everybody else has seen, and to > > > > > think > > what > > > > > nobody else has thought" Albert Szent-Györgyi > > > > > > > > _______________________________________________ > > > > Wien mailing list > > > > [email protected] > > > > > > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$ > > > > > > SEARCH the MAILING-LIST at: > > > > > > https://urldefense.com/v3/__http://www.mail-archive.com/[email protected]/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$ > > > > > > > > _______________________________________________ > > > Wien mailing list > > > [email protected] > > > > > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$ > > > > > SEARCH the MAILING-LIST at: > > https://urldefense.com/v3/__http://www.mail-archive.com/[email protected]/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$ > > > > > > _______________________________________________ > > Wien mailing list > > [email protected] > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > > SEARCH the MAILING-LIST at: > > http://www.mail-archive.com/[email protected]/index.html _______________________________________________ Wien mailing list [email protected] http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/[email protected]/index.html

