Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-25 Thread Lars Hupel
> I think it works properly now, see Isabelle/f0f83ce0badd + AFP/f7e9efc65d82.

Thanks for that. I'll proceed with the systems integration process of
that machine now.

> I usually measure the CPU time, multiply it by 2, and take the ceiling
> towards the next multiple of 300. In principle this maintenance of
> options could be automated.

The "multiple of 300" part is now implemented in "afp_check_roots"
(AFP/006f6826c3a1).

Cheers
Lars
___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev


Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-25 Thread Makarius
On 22/06/18 16:33, Lars Hupel wrote:
>> But there are still (spurious) issues with some AFP sessions. With high
>> contention (a total of 64 worker threads at the same time, e.g. with
>> threads=2 and -j 32), some sessions will run into suspicious timeouts:
>>
>> Timing Median_Of_Medians_Selection (2 threads, 25.039s elapsed time,
>> 44.468s cpu time, 2.451s GC time, factor 1.78)
>> *** Timeout
>> Median_Of_Medians_Selection FAILED
> 
> Here are the offending sessions:
> 
> Dict_Construction
> Hidden_Markov_Models
> Median_Of_Medians_Selection
> Mason_Stothers
> 
> They fail spuriously (except for the last one, which fails reproducibly)
> when running with a total of 64 or even 128 worker threads:
> 
> ./isabelle/bin/isabelle build -cbv -a -d afp-devel/thys/ -X slow -o
> threads=2 -j 32
> 
> ./isabelle/bin/isabelle build -cbv -a -d afp-devel/thys/ -X slow -o
> threads=4 -j 32
> 
> None of this happens when using a total of just 32 worker threads.
> 
> Makarius, David, this probably requires another round of analysis.
> 
> Isabelle/b6e48841d0a5
> AFP/00e13b87d199

I think it works properly now, see Isabelle/f0f83ce0badd + AFP/f7e9efc65d82.

The failing sessions above had very small timeout, fitting to a high-end
laptop of the authors, but unfitting to a huge server with many other
things running. E.g. see now

changeset:   9452:6ab2cc180aae
user:wenzelm
date:Sun Jun 24 11:25:19 2018 +0200
files:   thys/AODV/ROOT thys/Automatic_Refinement/ROOT
thys/Buffons_Needle/ROOT thys/Category3/ROOT thys/Dict_Construction/ROOT
thys/Error_Function/ROOT thys/Hidden_Markov_Models/ROOT
thys/LLL_Factorization/ROOT
description:
more uniform timeout: discretization for 300 (5min);


I usually measure the CPU time, multiply it by 2, and take the ceiling
towards the next multiple of 300. In principle this maintenance of
options could be automated.


Also note that option -b causes a lot of file-system traffic, which I
suspect does not work so well with many parallel processes. (This is an
old-fashioned HD, not an SSD.)

There were other file-system oriented crashes due to the new
Export.export in batch build, but I have made this conditional and
disabled by default:

changeset:   68491:f0f83ce0badd
tag: tip
user:wenzelm
date:Sun Jun 24 22:13:23 2018 +0200
files:   etc/options src/Pure/Thy/thy_info.ML src/Pure/Tools/dump.scala
description:
disable export_document by default (presently unused and for
demo/testing purposes): avoid spurious IO exception in highly parallel
environment;


Here is an earlier change that improves the overall scheduling of
long-running sessions:

changeset:   68486:6984a55f3cba
user:wenzelm
date:Sat Jun 23 14:23:53 2018 +0200
files:   src/Pure/Tools/build.scala
description:
clarified queue ordering: take session descendants into account, notably
for "slow" AFP sessions;


All this together gives fairly good robustness and performance as follows:

  ML_PLATFORM="x86_64-linux"
  ML_HOME="/home/isabelle-temp/makarius/x86_64-linux"
  ML_SYSTEM="polyml-5.7.1"
  ML_OPTIONS="--minheap 3G --maxheap 30G"

isabelle build -j8 -o threads=8 -a -X slow -d afp-devel/thys -N -f
0:58:59 elapsed time, 28:11:51 cpu time, factor 28.68
0:59:18 elapsed time, 28:14:49 cpu time, factor 28.58

isabelle build -j8 -o threads=8 -a -X very_slow -d afp-devel/thys -N -f
Finished AODV (1:17:55 elapsed time, 8:01:27 cpu time, factor 6.18)
1:23:40 elapsed time, 46:19:04 cpu time, factor 33.21


The "/home/isabelle-temp/makarius/x86_64-linux" above refers to a recent
repository version of Poly/ML built on Ubuntu 18.04 LTS: that was
essentially an accident, and should have been official polyml-5.7.1-6.

It would be also interesting to test this with x86-linux, but it
requires the Ubuntu package lib32stdc++6.


Note that the above 8 * 8 arrangement follows the structure of the
hardware (according to numactl -H): it is a cluster of 8-core nodes,
with a memory access penalty of factor 1.6 or 3.2 for threads sitting on
a distant node. The isabelle build option -N avoids that.

Here are some variations that burn more CPU cycles without really
improving the performance:

isabelle build -j16 -o threads=6 -a -X slow -d afp-devel/thys -N -f
0:58:00 elapsed time, 33:37:20 cpu time, factor 34.77

isabelle build -j32 -o threads=6 -a -X slow -d afp-devel/thys -N -b -f
1:08:28 elapsed time, 48:51:32 cpu time, factor 42.81

isabelle build -j64 -o threads=1 -a -X slow -d afp-devel/thys
2:05:31 elapsed time, 28:49:42 cpu time, factor 13.78
1:59:28 elapsed time, 28:56:00 cpu time, factor 14.53


The last one demonstrates where we would be without the great
multithreading provided by Poly/ML.


Summary: this is a quite good machine, which manages to push the
standard Isabelle + AFP quasi-interactive test below 1h, but the ideal
of 45min (the "Paris commuter constant") is not met yet. It probably
requires smarter scheduling in isabelle build to achieve that.

The Isabelle + AFP slow test is 

Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-24 Thread Makarius
On 23/06/18 19:16, Makarius wrote:
> On 22/06/18 16:33, Lars Hupel wrote:
>>
>> They fail spuriously (except for the last one, which fails reproducibly)
>> when running with a total of 64 or even 128 worker threads:
>>
>> None of this happens when using a total of just 32 worker threads.
>>
>> Makarius, David, this probably requires another round of analysis.
>>
>> Isabelle/b6e48841d0a5
>> AFP/00e13b87d199
> 
> I am still in the process of exploring the situation (with new and old
> hardware).

Which HD product is actually /dev/sda on the new hardware?

I don't have root access to query it e.g. via "sudo lshw -class disk
-class storage".


Makarius
___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev


Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-23 Thread Makarius
On 22/06/18 16:33, Lars Hupel wrote:
> 
> They fail spuriously (except for the last one, which fails reproducibly)
> when running with a total of 64 or even 128 worker threads:
> 
> None of this happens when using a total of just 32 worker threads.
> 
> Makarius, David, this probably requires another round of analysis.
> 
> Isabelle/b6e48841d0a5
> AFP/00e13b87d199

I am still in the process of exploring the situation (with new and old
hardware).

So far I have merely improved the overall session scheduling:
Isabelle/6984a55f3cba.


Makarius

___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev


Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-22 Thread Lars Hupel
> But there are still (spurious) issues with some AFP sessions. With high
> contention (a total of 64 worker threads at the same time, e.g. with
> threads=2 and -j 32), some sessions will run into suspicious timeouts:
> 
> Timing Median_Of_Medians_Selection (2 threads, 25.039s elapsed time,
> 44.468s cpu time, 2.451s GC time, factor 1.78)
> *** Timeout
> Median_Of_Medians_Selection FAILED

Here are the offending sessions:

Dict_Construction
Hidden_Markov_Models
Median_Of_Medians_Selection
Mason_Stothers

They fail spuriously (except for the last one, which fails reproducibly)
when running with a total of 64 or even 128 worker threads:

./isabelle/bin/isabelle build -cbv -a -d afp-devel/thys/ -X slow -o
threads=2 -j 32

./isabelle/bin/isabelle build -cbv -a -d afp-devel/thys/ -X slow -o
threads=4 -j 32

None of this happens when using a total of just 32 worker threads.

Makarius, David, this probably requires another round of analysis.

Isabelle/b6e48841d0a5
AFP/00e13b87d199
___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev


Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-22 Thread Lars Hupel
>> So far it looks good, but we still need to test all of AFP.
> 
> I will do that today and report my findings.

The machine has been busy since yesterday running the distribution and
the AFP (except for slow) multiple times using different parameters.

The distribution is working fine.

But there are still (spurious) issues with some AFP sessions. With high
contention (a total of 64 worker threads at the same time, e.g. with
threads=2 and -j 32), some sessions will run into suspicious timeouts:

Timing Median_Of_Medians_Selection (2 threads, 25.039s elapsed time,
44.468s cpu time, 2.451s GC time, factor 1.78)
*** Timeout
Median_Of_Medians_Selection FAILED

The timeout for this session is 50 seconds. So I assume that this is
again a problem where the session itself runs through fine, but
something after that is not working well.

I'll send an update when I have more data points.
___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev


Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-21 Thread Lars Hupel

David has worked rather quickly and produced the following commit:
https://github.com/polyml/polyml/commit/86c52cbd8f6d

I have updated the polyml component accordingly in 
Isabelle/1b8457cc4de8.


So far it looks good, but we still need to test all of AFP.


I will do that today and report my findings.

Cheers
Lars
___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev


Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-20 Thread Makarius
On 18/06/18 22:38, Makarius wrote:
> On 18/06/18 21:30, Lars Hupel wrote:
>>
>> The precise invocation is:
>>
>> $ ./isabelle/bin/isabelle build -cbv -a -o threads=2 -j 16
>>
>> - The "poly" process gets stuck at 100% CPU load and keeps calling "mmap"
>> - When trying to build the nonterminating session without "-b", it
>> terminates
> 
> David Matthews is the only one who can address this.

David has worked rather quickly and produced the following commit:
https://github.com/polyml/polyml/commit/86c52cbd8f6d

I have updated the polyml component accordingly in Isabelle/1b8457cc4de8.

So far it looks good, but we still need to test all of AFP.


Makarius
___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev


Re: [isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

2018-06-18 Thread Makarius
On 18/06/18 21:30, Lars Hupel wrote:
> 
> the good news is that we have just received new hardware (Dual Epyc 7601).
> 
> The bad news is that a current development snapshot
> (Isabelle/410818a69ee3) exhibits a problem on this hardware. In
> particular, the session HOL-Corec_Examples doesn't appear to terminate.
> 
> The precise invocation is:
> 
> $ ./isabelle/bin/isabelle build -cbv -a -o threads=2 -j 16
> 
> Isabelle2017 "build -bva" works fine. This qualifies the problem as a
> regression, I suppose.

Isabelle2017 still uses Poly/ML 5.6, but for the coming release we are
on 5.7.1, which is quite different in many respects.


> - The "poly" process gets stuck at 100% CPU load and keeps calling "mmap"
> - When trying to build the nonterminating session without "-b", it
> terminates
> - GDB tells me the following stack trace when "mmap" is called (which
> corroborates that it happens during heap dumping):
> 
> #0  0x7f8dab80ba13 in __GI___mmap64 (addr=0x0, len=32768, prot=7,
> flags=34, fd=-1, offset=0) at ../sysdeps/unix/sysv/linux/mmap64.c:52
> #1  0x00879d56 in OSMem::Allocate(unsigned long&, unsigned int) ()
> #2  0x00871c6e in MemMgr::NewExportSpace(unsigned long, bool,
> bool, bool) ()
> #3  0x0085f020 in CopyScan::ScanAddressAt(PolyWord*) ()
> #4  0x0088c2b6 in
> ScanAddress::ScanAddressesInObject(PolyObject*, unsigned long) ()
> ...
> #24 0x0088c335 in
> ScanAddress::ScanAddressesInObject(PolyObject*, unsigned long) ()
> #25 0x0088c8c1 in ScanAddress::ScanAddressesInRegion(PolyWord*,
> PolyWord*) ()
> #26 0x008890a1 in SaveRequest::Perform() ()
> #27 0x00882717 in Processes::BeginRootThread(PolyObject*) ()
> #28 0x00874a9c in polymain ()
> #29 0x7f8dab711b97 in __libc_start_main (main=0x405720 ,
> argc=16, argv=0x7fffe693d4d8, init=, fini= out>, rtld_fini=, stack_end=0x7fffe693d4c8) at
> ../csu/libc-start.c:310
> #30 0x00405e01 in _start ()

David Matthews is the only one who can address this.

Maybe the problem is even related to the non-termination we've seen
recently on HOL-Proofs.


> I'm open to any suggestions for how to diagnose this. Happy to give out
> SSH access to the machine.

Of course, I have also use for such a machine for testing Isabelle + AFP
as usual.


Makarius
___
isabelle-dev mailing list
isabelle-...@in.tum.de
https://mailmanbroy.informatik.tu-muenchen.de/mailman/listinfo/isabelle-dev