subject:"Re\: \[racket\-users\] Places code not using all the CPU"

Re: [racket-users] Places code not using all the CPU

2018-10-09 Thread 'Paulo Matos' via Racket Users




On 05/10/2018 19:23, Matthew Flatt wrote:
> 
> We should certainly update the documentation with information about the
> limits of parallelism via places.
> 

Added PR:
https://github.com/racket/racket/pull/2304


-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-09 Thread 'Paulo Matos' via Racket Users

On 08/10/2018 22:12, Philip McGrath wrote:
> This is much closer to the metal than where I usually spend my time,
> but, if it terns out that multiple OS processes is better than OS
> threads in this case, Distributed Places might provide an easier path to
> move to multiple processes than using `subprocess` directly:
> http://docs.racket-lang.org/distributed-places/index.html
> 

Sam mentioned trying that yesterday and I developed the loci library
before I did try them. Looking at the API, I can only say that at the
moment my library is certainly easier to use in the localhost. Once I
get to try to implement remote loci I will look into distributed places
and try to improve on that.

-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-09 Thread 'Paulo Matos' via Racket Users

Hi all,

Apologies for the delay in sending this email but I have been trying to
implement and test an alternative and wanted to be sure it works before
sending this off.

So, as Matthew suggested this problem has to do with memory allocation.
The --no-alloc option in Matthew's suggested snippet does not show the
delay I usually see in the thread CPU usage although thread creation is
still quite slow past around 20 places.

I started developing loci [1] to solve this problem instance yesterday
and I got it to a point where I can prove that subprocesses solve the
problem I am seeing. No point attaching a screenshot of htop with all
bars full to 100%... that's what happens. Also, process creation is
almost instantaneous and there's no delay compared to threads.

In the evening after I had almost everything sorted, Sam suggested on
Slack that I try distributed-places and use them locally. I haven't
tried this and I cannot say if it works better or worse but it seems
certainly harder to use than loci as my library uses the same API as places.

Part of the development was pretty quick because I noticed Matthew had
been playing with this before:
https://github.com/racket/racket/blob/master/pkgs/racket-benchmarks/tests/racket/benchmarks/places/place-processes.rkt
(might be worth noting that the code doesn't work with current racket)

I will adding contracts, tests and documentation throughout the week and
then replace places in my system with loci so I can dog-food the
library. Next step is to add remote loci at which point I will want to
compare with distributed-places and possibly improve on it.

If anyone has comments, suggestions or complaints on the library please
let me know but keep in mind it's barely a day old.

Paulo Matos

1: https://github.com/LinkiTools/racket-loci
   https://pkgd.racket-lang.org/pkgn/search?q=loci

On 05/10/2018 19:23, Matthew Flatt wrote:
> At Fri, 5 Oct 2018 17:55:47 +0200, Paulo Matos wrote:
>> Matthew, Sam, do you understand why this is happening?
> 
> I still think it's probably allocation, and probably specifically
> content on the process's page table. Do you see different behavior with
> a non-allocating variant (via `--no-alloc` below)?
> 
> We should certainly update the documentation with information about the
> limits of parallelism via places.
> 
> 
> 
> #lang racket
> 
> (define (go n alloc?)
>   (place/context p
>  (let ([v (vector (if alloc? 0.0 0))]
>[inc (if alloc? 1.0 1)])
>(let loop ([i 30])
>  (unless (zero? i)
>(vector-set! v 0 (+ (vector-ref v 0) inc))
>(loop (sub1 i)
>  (printf "Place ~a done~n" n)
>  n))
> 
> (module+ main
>   (define alloc? #t)
>   (define cores
> (command-line
>  #:once-each
>  [("--no-alloc") "Non-allocating variant" (set! alloc? #f)]
>  #:args (cores)
>  (string->number cores)))
> 
>   (time
>(map place-wait
> (for/list ([i (in-range cores)])
>   (printf "Starting core ~a~n" i)
>   (go i alloc?)
> 

-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-09 Thread 'Paulo Matos' via Racket Users

I just confirmed that this is due to memory allocation locking in the
kernel. If your places do no allocation then all is fine.

Paulo Matos

On 08/10/2018 21:39, James Platt wrote:
> I wonder if this has anything to do with mitigation for Spectre, Meltdown or 
> the other speculative execution vulnerabilities that have been identified 
> recently.  I understand that some or all of the patches affect the 
> performance of multi-CPU processing in general.
> 
> James  
> 

-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-08 Thread Philip McGrath

This is much closer to the metal than where I usually spend my time, but,
if it terns out that multiple OS processes is better than OS threads in
this case, Distributed Places might provide an easier path to move to
multiple processes than using `subprocess` directly:
http://docs.racket-lang.org/distributed-places/index.html

On Mon, Oct 8, 2018 at 7:39 PM James Platt  wrote:

> I wonder if this has anything to do with mitigation for Spectre, Meltdown
> or the other speculative execution vulnerabilities that have been
> identified recently.  I understand that some or all of the patches affect
> the performance of multi-CPU processing in general.
>
> James
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-08 Thread James Platt

I wonder if this has anything to do with mitigation for Spectre, Meltdown or 
the other speculative execution vulnerabilities that have been identified 
recently.  I understand that some or all of the patches affect the performance 
of multi-CPU processing in general.

James  

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-06 Thread George Neuner



On 10/5/2018 10:32 AM, Matthew Flatt wrote:
At Fri, 5 Oct 2018 15:36:04 +0200, Paulo Matos wrote: > Again, I am 
really surprised that you mention that places are not > separate 
processes. Documentation does say they are separate racket > virtual 
machines, how is this accomplished if not by using separate > 
processes? Each place is an OS thread within the Racket process. The 
virtual machine is essentially instantiated once in each thread, where 
things that look like global variables at the C level are actually 
thread-local variables to make them place-specific. Still, there is 
some sharing among the threads. > My workers are really doing Z3 style 
work - number crushing and lots of > searching. No IO (writing to 
disk) or communication so I would expect > them to really max out all 
CPUs. My best guess is that it's memory-allocation bottlenecks, 
probably at the point of using mmap() and mprotect(). Maybe things 
don't scale well beyond the 4-core machines that I use. On my 
machines, the enclosed program can max out CPU use with system time 
being a small fraction. It scales ok from 1 to 4 places (i.e., real 
time increased only some). The machine's core are hyperthreaded, and 
the example maxes out CPU utilization at 8 --- but it takes twice as 
long in real time, so the hardware threads don't help much in this 
case. Running two processes with 4 places takes about the same real 
time as running one process with 8 places, as does 2 processes with 2 
places. Do you see similar effects, or does this little example stop 
scaling before the number of processes matches the number of cores?


As Matthew said, this may be a case where multiple processes are better.

One thing that likely is vastly different between your two systems is 
the memory architecture.  On Paulo's many-core machine, each group of 
[probably] 6 CPUs will have its own physical bank of memory which is 
close to it and which it uses preferentially.  Access to a different 
bank may be very costly.  Paulo's machine may be spending a much greater 
percentage of time moving data between VM instances that are located in 
different memory regions ... something Matthew can't see on his quad-core.


Paulo, you might take a look at how memory is being allocated [not sure 
what tools you have for this] and see what happens if you restrict the 
process to running on various groups of CPUs.  It may be that some banks 
of your memory are "closer" than others.


Hope this helps,
George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread Neil Van Dyke





  if not I will have to redesign my system to use 'subprocess'


Expanding on this, for students on the list... Having many worker host 
processes is not necessarily a bad thing.  It can be more programmer 
work, but it simplifies the parallelism in a way (e.g., "let the Linux 
kernel worry about it" :), and it potentially gives you better isolation 
and resilience for some kinds of defects (in native code used via FFI, 
in Racket code, and even in the suspiciously sturdy Racket VM/backend).


If appropriate for your application, you can also consider a worker 
pool, with a health metric, sometimes reusing workers to avoid process 
startup times, and sometimes retiring, and perhaps sometimes benching 
workers for an induced big GC if that makes sense compared to 
retiring/unpooling, and maybe sometimes quarantining workers 
for debugging/dumps while keeping the system running.  You can also 
spread your workers across multiple hosts, not just CPUs/cores.


You can even use the worker pool to introduce new changes to a running 
system (being very rapid, or as an additional mechanism beyond normal 
testing for production), and do A/B performance/correctness of changes, 
and change rollback.


If your data to be communicated to/from a worker is relatively small and 
won't be a bottleneck, you can simply push it through the stdin and 
stdout of each process; otherwise, you can get judicious/clever with the 
many available host OS mechanisms.


(Students: Being able to get our hands dirty and engineer systems beyond 
a framework, when necessary, is one of the reasons we get CS/SE/EE/CE 
degrees and broad experience, rather than only collect a binder 
full of Certified Currently-Popular JS Framework Technician certs.  
Those oppressive student loans, and/or years of self-guided open source 
experience, might not be in vain. :)


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread Matthew Flatt

At Fri, 5 Oct 2018 17:55:47 +0200, Paulo Matos wrote:
> Matthew, Sam, do you understand why this is happening?

I still think it's probably allocation, and probably specifically
content on the process's page table. Do you see different behavior with
a non-allocating variant (via `--no-alloc` below)?

We should certainly update the documentation with information about the
limits of parallelism via places.



#lang racket

(define (go n alloc?)
  (place/context p
 (let ([v (vector (if alloc? 0.0 0))]
   [inc (if alloc? 1.0 1)])
   (let loop ([i 30])
 (unless (zero? i)
   (vector-set! v 0 (+ (vector-ref v 0) inc))
   (loop (sub1 i)
 (printf "Place ~a done~n" n)
 n))

(module+ main
  (define alloc? #t)
  (define cores
(command-line
 #:once-each
 [("--no-alloc") "Non-allocating variant" (set! alloc? #f)]
 #:args (cores)
 (string->number cores)))

  (time
   (map place-wait
(for/list ([i (in-range cores)])
  (printf "Starting core ~a~n" i)
  (go i alloc?)

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread 'Paulo Matos' via Racket Users

I was trying to create a much more elaborate example when Matthew sent
his tiny one which is enough to show the problem.

I started a 64core machine on aws to show the issue.

I see a massive degradation as the number of places increases.

I use this slightly modified code:
#lang racket

(define (go n)
  (place/context p
 (let ([v (vector 0.0)])
   (let loop ([i 30])
 (unless (zero? i)
   (vector-set! v 0 (+ (vector-ref v 0) 1.0))
   (loop (sub1 i)
 (printf "Place ~a done~n" n)
 n))

(module+ main
  (define cores
(command-line
 #:args (cores)
 (string->number cores)))

  (time
   (map place-wait
(for/list ([i (in-range cores)])
  (printf "Starting core ~a~n" i)
  (go i)

Here's the results in the video (might take a few minutes until it is live):
https://youtu.be/cDe_KF6nmJM

The guide says about places:
"The place form creates a place, which is effectively a new Racket
instance that can run in parallel to other places, including the initial
place."

I think this is misleading at the moment. If this behaviour can be
'fixed' then great, if not I will have to redesign my system to use
'subprocess' to start another racket process and a footnote should be
added to places in documentation to alert the users about this behaviour.

Matthew, Sam, do you understand why this is happening?

On 05/10/2018 16:51, Sam Tobin-Hochstadt wrote:
> I tried this same program on my desktop, which also has 4 (i7-4770)
> cores with hyperthreading. Here's what I see:
> 
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 1
> N: 1, cpu: 5808/5808.0, real: 5804
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 2
> N: 2, cpu: 12057/6028.5, real: 6063
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 3
> N: 3, cpu: 23377/7792., real: 7914
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 4
> N: 4, cpu: 41155/10288.75, real: 10357
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 6
> N: 6, cpu: 89932/14988., real: 15687
> [samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
> plt] time r ~/Downloads/p.rkt 8
> N: 8, cpu: 165152/20644.0, real: 21104
> 
> Real time goes up about 80% from 1-4 places, and then doubles again
> from 4 to 8. System time for 8 places is also about 10x what it is for
> 2 places, but only gets up to 2 seconds.
> On Fri, Oct 5, 2018 at 10:32 AM Matthew Flatt  wrote:
>>
>> At Fri, 5 Oct 2018 15:36:04 +0200, Paulo Matos wrote:
>>> Again, I am really surprised that you mention that places are not
>>> separate processes. Documentation does say they are separate racket
>>> virtual machines, how is this accomplished if not by using separate
>>> processes?
>>
>> Each place is an OS thread within the Racket process. The virtual
>> machine is essentially instantiated once in each thread, where things
>> that look like global variables at the C level are actually
>> thread-local variables to make them place-specific. Still, there is
>> some sharing among the threads.
>>
>>> My workers are really doing Z3 style work - number crushing and lots of
>>> searching. No IO (writing to disk) or communication so I would expect
>>> them to really max out all CPUs.
>>
>> My best guess is that it's memory-allocation bottlenecks, probably at
>> the point of using mmap() and mprotect(). Maybe things don't scale well
>> beyond the 4-core machines that I use.
>>
>> On my machines, the enclosed program can max out CPU use with system
>> time being a small fraction. It scales ok from 1 to 4 places (i.e.,
>> real time increased only some). The machine's core are hyperthreaded,
>> and the example maxes out CPU utilization at 8 --- but it takes twice
>> as long in real time, so the hardware threads don't help much in this
>> case. Running two processes with 4 places takes about the same real
>> time as running one process with 8 places, as does 2 processes with 2
>> places.
>>
>> Do you see similar effects, or does this little example stop scaling
>> before the number of processes matches the number of cores?
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "Racket Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to racket-users+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
> 

-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread Sam Tobin-Hochstadt

I tried this same program on my desktop, which also has 4 (i7-4770)
cores with hyperthreading. Here's what I see:

[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 1
N: 1, cpu: 5808/5808.0, real: 5804
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 2
N: 2, cpu: 12057/6028.5, real: 6063
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 3
N: 3, cpu: 23377/7792., real: 7914
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 4
N: 4, cpu: 41155/10288.75, real: 10357
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 6
N: 6, cpu: 89932/14988., real: 15687
[samth@huor:~/work/grant_parallel_compilers/nsf_submissions (master)
plt] time r ~/Downloads/p.rkt 8
N: 8, cpu: 165152/20644.0, real: 21104

Real time goes up about 80% from 1-4 places, and then doubles again
from 4 to 8. System time for 8 places is also about 10x what it is for
2 places, but only gets up to 2 seconds.
On Fri, Oct 5, 2018 at 10:32 AM Matthew Flatt  wrote:
>
> At Fri, 5 Oct 2018 15:36:04 +0200, Paulo Matos wrote:
> > Again, I am really surprised that you mention that places are not
> > separate processes. Documentation does say they are separate racket
> > virtual machines, how is this accomplished if not by using separate
> > processes?
>
> Each place is an OS thread within the Racket process. The virtual
> machine is essentially instantiated once in each thread, where things
> that look like global variables at the C level are actually
> thread-local variables to make them place-specific. Still, there is
> some sharing among the threads.
>
> > My workers are really doing Z3 style work - number crushing and lots of
> > searching. No IO (writing to disk) or communication so I would expect
> > them to really max out all CPUs.
>
> My best guess is that it's memory-allocation bottlenecks, probably at
> the point of using mmap() and mprotect(). Maybe things don't scale well
> beyond the 4-core machines that I use.
>
> On my machines, the enclosed program can max out CPU use with system
> time being a small fraction. It scales ok from 1 to 4 places (i.e.,
> real time increased only some). The machine's core are hyperthreaded,
> and the example maxes out CPU utilization at 8 --- but it takes twice
> as long in real time, so the hardware threads don't help much in this
> case. Running two processes with 4 places takes about the same real
> time as running one process with 8 places, as does 2 processes with 2
> places.
>
> Do you see similar effects, or does this little example stop scaling
> before the number of processes matches the number of cores?
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread Matthew Flatt

At Fri, 5 Oct 2018 15:36:04 +0200, Paulo Matos wrote:
> Again, I am really surprised that you mention that places are not
> separate processes. Documentation does say they are separate racket
> virtual machines, how is this accomplished if not by using separate
> processes?

Each place is an OS thread within the Racket process. The virtual
machine is essentially instantiated once in each thread, where things
that look like global variables at the C level are actually
thread-local variables to make them place-specific. Still, there is
some sharing among the threads.

> My workers are really doing Z3 style work - number crushing and lots of
> searching. No IO (writing to disk) or communication so I would expect
> them to really max out all CPUs.

My best guess is that it's memory-allocation bottlenecks, probably at
the point of using mmap() and mprotect(). Maybe things don't scale well
beyond the 4-core machines that I use.

On my machines, the enclosed program can max out CPU use with system
time being a small fraction. It scales ok from 1 to 4 places (i.e.,
real time increased only some). The machine's core are hyperthreaded,
and the example maxes out CPU utilization at 8 --- but it takes twice
as long in real time, so the hardware threads don't help much in this
case. Running two processes with 4 places takes about the same real
time as running one process with 8 places, as does 2 processes with 2
places.

Do you see similar effects, or does this little example stop scaling
before the number of processes matches the number of cores?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

p.rkt
Description: Binary data

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread 'Paulo Matos' via Racket Users

On 05/10/2018 14:15, Matthew Flatt wrote:
> It's difficult to be sure from your description, but it sounds like the
> problem may just be the usual one of scaling parallelism when
> communication is involved.
> 

Matthew, thanks for the reply.

The interesting thing here is that there is no communication between
places _most of the time_. It works as a ring topology where every
worker only communicates with the master and the master with all workers.

This communication is relatively rare, as in a message sent every few
minutes.

> Red is probably synchronization. It might be synchronization due to the
> communication you have between places, it might be synchronization on
> Racket's internal data structures, or it might be that the OS has to
> synchronize actions from multiple places within the same process (e.g.,
> multiple places are allocating and calling OS functions like mmap and
> mprotect, which the OS has to synchronize within a process). We've
> tried to minimize sharing among places, and it's important that they
> can GC independently, but there are still various forms of sharing to
> manage internally. In contrast, running separate processes for Z3
> should scale well, especially if the Z3 task is compute-intensive with
> minimal I/0 --- a best-case scenario for the OS.
> 

So, here you have pointed out to something that's surprising to me:
"OS has to synchronize actions from multiple places within the same
process (e.g., multiple places are allocating and calling OS functions
like mmap and mprotect, which the OS has to synchronize within a process)."

I thought each place was its own process similar to issuing a call of
racket itself on the body of the place. Now it seems somehow places are
all in the same process... in which case they'll probably even share
mutexes, although these low level details are a bit foggy in my mind.

> A parallel `raco setup` runs into similar issues. In recent development
> builds, you might experiment with passing `--processes` to `raco setup`
> to have it use separate processes instead of places within a single OS
> process, but I think you'll still find that it tops out well below your
> machine's compute capacity. Partly, dependencies constrain parallelism.
> Partly, the processes have to communicate more and there's a lot of
> I/O.

Again, I am really surprised that you mention that places are not
separate processes. Documentation does say they are separate racket
virtual machines, how is this accomplished if not by using separate
processes?

My workers are really doing Z3 style work - number crushing and lots of
searching. No IO (writing to disk) or communication so I would expect
them to really max out all CPUs.

-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread Matthew Flatt

It's difficult to be sure from your description, but it sounds like the
problem may just be the usual one of scaling parallelism when
communication is involved.

Red is probably synchronization. It might be synchronization due to the
communication you have between places, it might be synchronization on
Racket's internal data structures, or it might be that the OS has to
synchronize actions from multiple places within the same process (e.g.,
multiple places are allocating and calling OS functions like mmap and
mprotect, which the OS has to synchronize within a process). We've
tried to minimize sharing among places, and it's important that they
can GC independently, but there are still various forms of sharing to
manage internally. In contrast, running separate processes for Z3
should scale well, especially if the Z3 task is compute-intensive with
minimal I/0 --- a best-case scenario for the OS.

A parallel `raco setup` runs into similar issues. In recent development
builds, you might experiment with passing `--processes` to `raco setup`
to have it use separate processes instead of places within a single OS
process, but I think you'll still find that it tops out well below your
machine's compute capacity. Partly, dependencies constrain parallelism.
Partly, the processes have to communicate more and there's a lot of
I/O.

At Fri, 5 Oct 2018 11:43:36 +0200, "'Paulo Matos' via Racket Users" wrote:
> All,
> 
> A quick update on this problem which is in my critical path.
> I just noticed, in an attempt to reproduce it, that during the package
> setup part of the racket compilation procedure the same happens.
> 
> I am running `make CPUS=24 in-place`on a 36 cpu machine and I see that
> not only sometimes the racket process status goes from 'R' to 'D' (which
> also happens in my case), the CPUs are never really working at 100% with
> a lot of the work being done at kernel level.
> 
> Has anyone ever noticed this?
> 
> On 01/10/2018 11:13, 'Paulo Matos' via Racket Users wrote:
> > 
> > Hi,
> > 
> > I am not sure this is an issue with places or what it could be but my
> > devops-fu is poor and I am not even sure how to debug something like
> > this so maybe someone with more knowledge than me on this might chime in
> > to hint on a possible debug method.
> > 
> > I was running some benchmarks and noticed something odd for the first
> > time (although it doesn't mean it was ok before, just that this is the
> > first time I am actually analysing this issue).
> > 
> > My program (the master) will create N places (the workers) and each
> > place will start by issuing a rosette call which will trigger a call to
> > the z3 smt solver. So, N instances of Z3 will run and after it is done
> > it will run pure racket code that implements a graph search algorithm.
> > This N worker places are actually in a sync call waiting for messages
> > from the master and the work is being done by a thread on the worker
> > place. The master is either waiting for the timeout to arrive or for a
> > solution to be sent from a worker.
> > 
> > The interesting thing is that when the Z3 instances are running I get
> > all my 16 CPUs (on a dedicated machine) working at 100%. When the racket
> > code is running the search, they are all holding off at around 60%-80%
> > with a huge portion of it in the kernel (red bars in htop).
> > 
> > Since the Z3 calls come before the threads inside the places are started
> > and we get to the sync call, is it possible something bad is happening
> > in the sync call that uses the kernel so much? Take a look at htop
> > during Z3 and during the search - screenshots attached.
> > 
> > Are there any suggestions on what the problem might be or how I could
> > start to understand why the kernel is so active?
> > 
> > Kind regards,
> > 
> > 
> 
> -- 
> Paulo Matos
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-05 Thread 'Paulo Matos' via Racket Users

All,

A quick update on this problem which is in my critical path.
I just noticed, in an attempt to reproduce it, that during the package
setup part of the racket compilation procedure the same happens.

I am running `make CPUS=24 in-place`on a 36 cpu machine and I see that
not only sometimes the racket process status goes from 'R' to 'D' (which
also happens in my case), the CPUs are never really working at 100% with
a lot of the work being done at kernel level.

Has anyone ever noticed this?

On 01/10/2018 11:13, 'Paulo Matos' via Racket Users wrote:
> 
> Hi,
> 
> I am not sure this is an issue with places or what it could be but my
> devops-fu is poor and I am not even sure how to debug something like
> this so maybe someone with more knowledge than me on this might chime in
> to hint on a possible debug method.
> 
> I was running some benchmarks and noticed something odd for the first
> time (although it doesn't mean it was ok before, just that this is the
> first time I am actually analysing this issue).
> 
> My program (the master) will create N places (the workers) and each
> place will start by issuing a rosette call which will trigger a call to
> the z3 smt solver. So, N instances of Z3 will run and after it is done
> it will run pure racket code that implements a graph search algorithm.
> This N worker places are actually in a sync call waiting for messages
> from the master and the work is being done by a thread on the worker
> place. The master is either waiting for the timeout to arrive or for a
> solution to be sent from a worker.
> 
> The interesting thing is that when the Z3 instances are running I get
> all my 16 CPUs (on a dedicated machine) working at 100%. When the racket
> code is running the search, they are all holding off at around 60%-80%
> with a huge portion of it in the kernel (red bars in htop).
> 
> Since the Z3 calls come before the threads inside the places are started
> and we get to the sync call, is it possible something bad is happening
> in the sync call that uses the kernel so much? Take a look at htop
> during Z3 and during the search - screenshots attached.
> 
> Are there any suggestions on what the problem might be or how I could
> start to understand why the kernel is so active?
> 
> Kind regards,
> 
> 

-- 
Paulo Matos

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

2018-10-01 Thread 'Paulo Matos' via Racket Users

I attach yet another example where this behaviour is much more
noticiable. This is on a 64 core dedicated machine in amazon aws.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

Re: [racket-users] Places code not using all the CPU

16 matches

Site Navigation

Mail list logo

Footer information