Re: Threads: Time to get the terminology straight

2004-01-05 Thread Nigel Sandever

05/01/04 04:51:20, Sam Vilain [EMAIL PROTECTED] wrote:

On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote;

   I accept that it may not be possible on all platforms, and it may
   be too expensive on some others. It may even be undesirable in the
   context of Parrot, but I have seen no argument that goes to
   invalidate the underlying premise.

I think you missed this:

LT Different VMs can run on different CPUs. Why should we make atomic
LT instructions out if these? We have a JIT runtime performing at 1
LT Parrot instruction per CPU instruction for native integers. Why
LT should we slow down that by a magnitude of many tenths?

LT We have to lock shared data, then you have to pay the penalty, but
LT not for each piece of code.
.
So far, I have only suggested  using the mechanism in conjuction with
PMCs and PMC registers. 

You can't add an in-use flag to a native integer. But then, native integers
are not a part of the VHLLs (perl/Python/Ruby). The are a consituent part 
of scalars, but they use different register set and opcodes. Copying the
integer value of a scalar into an I register would require locking the scalars
PMC. Once the value is in the I register, operations performed on it would
not need to be synchronised. Once the resultant is calculated, it need to be 
moved back to the PMC and lock is cleared. There should be no need to 
interlock on most opcodes dealing with the I and R register sets.

The S registers are a different kettle of fish, and I haven't worked through 
the implications for these. My gut feels is that the C-style strings pointed 
at by S registers would be protected by the in-use flag set on the PMCs for
the scalars from which they are derived. 

This means that when a single PMC opcode results in a sequence of non-PMC
operations, then other shared threads would be blocked from operations 
until the sequence of non-PMC ops in the first shared thread where complete.
But ONLY if they attempt access to the same PMC. 

If they are processing PMC or non-PMC operations that do not involve the 
in-use PMC, then they will not be blocked and will be scheduled for their 
timeslices in the normal way.


and this:

LT I think, that you are missing multiprocessor systems totally.


If the mutex mechanism that is use to block the shared threads
is SMP, NUMA, AMP etc. safe, then the mechanism I describe is also 
safe in these envirnments.

You are effectively excluding true parallellism by blocking other
processors from executing Parrot ops while one has the lock. 
.
The block only occurs *IF* concurrent operations on the same data
are attempted.

 You may
as well skip the thread libraries altogether and multi-thread the ops
in a runloop like Ruby does.

But let's carry the argument through, restricting it to UP systems,
with hyperthreading switched off, and running Win32.  Is it even true
that masking interrupts is enough on these systems?
.
No masking of interupts is involved anywhere! 
I don't know where the idea arises, but it wasn't from me.


Win32 `Critical Sections' must be giving the scheduler hints not to
run other pending threads whilst a critical section is running.  Maybe
it uses the CPU sti/cli flags for that, to avoid the overhead of
setting a memory word somewhere (bad enough) or calling the system
(crippling).  In that case, setting STI/CLI might only incur a ~50%
performance penalty for integer operations.
.
I don't have access to the sources, but I do know that when one 
thread has entered a critical section, all other threads and processes
continue to be scheduled in the normal way except those that also try
to enter the critical section. 

Scheduling is only disabled for those threads that *ask* to be so,
and no others. Either within the process or other processes. How the 
mechanism works I can only speculate, but no CLI/STI instructions are 
involved.

total speculation 
When the first thread enters the critsec, a flag is set in the 
critsec memory.

When a second thread attempts to enter the critsec, a flag is 
set in the corresponding scheduler table to indicate that it should 
not be scheduled again until the flag is cleared. 

When the first thread leaves the critsec, the flag in the critsec
memory is cleared and the flag(s) in the scheduler tables for any 
thread(s) blocking on the critsec are also cleared. 

Which ever of the blocked threads is next scheduled, it aquires the
critsec, sets the flag in the critsec memory and the process repeats.

/total speculation

No masking of interupts is involved.


but then there's this:

  NS Other internal housekeeping operations, memory allocation, garbage
  NS collection etc. are performed as sysopcodes, performed by the VMI
  NS within the auspices of the critical section, and thus secured.

UG there may be times when a GC run needs to be initiated DURING a VM
UG operation. if the op requires an immediate lare chunk of ram it
UG can trigger a GC pass or allocation request. you can't force those
UG things to only 

Re: Threads: Time to get the terminology straight

2004-01-05 Thread Dan Sugalski
At 2:43 AM + 1/5/04, Nigel Sandever wrote:
05/01/04 01:22:32, Sam Vilain [EMAIL PROTECTED] wrote:

[STUFF] :)

In another post you mentions intel hyperthreading.
Essentially, duplicate sets of registers within a single CPU.
Do these need to apply lock on every machine level entity that
they access?
 No.
Why not?

Because they can only modify an entity if it is loaded into a register
and the logic behind hyperthreading won't allow both register sets
to load the same entity concurrently.
This seems generally common for SMT core CPUs (though some of them 
may allow multiple references in different threads on-chip and just 
keeps the data sync'd up--I've not looked), unfortunately, is 
untenable for Parrot. The hardware gets to throw massive amounts of 
parallelism at the problem, a luxury we, alas, don't have. (I'm 
pretty sure they also play some serious register remapping games and 
deferred  speculative execution tricks to not deadlock)

To do this would require that every time a PMC register was loaded 
that all the PMC registers of any other active interpreter in the 
interpreter group be scanned. Kind of a pricey operation.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Threads: Time to get the terminology straight

2004-01-04 Thread Dan Sugalski
I think some of the massive back and forth that's going on is in part 
due to terminology problems, which are in part causing some 
conceptual problems. So, for the moment, let's agree on the following 
things:

*) MUTEX - this is a low level, under the hood, not exposed to users, 
thing that can be locked. They're non-recursive, non-read/write, 
exclusive things. When a thread gets a mutex, any other attempt to 
get that mutex will block until the owning thread releases the mutex. 
The platform-native lock construct will be used for this.

*) LOCK - This is an exposed-to-HLL-code thing that can be locked. 
Only PMCs can be locked, and the lock may or may not be recursive or 
read/write.

*) CONDITION VARIABLE - the sleep until something pings me 
construct. Useful for queue construction, always associated with a 
MUTEX.

*) RENDEZVOUS POINT - A HLL version of a condition variable. *not* 
associated with a lock -- these are standalone.

Note that the mutex/condition association's a POSIX limitation, and 
POSIX threads is what we have on some platforms. If you want to 
propose abstracting it away, go for it. The separation doesn't buy us 
anything, though it's useful in other circumstances.

*) INTERPRETER - those bits of the Parrot_Interp structure that are 
absolutely required to be thread-specific. This includes the current 
register sets and stack pointers, as well as security context 
information. Basically if a continuation captures it, it's the 
interpreter.

*) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp 
structure that aren't required to be thread-specific (though I'm not 
sure there are any) *PLUS* anything pointed to that doesn't have to 
be thread-specific.

The environment includes the global namespaces, pads, stack chunks, 
memory allocation regions, arenas, and whatnots. Just because the 
pointer to the current pad is thread-specific doesn't mean the pad 
*itself* has to be. It can be shared.

*) INDEPENDENT THREAD - A thread that has no contact *AT ALL* with 
the internal data of any other thread in the current process. 
Independent threads need no synchronization for anything other than 
what few global things we have. And the fewer the better, though alas 
we can't have none at all.

Note that independent threads may still communicate back and forth by 
passing either atomic things (ints, floats, and pointers) or static 
buffers that can become the property of the destination thread.

*) SHARED THREAD - A thread that's part of a group of threads sharing 
a common interpreter environment.

Anyway, there's some terminology. It doesn't solve the design 
problem, but hopefully it'll help everyone talk the same language.

Remember that everything from the wrapped OS interface on up is up 
for grabs -- while we're not going to build our own mutexes or thread 
scheduler, everything that's been implemented or designed to date can 
be changed with sufficient good reason. (Though, again, the more you 
want to change the more spectacular the design has to be)
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads: Time to get the terminology straight

2004-01-04 Thread Nigel Sandever
On Sun, 4 Jan 2004 15:47:35 -0500, [EMAIL PROTECTED] (Dan Sugalski) wrote:

 *) INTERPRETER - those bits of the Parrot_Interp structure that are 
 absolutely required to be thread-specific. This includes the current 
 register sets and stack pointers, as well as security context 
 information. Basically if a continuation captures it, it's the 
 interpreter.
 
 *) INTERPRETER ENVIRONMENT - Those bits of the Parrot_Interp 
 structure that aren't required to be thread-specific (though I'm not 
 sure there are any) *PLUS* anything pointed to that doesn't have to 
 be thread-specific.
 
 The environment includes the global namespaces, pads, stack chunks, 
 memory allocation regions, arenas, and whatnots. Just because the 
 pointer to the current pad is thread-specific doesn't mean the pad 
 *itself* has to be. It can be shared.
 

 *) SHARED THREAD - A thread that's part of a group of threads sharing 
 a common interpreter environment.

Ignoring the implementation of the synchronisation required, the basic
premise of my long post was that each SHARED THREAD, should have it's
own INTERPRETER (a VM in my terms). and that these should share a 
common INTERPRETER ENVIRONMENT.

Simplistically, 5005threads shared an INTERPRETER ENVIRONMENT 
and a single INTERPRETER. Synchronising threaded access to the shared
INTERPRETER (rather than it's environment) was the biggest headache.
(I *think*).

With ithreads, each SHARED THREAD has it's own INTERPRETER *and*
INTERPRETER ENVIRONMENT. This removes the contention for and the
need to synchronise access to the INTERPRETER, but requires the 
duplication of shared elements of the INTERPRETER ENVIRONMENT and
the copy_on_read, with the inherent costs of the duplication at start-up, 
and slow, indirect access to shared entities across the duplicated 
INTERPRETER ENVIRONMENTS.

My proposal was that each SHARED THREAD, 
should have a separate copy of the INTERPRETER, 
but share a copy of the INTERPRETER ENVIRONMENT.

Everything else, were my attempts at solving the requirements of 
synchronisation that this would require, whilst minimising the cost
of that synchronisation, by avoiding the need for a mutex on every
shared entity, and the costs of attempting to aquire a mutex except
when two SHARED THREADS attempted concurrent access to a
shared entity. 

I think that by having SHARED THREAD == INTERPRETER, sharing
a common INTERPRETER ENVIRONMENT, you can avoid (some) of 
the problems associated with 5005threads but retain the direct 
access of shared entities. 

This imposes it's own set of requirements and costs, but (I believe)
that the ideas that underly the mechanisms I offered as solutions
are sound. The specific implementation is a platform specific detail
that could be pushed down to a lower level.

 ...those bits of the Parrot_Interp 
 structure that aren't required to be thread-specific (though I'm not 
 sure there are any) 

This is were I have a different (and quite possibly incorrect) view.
My mental picture of the INTERPRETER ENVIRONMENT includes
both the impementation of all the classes in the process *plus* all
the memory of every instance of those classes. 

I think your statement above implies that these would not be a part
of the INTERPRETER ENVIRONMENT per se, but would be allocated 
from global heap and only referenced from the bytecode that would live
in the INTERPRETER ENVIRONMENT? 

I realise that this is possible, and maybe even desirable, but the cost
of the GC walking a global heap, especially in the situationof a single 
process that contains to entirely separate instances of the 
INTERPRETER ENVIRONMENT, would be (I *think*) rather high.

I realise that this is a fairly rare occurance on mist platforms,
but in the win32 situation of emulated forks, each pseudo-process
must have an entirely separate INTERPRETER ENVIRONMENT,
potentially with each having multiple SHARED THREADS. 

If the memory for all entities in all pseudo-process is allocated from 
a (real) process-global heap, then the multiple GC's required by the 
multiple pseudo-process are going to be walking the same heap.
Possibly concurrently.  I realise that this problem (if it is such)
does not occur on platforms that have real forks available, but it
would be useful if the high level design would allow for the use of
separate (virtual) heaps tied to the INTERPRETER ENVIRONMENTs
which win32 has the ability to do.


 
  Dan


Nigel.





Re: Threads: Time to get the terminology straight

2004-01-04 Thread Sam Vilain
On Mon, 05 Jan 2004 12:58, Nigel Sandever wrote;

   Everything else, were my attempts at solving the requirements of
   synchronisation that this would require, whilst minimising the
   cost of that synchronisation, by avoiding the need for a mutex on
   every shared entity, and the costs of attempting to aquire a mutex
   except when two SHARED THREADS attempted concurrent access to a
   shared entity.

This paragraph sounds like you're trying to solve an intractable problem.
Try posting some psuedocode to explain what you mean.

But it has given me an idea that could minimise the number of locks,
ie not require a mutex on each PMC, just each shared PMC.  10 points
to the person who finds a flaw in this approach :)

  Each object you share, could create a new (virtual) shared memory
  segment with its own semaphore.  This virtual memory segment is
  considered its own COW domain (ie, its own thread to the GC);
  references inserted back to non-shared memory will pull the
  structures into that virtual COW thread.

  Access to the entire structure is controlled via a multiple
  reader/single writer lock (close to what a semaphore is IIRC); locks
  for a thread are released when references to places inside the
  shared segment are no longer anywhere in any @_ on the locking
  thread's call stack, or in use by any opcode (is that good enough?),
  and are acquired for writing when anything needs to be changed.

  Virtual shared memory segments can then easily be cleaned up by
  normal GC.

The major problem I can see is that upgrading a lock from reading to
writing can't work if there are concurrent writes (and the read lock
to be upgraded cannot sensibly be released).  But that should be OK,
since operation signatures will mark variables that need changing as
read-write as early as possible.

For example, in this sort of code (sorry for P5 code);

  sub changer {
 my $shared_object = shift;
 $shared_object-{bar} = baz;
  }

A read lock to the segment \$shared_object is in is acquired, then
released when it is `shifted' off.  As the next instruction has a
writable lvalue, it acquires a write lock.  But this code:

  sub changer {
 my $shared_object = shift;
 $shared_object-{bar} = somefunc();
  }

Will hold the write lock on $shared_object open until somefunc
runs.

My 2 :).  This discussion will certainly reach a dollar soon ;).
-- 
Sam Vilain, [EMAIL PROTECTED]

  Start every day with a smile and get it over with. 
W C FIELDS




Re: Threads: Time to get the terminology straight

2004-01-04 Thread Nigel Sandever
05/01/04 01:22:32, Sam Vilain [EMAIL PROTECTED] wrote:

[STUFF] :)

In another post you mentions intel hyperthreading. 
Essentially, duplicate sets of registers within a single CPU.

Do these need to apply lock on every machine level entity that
they access?
 No. 

Why not?

Because they can only modify an entity if it is loaded into a register
and the logic behind hyperthreading won't allow both register sets 
to load the same entity concurrently. 

( I know this is a gross simplificationof the interactions 
between the on-board logic and L1/L2 caching!)

--- Not an advert or glorification of Intel. Just an example -

Hyper-Threading Technology provides thread-level-parallelism (TLP) 
on each processor resulting in increased utilization of processor 
execution resources. As a result, resource utilization yields higher 
processing throughput. Hyper-Threading Technology is a form of 
simultaneous multi-threading technology (SMT) where multiple 
threads of software applications can be run simultaneously on one
 processor. 

This is achieved by duplicating the *architectural state* on each 
processor, while *sharing one set of processor execution resources*.
--

The last paragraph is the salient one as far as I am concerned.

The basic premise of my original proposal was that multi-threaded,
machine level applications don't have to interlock on machine level 
entities, because each operation they perform is atomic. 

Whilst the state of higher level objects, that the machine level 
objects are a part of, may have their state corrupted by two 
threads modifying things concurrently. The state of the threads
(registers sets+stack) themselves cannot be corrupted. 

This is because they have their own internally consistant state,
that only changes atomically, and that is completely separated,
each from the other. They only share common data (code is data
to the cpu, just bytecode is data to a VM).

So, if you are going to emulate a (hyper)threaded CPU, in a 
register-based virtual machine interpreter. And allow for
concurrent threads of execution within that VMI.
Then one way of ensuring that the internal state of the 
VMI was never corrupted, would be to have each thread have
it's own copy of the *architectural state* of the VM, whilst
 *one set of processor execution resources*.

For this to work, you would need to achieve the same opcode
atomicity at the VMI level. Interlocking the threads so that
on shared thread can not start an opcode until anothe shared 
threads has completed gives this atomicity. The penalty is that
if the interlocking is done for every opcode, then shared 
threads end up with very long virtual timeslices. To prevent 
that being the case (most of the time), the interlocking should
 only come into effect *if* concurrent access to a VM level 
entity is imminant. 

As the VMI cannot access (modify) the state of a VM level
entity (PMC) until it has loaded it into a VM register, the
interlosking only need come into effect *if*, the entity
who's reference is being loaded into a PMC register is 
currently in-use by (another) thread. 

The state if a PMCs in-useness can be flagged by a single bit
in its header. This can be detected by a shared thread when
the reference to it is loaded into teh PMC register and 
when it is, that shared thread then waits on the single,
shared mutex before proceeding.

It is only when the combination of atomised VM opcodes,
and lightweight in-use detection come together, that the
need for a mutex/entity can be avoided.

If the mutex used is capable of handling SMP, NUMA,
clusters etc, then the mechinsm will work. 

If the lightweight bit-test-set opcode isn't available,
then a heavyweight equivalent could be used, though the
advantages would be reduced.


Sam Vilain, [EMAIL PROTECTED]

I hope that clarifiies my thinking and how I arrived at it.

I accept that it may not be possible on all platforms, and
it may be too expensive on some others. It may even be 
undesirable in the context of Parrot, but I have seen no
argument that goes to invalidate the underlying premise.

Regards, Nigel





Re: Threads: Time to get the terminology straight

2004-01-04 Thread Sam Vilain
On Mon, 05 Jan 2004 15:43, Nigel Sandever wrote;

   I accept that it may not be possible on all platforms, and it may
   be too expensive on some others. It may even be undesirable in the
   context of Parrot, but I have seen no argument that goes to
   invalidate the underlying premise.

I think you missed this:

LT Different VMs can run on different CPUs. Why should we make atomic
LT instructions out if these? We have a JIT runtime performing at 1
LT Parrot instruction per CPU instruction for native integers. Why
LT should we slow down that by a magnitude of many tenths?

LT We have to lock shared data, then you have to pay the penalty, but
LT not for each piece of code.

and this:

LT I think, that you are missing multiprocessor systems totally.

You are effectively excluding true parallellism by blocking other
processors from executing Parrot ops while one has the lock.  You may
as well skip the thread libraries altogether and multi-thread the ops
in a runloop like Ruby does.

But let's carry the argument through, restricting it to UP systems,
with hyperthreading switched off, and running Win32.  Is it even true
that masking interrupts is enough on these systems?

Win32 `Critical Sections' must be giving the scheduler hints not to
run other pending threads whilst a critical section is running.  Maybe
it uses the CPU sti/cli flags for that, to avoid the overhead of
setting a memory word somewhere (bad enough) or calling the system
(crippling).  In that case, setting STI/CLI might only incur a ~50%
performance penalty for integer operations.

but then there's this:

  NS Other internal housekeeping operations, memory allocation, garbage
  NS collection etc. are performed as sysopcodes, performed by the VMI
  NS within the auspices of the critical section, and thus secured.

UG there may be times when a GC run needs to be initiated DURING a VM
UG operation. if the op requires an immediate lare chunk of ram it
UG can trigger a GC pass or allocation request. you can't force those
UG things to only happen between normal ops (which is what making
UG them into ops does). so GC and allocation both need to be able to
UG lock all shared things in their interpreter (and not just do a
UG process global lock) so those things won't be modified by the
UG other threads that share them.

I *think* this means that even if we *could* use critical sections for
each op, where this works and isn't terribly inefficient, GC throws a
spanner in the works.  This could perhaps be worked around.

In any case, it won't work on the fastest known threading
implementations (Solaris, Linux NPTL, etc), as they won't know to
block all the other threads in a given process just because one of
them set a CPU flag cycles before it was pre-empted.

So, in summary - it won't work on MP, and on UP, it couldn't possibly
be as overhead-free as the other solutions.

Clear as mud ?  :-)

[back to processors]
 Do these need to apply lock on every machine level entity that
 they access?

Yes, but the only resource that matters here is memory.  Locking
*does* take place inside the processor, but the locks are all close
enough to be inspected in under a cycle.  And misses incur a penalty
of several cycles - maybe dozens, depending on who has the memory
locked.

Registers are also locked by virtue of the fact that the
out-of-order execution and pipelining logic will not schedule/allow an
instruction to proceed until its data is ready.  Any CPU with
pipelining has this problem.

There is an interesting comparison to be drawn between the JIT
assembly happening inside the processor from the bytecode being
executed (x86) into a RISC core machine language (-ops) on
hyperthreading systems, and Parrot's compiling PASM to native machine
code.  It each case is the -ops that are ordered to maximize
performance and fed into the execution units.

On a hyperthreading processor, it has the luxury of knowing how long
it will take to check the necessary locks for each instruction,
probably under a cycle, so that -ops may scream along.

With Parrot, it might have to contact another host over an ethernet
controller to acquire a lock (eg, threads running in an OpenMOSIX
cluster).  This cannot happen for every instruction!
-- 
Sam Vilain, [EMAIL PROTECTED]

  The golden rule is that there are no golden rules
GEORGE BERNARD SHAW