Re: Threads Design. A Win32 perspective.

2004-01-07 Thread Elizabeth Mattijsen
At 23:10 -0500 1/5/04, Gordon Henriksen wrote:
Data corruption unacceptable? I disagree. It depends on the contract 
put forward by the language in question. Notably, Perl makes no such 
guarantees thus far, being as how it doesn't (any longer) run in a 
traditional threaded model. Successfully threaded languages and 
virtual machines explicitly make NO such guarantees. I'm completely 
okay with the potential for inconsistent data where the user doesn't 
take precautions. If the alternative is my code bogging down to half 
of its ideal speed because parrot is wasting time: acquiring three 
mutexes with deadlock detection so that it can execute add $P10, 
$P11, $P12; acquiring a mutex every time it indexes a string 
register; acquiring a mutex so it can examine cache-int_val-if 
those are the alternatives, I'm for the potential for inconsistent 
data. (And I say wasting because it's completely unnecessary in a 
good 99% of cases; the vast majority of data is not shared between 
threads at all.)
I agree with most what is said here.

However, I think there should be some type of debugging mode that 
_would_ do all of the locking and deadlock detection (and whatever 
other debugging capabilities we can think of).  Occasional, 
hard-to-reproduce data corruption is the single most problematic 
thing to debug in threaded applications.  It allows you to create 
Heisenbugs that will only show up under specific load conditions 
and/or specific platforms.  And when that happens, it doesn't matter 
that your threaded application doesn't crash: for the users it does 
the equivalent, namely produce the wrong results occasionally, and 
therefore not production ready.



Liz


Re: Threads Design. A Win32 perspective.

2004-01-07 Thread Dan Sugalski
At 23:10 -0500 1/5/04, Gordon Henriksen wrote:
Data corruption unacceptable? I disagree.
I get the feeling people just aren't reading what's been written, or 
aren't keeping it all straight.

*User* and *program* data integrity is not our problem -- not only 
are we not guaranteeing that, I'd be fine with parrot actively 
corrupting the contents of shared data structures that are accessed 
without synchronization.

*Interpreter* data corruption is absolutely unacceptable, will not be 
tolerated, and whatever synchronization the interpreters need to do 
to make sure it doesn't happen.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads Design. A Win32 perspective.

2004-01-06 Thread Gordon Henriksen
On Sunday, January 4, 2004, at 03:17 , Jeff Clites wrote:

On Jan 3, 2004, at 8:59 PM, Gordon Henriksen wrote:

On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:

Transparent interlocking of VHLL fat structures performed 
automatically by the VM itself. No need for :shared or lock().
Completely specious, and repeatedly proven unwise. Shouldn't even be 
pursued.

Atomic guarantees on collections (or other data structures) are rarely 
meaningful; providing them is simply a waste of time. Witness the 
well-deserved death of Java's synchronized Vector class in favor of 
ArrayList. The interpreter absolutely shouldn't crash due to threading 
errorsit should protect itself using standard techniquesbut it would 
be a mistake for parrot to mandate that all ops and PMCs be 
thread-safe.
What are these standard techniques? The JVM spec does seem to guarantee 
that even in the absence of proper locking by user code, things won't 
go completely haywire, but I can't figure out how this is possible 
without actual locking. (That is, I'm wondering if Java is doing 
something clever.) For instance, inserting something into a collection 
will often require updating more than one memory location (especially 
if the collection is out of space and needs to be grown), and I can't 
figure out how this could be guaranteed not to completely corrupt 
internal state in the absence of locking. (And if it _does_ require 
locking, then it seems that the insertion method would in fact then be 
synchronized.)

So my question is, how do JVMs manage to protect internal state in the 
absence of locking? Or do they?
The JVM won't crash, but individual objects can easily become corrupt. 
Java makes no guarantees of the level you're proposing. All 
synchronization in Java is explicit, using either synchronized methods, 
or synchronized (...) { ... } blocks; the runtime doesn't implicitly 
burn cycles synchronizing objects for the user. It does, however, give 
the user the threading primitives which allow them to write thread-safe 
code. Thread-safety is, indeed, left entirely to the user. This is 
crucial to allowing code to execute quickly and for the optimizer to 
function. The JVM protects itself, ensures that its core is thread-safe, 
and provides a secure operating environment for user code; it is the 
responsibility of the authors of classes to protect instances of their 
classes. Remember, Java's libraries are implemented atop these 
primitives *in Java.* It would be analogous to propose that parrot-based 
languages be primarily implemented atop its primitives in PBC, and that 
the sections which are not so-implemented (the core) must be 
scrutinized carefully for threadsafety.

What are these threadsafe primitives which Java provides? For example:

1. Fixed-size arrays are safe to access concurrently from multiple 
threads.
2. ints, chars, pointers, etc. can be updated atomically.
3. Java's memory allocator is thread-safe.
4. Strings are immutable (and thus sharable without threadsafety 
violations).
5. The JVM does not allow memory to be resized, making bounds-checking 
threadsafe.
6. No objects can be moved (while retaining their identity) except by 
the garbage collector.
7. All threads are suspended during GC.
8. All pointers are traceable by GC.

That parrot is largely implemented in C, and is designed to operate in 
terms of fat data structures, makes it much more difficult to guarantee 
in parrot that segfaults will not occur. I have some ideas which I'll 
share after I sync up with this fast-moving thread (of conversation). 
But one of my premises is that implicitly guaranteeing atomic PMC 
accesses will render threaded parrot dead-on-arrival. Just like the last 
2 abortive attempts at creating a threading Perl implementation. (And 
threaded parrot would be a separate executable, because the 
performance hit will be so staggering.) The sheer overhead of locking is 
an obvious deal-killer, but I think that deadlocks and prohibition of 
optimizations would quickly put a few extra nails in the coffin. I don't 
want to see the Perl community left adrift with another toy threading 
implementation unsuitable for any serious work.



Gordon Henriksen
[EMAIL PROTECTED]


Re: Threads Design. A Win32 perspective.

2004-01-06 Thread Gordon Henriksen
On Sunday, January 4, 2004, at 01:43 , Dan Sugalski wrote:

At 11:59 PM -0500 1/3/04, Gordon Henriksen wrote:
On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:

Transparent interlocking of VHLL fat structures performed 
automatically by the VM itself. No need for :shared or lock().
Completely specious, and repeatedly proven unwise. Shouldn't even be 
pursued.
Erm... that turns out not to be the case. A lot. (Yeah, I know, I said 
I wasn't paying attention)

An interpreter *must* lock any shared data structure, including PMCs, 
when accessing them. Otherwise they may be in an inconsistent state 
when being accessed, which will lead to data corruption or process 
crashing, which is unacceptable.

These locks do not have to correspond to HLL-level locks, though it'd 
be a reasonable argument that they share the same mutex.
Process crashing unacceptable? Absolutely. Complete agreement.

Data corruption unacceptable? I disagree. It depends on the contract put 
forward by the language in question. Notably, Perl makes no such 
guarantees thus far, being as how it doesn't (any longer) run in a 
traditional threaded model. Successfully threaded languages and virtual 
machines explicitly make NO such guarantees. I'm completely okay with 
the potential for inconsistent data where the user doesn't take 
precautions. If the alternative is my code bogging down to half of its 
ideal speed because parrot is wasting time: acquiring three mutexes with 
deadlock detection so that it can execute add $P10, $P11, $P12; 
acquiring a mutex every time it indexes a string register; acquiring a 
mutex so it can examine cache-int_valif those are the alternatives, 
I'm for the potential for inconsistent data. (And I say wasting because 
it's completely unnecessary in a good 99% of cases; the vast majority of 
data is not shared between threads at all.)

I'm not okay with parrot crashing for any reason, ever, though, so where 
locking is not provided, PMC classes *must* be coded *very* carefully so 
as to avoid crashes.



Gordon Henriksen
[EMAIL PROTECTED]


RE: Threads Design. A Win32 perspective.

2004-01-06 Thread Garrett Goebel
Gordon Henriksen wrote:
 Dan Sugalski wrote:
 
  An interpreter *must* lock any shared data structure, 
  including PMCs, when accessing them. Otherwise they may
  be in an inconsistent state when being accessed, which will 
  lead to data corruption or process crashing, which is
  unacceptable.
 
 Process crashing unacceptable? Absolutely. Complete agreement.
 
 Data corruption unacceptable? I disagree. It depends on the 
 contract put forward by the language in question. Notably,
 Perl makes no such guarantees thus far, being as how it
 doesn't (any longer) run in a traditional threaded model.
 Successfully threaded languages and virtual machines explicitly
 make NO such guarantees

Dan's non-negotiable Thread notes didn't say the VM's garauntees were
non-negotiable ;)

Are there differences in the contracts put forth by existing languages which
we hope to have parrot implementations? 

Ironically, wouldn't setting the garauntees against data corruption this
high pretty much rule out a pure parrot implementation of an SQL-92
compliant database. SQL-92 defines 4 transaction isolation levels in terms
of the phenomena they're meant to allow/avoid: dirty reads, non-repeatable
reads, and phantoms. 

Allowing language level variation in concurrency vs. correctness garuantees?
Can of worms? I can already hear Dan, Nothing to be seen here. -Move
along.

On the off chance that anyone's interested, here's a critique of the ANSI
SQL-92 isolation levels: http://citeseer.nj.nec.com/berenson95critique.html

--
Garrett Goebel
IS Development Specialist 

ScriptPro   Direct: 913.403.5261 
5828 Reeds Road   Main: 913.384.1008 
Mission, KS 66202  Fax: 913.384.2180 
www.scriptpro.com  garrett at scriptpro dot com 


Re: Threads Design. A Win32 perspective.

2004-01-05 Thread Dan Sugalski
At 4:03 PM -0800 1/4/04, Damien Neil wrote:
It's my understanding that Parrot has chosen to take the path of using
many mutable data structures at the VM level; unfortunately, this is
pretty much incompatible with a fast or elegant threading model.
Yep, generally true. The choice was made on purpose, and I was aware 
of most of the ramifications of it, both good and ill.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Elizabeth Mattijsen
At 00:49 +0100 1/4/04, Leopold Toetsch wrote:
Elizabeth Mattijsen [EMAIL PROTECTED] wrote:
  Indeed.  But as soon as there is something special such as a
 datastructure external to Perl between threads (which happens
 automatically shared automatically, because Perl doesn't know about
  the datastructure,
Why is it shared automatically? Do you have an example for that?
When you use an external library in Perl, such as e.g. libxml, you 
have Perl data-structures and libxml data-structures.  The Perl 
data-structures contain pointers to the libxml data-structures.

In comes the starting of an ithread and Perl clones all of the Perl 
data-structures.  But it copies _only_ does things it knows about. 
And thus leaves the pointers to the libxml data-structures untouched. 
Now you have 2 Perl data-structures that point to the _same_ libxml 
data-structures.  Voila, instant sharing.

With disastrous results.  Because as soon as the thread ends, the 
cloned Perl object in the thread goes out of scope.  Perl then calls 
the DESTROY method on the object, which then frees up the libxml 
data-structures.  That's what it's supposed to do.  Meanwhile, back 
in the original thread, the pointers in the Perl object now point at 
freed memory, rather than a live libxml data-structure.  And chaos 
ensues sooner or later.  Of course, chaos could well ensue before the 
thread is ended, because both threads think they have exclusive 
access to the libxml data-structure.

Hope this explanation made sense.


  ... so the cloned objects point to the same memory
 address), then you're in trouble.  Simply because you now have
 multiple DESTROYs called on the same external data-structure.  If the
 function of the DESTROY is to free the memory of the external
 data-structure, you're in trouble as soon as the first thread is
  done.  ;-(
Maybe that DOD/GC can help here. A shared object can and will be
destroyed only, when the last holder of that object has released it.
But do you see now how complicated this can become if thread === interpreter?



Liz


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Gordon Henriksen
On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:

Transparent interlocking of VHLL fat structures performed automatically 
by the VM itself. No need for :shared or lock().
Completely specious, and repeatedly proven unwise. Shouldn't even be 
pursued.

Atomic guarantees on collections (or other data structures) are rarely 
meaningful; providing them is simply a waste of time. Witness the 
well-deserved death of Java's synchronized Vector class in favor of 
ArrayList. The interpreter absolutely shouldn't crash due to threading 
errorsit should protect itself using standard techniquesbut it would 
be a mistake for parrot to mandate that all ops and PMCs be thread-safe.

The details of threaded programming cannot be hidden from the 
programmer. It's tempting to come up with clever ways to try, but the 
engine really has to take a back seat here. Smart programmers will 
narrow the scope of potential conflicts by reducing sharing of data 
structures in their threaded programs. Having done so, any atomicity 
guarantees on individual objects proves to be wasted effort: It will be 
resented by parrot's users as needless overhead, not praised. Consider 
the potential usage cases.

1. All objects in a non-threaded program.
2. Unshared objects in a threaded program.
3. Shared objects in a threaded program.
The first two cases will easily comprise 99% of all usage. In only the 
third case are synchronized objects even conceivably useful, and even 
then the truth of the matter is that they are of extremely limited 
utility: Their guarantees are more often than not too fine-grained to 
provide the high-level guarantees that the programmer truly needs. In 
light of this, the acquisition of a mutex (even a mutex that's 
relatively cheap to acquire) to push an element onto an array, or to 
access a string's datawell, it stops looking so good.

That said, the interpreter can't be allowed to crash due to threading 
errors. It must protect itself. But should a PerlArray written to 
concurrently from 2 threads guarantee its state make sense at the end of 
the program? I say no based upon precedent; the cost is too high.



Gordon Henriksen
[EMAIL PROTECTED]


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Leopold Toetsch
Elizabeth Mattijsen [EMAIL PROTECTED] wrote:

 When you use an external library in Perl, such as e.g. libxml, you
 have Perl data-structures and libxml data-structures.  The Perl
 data-structures contain pointers to the libxml data-structures.

 In comes the starting of an ithread and Perl clones all of the Perl
 data-structures.  But it copies _only_ does things it knows about.
 And thus leaves the pointers to the libxml data-structures untouched.
 Now you have 2 Perl data-structures that point to the _same_ libxml
 data-structures.  Voila, instant sharing.

I see. Our library loading code should take care of that. On thread
creation we call again the _init code, so that the external lib can
prepare itself to be used from multiple threads. But don't ask me about
details ;)

 Liz

leo


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Elizabeth Mattijsen
At 14:47 +0100 1/4/04, Leopold Toetsch wrote:
Elizabeth Mattijsen [EMAIL PROTECTED] wrote:
  When you use an external library in Perl, such as e.g. libxml, you
 have Perl data-structures and libxml data-structures.  The Perl
  data-structures contain pointers to the libxml data-structures.
  In comes the starting of an ithread and Perl clones all of the Perl
 data-structures.  But it copies _only_ does things it knows about.
 And thus leaves the pointers to the libxml data-structures untouched.
 Now you have 2 Perl data-structures that point to the _same_ libxml
  data-structures.  Voila, instant sharing.
I see. Our library loading code should take care of that. On thread
creation we call again the _init code, so that the external lib can
prepare itself to be used from multiple threads. But don't ask me about
details ;)
What you need, is basically being able to:

- register a class method to be called on cloning
- register an object method that is called whenever an _object_ is cloned
The CLONE sub that Perl5 has, is the class method.  The object method 
is missing from Perl (Thread::Bless is a way to remedy this problem).

I don't know what the _init code does, but judging by its name, its 
not giving enough info to be able to properly clone an object with 
external data structures.

Liz


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Dan Sugalski
At 11:59 PM -0500 1/3/04, Gordon Henriksen wrote:
On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:

Transparent interlocking of VHLL fat structures performed 
automatically by the VM itself. No need for :shared or lock().
Completely specious, and repeatedly proven unwise. Shouldn't even be pursued.
Erm... that turns out not to be the case. A lot. (Yeah, I know, I 
said I wasn't paying attention)

An interpreter *must* lock any shared data structure, including PMCs, 
when accessing them. Otherwise they may be in an inconsistent state 
when being accessed, which will lead to data corruption or process 
crashing, which is unacceptable.

These locks do not have to correspond to HLL-level locks, though it'd 
be a reasonable argument that they share the same mutex.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Jeff Clites
On Jan 4, 2004, at 5:47 AM, Leopold Toetsch wrote:

Elizabeth Mattijsen [EMAIL PROTECTED] wrote:

When you use an external library in Perl, such as e.g. libxml, you
have Perl data-structures and libxml data-structures.  The Perl
data-structures contain pointers to the libxml data-structures.

In comes the starting of an ithread and Perl clones all of the Perl
data-structures.  But it copies _only_ does things it knows about.
And thus leaves the pointers to the libxml data-structures untouched.
Now you have 2 Perl data-structures that point to the _same_ libxml
data-structures.  Voila, instant sharing.
I see. Our library loading code should take care of that. On thread
creation we call again the _init code, so that the external lib can
prepare itself to be used from multiple threads. But don't ask me about
details ;)
But I think we'll never be able to make this work as the user would 
initially expect. For instance if we have a DBI implementation, and 
some PMC is holding an external reference to a database cursor for an 
open transaction, then we can't properly duplicate the necessary state 
to make the copy of the PMC work correctly (that is, independently). 
(And I'm not saying just that we can't do it from within parrot, I'm 
saying the native database libraries can't do this.)

So some objects such as this would always have to end up shared (or 
else non-functional in the new thread), which is bad for users because 
they have to be concerned with what objects are backed by native 
libraries and which ones cannot be made to conform to each of our 
thread styles.

That seems like a major caveat.

JEff



Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Dan Sugalski
At 12:05 PM -0800 1/4/04, Jeff Clites wrote:
On Jan 4, 2004, at 5:47 AM, Leopold Toetsch wrote:

Elizabeth Mattijsen [EMAIL PROTECTED] wrote:

When you use an external library in Perl, such as e.g. libxml, you
have Perl data-structures and libxml data-structures.  The Perl
data-structures contain pointers to the libxml data-structures.

In comes the starting of an ithread and Perl clones all of the Perl
data-structures.  But it copies _only_ does things it knows about.
And thus leaves the pointers to the libxml data-structures untouched.
Now you have 2 Perl data-structures that point to the _same_ libxml
data-structures.  Voila, instant sharing.
I see. Our library loading code should take care of that. On thread
creation we call again the _init code, so that the external lib can
prepare itself to be used from multiple threads. But don't ask me about
details ;)
But I think we'll never be able to make this work as the user would 
initially expect. For instance if we have a DBI implementation, and 
some PMC is holding an external reference to a database cursor for 
an open transaction, then we can't properly duplicate the necessary 
state to make the copy of the PMC work correctly (that is, 
independently). (And I'm not saying just that we can't do it from 
within parrot, I'm saying the native database libraries can't do 
this.)
And don't forget the libraries that are picky about which thread 
calls into them -- there are some that require that the thread that 
created the handle for the library be the thread that calls into the 
library with that handle. (Though luckily those are pretty rare) And 
of course the non-reentrant libraries that require a global library 
lock for all calls otherwise the library state gets corrupted.

Aren't threads fun? :)
--
Dan
--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Jeff Clites
On Jan 3, 2004, at 8:59 PM, Gordon Henriksen wrote:

On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote:

Transparent interlocking of VHLL fat structures performed 
automatically by the VM itself. No need for :shared or lock().
Completely specious, and repeatedly proven unwise. Shouldn't even be 
pursued.

Atomic guarantees on collections (or other data structures) are rarely 
meaningful; providing them is simply a waste of time. Witness the 
well-deserved death of Java's synchronized Vector class in favor of 
ArrayList. The interpreter absolutely shouldn't crash due to threading 
errorsit should protect itself using standard techniquesbut it would 
be a mistake for parrot to mandate that all ops and PMCs be 
thread-safe.
What are these standard techniques? The JVM spec does seem to guarantee 
that even in the absence of proper locking by user code, things won't 
go completely haywire, but I can't figure out how this is possible 
without actual locking. (That is, I'm wondering if Java is doing 
something clever.) For instance, inserting something into a collection 
will often require updating more than one memory location (especially 
if the collection is out of space and needs to be grown), and I can't 
figure out how this could be guaranteed not to completely corrupt 
internal state in the absence of locking. (And if it _does_ require 
locking, then it seems that the insertion method would in fact then be 
synchronized.)

So my question is, how do JVMs manage to protect internal state in the 
absence of locking? Or do they?

JEff


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Sam Vilain
On Sat, 03 Jan 2004 20:51, Luke Palmer wrote;

   Parrot is platform-independent, but that doesn't mean we can't
   take advantage of platform-specific instructions to make it faster
   on certain machines.  Indeed, this is precisely what JIT is.  
   But a lock on every PMC is still pretty heavy for those non-x86
   platforms out there, and we should avoid it if we can.

So implement threading on architectures that don't support interrupt
masking with completely user-space threading (ie, runloop round-robin)
like Ruby does.  *That* is available on *every* platform.
-- 
Sam Vilain, [EMAIL PROTECTED]

Seeing a murder on television... can help work off one's antagonisms.
And if you haven't any antagonisms, the commercials will give you
some.
 -- Alfred Hitchcock



Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Dan Sugalski
At 3:17 PM -0500 1/4/04, Uri Guttman wrote:
  DS == Dan Sugalski [EMAIL PROTECTED] writes:

  DS And don't forget the libraries that are picky about which thread calls
  DS into them -- there are some that require that the thread that created
  DS the handle for the library be the thread that calls into the library
  DS with that handle. (Though luckily those are pretty rare) And of course
  DS the non-reentrant libraries that require a global library lock for all
  DS calls otherwise the library state gets corrupted.
  DS Aren't threads fun? :)

hence my love for events and forked procs.
Forks make some of this worse. There are more libraries that don't 
work with connections forked across processes than libraries that 
don't work with calls from a different thread.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads Design. A Win32 perspective.

2004-01-04 Thread Dan Sugalski
At 9:27 AM +1300 1/5/04, Sam Vilain wrote:
On Sat, 03 Jan 2004 20:51, Luke Palmer wrote;

   Parrot is platform-independent, but that doesn't mean we can't
   take advantage of platform-specific instructions to make it faster
   on certain machines.  Indeed, this is precisely what JIT is. 
   But a lock on every PMC is still pretty heavy for those non-x86
   platforms out there, and we should avoid it if we can.

So implement threading on architectures that don't support interrupt
masking with completely user-space threading (ie, runloop round-robin)
like Ruby does.  *That* is available on *every* platform.
Interrupt masking and a proper threading interface can be considered 
a prerequisite for threads of any sort under Parrot, the same way an 
ANSI C89-compliant compiler is a requirement. Platforms that can't 
muster at least thread spawning, mutexes, and condition variables 
don't get threads, and don't have to be considered. (You can, it's 
just not required, and you'd be hard-pressed to find anything outside 
the embedded realm that doesn't support at least that level of 
functionality, and I'm OK if there are no threads on the Gameboy port)
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Nigel Sandever
On Sat, 03 Jan 2004 01:48:07 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote:
 ding! ding! ding! you just brought in a cpu specific instruction which
 is not guaranteed to be on any other arch. in fact many have such a
 beast but again, it is not accessible from c.

 you can't bring x86 centrism into this. the fact that redmond/intel
 threads can make use of this instruction to do 'critical sections' is a
 major point why it is bad for a portable system with many architectures
 to be supported.

 i disagree. it is covered for the intel case only and that is not good
 enough.

 again, intel specific and it has to be tossed out.

 but atomic operations are not available on all platforms.

 not in a unix environment. real multi-threaded processes already have
 this problem.

 that is a kernel issue and most signalling systems don't have thread
 specific arguments AFAIK.

 virtual ram is what counts on unix. you can't request some large amount
 without using real swap space. 

 again, redmond specific. 

 what win32 does is not portable and not useful at a VM level.

 ok, i can see where you got the test/set and yield stuff from now
 finally, it is redmond specific again and very unportable. i
 have never heard of this fibre concept on any unix flavor.

 and that is not possible on other platforms.

 that is a big point and one that i don't see as possible. redmond can do
 what they want with their kernel and user procs. parrot can only use
 kernel concepts that are reasonably portable across most platforms.
 kernel threads are reasonably portable (FSDO of reasonable) but anything
 beyond that such as fibres, test/set in user space, etc is not. locks
 have to be in kernel space since we can do a fibre yield in user space
 on any other platform. so this rule out user space test/set as well
 since that would need a thread to spin instead of blocking.
 
 your ideas make sense but only on redmond/intel which is not the target
 space for parrot.

That's pretty much the crux. I don't know what is available (in detail) on
other platforms. Hence I needed to express the ideas in terms I understand
and explain them sufficiently that they could be viewed, interpreted  and 
either related to similar concepts on othe platforms, or shot down.

I accept your overall judgement, though not necessarially all the specifics.

Maybe it would be possible (for me + others) to write the core of a win32 specific,
threaded VM interpreter that would take parrot byte code and run it. Thereby,
utilising all the good stuff that preceeds the VM interpreter, plus probably large 
chunks of the parrot VM, but provides it with a more native compatible target. 

That's something that obviously not a simple project and is beyond the scope of 
this list. 

Thanks for taking the time to review this.

Nigel Sandever.




Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Luke Palmer
Nigel Sandever writes:
 Maybe it would be possible (for me + others) to write the core of a win32 specific,
 threaded VM interpreter that would take parrot byte code and run it. Thereby,
 utilising all the good stuff that preceeds the VM interpreter, plus probably large 
 chunks of the parrot VM, but provides it with a more native compatible target. 

You want to write a parrot?  Um, good luck.

One thing Uri mentioned was writing platform specific macro code, and he
dismissed it with that it's also a pain.  While I agree that it's a
pain, and it's about as maintainence-friendly as JIT, I don't think it
has to be ruled out.

Parrot is platform-independent, but that doesn't mean we can't take
advantage of platform-specific instructions to make it faster on certain
machines.  Indeed, this is precisely what JIT is.  

But a lock on every PMC is still pretty heavy for those non-x86
platforms out there, and we should avoid it if we can.

Luke

 
 That's something that obviously not a simple project and is beyond the scope of 
 this list. 
 
 Thanks for taking the time to review this.
 
 Nigel Sandever.
 
 


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Nigel Sandever [EMAIL PROTECTED] wrote:
 On Sat, 03 Jan 2004 01:48:07 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote:

 your ideas make sense but only on redmond/intel which is not the target
 space for parrot.

s/not the/by far not the only/

 Maybe it would be possible (for me + others) to write the core of a
 win32 specific, threaded VM interpreter that would take parrot byte
 code and run it. Thereby, utilising all the good stuff that preceeds
 the VM interpreter, plus probably large chunks of the parrot VM, but
 provides it with a more native compatible target.

I'd be glad, if someone fills the gaps in platform code. There is no
need to rewrite the interpreter or such: Defining the necessary macros
in an efficient way should be enough to start with.

 Nigel Sandever.

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Nigel Sandever [EMAIL PROTECTED] wrote:

[ Line length adjusted for readability ]

 VIRTUAL MACHINE INTERPRETER

 At any given point in the running of the interpreter, the VM register
 set, program counter and stack must represent the entire state for
 that thread.

That's exactly, what a ParrotInterpreter is: the entire state for a
thread.

 I am completely against the idea of VM threads having a 1:1
 relationship with interpreters.

While these can be separated its not efficient. Please note that Parrot
is a register-based VM. Switching state is *not* switching a
stack-pointer and a PC thingy like in stack-based VMs.

 ... The runtime costs of duplicating all
 shared data between interpreters, tying all shared variables, added to
 the cost of locking, throw away almost all of the benefit of using
 threads.

No, shared resources are not duplicated, nor tied. The locking of course
remains, but that's it.

 The distinction is that a critical section does not disable time
 slicing/ preemption.  It prevents any thread attempting to enter an
 owned critical section from receiving a timeslice until the current
 owner relinquishes it.

These are platform specific details. We will use whatever the
platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
The LOCK() can be any atomic operation and doesn't need to call the
kernel, if the lock is aquired.

[ Lot of Wintel stuff snipped ]

 Nigel Sandever.

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Uri Guttman
 LT == Leopold Toetsch [EMAIL PROTECTED] writes:

  LT These are platform specific details. We will use whatever the
  LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
  LT The LOCK() can be any atomic operation and doesn't need to call the
  LT kernel, if the lock is aquired.

if it doesn't call the kernel, how can a thread be blocked? you can't
have user level locks without spinning. at some point (even with fibres)
you need to make a kernel call so other threads can run. the macro layer
will make the mainline source look better but you still need kernel
calls in their definition.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Nigel Sandever
On Sat, 3 Jan 2004 11:35:37 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote:
 Nigel Sandever [EMAIL PROTECTED] wrote:
 
  VIRTUAL MACHINE INTERPRETER
 
  At any given point in the running of the interpreter, the VM register
  set, program counter and stack must represent the entire state for
  that thread.
 
 That's exactly, what a ParrotInterpreter is: the entire state for a
 thread.

This is only true if a thread == interpreter. 
If a single interpreter can run 2 threads then that single interpreter 
cannot represent the state of both threads safely.

 
  I am completely against the idea of VM threads having a 1:1
  relationship with interpreters.
 

With 5005threads, multiple threads exist in a single interpreter.

 All VHLL level data is shared without duplication, but locking has
 to be performed on each entity. This model is more efficient than
 ithreads. 
 However it was extremely difficult prevent onwanted interaction
 between the threads corrupting the internal state of the interpreter
 given the internal architecture of P5 and so it was abandoned.

With ithreads, each thread is also a seperate interpreter.

 This insulates the internal state of one interpreter from the other
 but also insulates *all* perl level program data in one interpreter
 from the perl level data in the other. 

 Spawning a new thread becomes a process of duplicating everything.
 The interpreter, the perl program, and all it existing data. 

 Sharing data between the threads/interpreters is implemented by 
 tieing the two copies of the variables to be shared and each time 
 a STORE is performed in one thread, the same STORE has too be 
 performed on the copy of that var held on every other threads 
 dataspace.

 If 2 threads need to share a scalar, but the program has 10 other
 threads, then each write to the shared scalar requires the update
 of all 12 copies of that scalar. There is no way to indicate that 
 you only need to share it between threads x  y.

 With ithreads, there can be no shared references, so no shared
 objects and no shared compound data structures

 While these can be separated its not efficient. Please note that Parrot
 is a register-based VM. Switching state is *not* switching a
 stack-pointer and a PC thingy like in stack-based VMs.
  ... The runtime costs of duplicating all
  shared data between interpreters, tying all shared variables, added to
  the cost of locking, throw away almost all of the benefit of using
  threads.
 No, shared resources are not duplicated, nor tied. The locking of course
 remains, but that's it.

The issues above are what make p5 ithreads almost unusable. 

If Parrot has found a way of avoiding these costs and limitations
then everything I offered is a waste of time, because these are
the issues  was attempting to address.

However, I have seen no indication here, in the sources or anywhere 
else that this is the case. I assume that the reason Dan opened the
discussion up in the first place is because the perception by those
looking on was that the p5 ithreads model was being suggested as the
way Parrot was going to go. 

And the reaction from those wh have tried to make use of ithreads
under p5 are all too aware that replicating them for Parrot would
be . [phrase deleted as too emotionally charged:)]

 leo

Nigel.





Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Elizabeth Mattijsen
At 18:20 + 1/3/04, Nigel Sandever wrote:
 Sharing data between the threads/interpreters is implemented by
 tieing the two copies of the variables to be shared and each time
 a STORE is performed in one thread, the same STORE has too be
 performed on the copy of that var held on every other threads
 dataspace.
Hmmm is that true?  My impression was (and that's the way I've 
implemented it in Thread::Tie) is that each STORE actually stores the 
value in a hidden background thread, and each FETCH obtains the 
current value from the background thread.  I don't think each STORE 
is actually cascading through all of the threads.  Not until they try 
to fetch the shared value, anyway.


 If 2 threads need to share a scalar, but the program has 10 other
 threads, then each write to the shared scalar requires the update
 of all 12 copies of that scalar. There is no way to indicate that
 you only need to share it between threads x  y.
That's true.  But there is some implicit grouping.  You can only 
create newly shared variables for the current thread and any thread 
that is started from the current thread.  This is actually a pity, as 
that precludes you to start a bare (minimal number of Perl modules 
loaded, or with exactly the modules that _you_ want) thread at 
compile time, from which you could start other threads.


 With ithreads, there can be no shared references, so no shared
 objects and no shared compound data structures
Actually, you can bless a reference to a shared variable, but you 
can't share a blessed object (the sharing will let you lose the 
content of the object).  I think shared compound data structures 
_are_ possible, but very tricky to get right (because the CLONE sub 
is called as a package method, rather than as an object method: see 
Thread::Bless for an attempt at getting around this).


  While these can be separated its not efficient. Please note that Parrot
 is a register-based VM. Switching state is *not* switching a
 stack-pointer and a PC thingy like in stack-based VMs.
  ... The runtime costs of duplicating all
  shared data between interpreters, tying all shared variables, added to
  the cost of locking, throw away almost all of the benefit of using
  threads.
 No, shared resources are not duplicated, nor tied. The locking of course
  remains, but that's it.
The issues above are what make p5 ithreads almost unusable.
For more about this, see my article on Perl Monks:

  http://www.perlmonks.org/index.pl?node_id=288022


And the reaction from those wh have tried to make use of ithreads
under p5 are all too aware that replicating them for Parrot would
be . [phrase deleted as too emotionally charged:)]
What can I say?   ;-)

Liz


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Elizabeth Mattijsen
I'm trying to be constructive here.  Some passages may appear to be 
blunt.  Read at your own risk   ;-)

At 01:48 -0500 1/3/04, Uri Guttman wrote:
  NS == Nigel Sandever [EMAIL PROTECTED] writes:
  NS All that is required to protect an object from corruption through
  NS concurrent access and state change is to prevent two (or more) VMs
  NS trying to access the same object simultaneously. In order for the VM to
  NS address the PMC and must load it into a VM register, it must know where
  NS it is. Ie. It must have a pointer.  It can use this pointer to access
  NS the PMC with a Bit Test and Set operation.  (x86 BTS)
  NS [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic
  NS operation. This occurs before the VM enters the VM operations critical
  NS section.
ding! ding! ding! you just brought in a cpu specific instruction which
is not guaranteed to be on any other arch. in fact many have such a
beast but again, it is not accessible from c.
I just _can't_ believe I'm hearing this.  So what if it's not 
accessible from C?  Could we just not build a little C-program that 
would create a small in whatever loadable library?  Or have a 
post-processing run through the binary image inserting the right 
machine instructions in the right places?  Not being from a *nix 
background, but more from a MS-DOS background, I've been used to 
inserting architecture specific machine codes from higher level 
languages into executable streams since 1983!  Don't tell me that's 
not done anymore?  ;-)


.. and still some archs don't
have it so it has to be emulated by dijkstra's algorithm. and that
requires two counters and a little chunk of code.
Well, I guess those architectures will have to live with that.  If 
that's what it takes?


you can't bring x86 centrism into this. the fact that redmond/intel
threads can make use of this instruction to do 'critical sections' is a
major point why it is bad for a portable system with many architectures
to be supported.
I disagree.  I'm not a redmond fan, so I agree with a lot of your 
_sentiment_, but you should also realize that a _lot_ of intel 
hardware is running Linux.  Heck, even some Solarises run on it.

I'm not saying that the intel CPU's are superior to others.  They're 
probably not.  But it's just as with cars: most of the cars get 
people from A to B.  They're not all Maserati's.  Or Rolls Royce's. 
Most of them are Volkswagen, Fiat and whatever compacts you guys 
drive in the States.  ;-)

I don't think we're making Parrot to run well on Maserati's and Rolls 
Royce's only.  We want to reach the Volkswagens.  And not even 
today's Volkswagens: by the time Perl 6 comes around, CPU's will have 
doubled in power yet _again_!

The portability is in Parrot itself: not by using the lowest common 
denominator of C runtime systems out there _today_!   It will take a 
lot of trouble to create a system that will run everywhere, but 
that's just it what makes it worthwhile.  Not that it offers the same 
limited capabilities on all systems!


i am well aware about test/set and atomic instructions. my point in my
previous reply was that it can't be assumed to be on a particular
arch. so it has to be assumed to be on none and it needs a pure software
solution. sure a platform specific macro/assembler lib could be done but
that is a pain as well.
Indeed.  A pain, probably.  Maybe not so much.  I can't believe that 
Sparc processors are so badly designed that they don't offer 
something similar as what Nigel suggested for Intel platforms.


again, intel specific and it has to be tossed out.
Again, I think you're thinking too much inside the non-Wintel box. 
You stated yourself just now ...in fact many have such beast..., so 
maybe it should first be investigated which platforms do suppport 
this and which don't, and then decide whether it is a good idea or an 
idea to be tossed?


virtual ram is what counts on unix. you can't request some large amount
without using real swap space. it may not allocate real ram until later
(page faulting on demand) but it is swap space that counts. it is used
up as soon as you allocate it.
So, maybe a wrapper is, either for *nix, or for Win32, or maybe both.


what win32 does is not portable and not useful at a VM level. kernels
and their threads can work well together. portable VMs and their threads
are another beast that can't rely on a particular architecture
instruction or kernel feature.
This sounds too much like dogma to me.  Why?  Isn't Parrot about 
borgifying all good things from all OS's and VM's, now and in the 
future?   ;-)


hardware is tossed out with portable VM design. we have a user space
process written in c with a VM and VM threads that are to be based on
kernel threads. the locking issues for shared objects is the toughest
nut to crack right now. there is no simple or fast solution to this
given the known contraints. intel/redmond specific solutions are not
applicable (though we can learn from them).
Then 

Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Jeff Clites
On Jan 3, 2004, at 11:19 AM, Elizabeth Mattijsen wrote:

At 01:48 -0500 1/3/04, Uri Guttman wrote:
  NS == Nigel Sandever [EMAIL PROTECTED] writes:
  NS All that is required to protect an object from corruption 
through
  NS concurrent access and state change is to prevent two (or more) 
VMs
  NS trying to access the same object simultaneously. In order for 
the VM to
  NS address the PMC and must load it into a VM register, it must 
know where
  NS it is. Ie. It must have a pointer.  It can use this pointer to 
access
  NS the PMC with a Bit Test and Set operation.  (x86 BTS)
  NS [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU 
atomic
  NS operation. This occurs before the VM enters the VM operations 
critical
  NS section.
ding! ding! ding! you just brought in a cpu specific instruction which
is not guaranteed to be on any other arch. in fact many have such a
beast but again, it is not accessible from c.
I just _can't_ believe I'm hearing this.  So what if it's not 
accessible from C?  Could we just not build a little C-program that 
would create a small in whatever loadable library?  Or have a 
post-processing run through the binary image inserting the right 
machine instructions in the right places?  Not being from a *nix 
background, but more from a MS-DOS background, I've been used to 
inserting architecture specific machine codes from higher level 
languages into executable streams since 1983!  Don't tell me that's 
not done anymore?  ;-)
Yes, you are correct--we are already using bits of assembly in parrot. 
C compilers tend to allow you to insert bits of inline assembly, and we 
are taking advantage of that--for instance, look for __asm__ in the 
following files:

jit/arm/jit_emit.h
jit/ppc/jit_emit.h
src/list.c
src/malloc.c
Also, JIT is all about generating platform-specific machine 
instructions at runtime. So it's certainly do-able, and right along the 
lines of of what we are already doing.

JEff



Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Uri Guttman [EMAIL PROTECTED] wrote:
  LT == Leopold Toetsch [EMAIL PROTECTED] writes:

  LT These are platform specific details. We will use whatever the
  LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
  LT The LOCK() can be any atomic operation and doesn't need to call the
  LT kernel, if the lock is aquired.

 if it doesn't call the kernel, how can a thread be blocked?

I wrote, *if* the lock is aquired. That's AFAIK the fast path of a futex
or of the described Win32 behavior. The slow path is always a kernel
call (or a some rounds spinning before ...)
But anyway, we don't reinvent these locking primitives.

 uri

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Jeff Clites
On Jan 3, 2004, at 10:08 AM, Elizabeth Mattijsen wrote:

At 12:15 -0500 1/3/04, Uri Guttman wrote:
  LT == Leopold Toetsch [EMAIL PROTECTED] writes:
  LT These are platform specific details. We will use whatever the
  LT platform/OS provides. In the source code its a LOCK() UNLOCK() 
pair.
  LT The LOCK() can be any atomic operation and doesn't need to call 
the
  LT kernel, if the lock is aquired.
if it doesn't call the kernel, how can a thread be blocked?
Think out of the box!

Threads can be blocked in many ways.  My forks.pm module uses sockets 
to block threads  ;-).
IO operations which block like that end up calling into the kernel.

But I believe it is usually possible to acquire an uncontested lock 
without calling into the kernel. When you do need to block (when trying 
to acquire a lock which is already held by another thread) you may need 
to enter the kernel. But I think that Leo's point was that in the 
common case of a successful lock operation, it may not be necessary.

JEff



Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Nigel Sandever [EMAIL PROTECTED] wrote:
 On Sat, 3 Jan 2004 11:35:37 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote:
 That's exactly, what a ParrotInterpreter is: the entire state for a
 thread.

 This is only true if a thread == interpreter.
 If a single interpreter can run 2 threads then that single interpreter
 cannot represent the state of both threads safely.

Yep. So if a single interpreter (which is almost a thread state) should
run two threads, you have to allocate and swap all. What should the
advantage of such a solution be?

 With 5005threads, multiple threads exist in a single interpreter.

These are obsolete.

 With ithreads, each thread is also a seperate interpreter.

  Spawning a new thread becomes a process of duplicating everything.
  The interpreter, the perl program, and all it existing data.

Partly yes. A new interpreter is created, the program, i.e. the opcode
stream is *not* duplicated, but JIT or prederef information has to be
rebuilt (on demand, if that run-core is running), and existing
non-shared data items are cloned.

  Sharing data between the threads/interpreters is implemented by
  tieing

Parrot != perl5.ithreads

 If Parrot has found a way of avoiding these costs and limitations
 then everything I offered is a waste of time, because these are
 the issues  was attempting to address.

I posted a very premature benchmark result, where an unoptimized Parrot
build is 8 times faster then the equivalent perl5 code.

 And the reaction from those wh have tried to make use of ithreads
 under p5 are all too aware that replicating them for Parrot would
 be . [phrase deleted as too emotionally charged:)]

I don't know how ithreads are working internally WRT the relevant issues
like object allocation and such. But threads at the OS level provide
shared code and data segments. So at the VM level you have to unshare
non-shared resources at thread creation. You can copy objects lazily and
make 2 distinct items when writing, or you copy them in the first
place. But you have these costs at thread start - and not later.

 Nigel.

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Uri Guttman
 EM == Elizabeth Mattijsen [EMAIL PROTECTED] writes:

   ding! ding! ding! you just brought in a cpu specific instruction which
   is not guaranteed to be on any other arch. in fact many have such a
   beast but again, it is not accessible from c.

  EM I just _can't_ believe I'm hearing this.  So what if it's not
  EM accessible from C?  Could we just not build a little C-program that
  EM would create a small in whatever loadable library?  Or have a
  EM post-processing run through the binary image inserting the right
  EM machine instructions in the right places?  Not being from a *nix
  EM background, but more from a MS-DOS background, I've been used to
  EM inserting architecture specific machine codes from higher level
  EM languages into executable streams since 1983!  Don't tell me that's
  EM not done anymore?  ;-)

it is not that it isn't done anymore but the effect has to be the same
on machines without test/set. and on top of that, it still needs to be a
kernel level operation so a thread can block on the lock. that is the
more important issue that makes using a test/set in user space a moot
problem.

  EM I disagree.  I'm not a redmond fan, so I agree with a lot of your
  EM _sentiment_, but you should also realize that a _lot_ of intel
  EM hardware is running Linux.  Heck, even some Solarises run on it.

we are talking maybe 10-20 architectures out there that we would want
parrot to run on. maybe more. how many does p5 run on now?

  EM The portability is in Parrot itself: not by using the lowest common
  EM denominator of C runtime systems out there _today_!   It will take a
  EM lot of trouble to create a system that will run everywhere, but that's
  EM just it what makes it worthwhile.  Not that it offers the same limited
  EM capabilities on all systems!

but we need a common denominator of OS features more than one for cpu
features. the fibre/thread stuff is redmond only. and they still require
system calls. so as i said the test/set is not a stopping point
(dijkstra) but the OS support is. how and where and when we lock is the
only critical factor and that hasn't been decided yet. we don't want to
lock at global thread levels and we are not sure we can lock at PMC or
object levels (GC and alloc can break that). we should be focusing on
that issue. think about how DBs did it. sybase used to do page locking
(coarse grained) since it was faster (this was 15 years ago) and they
had the fastest engine. but when multithreading and multicpu designs
came in a finer grained row locking was faster (oracle). sybase fell
behind and has not caught up. we have the same choices to make so we
need to study locking algorithms and techniques from that perspective
and not how to do a single lock (test/set vs kernel). but i will keep
reiterating that it has to be a kernel lock since we must block threads
and GC and such without spinning or manual scheduling (fibres).

   virtual ram is what counts on unix. you can't request some large amount
   without using real swap space. it may not allocate real ram until later
   (page faulting on demand) but it is swap space that counts. it is used
   up as soon as you allocate it.

  EM So, maybe a wrapper is, either for *nix, or for Win32, or maybe both.

this is very different behavior IMO and not something that can be
wrapped easily. i could be wrong but separating virtual allocation from
real allocation can't be emulated without kernel support. and we need
the same behavior on all platforms. this again brings up how we lock so
that GC/alloc will work properly with threads. do we lock a thread pool
but not the thread when we access a shared thingy? that is a medium
grain lock. can the GC/alloc break the lock if it is called inside that
operation? or could only the pool inside the active thread do that? what
about an shared object alloced from thread A's pool but it triggers an
alloc when being accessed in thread B. these are the questions that need
to be asked and answered. i was just trying to point out to nigel that
the intel/redmond solutions are not portable as they require OS
support and that all locks need to be kernel level. given that
requirement, we need to decide how to do the locks so those questions
can be answered with reasonable efficiency. of course a single global
lock would work but that stinks and we all know it. so what is the lock
granularity? how do we handle GC/alloc across shared objects?

  EM This sounds too much like dogma to me.  Why?  Isn't Parrot about
  EM borgifying all good things from all OS's and VM's, now and in the
  EM future?   ;-)

but parrot can only use a common set of features across OS's. we can't
use a redmond feature that can't be emulated on other platforms. and my
karma ran over my dogma :( :-)

   hardware is tossed out with portable VM design. we have a user space
   process written in c with a VM and VM threads that are to be based on
   kernel threads. the locking issues for shared objects is the toughest
   

Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Uri Guttman
 LT == Leopold Toetsch [EMAIL PROTECTED] writes:

  LT Uri Guttman [EMAIL PROTECTED] wrote:
LT == Leopold Toetsch [EMAIL PROTECTED] writes:

  LT These are platform specific details. We will use whatever the
  LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
  LT The LOCK() can be any atomic operation and doesn't need to call the
  LT kernel, if the lock is aquired.

   if it doesn't call the kernel, how can a thread be blocked?

  LT I wrote, *if* the lock is aquired. That's AFAIK the fast path of a futex
  LT or of the described Win32 behavior. The slow path is always a kernel
  LT call (or a some rounds spinning before ...)
  LT But anyway, we don't reinvent these locking primitives.

ok, i missed the 'if' there. :)

that could be workable and might be faster. it does mean that locks are
two step as well, user space test/set and fallback to kernel lock. we
can do what nigel said and wrap the test/set in macros and use assembler
to get at it on platforms that have it or fallback to dijkstra on those
that don't.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Uri Guttman
 EM == Elizabeth Mattijsen [EMAIL PROTECTED] writes:

  EM At 12:15 -0500 1/3/04, Uri Guttman wrote:
LT == Leopold Toetsch [EMAIL PROTECTED] writes:
  LT These are platform specific details. We will use whatever the
  LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair.
  LT The LOCK() can be any atomic operation and doesn't need to call the
  LT kernel, if the lock is aquired.
   if it doesn't call the kernel, how can a thread be blocked?

  EM Think out of the box!

  EM Threads can be blocked in many ways.  My forks.pm module uses sockets
  EM to block threads  ;-).

i used that design as well. a farm of worker threads blocked on a pipe
(socketpair) to the same process. the main event loop handled the other
side. worked very well.

  EM It sucks performance wise, but it beats the current perl ithreads
  EM implementation on many platforms in many situations.

i can believe that.

  EM Therefore my motto: whatever works, works.

but i discussed that solution with dan and he shot it down for speed
reasons IIRC. i still think it is an interesting solution. it could also
be used for the main event queue and/or loop as i mention above. we are
assuming some form of sockets on all platforms IIRC, so we can use
socketpair for that. i even use socketpair on win32 to test a
(pseudo)fork thing for file::slurp.

   ...you can't
   have user level locks without spinning. at some point (even with fibres)
   you need to make a kernel call so other threads can run.

  EM Possibly.  I don't know enough of the specifics whether this is
  EM true or not.

i looked at the docs for fibres and they say you do a manual reschedule
by selecting the fibre to run next or i think a yield. but something has
to go to the kernel since even fibres are kernel thingys.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Dave Mitchell
On Sat, Jan 03, 2004 at 08:05:13PM +0100, Elizabeth Mattijsen wrote:
 At 18:20 + 1/3/04, Nigel Sandever wrote:
  Sharing data between the threads/interpreters is implemented by
  tieing the two copies of the variables to be shared and each time
  a STORE is performed in one thread, the same STORE has too be
  performed on the copy of that var held on every other threads
  dataspace.
 
 Hmmm is that true?  My impression was (and that's the way I've 
 implemented it in Thread::Tie) is that each STORE actually stores the 
 value in a hidden background thread, and each FETCH obtains the 
 current value from the background thread.  I don't think each STORE 
 is actually cascading through all of the threads.  Not until they try 
 to fetch the shared value, anyway.

Sharing consists of the real SV living in a shared interpreter, with each
individual thread having a lightweight proxy SV that causes the
appropriate real SV to be accessed/updated by a mixture or magic and/or
tied-ish access. A particular access by one thread does not involve any of
the other threads or their proxies.

  With ithreads, there can be no shared references, so no shared
  objects and no shared compound data structures
 
 Actually, you can bless a reference to a shared variable, but you 
 can't share a blessed object (the sharing will let you lose the 
 content of the object).  I think shared compound data structures 
 _are_ possible, but very tricky to get right (because the CLONE sub 
 is called as a package method, rather than as an object method: see 
 Thread::Bless for an attempt at getting around this).

Nested shared structures work just fine, and there's no need to mess with
CLONE for plain data.

 And the reaction from those wh have tried to make use of ithreads
 under p5 are all too aware that replicating them for Parrot would
 be . [phrase deleted as too emotionally charged:)]

It's the implementation of ithreads in p5 that sucks, not the concept
itself. The use of COW makes new thread creation cheap, and the use of
lock PMCs interposed on the real PMCs makes shareing easy.

Dave.

-- 
O Unicef Clearasil!
Gibberish and Drivel!
  - Bored of the Rings


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Jeff Clites
On Jan 3, 2004, at 12:26 PM, Uri Guttman wrote:

  LT These are platform specific details. We will use whatever the
  LT platform/OS provides. In the source code its a LOCK() UNLOCK() 
pair.
  LT The LOCK() can be any atomic operation and doesn't need to call 
the
  LT kernel, if the lock is aquired.

if it doesn't call the kernel, how can a thread be blocked?
  LT I wrote, *if* the lock is aquired. That's AFAIK the fast path of 
a futex
  LT or of the described Win32 behavior. The slow path is always a 
kernel
  LT call (or a some rounds spinning before ...)
  LT But anyway, we don't reinvent these locking primitives.

ok, i missed the 'if' there. :)

that could be workable and might be faster. it does mean that locks are
two step as well, user space test/set and fallback to kernel lock. we
can do what nigel said and wrap the test/set in macros and use 
assembler
to get at it on platforms that have it
This is probably already done inside of pthread (and Win32) locking 
implementations--a check in userspace before a kernel call. It's also 
important to remember that it's not a two-step process most of the 
time, since most of the locking we are taking about is likely to be 
uncontested (ie, the lock acquisition will succeed most of the time, 
and no blocking will be needed). If we _do_ have major lock contention 
in our internal locking, those will be areas calling for a redesign.

JEff



Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Nigel Sandever
On Sat, 3 Jan 2004 21:00:31 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote:
  That's exactly, what a ParrotInterpreter is: the entire state for a
  thread.
 
  This is only true if a thread == interpreter.
  If a single interpreter can run 2 threads then that single interpreter
  cannot represent the state of both threads safely.
 
 Yep. So if a single interpreter (which is almost a thread state) should
 run two threads, you have to allocate and swap all. 

When a kernel level thead is spawned, no duplication of application memory 
is required, Only a set of registers, program counter and stack. These 
represent the entire state of that thread.

If a VM thread mirrors this, by duplicating the VM program counter, 
VM registers and  VM stack, then this VM thread context can also
avoid the need to replicate the rest of the program data (interpreter).

 What should the
 advantage of such a solution be?

The avoidance of duplication. 
Transparent interlocking of VHLL fat structures performed automatically
by the VM itself. No need for :shared or lock().


 
  With 5005threads, multiple threads exist in a single interpreter.
 
 These are obsolete.

ONLY because they couldn't be made to work properly. The reason 
that was true are entirely due to the architecture of P5.

Dan Sugalski suggested in this list back in 2001, that he would prefer
pthreads to ithreads. 

I've used both in p5, and pthreads are vastly more efficient, but flaky and
difficult to use well. These limitations are due to the architecture upon 
which they were built. My interest is in seeing the Parrot architecture
not exclude them.

 
  With ithreads, each thread is also a seperate interpreter.
 
   Spawning a new thread becomes a process of duplicating everything.
   The interpreter, the perl program, and all it existing data.
 
 Partly yes. A new interpreter is created, the program, i.e. the opcode
 stream is *not* duplicated, but JIT or prederef information has to be
 rebuilt (on demand, if that run-core is running), and existing
 non-shared data items are cloned.
 

Only duplicating shared data on demand (COW) may work well on systems
that support COW in the kernel. But on systems that don't, this has to be
emulated in user space, with all the inherent overhead that implies.

My desire was that the VM_Spawn_Thread VM_Share_PMC and 
VM_Lock_PMC opcodes could be coded such that those platforms where
the presence of kernel level COW and other native features mean that
the ithreads-style model of VMthread == kernel thread + interpreter 
is the best way to go, then that would be the underlying implementation.

On those platforms where VMthread == kernel thread + VMthread context
is the best way, then that would be the underlying implementation.

In order for this to be possible, it implies a certain level of support for
both be engrained in the design of the interpreter.

My (long) oroginal post, with all the subjects covered and details given 
was my attempt to describe the support required in the design for the 
latter. It would be necessary to consider all the elements, and the way 
they intereact, and take these into consideration when implementing
Parrots threading in order that this would be achievable.

Each element, the seraration of the VMstate from the interpreter state,
the atomisation of VM operations, the automated detection and locking of
concurrect access attempts and the serialisation of the VM threads when 
it is detected all need support at the highest level before they may be 
implemented at the lowest (platform specific) levels.

It simply isn't possible to implement them on one platform at the lowest
levels unless the upper levels of the design are contructed with the 
possibilities in mind.

   Sharing data between the threads/interpreters is implemented by
   tieing
 
 Parrot != perl5.ithreads
 
  If Parrot has found a way of avoiding these costs and limitations
  then everything I offered is a waste of time, because these are
  the issues  was attempting to address.
 
 I posted a very premature benchmark result, where an unoptimized Parrot
 build is 8 times faster then the equivalent perl5 code.
 
  And the reaction from those wh have tried to make use of ithreads
  under p5 are all too aware that replicating them for Parrot would
  be . [phrase deleted as too emotionally charged:)]
 
 I don't know how ithreads are working internally WRT the relevant issues
 like object allocation and such. But threads at the OS level provide
 shared code and data segments. So at the VM level you have to unshare
 non-shared resources at thread creation. 

You only need to copy them, if the two threads can attempt to modify
the contents of the objects concurrently. By precluding this possibility,
by atomising VMthread level operations by preventing a new VM thread
form being scheduled until any othe VM thread completes its current 
operation and ensuring that each VMthreads state is in a complete and
coherent state before another VM 

Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Uri Guttman
 JC == Jeff Clites [EMAIL PROTECTED] writes:

  JC On Jan 3, 2004, at 12:26 PM, Uri Guttman wrote:

   that could be workable and might be faster. it does mean that locks
   are two step as well, user space test/set and fallback to kernel
   lock. we can do what nigel said and wrap the test/set in macros and
   use assembler to get at it on platforms that have it

  JC This is probably already done inside of pthread (and Win32)
  JC locking implementations--a check in userspace before a kernel
  JC call. It's also important to remember that it's not a two-step
  JC process most of the time, since most of the locking we are taking
  JC about is likely to be uncontested (ie, the lock acquisition will
  JC succeed most of the time, and no blocking will be needed). If we
  JC _do_ have major lock contention in our internal locking, those
  JC will be areas calling for a redesign.

i meant in the coding and not necessarily at runtime. we still need to
address when and where locking happens.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Elizabeth Mattijsen
At 21:11 + 1/3/04, Dave Mitchell wrote:
On Sat, Jan 03, 2004 at 08:05:13PM +0100, Elizabeth Mattijsen wrote:
  Actually, you can bless a reference to a shared variable, but you
 can't share a blessed object (the sharing will let you lose the
 content of the object).  I think shared compound data structures
 _are_ possible, but very tricky to get right (because the CLONE sub
 is called as a package method, rather than as an object method: see
  Thread::Bless for an attempt at getting around this).
Nested shared structures work just fine, and there's no need to mess with
CLONE for plain data.
Indeed.  But as soon as there is something special such as a 
datastructure external to Perl between threads (which happens 
automatically shared automatically, because Perl doesn't know about 
the datastructure, so the cloned objects point to the same memory 
address), then you're in trouble.  Simply because you now have 
multiple DESTROYs called on the same external data-structure.  If the 
function of the DESTROY is to free the memory of the external 
data-structure, you're in trouble as soon as the first thread is 
done.  ;-(


  And the reaction from those wh have tried to make use of ithreads
 under p5 are all too aware that replicating them for Parrot would
  be . [phrase deleted as too emotionally charged:)]
It's the implementation of ithreads in p5 that sucks, not the concept
itself. The use of COW makes new thread creation cheap, and the use of
lock PMCs interposed on the real PMCs makes shareing easy.
I agree that Perl ithreads as *a* concept are ok.

The same could be said about what are now referred to as 5.005 
threads.  Ok as *a* concept.  And closer to what many non-Perl people 
consider to be threads.

Pardon my French, but both suck in the implementation.  And it is not 
for lack of effort by the people who developed it.  It is for lack of 
a good foundation to build on.  And we're talking foundation here 
now, we all want to make sure it is the best, earth-quake proofed, 
rocking foundation we can get!

Liz


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Uri Guttman [EMAIL PROTECTED] wrote:

 ok, i missed the 'if' there. :)

 that could be workable and might be faster. it does mean that locks are
 two step as well, user space test/set and fallback to kernel lock.

Yep, that is, what the OS provides. I really don't like to reinvent
wheels here - ehem, and nowhere else.

 uri

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Uri Guttman [EMAIL PROTECTED] wrote:
 ... this again brings up how we lock so
 that GC/alloc will work properly with threads. do we lock a thread pool
 but not the thread when we access a shared thingy?

This is the major issue, how to continue. Where are shared objects
living (alloc) and where and how are these destroyed (DOD/GC).

But first I'd like to hear some requirements defined, e.g.:

A spawns B, C threads with shared $a
B spawns D, E threads with shared $b

- is $b shared in A or C
- is $a shared in D or E
- are there any (HLL) decisions to decide what is shared
- and so on

If that isn't layed out we don't need to talk about locking a certain
thread pool.

 ...how do we handle GC/alloc across shared objects?

I posted a proposal for that :)

 uri

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Elizabeth Mattijsen [EMAIL PROTECTED] wrote:

 Indeed.  But as soon as there is something special such as a
 datastructure external to Perl between threads (which happens
 automatically shared automatically, because Perl doesn't know about
 the datastructure,

Why is it shared automatically? Do you have an example for that?
But anyway an interesting problem, I didn't consider until yet - thanks :)

 ... so the cloned objects point to the same memory
 address), then you're in trouble.  Simply because you now have
 multiple DESTROYs called on the same external data-structure.  If the
 function of the DESTROY is to free the memory of the external
 data-structure, you're in trouble as soon as the first thread is
 done.  ;-(

Maybe that DOD/GC can help here. A shared object can and will be
destroyed only, when the last holder of that object has released it.

[ perl5 thread concepts ]

 Pardon my French, but both suck in the implementation.  And it is not
 for lack of effort by the people who developed it.

The problem for sure was, to put threads on top of a working
interpreter and a commonly used language. Parrots design is based on
having threads, events, async IO in mind. It was surprisingly simple to
implement these first steps that are running now.

Separating the HLL layer from the engine probably helps a lot for such
rather major design changes.

 now, we all want to make sure it is the best, earth-quake proofed,
 rocking foundation we can get!

Yep. So again, your input is very welcome,

 Liz

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Leopold Toetsch
Nigel Sandever [EMAIL PROTECTED] wrote:
 On Sat, 3 Jan 2004 21:00:31 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote:
 Yep. So if a single interpreter (which is almost a thread state) should
 run two threads, you have to allocate and swap all.

 When a kernel level thead is spawned, no duplication of application memory
 is required, Only a set of registers, program counter and stack. These
 represent the entire state of that thread.

Here is the current approach, I've implemented partly: The state of a
thread is basically the interpreter structure - that's it. In terms of
Parrot at thread (a ParrotThread PMC) is derived from an interpreter (a
ParrotInterpreter PMC).

Please remember, Parrot is a register based VM and has a lot of
registers. The whole representation of a VM thread is more and different
to a kernel thread.

While scheduling a kernel thread is only swapping above items, a VM
level thread scheduler would have to swap much more.

 If a VM thread mirrors this, by duplicating the VM program counter,
 VM registers and  VM stack, then this VM thread context can also
 avoid the need to replicate the rest of the program data (interpreter).

You are again missing here: the interpreter is above VM state - the rest
is almost nothing. So the interpreter := thread approach holds. You
can't run a even a single - the one and only - thread without these
necessary data and that's just called interpreter in Parrot speak.

 ... No need for :shared or lock().

That's the - let's say - type 4 of Dan's layout of different threading
models. Everything is shared by default. That's similar to the shared
PMC type 3 model - except that no objects have to be copied. It for sure
depends on the user code, if one or the other model will have better
performance, so the user can choose. We will provide both.

 Only duplicating shared data on demand (COW) may work well on systems
 that support COW in the kernel.

No, we are dealing with VM objects and structures here - no kernel is
involved for COWed copies of e.g. strings.

[ snips ]

 Each element, the seraration of the VMstate from the interpreter state,

VM = Virtual machine = interpreter

These can't be separated as they are the same.

 the atomisation of VM operations,

Different VMs can run on different CPUs. Why should we make atomic
instructions out if these? We have a JIT runtime performing at 1 Parrot
instruction per CPU instruction for native integers. Why should we slow
down that by a magnitude of many tenths?

We have to lock shared data, then you have to pay the penalty, but not
for each piece of code.

 You only need to copy them, if the two threads can attempt to modify
 the contents of the objects concurrently.

I think, that you are missing multiprocessor systems totally.

leo


Re: Threads Design. A Win32 perspective.

2004-01-03 Thread Jeff Clites
On Jan 3, 2004, at 2:59 PM, Leopold Toetsch wrote:

Nigel Sandever [EMAIL PROTECTED] wrote:
Only duplicating shared data on demand (COW) may work well on systems
that support COW in the kernel.
No, we are dealing with VM objects and structures here - no kernel is
involved for COWed copies of e.g. strings.
And also, COW at the OS level (that is, of memory pages) doesn't help, 
because we have complex data structures filled with pointers, so 
copying them involves more than just duplicating a block of memory. We 
can use an approach similar to what we do for strings to make a COW 
copy of, for instance, the globals stash, but overall that will only be 
a speed improvement if the data structure is rarely modified. (That is, 
once it's modified, we will have paid the price. Unless we have clever 
data structure which can be COWed in sections.)

Just adding to what Leo already said.

JEff



Re: Threads Design. A Win32 perspective.

2004-01-02 Thread Nigel Sandever
On Thu, 01 Jan 2004 21:32:22 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote:

UG Uri Guttman
NS Nigel Sandever. (Mostly not reproduced here!)

  NS REENTRANCY

 UG this is true for c level threads but not necessarily true for VM level
 UG threads.  f the VM is atomic at its operation level and can't be
 UG preempted (i.e. it is not using kernel threads with time slicing), 
 
One of the major points of what I am (was?) proposing, is that there 
would be a 1:1 mapping between VM threads and kernel threads. 

The second point is that atomic operation of the VM thread does not imply
that they cannot be preempted. It only means that when one VM thread is 
preempted by the scheduler, no other thread that is waiting to enter the 
same critical section as the first thread is in, will be scheduled 
(by the scheduler), until the first thread exits that critical section.
 
 UG then
 UG it could use thread unsafe calls (as long as it keeps those silly static
 UG buffers clean). 
 
The point about avoiding calling the CRT directly from the main bulk of the 
code does not preclude the use of (suitably thread-aware) CRT calls completely.

The idea is to acknowledge that 
  A) Not all CRT's are equally well crafted or complete.
  B) Thread-aware CRTs are not available with all compilers for all platforms.
  
By writing the main bulk of the code in terms of a virtual CRT using macros, 
the main bulk of the code becomes platform independent. The only change required 
porting the code to different platforms is to write a set of suitable expansions 
for the macros for each platform/ compiler. Where the available CRT is up to the 
job, the macros expand directly to the underlying CRT call. Where they are not, 
a suitable alternative can be written that bypasses the CRT and goes directly to 
runtime. 

As an example: Compiling P5 with Borland 5.5 for NT, I encountered the restriction
that the Borland runtime didn't support (fseek) or (f)tell on files  4GB. The 
underlying platform does, and it was relatively trivial to replace the calls to 
the CRT functions with appropriate NT syscalls and so enable both USE_LARGE_FILES
and PERLIO. 

The only difficulty was that instead of being able to make the changes in a single
place, it was necessary to make essentially the same change in 3 or 4 different files.
To compound matters, the modifications required conditional compilation directives at
all 4 places, and these had to be interspersed with other #ifdefs for other platforms 
and other purposes. 

The idea was to avoid this morass of conditional compilation and copypaste coding by
pushing the conditional directives in the main code to a single

 #if defined( THIS_PLATFORM ) #include this_platform_crt.h
 
Within that header would be a simple set of #ifdefs to include compiler specific
header files. And within those, all the CRT macro expansions. Thus, I sought to 
ease both porting and maintenance. 

It would also become possible to add virtual syscall definitions to the high level
code, and expand them differently, perhaps radically so, on a platform by platform 
basis. In essence, creating a Virtual OS for the VM.
 
 UG parrot will (according to dan) use one interpreter per
 UG VM thread and those may run on kernel threads. 
 
From what I have read, the decision as to whether each VM thread also got a 
separate interpreter was one of the things Dan was opening up to discussion?

Personally, with the perspective of (the forkless) NT platform, and my experience
with trying to make good use of both p5's 5005thread and ithread implementations,
I am completely against the idea of VM threads having a 1:1 relationship with 
interpreters. The runtime costs of duplicating all shared data between interpreters,
tying all shared variables, added to the cost of locking, throw away almost all of 
the benefit of using threads. Combine that overhead with the restrictions of 

 a) No shared references therefore:
 b) No shared objects, therefore:
 c) No shared methods.
 d) No shared compound data structures.
 
And the only hope for multiprocessing becomes forking, which WIn32 doesn't support 
natively. And which, without the benefits of process level COW performed by the 
kernel, results in a kludged emulation of a non-native facility that will never 
be usable in any real sense of the word.
 
 UG it may be possible to
 UG disable preemption and/ or time slicing so the VM threads will be atomic
 UG at the VM operation level and then we don't have to worry as much about
 UG thread unsafe libs. 
 
I was trying to demonstrate that there is no need to disable the scheduler in order
for VM threads to be achieve atomic VM level operations.
 
 UG but i gather that people want real preemption and
 UG priorities and time slicing so that idea may be moot.
 
Yes. I am one of them:)

 UG but on most
 UG platforms that support kernel threads there are thread safe versions of
 UG most/ all the c lib stuff. 
 
As I said above. This is true, but it is also compiler 

Re: Threads Design. A Win32 perspective.

2004-01-02 Thread Uri Guttman
 NS == Nigel Sandever [EMAIL PROTECTED] writes:

  NS ATOMICITY AND CRITICAL SECTIONS

  UG that is what i mentioned above, disabling time slicing/ preemption when
  UG desired. it is not just a win32 concept. hell, turning off interrupts
  UG during interrupt handlers goes way back! redmond just likes to rename
  UG stuff and act like they invented it. :)
 
  NS I cannot emphasis enough that Critical Sections are different to
  NS both Events and Semaphores.

  NS The distinction is that a critical section does not disable time
  NS slicing/ preemption.  It prevents any thread attempting to enter an
  NS owned critical section from receiving a timeslice until the current
  NS owner relinquishes it. This means that other threads in the same process
  NS and other processes continue to timeslice preemptively.  Only the
  NS thread(s) waiting to enter the specified critical section are blocked.
  NS There is no 'spinning' involved.

it has to disable slicing/preemption if there are multiple kernel
threads running that have access to that data.

  UG in effect it sounds like a thread shared mutex. it could be implemented
  UG in kernel or process space.

  NS They are indeed a form of mutex, but the process space state and
  NS avoiding the need to switch to kernel mode, is crucial, as it is
  NS this that makes them lighter weight than a conventional IPC Mutex.

  NS A CRITICAL_SECTION is a DWORD allocated in user space. It must be
  NS allocated by the user space code, and then initialised through a
  NS syscall. Once initialised, it cannot be move, copied or modified
  NS (directly) by the user code.

  NS If you have the time and inclination, there is a brief (25 line)
  NS description of them at
  NS 
[http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp]

  UG that flag setting needs to be atomic or a mutex or similar. plain flag
  UG setting won't work. also the blocking has to be kernel level (so a
  UG kernel mutex/ semaphore is needed) so the kernel slicing will work.
 
  NS Indeed. A simple flag *is* all that is necessary to protect shared (fat)
  NS data structures, *if* the VMs, are the only code that could achieve
  NS concurrent access to the contents of the PMC, and these are already
  NS interlocked at the operational level.

  NS A PMC is an object, and the only code that can affect the state of
  NS that object are it's methods. The methods can only be invoked by
  NS the VM. Each method constitutes an (atomic) operation at the VM
  NS level.

  NS All that is required to protect an object from corruption through
  NS concurrent access and state change is to prevent two (or more) VMs
  NS trying to access the same object simultaneously. In order for the VM to
  NS address the PMC and must load it into a VM register, it must know where
  NS it is. Ie. It must have a pointer.  It can use this pointer to access
  NS the PMC with a Bit Test and Set operation.  (x86 BTS)
  NS [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic
  NS operation. This occurs before the VM enters the VM operations critical
  NS section.

ding! ding! ding! you just brought in a cpu specific instruction which
is not guaranteed to be on any other arch. in fact many have such a
beast but again, it is not accessible from c. and still some archs don't
have it so it has to be emulated by dijkstra's algorithm. and that
requires two counters and a little chunk of code.

you can't bring x86 centrism into this. the fact that redmond/intel
threads can make use of this instruction to do 'critical sections' is a
major point why it is bad for a portable system with many architectures
to be supported.

  NS Once the bit has been set in the PMC header by one VM, if the scheduler
  NS intervened exactly before the next instruction in that thread, and
  NS scheduled a second VM thread to run, and that VM also attempts to access
  NS that same PMC,the first operation it tries to perform is the same BTS
  NS instruction. The next instruction then tests the CF (Carry flag) and
  NS knows that the PMC is already in use. Hence, it then relinquishes
  NS (yields) the rest of its timeslice.

i am well aware about test/set and atomic instructions. my point in my
previous reply was that it can't be assumed to be on a particular
arch. so it has to be assumed to be on none and it needs a pure software
solution. sure a platform specific macro/assembler lib could be done but
that is a pain as well.

  NS Every time the second (or subsequent) VM attempts to use a PMC
  NS that is already in-use, it finds the bit set and immediately
  NS yields. This will only happen during the interval between the
  NS first VM issuing the BTS, and that same VM entering the
  NS critsec. Once the critsec has been entered, no other VM that could
  NS load that PMC, would even be scheduled.

yield is also a kernel specific thing and what will wake up the thread
after that? it has to block on some kernel thingy and the 

Threads Design. A Win32 perspective.

2004-01-01 Thread Nigel Sandever
This is going to be extremely light on details with respect to the current state of 
the Parrot interpreter. 

It is also going to be expressed in terms of Win32 APIs.

For both of these I apologise in advance. Time, and the or forever hold your peace 
imperative has overridden my desire to do otherwise. 

My attempts to get up speed enough on the sources to work out how to apply the ideas I 
am going to express, and to become conversant with the *nix 
pthreads implementation and terminology, have moved too slowly to afford me the 
opportunity to do more.

Hoping this stimulates discussion rather than a flame war.

Regards, Nigel Sandever.

PS.  Sorry if this gets posted twice.


THE BASIS OF THE IDEA

Modern OSs succeed in having multiple threads of execution share a single copy of 
process memory without the operations of one thread being able to 
interfere with the state of another. The state of the code running in those threads 
may be corrupted through mismanagement. But the state of the 
threads themselves, their program counters, registers and stacks cannot. The 
mechanisms for this incorruptibility are: 

Each operation (opcode) performed by the thread is atomic. 
The scheduler can never interrupt a thread whilst an operation is in 
progress. Only between operations.

Before the start of an operation, and after the end of one, the state of the 
thread is entirely encapsulated within the registers and stack. 
By swapping the entire state of the CPU register set, when switching from 
one thread to another, the state of each thread is preserved
and reconstituted. No other mechanisms or interlocks are required.

By analogy, a virtual machine that wishes to have multiple threads of execution must 
achieve the same level of atomicity for each operation it performs. 

VIRTUAL MACHINE INTERPRETER

At any given point in the running of the interpreter, the VM register set, program 
counter and stack must represent the entire state for that thread. 
Once an opcode has started execution on a given thread, no other thread of execution 
within that interpreter much be allowed to start an operation until 
the first thread completes its opcode. 

NON-VMI THREADS 

ASYNCHRONOUS IO

Note that this does not mean that no other thread in the process can take a timeslice, 
only that any thread that is allowed to run should not be able to 
affect the state of the VM in question. A thread charged with performing asynchronous 
reads on behalf of the user program running within the VM 
interpreter can go ahead so long as it doesn't directly modify the VMI state. 

EVENT MANAGER THREAD

Equally, an event thread can also be run concurrently in the background to receive 
asynchronous notifications (signals, messages, asynchronous read 
completions etc.). It can then queue these events and set a flag that the VMI can 
inspect between each iteration of the opcode execution loop and action 
appropriately. This gives the benefit of safe signals along with safe and timely 
processing of other, similarly asynchronous events.

GARBAGE COLLECTION

The garbage collector would need to run *synchronously* with the interpreter opcode 
loop. Effectively, it would be another (possibly long running) opcode. 
An analogy of this are all the long running syscalls that operate within a 
multi-tasking OS. Eg. synchronous IO, virtual memory allocation, swapping etc. 
Just as virtual memory operations suspend the affected processes until the operations 
are complete, so garbage collection can be see as a virtual memory 
operation for the virtual machine that requires the VMI to be suspended until the 
operation is complete.

PREREQUISITES

The key is for the VM to be reentrant, and the use of (in win32 terms) a critical 
section.

REENTRANCY

Not only must the VMI be coded in a reentrant fashion, with all state addressed 
through pointers (references) loaded into it's Virtual register set. All 
the code underlying it, including syscalls and CRT must equally be reentrant. Many 
APIs within many CRTs *are not reentrant* (eg. strtok()). All state 
must be on a per-thread not a per-process basis.

To this end, I would propose that no CRT APIs are used directly from the main code!

Instead, a full set of CRT-like macros would be inherited from a header file, where 
either the real CRT API would be called, or an alternative 
implementation. This header file would be conditionally included on the basic of the 
target platform. This concentrates the bulk, if not all, platform 
specific code into a single file (or set of files). 

EXPANDING THE VIRTUAL MACHINE

What I am suggesting here is that the virtual machine analogy be extended to encompass 
not just the VHLL-user-program view of the registers and stack, 
but also it's view of the entire machine and underlying OS. 

By so doing, not only does it allow those working on the core of the VMI to be 
isolated from the 

Re: Threads Design. A Win32 perspective.

2004-01-01 Thread Uri Guttman
 NS == Nigel Sandever [EMAIL PROTECTED] writes:

  NS REENTRANCY

  NS Not only must the VMI be coded in a reentrant fashion, with all state
  NS addressed through pointers (references) loaded into it's Virtual
  NS register set. All the code underlying it, including syscalls and CRT
  NS must equally be reentrant. Many APIs within many CRTs *are not
  NS reentrant* (eg. strtok()). All state must be on a per-thread not a
  NS per-process basis.

  NS To this end, I would propose that no CRT APIs are used directly from the
  NS main code!

  NS Instead, a full set of CRT-like macros would be inherited from a header
  NS file, where either the real CRT API would be called, or an alternative
  NS implementation. This header file would be conditionally included on the
  NS basic of the target platform. This concentrates the bulk, if not all,
  NS platform specific code into a single file (or set of files).

this is true for c level threads but not necessarily true for VM level
threads. if the VM is atomic at its operation level and can't be
preempted (i.e. it is not using kernel threads with time slicing), then
it could use thread unsafe calls (as long as it keeps those silly static
buffers clean). parrot will (according to dan) use one interpreter per
VM thread and those may run on kernel threads. it may be possible to
disable preemption and/or time slicing so the VM threads will be atomic
at the VM operation level and then we don't have to worry as much about
thread unsafe libs. but i gather that people want real preemption and
priorities and time slicing so that idea may be moot. but on most
platforms that support kernel threads there are thread safe versions of
most/all the c lib stuff. now, external libs that get linked in under
nci is a different story.


  NS ATOMICITY AND CRITICAL SECTIONS

  NS Atomicity of the VMI opcodes can be achieved by having the core
  NS loop of the interpreter enter and exit a critical section each
  NS time it processes an opcode. For those not familiar with critical
  NS sections, they are a (Win32) OS construct that prevents any
  NS cooperating thread within a process, from having timeslice until
  NS the thread that entered the critical section has exited.

that is what i mentioned above, disabling timeslicing/preemption when
desired. it is not just a win32 concept. hell, turning off interrupts
during interrupt handlers goes way back! redmond just likes to rename
stuff and act like they invented it. :)

  NS Unlike semaphores and events, critsecs only operate within a
  NS single process. They are relatively lightweight as there is no
  NS need to enter kernel mode for their operation. They are, in effect
  NS a CPU specific test and set operation applied to a piece of user
  NS mode memory. Their lightweight nature means that they are faster
  NS than inter-process semaphore mechanisms. When used in a process
  NS that currently has only a single thread in operation, they have
  NS almost negligible performance effect upon that thread. They also
  NS operate correctly on SMP machines.

in effect it sounds like a thread shared mutex. it could be implemented
in kernel or process space.

  NS If two threads attempt concurrent operations on the same PMC, the
  NS first thread accessing the PMC sets the flag. When a second thread
  NS attempt to access it, it finds the flag set and blocks
  NS (relinquishing it's timeslice) until the first thread has
  NS completed and cleared the flag. This doesn't solve the potential
  NS for deadlock problems arising from the user-level code, though it
  NS may be possible for the VMI to detect and diagnose these at
  NS runtime in a similar way that deep-recursion detection is done in
  NS P5.

that flag setting needs to be atomic or a mutex or similar. plain flag
setting won't work. also the blocking has to be kernel level (so a
kernel mutex/semaphore is needed) so the kernel slicing will work.

deadlock detection is a known problem in the DB world and we can do
similar things. but are those locks VM level or kernel level? if the
coder wants to prevent deadlocks they have to do a higher level lock
(analogous to a full DB lock or transaction). we can't stop stupid
coders from themselves.

  NS The cost to the single-threaded user application is for a testset
  NS operation (a few cycles), per object-method, plus the flag, which
  NS need only be one bit, but maybe more efficiently coded as a 32-bit
  NS entity.

more than test/set is needed as it has to be atomic. so cpu's have that
instruction but we can't get at it directly from c and we have to
emulate it on platforms which don't have it. assuming it is short and
fast is not a good idea. now the atomic semaphore problem was solved
long ago by dijkstra and it is a two phase test/set gizmo. it needs
several machine instructions or a few lines of c. this is not a trivial
amount of overhead for each access to a shared object.  also this type
of lock is in user space and won't block the