Re: Threads Design. A Win32 perspective.
At 23:10 -0500 1/5/04, Gordon Henriksen wrote: Data corruption unacceptable? I disagree. It depends on the contract put forward by the language in question. Notably, Perl makes no such guarantees thus far, being as how it doesn't (any longer) run in a traditional threaded model. Successfully threaded languages and virtual machines explicitly make NO such guarantees. I'm completely okay with the potential for inconsistent data where the user doesn't take precautions. If the alternative is my code bogging down to half of its ideal speed because parrot is wasting time: acquiring three mutexes with deadlock detection so that it can execute add $P10, $P11, $P12; acquiring a mutex every time it indexes a string register; acquiring a mutex so it can examine cache-int_val-if those are the alternatives, I'm for the potential for inconsistent data. (And I say wasting because it's completely unnecessary in a good 99% of cases; the vast majority of data is not shared between threads at all.) I agree with most what is said here. However, I think there should be some type of debugging mode that _would_ do all of the locking and deadlock detection (and whatever other debugging capabilities we can think of). Occasional, hard-to-reproduce data corruption is the single most problematic thing to debug in threaded applications. It allows you to create Heisenbugs that will only show up under specific load conditions and/or specific platforms. And when that happens, it doesn't matter that your threaded application doesn't crash: for the users it does the equivalent, namely produce the wrong results occasionally, and therefore not production ready. Liz
Re: Threads Design. A Win32 perspective.
At 23:10 -0500 1/5/04, Gordon Henriksen wrote: Data corruption unacceptable? I disagree. I get the feeling people just aren't reading what's been written, or aren't keeping it all straight. *User* and *program* data integrity is not our problem -- not only are we not guaranteeing that, I'd be fine with parrot actively corrupting the contents of shared data structures that are accessed without synchronization. *Interpreter* data corruption is absolutely unacceptable, will not be tolerated, and whatever synchronization the interpreters need to do to make sure it doesn't happen. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
On Sunday, January 4, 2004, at 03:17 , Jeff Clites wrote: On Jan 3, 2004, at 8:59 PM, Gordon Henriksen wrote: On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Atomic guarantees on collections (or other data structures) are rarely meaningful; providing them is simply a waste of time. Witness the well-deserved death of Java's synchronized Vector class in favor of ArrayList. The interpreter absolutely shouldn't crash due to threading errorsit should protect itself using standard techniquesbut it would be a mistake for parrot to mandate that all ops and PMCs be thread-safe. What are these standard techniques? The JVM spec does seem to guarantee that even in the absence of proper locking by user code, things won't go completely haywire, but I can't figure out how this is possible without actual locking. (That is, I'm wondering if Java is doing something clever.) For instance, inserting something into a collection will often require updating more than one memory location (especially if the collection is out of space and needs to be grown), and I can't figure out how this could be guaranteed not to completely corrupt internal state in the absence of locking. (And if it _does_ require locking, then it seems that the insertion method would in fact then be synchronized.) So my question is, how do JVMs manage to protect internal state in the absence of locking? Or do they? The JVM won't crash, but individual objects can easily become corrupt. Java makes no guarantees of the level you're proposing. All synchronization in Java is explicit, using either synchronized methods, or synchronized (...) { ... } blocks; the runtime doesn't implicitly burn cycles synchronizing objects for the user. It does, however, give the user the threading primitives which allow them to write thread-safe code. Thread-safety is, indeed, left entirely to the user. This is crucial to allowing code to execute quickly and for the optimizer to function. The JVM protects itself, ensures that its core is thread-safe, and provides a secure operating environment for user code; it is the responsibility of the authors of classes to protect instances of their classes. Remember, Java's libraries are implemented atop these primitives *in Java.* It would be analogous to propose that parrot-based languages be primarily implemented atop its primitives in PBC, and that the sections which are not so-implemented (the core) must be scrutinized carefully for threadsafety. What are these threadsafe primitives which Java provides? For example: 1. Fixed-size arrays are safe to access concurrently from multiple threads. 2. ints, chars, pointers, etc. can be updated atomically. 3. Java's memory allocator is thread-safe. 4. Strings are immutable (and thus sharable without threadsafety violations). 5. The JVM does not allow memory to be resized, making bounds-checking threadsafe. 6. No objects can be moved (while retaining their identity) except by the garbage collector. 7. All threads are suspended during GC. 8. All pointers are traceable by GC. That parrot is largely implemented in C, and is designed to operate in terms of fat data structures, makes it much more difficult to guarantee in parrot that segfaults will not occur. I have some ideas which I'll share after I sync up with this fast-moving thread (of conversation). But one of my premises is that implicitly guaranteeing atomic PMC accesses will render threaded parrot dead-on-arrival. Just like the last 2 abortive attempts at creating a threading Perl implementation. (And threaded parrot would be a separate executable, because the performance hit will be so staggering.) The sheer overhead of locking is an obvious deal-killer, but I think that deadlocks and prohibition of optimizations would quickly put a few extra nails in the coffin. I don't want to see the Perl community left adrift with another toy threading implementation unsuitable for any serious work. Gordon Henriksen [EMAIL PROTECTED]
Re: Threads Design. A Win32 perspective.
On Sunday, January 4, 2004, at 01:43 , Dan Sugalski wrote: At 11:59 PM -0500 1/3/04, Gordon Henriksen wrote: On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Erm... that turns out not to be the case. A lot. (Yeah, I know, I said I wasn't paying attention) An interpreter *must* lock any shared data structure, including PMCs, when accessing them. Otherwise they may be in an inconsistent state when being accessed, which will lead to data corruption or process crashing, which is unacceptable. These locks do not have to correspond to HLL-level locks, though it'd be a reasonable argument that they share the same mutex. Process crashing unacceptable? Absolutely. Complete agreement. Data corruption unacceptable? I disagree. It depends on the contract put forward by the language in question. Notably, Perl makes no such guarantees thus far, being as how it doesn't (any longer) run in a traditional threaded model. Successfully threaded languages and virtual machines explicitly make NO such guarantees. I'm completely okay with the potential for inconsistent data where the user doesn't take precautions. If the alternative is my code bogging down to half of its ideal speed because parrot is wasting time: acquiring three mutexes with deadlock detection so that it can execute add $P10, $P11, $P12; acquiring a mutex every time it indexes a string register; acquiring a mutex so it can examine cache-int_valif those are the alternatives, I'm for the potential for inconsistent data. (And I say wasting because it's completely unnecessary in a good 99% of cases; the vast majority of data is not shared between threads at all.) I'm not okay with parrot crashing for any reason, ever, though, so where locking is not provided, PMC classes *must* be coded *very* carefully so as to avoid crashes. Gordon Henriksen [EMAIL PROTECTED]
RE: Threads Design. A Win32 perspective.
Gordon Henriksen wrote: Dan Sugalski wrote: An interpreter *must* lock any shared data structure, including PMCs, when accessing them. Otherwise they may be in an inconsistent state when being accessed, which will lead to data corruption or process crashing, which is unacceptable. Process crashing unacceptable? Absolutely. Complete agreement. Data corruption unacceptable? I disagree. It depends on the contract put forward by the language in question. Notably, Perl makes no such guarantees thus far, being as how it doesn't (any longer) run in a traditional threaded model. Successfully threaded languages and virtual machines explicitly make NO such guarantees Dan's non-negotiable Thread notes didn't say the VM's garauntees were non-negotiable ;) Are there differences in the contracts put forth by existing languages which we hope to have parrot implementations? Ironically, wouldn't setting the garauntees against data corruption this high pretty much rule out a pure parrot implementation of an SQL-92 compliant database. SQL-92 defines 4 transaction isolation levels in terms of the phenomena they're meant to allow/avoid: dirty reads, non-repeatable reads, and phantoms. Allowing language level variation in concurrency vs. correctness garuantees? Can of worms? I can already hear Dan, Nothing to be seen here. -Move along. On the off chance that anyone's interested, here's a critique of the ANSI SQL-92 isolation levels: http://citeseer.nj.nec.com/berenson95critique.html -- Garrett Goebel IS Development Specialist ScriptPro Direct: 913.403.5261 5828 Reeds Road Main: 913.384.1008 Mission, KS 66202 Fax: 913.384.2180 www.scriptpro.com garrett at scriptpro dot com
Re: Threads Design. A Win32 perspective.
At 4:03 PM -0800 1/4/04, Damien Neil wrote: It's my understanding that Parrot has chosen to take the path of using many mutable data structures at the VM level; unfortunately, this is pretty much incompatible with a fast or elegant threading model. Yep, generally true. The choice was made on purpose, and I was aware of most of the ramifications of it, both good and ill. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
At 00:49 +0100 1/4/04, Leopold Toetsch wrote: Elizabeth Mattijsen [EMAIL PROTECTED] wrote: Indeed. But as soon as there is something special such as a datastructure external to Perl between threads (which happens automatically shared automatically, because Perl doesn't know about the datastructure, Why is it shared automatically? Do you have an example for that? When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. With disastrous results. Because as soon as the thread ends, the cloned Perl object in the thread goes out of scope. Perl then calls the DESTROY method on the object, which then frees up the libxml data-structures. That's what it's supposed to do. Meanwhile, back in the original thread, the pointers in the Perl object now point at freed memory, rather than a live libxml data-structure. And chaos ensues sooner or later. Of course, chaos could well ensue before the thread is ended, because both threads think they have exclusive access to the libxml data-structure. Hope this explanation made sense. ... so the cloned objects point to the same memory address), then you're in trouble. Simply because you now have multiple DESTROYs called on the same external data-structure. If the function of the DESTROY is to free the memory of the external data-structure, you're in trouble as soon as the first thread is done. ;-( Maybe that DOD/GC can help here. A shared object can and will be destroyed only, when the last holder of that object has released it. But do you see now how complicated this can become if thread === interpreter? Liz
Re: Threads Design. A Win32 perspective.
On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Atomic guarantees on collections (or other data structures) are rarely meaningful; providing them is simply a waste of time. Witness the well-deserved death of Java's synchronized Vector class in favor of ArrayList. The interpreter absolutely shouldn't crash due to threading errorsit should protect itself using standard techniquesbut it would be a mistake for parrot to mandate that all ops and PMCs be thread-safe. The details of threaded programming cannot be hidden from the programmer. It's tempting to come up with clever ways to try, but the engine really has to take a back seat here. Smart programmers will narrow the scope of potential conflicts by reducing sharing of data structures in their threaded programs. Having done so, any atomicity guarantees on individual objects proves to be wasted effort: It will be resented by parrot's users as needless overhead, not praised. Consider the potential usage cases. 1. All objects in a non-threaded program. 2. Unshared objects in a threaded program. 3. Shared objects in a threaded program. The first two cases will easily comprise 99% of all usage. In only the third case are synchronized objects even conceivably useful, and even then the truth of the matter is that they are of extremely limited utility: Their guarantees are more often than not too fine-grained to provide the high-level guarantees that the programmer truly needs. In light of this, the acquisition of a mutex (even a mutex that's relatively cheap to acquire) to push an element onto an array, or to access a string's datawell, it stops looking so good. That said, the interpreter can't be allowed to crash due to threading errors. It must protect itself. But should a PerlArray written to concurrently from 2 threads guarantee its state make sense at the end of the program? I say no based upon precedent; the cost is too high. Gordon Henriksen [EMAIL PROTECTED]
Re: Threads Design. A Win32 perspective.
Elizabeth Mattijsen [EMAIL PROTECTED] wrote: When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) Liz leo
Re: Threads Design. A Win32 perspective.
At 14:47 +0100 1/4/04, Leopold Toetsch wrote: Elizabeth Mattijsen [EMAIL PROTECTED] wrote: When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) What you need, is basically being able to: - register a class method to be called on cloning - register an object method that is called whenever an _object_ is cloned The CLONE sub that Perl5 has, is the class method. The object method is missing from Perl (Thread::Bless is a way to remedy this problem). I don't know what the _init code does, but judging by its name, its not giving enough info to be able to properly clone an object with external data structures. Liz
Re: Threads Design. A Win32 perspective.
At 11:59 PM -0500 1/3/04, Gordon Henriksen wrote: On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Erm... that turns out not to be the case. A lot. (Yeah, I know, I said I wasn't paying attention) An interpreter *must* lock any shared data structure, including PMCs, when accessing them. Otherwise they may be in an inconsistent state when being accessed, which will lead to data corruption or process crashing, which is unacceptable. These locks do not have to correspond to HLL-level locks, though it'd be a reasonable argument that they share the same mutex. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
On Jan 4, 2004, at 5:47 AM, Leopold Toetsch wrote: Elizabeth Mattijsen [EMAIL PROTECTED] wrote: When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) But I think we'll never be able to make this work as the user would initially expect. For instance if we have a DBI implementation, and some PMC is holding an external reference to a database cursor for an open transaction, then we can't properly duplicate the necessary state to make the copy of the PMC work correctly (that is, independently). (And I'm not saying just that we can't do it from within parrot, I'm saying the native database libraries can't do this.) So some objects such as this would always have to end up shared (or else non-functional in the new thread), which is bad for users because they have to be concerned with what objects are backed by native libraries and which ones cannot be made to conform to each of our thread styles. That seems like a major caveat. JEff
Re: Threads Design. A Win32 perspective.
At 12:05 PM -0800 1/4/04, Jeff Clites wrote: On Jan 4, 2004, at 5:47 AM, Leopold Toetsch wrote: Elizabeth Mattijsen [EMAIL PROTECTED] wrote: When you use an external library in Perl, such as e.g. libxml, you have Perl data-structures and libxml data-structures. The Perl data-structures contain pointers to the libxml data-structures. In comes the starting of an ithread and Perl clones all of the Perl data-structures. But it copies _only_ does things it knows about. And thus leaves the pointers to the libxml data-structures untouched. Now you have 2 Perl data-structures that point to the _same_ libxml data-structures. Voila, instant sharing. I see. Our library loading code should take care of that. On thread creation we call again the _init code, so that the external lib can prepare itself to be used from multiple threads. But don't ask me about details ;) But I think we'll never be able to make this work as the user would initially expect. For instance if we have a DBI implementation, and some PMC is holding an external reference to a database cursor for an open transaction, then we can't properly duplicate the necessary state to make the copy of the PMC work correctly (that is, independently). (And I'm not saying just that we can't do it from within parrot, I'm saying the native database libraries can't do this.) And don't forget the libraries that are picky about which thread calls into them -- there are some that require that the thread that created the handle for the library be the thread that calls into the library with that handle. (Though luckily those are pretty rare) And of course the non-reentrant libraries that require a global library lock for all calls otherwise the library state gets corrupted. Aren't threads fun? :) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
On Jan 3, 2004, at 8:59 PM, Gordon Henriksen wrote: On Saturday, January 3, 2004, at 04:32 , Nigel Sandever wrote: Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). Completely specious, and repeatedly proven unwise. Shouldn't even be pursued. Atomic guarantees on collections (or other data structures) are rarely meaningful; providing them is simply a waste of time. Witness the well-deserved death of Java's synchronized Vector class in favor of ArrayList. The interpreter absolutely shouldn't crash due to threading errorsit should protect itself using standard techniquesbut it would be a mistake for parrot to mandate that all ops and PMCs be thread-safe. What are these standard techniques? The JVM spec does seem to guarantee that even in the absence of proper locking by user code, things won't go completely haywire, but I can't figure out how this is possible without actual locking. (That is, I'm wondering if Java is doing something clever.) For instance, inserting something into a collection will often require updating more than one memory location (especially if the collection is out of space and needs to be grown), and I can't figure out how this could be guaranteed not to completely corrupt internal state in the absence of locking. (And if it _does_ require locking, then it seems that the insertion method would in fact then be synchronized.) So my question is, how do JVMs manage to protect internal state in the absence of locking? Or do they? JEff
Re: Threads Design. A Win32 perspective.
On Sat, 03 Jan 2004 20:51, Luke Palmer wrote; Parrot is platform-independent, but that doesn't mean we can't take advantage of platform-specific instructions to make it faster on certain machines. Indeed, this is precisely what JIT is. But a lock on every PMC is still pretty heavy for those non-x86 platforms out there, and we should avoid it if we can. So implement threading on architectures that don't support interrupt masking with completely user-space threading (ie, runloop round-robin) like Ruby does. *That* is available on *every* platform. -- Sam Vilain, [EMAIL PROTECTED] Seeing a murder on television... can help work off one's antagonisms. And if you haven't any antagonisms, the commercials will give you some. -- Alfred Hitchcock
Re: Threads Design. A Win32 perspective.
At 3:17 PM -0500 1/4/04, Uri Guttman wrote: DS == Dan Sugalski [EMAIL PROTECTED] writes: DS And don't forget the libraries that are picky about which thread calls DS into them -- there are some that require that the thread that created DS the handle for the library be the thread that calls into the library DS with that handle. (Though luckily those are pretty rare) And of course DS the non-reentrant libraries that require a global library lock for all DS calls otherwise the library state gets corrupted. DS Aren't threads fun? :) hence my love for events and forked procs. Forks make some of this worse. There are more libraries that don't work with connections forked across processes than libraries that don't work with calls from a different thread. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
At 9:27 AM +1300 1/5/04, Sam Vilain wrote: On Sat, 03 Jan 2004 20:51, Luke Palmer wrote; Parrot is platform-independent, but that doesn't mean we can't take advantage of platform-specific instructions to make it faster on certain machines. Indeed, this is precisely what JIT is. But a lock on every PMC is still pretty heavy for those non-x86 platforms out there, and we should avoid it if we can. So implement threading on architectures that don't support interrupt masking with completely user-space threading (ie, runloop round-robin) like Ruby does. *That* is available on *every* platform. Interrupt masking and a proper threading interface can be considered a prerequisite for threads of any sort under Parrot, the same way an ANSI C89-compliant compiler is a requirement. Platforms that can't muster at least thread spawning, mutexes, and condition variables don't get threads, and don't have to be considered. (You can, it's just not required, and you'd be hard-pressed to find anything outside the embedded realm that doesn't support at least that level of functionality, and I'm OK if there are no threads on the Gameboy port) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Threads Design. A Win32 perspective.
On Sat, 03 Jan 2004 01:48:07 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote: ding! ding! ding! you just brought in a cpu specific instruction which is not guaranteed to be on any other arch. in fact many have such a beast but again, it is not accessible from c. you can't bring x86 centrism into this. the fact that redmond/intel threads can make use of this instruction to do 'critical sections' is a major point why it is bad for a portable system with many architectures to be supported. i disagree. it is covered for the intel case only and that is not good enough. again, intel specific and it has to be tossed out. but atomic operations are not available on all platforms. not in a unix environment. real multi-threaded processes already have this problem. that is a kernel issue and most signalling systems don't have thread specific arguments AFAIK. virtual ram is what counts on unix. you can't request some large amount without using real swap space. again, redmond specific. what win32 does is not portable and not useful at a VM level. ok, i can see where you got the test/set and yield stuff from now finally, it is redmond specific again and very unportable. i have never heard of this fibre concept on any unix flavor. and that is not possible on other platforms. that is a big point and one that i don't see as possible. redmond can do what they want with their kernel and user procs. parrot can only use kernel concepts that are reasonably portable across most platforms. kernel threads are reasonably portable (FSDO of reasonable) but anything beyond that such as fibres, test/set in user space, etc is not. locks have to be in kernel space since we can do a fibre yield in user space on any other platform. so this rule out user space test/set as well since that would need a thread to spin instead of blocking. your ideas make sense but only on redmond/intel which is not the target space for parrot. That's pretty much the crux. I don't know what is available (in detail) on other platforms. Hence I needed to express the ideas in terms I understand and explain them sufficiently that they could be viewed, interpreted and either related to similar concepts on othe platforms, or shot down. I accept your overall judgement, though not necessarially all the specifics. Maybe it would be possible (for me + others) to write the core of a win32 specific, threaded VM interpreter that would take parrot byte code and run it. Thereby, utilising all the good stuff that preceeds the VM interpreter, plus probably large chunks of the parrot VM, but provides it with a more native compatible target. That's something that obviously not a simple project and is beyond the scope of this list. Thanks for taking the time to review this. Nigel Sandever.
Re: Threads Design. A Win32 perspective.
Nigel Sandever writes: Maybe it would be possible (for me + others) to write the core of a win32 specific, threaded VM interpreter that would take parrot byte code and run it. Thereby, utilising all the good stuff that preceeds the VM interpreter, plus probably large chunks of the parrot VM, but provides it with a more native compatible target. You want to write a parrot? Um, good luck. One thing Uri mentioned was writing platform specific macro code, and he dismissed it with that it's also a pain. While I agree that it's a pain, and it's about as maintainence-friendly as JIT, I don't think it has to be ruled out. Parrot is platform-independent, but that doesn't mean we can't take advantage of platform-specific instructions to make it faster on certain machines. Indeed, this is precisely what JIT is. But a lock on every PMC is still pretty heavy for those non-x86 platforms out there, and we should avoid it if we can. Luke That's something that obviously not a simple project and is beyond the scope of this list. Thanks for taking the time to review this. Nigel Sandever.
Re: Threads Design. A Win32 perspective.
Nigel Sandever [EMAIL PROTECTED] wrote: On Sat, 03 Jan 2004 01:48:07 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote: your ideas make sense but only on redmond/intel which is not the target space for parrot. s/not the/by far not the only/ Maybe it would be possible (for me + others) to write the core of a win32 specific, threaded VM interpreter that would take parrot byte code and run it. Thereby, utilising all the good stuff that preceeds the VM interpreter, plus probably large chunks of the parrot VM, but provides it with a more native compatible target. I'd be glad, if someone fills the gaps in platform code. There is no need to rewrite the interpreter or such: Defining the necessary macros in an efficient way should be enough to start with. Nigel Sandever. leo
Re: Threads Design. A Win32 perspective.
Nigel Sandever [EMAIL PROTECTED] wrote: [ Line length adjusted for readability ] VIRTUAL MACHINE INTERPRETER At any given point in the running of the interpreter, the VM register set, program counter and stack must represent the entire state for that thread. That's exactly, what a ParrotInterpreter is: the entire state for a thread. I am completely against the idea of VM threads having a 1:1 relationship with interpreters. While these can be separated its not efficient. Please note that Parrot is a register-based VM. Switching state is *not* switching a stack-pointer and a PC thingy like in stack-based VMs. ... The runtime costs of duplicating all shared data between interpreters, tying all shared variables, added to the cost of locking, throw away almost all of the benefit of using threads. No, shared resources are not duplicated, nor tied. The locking of course remains, but that's it. The distinction is that a critical section does not disable time slicing/ preemption. It prevents any thread attempting to enter an owned critical section from receiving a timeslice until the current owner relinquishes it. These are platform specific details. We will use whatever the platform/OS provides. In the source code its a LOCK() UNLOCK() pair. The LOCK() can be any atomic operation and doesn't need to call the kernel, if the lock is aquired. [ Lot of Wintel stuff snipped ] Nigel Sandever. leo
Re: Threads Design. A Win32 perspective.
LT == Leopold Toetsch [EMAIL PROTECTED] writes: LT These are platform specific details. We will use whatever the LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair. LT The LOCK() can be any atomic operation and doesn't need to call the LT kernel, if the lock is aquired. if it doesn't call the kernel, how can a thread be blocked? you can't have user level locks without spinning. at some point (even with fibres) you need to make a kernel call so other threads can run. the macro layer will make the mainline source look better but you still need kernel calls in their definition. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org
Re: Threads Design. A Win32 perspective.
On Sat, 3 Jan 2004 11:35:37 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote: Nigel Sandever [EMAIL PROTECTED] wrote: VIRTUAL MACHINE INTERPRETER At any given point in the running of the interpreter, the VM register set, program counter and stack must represent the entire state for that thread. That's exactly, what a ParrotInterpreter is: the entire state for a thread. This is only true if a thread == interpreter. If a single interpreter can run 2 threads then that single interpreter cannot represent the state of both threads safely. I am completely against the idea of VM threads having a 1:1 relationship with interpreters. With 5005threads, multiple threads exist in a single interpreter. All VHLL level data is shared without duplication, but locking has to be performed on each entity. This model is more efficient than ithreads. However it was extremely difficult prevent onwanted interaction between the threads corrupting the internal state of the interpreter given the internal architecture of P5 and so it was abandoned. With ithreads, each thread is also a seperate interpreter. This insulates the internal state of one interpreter from the other but also insulates *all* perl level program data in one interpreter from the perl level data in the other. Spawning a new thread becomes a process of duplicating everything. The interpreter, the perl program, and all it existing data. Sharing data between the threads/interpreters is implemented by tieing the two copies of the variables to be shared and each time a STORE is performed in one thread, the same STORE has too be performed on the copy of that var held on every other threads dataspace. If 2 threads need to share a scalar, but the program has 10 other threads, then each write to the shared scalar requires the update of all 12 copies of that scalar. There is no way to indicate that you only need to share it between threads x y. With ithreads, there can be no shared references, so no shared objects and no shared compound data structures While these can be separated its not efficient. Please note that Parrot is a register-based VM. Switching state is *not* switching a stack-pointer and a PC thingy like in stack-based VMs. ... The runtime costs of duplicating all shared data between interpreters, tying all shared variables, added to the cost of locking, throw away almost all of the benefit of using threads. No, shared resources are not duplicated, nor tied. The locking of course remains, but that's it. The issues above are what make p5 ithreads almost unusable. If Parrot has found a way of avoiding these costs and limitations then everything I offered is a waste of time, because these are the issues was attempting to address. However, I have seen no indication here, in the sources or anywhere else that this is the case. I assume that the reason Dan opened the discussion up in the first place is because the perception by those looking on was that the p5 ithreads model was being suggested as the way Parrot was going to go. And the reaction from those wh have tried to make use of ithreads under p5 are all too aware that replicating them for Parrot would be . [phrase deleted as too emotionally charged:)] leo Nigel.
Re: Threads Design. A Win32 perspective.
At 18:20 + 1/3/04, Nigel Sandever wrote: Sharing data between the threads/interpreters is implemented by tieing the two copies of the variables to be shared and each time a STORE is performed in one thread, the same STORE has too be performed on the copy of that var held on every other threads dataspace. Hmmm is that true? My impression was (and that's the way I've implemented it in Thread::Tie) is that each STORE actually stores the value in a hidden background thread, and each FETCH obtains the current value from the background thread. I don't think each STORE is actually cascading through all of the threads. Not until they try to fetch the shared value, anyway. If 2 threads need to share a scalar, but the program has 10 other threads, then each write to the shared scalar requires the update of all 12 copies of that scalar. There is no way to indicate that you only need to share it between threads x y. That's true. But there is some implicit grouping. You can only create newly shared variables for the current thread and any thread that is started from the current thread. This is actually a pity, as that precludes you to start a bare (minimal number of Perl modules loaded, or with exactly the modules that _you_ want) thread at compile time, from which you could start other threads. With ithreads, there can be no shared references, so no shared objects and no shared compound data structures Actually, you can bless a reference to a shared variable, but you can't share a blessed object (the sharing will let you lose the content of the object). I think shared compound data structures _are_ possible, but very tricky to get right (because the CLONE sub is called as a package method, rather than as an object method: see Thread::Bless for an attempt at getting around this). While these can be separated its not efficient. Please note that Parrot is a register-based VM. Switching state is *not* switching a stack-pointer and a PC thingy like in stack-based VMs. ... The runtime costs of duplicating all shared data between interpreters, tying all shared variables, added to the cost of locking, throw away almost all of the benefit of using threads. No, shared resources are not duplicated, nor tied. The locking of course remains, but that's it. The issues above are what make p5 ithreads almost unusable. For more about this, see my article on Perl Monks: http://www.perlmonks.org/index.pl?node_id=288022 And the reaction from those wh have tried to make use of ithreads under p5 are all too aware that replicating them for Parrot would be . [phrase deleted as too emotionally charged:)] What can I say? ;-) Liz
Re: Threads Design. A Win32 perspective.
I'm trying to be constructive here. Some passages may appear to be blunt. Read at your own risk ;-) At 01:48 -0500 1/3/04, Uri Guttman wrote: NS == Nigel Sandever [EMAIL PROTECTED] writes: NS All that is required to protect an object from corruption through NS concurrent access and state change is to prevent two (or more) VMs NS trying to access the same object simultaneously. In order for the VM to NS address the PMC and must load it into a VM register, it must know where NS it is. Ie. It must have a pointer. It can use this pointer to access NS the PMC with a Bit Test and Set operation. (x86 BTS) NS [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic NS operation. This occurs before the VM enters the VM operations critical NS section. ding! ding! ding! you just brought in a cpu specific instruction which is not guaranteed to be on any other arch. in fact many have such a beast but again, it is not accessible from c. I just _can't_ believe I'm hearing this. So what if it's not accessible from C? Could we just not build a little C-program that would create a small in whatever loadable library? Or have a post-processing run through the binary image inserting the right machine instructions in the right places? Not being from a *nix background, but more from a MS-DOS background, I've been used to inserting architecture specific machine codes from higher level languages into executable streams since 1983! Don't tell me that's not done anymore? ;-) .. and still some archs don't have it so it has to be emulated by dijkstra's algorithm. and that requires two counters and a little chunk of code. Well, I guess those architectures will have to live with that. If that's what it takes? you can't bring x86 centrism into this. the fact that redmond/intel threads can make use of this instruction to do 'critical sections' is a major point why it is bad for a portable system with many architectures to be supported. I disagree. I'm not a redmond fan, so I agree with a lot of your _sentiment_, but you should also realize that a _lot_ of intel hardware is running Linux. Heck, even some Solarises run on it. I'm not saying that the intel CPU's are superior to others. They're probably not. But it's just as with cars: most of the cars get people from A to B. They're not all Maserati's. Or Rolls Royce's. Most of them are Volkswagen, Fiat and whatever compacts you guys drive in the States. ;-) I don't think we're making Parrot to run well on Maserati's and Rolls Royce's only. We want to reach the Volkswagens. And not even today's Volkswagens: by the time Perl 6 comes around, CPU's will have doubled in power yet _again_! The portability is in Parrot itself: not by using the lowest common denominator of C runtime systems out there _today_! It will take a lot of trouble to create a system that will run everywhere, but that's just it what makes it worthwhile. Not that it offers the same limited capabilities on all systems! i am well aware about test/set and atomic instructions. my point in my previous reply was that it can't be assumed to be on a particular arch. so it has to be assumed to be on none and it needs a pure software solution. sure a platform specific macro/assembler lib could be done but that is a pain as well. Indeed. A pain, probably. Maybe not so much. I can't believe that Sparc processors are so badly designed that they don't offer something similar as what Nigel suggested for Intel platforms. again, intel specific and it has to be tossed out. Again, I think you're thinking too much inside the non-Wintel box. You stated yourself just now ...in fact many have such beast..., so maybe it should first be investigated which platforms do suppport this and which don't, and then decide whether it is a good idea or an idea to be tossed? virtual ram is what counts on unix. you can't request some large amount without using real swap space. it may not allocate real ram until later (page faulting on demand) but it is swap space that counts. it is used up as soon as you allocate it. So, maybe a wrapper is, either for *nix, or for Win32, or maybe both. what win32 does is not portable and not useful at a VM level. kernels and their threads can work well together. portable VMs and their threads are another beast that can't rely on a particular architecture instruction or kernel feature. This sounds too much like dogma to me. Why? Isn't Parrot about borgifying all good things from all OS's and VM's, now and in the future? ;-) hardware is tossed out with portable VM design. we have a user space process written in c with a VM and VM threads that are to be based on kernel threads. the locking issues for shared objects is the toughest nut to crack right now. there is no simple or fast solution to this given the known contraints. intel/redmond specific solutions are not applicable (though we can learn from them). Then
Re: Threads Design. A Win32 perspective.
On Jan 3, 2004, at 11:19 AM, Elizabeth Mattijsen wrote: At 01:48 -0500 1/3/04, Uri Guttman wrote: NS == Nigel Sandever [EMAIL PROTECTED] writes: NS All that is required to protect an object from corruption through NS concurrent access and state change is to prevent two (or more) VMs NS trying to access the same object simultaneously. In order for the VM to NS address the PMC and must load it into a VM register, it must know where NS it is. Ie. It must have a pointer. It can use this pointer to access NS the PMC with a Bit Test and Set operation. (x86 BTS) NS [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic NS operation. This occurs before the VM enters the VM operations critical NS section. ding! ding! ding! you just brought in a cpu specific instruction which is not guaranteed to be on any other arch. in fact many have such a beast but again, it is not accessible from c. I just _can't_ believe I'm hearing this. So what if it's not accessible from C? Could we just not build a little C-program that would create a small in whatever loadable library? Or have a post-processing run through the binary image inserting the right machine instructions in the right places? Not being from a *nix background, but more from a MS-DOS background, I've been used to inserting architecture specific machine codes from higher level languages into executable streams since 1983! Don't tell me that's not done anymore? ;-) Yes, you are correct--we are already using bits of assembly in parrot. C compilers tend to allow you to insert bits of inline assembly, and we are taking advantage of that--for instance, look for __asm__ in the following files: jit/arm/jit_emit.h jit/ppc/jit_emit.h src/list.c src/malloc.c Also, JIT is all about generating platform-specific machine instructions at runtime. So it's certainly do-able, and right along the lines of of what we are already doing. JEff
Re: Threads Design. A Win32 perspective.
Uri Guttman [EMAIL PROTECTED] wrote: LT == Leopold Toetsch [EMAIL PROTECTED] writes: LT These are platform specific details. We will use whatever the LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair. LT The LOCK() can be any atomic operation and doesn't need to call the LT kernel, if the lock is aquired. if it doesn't call the kernel, how can a thread be blocked? I wrote, *if* the lock is aquired. That's AFAIK the fast path of a futex or of the described Win32 behavior. The slow path is always a kernel call (or a some rounds spinning before ...) But anyway, we don't reinvent these locking primitives. uri leo
Re: Threads Design. A Win32 perspective.
On Jan 3, 2004, at 10:08 AM, Elizabeth Mattijsen wrote: At 12:15 -0500 1/3/04, Uri Guttman wrote: LT == Leopold Toetsch [EMAIL PROTECTED] writes: LT These are platform specific details. We will use whatever the LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair. LT The LOCK() can be any atomic operation and doesn't need to call the LT kernel, if the lock is aquired. if it doesn't call the kernel, how can a thread be blocked? Think out of the box! Threads can be blocked in many ways. My forks.pm module uses sockets to block threads ;-). IO operations which block like that end up calling into the kernel. But I believe it is usually possible to acquire an uncontested lock without calling into the kernel. When you do need to block (when trying to acquire a lock which is already held by another thread) you may need to enter the kernel. But I think that Leo's point was that in the common case of a successful lock operation, it may not be necessary. JEff
Re: Threads Design. A Win32 perspective.
Nigel Sandever [EMAIL PROTECTED] wrote: On Sat, 3 Jan 2004 11:35:37 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote: That's exactly, what a ParrotInterpreter is: the entire state for a thread. This is only true if a thread == interpreter. If a single interpreter can run 2 threads then that single interpreter cannot represent the state of both threads safely. Yep. So if a single interpreter (which is almost a thread state) should run two threads, you have to allocate and swap all. What should the advantage of such a solution be? With 5005threads, multiple threads exist in a single interpreter. These are obsolete. With ithreads, each thread is also a seperate interpreter. Spawning a new thread becomes a process of duplicating everything. The interpreter, the perl program, and all it existing data. Partly yes. A new interpreter is created, the program, i.e. the opcode stream is *not* duplicated, but JIT or prederef information has to be rebuilt (on demand, if that run-core is running), and existing non-shared data items are cloned. Sharing data between the threads/interpreters is implemented by tieing Parrot != perl5.ithreads If Parrot has found a way of avoiding these costs and limitations then everything I offered is a waste of time, because these are the issues was attempting to address. I posted a very premature benchmark result, where an unoptimized Parrot build is 8 times faster then the equivalent perl5 code. And the reaction from those wh have tried to make use of ithreads under p5 are all too aware that replicating them for Parrot would be . [phrase deleted as too emotionally charged:)] I don't know how ithreads are working internally WRT the relevant issues like object allocation and such. But threads at the OS level provide shared code and data segments. So at the VM level you have to unshare non-shared resources at thread creation. You can copy objects lazily and make 2 distinct items when writing, or you copy them in the first place. But you have these costs at thread start - and not later. Nigel. leo
Re: Threads Design. A Win32 perspective.
EM == Elizabeth Mattijsen [EMAIL PROTECTED] writes: ding! ding! ding! you just brought in a cpu specific instruction which is not guaranteed to be on any other arch. in fact many have such a beast but again, it is not accessible from c. EM I just _can't_ believe I'm hearing this. So what if it's not EM accessible from C? Could we just not build a little C-program that EM would create a small in whatever loadable library? Or have a EM post-processing run through the binary image inserting the right EM machine instructions in the right places? Not being from a *nix EM background, but more from a MS-DOS background, I've been used to EM inserting architecture specific machine codes from higher level EM languages into executable streams since 1983! Don't tell me that's EM not done anymore? ;-) it is not that it isn't done anymore but the effect has to be the same on machines without test/set. and on top of that, it still needs to be a kernel level operation so a thread can block on the lock. that is the more important issue that makes using a test/set in user space a moot problem. EM I disagree. I'm not a redmond fan, so I agree with a lot of your EM _sentiment_, but you should also realize that a _lot_ of intel EM hardware is running Linux. Heck, even some Solarises run on it. we are talking maybe 10-20 architectures out there that we would want parrot to run on. maybe more. how many does p5 run on now? EM The portability is in Parrot itself: not by using the lowest common EM denominator of C runtime systems out there _today_! It will take a EM lot of trouble to create a system that will run everywhere, but that's EM just it what makes it worthwhile. Not that it offers the same limited EM capabilities on all systems! but we need a common denominator of OS features more than one for cpu features. the fibre/thread stuff is redmond only. and they still require system calls. so as i said the test/set is not a stopping point (dijkstra) but the OS support is. how and where and when we lock is the only critical factor and that hasn't been decided yet. we don't want to lock at global thread levels and we are not sure we can lock at PMC or object levels (GC and alloc can break that). we should be focusing on that issue. think about how DBs did it. sybase used to do page locking (coarse grained) since it was faster (this was 15 years ago) and they had the fastest engine. but when multithreading and multicpu designs came in a finer grained row locking was faster (oracle). sybase fell behind and has not caught up. we have the same choices to make so we need to study locking algorithms and techniques from that perspective and not how to do a single lock (test/set vs kernel). but i will keep reiterating that it has to be a kernel lock since we must block threads and GC and such without spinning or manual scheduling (fibres). virtual ram is what counts on unix. you can't request some large amount without using real swap space. it may not allocate real ram until later (page faulting on demand) but it is swap space that counts. it is used up as soon as you allocate it. EM So, maybe a wrapper is, either for *nix, or for Win32, or maybe both. this is very different behavior IMO and not something that can be wrapped easily. i could be wrong but separating virtual allocation from real allocation can't be emulated without kernel support. and we need the same behavior on all platforms. this again brings up how we lock so that GC/alloc will work properly with threads. do we lock a thread pool but not the thread when we access a shared thingy? that is a medium grain lock. can the GC/alloc break the lock if it is called inside that operation? or could only the pool inside the active thread do that? what about an shared object alloced from thread A's pool but it triggers an alloc when being accessed in thread B. these are the questions that need to be asked and answered. i was just trying to point out to nigel that the intel/redmond solutions are not portable as they require OS support and that all locks need to be kernel level. given that requirement, we need to decide how to do the locks so those questions can be answered with reasonable efficiency. of course a single global lock would work but that stinks and we all know it. so what is the lock granularity? how do we handle GC/alloc across shared objects? EM This sounds too much like dogma to me. Why? Isn't Parrot about EM borgifying all good things from all OS's and VM's, now and in the EM future? ;-) but parrot can only use a common set of features across OS's. we can't use a redmond feature that can't be emulated on other platforms. and my karma ran over my dogma :( :-) hardware is tossed out with portable VM design. we have a user space process written in c with a VM and VM threads that are to be based on kernel threads. the locking issues for shared objects is the toughest
Re: Threads Design. A Win32 perspective.
LT == Leopold Toetsch [EMAIL PROTECTED] writes: LT Uri Guttman [EMAIL PROTECTED] wrote: LT == Leopold Toetsch [EMAIL PROTECTED] writes: LT These are platform specific details. We will use whatever the LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair. LT The LOCK() can be any atomic operation and doesn't need to call the LT kernel, if the lock is aquired. if it doesn't call the kernel, how can a thread be blocked? LT I wrote, *if* the lock is aquired. That's AFAIK the fast path of a futex LT or of the described Win32 behavior. The slow path is always a kernel LT call (or a some rounds spinning before ...) LT But anyway, we don't reinvent these locking primitives. ok, i missed the 'if' there. :) that could be workable and might be faster. it does mean that locks are two step as well, user space test/set and fallback to kernel lock. we can do what nigel said and wrap the test/set in macros and use assembler to get at it on platforms that have it or fallback to dijkstra on those that don't. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org
Re: Threads Design. A Win32 perspective.
EM == Elizabeth Mattijsen [EMAIL PROTECTED] writes: EM At 12:15 -0500 1/3/04, Uri Guttman wrote: LT == Leopold Toetsch [EMAIL PROTECTED] writes: LT These are platform specific details. We will use whatever the LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair. LT The LOCK() can be any atomic operation and doesn't need to call the LT kernel, if the lock is aquired. if it doesn't call the kernel, how can a thread be blocked? EM Think out of the box! EM Threads can be blocked in many ways. My forks.pm module uses sockets EM to block threads ;-). i used that design as well. a farm of worker threads blocked on a pipe (socketpair) to the same process. the main event loop handled the other side. worked very well. EM It sucks performance wise, but it beats the current perl ithreads EM implementation on many platforms in many situations. i can believe that. EM Therefore my motto: whatever works, works. but i discussed that solution with dan and he shot it down for speed reasons IIRC. i still think it is an interesting solution. it could also be used for the main event queue and/or loop as i mention above. we are assuming some form of sockets on all platforms IIRC, so we can use socketpair for that. i even use socketpair on win32 to test a (pseudo)fork thing for file::slurp. ...you can't have user level locks without spinning. at some point (even with fibres) you need to make a kernel call so other threads can run. EM Possibly. I don't know enough of the specifics whether this is EM true or not. i looked at the docs for fibres and they say you do a manual reschedule by selecting the fibre to run next or i think a yield. but something has to go to the kernel since even fibres are kernel thingys. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org
Re: Threads Design. A Win32 perspective.
On Sat, Jan 03, 2004 at 08:05:13PM +0100, Elizabeth Mattijsen wrote: At 18:20 + 1/3/04, Nigel Sandever wrote: Sharing data between the threads/interpreters is implemented by tieing the two copies of the variables to be shared and each time a STORE is performed in one thread, the same STORE has too be performed on the copy of that var held on every other threads dataspace. Hmmm is that true? My impression was (and that's the way I've implemented it in Thread::Tie) is that each STORE actually stores the value in a hidden background thread, and each FETCH obtains the current value from the background thread. I don't think each STORE is actually cascading through all of the threads. Not until they try to fetch the shared value, anyway. Sharing consists of the real SV living in a shared interpreter, with each individual thread having a lightweight proxy SV that causes the appropriate real SV to be accessed/updated by a mixture or magic and/or tied-ish access. A particular access by one thread does not involve any of the other threads or their proxies. With ithreads, there can be no shared references, so no shared objects and no shared compound data structures Actually, you can bless a reference to a shared variable, but you can't share a blessed object (the sharing will let you lose the content of the object). I think shared compound data structures _are_ possible, but very tricky to get right (because the CLONE sub is called as a package method, rather than as an object method: see Thread::Bless for an attempt at getting around this). Nested shared structures work just fine, and there's no need to mess with CLONE for plain data. And the reaction from those wh have tried to make use of ithreads under p5 are all too aware that replicating them for Parrot would be . [phrase deleted as too emotionally charged:)] It's the implementation of ithreads in p5 that sucks, not the concept itself. The use of COW makes new thread creation cheap, and the use of lock PMCs interposed on the real PMCs makes shareing easy. Dave. -- O Unicef Clearasil! Gibberish and Drivel! - Bored of the Rings
Re: Threads Design. A Win32 perspective.
On Jan 3, 2004, at 12:26 PM, Uri Guttman wrote: LT These are platform specific details. We will use whatever the LT platform/OS provides. In the source code its a LOCK() UNLOCK() pair. LT The LOCK() can be any atomic operation and doesn't need to call the LT kernel, if the lock is aquired. if it doesn't call the kernel, how can a thread be blocked? LT I wrote, *if* the lock is aquired. That's AFAIK the fast path of a futex LT or of the described Win32 behavior. The slow path is always a kernel LT call (or a some rounds spinning before ...) LT But anyway, we don't reinvent these locking primitives. ok, i missed the 'if' there. :) that could be workable and might be faster. it does mean that locks are two step as well, user space test/set and fallback to kernel lock. we can do what nigel said and wrap the test/set in macros and use assembler to get at it on platforms that have it This is probably already done inside of pthread (and Win32) locking implementations--a check in userspace before a kernel call. It's also important to remember that it's not a two-step process most of the time, since most of the locking we are taking about is likely to be uncontested (ie, the lock acquisition will succeed most of the time, and no blocking will be needed). If we _do_ have major lock contention in our internal locking, those will be areas calling for a redesign. JEff
Re: Threads Design. A Win32 perspective.
On Sat, 3 Jan 2004 21:00:31 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote: That's exactly, what a ParrotInterpreter is: the entire state for a thread. This is only true if a thread == interpreter. If a single interpreter can run 2 threads then that single interpreter cannot represent the state of both threads safely. Yep. So if a single interpreter (which is almost a thread state) should run two threads, you have to allocate and swap all. When a kernel level thead is spawned, no duplication of application memory is required, Only a set of registers, program counter and stack. These represent the entire state of that thread. If a VM thread mirrors this, by duplicating the VM program counter, VM registers and VM stack, then this VM thread context can also avoid the need to replicate the rest of the program data (interpreter). What should the advantage of such a solution be? The avoidance of duplication. Transparent interlocking of VHLL fat structures performed automatically by the VM itself. No need for :shared or lock(). With 5005threads, multiple threads exist in a single interpreter. These are obsolete. ONLY because they couldn't be made to work properly. The reason that was true are entirely due to the architecture of P5. Dan Sugalski suggested in this list back in 2001, that he would prefer pthreads to ithreads. I've used both in p5, and pthreads are vastly more efficient, but flaky and difficult to use well. These limitations are due to the architecture upon which they were built. My interest is in seeing the Parrot architecture not exclude them. With ithreads, each thread is also a seperate interpreter. Spawning a new thread becomes a process of duplicating everything. The interpreter, the perl program, and all it existing data. Partly yes. A new interpreter is created, the program, i.e. the opcode stream is *not* duplicated, but JIT or prederef information has to be rebuilt (on demand, if that run-core is running), and existing non-shared data items are cloned. Only duplicating shared data on demand (COW) may work well on systems that support COW in the kernel. But on systems that don't, this has to be emulated in user space, with all the inherent overhead that implies. My desire was that the VM_Spawn_Thread VM_Share_PMC and VM_Lock_PMC opcodes could be coded such that those platforms where the presence of kernel level COW and other native features mean that the ithreads-style model of VMthread == kernel thread + interpreter is the best way to go, then that would be the underlying implementation. On those platforms where VMthread == kernel thread + VMthread context is the best way, then that would be the underlying implementation. In order for this to be possible, it implies a certain level of support for both be engrained in the design of the interpreter. My (long) oroginal post, with all the subjects covered and details given was my attempt to describe the support required in the design for the latter. It would be necessary to consider all the elements, and the way they intereact, and take these into consideration when implementing Parrots threading in order that this would be achievable. Each element, the seraration of the VMstate from the interpreter state, the atomisation of VM operations, the automated detection and locking of concurrect access attempts and the serialisation of the VM threads when it is detected all need support at the highest level before they may be implemented at the lowest (platform specific) levels. It simply isn't possible to implement them on one platform at the lowest levels unless the upper levels of the design are contructed with the possibilities in mind. Sharing data between the threads/interpreters is implemented by tieing Parrot != perl5.ithreads If Parrot has found a way of avoiding these costs and limitations then everything I offered is a waste of time, because these are the issues was attempting to address. I posted a very premature benchmark result, where an unoptimized Parrot build is 8 times faster then the equivalent perl5 code. And the reaction from those wh have tried to make use of ithreads under p5 are all too aware that replicating them for Parrot would be . [phrase deleted as too emotionally charged:)] I don't know how ithreads are working internally WRT the relevant issues like object allocation and such. But threads at the OS level provide shared code and data segments. So at the VM level you have to unshare non-shared resources at thread creation. You only need to copy them, if the two threads can attempt to modify the contents of the objects concurrently. By precluding this possibility, by atomising VMthread level operations by preventing a new VM thread form being scheduled until any othe VM thread completes its current operation and ensuring that each VMthreads state is in a complete and coherent state before another VM
Re: Threads Design. A Win32 perspective.
JC == Jeff Clites [EMAIL PROTECTED] writes: JC On Jan 3, 2004, at 12:26 PM, Uri Guttman wrote: that could be workable and might be faster. it does mean that locks are two step as well, user space test/set and fallback to kernel lock. we can do what nigel said and wrap the test/set in macros and use assembler to get at it on platforms that have it JC This is probably already done inside of pthread (and Win32) JC locking implementations--a check in userspace before a kernel JC call. It's also important to remember that it's not a two-step JC process most of the time, since most of the locking we are taking JC about is likely to be uncontested (ie, the lock acquisition will JC succeed most of the time, and no blocking will be needed). If we JC _do_ have major lock contention in our internal locking, those JC will be areas calling for a redesign. i meant in the coding and not necessarily at runtime. we still need to address when and where locking happens. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org
Re: Threads Design. A Win32 perspective.
At 21:11 + 1/3/04, Dave Mitchell wrote: On Sat, Jan 03, 2004 at 08:05:13PM +0100, Elizabeth Mattijsen wrote: Actually, you can bless a reference to a shared variable, but you can't share a blessed object (the sharing will let you lose the content of the object). I think shared compound data structures _are_ possible, but very tricky to get right (because the CLONE sub is called as a package method, rather than as an object method: see Thread::Bless for an attempt at getting around this). Nested shared structures work just fine, and there's no need to mess with CLONE for plain data. Indeed. But as soon as there is something special such as a datastructure external to Perl between threads (which happens automatically shared automatically, because Perl doesn't know about the datastructure, so the cloned objects point to the same memory address), then you're in trouble. Simply because you now have multiple DESTROYs called on the same external data-structure. If the function of the DESTROY is to free the memory of the external data-structure, you're in trouble as soon as the first thread is done. ;-( And the reaction from those wh have tried to make use of ithreads under p5 are all too aware that replicating them for Parrot would be . [phrase deleted as too emotionally charged:)] It's the implementation of ithreads in p5 that sucks, not the concept itself. The use of COW makes new thread creation cheap, and the use of lock PMCs interposed on the real PMCs makes shareing easy. I agree that Perl ithreads as *a* concept are ok. The same could be said about what are now referred to as 5.005 threads. Ok as *a* concept. And closer to what many non-Perl people consider to be threads. Pardon my French, but both suck in the implementation. And it is not for lack of effort by the people who developed it. It is for lack of a good foundation to build on. And we're talking foundation here now, we all want to make sure it is the best, earth-quake proofed, rocking foundation we can get! Liz
Re: Threads Design. A Win32 perspective.
Uri Guttman [EMAIL PROTECTED] wrote: ok, i missed the 'if' there. :) that could be workable and might be faster. it does mean that locks are two step as well, user space test/set and fallback to kernel lock. Yep, that is, what the OS provides. I really don't like to reinvent wheels here - ehem, and nowhere else. uri leo
Re: Threads Design. A Win32 perspective.
Uri Guttman [EMAIL PROTECTED] wrote: ... this again brings up how we lock so that GC/alloc will work properly with threads. do we lock a thread pool but not the thread when we access a shared thingy? This is the major issue, how to continue. Where are shared objects living (alloc) and where and how are these destroyed (DOD/GC). But first I'd like to hear some requirements defined, e.g.: A spawns B, C threads with shared $a B spawns D, E threads with shared $b - is $b shared in A or C - is $a shared in D or E - are there any (HLL) decisions to decide what is shared - and so on If that isn't layed out we don't need to talk about locking a certain thread pool. ...how do we handle GC/alloc across shared objects? I posted a proposal for that :) uri leo
Re: Threads Design. A Win32 perspective.
Elizabeth Mattijsen [EMAIL PROTECTED] wrote: Indeed. But as soon as there is something special such as a datastructure external to Perl between threads (which happens automatically shared automatically, because Perl doesn't know about the datastructure, Why is it shared automatically? Do you have an example for that? But anyway an interesting problem, I didn't consider until yet - thanks :) ... so the cloned objects point to the same memory address), then you're in trouble. Simply because you now have multiple DESTROYs called on the same external data-structure. If the function of the DESTROY is to free the memory of the external data-structure, you're in trouble as soon as the first thread is done. ;-( Maybe that DOD/GC can help here. A shared object can and will be destroyed only, when the last holder of that object has released it. [ perl5 thread concepts ] Pardon my French, but both suck in the implementation. And it is not for lack of effort by the people who developed it. The problem for sure was, to put threads on top of a working interpreter and a commonly used language. Parrots design is based on having threads, events, async IO in mind. It was surprisingly simple to implement these first steps that are running now. Separating the HLL layer from the engine probably helps a lot for such rather major design changes. now, we all want to make sure it is the best, earth-quake proofed, rocking foundation we can get! Yep. So again, your input is very welcome, Liz leo
Re: Threads Design. A Win32 perspective.
Nigel Sandever [EMAIL PROTECTED] wrote: On Sat, 3 Jan 2004 21:00:31 +0100, [EMAIL PROTECTED] (Leopold Toetsch) wrote: Yep. So if a single interpreter (which is almost a thread state) should run two threads, you have to allocate and swap all. When a kernel level thead is spawned, no duplication of application memory is required, Only a set of registers, program counter and stack. These represent the entire state of that thread. Here is the current approach, I've implemented partly: The state of a thread is basically the interpreter structure - that's it. In terms of Parrot at thread (a ParrotThread PMC) is derived from an interpreter (a ParrotInterpreter PMC). Please remember, Parrot is a register based VM and has a lot of registers. The whole representation of a VM thread is more and different to a kernel thread. While scheduling a kernel thread is only swapping above items, a VM level thread scheduler would have to swap much more. If a VM thread mirrors this, by duplicating the VM program counter, VM registers and VM stack, then this VM thread context can also avoid the need to replicate the rest of the program data (interpreter). You are again missing here: the interpreter is above VM state - the rest is almost nothing. So the interpreter := thread approach holds. You can't run a even a single - the one and only - thread without these necessary data and that's just called interpreter in Parrot speak. ... No need for :shared or lock(). That's the - let's say - type 4 of Dan's layout of different threading models. Everything is shared by default. That's similar to the shared PMC type 3 model - except that no objects have to be copied. It for sure depends on the user code, if one or the other model will have better performance, so the user can choose. We will provide both. Only duplicating shared data on demand (COW) may work well on systems that support COW in the kernel. No, we are dealing with VM objects and structures here - no kernel is involved for COWed copies of e.g. strings. [ snips ] Each element, the seraration of the VMstate from the interpreter state, VM = Virtual machine = interpreter These can't be separated as they are the same. the atomisation of VM operations, Different VMs can run on different CPUs. Why should we make atomic instructions out if these? We have a JIT runtime performing at 1 Parrot instruction per CPU instruction for native integers. Why should we slow down that by a magnitude of many tenths? We have to lock shared data, then you have to pay the penalty, but not for each piece of code. You only need to copy them, if the two threads can attempt to modify the contents of the objects concurrently. I think, that you are missing multiprocessor systems totally. leo
Re: Threads Design. A Win32 perspective.
On Jan 3, 2004, at 2:59 PM, Leopold Toetsch wrote: Nigel Sandever [EMAIL PROTECTED] wrote: Only duplicating shared data on demand (COW) may work well on systems that support COW in the kernel. No, we are dealing with VM objects and structures here - no kernel is involved for COWed copies of e.g. strings. And also, COW at the OS level (that is, of memory pages) doesn't help, because we have complex data structures filled with pointers, so copying them involves more than just duplicating a block of memory. We can use an approach similar to what we do for strings to make a COW copy of, for instance, the globals stash, but overall that will only be a speed improvement if the data structure is rarely modified. (That is, once it's modified, we will have paid the price. Unless we have clever data structure which can be COWed in sections.) Just adding to what Leo already said. JEff
Re: Threads Design. A Win32 perspective.
On Thu, 01 Jan 2004 21:32:22 -0500, [EMAIL PROTECTED] (Uri Guttman) wrote: UG Uri Guttman NS Nigel Sandever. (Mostly not reproduced here!) NS REENTRANCY UG this is true for c level threads but not necessarily true for VM level UG threads. f the VM is atomic at its operation level and can't be UG preempted (i.e. it is not using kernel threads with time slicing), One of the major points of what I am (was?) proposing, is that there would be a 1:1 mapping between VM threads and kernel threads. The second point is that atomic operation of the VM thread does not imply that they cannot be preempted. It only means that when one VM thread is preempted by the scheduler, no other thread that is waiting to enter the same critical section as the first thread is in, will be scheduled (by the scheduler), until the first thread exits that critical section. UG then UG it could use thread unsafe calls (as long as it keeps those silly static UG buffers clean). The point about avoiding calling the CRT directly from the main bulk of the code does not preclude the use of (suitably thread-aware) CRT calls completely. The idea is to acknowledge that A) Not all CRT's are equally well crafted or complete. B) Thread-aware CRTs are not available with all compilers for all platforms. By writing the main bulk of the code in terms of a virtual CRT using macros, the main bulk of the code becomes platform independent. The only change required porting the code to different platforms is to write a set of suitable expansions for the macros for each platform/ compiler. Where the available CRT is up to the job, the macros expand directly to the underlying CRT call. Where they are not, a suitable alternative can be written that bypasses the CRT and goes directly to runtime. As an example: Compiling P5 with Borland 5.5 for NT, I encountered the restriction that the Borland runtime didn't support (fseek) or (f)tell on files 4GB. The underlying platform does, and it was relatively trivial to replace the calls to the CRT functions with appropriate NT syscalls and so enable both USE_LARGE_FILES and PERLIO. The only difficulty was that instead of being able to make the changes in a single place, it was necessary to make essentially the same change in 3 or 4 different files. To compound matters, the modifications required conditional compilation directives at all 4 places, and these had to be interspersed with other #ifdefs for other platforms and other purposes. The idea was to avoid this morass of conditional compilation and copypaste coding by pushing the conditional directives in the main code to a single #if defined( THIS_PLATFORM ) #include this_platform_crt.h Within that header would be a simple set of #ifdefs to include compiler specific header files. And within those, all the CRT macro expansions. Thus, I sought to ease both porting and maintenance. It would also become possible to add virtual syscall definitions to the high level code, and expand them differently, perhaps radically so, on a platform by platform basis. In essence, creating a Virtual OS for the VM. UG parrot will (according to dan) use one interpreter per UG VM thread and those may run on kernel threads. From what I have read, the decision as to whether each VM thread also got a separate interpreter was one of the things Dan was opening up to discussion? Personally, with the perspective of (the forkless) NT platform, and my experience with trying to make good use of both p5's 5005thread and ithread implementations, I am completely against the idea of VM threads having a 1:1 relationship with interpreters. The runtime costs of duplicating all shared data between interpreters, tying all shared variables, added to the cost of locking, throw away almost all of the benefit of using threads. Combine that overhead with the restrictions of a) No shared references therefore: b) No shared objects, therefore: c) No shared methods. d) No shared compound data structures. And the only hope for multiprocessing becomes forking, which WIn32 doesn't support natively. And which, without the benefits of process level COW performed by the kernel, results in a kludged emulation of a non-native facility that will never be usable in any real sense of the word. UG it may be possible to UG disable preemption and/ or time slicing so the VM threads will be atomic UG at the VM operation level and then we don't have to worry as much about UG thread unsafe libs. I was trying to demonstrate that there is no need to disable the scheduler in order for VM threads to be achieve atomic VM level operations. UG but i gather that people want real preemption and UG priorities and time slicing so that idea may be moot. Yes. I am one of them:) UG but on most UG platforms that support kernel threads there are thread safe versions of UG most/ all the c lib stuff. As I said above. This is true, but it is also compiler
Re: Threads Design. A Win32 perspective.
NS == Nigel Sandever [EMAIL PROTECTED] writes: NS ATOMICITY AND CRITICAL SECTIONS UG that is what i mentioned above, disabling time slicing/ preemption when UG desired. it is not just a win32 concept. hell, turning off interrupts UG during interrupt handlers goes way back! redmond just likes to rename UG stuff and act like they invented it. :) NS I cannot emphasis enough that Critical Sections are different to NS both Events and Semaphores. NS The distinction is that a critical section does not disable time NS slicing/ preemption. It prevents any thread attempting to enter an NS owned critical section from receiving a timeslice until the current NS owner relinquishes it. This means that other threads in the same process NS and other processes continue to timeslice preemptively. Only the NS thread(s) waiting to enter the specified critical section are blocked. NS There is no 'spinning' involved. it has to disable slicing/preemption if there are multiple kernel threads running that have access to that data. UG in effect it sounds like a thread shared mutex. it could be implemented UG in kernel or process space. NS They are indeed a form of mutex, but the process space state and NS avoiding the need to switch to kernel mode, is crucial, as it is NS this that makes them lighter weight than a conventional IPC Mutex. NS A CRITICAL_SECTION is a DWORD allocated in user space. It must be NS allocated by the user space code, and then initialised through a NS syscall. Once initialised, it cannot be move, copied or modified NS (directly) by the user code. NS If you have the time and inclination, there is a brief (25 line) NS description of them at NS [http://msdn.microsoft.com/library/en-us/dllproc/base/critical_section_objects.asp] UG that flag setting needs to be atomic or a mutex or similar. plain flag UG setting won't work. also the blocking has to be kernel level (so a UG kernel mutex/ semaphore is needed) so the kernel slicing will work. NS Indeed. A simple flag *is* all that is necessary to protect shared (fat) NS data structures, *if* the VMs, are the only code that could achieve NS concurrent access to the contents of the PMC, and these are already NS interlocked at the operational level. NS A PMC is an object, and the only code that can affect the state of NS that object are it's methods. The methods can only be invoked by NS the VM. Each method constitutes an (atomic) operation at the VM NS level. NS All that is required to protect an object from corruption through NS concurrent access and state change is to prevent two (or more) VMs NS trying to access the same object simultaneously. In order for the VM to NS address the PMC and must load it into a VM register, it must know where NS it is. Ie. It must have a pointer. It can use this pointer to access NS the PMC with a Bit Test and Set operation. (x86 BTS) NS [http://www.itis.mn.it/linux/quarta/x86/bts.htm] This is a CPU atomic NS operation. This occurs before the VM enters the VM operations critical NS section. ding! ding! ding! you just brought in a cpu specific instruction which is not guaranteed to be on any other arch. in fact many have such a beast but again, it is not accessible from c. and still some archs don't have it so it has to be emulated by dijkstra's algorithm. and that requires two counters and a little chunk of code. you can't bring x86 centrism into this. the fact that redmond/intel threads can make use of this instruction to do 'critical sections' is a major point why it is bad for a portable system with many architectures to be supported. NS Once the bit has been set in the PMC header by one VM, if the scheduler NS intervened exactly before the next instruction in that thread, and NS scheduled a second VM thread to run, and that VM also attempts to access NS that same PMC,the first operation it tries to perform is the same BTS NS instruction. The next instruction then tests the CF (Carry flag) and NS knows that the PMC is already in use. Hence, it then relinquishes NS (yields) the rest of its timeslice. i am well aware about test/set and atomic instructions. my point in my previous reply was that it can't be assumed to be on a particular arch. so it has to be assumed to be on none and it needs a pure software solution. sure a platform specific macro/assembler lib could be done but that is a pain as well. NS Every time the second (or subsequent) VM attempts to use a PMC NS that is already in-use, it finds the bit set and immediately NS yields. This will only happen during the interval between the NS first VM issuing the BTS, and that same VM entering the NS critsec. Once the critsec has been entered, no other VM that could NS load that PMC, would even be scheduled. yield is also a kernel specific thing and what will wake up the thread after that? it has to block on some kernel thingy and the
Threads Design. A Win32 perspective.
This is going to be extremely light on details with respect to the current state of the Parrot interpreter. It is also going to be expressed in terms of Win32 APIs. For both of these I apologise in advance. Time, and the or forever hold your peace imperative has overridden my desire to do otherwise. My attempts to get up speed enough on the sources to work out how to apply the ideas I am going to express, and to become conversant with the *nix pthreads implementation and terminology, have moved too slowly to afford me the opportunity to do more. Hoping this stimulates discussion rather than a flame war. Regards, Nigel Sandever. PS. Sorry if this gets posted twice. THE BASIS OF THE IDEA Modern OSs succeed in having multiple threads of execution share a single copy of process memory without the operations of one thread being able to interfere with the state of another. The state of the code running in those threads may be corrupted through mismanagement. But the state of the threads themselves, their program counters, registers and stacks cannot. The mechanisms for this incorruptibility are: Each operation (opcode) performed by the thread is atomic. The scheduler can never interrupt a thread whilst an operation is in progress. Only between operations. Before the start of an operation, and after the end of one, the state of the thread is entirely encapsulated within the registers and stack. By swapping the entire state of the CPU register set, when switching from one thread to another, the state of each thread is preserved and reconstituted. No other mechanisms or interlocks are required. By analogy, a virtual machine that wishes to have multiple threads of execution must achieve the same level of atomicity for each operation it performs. VIRTUAL MACHINE INTERPRETER At any given point in the running of the interpreter, the VM register set, program counter and stack must represent the entire state for that thread. Once an opcode has started execution on a given thread, no other thread of execution within that interpreter much be allowed to start an operation until the first thread completes its opcode. NON-VMI THREADS ASYNCHRONOUS IO Note that this does not mean that no other thread in the process can take a timeslice, only that any thread that is allowed to run should not be able to affect the state of the VM in question. A thread charged with performing asynchronous reads on behalf of the user program running within the VM interpreter can go ahead so long as it doesn't directly modify the VMI state. EVENT MANAGER THREAD Equally, an event thread can also be run concurrently in the background to receive asynchronous notifications (signals, messages, asynchronous read completions etc.). It can then queue these events and set a flag that the VMI can inspect between each iteration of the opcode execution loop and action appropriately. This gives the benefit of safe signals along with safe and timely processing of other, similarly asynchronous events. GARBAGE COLLECTION The garbage collector would need to run *synchronously* with the interpreter opcode loop. Effectively, it would be another (possibly long running) opcode. An analogy of this are all the long running syscalls that operate within a multi-tasking OS. Eg. synchronous IO, virtual memory allocation, swapping etc. Just as virtual memory operations suspend the affected processes until the operations are complete, so garbage collection can be see as a virtual memory operation for the virtual machine that requires the VMI to be suspended until the operation is complete. PREREQUISITES The key is for the VM to be reentrant, and the use of (in win32 terms) a critical section. REENTRANCY Not only must the VMI be coded in a reentrant fashion, with all state addressed through pointers (references) loaded into it's Virtual register set. All the code underlying it, including syscalls and CRT must equally be reentrant. Many APIs within many CRTs *are not reentrant* (eg. strtok()). All state must be on a per-thread not a per-process basis. To this end, I would propose that no CRT APIs are used directly from the main code! Instead, a full set of CRT-like macros would be inherited from a header file, where either the real CRT API would be called, or an alternative implementation. This header file would be conditionally included on the basic of the target platform. This concentrates the bulk, if not all, platform specific code into a single file (or set of files). EXPANDING THE VIRTUAL MACHINE What I am suggesting here is that the virtual machine analogy be extended to encompass not just the VHLL-user-program view of the registers and stack, but also it's view of the entire machine and underlying OS. By so doing, not only does it allow those working on the core of the VMI to be isolated from the
Re: Threads Design. A Win32 perspective.
NS == Nigel Sandever [EMAIL PROTECTED] writes: NS REENTRANCY NS Not only must the VMI be coded in a reentrant fashion, with all state NS addressed through pointers (references) loaded into it's Virtual NS register set. All the code underlying it, including syscalls and CRT NS must equally be reentrant. Many APIs within many CRTs *are not NS reentrant* (eg. strtok()). All state must be on a per-thread not a NS per-process basis. NS To this end, I would propose that no CRT APIs are used directly from the NS main code! NS Instead, a full set of CRT-like macros would be inherited from a header NS file, where either the real CRT API would be called, or an alternative NS implementation. This header file would be conditionally included on the NS basic of the target platform. This concentrates the bulk, if not all, NS platform specific code into a single file (or set of files). this is true for c level threads but not necessarily true for VM level threads. if the VM is atomic at its operation level and can't be preempted (i.e. it is not using kernel threads with time slicing), then it could use thread unsafe calls (as long as it keeps those silly static buffers clean). parrot will (according to dan) use one interpreter per VM thread and those may run on kernel threads. it may be possible to disable preemption and/or time slicing so the VM threads will be atomic at the VM operation level and then we don't have to worry as much about thread unsafe libs. but i gather that people want real preemption and priorities and time slicing so that idea may be moot. but on most platforms that support kernel threads there are thread safe versions of most/all the c lib stuff. now, external libs that get linked in under nci is a different story. NS ATOMICITY AND CRITICAL SECTIONS NS Atomicity of the VMI opcodes can be achieved by having the core NS loop of the interpreter enter and exit a critical section each NS time it processes an opcode. For those not familiar with critical NS sections, they are a (Win32) OS construct that prevents any NS cooperating thread within a process, from having timeslice until NS the thread that entered the critical section has exited. that is what i mentioned above, disabling timeslicing/preemption when desired. it is not just a win32 concept. hell, turning off interrupts during interrupt handlers goes way back! redmond just likes to rename stuff and act like they invented it. :) NS Unlike semaphores and events, critsecs only operate within a NS single process. They are relatively lightweight as there is no NS need to enter kernel mode for their operation. They are, in effect NS a CPU specific test and set operation applied to a piece of user NS mode memory. Their lightweight nature means that they are faster NS than inter-process semaphore mechanisms. When used in a process NS that currently has only a single thread in operation, they have NS almost negligible performance effect upon that thread. They also NS operate correctly on SMP machines. in effect it sounds like a thread shared mutex. it could be implemented in kernel or process space. NS If two threads attempt concurrent operations on the same PMC, the NS first thread accessing the PMC sets the flag. When a second thread NS attempt to access it, it finds the flag set and blocks NS (relinquishing it's timeslice) until the first thread has NS completed and cleared the flag. This doesn't solve the potential NS for deadlock problems arising from the user-level code, though it NS may be possible for the VMI to detect and diagnose these at NS runtime in a similar way that deep-recursion detection is done in NS P5. that flag setting needs to be atomic or a mutex or similar. plain flag setting won't work. also the blocking has to be kernel level (so a kernel mutex/semaphore is needed) so the kernel slicing will work. deadlock detection is a known problem in the DB world and we can do similar things. but are those locks VM level or kernel level? if the coder wants to prevent deadlocks they have to do a higher level lock (analogous to a full DB lock or transaction). we can't stop stupid coders from themselves. NS The cost to the single-threaded user application is for a testset NS operation (a few cycles), per object-method, plus the flag, which NS need only be one bit, but maybe more efficiently coded as a 32-bit NS entity. more than test/set is needed as it has to be atomic. so cpu's have that instruction but we can't get at it directly from c and we have to emulate it on platforms which don't have it. assuming it is short and fast is not a good idea. now the atomic semaphore problem was solved long ago by dijkstra and it is a two phase test/set gizmo. it needs several machine instructions or a few lines of c. this is not a trivial amount of overhead for each access to a shared object. also this type of lock is in user space and won't block the