Re: Strings. Finally.
On Jun 14, 2004, at 1:54 PM, Dan Sugalski wrote: Parrot provides code points for all graphemes, even for those character sets/encodings which don't inherently do so. Most sets that have variable-length encodings use an escape sequence scheme--the value of the first byte in a character determines whether the grapheme is a one or more byte sequence. When parrot turns these into code points it does it by building up the final value. The first byte is put in the low 8 bits of the integer. If there's a second byte in the sequence the current value is shifted left 8 bits and the new byte is stuffed in the low 8 bits. If there's a third byte in the sequence everything is shifted left again 8 bits and that third byte is stuffed in the bottom, and so on. A grapheme consists of one or more code points. Is provides code points for all graphemes really what is intended here? I assume not, since you can't represent every combination of combining Unicode characters (COMBINING GRAVE ACCENT + KATAKANA LETTER KA, say) in a single 32-bit code point. - Damien
Re: Threads... last call
On Wed, Jan 28, 2004 at 12:53:09PM -0500, Melvin Smith wrote: At 12:27 PM 1/23/2004 -0800, Damien Neil wrote: Java Collections are a standard Java library of common data structures such as arrays and hashes. Collections are not synchronized; access involves no locks at all. Multiple threads accessing the same collection at the same time cannot, however, result in the virtual machine crashing. (They can result in data structure corruption, but this corruption is limited to surprising results rather than VM crash.) But this accomplishes nothing useful and still means the data structure is not re-entrant, nor is it corruption resistant, regardless of how we judge it. Quite the contrary--it is most useful. Parrot must, we all agree, under no circumstances crash due to unsynchronized data access. For it to do so would be, among other things, a gross security hole when running untrusted code in a restricted environment. There is no need for any further guarantee about unsynchronized data access, however. If unsyncronized threads invariably cause an exception, that's fine. If they cause the threads involved to halt, that's fine too. If they cause what was once an integer variable to turn into a string containing the definition of mulching...well, that too falls under the heading of undefined results. Parrot cannot and should not attempt to correct for bugs in user code, beyond limiting the extent of the damage to the threads and data structures involved. Java, when released, took the path that Parrot appears to be about to take--access to complex data structures (such as Vector) was always synchronized. This turned out to be a mistake--sufficiently so that Java programmers would often implement their own custom, unsynchronized replacements for the core classes. As a result, when the Collections library (which replaces those original data structures) was released, the classes in it were left unsynchronized. In Java's case, the problem was at the library level, not the VM level; as such, it was relatively easy to fix at a later date. Parrot's VM-level data structure locking will be less easy to change. - Damien
Re: Threads... last call
On Fri, Jan 23, 2004 at 10:07:25AM -0500, Dan Sugalski wrote: A single global lock, like python and ruby use, kill any hope of SMP-ability. Assume, for the sake of argument, that locking almost every PMC every time a thread touches it causes Parrot to run four times slower. Assume also that all multithreaded applications are perfectly parallelizable, so overall performance scales linearly with number of CPUs. In this case, threaded Parrot will need to run on a 4-CPU machine to match the speed of a single-lock design running on a single CPU. The only people that will benefit from the multi-lock design are those using machines with more than 4 CPUs--everyone else is worse off. This is a theoretical case, of course. We don't know exactly how much of a performance hit Parrot will incur from a lock-everything design. I think that it would be a very good idea to know for certain what the costs will be, before it becomes too late to change course. Perhaps the cost will be minimal--a 20% per-CPU overhead would almost certainly be worth the ability to take advantage of multiple CPUs. Right now, however, there is no empirical data on which to base a decision. I think that making a decision without that data is unwise. As I said, I've seen a real-world program which was rewritten to take advantage of multiple CPUs. The rewrite fulfilled the design goals: the new version scaled with added CPUs. Unfortunately, lock overhead made it sufficiently slower that it took 2-4 CPUs to match the old performance on a single CPU--despite the fact that almost all lock attempts succeeded without contention. The current Parrot design proposal looks very much like the locking model that app used. Corruption-resistent data structures without locking just don't exist. An existence proof: Java Collections are a standard Java library of common data structures such as arrays and hashes. Collections are not synchronized; access involves no locks at all. Multiple threads accessing the same collection at the same time cannot, however, result in the virtual machine crashing. (They can result in data structure corruption, but this corruption is limited to surprising results rather than VM crash.) - Damien
Re: Start of thread proposal
On Wed, Jan 21, 2004 at 01:14:46PM -0500, Dan Sugalski wrote: ... seems to indicate that even whole ops like add P,P,P are atomic. Yep. They have to be, because they need to guarantee the integrity of the pmc structures and the data hanging off them (which includes buffer and string stuff) Personally, I think it would be better to use corruption-resistant buffer and string structures, and avoid locking during basic data access. While there are substantial differences in VM design--PMCs are much more complicated than any JVM data type--the JVM does provide a good example that this can be done, and done efficiently. Failing this, it would be worth investigating what the real-world performance difference is between acquiring multiple locks per VM operation (current Parrot proposal) vs. having a single lock controlling all data access (Python) or jettisoning OS threads entirely in favor of VM-level threading (Ruby). This forfeits the ability to take advantage of multiple CPUs--but Leopold's initial timing tests of shared PMCs were showing a potential 3-5x slowdown from excessive locking. I've seen software before that was redesigned to take advantage of multiple CPUs--and then required no less than four CPUs to match the performance of the older, single-CPU version. The problem was largely attributed to excessive locking of mostly-uncontested data structures. - Damien
Re: JVM as a threading example (threads proposal)
On Thu, Jan 15, 2004 at 11:58:22PM -0800, Jeff Clites wrote: On Jan 15, 2004, at 10:55 PM, Leopold Toetsch wrote: Yes, that's what I'm saying. I don't see an advantage of JVMs multi-step variable access, because it even doesn't provide such atomic access. You're missing the point of the multi-step access. It has nothing to do with threading or atomic access to variables. The JVM is a stack machine. JVM opcodes operate on the stack, not on main memory. The stack is thread-local. In order for a thread to operate on a variable, therefore, it must first copy it from main store to thread- local store (the stack). Parrot, so far as I know, operates in exactly the same way, except that the thread-local store is a set of registers rather than a stack. Both VMs separate working-set data (the stack and/or registers) from main store to reduce symbol table lookups. What I was expecting that the Java model was trying to do (though I didn't find this) was something along these lines: Accessing the main store involves locking, so by copying things to a thread-local store we can perform several operations on an item before we have to move it back to the main store (again, with locking). If we worked directly from the main store, we'd have to lock for each and every use of the variable. I don't believe accesses to main store require locking in the JVM. This will all make a lot more sense if you keep in mind that Parrot-- unthreaded as it is right now--*also* copies variables to working store before operating on them. This isn't some odd JVM strangeness. The JVM threading document is simply describing how the stack interacts with main memory. - Damien
Re: JVM as a threading example (threads proposal)
On Thu, Jan 15, 2004 at 09:31:39AM +0100, Leopold Toetsch wrote: I don't see any advantage of such a model. The more as it doesn't gurantee any atomic access to e.g. long or doubles. The atomic access to ints and pointers seems to rely on the architecture but is of course reasonable. You *can't* guarantee atomic access to longs and doubles on some architectures, unless you wrap every read or write to one with a lock. The CPU support isn't there. (Why the e.g.? Longs and doubles are explicitly the only core data types which the JVM does not guarantee atomic access to.) - Damien
Re: Threads Design. A Win32 perspective.
On Sun, Jan 04, 2004 at 12:17:33PM -0800, Jeff Clites wrote: What are these standard techniques? The JVM spec does seem to guarantee that even in the absence of proper locking by user code, things won't go completely haywire, but I can't figure out how this is possible without actual locking. (That is, I'm wondering if Java is doing something clever.) For instance, inserting something into a collection will often require updating more than one memory location (especially if the collection is out of space and needs to be grown), and I can't figure out how this could be guaranteed not to completely corrupt internal state in the absence of locking. (And if it _does_ require locking, then it seems that the insertion method would in fact then be synchronized.) My understanding is that Java Collections are generally implemented in Java. Since the underlying Java bytecode does not permit unsafe operations, Collections are therefore safe. (Of course, unsynchronized writes to a Collection will probably result in exceptions--but it won't crash the JVM.) For example, insertion into a list might be handled something like this (apologies for rusty Java skills): void append(Object new_entry) { if (a.length = size) { Object new_a[] = new Object[size * 2]; for (int i = 0; i size; i++) { new_a[i] = a; } } a[size++] = new_entry; } If two threads call this function at the same time, they may well leave the list object in an inconsistent state--but there is no way that the above code can cause JVM-level problems. The key decision in Java threading is to forbid modification of all bytecode-level types that cannot be atomically modified. For example, the size of an array cannot be changed, and strings are constant. If it WERE possible to resize arrays, the above code would require locks to avoid potential JVM corruption--every access to 'a' would need a lock against the possiblity that another thread was in the process of resizing it. It's my understanding that Parrot has chosen to take the path of using many mutable data structures at the VM level; unfortunately, this is pretty much incompatible with a fast or elegant threading model. - Damien
Re: Events
On Tue, Jul 22, 2003 at 11:41:25PM -0400, Dan Sugalski wrote: First, to get it out of the way, I don't have to convince you of anything. You have to convince me. For better or worse I'm responsible for the design and its ultimately my decision. If you don't want async IO, it's time to make a very, very good case. I hope that I haven't given the impression that I feel otherwise. You're the designer, and Parrot is your baby. I'm just expressing my opinion; you are of course completely free to disagree with me. Let me restate my position, since I think it's getting lost in the general confusion: I'd be happy if Parrot contained no support at all for interrupts, in particular the traditional interrupt-based delivery of Unix signals. I think that support for interrupts will come at a cost, and I'd prefer not to have to pay that cost. I've expounded at length in an earlier message on why I think interrupts in application-level code is generally a bad idea. I won't bother repeating myself here; I don't think I said anything particularly controversial there. I'm not arguing against non-blocking IO, event loops, a unified event queue, or internally using the aio_*() API on Unix. I think that all of these things are Nifty(tm) and I highly approve of all of them. I /am/ arguing against exposing the aio_*() API (or its equivalent) to code running atop the Parrot VM, on the grounds that it uses interrupts as a part of the API. I'd rather just have non-blocking IO calls and a good event queue. On a somewhat related note, I'm dubious about the performance gains that code using interrupt-driven IO will see as opposed to code using event-loop driven IO. I /think/ you're telling me that I'm wrong, and that interrupt-driven IO does indeed have performance benefits; it's possible that you're actually telling me that event-loop driven code with non-blocking IO has performance benefits as compared to threaded code with blocking IO. If it's the latter, then we are in violent agreement. : - Damien
Re: Events
On Sun, Jul 20, 2003 at 11:59:00AM -0400, Dan Sugalski wrote: We're supporting interrupts at the interpreter level because we must. It doesn't matter much whether we like them, or we think they're a good idea, the languages we target require them to be there. Perl 5, Perl 6, Python, and Ruby all have support for Unix signals in pretty much the way you'd get them if you were writing C code. (That is to say, broken by design, dangerous to use, and of far less utility than they seem on the surface) Right, which is why I said in my initial message that dropping interrupts might be politically impossible. I still think that including something that is broken by design, dangerous to use, and of questionable utility isn't a good idea, but I can accept the argument that it may be necessary. It would be entirely possible for Parrot (or a Parrot library) to use AIO at a low level, without introducing interrupts to the VM layer. Sure. But what'd be the point? Adding in interrupts allows a number of high-performance idioms that aren't available without them. They certainly won't be required, and most of the IO you'll see done will be entirely synchronous, since that's what most compilers will be spitting out. You don't *have* to use IO callbacks, you just can if you want to. Could point me at a reference for these high-performance idioms? While I've heard of significant gains being realized through AIO, it was my understanding that this is generally related to disk IO, where Unix doesn't provide support for non-blocking IO. The performance gains come not from a different code flow, but from the ability to perform disk access in the background. (I'm not disputing that such idioms exist; if there's a better way to do things that I don't know of, I want to know more about it!) Regarding AIO being faster: Beware premature optimization. I'm going to start carrying a nerf bat around and smack people who trot this one out. The fact that it is often said does not make it any less true. You've asserted that Parrot will be faster (in at least some situations) with interrupt-driven IO than it will be with non-interrupt-driven IO. I'm unconvinced of this claim. In particular, I feel that support for interrupts will come at an overall performance penalty, and I am unconvinced that this penalty will not outweigh any benefits that interrupt-driven IO would bring. Now, you can ignore me if you want; you're the designer. Hitting me isn't going to convince me of anything, however. While it's not inappropriate to apply it to design, we're nowhere near that point. This isn't premature optimization, or optimization of any sort--it's design, and it should be done now. This is what we're *supposed* to be doing. It's certainly reasonable to posit that async IO is a bad design choice (won't get you very far, but you can posit it :) but please don't trot out the premature optimization quote. This is *exactly* the time when that quote is appropriate to apply. When a design decision is made because it'll be faster that way, it is always worth examining the question of whether it WILL be faster or not. (I am aware that there is a second reason for supporting interrupts in Parrot--Unix signals; I was addressing the argument that support for AIO is sufficient reason to include interrupts.) For example: If it turns out that Parrot, sans interrupt-driven IO, is capable of saturating the system bus when writing to a device, there is little point in optimizing Parrot's IO system. You may suspect, but you'd turn out to be incorrect--using threads to simulate a real async IO system still has performance wins. And we're going to be using native async stuff when we can. Do you know of a program that does this (simulated AIO via threads)? (Again, I'm not disputing your claim--it's just that this is completely contrary to my experience, and I'd like to know more about it.) - Damien
Re: Events
On Fri, Jul 18, 2003 at 05:42:10PM -0400, Benjamin Goldberg wrote: AIO is unpopular because it's not widely/portably supported, whereas non-blocking event-loop IO is, (one generally does have either select or poll), as is blocking threaded IO (even if the thread starting stuff may need to be different on some platform, it's *mostly* portable). I disagree; AIO is not widely/portably supported because it is unpopular. Threading (on Unix systems, at least) is a much newer concept than AIO, and yet it is now nigh-ubiquitous; any modern OS needs to have solid threading support to be taken seriously. Portable libraries wrapping system-specific thread models are common. The only reason this hasn't happened with AIO is lack of user demand. The problem with AIO is that it has all the synchronization pain of threading combined with the code flow complexity of an event-based IO system. There are certianly occasions when AIO may prove to be the best or most elegant solution to a problem, but in most cases there are other approaches which are substantially simpler for the programmer. If we make it a core part of parrot, it will become more popular, simply because of parrot's availability. I'd be interested in seeing specific examples of problems that will be solved by adding AIO support to the VM layer. How will this feature be used in real-world programs? Outside of signals and AIO, what requires async event dispatch? User events, as you pointed out above, are better handled through an explicit request for the next event. Inter-thread notification and timers? True, these *could* be considered to be user events, but IMHO, there are times when we want a user event to have the same (high) priority as a system signal. I'd like a specific example (general pseudocode fine) of inter-thread notification implemented using interrupts that a) doesn't include any race conditions, and b) can't be written more clearly using non-interrupt based code. I think you're vastly underestimating the difficulty of writing interrupt-based code that doesn't include race conditions. Consider that Parrot itself has given up on trying to do this: internally, interrupts (signals) will simply result in an event being added to a queue for later processing. - Damien
Re: Events
On Fri, Jul 18, 2003 at 05:58:41PM -0400, Uri Guttman wrote: and event loop i/o doesn't usually support async file i/o. many people conflate the two. event i/o handles sockets, pipes and such but none support files. the issue is that the async file i/o api is so different (or non-existant) on all the platforms. dan wants to make clean async file i/o in the core by using a blocking thread on each i/o request and synchronizing them with this event queue in the main thread. it has the advantage of being easier to code and should work on most platforms which have threads. and he wants async file i/o in parrot core since it is faster and has to be in the core to be properly supported at higher levels. Right, there are two independent issues here: Support for asynchronous IO (an OS feature distinct from non-blocking IO), and VM-level support for interrupts in Parrot. The latter is what I am questioning. It would be entirely possible for Parrot (or a Parrot library) to use AIO at a low level, without introducing interrupts to the VM layer. The fact that most event loops do not support async file IO on Unix systems is due to a combination of deficiencies in the Unix APIs (select() and poll() don't work on files), and lack implementation in the event library. There is certainly no reason a traditional event loop (such at Tcl's, which is an excellent example of a well-done event system) can't use AIO at a low level to support async file IO on Unix. (I specifically refer to Unix above because many non-Unix systems have perfectly good support for monitoring files in their equivalents of select(). So do some Unixes, for that matter.) Regarding AIO being faster: Beware premature optimization. And are you referring to OS-level AIO (which often does have performance advantages), or application-level AIO using a collection of threads as you describe above (which I suspect will be slower than single-threaded non-blocking IO, owing to synchronization costs between threads)? that is a major win since no other system i (or dan) has heard of has portable async file i/o. and it will be integrated into the core event handling so you will be able to mix and match async socket, terminal (on unix at least) and file i/o with timers. this is what i want. :) Do you want to use interrupt-based IO at the VM level, or do you want an event system which will function cleanly on sockets, terminals, and files? - Damien
Re: Events
On Thu, Jul 17, 2003 at 12:58:12PM -0400, Dan Sugalski wrote: The first is done in the case of readw or writew, for example. The second for event-driven programs that set up callbacks and park themselves forever in one big ProcessEvent call. (Tk programs come to mind) The third is to be scattered in generated code to make sure that events are occasionally checked for, and will probably drain only high-priority events and at most a single low-priorty event. While it's possibly politically impossible (many people are very attached to Unix signals), I'd really rather work with a system that does async event dispatch exclusively through threads. Interrupting a thread in the middle of its execution and sending it haring off to an interrupt handler is not only clumsy and difficut to implement, it's a recipe for buggy code. Much nicer would be if events were always dispatched in one of two ways: - Synchronously, by calling a GetNextEvent or ProcessEvent function. - Asynchronously, by spawning a new thread and executing the signal handler within it. Is there any hope for rethinking the desire to expose the ugliness of Unix signals in the Parrot VM? - Damien
Re: Events
On Fri, Jul 18, 2003 at 11:29:27AM -0400, Dan Sugalski wrote: Nope, that won't work. A lot of what's done is really main thread with interrupts and that doesn't map well to doing all signal handling in separate threads. You really do want to pause the main program to handle the events that are coming in, if they're events of sufficient importance. Generally I put them in three classes--hard interrupts (signals), soft interrupts (IO completion stuff), and events (fuzzy user-level stuff). Hard and soft interrupts should get dealt with as soon as possible, events should probably wait until something explicitly decides to process an event. In my experience, interrupt handlers in Perl code generally fall into three categories: Ones that set a flag to be checked later, ones that perform an action and terminate the program, and buggy ones subject to race conditions. IO completion events in particular should not be handled by interrupting the main execution thread. The appropriate action required to handle these events will almost invariably require access to data structures shared between the interrrupt handler and the main thread. If you place the interrupt handler in the main thread, you can't use locks to control access to these structures (as the handler will wait on the main thread's lock, while the main thread will block on the handler returning). This leads to Unix-style signal masks, where interrupts are blocked during critical sections. While this works, I strongly feel that a platform with thread support is better off dispatching interrupts to a separate thread and using the existing interthread synchronization mechanisms, rather than introducing a separate interrupt masking system. Also, given that asynchronous IO is a fairly unpopular programming technique these days (non-blocking event-loop IO and blocking threaded IO are far more common), I would think long and hard before placing support for it as a core design goal of the VM. If there is a compelling reason to use AIO, the implementation may better be handled at a lower level than Parrot; even if Parrot itself does not support AIO at the opcode level, Parrot programs could still use it by calling down to a C library. It's not just signals, there's a lot of stuff that falls into this category. We've got to deal with it, and deal with it properly, since not dealing with it gets you an 80% solution. Outside of signals and AIO, what requires async event dispatch? User events, as you pointed out above, are better handled through an explicit request for the next event. - Damien
Re: Streams vs. Descriptors
On Mon, Jul 15, 2002 at 08:59:40PM -0400, Melvin Smith wrote: True async IO implementations allow other things besides just notifying the process when data is available. Things like predictive seeks, or bundling up multiple read/writes, etc. aren't doable with select/poll loops. And the aioread/aiowrite/listio, etc. are a POSIX standard now, so they should be reasonably available on most UNIXen. I'm not familiar with predictive seeks, and a quick google didn't turn up anything relevant; can you give a quick explanation? Bundling reads and writes sounds like a job for a buffered I/O layer. Are the aio* calls available on Windows? On the Macintosh? (My OS X system doesn't have a manpage for aioread, and man -k aio doesn't turn up anything obvious.) How about PalmOS? While the POSIX standard is a help, I think async I/O remains far less portable than the more traditional alternatives. You are right, though, I blurred the concepts. Callbacks are good to have as well, for calling code blocks when data is available, and this might be done as an event loop, or a thread. However, the talks I've had with Dan always ended up in us deciding that calling an event loop between every op, or even every N ops wasn't what we wanted to do. Certainly, calling an event loop between every op would be insane. That's not the normal way of using one, however. Consider the (excellent) Tcl event loop as an example: When a condition triggers the loop, it invokes the appropriate callback which runs to completion before returning control to the loop. This doesn't allow an event to interrupt the current thread of control, of course. The most common way of having multiple concurrent threads is, however, exactly that--threads. Threads can be used independently (the Java approach; all I/O is blocking) or in conjunction with an event loop (the Macintosh OS X event loop takes this approach). I really recommend taking a look at the Tcl event loop and I/O system, if you haven't already. It's a joy to work with, and one of the best features of that language. For many things, synchronous IO is adequate, and faster, but for people that really want the aio interface, I'm not sure it is worth trying to fake it. I'm sure that there are things async I/O is very good at, but I'm not certain it makes sense to design Parrot's I/O system around them. Might it not make more sense for async I/O to be available via an alternate API? - Damien
Re: Streams vs. Descriptors
On Tue, Jul 16, 2002 at 11:35:10AM -0700, John Porter wrote: Damien Neil wrote: I'm not familiar with predictive seeks, can you give a quick explanation? It's very much like predictive loading of the instruction cache in a cpu. It makes a heuristic guess that since you just read 1000 bytes in order, you're probably going to want to read the next 1000 bytes in order, so it reads them in even before you ask for them. This can be extended to seeks in general. However, prediction is usually too strong a term. It's usually just pre-reading of the linear stream[1]. (The program is a lazy consumer. :-) Ah, that I'm familiar with. Surely that isn't specific to async I/O? I'm fairly certain that many OSs will do readahead on ordinary read() calls. In the end, there should be nothing of which it can be said, It is easier to do in Tcl than in Perl. [2] Hear, hear! : I've been missing Tcl's event loop for years. - Damien
Re: PARROT QUESTIONS: Use the source, Luke
On Mon, Jul 15, 2002 at 12:34:52AM -0400, Melvin Smith wrote: The last four are reserved by various C and C++ standards. I always hear this, but in real life it is never much of a problem. Especially with a namespace like [Parrot] It is a good idea to avoid using the reserved identifier space, not only because it avoids conflicts with vendor libraries, but for documentation purposes. The leading underscore means system internal, do not touch; blurring this meaning doesn't help. It's also unnecessary. It isn't like there aren't perfectly good alternatives--what's wrong with Parrot__? - Damien
Re: Streams vs. Descriptors
On Mon, Jul 15, 2002 at 12:16:29AM -0400, Melvin Smith wrote: 1) Async support. The IO system needs to be asynchronous and re-entrant at the core, whether by threads or by use of the platform's async support. Other things like callbacks assume other features of Parrot to be finished, like subs/methods. Out of curiosity, what's the motivation for supporting true signal-driven async IO (which is what you seem to be referring to)? In my experience, nonblocking IO and a standard event loop is more than sufficient, and far easier to implement--especially portably. - Damien
Re: [CONFIGURE] New make.pl coming soon...
On Wednesday, April 24, 2002, at 04:04 PM, Robert Spier wrote: One of the keys of the system Jeff has implemented is that it's 100% real perl code and real perl objects, not a language parsed with perl. This means you can do nifty things and write perl code to modify things in a natural way. This is true for cons as well. - Damien
0.0.2 needs what?
Are there any issues holding up 0.0.2 that I (or others) could work on? Failing that, what areas of Parrot are most in need of immediate work? I'm interested in looking at the bytecode loader, if nobody else has intentions there. In particular, I'd like to see if we can get empirical data to justify some of the design decisions that are being assumed. Exactly how expensive, for example, would it be to use a single bytecode format with platform-independent encodings? - Damien
Re: Strings db
On Tue, Sep 25, 2001 at 07:29:01PM -0700, Wizard wrote: Actually, the thing that I didn't like was using an actual string as the message_id. I would have expected something more in the way of: char *err = get_text_string( THREAD_EXCEPTION_117, \ THREAD EXCEPTION: Not enough handles. ); This is a far more error-prone interface in a number of ways: It is very easy for the mapping between the number and the string to be lost. Adding and removing strings is harder: the string list will become filled with holes (or must be renumbered), and the numeric order of the strings will probably not correspond with the logical order. Numerically indexes are far more prone to failure when using out-of-date catalog files, while string-indexed ones will mostly continue to work. (With the obvious exception that messages not contained in the old catalog cannot be displayed from it.) All these disadvantages are a significant penalty to pay for a very minor improvement in efficiency. (If there is one thing that Perl has demonstrated, it is that looking up a string in a hash is fast.) - Damien
Re: 0.0.2 needs what?
On Tue, Sep 25, 2001 at 07:36:31PM -0400, Gregor N. Purdy wrote: I'm currently working on some assigned taskes for the bytecode stuff for 0.0.2. I need to get it to the point where we can stash NVs in the const_table. I've already got the interpreter using packfile.[hc] for its work (I posted a patch earlier today). After taking a look at the packfile code, I think the interface needs to be made more generic. I don't believe the file format should be aware of the nature of the contents. For example, rather than having functions to access the constant table, the fixup table, and the bytecode, I would rather see a single set of functions which take a section ID as a parameter. I also feel that the prior discussions on using a preexisting file format were on the right track. With a good API, however, the file format can be completely redefined, so this is a less pressing concern. (I also still think that IFF fits our needs quite closely, although its support for data structure nesting may be more than we want.) - Damien
Re: 0.0.2 needs what?
On Wed, Sep 26, 2001 at 12:38:28AM +0100, Simon Cozens wrote: But then I'm one of those weird critters who doesn't understand what all the complaining over XS is about. :) I'd be happy to do the XS coding if it came down to it. I'll take a look at making the assembler and disassembler use the C packfile routines through XS. As I just mentioned in a previous mail, however, I'm not very happy with the current packfile API... should I go ahead and use the existing one (temporarily, I hope :), or is this section not covered by the current feature freeze? - Damien
Re: Draft switch for DO_OP() :-)
On Thu, Sep 20, 2001 at 11:11:42AM -0400, Dan Sugalski wrote: Actually the ops=C conversion was conceived to do exactly what's being done now--to abstract out the body of the opcodes so that they could be turned into a switch, or turned into generated machine code, or TIL'd. If you're finding that this isn't working well it's a sign we need to change things some so they will. (Better now than in six months...) The problem is that the conversion currently done by process_opcodes.pl translates the op definitions into functions, and leaves the remainder of the file untouched. This is useful, because it allows the opcode file to include headers, declare file-scope variables, and the like. Unfortunately, when translating the ops into a switch statement in a header file, there is no place to put this non-opcode code. There are a few approaches we can take. The simplest, I think, is to ignore the problem when generating inline ops; given that these ops are going to be compiled at Perl build time (they can never be dynamically loaded for obvious reasons), we can manually put any required #includes in interpreter.c. Files containing dynamically- loaded ops can be generated in the same way that process_opcodes.pl does now, preserving the file-scope code. Another approach would be to include a means of defining information that must be included by the file implementing the ops. For example: HEADER { #include math.h } This would then be placed into interp_guts.h. (Possibly surrounded by a conditional guard (#ifdef PARROT_OP_IMPLEMENTATION), so no file other than interpreter.h will pick up that code.) - Damien
Re: niave question about Parrot::Opcode
On Wed, Sep 19, 2001 at 01:40:31PM -0400, Pat Eyler wrote: I realize that the $count inside the if block shown masks the $count declared outside the while loop, but (to me) this would be easier to understand if the inner $count where changed to $numParams -- it is more obvious on casual reading that $count and $count are two different things. Am I missing something? No, you aren't. That IS confusing. 2) It also appears that a second (older?) version of read_ops and an associated pile of pod is still in the Opcode.pm file can this be trimmed (removing about 80 lines from the file)? Where on earth did that come from? Patch attached to rename the second $count, and to remove the duplicate code. - Damien Index: Parrot/Opcode.pm === RCS file: /home/perlcvs/parrot/Parrot/Opcode.pm,v retrieving revision 1.6 diff -u -r1.6 Opcode.pm --- Parrot/Opcode.pm2001/09/18 00:32:15 1.6 +++ Parrot/Opcode.pm2001/09/20 07:23:44 @@ -28,9 +28,9 @@ my($name, @params) = split /\s+/; if (@params $params[0] =~ /^\d+$/) { - my $count = shift @params; + my $nparams = shift @params; die $file, line $.: opcode $name parameters don't match count\n - if ($count != @params); + if ($nparams != @params); } warn $file, line $.: opcode $name redefined\n if $opcode{$name}; @@ -108,91 +108,5 @@ The fingerprint() function returns the MD5 signature (in hex) of the opcode table. - -=cut -package Parrot::Opcode; - -use strict; -use Symbol; - -sub read_ops { -my $file = @_ ? shift : opcode_table; - -my $fh = gensym; -open $fh, $file or die $file: $!\n; - -my %opcode; -my $count = 1; -while ($fh) { - s/#.*//; - s/^\s+//; - chomp; - next unless $_; - - my($name, @params) = split /\s+/; - if (@params $params[0] =~ /^\d+$/) { - my $count = shift @params; - die $file, line $.: opcode $name parameters don't match count\n - if ($count != @params); - } - - warn $file, line $.: opcode $name redefined\n if $opcode{$name}; - - $opcode{$name}{ARGS} = @params; - $opcode{$name}{TYPES} = \@params; - $opcode{$name}{CODE} = ($name eq end) ? 0 : $count++; - $opcode{$name}{FUNC} = Parrot_op_$name; - - my $num_i = () = grep {/i/} @params; - my $num_n = () = grep {/n/} @params; - $opcode{$name}{RETURN_OFFSET} = 1 + $num_i + $num_n * 2; -} - -return %opcode; -} - -1; - - -__END__ - -=head1 NAME - -Parrot::Opcode - Read opcode definitions - -=head1 SYNOPSIS - - use Parrot::Opcode; - - %opcodes = Parrot::Opcode::read_ops(); - -=head1 DESCRIPTION - -The read_ops() function parses the Parrot opcode_table file, and -returns the contents as a hash. The hash key is the opcode name; -values are hashrefs containing the following fields: - -=over - -=item CODE - -The opcode number. - -=item ARGS - -The opcode argument count. - -=item TYPES - -The opcode argument types, as an arrayref. - -=item FUNC - -The name of the C function implementing this op. - -=back - -read_ops() takes an optional argument: the file to read the opcode table -from. =cut
Re: [PATCH] Changes to interpreter op table and simplified DO_OP
Oops; that'll teach me to submit things before a cvs update. The generate.pl I just sent is out-of-date with regards to CVS. Attached is an updated version. (I haven't seen my prior mail go through yet; I'm guessing this is the list being slow, but it might be a problem with my local mail system. Just in case, I'm sending this from a different machine. If this arrives without my prior message, that means my mail system is screwy. :) - Damien generate.pl
Re: question about branching/returning
On Wed, Sep 19, 2001 at 10:32:18PM -0700, Dave Storrs wrote: Ok, that was pretty much what I thought. But then what is the 'end' opcode for? It does a 'RETURN 0', which would increment the PC by 0 opcodes...which either counts as an infinite loop or a no-op, and we've already got a no-op op. RETURN(0) is special-cased by process_opcodes(); it returns a literal 0, not a relative address. As other people have noted, this is irrelevent, as end is never called. - Damien
Re: Tru64
On Thu, Sep 20, 2001 at 09:06:12AM -0500, Gibbs Tanton - tgibbs wrote: Damien, is there any way we could get a similar fix for number.t? That would make us at 100% on Tru64. (Apologies if this shows up twice; something appears to be screwy with my mail system.) I'm currently getting segfaults on all tests on Tru64; I'll look into it if I get a chance, but I may not have time for a few days. (I'm flying to Connecticut for a friend's wedding tomorrow morning.) I didn't think there were any tests in number.t which would be particularly architecture-dependent...which ones are failing for you, and what output are they producing? - Damien
Re: Draft switch for DO_OP() :-)
On Thu, Sep 20, 2001 at 11:11:42AM -0400, Dan Sugalski wrote: Actually the ops=C conversion was conceived to do exactly what's being done now--to abstract out the body of the opcodes so that they could be turned into a switch, or turned into generated machine code, or TIL'd. If you're finding that this isn't working well it's a sign we need to change things some so they will. (Better now than in six months...) The problem is that the conversion currently done by process_opcodes.pl translates the op definitions into functions, and leaves the remainder of the file untouched. This is useful, because it allows the opcode file to include headers, declare file-scope variables, and the like. Unfortunately, when translating the ops into a switch statement in a header file, there is no place to put this non-opcode code. There are a few approaches we can take. The simplest, I think, is to ignore the problem when generating inline ops; given that these ops are going to be compiled at Perl build time (they can never be dynamically loaded for obvious reasons), we can manually put any required #includes in interpreter.c. Files containing dynamically- loaded ops can be generated in the same way that process_opcodes.pl does now, preserving the file-scope code. Another approach would be to include a means of defining information that must be included by the file implementing the ops. For example: HEADER { #include math.h } This would then be placed into interp_guts.h. (Possibly surrounded by a conditional guard (#ifdef PARROT_OP_IMPLEMENTATION), so no file other than interpreter.h will pick up that code.) - Damien
Re: Name lengths in C code
On Thu, Sep 20, 2001 at 05:09:52PM -0400, Dan Sugalski wrote: Just a reminder--function names shouldn't exceed 31 characters. The C standard doesn't guarantee anything past that... You think that's bad? You aren't guaranteed more than six characters, case-insensitive for external identifiers. I've been told that Oracle actually requires conformance to this in their coding standards. I'm very happy that I don't have to write code for Orcale... - Damien
_read = read
test_main.c still seems to contain a call to _read(), rather than read(). This breaks compilation under Tru64 for me; the attached patch removes the _. - Damien Index: test_main.c === RCS file: /home/perlcvs/parrot/test_main.c,v retrieving revision 1.11 diff -u -r1.11 test_main.c --- test_main.c 2001/09/18 21:03:27 1.11 +++ test_main.c 2001/09/20 21:17:44 @@ -94,7 +94,7 @@ #ifndef HAS_HEADER_SYSMMAN program_code = (opcode_t*)mem_sys_allocate(program_size); -_read(fd, (void*)program_code, program_size); +read(fd, (void*)program_code, program_size); #else program_code = (opcode_t*)mmap(0, program_size, PROT_READ, MAP_SHARED, fd, 0); #endif
Re: Tru64
On Thu, Sep 20, 2001 at 09:06:12AM -0500, Gibbs Tanton - tgibbs wrote: Failed 1/5 test scripts, 80.00% okay. 7/74 subtests failed, 90.54% okay. make: *** [test] Error 2 Damien, is there any way we could get a similar fix for number.t? That would make us at 100% on Tru64. I'm currently getting segfaults on all tests on Tru64; I'll look into it if I get a chance, but I may not have time for a few days. (I'm flying to Connecticut for a friend's wedding tomorrow morning.) - Damien
Some tests
The attached file contains tests for all Parrot integer ops. - Damien #! perl -w use Parrot::Test tests = 26; output_is(CODE, OUTPUT, set_i_ic); # XXX: Need a test for writing outside the set of available # registers. Parrot doesn't check for this at the moment. set I0, 0x12345678 print I0 print \\n set I31, 0x9abcdef1 print I31 print \\n set I1, 2147483647 print I1 print \\n set I2, -2147483648 print I2 print \\n set I3, 4294967295 print I3 print \\n CODE 305419896 -1698898191 2147483647 -2147483648 -1 OUTPUT output_is(CODE, OUTPUT, set_i); set I0, 0x77665544 set I1, I0 print I1 print \\n CODE 2003195204 OUTPUT output_is(CODE, OUTPUT, add_i); set I0, 0x11223344 add I1, I0, I0 print I1 print \\n add I2, I0, I1 print I2 print \\n add I2, I2, I2 print I2 print \\n set I3, 2147483647 set I4, 1 add I5, I3, I4 print I5 print \\n set I6, -1 add I7, I5, I6 print I7 print \\n CODE 574908040 862362060 1724724120 -2147483648 2147483647 OUTPUT output_is(CODE, OUTPUT, sub_i); set I0, 0x12345678 set I1, 0x01234567 sub I2, I0, I1 print I2 print \\n CODE 286331153 OUTPUT output_is(CODE, OUTPUT, mul_i); set I0, 7 set I1, 29 mul I2, I0, I1 print I2 print \\n CODE 203 OUTPUT output_is(CODE, OUTPUT, div_i); set I0, 0x set I1, 0x div I2, I0, I1 print I2 print \\n set I0, 11 set I1, 2 div I2, I0, I1 print I2 print \\n set I0, 9 set I1, -4 div I2, I0, I1 print I2 print \\n CODE 3 5 -2 OUTPUT output_is(CODE, OUTPUT, mod_i); set I0, 17 set I1, 5 mod I2, I0, I1 print I2 print \\n set I0, -57 set I1, 10 mod I2, I0, I1 print I2 print \\n CODE 2 -7 OUTPUT output_is(CODE, OUTPUT, eq_i_ic); set I0, 0x12345678 set I1, 0x12345678 set I2, 0x76543210 eq I0, I1, ONE, ERROR print bad\\n ONE: print ok 1\\n eq I1, I2, ERROR, TWO print bad\\n TWO: print ok 2\\n end ERROR: print bad\\n CODE ok 1 ok 2 OUTPUT output_is(CODE, OUTPUT, eq_ic_ic); set I0, -42 eq I0, 42, ERROR, ONE print bad\\n ONE: print ok 1\\n eq I0, -42, TWO, ERROR print bad\\n TWO: print ok 2\\n end ERROR: print bad\\n CODE ok 1 ok 2 OUTPUT output_is(CODE, OUTPUT, ne_i_ic); set I0, 0xa0b0c0d0 set I1, 0xa0b0c0d0 set I2, 0 ne I0, I2, ONE, ERROR print bad\\n ONE: print ok 1\\n ne I0, I1, ERROR, TWO print bad\\n TWO: print ok 2\\n end ERROR: print bad\\n CODE ok 1 ok 2 OUTPUT output_is(CODE, OUTPUT, ne_ic_ic); set I0, 427034409 ne I0, 427034409, ERROR, ONE print bad\\n ONE: print ok 1\\n ne I0, 427034408, TWO, ERROR print bad\\n TWO: print ok 2\\n end ERROR: print bad\\n CODE ok 1 ok 2 OUTPUT output_is(CODE, OUTPUT, lt_i_ic); set I0, 2147483647 set I1, -2147483648 set I2, 0 set I3, 0 lt I1, I0, ONE, ERROR print bad\\n ONE: print ok 1\\n lt I0, I1, ERROR, TWO print bad\\n TWO: print ok 2\\n lt I2, I3, ERROR, THREE print bad\\n THREE: print ok 3\\n end ERROR: print bad\\n CODE ok 1 ok 2 ok 3 OUTPUT output_is(CODE, OUTPUT, lt_ic_ic); set I0, 2147483647 set I1, -2147483648 set I2, 0 lt I0, -2147483648, ERROR, ONE print bad\\n ONE: print ok 1\\n lt I1, 2147483647, TWO, ERROR print bad\\n TWO: print ok 2\\n lt I0, 0, ERROR, THREE print bad\\n THREE: print ok 3\\n end ERROR: print bad\\n CODE ok 1 ok 2 ok 3 OUTPUT output_is(CODE, OUTPUT, le_i_ic); set I0, 2147483647 set I1, -2147483648 set I2, 0 set I3, 0 le I1, I0, ONE, ERROR print bad\\n ONE: print ok 1\\n le
Re: A task for the interested
On Tue, Sep 18, 2001 at 03:55:23PM -0400, Dan Sugalski wrote: Anyone care to take a shot at it? Having an extra overridable column in the opcode_table file (so we know which opcodes are overridable, and thus can't be in the switch) would be a good thing while you were at it... I will do this tonight, if nobody else gets to it first. - Damien
Bytecode safety
Proposed: Parrot should never crash due to malformed bytecode. When choosing between execution speed and bytecode safety, safety should always win. Careful op design and possibly a validation pass before execution will hopefully keep the speed penalty to a minimum. Yes, no? - Damien
Re: Bytecode safety
On Tue, Sep 18, 2001 at 10:40:30PM +0100, Simon Cozens wrote: On Tue, Sep 18, 2001 at 02:37:43PM -0700, Damien Neil wrote: Proposed: Parrot should never crash due to malformed bytecode. Haven't we done this argument? :) Sort of, while talking about other things. I wanted to drag it out to stand on its own. : - Damien
Number tests
...and here are tests for the number ops. - Damien #! perl -w use Parrot::Test tests = 23; output_is(CODE, OUTPUT, set_n_nc); set N0, 1.0 set N1, 4.0 set N2, 16.0 set N3, 64.0 set N4, 256.0 set N5, 1024.0 set N6, 4096.0 set N7, 16384.0 set N8, 65536.0 set N9, 262144.0 set N10, 1048576.0 set N11, 4194304.0 set N12, 16777216.0 set N13, 67108864.0 set N14, 268435456.0 set N15, 1073741824.0 set N16, 4294967296.0 set N17, 17179869184.0 set N18, 68719476736.0 set N19, 274877906944.0 set N20, 1099511627776.0 set N21, 4398046511104.0 set N22, 17592186044416.0 set N23, 70368744177664.0 set N24, 281474976710656.0 set N25, 1.12589990684262e+15 set N26, 4.5035996273705e+15 set N27, 1.8014398509482e+16 set N28, 7.20575940379279e+16 set N29, 2.88230376151712e+17 set N30, 1.15292150460685e+18 set N31, 4.61168601842739e+18 print N0 print \\n print N1 print \\n print N2 print \\n print N3 print \\n print N4 print \\n print N5 print \\n print N6 print \\n print N7 print \\n print N8 print \\n print N9 print \\n print N10 print \\n print N11 print \\n print N12 print \\n print N13 print \\n print N14 print \\n print N15 print \\n print N16 print \\n print N17 print \\n print N18 print \\n print N19 print \\n print N20 print \\n print N21 print \\n print N22 print \\n print N23 print \\n print N24 print \\n print N25 print \\n print N26 print \\n print N27 print \\n print N28 print \\n print N29 print \\n print N30 print \\n print N31 print \\n CODE 1.00 4.00 16.00 64.00 256.00 1024.00 4096.00 16384.00 65536.00 262144.00 1048576.00 4194304.00 16777216.00 67108864.00 268435456.00 1073741824.00 4294967296.00 17179869184.00 68719476736.00 274877906944.00 1099511627776.00 4398046511104.00 17592186044416.00 70368744177664.00 281474976710656.00 1125899906842620.00 4503599627370500.00 18014398509482000.00 72057594037927904.00 288230376151712000.00 1152921504606850048.00 4611686018427389952.00 OUTPUT output_is(CODE, OUTPUT, add_n); set N0, 1.0 add N1, N0, N0 print N1 print \\n add N2, N0, N1 print N2 print \\n add N2, N2, N2 print N2 print \\n CODE 2.00 3.00 6.00 OUTPUT output_is(CODE, OUTPUT, sub_i); set N0, 424242.0 set N1, 4200.0 sub N2, N0, N1 print N2 print \\n CODE 420042.00 OUTPUT output_is(CODE, OUTPUT, mul_i); set N0, 2.0 mul N1, N0, N0 mul N1, N1, N0 mul N1, N1, N0 mul N1, N1, N0 mul N1, N1, N0 mul N1, N1, N0 mul N1, N1, N0 print N1 print \\n CODE 256.00 OUTPUT output_is(CODE, OUTPUT, div_i); set N0, 10.0 set N1, 2.0 div N2, N0, N1 print N2 print \\n set N3, 7.0 set N4, 2.0 div N3, N3, N4 print N3 print \\n set N5, 9.0 set N6, -4.0 div N7, N5, N6 print N7 print \\n CODE 5.00 3.50 -2.25 OUTPUT output_is(CODE, OUTPUT, eq_n_ic); set N0, 5.01 set N1, 5.01 set N2, 5.02 eq N0, N1, ONE, ERROR print bad\\n ONE: print ok 1\\n eq N1, N2, ERROR, TWO print bad\\n TWO: print ok 2\\n end ERROR: print bad\\n CODE ok 1 ok 2 OUTPUT output_is(CODE, OUTPUT, eq_nc_ic); set N0, 1.01 eq N0, 1.00, ERROR, ONE print bad\\n ONE: print ok 1\\n eq N0, 1.01, TWO, ERROR print bad\\n TWO: print ok 2\\n end ERROR: print bad\\n CODE ok 1 ok 2 OUTPUT output_is(CODE,
Re: t/op/integer.t is IMHO wrong
On Wed, Sep 19, 2001 at 12:51:43AM +0200, Mattia Barbon wrote: I think that especting 4294967295 == -1 because they have the same bit pattern ( on two's complement 32 bit machines ) is wrong I was wondering how long it would take for someone to notice that. : If anyone feels like defining a policy on what Parrot does with out-of-range numbers, and what happens on integer overflow, I'll submit patches to the tests to check against it. I'd rather we didn't just modify the tests to never trigger overflow conditions, however; that's just sweeping the issue under the rug. - Damien
Re: Tests
On Tue, Sep 18, 2001 at 06:12:48PM -0500, Gibbs Tanton - tgibbs wrote: All the tests are great! But, could everyone please remember to put an end at the end of each assembly test...cygwin doesn't like it if you don't. I think I've patched all the ones up to this point. Oops. Sorry about that; I thought I had seen a patch go through to make the ends optional. - Damien
Re: naming conventions on opcodes
On Tue, Sep 18, 2001 at 07:52:06PM -0400, Dan Sugalski wrote: More to the point, it needs typing exactly twice--once in the .ops file that defines the opcode function body, and once in opcode_table. The assembler, of course, uses the smaller name. Three times: And once to name the test case. : - Damien
Re: Difficulties
On Sat, Sep 15, 2001 at 01:15:57AM -0700, Brent Dax wrote: As for the 5.6 thing...I think we're supposed to support 5.005 and above. Can you tell what Parrot::Opcode needs it for? (And if it's for 'our', I'm going to punch someone... :^) ) Er...I think it IS for our, actually. : I'm so used to using it, I didn't realize I was introducing a 5.6ism. The silly thing is, I deliberately avoided using open(my $fh, $file) to keep from requiring 5.6... I notice that someone did add a use 5.6.0 to Parrot::Opcode--here's a patch which removes it, and the offending ours. - Damien Index: Parrot/Opcode.pm === RCS file: /home/perlcvs/parrot/Parrot/Opcode.pm,v retrieving revision 1.3 diff -u -r1.3 Opcode.pm --- Parrot/Opcode.pm2001/09/15 00:57:42 1.3 +++ Parrot/Opcode.pm2001/09/15 08:33:48 @@ -1,12 +1,11 @@ package Parrot::Opcode; -use 5.6.0; use strict; use Symbol; use Digest::MD5 qw(md5_hex); -our %opcode; -our $fingerprint; +my %opcode; +my $fingerprint; sub _load { my $file = @_ ? shift : opcode_table;
Re: Difficulties
On Sat, Sep 15, 2001 at 01:52:26AM -0700, Brent Dax wrote: use vars qw(%opcode $fingerprint);#or strict will throw a tantrum Not necessary--the patch changes those variables to lexicals. There wasn't any strong reason for them to be package vars. - Damien
Re: Half-completed parrot/parrot.h conversion?
On Fri, Sep 14, 2001 at 11:31:20AM -0500, Gibbs Tanton - tgibbs wrote: The patch assumes that your source code directory is named parrot. This may have been an invalid assumption, but it is going to be hard to do this patch unless we agree on the name of the source directory. That may be difficult. I occasionally like to have multiple copies of the source directory around for testing--usually parrot.orig and parrot. I suspect I'm not the only one. Having the compile in the parrot.orig directory pick up the includes from ../parrot would be surprising, to say the least. - Damien
Re: RFC: Bytecode file format
On Sat, Sep 15, 2001 at 01:03:51AM +0300, Jarkko Hietaniemi wrote: Re: IFF. Being an old Amiga user, I find it appealing. Is the lack of a dictionary likely to be a significant problem? Please elaborate. IFF stores a linear series of chunks. Each chunk has a header containing the chunk id, and the size of the chunk. In order to get a listing of all chunks in an IFF file, you need to do a linear scan of the chunks. A file format with a dictionary would contain a single section with a list of all chunks in the file, eliminating the need to do numerous seeks and reads to pull in the contents. - Damien
Re: RFC: Bytecode file format
On Sat, Sep 15, 2001 at 12:39:39AM +0300, Jarkko Hietaniemi wrote: It will be hard to use one format for both native and portable. Not one format, but a set of closely related formats with well-defined transformations between them. After thinking about implementing this for a bit, I'm becoming dubious about the value of allowing any instance of Parrot to read the native bytecode of every other Parrot out there. Do we really want the non-native byteloader to be capable of reading everything from little-endian 16-bit to 64-bit mixed-endian? What about 36-bit? (PDP-6 port, anyone? :) I propose two encodings per Parrot: portable and native. Portable is big-endian 32-bit words. Native is, of course, whatever makes sense for the local machine. If you want to share bytecode between machines, you pass the create portable bytecode switch to the assembler. - Damien
Re: Patch: Common opcode_table parsing
On Thu, Sep 13, 2001 at 08:25:46AM +0100, Simon Cozens wrote: On Thu, Sep 13, 2001 at 12:29:18AM -0700, Damien Neil wrote: CVS changes over the past couple of days mean this patch will no longer cleanly apply. I'd be happy to update it to patch cleanly against the current CVS code, but I'd like to know first if the approach it takes is on the right track. I like it, if only because reduction of common code is always good, and reduction of common code while everything's in a lot of flux is even better. OK, I'll go through and update it again. This patch takes out the parsing of interp_guts.h, which I think is good for a variety of reasons. (Summary: it makes things simpler, and I don't think parsing it will buy us anything at this point in time.) Is this OK, or should I put it back in? Urgh, urgh, urgh. I don't *like* the idea of munging opcode function names, but I equally don't like coredumps. Isn't there a way of telling the linker to use our own symbols? Actually, the problem is that the linker IS using our symbols. : There appears to be an end symbol somewhere in libc that is getting munged by the Parrot symbol. I think. I didn't look deeply enough to see exactly how things were going wrong, once I traced the core to a symbol clash. I *really* think we need to munge the names, though. end is just far too common a symbol for us to be able to pollute it. Let's learn the lesson from Perl 5: All symbols exported by the Parrot code need a prefix. - Damien
Re: Patch: Common opcode_table parsing
On Thu, Sep 13, 2001 at 08:44:44AM +0100, Simon Cozens wrote: Aiiee. Yes, I appreciate what you're saying, but the other lessons from Perl 5 is that if you want to do that, you end up with either lots of unweildy code, or a nasty macro renaming. Which is it gonna be? I don't really like the Perl 5 approach of lots of macros. I'd rather have a short prefix attached to all symbols. par_foo() rather than foo() in all cases. For very commonly used macros (the equivalent of PUSHi()), you might relax this rule. Look at the current source: Is it really going to be any harder to always type par_string_length() rather than string_length(), or Par_Allocate_Aligned() rather than Allocate_Aligned()? The symbol names are long enough to begin with that an extra four characters isn't going to make much difference. Even a single character would do a lot to eliminate symbol clashes: Pstring_length() and PAllocate_Aligned(), for example. (Speaking of the above, someone authoritative may want to dictate whether functions are Upper_Case or lower_case.) Talking just about the opcode functions, however: Will code be calling opcode functions directly very often? Perhaps I'm wrong, but I'd think that ops being called outside the runops() loop will be rare. In fact, if ops can be embedded into a switch statement in runops() at compile time, there won't even be any assurance that there ARE any op functions to call. Unwieldy op function names shouldn't be a problem. - Damien
Re: Using int32_t instead of IV for code
On Thu, Sep 13, 2001 at 10:06:51AM +0100, Philip Kendall wrote: If we are going to keep on doing fancy stuff with pointer arithmetic (eg the Alloc_Aligned/CHUNK_BASE stuff), I think we're also going to need an integer type which is guaranteed to be the same width as a pointer, so we can freely typecast between the two. The language lawyer in me insists that I point out that this is inherently nonportable. C does not guarantee that it is possible to convert losslessly between pointers and integers; there have been systems on which this was impossible (or hugely inefficient) for hardware reasons. The correct approach to storing pointers and integerss in the same value is to use a union. Personally, I would use: typedef union { int i; void *p; } IV; I realize that I'm probably in a minority of one on this. : Also, if we've got a system with 64 bit IVs, are the arguments to Parrot opcodes going to be 32 or 64 bit? If 32 bit, is there going to be any way of loading a 64 bit constant? This reminds me of something I've been meaning to ask: Is Parrot byte code intended to be network-portable? - Damien
Re: patch: assembly listings from assembler
On Thu, Sep 13, 2001 at 06:41:00PM -0400, Dan Sugalski wrote: At 01:42 PM 9/13/2001 -0700, Benjamin Stuhl wrote: Could we please get in the habit of adding a -c or a -u to our CVS diffs, just as we would with normal patches? Yes, please! All diffs posted to the list should be either -c or -u diffs. Both can be fed to patch, and both read far more easily than the plain diff output. The following lines, placed in ~/.cvsrc, make cvs work much better: update -dP diff -u The -d option to update makes cvs check out newly-created directories; without it, it will silently ignore them. -P prunes empty directories, which compensates for the fact that directories can't be deleted. And the -u to diff (or -c) is just a good idea. : - Damien
Patch: Common opcode_table parsing, take 2
Here's an updated version of my original patch, to account for recent changes in CVS. As before, this includes opcode-munging to let Parrot run on FreeBSD. - Damien diff -u --new-file -r parrot.orig/Parrot/Opcode.pm parrot/Parrot/Opcode.pm --- parrot.orig/Parrot/Opcode.pmWed Dec 31 16:00:00 1969 +++ parrot/Parrot/Opcode.pm Mon Sep 10 23:52:35 2001 @@ -0,0 +1,86 @@ +package Parrot::Opcode; + +use strict; +use Symbol; + +sub read_ops { +my $file = @_ ? shift : opcode_table; + +my $fh = gensym; +open $fh, $file or die $file: $!\n; + +my %opcode; +my $count = 1; +while ($fh) { + s/#.*//; + s/^\s+//; + chomp; + next unless $_; + + my($name, @params) = split /\s+/; + if (@params $params[0] =~ /^\d+$/) { + my $count = shift @params; + die $file, line $.: opcode $name parameters don't match count\n + if ($count != @params); + } + + warn $file, line $.: opcode $name redefined\n if $opcode{$name}; + + $opcode{$name}{ARGS} = @params; + $opcode{$name}{TYPES} = \@params; + $opcode{$name}{CODE} = ($name eq end) ? 0 : $count++; + $opcode{$name}{FUNC} = Parrot_op_$name; + + my $num_i = () = grep {/i/} @params; + my $num_n = () = grep {/n/} @params; + $opcode{$name}{RETURN_OFFSET} = 1 + $num_i + $num_n * 2; +} + +return %opcode; +} + +1; + + +__END__ + +=head1 NAME + +Parrot::Opcode - Read opcode definitions + +=head1 SYNOPSIS + + use Parrot::Opcode; + + %opcodes = Parrot::Opcode::read_ops(); + +=head1 DESCRIPTION + +The read_ops() function parses the Parrot opcode_table file, and +returns the contents as a hash. The hash key is the opcode name; +values are hashrefs containing the following fields: + +=over + +=item CODE + +The opcode number. + +=item ARGS + +The opcode argument count. + +=item TYPES + +The opcode argument types, as an arrayref. + +=item FUNC + +The name of the C function implementing this op. + +=back + +read_ops() takes an optional argument: the file to read the opcode table +from. + +=cut diff -u --new-file -r parrot.orig/assemble.pl parrot/assemble.pl --- parrot.orig/assemble.pl Thu Sep 13 20:45:05 2001 +++ parrot/assemble.pl Thu Sep 13 20:33:36 2001 @@ -5,6 +5,7 @@ # Brian Wheeler ([EMAIL PROTECTED]) use strict; +use Parrot::Opcode; my $opt_c; if (@ARGV and $ARGV[0] eq -c) { @@ -25,32 +26,10 @@ foreach (keys(%real_type)) { $sizeof{$_}=length(pack($pack_type{$real_type{$_}},0)); } - -# get opcodes from guts. -open GUTS, interp_guts.h; -my %opcodes; -while (GUTS) { -next unless /\tx\[(\d+)\] = ([a-z_]+);/; -$opcodes{$2}{CODE} = $1; -} -close GUTS; -# get opcodes and their arg lists -open OPCODES, opcode_table or die Can't get opcode table, $!/$^E; -while (OPCODES) { -next if /^\s*#/; -chomp; -s/^\s+//; -next unless $_; -my ($name, $args, @types) = split /\s+/, $_; -my @rtypes=@types; -@types=map { $_ = $real_type{$_}} @types; -$opcodes{$name}{ARGS} = $args; -$opcodes{$name}{TYPES} = [@types]; -$opcodes{$name}{RTYPES}=[@rtypes]; -} -close OPCODES; +# get opcodes +my %opcodes = Parrot::Opcode::read_ops(); # read source and assemble @@ -134,8 +113,8 @@ $pc+=4; foreach (0..$#args) { -my($rtype)=$opcodes{$opcode}{RTYPES}[$_]; -my($type)=$opcodes{$opcode}{TYPES}[$_]; +my($rtype)=$opcodes{$opcode}{TYPES}[$_]; +my($type)=$real_type{$opcodes{$opcode}{TYPES}[$_]}; if($rtype eq I || $rtype eq N || $rtype eq P || $rtype eq S) { # its a register argument $args[$_]=~s/^[INPS](\d+)$/$1/i; diff -u --new-file -r parrot.orig/build_interp_starter.pl parrot/build_interp_starter.pl --- parrot.orig/build_interp_starter.pl Thu Sep 13 20:45:05 2001 +++ parrot/build_interp_starter.pl Thu Sep 13 20:36:14 2001 @@ -1,10 +1,9 @@ # !/usr/bin/perl -w use strict; +use Parrot::Opcode; open INTERP, interp_guts.h or die Can't open interp_guts.h, $!/$^E; -open OPCODES, opcode_table or die Can't open opcode_table, $!/$^E; - print INTERP CONST; /* * @@ -18,17 +17,9 @@ #define BUILD_TABLE(x) do { \\ CONST -my $count = 1; -while (OPCODES) { -chomp; -s/#.*$//; -s/^\s+//; -next unless $_; -my($name) = split /\s+/; -my $num = $count; -$num = 0 if $name eq 'end'; -print INTERP \tx[$num] = $name; \\\n; -$count++ unless $name eq 'end'; +my %opcodes = Parrot::Opcode::read_ops(); +for my $name (sort {$opcodes{$a}{CODE} = $opcodes{$b}{CODE}} keys %opcodes) { +print INTERP \tx[$opcodes{$name}{CODE}] = $opcodes{$name}{FUNC}; \\\n; } print INTERP } while (0);\n; diff -u --new-file -r parrot.orig/disassemble.pl parrot/disassemble.pl --- parrot.orig/disassemble.pl Thu Sep 13 20:45:05 2001 +++ parrot/disassemble.pl Thu Sep 13 20:37:47 2001 @@ -5,6 +5,7 @@ # Turn a parrot bytecode
Patch: Common opcode_table parsing
The following patch moves all parsing of opcode_table into a Parrot::Opcode module. It also removes all parsing of interp_guts.h. This patch incorporates my earlier patches to prefix all C opcode functions with Perl_op_. As best I can tell, everything works the same with the patch as it did before--the assembler and disassembler both generate identical output, and test_prog runs as well as before. (Or better on FreeBSD, where it stops core dumping. :) - Damien diff -r --new-file -u parrot.orig/Parrot/Opcode.pm parrot/Parrot/Opcode.pm --- parrot.orig/Parrot/Opcode.pmWed Dec 31 16:00:00 1969 +++ parrot/Parrot/Opcode.pm Mon Sep 10 23:52:35 2001 @@ -0,0 +1,86 @@ +package Parrot::Opcode; + +use strict; +use Symbol; + +sub read_ops { +my $file = @_ ? shift : opcode_table; + +my $fh = gensym; +open $fh, $file or die $file: $!\n; + +my %opcode; +my $count = 1; +while ($fh) { + s/#.*//; + s/^\s+//; + chomp; + next unless $_; + + my($name, @params) = split /\s+/; + if (@params $params[0] =~ /^\d+$/) { + my $count = shift @params; + die $file, line $.: opcode $name parameters don't match count\n + if ($count != @params); + } + + warn $file, line $.: opcode $name redefined\n if $opcode{$name}; + + $opcode{$name}{ARGS} = @params; + $opcode{$name}{TYPES} = \@params; + $opcode{$name}{CODE} = ($name eq end) ? 0 : $count++; + $opcode{$name}{FUNC} = Parrot_op_$name; + + my $num_i = () = grep {/i/} @params; + my $num_n = () = grep {/n/} @params; + $opcode{$name}{RETURN_OFFSET} = 1 + $num_i + $num_n * 2; +} + +return %opcode; +} + +1; + + +__END__ + +=head1 NAME + +Parrot::Opcode - Read opcode definitions + +=head1 SYNOPSIS + + use Parrot::Opcode; + + %opcodes = Parrot::Opcode::read_ops(); + +=head1 DESCRIPTION + +The read_ops() function parses the Parrot opcode_table file, and +returns the contents as a hash. The hash key is the opcode name; +values are hashrefs containing the following fields: + +=over + +=item CODE + +The opcode number. + +=item ARGS + +The opcode argument count. + +=item TYPES + +The opcode argument types, as an arrayref. + +=item FUNC + +The name of the C function implementing this op. + +=back + +read_ops() takes an optional argument: the file to read the opcode table +from. + +=cut diff -r --new-file -u parrot.orig/assemble.pl parrot/assemble.pl --- parrot.orig/assemble.pl Mon Sep 10 14:26:08 2001 +++ parrot/assemble.pl Mon Sep 10 23:51:34 2001 @@ -3,6 +3,7 @@ # assemble.pl - take a parrot assembly file and spit out a bytecode file use strict; +use Parrot::Opcode; my(%opcodes, %labels); @@ -12,23 +13,7 @@ ); my $sizeof_packi = length(pack($pack_type{i},1024)); -open GUTS, interp_guts.h; -my $opcode; -while (GUTS) { -next unless /\tx\[(\d+)\] = ([a-z_]+);/; -$opcodes{$2}{CODE} = $1; -} - -open OPCODES, opcode_table or die Can't get opcode table, $!/$^E; -while (OPCODES) { -next if /^\s*#/; -chomp; -s/^\s+//; -next unless $_; -my ($name, $args, @types) = split /\s+/, $_; -$opcodes{$name}{ARGS} = $args; -$opcodes{$name}{TYPES} = [@types]; -} +%opcodes = Parrot::Opcode::read_ops(); my $pc = 0; my @code; diff -r --new-file -u parrot.orig/build_interp_starter.pl parrot/build_interp_starter.pl --- parrot.orig/build_interp_starter.pl Mon Sep 10 14:26:09 2001 +++ parrot/build_interp_starter.pl Mon Sep 10 23:53:26 2001 @@ -1,10 +1,9 @@ # !/usr/bin/perl -w use strict; +use Parrot::Opcode; open INTERP, interp_guts.h or die Can't open interp_guts.h, $!/$^E; -open OPCODES, opcode_table or die Can't open opcode_table, $!/$^E; - print INTERP CONST; /* * @@ -18,16 +17,8 @@ #define BUILD_TABLE(x) do { \\ CONST -my $count = 1; -while (OPCODES) { -chomp; -s/#.*$//; -s/^\s+//; -next unless $_; -my($name) = split /\s+/; -my $num = $count; -$num = 0 if $name eq 'end'; -print INTERP \tx[$num] = $name; \\\n; -$count++ unless $name eq 'end'; +my %opcodes = Parrot::Opcode::read_ops(); +for my $name (sort {$opcodes{$a}{CODE} = $opcodes{$b}{CODE}} keys %opcodes) { +print INTERP \tx[$opcodes{$name}{CODE}] = $opcodes{$name}{FUNC}; \\\n; } print INTERP } while (0);\n; diff -r --new-file -u parrot.orig/disassemble.pl parrot/disassemble.pl --- parrot.orig/disassemble.pl Mon Sep 10 14:45:33 2001 +++ parrot/disassemble.pl Mon Sep 10 23:57:36 2001 @@ -7,6 +7,7 @@ use strict; my(%opcodes, @opcodes); +use Parrot::Opcode; my %unpack_type; %unpack_type = (i = 'l', @@ -16,28 +17,10 @@ n = 8, ); -open GUTS, interp_guts.h; -my $opcode; -while (GUTS) { -next unless /\tx\[(\d+)\] = ([a-z_]+);/; -$opcodes{$2}{CODE} = $1; -} - -open OPCODES, opcode_table or die Can't get opcode table, $!/$^E; -while (OPCODES) { -next if /^\s*#/; -s/^\s+//; -
Re: Speaking of namespaces...
On Mon, Sep 10, 2001 at 06:58:23PM -0400, Dan Sugalski wrote: At 03:52 PM 9/10/2001 -0700, Damien Neil wrote: Parrot fails to work in very obscure ways on FreeBSD. After some poking around, I tracked the problem to the end op--this appears to conflict with something inside libc. Renaming the op fixes the problem. Ah, that's what was killing the build on Nat's machine. Patch, by chance? The following quick-and-dirty patch appears to work. This prefixes all opcode functions with Parrot_op_. I'd have made the prefix configurable, but the opcode generation is spread across three different files. (Aside: What's the best way to generate a useful patch with cvs? The following comes from cvs -q diff -u.) - Damien Index: build_interp_starter.pl === RCS file: /home/perlcvs/parrot/build_interp_starter.pl,v retrieving revision 1.2 diff -u -u -r1.2 build_interp_starter.pl --- build_interp_starter.pl 2001/09/10 21:26:09 1.2 +++ build_interp_starter.pl 2001/09/10 23:07:08 @@ -27,7 +27,7 @@ my($name) = split /\s+/; my $num = $count; $num = 0 if $name eq 'end'; -print INTERP \tx[$num] = $name; \\\n; +print INTERP \tx[$num] = Parrot_op_$name; \\\n; $count++ unless $name eq 'end'; } print INTERP } while (0);\n; Index: make_op_header.pl === RCS file: /home/perlcvs/parrot/make_op_header.pl,v retrieving revision 1.3 diff -u -u -r1.3 make_op_header.pl --- make_op_header.pl 2001/09/10 21:26:09 1.3 +++ make_op_header.pl 2001/09/10 23:07:08 @@ -6,7 +6,7 @@ next if /^\s*#/ or /^\s*$/; chomp; ($name, undef) = split /\t/, $_; -print IV *$name(IV *, struct Perl_Interp *);\n; +print IV *Parrot_op_$name(IV *, struct Perl_Interp *);\n; } BEGIN { Index: process_opfunc.pl === RCS file: /home/perlcvs/parrot/process_opfunc.pl,v retrieving revision 1.3 diff -u -u -r1.3 process_opfunc.pl --- process_opfunc.pl 2001/09/10 21:26:09 1.3 +++ process_opfunc.pl 2001/09/10 23:07:08 @@ -105,7 +105,7 @@ my $line = shift; my ($name) = $line =~ /AUTO_OP\s+(\w+)/; -print OUTPUT IV *$name(IV cur_opcode[], struct Perl_Interp *interpreter) {\n; +print OUTPUT IV *Parrot_op_$name(IV cur_opcode[], struct Perl_Interp +*interpreter) {\n; return($name, return cur_opcode + . $opcode{$name}{RETURN_OFFSET}. ;\n}\n); } @@ -114,7 +114,7 @@ my $line = shift; my ($name) = $line =~ /MANUAL_OP\s+(\w+)/; -print OUTPUT IV *$name(IV cur_opcode[], struct Perl_Interp *interpreter) {\n; +print OUTPUT IV *Parrot_op_$name(IV cur_opcode[], struct Perl_Interp +*interpreter) {\n; print OUTPUT IV return_offset = 1;\n; return($name, return cur_opcode + return_offset;\n}\n); } Index: test.pbc === RCS file: /home/perlcvs/parrot/test.pbc,v retrieving revision 1.2 diff -u -u -r1.2 test.pbc Binary files /tmp/cvsqe7MSGr3cy and test.pbc differ
Re: Speaking of namespaces...
On Mon, Sep 10, 2001 at 04:04:20PM -0700, Damien Neil wrote: The following quick-and-dirty patch appears to work. This prefixes all opcode functions with Parrot_op_. I'd have made the prefix configurable, but the opcode generation is spread across three different files. Oops--that breaks the assembler. This patch fixes the assembler to work with the prior patch. - Damien Index: assemble.pl === RCS file: /home/perlcvs/parrot/assemble.pl,v retrieving revision 1.6 diff -u -u -r1.6 assemble.pl --- assemble.pl 2001/09/10 21:26:08 1.6 +++ assemble.pl 2001/09/10 23:43:30 @@ -15,7 +15,7 @@ open GUTS, interp_guts.h; my $opcode; while (GUTS) { -next unless /\tx\[(\d+)\] = ([a-z_]+);/; +next unless /\tx\[(\d+)\] = Parrot_op_([a-z_]+);/; $opcodes{$2}{CODE} = $1; }
Re: Speaking of namespaces...
On Mon, Sep 10, 2001 at 08:48:48PM -0400, Dan Sugalski wrote: At 04:56 PM 9/10/2001 -0700, Brent Dax wrote: This patch seems to work on the FreeBSD box I have access to. Now to figure out what's causing all those 'use of uninitialized value at assembler.pl line 81' messages... It's the blank lines in opcode_table. The assembler (and disassembler) at some point didn't grok 'em. Patches have been applied, but you might've checked out before that happened. No, in this case, it's my fault. I didn't realize the assembler reads op name/number mappings out of interp_guts.h, so my patch broke the assembler. I'm thinking of writing something to generate a Parrot::Opcode.pm module, so code doesn't need to parse opcode_table and interp_guts.h. Sound reasonable? - Damien
Re: Speaking of namespaces...
On Mon, Sep 10, 2001 at 08:56:52PM -0400, Dan Sugalski wrote: I'm thinking of writing something to generate a Parrot::Opcode.pm module, so code doesn't need to parse opcode_table and interp_guts.h. Sound reasonable? Yes, please do. I knew we needed one the second time I needed to parse opcode_table, I just haven't stopped long enough to be lazy and still program. (In those cases I came to a full stop...) OK, I'll do that sometime tonight. Should it parse opcode_table, or should it be generated with the contents of opcode_table? - Damien
Re: Should the op dispatch loop decode?
On Tue, Jun 12, 2001 at 06:12:35PM -0400, Dan Sugalski wrote: At the moment I'm leaning towards the functions doing their own decoding, as it seems likely to be faster. (Though we'd be duplicating the decoding logic everywhere, and bigger's reasonably bad) Possibly mandating shadow functions for each opcode function, where the shadow does the decoding and calls the real functions which take real things rather than our registers. Opinions anyone? I'd say that choosing the more complicated way because it seems to be faster is almost always a bad idea. What was that quote about premature optimization? A major advantage to putting the decoding in the main loop to start with, at least, is that it makes it easier to perform major surgery on the overall opcode design without needing to touch every op. I don't know how likely such surgery is. - Damien
Re: More character matching bits
On Tue, Jun 12, 2001 at 06:44:02PM -0400, Dan Sugalski wrote: While that's true, KATAKANA LETTER A and HIRAGANA LETTER A are also referring to distinct things. (Though arguably not as distinct as either with LATIN CAPITAL A) If we do one, why not the other? I'm perfectly happy with an answer that starts because..., but we should have an answer. Because anything which treats KATAKANA LETTER A and LATIN CAPITAL A as the same thing needs to treat KATAKANA LETTER KA and the sequence (LATIN CAPITAL K, LATIN CAPITAL A) as the same thing. Because this as much sense as allowing WHITE SMILING FACE to match (COLON, HYPHEN, RIGHT PARENTHESIS). Because the logical extension of this is to allow a sequence of Kanji or other ideographic characters to match their Romanized representation (or vice versa), which is a reasonable approximation of impossible. We probably also ought to answer the question How accommodating to non-latin writing systems are we going to be? It's an uncomfortable question, but one that needs asking. Answering by Larry, probably, but definitely asking. Perl's not really language-neutral now (If you think so, go wave locales at Jarkko and see what happens... :) but all our biases are sort of implicit and un (or under) stated. I'd rather they be explicit, though I know that's got problems in and of itself. A fair question, and not one I can answer. I can say that I feel that providing a mechanism for Hiragana characters to match Katakana and vice-versa is about as useful for a person doing Japanese text processing as case-insensitive matching is for a person working with English. - Damien
Re: More character matching bits
On Wed, Jun 13, 2001 at 01:22:32AM +0100, Simon Cozens wrote: I'd say it was about as useful as providing a regexp option to translate the search term into French and try that instead.[1] Handy, possibly. Essential? No. Something that should be part of the core? I'll leave that for you to decide. I believe that my initial analogy is more accurate than yours. The ability to match Hiragana as Katakana and vice-versa is almost identical conceptually to the ability to perform case insensitive matches on English text. What next, you want to maybe add Japanese and Chinese readings for all the kanji and convert between them too? That would be *considerably* more useful. :) [1] katakana signifies The following text is not in Japanese, except when it doesn't. This is literally accurate, but completely content-free. The variety of ways in which Hiragana and Katakana are be used in Japanese are as disparate as the ways in which italic and non-italic characters are used in English. Katakana is frequently used to write words with additional emphasis, to convey the impression of a sentence being spoken with an accent, to write the on-youmi of a Kanji, to write foreign loan words, and to write onomatopoeia. (This is not a complete list.) - Damien
Re: More character matching bits
On Wed, Jun 13, 2001 at 02:15:16AM +0100, Simon Cozens wrote: Or we could keep it out of core. It's up to you, really. No, it isn't. It's up to Larry, or to whoever gets the regex pumpkin. I'm withdrawing from this discussion: My intent was to clarify exactly why someone might want to treat Katakana and Hiragana as equivalent for matching purposes, not to take a stand on what features Perl should include or how these should be implemented. to write the on-youmi of a Kanji, Hrm, no, not usually; furigana are almost always hiragana, and learner's textbooks - bah, they're not real Japanese. :) I believe you are confused; kun-youmi and on-youmi have nothing to do with furigana. - Damien
Re: Unicode handling
On Tue, Mar 27, 2001 at 12:38:23PM -0500, Dan Sugalski wrote: I'm afraid this isn't what I'd normally think of--ord to me returns the integer value of the first code point in the string. That does mean that A is different for ASCII and EBCDIC, but that's just One Of Those Things. My personal take is that ord and chr should be exact inverses of each other. chr(ord($c)) should produce the same value as $c. (Albiet possibly in a different internal encoding.) I just trawled through my installed modules, looking for existing uses of ord to support my argument. Unfortunately for me, my conclusion is that pretty much any code which uses ord currently will break in a world with multibyte characters. I do think that it would be worthwhile to come up with some examples of intended uses of ord. In particular, I'd be very interested in seeing any cases where you would want it to return the value of a code point in anything other than the current default encoding. - Damien
Re: Unicode handling
On Mon, Mar 26, 2001 at 11:32:46AM -0500, Dan Sugalski wrote: At 05:09 PM 3/23/2001 -0800, Damien Neil wrote: So the results of ord are dependent on a global setting for "current character set" or some such, not on the encoding of the string that is passed to it? Nope, ord is dependent on the string it gets, as those strings know what their encoding is. chr is the one dependent on the current default encoding. So $c = chr(ord($c)) could change $c? That seems odd. In what other circumstances will the encoding of a string be visible to the programmer? Not when printing the string to a file handle, I would think -- that should be controlled by the encoding on the handle. Are there any other cases where encoding matters? - Damien
Re: Unicode handling
On Mon, Mar 26, 2001 at 08:37:05PM +, [EMAIL PROTECTED] wrote: If ord is dependent on the encoding of the string it gets, as Dan was saying, than ord($e) is 0x81, It it could still be 0x81 (from ebcdic) with the encoding carried along with the _number_ if we thought that worth the trouble. I'm going to go away and whimper in pain for a bit, now. "I thought chr(0x61) was 'a'." "It is, but that's an EBCDIC number." - Damien
Re: Unicode handling
On Fri, Mar 23, 2001 at 12:38:04PM -0500, Dan Sugalski wrote: while (IN) { $count++ if /bar/; print OUT $_; } I would find it surprising for this to have different output than input. Other people's milage may vary. In general, however, I think I would prefer to be required to explicitly normalize my data (via a function, pragma, or option set on a filehandle) than have data change unexpectedly behind my back. - Damien
Re: Unicode handling
On Fri, Mar 23, 2001 at 06:16:58PM -0500, Dan Sugalski wrote: At 11:09 PM 3/23/2001 +, Simon Cozens wrote: For instance, chr() will produce Unicode codepoints. But you can pretend that they're ASCII codepoints, it's only the EBCDIC folk that'll get hurt. I hope and suspect there'll be an equivalent of "use bytes" which makes chr(256) either blow up or wrap around. Actually no it won't. If the string you're doing a chr on is tagged as EBCDIC, you'll get the EBCDIC value. Yes, it does mean that this: chr($foo) == chr($bar); could evaluate to false if one of the strings is EBCDIC and the other isn't. Odd but I don't see a good reason not to. Otherwise we'd want to force everything to Unicode, and then what do we do if one of the strings is plain binary data? Are you thinking of ord rather than chr? I can't seem to make the above make sense otherwise. chr takes a number, not a string as its argument... Your initial description of character set handling didn't mention that different strings can be tagged as having different encodings, and didn't cover the implications of this. Could you give a list of the specific occasions when the encoding of a string would be visible to a programmer? - Damien
Re: Unicode handling
On Fri, Mar 23, 2001 at 06:31:13PM -0500, Dan Sugalski wrote: Err, perhaps I'm being dumb here - but surely $foo and $bar arent typed strings, they're just numbers (or strings which match /^\d+$/) ??? D'oh! Too much blood in my caffeine stream. Yeah, I was thinking of ord. chr will emit a character of the type appropriate to the current default string context. The default context will probably be settable at compile time, or be the platform native type, alterable somehow. Probably "use blah;" but that's a language design issue. :) Ah, this answers the puzzlement in the message I just sent. : So the results of ord are dependent on a global setting for "current character set" or some such, not on the encoding of the string that is passed to it? - Damien
Re: Please shoot down this GC idea...
On Wed, Feb 14, 2001 at 11:26:00AM -0500, Dan Sugalski wrote: At 11:03 AM 2/14/2001 -0500, Buddha Buck wrote: [Truly profound amount of snippage] I'm sure this idea has flaws. But it's an idea. Tell me what I'm missing. You've pretty much summed up the current plan. I have a strong suspicion that this approach will lead to confusing, hard-to-find bugs in Perl programs. (That is, programs written in Perl, rather than perl-the-program.) Consider: sub do_stuff { ... } { my $fh = IO::File-new("file"); do_stuff($fh); } In this code, the compiler can determine that $fh has no active references at the end of the block, and $fh-DESTROY will be called. (The compiler can flag do_stuff() as not preserving any references to its argument.) Now consider: { my $fh = IO::File-new("file"); do_stuff($fh); } sub do_stuff { ... } In this case, the compiler hasn't seen do_stuff() when it compiles the block in which $fh is instantiated. Unless it performs multiple passes, it won't be able to determine that do_stuff() does not preserve a reference to $fh, and it won't be able to deterministically call $fh-DESTROY at the end of the block. This is purest action-at-a-distance. To the programmer, there is no difference between the two blocks. This can occur in even more confusing fashions: consider a pair of recursive subs, or autoloaded subs, or method calls. For example: sub foo { my Dog $spot = shift; my $fh = IO::File-new("file"); $spot-eat_homework($fh); } Even with the object type declared, the compiler can make no assumptions about whether a reference to $fh will be held or not. Perhaps the Poodle subclass of Dog will hold a reference, and the Bulldog subclass will not. I think that there will be few cases when compile-time analysis will identify places where deterministic finalization can occur. Worse, any programmer who attempts to code for these cases will leave herself open for action-at-a-distance code breakage. Maybe I'm missing something. I hope I am. I think, however, that Perl will have to decide between deterministic destruction of non- circular data structures, and a modern garbage collector. - Damien
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
[trimming distribution to -internals only] On Wed, Feb 14, 2001 at 07:44:53PM +, Simon Cozens wrote: package NuclearReactor::CoolingRod; sub new { Reactor-decrease_core_temperature(); bless {}, shift } sub DESTROY { Reactor-increase_core_temperature(); } A better design: package NuclearReactor::CoolingRod; sub new { Reactor-decrease_core_temperature(); bless { inserted = 1 }, shift; } sub insert { my $self = shift; return if $self-{inserted}; Reactor-decrease_core_temperature(); $self-{inserted} = 1; } sub remove { my $self = shift; return unless $self-{inserted}; Reactor-increase_core_temperature(); $self-{inserted} = 0; } sub DESTROY { my $self = shift; $self-remove; } Using object lifetime to control state is almost never a good idea, even if you have deterministic finalization. A much better approach is to have methods which allow holders of the object to control it, and a finalizer (DESTROY method) which cleans up only if necessary. A more real-world case is IO::Handle. If you want to close a handle explicitly, you call $handle-close, not $handle-DESTROY. The concept of closing a handle is orthogonal to the concept of the object ceasing to exist. User code can close a handle. It can't make the object go away -- only the garbage collector can do that. I think that the biggest problem with DESTROY is that it is misnamed. The name makes people think that $obj-DESTROY should destroy an object, which it doesn't. It's rather too late to rename it to ATDESTRUCTION or WHENDESTROYED or FINALIZE, however. - Damien
Re: Core data types and lazy evaluation
On Wed, Dec 27, 2000 at 09:27:05PM -0500, Dan Sugalski wrote: While we can evaluate the list lazily, that doesn't mean that's what the language guarantees. Right now it's perfectly OK to do: $foo = ($bar, $baz, $xyzzy); and if $bar and $baz are tied, that'll execute their FETCH methods. If that's expected (and it is with functions, though arguably function calls and fetches of active data are not the same thing) then we can't be lazy. I'd *like* to be, since it means we can optimize away things or defer them until never, but... (Plus it makes the dataflow analysis a darned sight easier, since every load and store from a variable wouldn't potentially be a function call...) I'd view the function call case as being conceptually equivalent to: @_ = ($bar, $baz, $xyzzy); snrub; In this case, you are assigning the arguments to an array (@_), so it makes sense for them all to be evaluated. (Or not, depending on your opinions of how list-to-array assignment should work. :) In the case of assigning a list to a scalar, or a list of scalars, I still think that it makes sense for values not assigned to not be evaluated. Consider this case: ($two, $four) = (@primes, $junk); @primes is a lazy array containing all the primes. Here, you would expect only the first two values of @primes to be evaluated. If evaluation stops with the second element fo @primes, however, why would $junk be evaluated? Also one side-effect of this, if we allow it, is to have a list masqerade (under the hood, at least) as another variable type. We could, say, see this: @foo = (@bar, @baz); but actually defer evaluating the list and doing the assignment until either @foo, @bar, or @baz is accessed. (Potentially holding off even further--things like scalar() on @foo, for example, wouldn't require finishing the assignment, and neither would something simple like $foo[12]) I dislike this. It means that an exception occurring while evaluating $bar[0] can be deferred indefinately. This is a good argument for tagging tied variables as active data, since that'd require things be evaluated immediately. If this isn't done for tied variables, I withdraw my objection. I'm still a bit dubious about the cost/benefit tradeoff here, though. Hmm. # @primes = (2, 4, 5, 7, ...) @foo = (1, @primes); I think the above constitutes an argument for something. I'm not certain what. : I like lazy evaluation, but I don't think it should come at the expense of early detection of errors and comprehensible garbage collection. We're trying really, *really* hard to decouple GC with object destruction. The latter should be reasonably understandable (though not necessarily deterministic) while the former should be considered Dark Magic. :) My specific concern is that it shouldn't be easy to accidentally leave very large data structures lying around without obvious references to them. GC need not be deterministic, but it should be possible to avoid leaking memory without resorting to animal sacrifice. : - Damien
Re: standard representations
On Wed, Dec 27, 2000 at 10:46:03AM -0500, Philip Newton wrote: So a native int could be 8 bits big? I think that's allowed according to ANSI. ANSI/ISO C states: char = short = int = long char = 8 bits short = 16 bits int = 16 bits long = 32 bits C99 adds "long long", which is = long, and is at least 64 bits large. I'd be in favor of defining Perl's "native int" type to be at least 32 bits long. I would recommend against using the compiler's default int type in all cases, as there are compilers which define int as 16 bits for backwards compatability reasons. (As opposed to 16 bits being the native word size of the architecture.) - Damien
Re: standard representations
On Wed, Dec 27, 2000 at 02:06:45PM -0500, Hildo Biersma wrote: I don't recall the bit sizes to be in ANSI C. Which paragraph is that in? You need to deduce the bit sizes, as the standard doesn't speak in terms of bits. I don't have a copy of C89 available, but section 5.2.4.2.1 defines the sizes of the various integers: -- minimum value for an object of type short int SHRT_MIN-32767 // -(2 ** 15 - 1) -- maximum value for an object of type short int SHRT_MAX+32767 // 2 ** 15 - 1 ...and so forth. Even so, the fact that a standard may declare it, doesn't make it true. I would expect embedded targets to differ from this. I seriously doubt Perl will ever run on an architecture too small to provide a 32-bit type. I am certain it will never run on an architecture with no 16-bit type. Furthermore, the fact that the standard declares a thing DOES make it true. If Perl is to be written in C, it makes sense that it require a compiler which at least pretends to conform to ANSI/ISO C. This is hardly an onerous restriction -- most compilers are compliant, with the exception of compilers for very-small embedded systems (ones where the total memory available is measured in bytes) and antiquated curiosities like the SunOS 4 compiler. Can you name specific compilers which fail to conform to the standard in this (or other) regards, which Perl will need to support? That's eskewing efficiency to make sensible minimum guarantees. I'd personally rather see the C compiler's native types be used, because that's what the platform can do _efficiently_. Using larger types than that harms perl's ability to perform well on small platforms. I am deeply dubious about Perl's ability to perform well on 80286 (or equivalent capacity) machines under any circumstances. - Damien
Re: standard representations
On Wed, Dec 27, 2000 at 02:51:57PM -0500, Hildo Biersma wrote: This seems likely, but we must take care not to take these assumptions too far. For example, (and this is not realted to this discussion), pointers may well be smaller than integers (MVS defines 32-bit ints and 31-bit pointers) This is exactly the reason why standards are important. An architecture with 32-bit ints and 31-bit pointers is completely valid, and it is important to not write code which assumes that ints and pointers are interchangable. I have far less trust in the standards than you have. Having said that, I can't actually name non-compliant compilers, so you're quite likely to be right. Most compilers will violate the standard in certain small ways; complete conformance is an ideal rarely (if ever) reached. Few compilers (and none, in my experience, of any quality) will commit gross violations such as getting the guaranteed integer sizes wrong. - Damien
Re: Core data types and lazy evaluation
On Wed, Dec 27, 2000 at 05:17:33PM -0500, Dan Sugalski wrote: The part I'm waffling on (and should ultimately punt to Larry) is what to do with lazy data, and what exactly counts as lazy data anyway. For example, tied variables certainly aren't passive data, but should they be evaluated if they aren't used? If you do this: ($foo, $bar) = (@baz, "12", 15, $some_tied_scalar); should the FETCH method of $some_tied_scalar be called unconditionally, even though we don't use it? (I'd argue yes, but prefer no... :) I would argue no. If the list is evaluated lazily, I'd only expect the scalar FETCH method to be called when needed, not unconditionally. Also one side-effect of this, if we allow it, is to have a list masqerade (under the hood, at least) as another variable type. We could, say, see this: @foo = (@bar, @baz); but actually defer evaluating the list and doing the assignment until either @foo, @bar, or @baz is accessed. (Potentially holding off even further--things like scalar() on @foo, for example, wouldn't require finishing the assignment, and neither would something simple like $foo[12]) I dislike this. It means that an exception occurring while evaluating $bar[0] can be deferred indefinately. I'm also thinking that there could be some really odd interactions with GC...if I write { my @bar = some_creation_function(); @foo = (@bar); } I would expect @bar to be GCd (or at least become GC-able) upon exit from the block. With a scheme as you describe, @bar would need to hang around for the lifetime of @foo. I like lazy evaluation, but I don't think it should come at the expense of early detection of errors and comprehensible garbage collection. Other than that, I like what you describe. - Damien
Re: The external interface for the parser piece
On Mon, Nov 27, 2000 at 05:29:36PM -0500, Dan Sugalski wrote: int perl6_parse(PerlInterp *interp, void *source, int flags, void *extra_pointer); Count me in with the people who prefer: int perl6_parse(PerlInterp *interp, PerlIO *io); I understand the desire to reduce the number of API bits the external user needs to know about, but I think that the non-PerlIO API will lead to more complexity than it removes. Assuming the non-PerlIO interface is used, however, I believe there is a problem with the PERL_GENERATED_SOURCE option. ANSI/ISO C does not guarantee that a function pointer may be stored in a void*. I would suggest that Perl's external APIs, at the very least, should conform to standard C. - Damien