RE: GC, exceptions, and stuff
I've checked with some Sun folks. My understanding is that if you don't do a list of what I'd consider obviously stupid things like: *) longjmp out of the middle of an interrupt handler *) longjmp across a system call boundary (user-system-user and the inner jumps to the outer) *) Expect POSIX's dead-stupid mutexes to magically unlock *) Share jump destinations amongst threads *) Use the original Solaris thread implementation in general then you should be safe. I think we have concluded that we only setup flags inside signal handlers. So we don't need sigsetjmp/siglongjmp at all. I think we'll be safe using longjmp as a C-level exception handler. I'm right now trying to figure whether it's a good thing to do or not. (I'd like to unify C and Parrot level exceptions if I can) That is my point. Even if libc does not have thread-safe longjmp, we can easily make one ourself using assembly code. Hong
RE: GC, exceptions, and stuff
Actually I'd been given dire warnings from some of the Solaris folks. Don't use setjmp with threads! I've since gotten details, and it's Don't use setjmp with threads and do Stupid Things. I used to be at Sun. I knew those warnings too. If we use longjmp carefully, we can make it. In the worst case, write our own version. Hong
RE: GC, exceptions, and stuff
I used to be at Sun. I knew those warnings too. If we use longjmp carefully, we can make it. In the worst case, write our own version. ..Or we could use setcontext/getcontext, could we not? The setcontext/getcontext will be much worse than setjmp/longjmp. The are more platform specific than longjmp. And they don't work well inside signal handler, just like longjmp. When I was working on HotSpot JVM, we had some problems with getcontext. They work 99.99% of time. We added many workaround for the .01% cases. I believe the Solaris guys have been improving the code. I am not sure of the current status. Hong
RE: GC, exceptions, and stuff
When I was working on HotSpot JVM, we had some problems with getcontext. They work 99.99% of time. We added many workaround for the .01% cases. I believe the Solaris guys have been improving the code. I am not sure of the current status. Was that inside of a signal handler or just in general usage? It was inside signal handler. Hong
RE: GC, exceptions, and stuff
Okay, i've thought things over a bit. Here's what we're going to do to deal with infant mortality, exceptions, and suchlike things. Important given: We can *not* use setjmp/longjmp. Period. Not an option--not safe with threads. At this point, having considered the alternatives, I wish it were otherwise but it's not. Too bad for us. I think this statement is not very accurate. The real problem is setjmp/longjmp does not work well inside signal handler. The thread-package-compatible setjmp/longjmp can be easily implemented using assembly code. It does not require access to any private data structures. Note that Microsoft Windows Structured Exception Handler works well under thread and signal. The assembly code of __try will show you how to do it. However, signal-compatible will be very difficult. It requries access to ucontext, and most of thread package can not provide 100% correct ucontext for signal. (The thread package may have the right info, but the ucontext parameter may not have the info.) My basic suggestion is if we need convenient and fast C-based exception handling, we can write our own setjmp/longjmp in assembly code. The functionality will be exported as magic macros. Such as TRY { ... } CATCH (EBADF) { ... } CATCH (ENOMEM) { ... } END; Hong
RE: GC, exceptions, and stuff
The thread-package-compatible setjmp/longjmp can be easily implemented using assembly code. It does not require access to any private data structures. Note that Microsoft Windows Structured Exception Handler works well under thread and signal. The assembly code of __try will show you how to do it. Yup, and we can use platform-specific exception handling mechanisms as well, if there are any. Except... The stack unwinding is very basic, that is why we have setjmp/longjmp. Even it is CPU specific, it requires only very small piece of asm code, much less than JIT. BTW, JIT needs similar kind of functionalities, otherwise JIT will not be able to handle exceptions very fast. It will be very awrkward to check for every null pointer and every function return. However, signal-compatible will be very difficult. It requries access to ucontext, and most of thread package can not provide 100% correct ucontext for signal. (The thread package may have the right info, but the ucontext parameter may not have the info.) You hit this. And we can't universally guarantee that it'll work, either. Parrot has to handle signals, such as SIGSEGV. I believe we have to solve this problem, no matter whether use sigjmp/longjmp as general exception handling. In general, most of libc functions do not work well inside signal handler. My basic suggestion is if we need convenient and fast C-based exception handling, we can write our own setjmp/longjmp in assembly code. The functionality will be exported as magic macros. Such as If we're going to do this, and believe me I dearly want to, we're going to be yanking ourselves out a bunch of levels. We'll be setting the setjmp in runops.c just outside the interpreter loop, and yank ourselves way the heck out. It's that multi-level cross-file jumping that I really worry about. The multi-level jump should not be a problem inside parrot code itself. The GC disapline should have handled the problem already. 1) If the parrot code allocate any thing that can not be handle by GC, it must setup exception handler to release it, see sample. void * mem = NULL; TRY { mem = malloc(sizeof(foo)); } FINALLY { free(mem); } END; 2) If the parrot code allocate any thing that are finalizable, there is no need to release them. When the object is not referenced, the next gc will finalize it. We can still use TRY block to enfore cleanup in timely fashion. However, we can not use setjmp/longjmp (even parrot-specific version) to unwind non-parrot frames. If an third party C application calls Parrot_xxx, the Parrot_xxx should catch any exception and translate it into error code and returns it. Implement parrot-specific version setjmp/longjmp will be trivial compare to the complexity of JIT and GC. When we solved the JIT, GC, threading, and signal handling, the problems with setjmp/longjmp should have been solved by then. But if we only want a simple interpreter solution, there is no need to take on this additional complexity. Hong
RE: Unicode thoughts...
I think it will be relative easy to deal with different compiler and different operating system. However, ICU does contain some C++ code. It will make life much harder, since current Parrot only assume ANSI C (even a subset of it). Hong This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode)
RE: 64 bit Debian Linux/PowerPC OK but very noisy
It looks like you are running in 32-bit environment, but using 64-bit INTVAL. The INTVAL must be the same size as void* in order to cast between them without warning. Please try to reconfig using 32-bit INTVAL, or running process in 64-bit mode. Hong -Original Message- From: Michael G Schwern [mailto:[EMAIL PROTECTED]] Sent: Saturday, March 16, 2002 2:54 PM To: Hong Zhang Cc: [EMAIL PROTECTED] Subject: Re: 64 bit Debian Linux/PowerPC OK but very noisy On Sat, Mar 16, 2002 at 02:36:45PM -0800, Hong Zhang wrote: Can you check what is the sizeof(INTVAL) and sizeof(void*)? Some warnings should not have happened. (Note: Not a C programmer) INTVAL? I can't find where its defined. int main (void) { printf(int %d, long long %d, void %d\n, sizeof(int), sizeof(long long), sizeof(void*)); } int 4, long long 8, void 4. From perl -V: intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=87654321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, usemymalloc=n, prototype=define -- Michael G. Schwern [EMAIL PROTECTED] http://www.pobox.com/~schwern/ Perl Quality Assurance[EMAIL PROTECTED] Kwalitee Is Job One The key, my friend, is hash browns. http://www.goats.com/archive/980402.html
RE: 64 bit Debian Linux/PowerPC OK but very noisy
Can you check what is the sizeof(INTVAL) and sizeof(void*)? Some warnings should not have happened. Hong -Original Message- From: Michael G Schwern [mailto:[EMAIL PROTECTED]] Sent: Saturday, March 16, 2002 10:24 AM To: [EMAIL PROTECTED] Subject: 64 bit Debian Linux/PowerPC OK but very noisy This is parrot built using a 5.6.1 with 64 bit integers. The tests pass ok, but there's a heap of warnings in the build. Here's the complete make output. perl5.6.1 vtable_h.pl perl5.6.1 make_vtable_ops.pl vtable.ops perl5.6.1 ops2c.pl C core.ops io.ops rx.ops vtable.ops include/parrot/oplib/core_ops.hperl5.6.1 ops2c.pl CPrederef core.ops io.ops rx.ops vtable.ops include/parrot/oplib/core_ops_prederef.hcc -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Winline -W -Wno-unused -Wsign-compare -I./include -o test_main.o -c test_main.c cc -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Winline -W -Wno-unused -Wsign-compare-I./include -o exceptions.o -c exceptions.c cc -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Winline -W -Wno-unused -Wsign-compare-I./include -o global_setup.o -c global_setup.c global_setup.c: In function `init_world': global_setup.c:23: warning: passing arg 1 of `Parrot_Array_class_init' with different width due to prototype global_setup.c:24: warning: passing arg 1 of `Parrot_PerlUndef_class_init' with different width due to prototype global_setup.c:25: warning: passing arg 1 of `Parrot_PerlInt_class_init' with different width due to prototype global_setup.c:26: warning: passing arg 1 of `Parrot_PerlNum_class_init' with different width due to prototype global_setup.c:27: warning: passing arg 1 of `Parrot_PerlString_class_init' with different width due to prototype global_setup.c:28: warning: passing arg 1 of `Parrot_PerlArray_class_init' with different width due to prototype global_setup.c:29: warning: passing arg 1 of `Parrot_PerlHash_class_init' with different width due to prototype global_setup.c:30: warning: passing arg 1 of `Parrot_ParrotPointer_class_init' with different width due to prototype global_setup.c:31: warning: passing arg 1 of `Parrot_IntQueue_class_init' with different width due to prototype cc -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Winline -W -Wno-unused -Wsign-compare-I./include -o interpreter.o -c interpreter.c interpreter.c: In function `make_interpreter': interpreter.c:481: warning: passing arg 1 of `mem_sys_allocate' with different width due to prototype interpreter.c:501: warning: passing arg 2 of `pmc_new' with different width due to prototype interpreter.c:577: warning: passing arg 3 of `Parrot_string_make' with different width due to prototype interpreter.c:577: warning: passing arg 5 of `Parrot_string_make' with different width due to prototype interpreter.c:579: warning: passing arg 3 of `Parrot_string_make' with different width due to prototype interpreter.c:579: warning: passing arg 5 of `Parrot_string_make' with different width due to prototype cc -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Winline -W -Wno-unused -Wsign-compare-I./include -o parrot.o -c parrot.c cc -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Winline -W -Wno-unused -Wsign-compare-I./include -o register.o -c register.c cc -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Winline -W -Wno-unused -Wsign-compare-I./include -o core_ops.o -c core_ops.c core.ops: In function `Parrot_close_i': core.ops:93: warning: cast to pointer from integer of different size core.ops: In
RE: Threads afety and interpreter safety
1) NO STATIC VARIABLES! EVER! 2) Don't hold on to pointers to memory across calls to routines that might call the GC. 3) Don't hold on to pointers to allocated PMCs that aren't accessible from the root set I don't think the rule #2 and #3 can be achieved without systematic effort. In practice, GC can happen at any time. When I worked on JVM, we used something call references, which is pretty much Object**. The object pointer is almost always put on a per-thread object pointer stack. The C code always refer to the stack slot. The GC will scan the entire object pointer stack, which is considered as part of root set. Couple of macros will be very helpful. #define ENTER \ void* local_frame_start = current_thread-oop_stack #define LEAVE \ current_thread-oop_stack = local_frame_start #define DEREF(ref) \ (*ref) #define REF(o) \ (*current_thread-oop_stack++ = o, current_thread_oop_stack) For each object pointer type, there is a reference type. struct Object; typedef Object* ObjectPtr; typedef ObjectPtr* ObjectRef; Only references should be used for function calls. Pointers should be only used without function body, and should not be used cross function calls. This way, we don't have to worry about which functions may cause GC. Hong
RE: [PATCH] Stop win32 popping up dialogs on segfault
The following patch adds a Parrot_nosegfault() function to win32.c; after it is called, a segmentation fault will print This process received a segmentation violation exception instead of popping up a dialog. I think it might be useful for tinderbox clients. Please notice, stdio is not signal/exception safe, you can not use printf(), even sprintf() inside signal handler. On unix, you have to write something like: write(2, msg, strlen(msg)); On win32, you have to write: { DWORD dummy; WriteFile(GetStdHandle(STD_ERR_HANDLE), msg, strlen(msg), dummy, NULL); } The reason for this is stdio uses mutex to protect internal buffers. If the mutex is already acquired by someone, the printf will end up deadlock. In some cases, it will just crash. The write() and WriteFile() are system call. They are atomic on almost all systems, so it does not need any lock in user space. On win32, the MSVCRT._open() is not atomic, so it should not be used inside signal/exception handler too. By the way, the SIGINT and SIGQUIT on win32 is running in its own thread, so the restriction is less. Hong
RE: parrot rx engine
Agh, if you go and do that, you must then be sure that rx is capable of optimizing /a/i and /[aA]/ in the same way. What I mean is that Perl's current regex engine is able to use /abc/i as a constant in a string, while it cannot do the same for /[Aa][Bb][Cc]/. Why? Because in the first case, the string being matched against has been folded, so abc will or will not be in the string. In the second case, the string has not been folded, so scanning for that constant string would require either Please don't use the current perl as an example. I am proposing a new algorithm for Parrot regex engine. Of course, the current Perl regex engine will not benefit from it. For things like /AbC/i, the new rx engine must be able to optimize it down to rx_opcode_ascii_match_case_insensitive abc. If you change your example to include 1-m and m-1 case-folding chars, the current simple and fast Perl scheme will not work at all. Hong
RE: I'm amazed - Is this true :)
mops tests : on perl5,python I get - 2.38 M/ops ruby ~ 1.9 M/ops ps ~ 1.5 M/ops parrot - 20.8 M/s parrot jitted - 341 M/ops and it finish in half second ... for most of the other I have to wait more that a minute .. Frankly speaking, this number is misleading. I know the python and ruby interpreter. They count a + b as 3 mops, load a, load b, and add top two values of stack. The a and b can be any type, so type check, coersion, vtable dispatch overhead are necessary. It is equivalent to add to PMCs and produce a third PMC. The Parrot op does not map directly to language constructs, it is more like Java virtual machine, where operand types are known. Some time, compiler can compile code directly into Parrot opcode, when the type information is avaible. Most of time, we have to use generic PMC and vtable. The difference between Perl 5 opcode and Perl 6 opcode + vtable would be much smaller. Hong
RE: parrot rx engine
But as you say, case folding is expensive. And with this approach you are going to case-fold every string that is matched against an rx that has some part of it that is case-insensitive. That is correct in general. But regex compiler can be smarter than that. For example, rx should optimize /a+/i to /[aA]+/ to avoid case-folding. If it is too difficult for rx to do case-folding, I think it is better to use some normalizer to do full-case folding. The case-folding should be done in the rx itself, at compile time if possible. Then it is only done once, which will save a lot of time if the rx happens to be used in a loop or something. The regular expression itself is case-folded at compile time. But I am talking about input string here, not re. Hong
RE: How Powerful Is Parrot? (A Few More Questions)
I believe the main difficulty comes from heading into uncharted waters. For example, once you've decided to make garbage collection optional, what does the following line of code mean? delete x; If the above code is compiled to Parrot, it probably equivalent to x-~Destructor(); i.e., the destructor is called, but the memory is left to GC, which most likely handle free at a later time. Or, for example, are the side effects of the following two functions different? void f1() { // On the stack MyClass o; } void f2() { // On the heap MyClass o = new MyClass(); } If garbage collection is not 100% deterministic, these two functions could produce very different results because we do not know when or if the destructor for MyClass will execute in the case of f2(). This is exactly the same case for C++. When you compile f2 with gcc, how can you tell when the destructor is called. Even the following code does not work. void f3() { MyClass o = new MyClass(); ... delete o; } If there is an exception happens within (...), the destructor will not be called. If garbage collection is not 100% deterministic (and Mark and Sweep is not), we need extra language features, such as Java's finally block, to ensure things can be cleaned up, and extra training to ensure programmers are smart enough to know how to use finally blocks correctly. That is exactly the case for C++. In your above code f1(), the C++ compiler already (behind the scene) inserts finally block for o destructor. That is why the destructor of stack allocated objects is called even when exception happens. The only difference is that the memory deallocation is dis-associated with object destruction. Summary: the object destruction with GC is as deterministic as C++ heap allocated object, i.e. you have to call delete x (in C++), x.close() (in Java), x.dispose (in C#), otherwise is 0% deterministic, period. Hong
RE: How Powerful Is Parrot? (A Few More Questions)
This changes the way a programmer writes code. A C++ class and function that uses the class looks like this: class A { public: A(){...grab some resources...} ~A(){...release the resources...} } void f() { A a; ... use a's resources ... } ...looks like this in Java... class A { public: A(){...grab some resources...} } void f() { try { A a; ... use a's resources ... } finally { ...release the resources... } } This is exactly the right way to do things in Java. In Java, you can open hundreds of files, and never trigger any gc, since each file object is very small. Unless you explicit close file, you will be dead very quickly. The difference between C++ and Java is C++ provides you stack allocated object, and compiler does the dirty job to make sure the dtors are called at the right time. In Java, you have to do it yourself. In case you make some mistakes, the finalizer will kick in. But you should not rely on it. From the runtime poit of view, the above C++ and Java are almost the same, except the memory deallocation. This is one of the reason Java is so sloppy. Everyone relies on language feature to do their job, but it is impossible for JVM to know there are several file objects in thousands of dead object, which need to be finalized in order to free enough file descriptor. All you need to do is to treat Java object as C++ heap object, period. Hong
RE: on parrot strings
But e` and e are different letters man. And re`sume` and resume are different words come to that. If the user wants something that'll match 'em both then the pattern should surely be: /r[ee`]sum[ee`]/ I disagree. The difference between 'e' and 'e`' is similar to 'c' and 'C'. The Unicode compability equivalence has similar effect too, such as half width letter and full width letter. It may just be my personal perference. But I don't think it is good idea to push this problem to user of regex. Hong
RE: on parrot strings
Yes, that's somewhat problematic. Making up a byte CEF would be Wrong, though, because there is, by definition, no CCS to map, and we would be dangerously close to conflating in CES, too... ACR-CCS-CEF-CES. Read the character model. Understand the character model. Embrace the character model. Be the character model. (And once you're it, read the relevant Unicode, XML, and Web standards.) To highlight the difference between opaque numbers and characters, the above should really be: if ($buf =~ /\x47\x49\x46\x38\x39\x61\x08\x02/) { ... } I think what needs to be done is that \xHH must not be encoded as literals (as it is now, 'A' and \x41 are identical (in ASCII)), but instead as regex nodes of their own, storing the code points. Then the regex engine can try both the right/new way (the Unicode code point), and the wrong/legacy way (the native code point). My suggest will be add a binary mode, such as //b. When binary mode is in effect, only ascii characters (0 - 127) still carry text property. \p{IsLower} will only match ascii a to z. All 128 - 255 always have false text property. Any code points must be between 0 and 255. The regcomp can easily check it upon compilation. A dedicated binary mode will simplify many issues. And the regex will be very readable. We can make binary mode be exclusive with text mode, i.e. and regex expression must be either binary or text, but not both. (I am not sure if it is really useful to have mixed mode.) Hong
RE: on parrot strings
But e` and e are different letters man. And re`sume` and resume are different words come to that. If the user wants something that'll match 'em both then the pattern should surely be: /r[ee`]sum[ee`]/ I disagree. The difference between 'e' and 'e`' is similar to 'c' and 'C'. The Unicode compability equivalence has similar effect too, such as half width letter and full width letter. German to English schon = already schön = nice 2 totally different words. I am talking about similar word where you are talking about different word. I don't mind if someone can search cross languages. Some Chinese search enginee can do chinese search using engish keyword (for people having chinese viewer but not chinese input method.) Of course, no one expect regex engine should do that. The re`sume` do appear in English sentence. The [half|full] width letter are in the same language. Hong
RE: on parrot strings
(1) There are 5.125 bytes in Unicode, not four. (2) I think the above would suffer from the same problem as one common suggestion, two-level bitmaps (though I think the above would suffer less, being of finer granularity): the problem is that a lot of space is wasted, since the usage patterns of Unicode character classes tend to be rather scattered and irregular. Yes, I see that you said: only the arrays that we actually used would be allocated to save space-- which reads to me: much complicated logic both in creation and access to make the data structure *look* simple. I'm a firm believer in getting the data structures right, after which the code to access them almost writes itself. I would suggest the inversion lists for the first try. As long as character classes are not very dynamic once they have been created (and at least traditionally that has been the case), inversion lists should work reasonably well. My proposal is we should use mix method. The Unicode standard class, such as \p{IsLu}, can be handled by a standard splitbin table. Please see Java java.lang.Character or Python unicodedata_db.h. I did measurement on it, to handle all unicode category, simple casing, and decimal digit value, I need about 23KB table for Unicode 3.1 (0x0 to 0x10), about 15KB for (0x0 to 0x). For simple character class, such as [\p{IsLu}\p{InGreak}], the regex does not need to emit optimized bitmap. Instead, the regex just generate an union, the first one will use standard unicode category lookup, the second one is a simple range. If user mandate to use fast bitmap, and the character class is not extremely complicated, we will only probably need about several K for each char class. As for character encodings, we're forcing everything to UTF-32 in regular expressions. No exceptions. If you use a string in a regex, it'll be transcoded. I honestly can't think of a better way to guarantee efficient string indexing. I don't think UTF-32 will save you much. The unicode case map is variable length, combining character, canonical equivalence, and many other thing will require variable length mapping. For example, if I only want to parse /[0-9]+/, why you want to convert everything to UTF-32. Most of time, the regcomp() can find out whether this regexp will need complicated preprocessing. Another example, if I want to search for /resume/e, (equivalent matching), the regex engine can normalize the case, fully decompose input string, strip off any combining character, and do 8-bit Boyer-Moore search. I bet it will be simpler and faster than using UTF-32. (BTW, the equivalent matching means match English spelling against French spell, disregarding diacritics.) I think we should explore more choices and do some experiments. Hong
RE: on parrot strings
preprocessing. Another example, if I want to search for /resume/e, (equivalent matching), the regex engine can normalize the case, fully decompose input string, strip off any combining character, and do 8-bit Hmmm. The above sounds complicated not quite what I had in mind for equivalence matching: I would have just said both the pattern and the target need to normalized, as defined by Unicode. Then the comparison and searching reduce to the trivial cases of byte equivalence and searching (of which B-M is the most popular example). You are right in some sense. But normalized, as defined by Unicode may not be simple. I look at unicode regex tr18. It does not specify equivalence of resume vs re`sume`, but user may want or may not want this kind of normalization. Hong
RE: on parrot strings
My proposal is we should use mix method. The Unicode standard class, such as \p{IsLu}, can be handled by a standard splitbin table. Please see Java java.lang.Character or Python unicodedata_db.h. I did measurement on it, to handle all unicode category, simple casing, and decimal digit value, I need about 23KB table for Unicode 3.1 (0x0 to 0x10), about 15KB for (0x0 to 0x). Don't try to compete with inversion lists on the size: their size is measured in bytes. For example Latin script, which consists of 22 separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44 ints, or 176 bytes. Searching for membership in an inversion list is O(N log N) (binary search). Encoding the whole range is a non-issue bordering on a joke: two ints, or 8 bytes. When I said mixed method, I did intend to include binary search. The binary search is a win for sparse character class. But bitmap is better for large one. Python uses two level bitmap for first 64K character. Hong
RE: [PATCH] Keep comments in sync with the code...
By the way, we should not have global variable names like index at the first place. All globals should look something like GIndex. Hong -Original Message- From: Simon Glover [mailto:[EMAIL PROTECTED]] Sent: Tuesday, January 08, 2002 9:56 AM To: [EMAIL PROTECTED] Subject: [PATCH] Keep comments in sync with the code... We changed from index to idx in the code, but not in the comments. Simon --- key.c.old Tue Jan 8 08:00:00 2002 +++ key.c Tue Jan 8 17:52:36 2002 @@ -217,7 +217,7 @@ /*=for api key key_element_type -return the type of element index of KEY key +return the type of element idx of KEY key =cut */ @@ -240,7 +240,7 @@ /*=for api key key_element_value_i -return the value of index index of KEY key +return the value of index idx of KEY key =cut */ @@ -289,7 +289,7 @@ /*=for api key key_set_element_value_i -Set the value of index index of key key to integer value +Set the value of index idx of key key to integer value =cut */ @@ -312,7 +312,7 @@ /*=for api key key_set_element_value_s -Set the value of index index of key key to string value +Set the value of index idx of key key to string value =cut */ @@ -386,7 +386,7 @@ /*=for api key key_inc -Increment the type of index index of key key +Increment the type of index idx of key key =cut */
RE: [PATCH] Re: Question about INTVAL vs. opcode_t sizes
That's what I thought I remembered; in that case, here's a patch: Index: core.ops === RCS file: /home/perlcvs/parrot/core.ops,v retrieving revision 1.68 diff -u -r1.68 core.ops --- core.ops 4 Jan 2002 02:36:25 - 1.68 +++ core.ops 5 Jan 2002 03:58:14 - @@ -463,8 +463,8 @@ =cut op write(i|ic, i|ic) { - INTVAL * i = ($2); - write($1, i, sizeof(INTVAL)); + INTVAL i = (INTVAL)$2; + write($1, i, sizeof(INTVAL)); goto NEXT(); } I think the above code is wrong. It should be I32 i = (I32) $2; write($1, i, 4); I am not sure why you want to write all INTVAL bytes when only the lower 32-bit are valid. Hong
RE: 64-bit Solaris status
I am not sure why we need the U postfix in the first place. For literal like ~0xFFF, the compiler should automatically sign-extends to our expected size. Personally, I prefer to using ([u]intptr_t) ~0xFFF, which is more portable. So we don't have to deal with U, UL, i64. It is possible to use 32-bit address mode on 64-bit alpha, and the address is sign extened, not zero extended. Hong Passes on 64-bit Solaris. (And 32-bit Linux.) Probably more correct regardless, as longs are almost always the same size as pointers, whereas ints aren't. --- ../parrot/Configure.pl Wed Jan 2 22:53:29 2002 +++ ./Configure.pl Wed Jan 2 22:53:29 2002 @@ -141,11 +141,11 @@ debugging = $opt_debugging, rm_f = 'rm -f', rm_rf = 'rm -rf', -stacklow = '(~0xfff)U', -intlow= '(~0xfff)U', -numlow= '(~0xfff)U', -strlow= '(~0xfff)U', -pmclow= '(~0xfff)U', +stacklow = '(~0xfff)UL', +intlow= '(~0xfff)UL', +numlow= '(~0xfff)UL', +strlow= '(~0xfff)UL', +pmclow= '(~0xfff)UL', make = $Config{make}, make_set_make = $Config{make_set_make}, @@ -701,7 +701,7 @@ my $vector = unpack(b*, pack(V, $_)); my $offset = rindex($vector, 1)+1; my $mask = 2**$offset - 1; -push @returns, (~0x.sprintf(%x, $mask).U); +push @returns, (~0x.sprintf(%x, $mask).UL); } return @returns; -- Bryan C. Warnock [EMAIL PROTECTED]
RE: 64-bit Solaris status
Also, the UL[L] should probably be on the inside of the (): stacklow = '(~0xfffULL)', I still don't see this one is safer than my proposal. ~((uintptr_t) 0xfff); Anyway, we should use some kind of macro for this purpose. #ifndef foo #define foo(a) ((uintptr_t) (a)) #endif or #ifndef foo #define foo(a) (a##ull) #endif so the stacklow will read as stacklow = ~foo(0xfff) Hong
RE: [PATCH] Don't count on snprintf
What we really need is our own s(n?)printf: Parrot_sprintf(target, %I + %F - %I, foo, bar, baz); /* or some such nonsense */ or even: target=Parrot_sprintf(%I + %F - %I); /* like Perl's built-in */ That way, it could even handle Parrot strings natively, perhaps with a %S code. By the way, Windows sems to have an _snprintf function with the same arguments. The leading underscore is beyond me. *shrugs* It may be a good idea to have our own version of vsnprintf(). I know the windows version does not handle infinity and nan well. The precision of floating point may be different on different platforms. BTW, MSVCRT has several functions with leading _, such as _isnan, _finite, and snprintf. Hong
RE: sizeof(INTVAL), sizeof(void*), sizeof(opcode_t)
On Tue, 20 Nov 2001, Ken Fox wrote: It sounds like you want portable byte code. Is that a goal? I do indeed want portable packfiles, and I thought that was more then a goal, I thought that was a requirement. In an ideal world, I want a PVM to be intergrated in a webbrowser the same way a JVM is now. I think we should separate packfile from runtime image file. If we want the runtime can run a mmapped (pack)file, the file can not be portable. We have to deal with endianness, alignment, floating point format etc. I think we can get the best of both worlds. We, I think, should be able to get the bytecode format such that it is mmapable on platforms with the same endiannes and sizeof(INTVAL), and nonmmapable otherwise. There is not much problem on the bytecode side. As we discussed before, the bytecode is a stream of (aligned) 32-bit values. Most platforms can handle 32-bit value efficiently. Other platforms can do simple conversion. I think what you really need to worry about is the file format, such as constant area, linkage table, etc. There is no need to make sizeof(opcode) == sizeof(INTVAL), since constant area can hold anything you need. All you need to do is one more indirection. Hong
RE: Beginning of dynamic loading -- platform assistance needed
Okay, here's the updated scheme. *) There is a platform/generic.c and platform/generic.h. (OK, it'll probably really be unixy, but these days it's close enough) If there is no pltform-specific file, this is the one that gets copied to platform.c and platform.h *) If there *is* a platform specific file it may, and probably should unless it plans on overriding everything, include generic.c and generic.h. *) All entries in generic.c should be bracketed with #if !defined(OVERRIDE_funcname) and any functions that the platform defines that override one in generic.c should have a corresponding #define OVERRIDE_function in the platform-specific .h file Yeah, this is definitely a pain. If someone's got a better idea I'm all ears... Sounds like less of a pain and more forward-looking than maintaining dozens of nearly-identical unixy platform files. Looks like a good plan to me. Portability's a pain no matter how you slice it. It's just a hard problem. I don't think there's an easy solution. I like this idea too. I think we need one generic.[ch] file for all platforms. The unix.[ch], win32.[ch], macos.[ch] will cover most of our needs. Each platform can define its own porting file. Instead of defining zillions of OVERRIDE_funcname, I like to use plain name, such as // platform.h INLINE ll_eq(int64_t a, int64_t b) { return memcmp(a, b, sizeof(a)) == 0; } #define ll_eq ll_eq // generic.h #ifndef ll_eq #define ll_eq(a, b) ((a) == (b)) // assuming compiler support 64-bit int #endif The porting interface includes constants and functions. We should assume the functions may be implemented as macros. So is prohibited on porting interface. (This is mainly for speed reason.) Portable structures are very unlikely, such as struct sockaddr_in and struct timeval. Parrot may need to define its own structs. Hong
RE: Building on Win32
Also, note that Hong Zhang ([EMAIL PROTECTED]) has pointed out a simplification (1 API call rather than 2)... FYI. The GetSystemTimeAsFileTime() takes less than 10 assembly instructions. It just reads the kernel time variable that maps into every address space. and given I think I've found a working Gnu Diff for Win32 I may be able to submit a real patch (but it'll be the morning before I get sorted out). I thought the cygwin contains the gnu diff. Hong
RE: Building on Win32
void gettimeofday(struct timeval* pTv, void *pDummy); { SYSTEMTIME sysTime; FILETIME fileTime;/* 100ns == 1 */ LARGE_INTEGER i; GetSystemTime(sysTime); SystemTimeToFileTime(sysTime, fileTime); /* Documented as the way to get a 64 bit from a FILETIME. */ memcpy(i, fileTime, sizeof(LARGE_INTEGER)); pTv-tv_sec = i.QuadPart / 1000; /*10e7*/ pTv-tv_usec = (i.QuadPart / 10) % 100; /*10e6*/ } For speed reason, you can use GetSystemTimeAsFileTime(), which is very efficient. The Win32 is little-endian only operating system. You can use the following code. void gettimeofday(struct timeval* pTv, void *pDummy); { __int64 l; GetSystemTimeAsFileTime((LPFILETIME) l); pTv-tv_sec = (long) l / 1000; /*10e7*/ pTv-tv_usec = (unsigned long) (i.QuadPart / 10) % 100; /*10e6*/ } You missed the cast. Hong
RE: moving integer constants to the constant table
This patch moves integer constants to the constant table if the size chosen for integers is not the same as the size chosen for opcodes. It still leaves room for trouble. I suggestion we move everything that can not be hold by int32_t out of opcode stream. The need for 64-bit constant are rare. This way, we can generate portable bytecode. Hong
RE: thread vs signal
Now how do you go about performing an atomic operation in MT? I understand the desire for reentrance via the exclusive use of local variables, but I'm not quite sure how you can enforce this when many operations are on shared data (manipulating elements of the interpreter / global variables). There are two categories of global vars: ones used by runtime and ones used by app. For former, the runtime will use following schemes: 1) Reducing globals by using more per-thread variable (such as per thread profile info instead of per interpreter info). 2) Use atomic variable. Increment a profile counter does not need lock even it may ocationally corrected by one. 3) Use mutex as needed. I definately agree that locking should be at a high level (let them core if they don't obey design it well). I liked the perl5 idea that any scalar / array / hash could be a mutex. Prevents you from having to carry around lots of extra mutex-values. We can achieve the exact same synchronization policy of java or one that's finer tuned for performance. We can either let sv/av/hv carry mutex, or let them be atomic, although it is non-trivial to make them atomic. For languages like Smalltalk, it is trivial to make system atomic, since all complex data structure are user defined. Hong
RE: thread vs signal
On Sun, Sep 30, 2001 at 10:45:46AM -0700, Hong Zhang wrote: Python uses global lock for multi-threading. It is reasonable for io thread, which blocks most of time. It will completely useless for CPU intensive programs or large SMP machines. It might be useless in theory. In practice it isn't, because most CPU-intensive tasks are pushed down into C code anyway, and C code can release the single interpreter lock while it's crunching away. That does not mean Python is a high performance MT language. It just gives the problem to C. In that sense, every language is about to have the speed, since we can just write everything in C and call it, and we are blazing fast everywhere, .NET? Hong
RE: thread vs signal
How does python handle MT? Honestly? Really, really badly, at least from a performance point of view. There's a single global lock and anything that might affect shared state anywhere grabs it. Python uses global lock for multi-threading. It is reasonable for io thread, which blocks most of time. It will completely useless for CPU intensive programs or large SMP machines. If Perl needs to have full multi-threading, we should better reference to Java. Java has the best language/runtime support for MT. It can run thousands of threads inside one VM on big SMP machine. However, Java has made many mistakes with threading. One of it is the synchronization overhead. A normal Java program can issue one million locks per second. The JDK 1.0.0 spent 20-25% of time in locking code when run HotJava. The main problem came from the fact the core library (Vector, Hashtable, IO streams, awt etc) are fully synchronized, even though most of time you don't need them to be synced. The same story may happen to Perl. If Perl make all operations on SV, AV, HV sync, the performance will be pathetic. Many SMP machines can only perform about 10M sync operations per second, because sync op requires system-wide bus lock or global memory transaction. This situation will not change much in the future. One way to reduce sync overhead is to make more operation atomic instead of of sync. For example, read() write() are atomic. There is no need to sync stream. The array get/put are atomic in Java, so we don't need sync either. The high level library or app itself will be responsible for the sync. Hong
RE: NV Constants
This was failing here until I made the following change: PackFile_Constant_unpack_number(struct PackFile_Constant * self, char * packed, IV packed_size) { char * cursor; NV value; NV * aligned = mem_sys_allocate(sizeof(IV)); Are you sure this is correct? Or this is before the fix. Allocating NV using sizeof(IV) is strange. I don't see the need to have aligned temp variable. The following code will do exact as your code (I believe). The memcpy() can handle alignment nicely. PackFile_Constant_unpack_number(struct PackFile_Constant * self, char * packed, IV packed_size) { PackFile_Constant_clear(self); self-type = PFC_NUMBER; memcpy((self-number), packed, sizeof(NV)); return 1; } Hong
RE: NV Constants
The memcpy() can handle alignment nicely. Not always. I tried. :( How that could be possible? The memcpy() just does byte-by-byte copy. It does not care anything about the alignment of source or dest. How can it fail? Hong
thread vs signal
In a word? Badly. :) Especially when threads were involved, though in some ways it was actually better since you were less likely to core perl. Threads and signals generally don't mix well, especially in any sort of cross-platform way. Linux, for example, deals with signals in threaded programs very differently than most other unices do. (Both ways make sense, they just aren't at all similar) Though what you said is largely correct, there are way to use signal safely with threads. Signals are divided into 2 category, sync or async. The sync signals include SIGSEGV, SIGBUS etc. They must be handled inside signal handler. As long as the crash does not happen inside mutext/condvar block, it will be safe to get out of the trouble using siglongjmp on most platforms. For async signals, it is very risky to use siglongjmp(), since the jmpbuf may not be correct. The alternative is to use sigwait() family. See some examples. A) to handle sync signal int sig; foo() { sig = sigsetjmp(interpreter-jmpbuf); if (sig == 0) { for (;;) { DO_OP(); } } else if (sig == SIGSEGV) { // do someting } else if (sig == SIGBUS) { // do something } } void signal_handler(int sig) { siglongjmp(current_interpreter()-jmpbuf, sig); } The above code is safe on most platform. But it should be used in a controlled fashion, so we can correct recover from the error. If it does not work on some platform, we can use alternative scheme. foo () { while (interpreter-sig == 0) { DO_OP(); } if (interpreter-sig == SIGSEGV) { ... } } void signal_handler(int sig) { current_interpreter()-sig = sig; } Since the pthread_self() may not be available inside signal_handler(), we need to design some global data structures to find current interpreter. B) wrong way handle async signal (was used in Java) mutex_lock(); if (sigsetjmp(interpreter-jmpbuf)) { cond_wait(...); } else { // PANIC; } The above code will not work reliable on any platform. The siglongjmp will not be able to restore mutex correctly, even only one mutex is involved here. C) correct way to handle async signal such as CTRL-C void async_signal_handler_thread_function() { while (sigwait()) { handle signal } } We create one thread for all async signal, and let everyone else mask async signal off. This scheme can handle signal reliably under threads. Hong
RE: thread vs signal
The fun part about async vs sync is there's no common decision on what's an async signal and what's a sync signal. :( SIGPIPE, for example, is one of those. (Tru64, at least, treats it differently than Solaris) I generally divide signals into two groups: *) Messages from outside (i.e. SIGHUP) *) Indicators of Horrific Failure (i.e. SIGBUS) I think the another (*better*) way for this is process-wide signa vs thread-specific signal. Generally speaking, parrot should probably just up and die for the first type, and turn the second into events. Have you reversed the ordering??? How can you convert SIGBUS to events? AFAIK, almost none of the pthread functions are safe in signal handlers. There might be one or two, but I can't remember which ones. (None of the mutex or condition functions, alas, and they're rather useful) Keep this for record. sem_post() is the only signal-safe thread function. I don't think mutex and condvar are useful in this case. If We create one thread for all async signal, and let everyone else mask async signal off. This scheme can handle signal reliably under threads. This, unfortunately, isn't portable. It only works on platforms that fully implement the POSIX threading standard. Linux is the big example of a platform that *doesn't*. Signals only get delivered to the thread that triggered them, and if the thread's got the signal masked off it gets dropped. :( You did not get my idea. I was talking about async (message from outside, process-wide signal). There is no notion of the thread that triggered them here, which is about sync signal only. Linux does have sigtimedwait() etc. The mask off has different defs. You can set it SIG_IGN which drop the signal. Or you can use sigmask() to mask it off, and signal will be enqueued. Hong
RE: SV: Parrot multithreading?
This is fine at the target language level (e.g. perl6, python, jako, whatever), but how do we throw catchable exceptions up through six or eight levels of C code? AFAICS, this is more of why perl5 uses the JMP_BUF stuff - so that XS and functions like sv_setsv() can Perl_croak() without caring about who's above them in the call stack. This is my point exactly. This is the wrong assumption. If you don't care about the call stack, how can you expect the [sig]longjmp can successfully unwind stack? The caller may have a malloc memory block, or have entered a mutex, or acquire the file lock of Perl cvs directory. You probably have to call Dan or Simon for the last case. The alternative is that _every_ function simply return a status, which is fundamentally expensive (your real retval has to be an out parameter, to start with). This is the only right solution generally. If you really really really know everything between setjmp and longjmp, you can use it. However, the chance is very low. To answer my own question (at least, with regards to Solaris), the attributes(5) man page says that 'Unsafe' is defined thus: An Unsafe library contains global and static data that is not protected. It is not safe to use unless the application arranges for only one thread at time to execute within the library. Unsafe libraries may contain routines that are Safe; however, most of the library's routines are unsafe to call. This would imply that in the worst case (at least for Solaris) we could just wrap calls to [sig]setjmp and [sig]longjmp in a mutex. 'croak' happens relatively infrequently anyway. This is not the point. The [sig]setjmp and [sig]longjmp are generally safe outside signal handler. Even they are not safe, we can easily write our own thread-safe version using very small amount of assembly code. The problem is they can not be used inside signal handler under MT, and it is (almost) impossible to write a thread-safe version. Hong
RE: SV: Parrot multithreading?
This is the wrong assumption. If you don't care about the call stack, how can you expect the [sig]longjmp can successfully unwind stack? The caller may have a malloc memory block, Irrelevant with a GC. Are you serious? Do you mean I can not use malloc in my C code? or have entered a mutex, If they're holding a mutex over a function call without a _really_ good reason, it's their own fault. If you don't care about caller, why the caller cares about you? Why the callers need to present their reason for locking a mutex? You ask too much. or acquire the file lock of Perl cvs directory. You probably have to call Dan or Simon for the last case. The alternative is that _every_ function simply return a status, which is fundamentally expensive (your real retval has to be an out parameter, to start with). This is the only right solution generally. If you really really really know everything between setjmp and longjmp, you can use it. However, the chance is very low. It is also slow, and speed is priority #1. If so, just use C, which does not check nothing. Signals are an event, and so don't need jumps. Under MT, it's not like there would be a lot of contention for PAR_jump_lock. Show me how to convert SIGSEGV to event. Please read previous messages. Some signals are events, some are not. Hong
RE: Tru64 core dumps
# 0xf000 for 64 bit systems. With that changed Don't bother. Make the constant be ~0xfff. :) Umm, are you sure? It's used in an integer context and masked against an IV, so you might need an 'int', a 'long', or a 'long long'. I'm unsure what type to portably assume for C preprocessor constants, but I suspect this might not do what you want if an IV is a 'long long'. (However, given that it's operating against an IV that used to be a pointer of a possibly different size, everything might just work out fine.) There should be no need. ~0xfff is singed int, which will be signed extended by compilers as needed. Unless you are using a buggy compiler. Hong
RE: Tru64 core dumps
You are using the wrong flag. The expression in second is long long. So you should use flag %llx. Since printf uses vararg, it is undefined behavior if there is type mismatch with argument. Hong Hehehe. Ok. Guess what the following will print: #include stdio.h int main(void) { int x = 511; printf(x = %x\n, x); printf(x ~0xff = %x\n, x (long long) ~0xff); return 0; } -- Andy Dougherty[EMAIL PROTECTED] Dept. of Physics Lafayette College, Easton PA 18042
RE: variable number of arguments
is it possible the ops to handle variable number of arguments, what I have in mind : print I1,,,N2,\n This should be done by create array opcode plus print array opcode. [1, 2, 3, 4, 5] The create array opcode takes n top of stack (or n of registers) and create an array out of it. Both opcodes are very popular and worth of their existence. I don't see further benefit of a single vararg print opcode. The print is an expensive opcode anyway. Hong
RE: [PATCH] assemble.pl registers go from 0-31
Attached patch makes sure you don't try and use register numbers over 31. That is, this patch allows registers I0-I31 and anything else gets a: Error (foo.pasm:0): Register 32 out of range (should be 0-31) in 'set_i_ic' Oh, there's also a comment at end of line patch that has snuck in 'cos it's so darn useful. Just curious, do we need a dedicated zero register and sink register? The zero register always reads zero, and can not be written. The sink register can not be read, and write to it can be ignored. Hong
RE: [PATCH] assemble.pl registers go from 0-31
Just curious, do we need a dedicated zero register and sink register? I've been pondering that one and waffling back and forth. At the moment I don't think so, since there's no benefit to going with a zero register over a zero constant, but that could change tomorrow. For example, once we have subcall, we want to provide all arguments in registers, instead of some args in regs, some in constant pool, some in inline literals. At least, this is a reasonable approach. The sink register can be used for in-place patch (for debugging, profiling, or whatever) without re-arrange the opcodes and offsets. It is of little use. Just a thought. Hong
RE: Parrot multithreading?
DS I'm also seriously considering throwing *all* PerlIO code into separate DS threads (one per file) as an aid to asynchrony. but that will be hard to support on systems without threads. i still have that internals async i/o idea floating in my numb skull. it is an api that would look async on all platforms and will use the kernel async file i/o if possible. it could be made thread specific easily as my idea was that the event system was also thread specific. I think we should have some thread abstraction layer instead of throwing PerlIO into threads. The thread abstraction layer can use either native thread package (blocking io), or implement user level thread package with either non-blocking io or async io. The internal io should be sync instead of async. async is normally slower than sync (most of unix don't have real async io), and thread is cheap. Hong
RE: Parrot multithreading?
Nope. Internal I/O, at least as the interpreter will see it is async. You can build sync from async, it's a big pain to build async from sync. Doesn't mean we actually get asynchrony, just that we can. It is trivial to build async from sync, just using thread. Most Unix async are built this way, using either user level thread or kernel level thread. Win32 has really async io implementation, but it does not interact well with sync io. Just because some systems have a really pathetic I/O system doesn't mean we should penalize those that don't... Implement sync on top of async is also slower. I bet most people will use sync io, not async one. There is no need to build async io from sync, the async can be provided using separate module. It is not about some systems, it is about most systems. Very few systems have high performance async io implementation. And the semantics is not very portable. I am not sure if interpreter has to choose one over the other. The interpreter could support both interface, and use as needed. Hong
RE: Check NV alignment for Solaris
One of the things that might be coring solaris is the potential for embedded floats in the bytecode stream. (The more I think about that the more I regret it...) The ops do a quick and ugly cast to treat some of the opcode stream as an NV which may trip across alignment rules and size issues. (I assume NVs are twice the size of ops, but that could be incorrect) I am strongly against embedded any constants (other than 32-bit literals) into opcode stream. Floats format is very platform dependent. We should use constant pool for it. The float literals are 64-bit wide, there is no way to align it correctly. Once we have float embedded in opcode stream, it will very difficult to patch it. There is really no obvious benefit to do so. We should just use constant pool. And leave opcode as signed 32-bit integer stream. The 32-bit value can be represented using different formats in memory or in file -- endian, and size. Hong
RE: Bytecode safety
Proposed: Parrot should never crash due to malformed bytecode. When choosing between execution speed and bytecode safety, safety should always win. Careful op design and possibly a validation pass before execution will hopefully keep the speed penalty to a minimum. We can use similar model as Java bytecode. Because of poor design, the Java bytecode requires exponential algorithm to verify the bytecode, mainly caused by weak-typing on local variables (where all other parts of Java runtime are strongly typed), and the notorious jsr/ret bytecode. We should avoid the same kind of mistakes. The bytecode verification should be about O(n * ln(n)). Hong
RE: [PATCH] changing IV to opcode_t!!
Do we want the opcode to be so complicated? I thought we are going to use this kind of thing for generic pointers. The p member of opcode does not make any sense to me. Hong Earlier there was some discussion about changing typedef long IV to typedef union { IV i; void* p; } opcode_t;
RE: Bytecode file format
OffsetLength Description 0 1 Magic Cookie (0x013155a1) 1 n Data n+1 m Directory Table m+n+1 1 Offset of beginning of directory table (i.e. n+1) I think we need a version right after cookie for long term compatibility. The directory is after the data so offsets can be determined as the data is written. The directory offset is at the very end, so it can be determined before the directory is written, and easily found by loaders. Having the directory at the end may not be a good choice. It requires to load everything into memory before parsing. If the directory is in the front, we can do stream-parsing. Hong
RE: RFC: Bytecode file format
8-byte word:endianness (magic value 0x123456789abcdef0) byte: word size byte[7]:empty word: major version word: minor version Where all word values are as big as the word size says they are. The magic value can be something else, but it should byteswap such that if you read it in you can tell whether it was a big-endian write or a little-endian write. Since the magic value can tell the endian, there is really no need for the endianness field. Personally I don't like the word size concept. I prefer we use fixed 4-byte word. If we support multiple word size, each runtime have to deal with several bytecode data format, 2, 4, 6, 8-byte word. I believe the 4-byte word will be 99+% of all practical use. We should let minority convert it, instead of asking every runtime to handle every thing. Hong
RE: RFC: Bytecode file format
We can't do that. There are platforms on both ends that have _no_ native 32-bit data formats (Crays, some 16-bit CPUs?). They still need to be able to load and generate bytecode without ridiculuous CPU penalties (your Palm III is not running on a 700MHz Pentium III, after all!) If the platform can not deal with 32-bit value, the runtime can convert it to their own in memory format. Almost all platforms can deal with 32-bit value from file/data base. All these is based on the assumption of portable bytecode file. If the file is just snapshot of runtime image, there is not need to discuss much here, since each runtime can just choose its own format without worry about interexchange. Hong
RE: RFC: Bytecode file format
There's a one-off conversion penalty at bytecode load time, and I don't consider that excessive. I want the bytecode to potentially be in platform native format (4/8 byte ints, big or little endian) with a simple and well-defined set of conversion semantics. That way the bytecode loader can manage it quickly, and the external conversion tool (To change between types) can deal with it simply as well. If you want native format, you have implement runtime specific image file format, such as Smalltalk image. It will be hard to use one format for both native and portable. Hong
RE: Using int32_t instead of IV for code
If we are going to keep on doing fancy stuff with pointer arithmetic (eg the Alloc_Aligned/CHUNK_BASE stuff), I think we're also going to need an integer type which is guaranteed to be the same width as a pointer, so we can freely typecast between the two. You are not supposed to do fancy stuff with code stream. Also, if we've got a system with 64 bit IVs, are the arguments to Parrot opcodes going to be 32 or 64 bit? If 32 bit, is there going to be any way of loading a 64 bit constant? The arguments are always 32-bit. For larger constant, such as 64-bit int, number, bigint/bigfloat/string, you must use constant pool. There is not much benefit to embed 64-bit value into code stream, since it is rarely used and bloat up everything else. Hong
RE: Using int32_t instead of IV for code
I'd have thought it made sense to define it as a bytecode_t type, or some such which could be platform specific. It is better called opcode_t, since we are not using bytecode anyway. Hong
RE: Parrot coredumps on Solaris 8
Now works on Solaris and i386, but segfaults at the GRAB_IV call in read_constants_table on my Alpha. Problems with the integer-pointer conversions in memory.c? (line 29 is giving me a warning). The line 29 is extremely wrong. It assigns IV to void* without casting. The alignment calculation is very wrong too. Using classic alignment, it should read as: mem = (void*) (((IV)mem + mask) ~mask); Hong
Using int32_t instead of IV for code
I think we should use int32_t instead of IV for all code related data. The IV is 64-bit on 64-bit machine, which is significant waste. The IV is also platform specific, and has caused some nasty problems so far. Hong
RE: Math functions? (Particularly transcendental ones)
Uri Guttman we are planning automatic over/underflow to bigfloat. so there is no need for traps. they could be provided at the time of the conversion to big*. OK. But will Perl support signaling and non-signaling NANs? I don't think we should go for automatic overflow/underflow between float and bigfloat. The float exception (overflow, underflow, inexact, divide zero, ...) is very difficult to handle. Using Unix signal is expensive and very platform-specific (lots of ucontext issues). Since C language does not support floating-point signal, we may use some assembly code to handle it, it will be porting nightmare. Since most of floating-point assumes IEEE-semantics, taking automatic float/bigfloat will change this assumption significantly. It may a lot of code and algorithm. I think it is safer just to provide a BigDecimal class for developers to use, and keep the basic float semantics (close to 64-bit IEEE-754 if possible). Hong
RE: An overview of the Parrot interpreter
True, but it is easier to generate FAST code for a register machine. A stack machine forces a lot of book-keeping either run-time inc/dec of sp, or alternatively compile-time what-is-offset-now stuff. The latter is a real pain if you are trying to issue multiple instructions at once. I think we need to get some initial performance characteristics of register machine vs stack machine before we go too far. There is not much points left debating in email list. I believe we have some misunderstanding here. The inc/dec of sp cost nothing, the sp is almost always a register variable, the cost of arithmetics on it is mostly likely hidden by dispatch loop. The main cost is useless memory copy like: push local #3. The register machine can only avoid copy between local variables and expression stack. If a sub uses a lot of globals and fields, the register machine has to load/store them (in/out register file), which is exactly the same as push/pop stack. I think the performance gain of a register machine comes from several areas: 1) avoid copy between local and stack, but it can not speed up global/field access. 2) complex op reduce dispatch overhead add i1, i2, i3; vs push local 1; push local 2; add pop local 3 This is likely the biggest gain. 3) special registers (32 ints, 32 floats, 32 strings) simplify gc and speed up common opcodes. In order to achieve this, we must enable some type system. I remember Perl6 is going to have unified int/bigint, num/bignum, multi encoding/charset string. I wonder how the special registers can handle this feature, since there may be overflow/underflow problem. Hong
RE: An overview of the Parrot interpreter
If you really want a comparison, here's one. Take this loop: i = 0; while (i 1000) { i = i + 7; } with the ops executed in the loop marked with pipes. The corresponding parrot code would be: getaddr P0, i store P0, 0 store I0, 1000 foo: | branchgt end, P0, I0 | add P0, P0, 7 | jump foo I think dan gave a straight forward translation, since it does not really use the int register. The optimized code will be faster. store i1, 0; store i2, 1000; jump L2; L1: add i1, 7 = i1; L2: branchlt i1, i2 = L1; getaddr i = P0; store i1 = P0; Howerver, I like to point out one hidden overhead of register opcode is decoding the parameter. The add instrction of stack machine does not have args, but for register machine, it has 3 arguments. Hong
RE: Final draft: Conventions and Guidelines for Perl Source Code
I believe the advantage of if (...) { ... } else { ... } is to write very dense code, especially when the block itself is single line. This style may not be readable to some people. This style is not very consistent, if (...) { ... } else { ... } I believe it would better be /* comment */ if (...) { ... } /* comment */ else { ... } The advantage of this style is not as dense as previous one, and good for comment. if (...) { ... } else { ... } The last style is very sparse, and very readable. It just wastes too much screen and paper (if you wanna print). BTW, I am not sure it has been mentioned already. We should enfore {} even for single line block. Since we use plenty of macros that may be expanded to multi lines, it is much safer and consistent to always use {}. Hong
RE: Draft assembly PDD
The branch instruction is wrong. It should be branch #num. The offset should be part of instruction, not from register. Nope, because that kills the potential for computed relative branches. (It's in there on purpose) Branches should work from both constants and registers. Even so, the branch #num should have better performance, and it is part of any machine language. Since we already have jump instruction, do we really need the branch %r, which can be simulated by add %r, %pc, #num; jump %r. The register set seems too big. It reduces cache efficiency and uses too much stack. Yeah, that's something I'm worried about. 64 may be too much. 16 is too few, so we might split the difference and go with 32 to start. If we define caller-save and callee save. The 64 register may not be bad, as long as caller-save set is small. If we don't define caller/callee save, we can still use 64 register. However, we need add one tag bit to each function/ stack frame to indicate whether is big frame or small frame. The big frame uses 64, the small use 16. The reg set is still 64, but the small frame does not use anything beyond 16. So we don't have to save/restore them. It is not just for performance, the stack size and cache locationality are also big issues. Hong
RE: The internal string API
The one problem with copy-on-write is that, if we implement it in software, we end up paying the price to check it on every string write. (No free depending on the hardware, alas) Not that this should shoot down the idea of COW strings, but it is a cost that needs considering. (I suppose we could have a COW subtype of the basic scalar and string scalar) Even with software implementation, it can come almost free. In this case, I can use two sizes for each string, readSize and writeSize. The write operation will check agains writeSize as part of normal bounds check. If an string is read only, (such as literal), the writeSize will be 0, and we do copy on write. The same scheme applies to string growth. So the price is just one extra word (writeSize) per string. Since this enables us intern all literal strings without introducing another data type, I would say the overhead is minimal. Hong
RE: The internal string API
* Convert from and to UTF-32 * lengths in bytes, characters, and possibly glyphs * character size (with the variable length ones reporting in negative numbers) What do you mean by character size if it does not support variable length? * get and set the locale (This might not be the spot for this) The locale should be context based. Each thread should have its own locale. * normalize (a noop for non-Unicode data) * Get the encoding name The encoding name is tricky. Neither Java or POSIX defines their naming scheme. I personally prefer full name with lower case, such as iso8859-1, the API converts name to lower automatically. The encoding name must be strict ASCII. Some common aliases may be provided. There must be an API to list all supported encoding during runtime. * Do a substr operation by character and glyph The byte based is more useful. I have utf-8, and I want to substr it to another utf-8. It is painful to convert it or linear search for charaacter position. I don't know if we want to treat encoding and data format separately--it would seem to make sense to be able to have a string tell us it's Unicode/UTF-32/Korean rather than just UTF-32/Korean, since I don't see why it wouldn't be allowable to use the UTF-8 or UTF-16 encoding on non-Unicode data. (Not that it'd necessarily be all that useful, and I can see just not allowing it) I don't see the core should support language/locale in this detail. I deal a lot of mix chinese/english text file. There is no way to represent it using plain string, unless you want to make string be rich-format-text -buffer. Current locale or explicit locale parameter will suffice your goal. Hong
RE: The internal string API
This is the common approach of complicated text representation, the implemetations I have seen includes IBM IText and SGI rope. For rope, each rope is represented by either of a simple immutable string, a simple mutable string, a simple immutable substring of another rope, or a binary node of other two ropes. We can even add user-defined node for things like memory- mapped, or #include etc. The basic string is just one of the rope type. We can build a text package much like SGI rope. I don't think we should make the basic string be rope-like, just for complexity and modularity. Hong the simplest tree is one node with a raw block in it. Only when you start doing things to it substr($A, 27, 3, $B) and suchlike does deferring the copying give a win. Say $A is 35 megabytes long and $B is 23K. Currently, and in any string representation that uses raw blocks, we have to do these things: copy substr($A,27,3) to return value if needed Allocate a new 36M block copy substr($A,0,27) copy $B copy substr($A,30) set $A's data pointer to the new block free $A's old block With a tree representation, the assign-to-middle operation becomes: Return Value if needed is substr($A,27,3) Create a new string-segment-list-node Segment 1: substr($A,0,27) Segment 2: $B (which might be another tree) Segment 3: substr($A,30) return $A's old top node to the node pool set $A's data pointer to the new top node set $B to copy-on-write mode, so future changes to $B do not affect $A no new allocations! This kind of thing also allows us to do live interpolation in which ql this will $change might rewrite to a magic scalar that evaluates the join every time it is fetched instead of once when it is built. Mixed-type? Yes! You could even have a value that is not a string at all, hanging off your string tree.
RE: More character matching bits
We should let external collator to handle all these fancy features. People can always normalize/canonicalize/do-whatever-you-want and send the result text/binary to regex. All the features we argue about here can be easily done by a customized collator. Do NOT expect the Perl regex be a linguist that can understand every language in the world and be able to match my name in English and Chinese :-) (Of course, that will be a useful feature for me.) Please note regex is O(n) at best, adding an external collator will make is O(2n). Put fancy unicode feature into regex will not make it any faster. My recommendation is to keep regex locale independent. And have some API for handling locale specific features, though I am not sure what is the best way to do this. Hong
RE: Should we care much about this Unicode-ish criticism?
However, I don't think this actually affects your comments, except that I'd guess that the half digits mentioned by Hong don't have the same term case used with them that the letters of various alphabets do. I am not sure if we mean the same thing. The regular ascii 0123456789 are called half-width-digit in china, because they take about half of the width of any chinese character to display on the screen or paper. There are another set of 012... in chinese encoding that denotes digits look the same width as chinese characters, full-width. The full width characters mainly used for formatting. It has nothing to do width the lowercase/uppercase in roman language. I believe Unicode has many font characters. Is this Uppercase? Is this Lowercase? I believe the Unicode already defines character categories, such as L, Lu, Ll, Lo. I prefer we just use unicode term instead of extending ctype.h. The Perl 5 regex already support them. Hong
RE: Unicode sorting...
I can't really believe that this would be a problem, but if they're integrated alphabets from different locales, will there be issues with sorting (if we're not planning to use the locale)? Are there instances where like characters were combined that will affect the sort orders? Yes, it is an issue. In the general case, you CANNOT sort strings of several locales/languages into a single order that would satisfy all of the locales/languages. One often quoted example is German and Swedish/Finnish: the LATIN CAPITAL LETTER A WITH RING ABOVE comes between A and B in the former but after Z (not immediately, but doesn't matter here) in the latter. Similarly for all the accented alphabetic characters, the rules how they are sorted differ from one place to another , and many languages have special combinations like ch, ss, ij that require special attention. My understanding is there is NO general unicode sorting, period. The most useful one must be locale-sensitive, as defined by unicode collation. In practice, the story is even worse. For example, how do you sort strings comming from different locales, say I have an address book with names from all over the world. Which locale I should use to sort the names. Another example is the chinese has no definite sorting order, period. The commonly used scheme are phonetic-based or stroke-based. Since many characters have more than one pronounciations (context sensitive) and more than one forms (simplified and traditional). So if we have a mix content from china and taiwan, it is impossible to sort in a way everyone will feel happy. Also Chinese is space insensitive. In English, we have to use space to separate words. But in Chinese, there is no lexical words, only linguistic words. You can insert space between any two chinese characters without change their meaning. I heard a rumor long time ago, the unicode consortium was working on a locale independent collation, which can be used to sort mix content. As for Perl, I like to have several basic sortings: a) binary sorting b) locale independent general sort c) locale-sensitive sort based on unicode collation We could have more if possible. The general sort can be done by canonicalize all strings, remove case info, remove diacritics, remove font/width, then use binary sort. Hong
RE: Should we care much about this Unicode-ish criticism?
What happens if unicode supported uppercase and lowercase numbers? [I had a dig about, and it doesn't seem to mention lowercase or uppercase digits. Are they just a typography distinction, and hence not enough to be worthy of codepoints?] Damned if I know; I didn't know there even was such a thing. Uppercase vs. lowercase for letters is more than a typographic distinction for many languages; there are words in English, for example, with a different meaning depending on whether they're capitalized (since capitalization indicates a proper noun). If there is some similar distinction of meaning for numbers in some language, I suppose that Unicode may add such a thing; to date, there doesn't appear to be any concept of uppercase or lowercase for anything but letters. There does exist half-width digits and full-width digits (widely used in chinese). They create similar problem. Hong
RE: Stacks, registers, and bytecode. (Oh, my!)
On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote: This is the bit that scares me about unifying perl ops and regex ops: can we really unify them without taking a performance hit? Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like. Second, Ruby uses a giant switch instead of function pointers for their op despatch loop; Matz says it doesn't make that much difference in terms of performance. Function pointer dispath is normally faster or as fast as switch. The main down side is the context. A typical regular expression engine can pre-fetch many variables into register local, they can be efficiently used by all switch cases. However, the common context for regular expression is relative small, I am not sure of the performance hit. Hong
RE: Should we care much about this Unicode-ish criticism?
Courtesy of Slashdot, http://www.hastingsresearch.com/net/04-unicode-limitations.shtml I'm not sure if this is an issue for us or not, as we're generally language-neutral, and I don't see any technical issues with any of the UTF-* encodings having headroom problems. I think the author confused himself. The Unicode itself is not sufficient to process human language, no matter how many characters it includes. It is just an encoding. Just take Chinese as example, only small percent (10%) of Chinese can read more than 6000 characters. The biggest dictionary I know of includes about 65000 characters, many of them even linguists can not agree with each other. Some of the characters are kind of research result of the authors. It is impossible to includes those characters into an international standard, such as Unicode. Unicode contains surrogates for future growth. We still have about 1M code points left for allocation. Eventually it will include much more characters than anyone can care about. Hong
RE: Should we care much about this Unicode-ish criticism?
Firstly, the JIS standard defines, along with the ordering and enumeration of its characters, their glyph shape. Unicode, on the other hand does not. This means that as far as Unicode is concerned, there is literally no distinction between two distinct shapes and hence no way to specify which should be used. This becomes particularly emotive when one is, for instance, attempting to represent a person's name - if they have a particular preferred variant character with which they write their name, there is no way to communicate that to the computer, and information is lost. This is a very common practice, nothing to surprise. As you can tell, my name is "hong zhang", which already lost "chinese tone" and "glyph". "hong" has 4 tones, each tone can be any of several characters, each character can be one of several glyphs (simplified and tranditional). However, it does not really matter to still call it my name. The second objection is again related to character versus glyph issues: since Chinese, I think this problem =~ locale. For any unicode character, you can not properly tell its lower case or upper case without considering locale. And unicode does not encode locale. Finally, there is a historiographical issue; when computers are used to digitise and store historical literature containing archaic characters, specifying the exact variant character becomes an important consideration. I believe this should be handled by application. This kind of work is needed by research. Perl should not care about it. Hong
RE: Stacks, registers, and bytecode. (Oh, my!)
There's no reason why you can.t have a hybrid scheme. In fact I think it's a big win over a pure register-addressing scheme. Consider... The hybrid scheme may be a win in some cases, but I am not sure if it worth the complexity. I personally prefer a strict RISC style opcodes, mainly load, store, and ops for common operators (+, -, * etc), plus escaped opcode for complicated operators and functions. Consider the following code. $a = $x*$y+$z Suppose we have r5 and r6 available for scratch use, and that for some reason we wish to keep a pointer to $a in r1 at the end (perhaps we use $a again a couple of lines later): This might have the following bytecode with a pure resiger scheme: GETSV('x',r5) # get pointer to global $x, store in register 5 GETSV('y',r6) MULT(r5,r5,r6) # multiply the things pointed to by r5 and r6; store ptr to # result in r5 GETSV('z',r6) ADD(r5,r5,r6) GETSV('a',r1) SASSIGN(r1,r5) Please note most of common operations will deal with locals, not globals. Since almost all locals will fit into register set, the generated bytecode will be very small and very fast. The global access is doomed to be slower than locals, especailly considering the synchronization overhead associated with threading. Hong
RE: Stacks, registers, and bytecode. (Oh, my!)
here is an idea. if we use a pure stack design but you can access the stack values with an index, then the index number can get large. so a fixed register set would allow us to limit the index to 8 bits. so the byte code could look something like this: 16 bit op (plenty of room for growth) 8 bit register index of arg 1 8 bit register index of arg 2 ... next op code ... literal data support is needed (read only) either each op code knows how many args it has, I like to do so, otherwise we will lose most of the performance gain. or we have an end marker (e.g 0xff which is never used as a register index). If we have to use variable arguments, I strongly recommend to add one argc byte immediately following the opcode. Linear scan bytecode will be very slow. the op code is stored in network endian order and the interpreter will always build a 16 bit int from the 2 bytes. The 16-bit op has both endian issue and alignment issue. Most of RISC machine can not access byte-aligned opcode, so we have to add a lot of padding. Anyway, it will be fatter and slower than 8-bit opcode. I prefer to using escape opcode. we have a simple set of load literal, push/pop (multiple) registers op codes. There should be no push/pop opcodes. They are simply register moves. each thread has its own register set. all registers point to PMC's passing lists to/from subs is via an array ref. the data list is on the stack and the array ref is in @_ or passed by return(). special registers ($_, @_, events, etc.) are indexed with a starting offset of 64, so general registers are 0-63. this can be mmapped in, executed with NO changes, fairly easily generated by the compiler front end, optimizable on or offline, (dis)assembler can be simply written, etc. simple to translate to TIL code by converting each op code to a call to the the op function itself and passing in the register indexes or even the PMC pointers themselves. Agreed. Hong
RE: Stacks registers
Register based. Untyped registers; I'm hoping that the vtable stuff can be sufficiently optimized that there'll be no major win in storing multiple copies of a PMC's data in different types knocking around. For those yet to be convinced by the benefits of registers over stacks, try grokking in fullness what op scratchpads are about. Ooh look, registers. I think stack based =~ register based. If we don't have Java-like jsr and ret, every bytecode inside one method always operates on the same stack depth, therefore we can just treat the locals + stack as a flat register file. A single pass can translate stack based code into register based code. For example: push local #3; = move #(max_local + opcode_stack_depth), #3 push local #3; push local #4; add; pop local #5; = add #5, #3, #4 push local #3; push local #4; call foo; pop #6; = call_2 #6, #3, #4 As long as stack based system is carefully designed, we can easily add linear-cost translation step to convert it into register based bytecode, and run it faster. Hong
Re: Perl_foo() vs foo() etc
IIRC, ISO C says you cannot have /^_[A-Z_][A-Za-z_0-9]*$/. That's reserved for the standard. If you consider our prefix is "_Perl_" not just "_", we will be pretty safe. There are just not many people follow the standard anyway :-) Hong
Re: Unicode handling
I recommend to use 'u' flag, which indicates all operations are performed against unicode grapheme/glyph. By default re is performed on codepoint. U doesn't really signal "glyph" to me, but we are sort of limited in what we have left. We still need a zero-width assertion for glyph boundary within regexes themselves. The 'u' flag means "advanced unicode feature(s)", which includes "always matching against glyph/grapheme, not codepoint". What it really means is up to discussion. I think we probably still need "glyph" or "grapheme" boundary in some cases. We need the character equivalence construct, such as [[=a=]], which matches "a", "A ACUTE". Yeah, we really need a big list of these. PDD anyone? I don't think we need a big list here. The [[=a=]] is part of POSIX 1003.2 regex syntax, also [[.ch.]]. Perl 5 does not support these syntax. We can implement in Perl 6. For even advantage equivalence, we can offload the job to collation library. Hong
Re: Unicode handling
We need the character equivalence construct, such as [[=a=]], which matches "a", "A ACUTE". Yeah, we really need a big list of these. PDD anyone? But surely this is a locale issue, and not an encoding one? Not every language recognizes the same character equivalences. Let me clarify it. The "character equivalence", assuming [[~a~]] syntax, means matching a sequence of a single letter 'a' followed any number of combining characters. I believe we can handle this without considering locale. Whether it is still useful is up to discussion. At least it is trivial to implement. Hong
Re: PDD 4: Internal data types
The normalization has something to do with encoding. If you compare two strings with the same encoding, of course you don't have to care about it. Of course you do. Think about it. I said "you don't have to". You can use "==" for codepoint comparison, and something like "Normalizer.compare(a, b)" for lexical comparison, like Java. It may not be the best solution, but it is doable and acceptable. If I'm comparing "(Greek letter lower case alpha with tonos)" with "(Greek letter lower case alpha)(+tonos)" I want them to compare equal. One string is normalized, the other isn't; how they're encoded is irrelevant, you still have to care about normalization. (This is where Perl 5 currently falls over) Normalization has utterly nothing at all to do with encoding. Nothing. Please not fight on wording. For most encodings I know of, the concept of normalization does not even exist. What is your definition of normalization? Now, since we have to normalize strings in some cases (like the comparison above) when the user hasn't explicitly asked for it, let's not make things like length() and substr() dependent on whether or not the string is normalized, eh? The *last* thing I want to happen is this: $a = "(Greek letter lower case alpha with tonos)" print length $a; # 1 if ($a eq "(Greek letter lower case alpha)(+tonos)") { # (Which it damned well ought to) print length $a; # 2! HA! Surprise! $a had to be normalized! } I fully understand this. This is one of the reasons I propose sole UTF-8 encoding. If length() and substr() depend on string internal encoding, are they still useful? Who can handle this magic length(). I still believe UTF-8 is the best choice. Random string access is just not important, at least, to me. Let's not fight on string encoding. I like to see some suggestions about how to handle normalization transparently. Making length()/substr() depend on encoding/normalization (whatever they are) does not make sense to me. Hong
Re: Idea for safe signal handling by a byte code interpreter
Here is some of my experience with HotSpot for Linux port. I've read, in the glibc info manuals, the the similar situation exists in C programming -- you don't want to do a lot inside the signal handler; just set a flag and return, then check that flag from your main loop, and run a "bottom half". It is much more limited than you read. Even the sprintf() does not work well. The sprintf() support "%m", which means errno. The errno is "#define errno *__errno_location()", which uses thread_self(). If you install signal handler with alternative signal stack. The sprintf() will crash immediately, even you use empty format string. I've looked, a little, (and months ago at that) at the LibREP (ala "sawfish") virtual machine. It's a pretty good indirect threaded VM that uses techniques pioneered by Forth engines. It utilizes the GCC ability to take the address of a label to build a jump table indexed by opcode. Very efficient. It is not very portable. I don't believe it will be any faster than switch case. What if, at the C level, you had a signal handler that sets or increments a flag or counter, stuffs a struct with information about the signal's context, then pushes (by "push", I mean "(cons v ls)", not "(append! ls v)" 'whatever ;-) that struct on a stack... I don't believe there is any way to push anything on the stack inside signal handler without breaking the interpreter. Remember the signal context is not useful outside signal handler. For synchronous signal, we can use regular signal handler or win32 structured exception handler, for things like SIGSEGV etc. For asynchronous singal handler, we have to do some magic things. If you don't need signal context (most of time), you can use a generic signal handler, void perl_signal_handler(int sig) { /* this is thread safe, SMP safe, nested signal safe */ atomic_increment(signal_table[sig]); /* set general flag for all async event */ async_flag = 1; } If you really need signal context, you have to use a dedicated thread void* thread_function(void* arg) { while (sigwaitinfo(sigset, siginfo)) { /* handle signal here */ } /* something wrong */ } Hong
Re: Idea for safe signal handling by a byte code interpreter
What if, at the C level, you had a signal handler that sets or increments a flag or counter, stuffs a struct with information about the signal's context, then pushes (by "push", I mean "(cons v ls)", not "(append! ls v)" 'whatever ;-) that struct on a stack... Hong I don't believe there is any way to push anything on the stack inside Hong signal handler without breaking the interpreter. Remember the signal Hong context is not useful outside signal handler. I don't mean "the stack", but "a stack"; one created just for this purpose. "a stack" is still too easy to get overflow. And will be difficult to manage in threaded environment, malloc() is not allowed inside signal handler. A simple signal count will be much easier to deal with. I tried to give a concrete solution here. I have used this solution for HotSpot java virtual machine for Linux, and it works fine. Hong
Re: PDD 4: Internal data types
I was thinking maybe (length/4)*31-bit 2s complement to make portable overflow detection easier, but that would be only if there wasn't a good C library for this available to snag. I believe Python uses (length/2)*15-bit 2's complement representation. Because bigint and bitnum are complicated anyway, we should make the them transparent and disallow direct field access. Some generic inline functions and/or macros must be used to access individual digits. The actual data structure should be defined the implementation, and is irrelavent to rest of system. We can decide its size if we really need. Hong
Re: PDD 4: Internal data types
For bigint, we definite need a highly portable implementation. People can do platform specific optimization on their own later. We should settle the generic implementation first, with proper encapsulation. Hong Do we need to settle on anything - can it vary by platform so that 64 bit platforms can use 64 bit, in which case the 32/31 choice could even be by platform (or always 32 if we find it works well)
Re: PDD 4: Internal data types
Unless I really, *really* misread the unicode standard (which is distinctly possible) normalization has nothing to do with encoding, I understand what you are trying to say. But it is not very easy in practice. The normalization has something to do with encoding. If you compare two strings with the same encoding, of course you don't have to care about it. But if you compare two strings with different encodings (what Perl 6 will do), you have to care about it. The 6 character "re`sume`" in latin-1 encoding should equal to 8 characters decomposed unicode string. That is what people would expect. If the language does not handle it, some library will do it. and the encoding we choose doesn't make any difference to the character position, string length, or ord stuff if we define them to work on characters rather than bytes. Which doesn't mean it's not a problem, it's just a different problem. Anyway, that is the problem I tried to raise, different problem is still problem. I am not sure what the character definition you are using. The single codepoint "e`" can be expressed by two codepoints in unicode. So the ord("e`") will return different value depending on its own encoding. All the concept of character position, string length, and ord() stuff depend on encoding. If Perl 6 uses only one encoding, everything will be just fine. Otherwise, someone has to handle this problem. Perl users will have to face all kinds of problem when they try to deal with individual characters. Most won't, honestly. At a guess, 90% of perl's current userbase doesn't care about Unicode for any reason other than XML, I totally agree with you on this. That was not my point. What I tried to express is what Perl 6 should do for people who do care about it. I like to see the solution, be it part of language or some unicode library. Hong
Re: PDD 4: Internal data types
struct perl_string { void *string_buffer; UV length; UV allocated; UV flags; } The low three bits of the flags field is reserved for the type of the string. The various types are: =over 4 =item BINARY (0) =item ASCII (1) =item EBCDIC (2) =item UTF_8 (3) =item UTF_32 (4) =item NATIVE_1 (5) through NATIVE_3 (7) Some thoughts about string encoding. Because Unicode normalization and canonical equivalent, some characters that take one codepoint in one encoding may take two or more codepoints in another encoding, mainly vowels with diacritics. In that sense, the substr() may give different results depending on its current encoding. Here is an example, "re`sume`" takes 6 characters in Latin-1, but could take 8 characters in Unicode. All Perl functions that directly deal with character position and length will be sensitive to encoding. I wonder how we should handle this case. Hong
Re: PDD 4: Internal data types
Here is an example, "re`sume`" takes 6 characters in Latin-1, but could take 8 characters in Unicode. All Perl functions that directly deal with character position and length will be sensitive to encoding. I wonder how we should handle this case. My first inclination is to force normalization on any data we manipulate. That was one of the reasons I proposed UTF-8 string encoding. If we don't do normalization (by keeping multiple encoding), we have to avoid using character position, string length, ord(), since they are encoding specific. Perl users will have to face all kinds of problem when they try to deal with individual characters. In any case, we need to make sure that regex not have any problems with normalization. Hong
Questions about PDD 4: Internal data types
Integer data types are generically referred to as CINTs. There is an CINT typedef that is guaranteed to hold any integer type. Does such thing exist? Unless it is BIGINT. Should we scrap the buffer pointer and just tack the buffer on the end of the structure? Saves a level of indirection, but means if we need to make the buffer bigger we have to adjust anything pointing to it. It largely depends on whether these primitive types are mutable or immutable. Most languages chose immutable, such as Python or Smalltalk. I assume Perl will choose mutable semantics. Floating point data types are generically reffered to as CNUMs. There is a CNUM typedef that is guaranteed to hold any floating point data type. Can you clarify this? The __float80 on x86 has very bad alignment, and not all compilers support it. =item BINARY (0) =item ASCII (1) =item EBCDIC (2) =item UTF_8 (3) =item UTF_32 (4) =item NATIVE_1 (5) through NATIVE_3 (7) Why not to include UTF-16? Hong
Re: PDD 4: Internal data types
I was hoping to get us something that was guaranteed to hold an integer, no matter what it was, so you could do something like: struct thingie { UV type; INT my_int; } What is the purpose of doing this? The SV is guaranteed to hold anything. Why we need a type that can hold any integer, and a type that can hold any float. The struct/union solution does not provide much type safety. How can I tell which member is valid without external knowledge. I don't think we really need this type, using SV instead. Hong
Re: C Garbage collector
I don't quite understand what is the intention here. Most of C garbage collector is mark sweep based. It has all common problems of gc, for example non-deterministic finalization (destruction), or conservativeness. If we decide to use GC for Perl, it will be trivial to implement a simple mark sweep collector or semi space copy collector. There is no advantage to use C garbage collector. Hong - Original Message - From: "NeonEdge" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, February 21, 2001 3:32 AM Subject: RE: C Garbage collector I agree with Damien that the Sun description sounds less portable, which we all know in the Perl world is crucial (80 ports)(although Sun mentions 16-bit DOS/Win). Any GC implementation needs to try to 'not break' the existing stuff. Other questions are somewhat dependent upon what language is used to implement (both GC descriptions are C or C++ dependent which is ok by me, but I'm a masochist). I've been off the list since RFCs closed, so does anyone know if there's been any further thoughts on the implementation language? Grant M.
Re: string encoding
People in Japan/China/Korea have been using multi-byte encoding for long time. I personally have used it for more 10 years. I never feel much of the "pain". Do you think I are using my computer with O(n) while you are using it with O(1)? There are 100 million people using variable-length encoding!!! Not at this level they aren't. The people actually writing the code do feel the pain, and you do pay a computational price. You can't *not* pay the price. substr($foo, 233253, 14) is going to cost significantly more with variable sized characters than fixed sized ones. I don't believe so. The problem is you assume the character position at the very beginning. Where you get the value of 233253 and 14. Hereby I will show an example of how to decode "Context-Length: 1000" into name value pair using multi-byte encoding. The code is in C syntax. char* str = "Content-Length: 1000"; int idx = indexof(str, ": "); /* sort of strstr() */ char* name = strndup(str, idx); char* value = strdup(str + idx + strlen(": ")); If you go through C string functions plus XXXprintf(). Most of them, if not all, are O(n). Take this example, in Chinese every character has the same width, so it is very easy to format paragraphs and lines. Most English web pages are rendered using "Times New Roman", which is a variable-width font. Do you think the English pages are rendered O(n) while Chinese page are rendered O(1)? You need a better example, since that one's rather muddy. The example is not good. How about find the cursor position when you click in the middle of a Word document? Fix width font will be fast than variable one. Right? As I said there are many more hard problems than UTF-8. If you want to support i18n and l10n, you have to live with it. No, we don't. We do *not* have to live with it at all. That UTF-8 is a variable-length representation is an implementation detail, and one we are not required to live with internally. If UTF-16 (which is also variable width, annoyingly) or UTF-32 (which doesn't officially exist as far as I can tell, but we can define by fiat) is better for us, then great. They're all just different ways of representing Unicode abstract characters. (I think--I'm only up to chapter 3 of the unicode 3.0 book) Besides, I think you're arguing a completely different point, and I think it's been missed generally. Where we're going to get bit hard, and I can't see a way around, is combining characters. My original argument is to use UTF-8 as the internal representation of string. Given the complexity of i18n and l10n, most text processing jobs can be done as efficiently using UTF-8 as using UTF-32, unless you want to treat them as binary. Most text process are using linear algorithms anyway. Hong
Re: string encoding
What do you mean? Have you seen people using multi-byte encoding in Japan/China/Korea? You're talking to the wrong person. Japanese data handling is my graduate dissertation. :) The Unified Hangul/Kanji/Ha'nzi' Characters in Unicode (so-called "Unihan") occupy one and only one codepoint each. Legacy data sets (EUC and the like) can be processed internally by being converted to Unicode on entry to the core. Did it buy you much? I don't believe so. Can you give some examples why random character access is so important? Most people are processing text linearly. I have been working with Java for many years. I found that Unicode is the best excuse people are using for i18n and l10n. English speaking developers, including me, want to keep their simple mind of english text process, so we don't have to the real hard work. Hong
Re: string encoding
And address arithmetic and mem(cmp|cpy) is faster than array iteration. Ha Ha Ha. You must be kidding. The mem(cmp|cpy) work just fine on UTF-8 string comparison and copy. But the memcmp() can not be used for UTF-32 string comparison, because of endian issue. Hong
Re: string encoding
Did it buy you much? I don't believe so. Can you give some examples why random character access is so important? Most people are processing text linearly. Most, but not all. And as this is the internals list, we have to deal with all. We can't choose a convenient subset and ignore the rest. (No matter how much I might like to...) I believe that a larger subset of people will be more happy with UTF-8 than UTF-32. The UTF-32 is not panacea either. We have to make trade off. Unless we choose to use multi string encodings, I vote for UTF-8. I have been working with Java for many years. I found that Unicode is the best excuse people are using for i18n and l10n. English speaking developers, including me, want to keep their simple mind of english text process, so we don't have to the real hard work. Okay, this paragraph made no sense to me, but it feels like it's saying something that's important. Could you try again? Based on my previous experience with i18n and l10n, I believe UTF-32 will not help you much, if any. It just misleads people believe the Unicode processing is simple. Hong
Re: string encoding
I like to wrap up my argument. I recommend to use UTF-8 as the sole string encoding. If we end up with multiple encodings, there is absolutely no point for this argument. Benefits of UTF-8 is more compact, less encoding conversion, more friendly to C API. UTF-16 is variable length encoding too, if considering the surrogates. UTF-32 is way too big. The main disadvantage of UTF-8 is O(n) random access, which I personally believe is not very important, since most text processing require linear scan of text. Multi-byte encoding has been widely used in Asian countries for years. It does not seem to be a significant problem. If Perl intends to have supurior of Unicode, i18n and l10n, the benefits of UTF-16 will fade away pretty quickly. Overall, both UTF-8 and UTF-16 are acceptable. But I believe UTF-8 is a slightly better choice. Hong
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
{ my $fh = IO::File-new("file"); print $fh "foo\n"; } { my $fh = IO::File-new("file"); print $fh "bar\n"; } At present "file" will contain "foo\nbar\n". Without DF it could just as well be "bar\nfoo\n". Make no mistake, this is a major change to the semantics of perl. Alan Burlison This code should NEVER work, period. People will just ask for trouble with this kind of code. The DF never exists, even with reference count. Can anyone show me how to deterministically collect circular reference? The current semantics of perl works most of time, but not always. What we really are talking about is "Shall Perl provide 90% or 99% of DF?" The operating system provides 0% during runtime, 100% at process exit. Hong
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Hong Zhang wrote: This code should NEVER work, period. People will just ask for trouble with this kind of code. Actually I meant to have specified "" as the mode, i.e. append, then what I originally said holds true. This behaviour is predictable and dependable in the current perl implementation. Without the the file will contain just "bar\n". That was not what I meant. Your code already assume the existence of reference counting. It does not work well with any other kind of garbage collection. If you translate the same code into C without putting in the close(), the code will not work at all. By the way, in order to use perl in real native thread systems, we have to use atomic operation for increment/decrement reference count. On most systems I have measured (pc and sparc), any atomic operation takes about 0.1-0.3 micro second, and it will be even worse on large SMP machines. The latest garbage collection algorithms (parallel and cocurrent) can handle large memory pretty well. The cost will be less DF. Hong
string encoding
Hi, All, I want to give some of my thougts about string encoding. Personally I like the UTF-8 encoding. The solution to the variable length can be handled by a special (virtual) function like class String { virtual UV iterate(/*inout*/ int* index); }; So in typical string iteration, the code will looks like for (i = 0; i size;) { UV ch = s-iterate(i); /* do what u want */ } instead of for (i = 0; i size; i++) { uint32 ch = s-charAt(i); /* be my guest */ } The new style will be strange, but not very difficult to use. It also hide the internal representation. The UTF-32 suggestion is largely ignorant to internationalization. Many user characters are composed by more than one unicode code point. If you consider the unicode normalization, canonical form, hangul conjoined, hindic cluster, combining character, varama, collation, locale, UTF-32 will not help you much, if at all. Hong
Re: string encoding
On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote: Personally I like the UTF-8 encoding. The solution to the variable length can be handled by a special (virtual) function like I'm expecting that the virtual, internal representation will not be in a UTF but will simply be an array of codepoints. Manipulating UTF8 internally is horrible because it's a variable length encoding, so you need to keep track of where you are both in terms of characters and bytes. Yuck, yuck, yuck. I am not sure if you have read through my email. The concept of characters have nothing to do with codepoints. Many characters are composed by more than one codepoints. The concept of character position is completely useless in many languages. Many languages just don't have the English-style "character", see collation, hungul conjoined, combining characters. There is just no easy way to keep track of character position. What you really meant was probably the codepoint position. The codepoint position is largely internal to library. As long as regular expression can efficiently handle utf-8, (as it does now), most people will feel just fine with it. There are just not many people interested in the codepoint position, if they ever heard of it. They care more about m// or s///. Even you want to keep track the character offsets, it is still much easier than many other unicode features I mentioned. Hong