RE: GC, exceptions, and stuff

2002-05-29 Thread Hong Zhang

 I've checked with some Sun folks. My understanding is that if you 
 don't do a list of what I'd consider obviously stupid things like:
 
 *) longjmp out of the middle of an interrupt handler
 *) longjmp across a system call boundary (user-system-user and the 
 inner jumps to the outer)
 *) Expect POSIX's dead-stupid mutexes to magically unlock
 *) Share jump destinations amongst threads
 *) Use the original Solaris thread implementation in general
 
 then you should be safe.

I think we have  concluded that we only setup flags inside signal 
handlers. So we don't need sigsetjmp/siglongjmp at all.

 I think we'll be safe using longjmp as a C-level exception handler. 
 I'm right now trying to figure whether it's a good thing to do or 
 not. (I'd like to unify C and Parrot level exceptions if I can)

That is my point. Even if libc does not have thread-safe longjmp, we
can easily make one ourself using assembly code.

Hong



RE: GC, exceptions, and stuff

2002-05-29 Thread Hong Zhang

 Actually I'd been given dire warnings from some of the Solaris folks. 
 Don't use setjmp with threads!

 I've since gotten details, and it's Don't use setjmp with threads 
 and do Stupid Things.

I used to be at Sun. I knew those warnings too. If we use longjmp
carefully, we can make it. In the worst case, write our own version.

Hong



RE: GC, exceptions, and stuff

2002-05-29 Thread Hong Zhang

  I used to be at Sun. I knew those warnings too. If we use longjmp
  carefully, we can make it. In the worst case, write our own version.
 
 ..Or we could use setcontext/getcontext, could we not?

The setcontext/getcontext will be much worse than setjmp/longjmp.
The are more platform specific than longjmp. And they don't work
well inside signal handler, just like longjmp.

When I was working on HotSpot JVM, we had some problems with 
getcontext. They work 99.99% of time. We added many workaround
for the .01% cases. I believe the Solaris guys have been improving
the code. I am not sure of the current status.

Hong



RE: GC, exceptions, and stuff

2002-05-29 Thread Hong Zhang

  When I was working on HotSpot JVM, we had some problems with getcontext.
  They work 99.99% of time. We added many workaround for the .01% cases. I
  believe the Solaris guys have been improving the code. I am not sure of
  the current status.
 
 Was that inside of a signal handler or just in general usage?

It was inside signal handler.

Hong



RE: GC, exceptions, and stuff

2002-05-28 Thread Hong Zhang

 Okay, i've thought things over a bit. Here's what we're going to do 
 to deal with infant mortality, exceptions, and suchlike things.
 
 Important given: We can *not* use setjmp/longjmp. Period. Not an 
 option--not safe with threads. At this point, having considered the 
 alternatives, I wish it were otherwise but it's not. Too bad for us.

I think this statement is not very accurate. The real problem is
setjmp/longjmp does not work well inside signal handler.

The thread-package-compatible setjmp/longjmp can be easily implemented 
using assembly code. It does not require access to any private data 
structures. Note that Microsoft Windows Structured Exception Handler 
works well under thread and signal. The assembly code of __try will 
show you how to do it.

However, signal-compatible will be very difficult. It requries access
to ucontext, and most of thread package can not provide 100% correct
ucontext for signal. (The thread package may have the right info, but
the ucontext parameter may not have the info.)

My basic suggestion is if we need convenient and fast C-based exception
handling, we can write our own setjmp/longjmp in assembly code. The 
functionality will be exported as magic macros. Such as

TRY {
  ...
} CATCH (EBADF) {
  ...
} CATCH (ENOMEM) {
  ...
} END;

Hong



RE: GC, exceptions, and stuff

2002-05-28 Thread Hong Zhang

 The thread-package-compatible setjmp/longjmp can be easily implemented
 using assembly code. It does not require access to any private data
 structures. Note that Microsoft Windows Structured Exception Handler
 works well under thread and signal. The assembly code of __try will
 show you how to do it.
 
 Yup, and we can use platform-specific exception handling mechanisms 
 as well, if there are any. Except...

The stack unwinding is very basic, that is why we have setjmp/longjmp.
Even it is CPU specific, it requires only very small piece of asm code,
much less than JIT. BTW, JIT needs similar kind of functionalities,
otherwise JIT will not be able to handle exceptions very fast. It will
be very awrkward to check for every null pointer and every function
return.

 However, signal-compatible will be very difficult. It requries access
 to ucontext, and most of thread package can not provide 100% correct
 ucontext for signal. (The thread package may have the right info, but
 the ucontext parameter may not have the info.)
 
 You hit this. And we can't universally guarantee that it'll work, either.

Parrot has to handle signals, such as SIGSEGV. I believe we have to solve
this problem, no matter whether use sigjmp/longjmp as general exception
handling. In general, most of libc functions do not work well inside signal 
handler.

 My basic suggestion is if we need convenient and fast C-based exception
 handling, we can write our own setjmp/longjmp in assembly code. The
 functionality will be exported as magic macros. Such as
 
 If we're going to do this, and believe me I dearly want to, we're 
 going to be yanking ourselves out a bunch of levels. We'll be setting 
 the setjmp in runops.c just outside the interpreter loop, and yank 
 ourselves way the heck out. It's that multi-level cross-file jumping 
 that I really worry about.

The multi-level jump should not be a problem inside parrot code itself.
The GC disapline should have handled the problem already.

1) If the parrot code allocate any thing that can not be handle by GC,
it must setup exception handler to release it, see sample.

  void * mem = NULL;
  TRY {
mem = malloc(sizeof(foo));
  } FINALLY {
free(mem);
  } END;

2) If the parrot code allocate any thing that are finalizable, there is
no need to release them. When the object is not referenced, the next gc
will finalize it. We can still use TRY block to enfore cleanup in timely
fashion.

However, we can not use setjmp/longjmp (even parrot-specific version)
to unwind non-parrot frames. If an third party C application calls 
Parrot_xxx, the Parrot_xxx should catch any exception and translate
it into error code and returns it.

Implement parrot-specific version setjmp/longjmp will be trivial compare
to the complexity of JIT and GC. When we solved the JIT, GC, threading,
and signal handling, the problems with setjmp/longjmp should have been
solved by then. But if we only want a simple interpreter solution, there
is no need to take on this additional complexity.

Hong



RE: Unicode thoughts...

2002-03-25 Thread Hong Zhang


I think it will be relative easy to deal with different compiler
and different operating system. However, ICU does contain some
C++ code. It will make life much harder, since current Parrot
only assume ANSI C (even a subset of it).

Hong

 This is rather concerning to me.  As I understand it, one of 
 the goals for 
 parrot was to be able to have a usable subset of it which is totally 
 platform-neutral (pure ANSI C).   If we start to depend too much on 
 another library which may not share that goal, we could have trouble 
 with the parrot build process (which was supposed to be 
 shipped as parrot bytecode)



RE: 64 bit Debian Linux/PowerPC OK but very noisy

2002-03-17 Thread Hong Zhang


It looks like you are running in 32-bit environment, but
using 64-bit INTVAL. The INTVAL must be the same size as
void* in order to cast between them without warning.
Please try to reconfig using 32-bit INTVAL, or running
process in 64-bit mode.

Hong

 -Original Message-
 From: Michael G Schwern [mailto:[EMAIL PROTECTED]]
 Sent: Saturday, March 16, 2002 2:54 PM
 To: Hong Zhang
 Cc: [EMAIL PROTECTED]
 Subject: Re: 64 bit Debian Linux/PowerPC OK but very noisy
 
 
 On Sat, Mar 16, 2002 at 02:36:45PM -0800, Hong Zhang wrote:
  
  Can you check what is the sizeof(INTVAL) and sizeof(void*)?
  Some warnings should not have happened.
 
 (Note: Not a C programmer)
 
 INTVAL?  I can't find where its defined.
 
 int main (void) {
 printf(int %d, long long %d, void %d\n, 
sizeof(int), sizeof(long long), sizeof(void*));
 }
 
 int 4, long long 8, void 4.
 
 From perl -V:
 
 intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=87654321
 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8
 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, 
 Off_t='off_t', lseeksize=8
 alignbytes=8, usemymalloc=n, prototype=define
 
 
 -- 
 
 Michael G. Schwern   [EMAIL PROTECTED]
 http://www.pobox.com/~schwern/
 Perl Quality Assurance[EMAIL PROTECTED] 
 Kwalitee Is Job One
 The key, my friend, is hash browns.
   http://www.goats.com/archive/980402.html
 



RE: 64 bit Debian Linux/PowerPC OK but very noisy

2002-03-16 Thread Hong Zhang


Can you check what is the sizeof(INTVAL) and sizeof(void*)?
Some warnings should not have happened.

Hong

 -Original Message-
 From: Michael G Schwern [mailto:[EMAIL PROTECTED]]
 Sent: Saturday, March 16, 2002 10:24 AM
 To: [EMAIL PROTECTED]
 Subject: 64 bit Debian Linux/PowerPC OK but very noisy
 
 
 This is parrot built using a 5.6.1 with 64 bit integers.  The tests
 pass ok, but there's a heap of warnings in the build.  Here's the
 complete make output.
 
 
 perl5.6.1 vtable_h.pl
 perl5.6.1 make_vtable_ops.pl  vtable.ops
 perl5.6.1 ops2c.pl C core.ops io.ops rx.ops vtable.ops
 include/parrot/oplib/core_ops.hperl5.6.1 ops2c.pl CPrederef 
 core.ops io.ops rx.ops vtable.ops
 include/parrot/oplib/core_ops_prederef.hcc 
 -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE 
 -D_FILE_OFFSET_BITS=64  -Wall -Wstrict-prototypes 
 -Wmissing-prototypes -Winline -Wshadow -Wpointer-arith 
 -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion 
 -Waggregate-return -Winline -W -Wno-unused -Wsign-compare
 -I./include  -o test_main.o -c test_main.c
 cc -fno-strict-aliasing -I/usr/local/include 
 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64  -Wall 
 -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow 
 -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings 
 -Wconversion -Waggregate-return -Winline -W -Wno-unused 
 -Wsign-compare-I./include  -o exceptions.o -c exceptions.c
 cc -fno-strict-aliasing -I/usr/local/include 
 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64  -Wall 
 -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow 
 -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings 
 -Wconversion -Waggregate-return -Winline -W -Wno-unused 
 -Wsign-compare-I./include  -o global_setup.o -c global_setup.c
 global_setup.c: In function `init_world':
 global_setup.c:23: warning: passing arg 1 of 
 `Parrot_Array_class_init' with different width due to prototype
 global_setup.c:24: warning: passing arg 1 of 
 `Parrot_PerlUndef_class_init' with different width due to prototype
 global_setup.c:25: warning: passing arg 1 of 
 `Parrot_PerlInt_class_init' with different width due to prototype
 global_setup.c:26: warning: passing arg 1 of 
 `Parrot_PerlNum_class_init' with different width due to prototype
 global_setup.c:27: warning: passing arg 1 of 
 `Parrot_PerlString_class_init' with different width due to prototype
 global_setup.c:28: warning: passing arg 1 of 
 `Parrot_PerlArray_class_init' with different width due to prototype
 global_setup.c:29: warning: passing arg 1 of 
 `Parrot_PerlHash_class_init' with different width due to prototype
 global_setup.c:30: warning: passing arg 1 of 
 `Parrot_ParrotPointer_class_init' with different width due to 
 prototype
 global_setup.c:31: warning: passing arg 1 of 
 `Parrot_IntQueue_class_init' with different width due to prototype
 cc -fno-strict-aliasing -I/usr/local/include 
 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64  -Wall 
 -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow 
 -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings 
 -Wconversion -Waggregate-return -Winline -W -Wno-unused 
 -Wsign-compare-I./include  -o interpreter.o -c interpreter.c
 interpreter.c: In function `make_interpreter':
 interpreter.c:481: warning: passing arg 1 of 
 `mem_sys_allocate' with different width due to prototype
 interpreter.c:501: warning: passing arg 2 of `pmc_new' with 
 different width due to prototype
 interpreter.c:577: warning: passing arg 3 of 
 `Parrot_string_make' with different width due to prototype
 interpreter.c:577: warning: passing arg 5 of 
 `Parrot_string_make' with different width due to prototype
 interpreter.c:579: warning: passing arg 3 of 
 `Parrot_string_make' with different width due to prototype
 interpreter.c:579: warning: passing arg 5 of 
 `Parrot_string_make' with different width due to prototype
 cc -fno-strict-aliasing -I/usr/local/include 
 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64  -Wall 
 -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow 
 -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings 
 -Wconversion -Waggregate-return -Winline -W -Wno-unused 
 -Wsign-compare-I./include  -o parrot.o -c parrot.c
 cc -fno-strict-aliasing -I/usr/local/include 
 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64  -Wall 
 -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow 
 -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings 
 -Wconversion -Waggregate-return -Winline -W -Wno-unused 
 -Wsign-compare-I./include  -o register.o -c register.c
 cc -fno-strict-aliasing -I/usr/local/include 
 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64  -Wall 
 -Wstrict-prototypes -Wmissing-prototypes -Winline -Wshadow 
 -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings 
 -Wconversion -Waggregate-return -Winline -W -Wno-unused 
 -Wsign-compare-I./include  -o core_ops.o -c core_ops.c
 core.ops: In function `Parrot_close_i':
 core.ops:93: warning: cast to pointer from integer of different size
 core.ops: In 

RE: Threads afety and interpreter safety

2002-03-16 Thread Hong Zhang

 1) NO STATIC VARIABLES! EVER!
 2) Don't hold on to pointers to memory across calls to routines that 
 might call the GC.
 3) Don't hold on to pointers to allocated PMCs that aren't accessible 
 from the root set

I don't think the rule #2 and #3 can be achieved without systematic
effort. In practice, GC can happen at any time. When I worked on JVM,
we used something call references, which is pretty much Object**. The
object pointer is almost always put on a per-thread object pointer
stack. The C code always refer to the stack slot. The GC will scan
the entire object pointer stack, which is considered as part of root 
set.

Couple of macros will be very helpful.

#define ENTER \
void* local_frame_start = current_thread-oop_stack

#define LEAVE \
current_thread-oop_stack = local_frame_start

#define DEREF(ref) \
(*ref)

#define REF(o) \
(*current_thread-oop_stack++ = o, current_thread_oop_stack)

For each object pointer type, there is a reference type.

struct Object;
typedef Object* ObjectPtr;
typedef ObjectPtr* ObjectRef;

Only references should be used for function calls. Pointers should
be only used without function body, and should not be used cross
function calls. This way, we don't have to worry about which functions
may cause GC.

Hong



RE: [PATCH] Stop win32 popping up dialogs on segfault

2002-02-08 Thread Hong Zhang

 The following patch adds a Parrot_nosegfault() function
 to win32.c; after it is called, a segmentation fault will print
 This process received a segmentation violation exception
 instead of popping up a dialog. I think it might be useful
 for tinderbox clients.

Please notice, stdio is not signal/exception safe, you can not
use printf(), even sprintf() inside signal handler. On unix,
you have to write something like:

write(2, msg, strlen(msg));

On win32, you have to write:
{
DWORD dummy;
WriteFile(GetStdHandle(STD_ERR_HANDLE), msg, strlen(msg), dummy, NULL);
}

The reason for this is stdio uses mutex to protect internal buffers.
If the mutex is already acquired by someone, the printf will end up
deadlock. In some cases, it will just crash. The write() and WriteFile()
are system call. They are atomic on almost all systems, so it does not
need any lock in user space. On win32, the MSVCRT._open() is not atomic,
so it should not be used inside signal/exception handler too.

By the way, the SIGINT and SIGQUIT on win32 is running in its own thread,
so the restriction is less.

Hong



RE: parrot rx engine

2002-02-04 Thread Hong Zhang

 Agh, if you go and do that, you must then be sure that rx is capable of
 optimizing /a/i and /[aA]/ in the same way.  What I mean is that Perl's
 current regex engine is able to use /abc/i as a constant in a string,
 while it cannot do the same for /[Aa][Bb][Cc]/.  Why?  Because in the
 first case, the string being matched against has been folded, so abc
 will or will not be in the string.  In the second case, the string has not
 been folded, so scanning for that constant string would 
 require either

Please don't use the current perl as an example. I am proposing a new
algorithm for Parrot regex engine. Of course, the current Perl regex
engine will not benefit from it. For things like /AbC/i, the new rx
engine must be able to optimize it down to 

  rx_opcode_ascii_match_case_insensitive abc.

If you change your example to include 1-m and m-1 case-folding chars,
the current simple and fast Perl scheme will not work at all.

Hong



RE: I'm amazed - Is this true :)

2002-02-04 Thread Hong Zhang

 mops tests :
 
 on perl5,python I get - 2.38 M/ops
 ruby ~ 1.9 M/ops
 ps ~ 1.5 M/ops
 
 parrot - 20.8 M/s
 parrot jitted - 341 M/ops and it finish in half second ... for most of
 the other I have to wait more that a minute ..

Frankly speaking, this number is misleading. I know the python and ruby
interpreter. They count a + b as 3 mops, load a, load b, and add top
two values of stack. The a and b can be any type, so type check, coersion,
vtable dispatch overhead are necessary. It is equivalent to add to PMCs
and produce a third PMC. The Parrot op does not map directly to language
constructs, it is more like Java virtual machine, where operand types
are known. Some time, compiler can compile code directly into Parrot
opcode, when the type information is avaible. Most of time, we have to
use generic PMC and vtable. The difference between Perl 5 opcode and
Perl 6 opcode + vtable would be much smaller.

Hong



RE: parrot rx engine

2002-01-31 Thread Hong Zhang

 But as you say, case folding is expensive. And with this approach you
 are going to case-fold every string that is matched against an rx
 that has some part of it that is case-insensitive.

That is correct in general. But regex compiler can be smarter than that.
For example, rx should optimize /a+/i to /[aA]+/ to avoid case-folding.
If it is too difficult for rx to do case-folding, I think it is better
to use some normalizer to do full-case folding.

 The case-folding should be done in the rx itself, at compile time if
possible.
 Then it is only done once, which will save a lot of time if the rx happens
 to be used in a loop or something.

The regular expression itself is case-folded at compile time. But I am
talking about input string here, not re.

Hong



RE: How Powerful Is Parrot? (A Few More Questions)

2002-01-25 Thread Hong Zhang

 I believe the main difficulty comes from heading into uncharted waters.
For
 example, once you've decided to make garbage collection optional, what
does
 the following line of code mean?
 
  delete x;

If the above code is compiled to Parrot, it probably equivalent to

  x-~Destructor();

i.e., the destructor is called, but the memory is left to GC, which most
likely
handle free at a later time.

 Or, for example, are the side effects of the following two functions
 different?
 
 void f1()
 {
  // On the stack
  MyClass o;
 }
 
 void f2()
 {
  // On the heap
  MyClass o = new MyClass();
 }
 
 If garbage collection is not 100% deterministic, these two functions could
 produce very different results because we do not know when or if the
 destructor for MyClass will execute in the case of f2().

This is exactly the same case for C++. When you compile f2 with gcc, how
can you tell when the destructor is called. Even the following code does
not work.

  void f3()
  {
MyClass o = new MyClass();
...
delete o;
  }

If there is an exception happens within (...), the destructor will not be
called.

 If garbage collection is
 not 100% deterministic (and Mark and Sweep is not), we need extra language
 features, such as Java's finally block, to ensure things can be cleaned
 up, and extra training to ensure programmers are smart enough to know how
 to use finally blocks correctly.

That is exactly the case for C++. In your above code f1(), the C++ compiler
already (behind the scene) inserts finally block for o destructor. That
is why the destructor of stack allocated objects is called even when
exception 
happens. The only difference is that the memory deallocation is
dis-associated
with object destruction.

Summary: the object destruction with GC is as deterministic as C++ heap 
allocated object, i.e. you have to call delete x (in C++), x.close() (in
Java), x.dispose (in C#), otherwise is 0% deterministic, period.

Hong



RE: How Powerful Is Parrot? (A Few More Questions)

2002-01-25 Thread Hong Zhang

 This changes the way a programmer writes code. A C++ class 
 and function that uses the class looks like this:
 
 class A
 {
 public:
  A(){...grab some resources...}
  ~A(){...release the resources...}
 }
 
 void f()
 {
  A a;
  ... use a's resources ...
 }
 
 ...looks like this in Java...
 
 class A
 {
 public:
  A(){...grab some resources...}
 }
 
 void f()
 {
  try
  {
   A a;
   ... use a's resources ...
  }
  finally
  {
   ...release the resources...
  }
 }

This is exactly the right way to do things in Java. In Java, you
can open hundreds of files, and never trigger any gc, since each
file object is very small. Unless you explicit close file, you
will be dead very quickly.

The difference between C++ and Java is C++ provides you stack allocated
object, and compiler does the dirty job to make sure the dtors are
called at the right time. In Java, you have to do it yourself.
In case you make some mistakes, the finalizer will kick in. But you
should not rely on it. From the runtime poit of view, the above C++
and Java are almost the same, except the memory deallocation.

This is one of the reason Java is so sloppy. Everyone relies on language
feature to do their job, but it is impossible for JVM to know there
are several file objects in thousands of dead object, which need to
be finalized in order to free enough file descriptor.

All you need to do is to treat Java object as C++ heap object, period.

Hong



RE: on parrot strings

2002-01-21 Thread Hong Zhang

 But e` and e are different letters man. And re`sume` and resume are
 different words come to that. If the user wants something that'll
 match 'em both then the pattern should surely be:
 
/r[ee`]sum[ee`]/

I disagree. The difference between 'e' and 'e`' is similar to 'c'
and 'C'. The Unicode compability equivalence has similar effect
too, such as half width letter and full width letter.

It may just be my personal perference. But I don't think it is
good idea to push this problem to user of regex.

Hong



RE: on parrot strings

2002-01-21 Thread Hong Zhang

 Yes, that's somewhat problematic.  Making up a byte CEF would be
 Wrong, though, because there is, by definition, no CCS to map, and
 we would be dangerously close to conflating in CES, too...
 ACR-CCS-CEF-CES.  Read the character model.  Understand the character
 model.  Embrace the character model.  Be the character model.  (And
 once you're it, read the relevant Unicode, XML, and Web standards.)
 
 To highlight the difference between opaque numbers and characters,
 the above should really be:
 
   if ($buf =~ /\x47\x49\x46\x38\x39\x61\x08\x02/) { ... }
 
 I think what needs to be done is that \xHH must not be encoded as
 literals (as it is now, 'A' and \x41 are identical (in ASCII)), but
 instead as regex nodes of their own, storing the code points.  Then
 the regex engine can try both the right/new way (the Unicode code
 point), and the wrong/legacy way (the native code point).

My suggest will be add a binary mode, such as //b. When binary mode
is in effect, only ascii characters (0 - 127) still carry text property.
\p{IsLower} will only match ascii a to z. All 128 - 255 always have false
text property. Any code points must be between 0 and 255. The regcomp
can easily check it upon compilation.

A dedicated binary mode will simplify many issues. And the regex will
be very readable. We can make binary mode be exclusive with text mode,
i.e. and regex expression must be either binary or text, but not both.
(I am not sure if it is really useful to have mixed mode.)

Hong



RE: on parrot strings

2002-01-21 Thread Hong Zhang

  But e` and e are different letters man. And re`sume` and resume are 
  different words come to that. If the user wants something that'll 
  match 'em both then the pattern should surely be: 
  
 /r[ee`]sum[ee`]/ 
 
 I disagree. The difference between 'e' and 'e`' is similar to 'c' 
 and 'C'. The Unicode compability equivalence has similar effect 
 too, such as half width letter and full width letter. 

German to English 
 schon = already 
 schön = nice 

2 totally different words. 

I am talking about similar word where you are talking about different word.
I don't mind if someone can search cross languages. Some Chinese search
enginee can do chinese search using engish keyword (for people having
chinese viewer but not chinese input method.) Of course, no one expect
regex engine should do that.

The re`sume` do appear in English sentence. The [half|full] width letter
are in the same language.

Hong



RE: on parrot strings

2002-01-18 Thread Hong Zhang

 (1) There are 5.125 bytes in Unicode, not four.
 (2) I think the above would suffer from the same problem as one common
 suggestion, two-level bitmaps (though I think the above would suffer
 less, being of finer granularity): the problem is that a lot of
 space is wasted, since the usage patterns of Unicode character
 classes tend to be rather scattered and irregular.  Yes, I see
 that you said: only the arrays that we actually used would be
 allocated to save space-- which reads to me: much complicated
 logic both in creation and access to make the data 
 structure *look*
 simple.  I'm a firm believer in getting the data structures right,
 after which the code to access them almost writes itself.
 
 I would suggest the inversion lists for the first try.  As long as
 character classes are not very dynamic once they have been created
 (and at least traditionally that has been the case), inversion lists
 should work reasonably well.

My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did 
measurement on it, to handle all unicode category, simple casing,
and decimal digit value, I need about 23KB table for Unicode 3.1
(0x0 to 0x10), about 15KB for (0x0 to 0x).

For simple character class, such as [\p{IsLu}\p{InGreak}], the regex
does not need to emit optimized bitmap. Instead, the regex just generate
an union, the first one will use standard unicode category lookup, the
second one is a simple range.

If user mandate to use fast bitmap, and the character class is not
extremely complicated, we will only probably need about several K for
each char class.

  As for character encodings, we're forcing everything to UTF-32 in
  regular expressions.  No exceptions.  If you use a string in a regex,
  it'll be transcoded.  I honestly can't think of a better way to
  guarantee efficient string indexing.

I don't think UTF-32 will save you much. The unicode case map is variable
length, combining character, canonical equivalence, and many other thing
will require variable length mapping. For example, if I only want to
parse /[0-9]+/, why you want to convert everything to UTF-32. Most of
time, the regcomp() can find out whether this regexp will need complicated
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully 
decompose input string, strip off any combining character, and do 8-bit
Boyer-Moore search. I bet it will be simpler and faster than using UTF-32.
(BTW, the equivalent matching means match English spelling against French
spell, disregarding diacritics.)

I think we should explore more choices and do some experiments.

Hong



RE: on parrot strings

2002-01-18 Thread Hong Zhang

  preprocessing. Another example, if I want to search for /resume/e,
  (equivalent matching), the regex engine can normalize the case, fully 
  decompose input string, strip off any combining character, and do 8-bit
 
 Hmmm.  The above sounds complicated not quite what I had in mind
 for equivalence matching: I would have just said both the pattern
 and the target need to normalized, as defined by Unicode.  Then 
 the comparison and searching reduce to the trivial cases of byte
 equivalence and searching (of which B-M is the most popular example).

You are right in some sense. But normalized, as defined by Unicode
may not be simple. I look at unicode regex tr18. It does not specify
equivalence of resume vs re`sume`, but user may want or may not
want this kind of normalization.

Hong



RE: on parrot strings

2002-01-18 Thread Hong Zhang

  My proposal is we should use mix method. The Unicode standard class,
  such as \p{IsLu}, can be handled by a standard splitbin table. Please
  see Java java.lang.Character or Python unicodedata_db.h. I did 
  measurement on it, to handle all unicode category, simple casing,
  and decimal digit value, I need about 23KB table for Unicode 3.1
  (0x0 to 0x10), about 15KB for (0x0 to 0x).
 
 Don't try to compete with inversion lists on the size: their size is
 measured in bytes.  For example Latin script, which consists of 22
 separate ranges sprinkled between U+0041 and U+FF5A, encodes into 44
 ints, or 176 bytes. Searching for membership in an inversion list is
 O(N log N) (binary search).  Encoding the whole range is a non-issue
 bordering on a joke: two ints, or 8 bytes.

When I said mixed method, I did intend to include binary search. The binary
search is a win for sparse character class. But bitmap is better for large
one. Python uses two level bitmap for first 64K character.

Hong



RE: [PATCH] Keep comments in sync with the code...

2002-01-08 Thread Hong Zhang


By the way, we should not have global variable names like index
at the first place. All globals should look  something like GIndex.

Hong

 -Original Message-
 From: Simon Glover [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, January 08, 2002 9:56 AM
 To: [EMAIL PROTECTED]
 Subject: [PATCH] Keep comments in sync with the code...
 
 
 
  We changed from index to idx in the code, but not in the comments.
 
  Simon
 
 --- key.c.old Tue Jan  8 08:00:00 2002
 +++ key.c Tue Jan  8 17:52:36 2002
 @@ -217,7 +217,7 @@
  
  /*=for api key key_element_type
  
 -return the type of element index of KEY key
 +return the type of element idx of KEY key
  
  =cut
  */
 @@ -240,7 +240,7 @@
  
  /*=for api key key_element_value_i
  
 -return the value of index index of KEY key
 +return the value of index idx of KEY key
  
  =cut
  */
 @@ -289,7 +289,7 @@
  
  /*=for api key key_set_element_value_i
  
 -Set the value of index index of key key to integer value
 +Set the value of index idx of key key to integer value
  
  =cut
  */
 @@ -312,7 +312,7 @@
  
  /*=for api key key_set_element_value_s
  
 -Set the value of index index of key key to string value
 +Set the value of index idx of key key to string value
  
  =cut
  */
 @@ -386,7 +386,7 @@
  
  /*=for api key key_inc
  
 -Increment the type of index index of key key
 +Increment the type of index idx of key key
  
  =cut
  */
  
 



RE: [PATCH] Re: Question about INTVAL vs. opcode_t sizes

2002-01-06 Thread Hong Zhang

 That's what I thought I remembered; in that case, here's a patch:
 
 Index: core.ops
 ===
 RCS file: /home/perlcvs/parrot/core.ops,v
 retrieving revision 1.68
 diff -u -r1.68 core.ops
 --- core.ops  4 Jan 2002 02:36:25 -   1.68
 +++ core.ops  5 Jan 2002 03:58:14 -
 @@ -463,8 +463,8 @@
  =cut
 
  op write(i|ic, i|ic) {
 -  INTVAL * i = ($2);
 -  write($1, i, sizeof(INTVAL));
 +  INTVAL i = (INTVAL)$2;
 +  write($1, i, sizeof(INTVAL));
goto NEXT();
  }

I think the above code is wrong. It should be

I32 i = (I32) $2;
write($1, i, 4);

I am not sure why you want to write all INTVAL bytes when only
the lower 32-bit are valid.

Hong



RE: 64-bit Solaris status

2002-01-03 Thread Hong Zhang


I am not sure why we need the U postfix in the first place. For literal
like ~0xFFF, the compiler should automatically sign-extends to our
expected size. Personally, I prefer to using ([u]intptr_t) ~0xFFF,
which is more portable. So we don't have to deal with U, UL, i64.
It is possible to use 32-bit address mode on 64-bit alpha, and
the address is sign extened, not zero extended.

Hong

 Passes on 64-bit Solaris.  (And 32-bit Linux.)  Probably more correct 
 regardless, as longs are almost always the same size as 
 pointers, whereas 
 ints aren't.
 
 --- ../parrot/Configure.pl  Wed Jan  2 22:53:29 2002
 +++ ./Configure.pl  Wed Jan  2 22:53:29 2002
 @@ -141,11 +141,11 @@
  debugging = $opt_debugging,
  rm_f  = 'rm -f',
  rm_rf = 'rm -rf',
 -stacklow  = '(~0xfff)U',
 -intlow= '(~0xfff)U',
 -numlow= '(~0xfff)U',
 -strlow= '(~0xfff)U',
 -pmclow= '(~0xfff)U',
 +stacklow  = '(~0xfff)UL',
 +intlow= '(~0xfff)UL',
 +numlow= '(~0xfff)UL',
 +strlow= '(~0xfff)UL',
 +pmclow= '(~0xfff)UL',
  make  = $Config{make},
  make_set_make = $Config{make_set_make},
 
 @@ -701,7 +701,7 @@
  my $vector = unpack(b*, pack(V, $_));
  my $offset = rindex($vector, 1)+1;
  my $mask = 2**$offset - 1;
 -push @returns, (~0x.sprintf(%x, $mask).U);
 +push @returns, (~0x.sprintf(%x, $mask).UL);
  }
 
  return @returns;
 
 
 
 -- 
 Bryan C. Warnock
 [EMAIL PROTECTED]
 



RE: 64-bit Solaris status

2002-01-03 Thread Hong Zhang


 Also, the UL[L] should probably be on the inside of the ():
 
 stacklow = '(~0xfffULL)',

I still don't see this one is safer than my proposal.

~((uintptr_t) 0xfff);

Anyway, we should use some kind of macro for this purpose.

#ifndef foo
#define foo(a) ((uintptr_t) (a))
#endif

or 

#ifndef foo
#define foo(a) (a##ull)
#endif

so the stacklow will read as

stacklow = ~foo(0xfff)

Hong



RE: [PATCH] Don't count on snprintf

2001-11-30 Thread Hong Zhang

 What we really need is our own s(n?)printf:
 
   Parrot_sprintf(target, %I + %F - %I, foo, bar, baz);
   /* or some such nonsense */
 or even:
   target=Parrot_sprintf(%I + %F - %I); /* like Perl's built-in */
 
 That way, it could even handle Parrot strings natively, perhaps with a
 %S code.
 
 By the way, Windows sems to have an _snprintf function with the same
 arguments.  The leading underscore is beyond me.  *shrugs*

It may be a good idea to have our own version of vsnprintf(). I know
the windows version does not handle infinity and nan well. The precision
of floating point may be different on different platforms.

BTW, MSVCRT has several functions with leading _, such as _isnan,
_finite, and snprintf.

Hong



RE: sizeof(INTVAL), sizeof(void*), sizeof(opcode_t)

2001-11-20 Thread Hong Zhang

 On Tue, 20 Nov 2001, Ken Fox wrote:
  It sounds like you want portable byte code. Is that a goal?
 I do indeed want portable packfiles, and I thought that was more then a
 goal, I thought that was a requirement.  In an ideal world, I want a
 PVM to be intergrated in a webbrowser the same way a JVM is now.

I think we should separate packfile from runtime image file. If we want
the runtime can run a mmapped (pack)file, the file can not be portable.
We have to deal with endianness, alignment, floating point format
etc.

 I think we can get the best of both worlds.  We, I think, should be able
 to get the bytecode format such that it is mmapable on platforms with the
 same endiannes and sizeof(INTVAL), and nonmmapable otherwise.

There is not much problem on the bytecode side. As we discussed before,
the bytecode is a stream of (aligned) 32-bit values. Most platforms can
handle 32-bit value efficiently. Other platforms can do simple conversion.

I think what you really need to worry about is the file format, such as
constant area, linkage table, etc. There is no need to make sizeof(opcode)
== sizeof(INTVAL), since constant area can hold anything you need. All you
need to do is one more indirection.

Hong



RE: Beginning of dynamic loading -- platform assistance needed

2001-11-02 Thread Hong Zhang

  Okay, here's the updated scheme.
  
  *) There is a platform/generic.c and platform/generic.h. (OK, it'll 
  probably really be unixy, but these days it's close enough) If there is
no 
  pltform-specific file, this is the one that gets copied to platform.c
and 
  platform.h
  
  *) If there *is* a platform specific file it may, and probably should 
  unless it plans on overriding everything, include generic.c and
generic.h.
  
  *) All entries in generic.c should be bracketed with #if 
  !defined(OVERRIDE_funcname) and any functions that the platform defines

  that override one in generic.c should have a corresponding #define 
  OVERRIDE_function in the platform-specific .h file
  
  Yeah, this is definitely a pain. If someone's got a better idea I'm all
ears...
 
 Sounds like less of a pain and more forward-looking than maintaining
 dozens of nearly-identical unixy platform files.
 
 Looks like a good plan to me.  Portability's a pain no matter how you
 slice it.  It's just a hard problem.  I don't think there's an easy
 solution.

I like this idea too. I think we need one generic.[ch] file for all
platforms.
The unix.[ch], win32.[ch], macos.[ch] will cover most of our needs. Each
platform can define its own porting file.

Instead of defining zillions of OVERRIDE_funcname, I like to use plain name,
such as

// platform.h
INLINE ll_eq(int64_t a, int64_t b) {
return memcmp(a, b, sizeof(a)) == 0;
}

#define ll_eq ll_eq

// generic.h
#ifndef ll_eq
#define ll_eq(a, b) ((a) == (b)) // assuming compiler support 64-bit int
#endif

The porting interface includes constants and functions. We should assume
the functions may be implemented as macros. So  is prohibited on porting
interface. (This is mainly for speed reason.) Portable structures are
very unlikely, such as struct sockaddr_in and struct timeval. Parrot
may need to define its own structs.

Hong



RE: Building on Win32

2001-11-02 Thread Hong Zhang

 Also, note that Hong Zhang ([EMAIL PROTECTED]) has pointed out a 
 simplification (1 API call rather than 2)...

FYI. The GetSystemTimeAsFileTime() takes less than 10 assembly instructions.
It just reads the kernel time variable that maps into every address space.

 and given I think I've found a working Gnu Diff for Win32 I may be able 
 to submit a real patch (but it'll be the morning before I get sorted out).

I thought the cygwin contains the gnu diff.

Hong



RE: Building on Win32

2001-11-01 Thread Hong Zhang

 void gettimeofday(struct timeval* pTv, void *pDummy);
 {
 SYSTEMTIME sysTime;
 FILETIME fileTime;/* 100ns == 1 */
 LARGE_INTEGER i;
 
 GetSystemTime(sysTime);
 SystemTimeToFileTime(sysTime, fileTime);
 /* Documented as the way to get a 64 bit from a FILETIME. */
 memcpy(i, fileTime, sizeof(LARGE_INTEGER));
 
 pTv-tv_sec = i.QuadPart / 1000; /*10e7*/
 pTv-tv_usec = (i.QuadPart / 10) % 100; /*10e6*/
 
 }

For speed reason, you can use GetSystemTimeAsFileTime(), which is
very efficient. The Win32 is little-endian only operating system.
You can use the following code.

void gettimeofday(struct timeval* pTv, void *pDummy);
{
__int64 l;
GetSystemTimeAsFileTime((LPFILETIME) l);

pTv-tv_sec = (long) l / 1000; /*10e7*/
pTv-tv_usec = (unsigned long) (i.QuadPart / 10) % 100; /*10e6*/ 
}

You missed the cast.

Hong



RE: moving integer constants to the constant table

2001-10-04 Thread Hong Zhang

 This patch moves integer constants to the constant table if the size
chosen
 for integers is not the same as the size chosen for opcodes.

It still leaves room for trouble. I suggestion we move everything that can
not be hold by int32_t out of opcode stream. The need for 64-bit constant
are rare. This way, we can generate portable bytecode.

Hong



RE: thread vs signal

2001-10-01 Thread Hong Zhang

 Now how do you go about performing an atomic operation in MT?  I
 understand the desire for reentrance via the exclusive use of local
 variables, but I'm not quite sure how you can enforce this when many
 operations are on shared data (manipulating elements of the
 interpreter / global variables).

There are two categories of global vars: ones used by runtime and ones
used by app.

For former, the runtime will use following schemes:
1) Reducing globals by using more per-thread variable (such as per 
thread profile info instead of per interpreter info).
2) Use atomic variable. Increment a profile counter does not need lock
even it may ocationally corrected by one.
3) Use mutex as needed.

 I definately agree that locking should be at a high level (let them core
 if they don't obey design it well).  I liked the perl5 idea that any
 scalar / array / hash could be a mutex.  Prevents you from having to
 carry around lots of extra mutex-values.  We can achieve the exact
 same synchronization policy of java or one that's finer 
 tuned for performance.

We can either let sv/av/hv carry mutex, or let them be atomic, although it
is non-trivial to make them atomic. For languages like Smalltalk, it is
trivial to make system atomic, since all complex data structure are user
defined.

Hong



RE: thread vs signal

2001-10-01 Thread Hong Zhang

 On Sun, Sep 30, 2001 at 10:45:46AM -0700, Hong Zhang wrote:
 Python uses global lock for multi-threading. It is reasonable for io
thread,
 which blocks most of time. It will completely useless for CPU intensive
 programs or large SMP machines.
 
 It might be useless in theory.  In practice it isn't, because most
 CPU-intensive tasks are pushed down into C code anyway, and C code can
 release the single interpreter lock while it's crunching away.

That does not mean Python is a high performance MT language. It just gives
the problem to C. In that sense, every language is about to have the speed,
since we can just write everything in C and call it, and we are blazing
fast everywhere, .NET?

Hong



RE: thread vs signal

2001-09-30 Thread Hong Zhang

 How does python handle MT?

 Honestly? Really, really badly, at least from a performance point of view.
 There's a single global lock and anything that might affect shared state
 anywhere grabs it.

Python uses global lock for multi-threading. It is reasonable for io thread,
which blocks most of time. It will completely useless for CPU intensive
programs or large SMP machines.

If Perl needs to have full multi-threading, we should better reference to
Java.
Java has the best language/runtime support for MT. It can run thousands of 
threads inside one VM on big SMP machine.

However, Java has made many mistakes with threading. One of it is the
synchronization overhead. A normal Java program can issue one million locks
per second. The JDK 1.0.0 spent 20-25% of time in locking code when run
HotJava. The main problem came from the fact the core library (Vector, 
Hashtable, IO streams, awt etc) are fully synchronized, even though most
of time you don't need them to be synced.

The same story may happen to Perl. If Perl make all operations on SV, AV,
HV sync, the performance will be pathetic. Many SMP machines can only
perform about 10M sync operations per second, because sync op requires 
system-wide bus lock or global memory transaction. This situation will
not change much in the future.

One way to reduce sync overhead is to make more operation atomic instead
of of sync. For example, read() write() are atomic. There is no need to
sync stream. The array get/put are atomic in Java, so we don't need sync
either. The high level library or app itself will be responsible for the
sync.

Hong




RE: NV Constants

2001-09-30 Thread Hong Zhang

 This was failing here until I made the following change:
 
 PackFile_Constant_unpack_number(struct PackFile_Constant * 
 self, char * packed, IV packed_size) {
 char * cursor;
 NV value;
 NV *   aligned = mem_sys_allocate(sizeof(IV));

Are you sure this is correct? Or this is before the fix.

Allocating NV using sizeof(IV) is strange. I don't see the
need to have aligned temp variable. The following code
will do exact as your code (I believe). The memcpy() can
handle alignment nicely.

PackFile_Constant_unpack_number(struct PackFile_Constant * self, char *
packed, IV packed_size) {
PackFile_Constant_clear(self);

self-type   = PFC_NUMBER;
memcpy((self-number), packed, sizeof(NV));

return 1;
}

Hong



RE: NV Constants

2001-09-30 Thread Hong Zhang


  The memcpy() can handle alignment nicely.
 
 Not always. I tried. :(

How that could be possible? The memcpy() just does byte-by-byte
copy. It does not care anything about the alignment of source
or dest. How can it fail?

Hong



thread vs signal

2001-09-28 Thread Hong Zhang


 In a word? Badly. :) Especially when threads were involved, though in some

 ways it was actually better since you were less likely to core perl.
 
 Threads and signals generally don't mix well, especially in any sort of 
 cross-platform way. Linux, for example, deals with signals in threaded 
 programs very differently than most other unices do. (Both ways make
sense, 
 they just aren't at all similar)

Though what you said is largely correct, there are way to use signal safely
with threads.

Signals are divided into 2 category, sync or async. The sync signals include
SIGSEGV, SIGBUS etc. They must be handled inside signal handler. As long as
the crash does not happen inside mutext/condvar block, it will be safe to
get out of the trouble using siglongjmp on most platforms. For async
signals,
it is very risky to use siglongjmp(), since the jmpbuf may not be correct.
The alternative is to use sigwait() family. See some examples.

A) to handle sync signal

int sig;

foo() {
sig = sigsetjmp(interpreter-jmpbuf);

if (sig == 0) {
for (;;) { DO_OP(); }
  } else if (sig == SIGSEGV) {
// do someting
} else if (sig == SIGBUS) {
// do something
}
}

void signal_handler(int sig) {
siglongjmp(current_interpreter()-jmpbuf, sig);
}

The above code is safe on most platform. But it should be used in a
controlled
fashion, so we can correct recover from the error. If it does not work on
some
platform, we can use alternative scheme.
foo () {
while (interpreter-sig == 0) {
DO_OP();
}
if (interpreter-sig == SIGSEGV) {
...
}
}

void signal_handler(int sig) {
current_interpreter()-sig = sig;
}

Since the pthread_self() may not be available inside signal_handler(), we
need to design some global data structures to find current interpreter.

B) wrong way handle async signal (was used in Java)

mutex_lock();
if (sigsetjmp(interpreter-jmpbuf)) {
cond_wait(...);
} else {
// PANIC;
}

The above code will not work reliable on any platform. The siglongjmp
will not be able to restore mutex correctly, even only one mutex is
involved here.

C) correct way to handle async signal such as CTRL-C

void async_signal_handler_thread_function() {
while (sigwait()) {
handle signal
}
}

We create one thread for all async signal, and let everyone else mask async 
signal off. This scheme can handle signal reliably under threads.

Hong



RE: thread vs signal

2001-09-28 Thread Hong Zhang

 The fun part about async vs sync is there's no common decision on what's
an 
 async signal and what's a sync signal. :( SIGPIPE, for example, is one of 
 those. (Tru64, at least, treats it differently than Solaris)
 
 I generally divide signals into two groups:
 
   *) Messages from outside (i.e. SIGHUP)
   *) Indicators of Horrific Failure (i.e. SIGBUS)

I think the another (*better*) way for this is process-wide signa  
vs thread-specific signal.
 
 Generally speaking, parrot should probably just up and die for the first 
 type, and turn the second into events.

Have you reversed the ordering??? How can you convert SIGBUS to events?

 AFAIK, almost none of the pthread functions are safe in signal handlers. 
 There might be one or two, but I can't remember which ones. (None of the 
 mutex or condition functions, alas, and they're rather useful)

Keep this for record. sem_post() is the only signal-safe thread function.

I don't think mutex and condvar are useful in this case. If 
 
 We create one thread for all async signal, and let everyone else mask
async
 signal off. This scheme can handle signal reliably under threads.
 
 This, unfortunately, isn't portable. It only works on platforms that fully

 implement the POSIX threading standard. Linux is the big example of a 
 platform that *doesn't*. Signals only get delivered to the thread that 
 triggered them, and if the thread's got the signal masked off it gets 
 dropped. :(

You did not get my idea. I was talking about async (message from outside,
process-wide signal). There is no notion of the thread that triggered them
here, which is about sync signal only. Linux does have sigtimedwait() etc.

The mask off has different defs. You can set it SIG_IGN which drop the
signal.
Or you can use sigmask() to mask it off, and signal will be enqueued.

Hong



RE: SV: Parrot multithreading?

2001-09-28 Thread Hong Zhang


  This is fine at the target language level (e.g. perl6, python, jako,
  whatever), but how do we throw catchable exceptions up through six or
  eight levels of C code? AFAICS, this is more of why perl5 uses the
  JMP_BUF stuff - so that XS and functions like sv_setsv() can
  Perl_croak() without caring about who's above them in the call stack.
 
 This is my point exactly.

This is the wrong assumption. If you don't care about the call stack, 
how can you expect the [sig]longjmp can successfully unwind stack?
The caller may have a malloc memory block, or have entered a mutex,
or acquire the file lock of Perl cvs directory. You probably have
to call Dan or Simon for the last case.

 The alternative is that _every_ function simply return a status, which
 is fundamentally expensive (your real retval has to be an out
 parameter, to start with).

This is the only right solution generally. If you really really really
know everything between setjmp and longjmp, you can use it. However,
the chance is very low.

 To answer my own question (at least, with regards to Solaris), the
 attributes(5) man page says that 'Unsafe' is defined thus:
 
  An Unsafe library contains global and static data that is not
  protected.  It is not safe to use unless the application arranges for
  only one thread at time to execute within the library. Unsafe
  libraries may contain routines that are Safe;  however, most of the
  library's routines are unsafe to call.
 
 This would imply that in the worst case (at least for Solaris) we could
 just wrap calls to [sig]setjmp and [sig]longjmp in a mutex.  'croak'
 happens relatively infrequently anyway.

This is not the point. The [sig]setjmp and [sig]longjmp are generally
safe outside signal handler. Even they are not safe, we can easily
write our own thread-safe version using very small amount of assembly
code. The problem is they can not be used inside signal handler under
MT, and it is (almost) impossible to write a thread-safe version.

Hong



RE: SV: Parrot multithreading?

2001-09-28 Thread Hong Zhang

  This is the wrong assumption. If you don't care about the call stack, 
  how can you expect the [sig]longjmp can successfully unwind stack?
  The caller may have a malloc memory block, 
 
 Irrelevant with a GC.

Are you serious? Do you mean I can not use malloc in my C code?

  or have entered a mutex,
 
 If they're holding a mutex over a function call without a
 _really_ good reason, it's their own fault.

If you don't care about caller, why the caller cares about you?
Why the callers need to present their reason for locking a
mutex? You ask too much.

  or acquire the file lock of Perl cvs directory. You
  probably have
  to call Dan or Simon for the last case.
  
   The alternative is that _every_ function simply return
  a status, which
   is fundamentally expensive (your real retval has to be
  an out
   parameter, to start with).
  
  This is the only right solution generally. If you really
  really really
  know everything between setjmp and longjmp, you can use
  it. However,
  the chance is very low.
 
 It is also slow, and speed is priority #1.

If so, just use C, which does not check nothing.

 Signals are an event, and so don't need jumps. Under MT,
 it's not like there would be a lot of contention for
 PAR_jump_lock.

Show me how to convert SIGSEGV to event. Please read previous
messages. Some signals are events, some are not.

Hong



RE: Tru64 core dumps

2001-09-26 Thread Hong Zhang

  #  0xf000 for 64 bit systems.  With that changed
 
  Don't bother. Make the constant be ~0xfff. :)
 
 Umm, are you sure?  It's used in an integer context and masked against an 
 IV, so you might need an 'int', a 'long', or a 'long long'.  I'm unsure
 what type to portably assume for C preprocessor constants, but I suspect
 this might not do what you want if an IV is a 'long long'.  (However,
 given that it's operating against an IV that used to be a pointer of a
 possibly different size, everything might just work out fine.)

There should be no need. ~0xfff is singed int, which will be signed extended
by compilers as needed. Unless you are using a buggy compiler.

Hong



RE: Tru64 core dumps

2001-09-26 Thread Hong Zhang


You are using the wrong flag. The expression in second is long long.
So you should use flag %llx. Since printf uses vararg, it is
undefined behavior if there is type mismatch with argument.

Hong

 Hehehe.  Ok.  Guess what the following will print:
 
 #include stdio.h
 int main(void) {
 int x = 511;
 printf(x = %x\n, x);
 printf(x  ~0xff = %x\n, x  (long long) ~0xff);
 return 0;
 }
 
 
 -- 
 Andy Dougherty[EMAIL PROTECTED]
 Dept. of Physics
 Lafayette College, Easton PA 18042
 
 
 



RE: variable number of arguments

2001-09-24 Thread Hong Zhang

 is it possible the ops to handle variable number of arguments, what I have
 in mind :
 
 print I1,,,N2,\n

This should be done by create array opcode plus print array opcode.

[1, 2, 3, 4, 5]

The create array opcode takes n top of stack (or n of registers)
and create an array out of it. Both opcodes are very popular and worth
of their existence. I don't see further benefit of a single vararg
print opcode. The print is an expensive opcode anyway.

Hong



RE: [PATCH] assemble.pl registers go from 0-31

2001-09-24 Thread Hong Zhang

 Attached patch makes sure you don't try and use register numbers over
 31. That is, this patch allows registers I0-I31 and anything else gets
 a: Error (foo.pasm:0): Register 32 out of range (should be 
 0-31) in 'set_i_ic'
 
 Oh, there's also a comment at end of line patch that has snuck in 'cos
 it's so darn useful.

Just curious, do we need a dedicated zero register and sink register?
The zero register always reads zero, and can not be written. The sink
register can not be read, and write to it can be ignored.

Hong



RE: [PATCH] assemble.pl registers go from 0-31

2001-09-24 Thread Hong Zhang


 Just curious, do we need a dedicated zero register and sink register?
 
 I've been pondering that one and waffling back and forth. At the moment I 
 don't think so, since there's no benefit to going with a zero register
over 
 a zero constant, but that could change tomorrow.

For example, once we have subcall, we want to provide all arguments in
registers, instead of some args in regs, some in constant pool, some
in inline literals. At least, this is a reasonable approach.

The sink register can be used for in-place patch (for debugging, 
profiling, or whatever) without re-arrange the opcodes and offsets.
It is of little use. Just a thought.

Hong



RE: Parrot multithreading?

2001-09-20 Thread Hong Zhang


   DS I'm also seriously considering throwing *all* PerlIO code into
 separate 
   DS threads (one per file) as an aid to asynchrony.
 
 but that will be hard to support on systems without threads. i still
 have that internals async i/o idea floating in my numb skull. it is an
 api that would look async on all platforms and will use the kernel async
 file i/o if possible. it could be made thread specific easily as my idea
 was that the event system was also thread specific.
 
I think we should have some thread abstraction layer instead of throwing
PerlIO into threads. The thread
abstraction layer can use either native thread package (blocking io), or
implement user level thread package
with either non-blocking io or async io. The internal io should be sync
instead of async. async is normally
slower than sync (most of unix don't have real async io), and thread is
cheap.

Hong



RE: Parrot multithreading?

2001-09-20 Thread Hong Zhang


 Nope. Internal I/O, at least as the interpreter will see it is async. You 
 can build sync from async, it's a big pain to build async from sync. 
 Doesn't mean we actually get asynchrony, just that we can.
 
It is trivial to build async from sync, just using thread. Most Unix async
are built this way, using either
user level thread or kernel level thread. Win32 has really async io
implementation, but it does not interact
well with sync io.

 Just because some systems have a really pathetic I/O system doesn't mean
 we 
 should penalize those that don't...
 
Implement sync on top of async is also slower. I bet most people will use
sync io, not async one. There
is no need to build async io from sync, the async can be provided using
separate module.

It is not about some systems, it is about most systems. Very few systems
have high performance async io 
implementation. And the semantics is not very portable.

I am not sure if interpreter has to choose one over the other. The
interpreter could support both interface,
and use as needed.

Hong



RE: Check NV alignment for Solaris

2001-09-19 Thread Hong Zhang

 One of the things that might be coring solaris is the potential for 
 embedded floats in the bytecode stream. (The more I think about that the 
 more I regret it...) The ops do a quick and ugly cast to treat some of the

 opcode stream as an NV which may trip across alignment rules and size 
 issues. (I assume NVs are twice the size of ops, but that could be
incorrect)

I am strongly against embedded any constants (other than 32-bit literals)
into
opcode stream. Floats format is very platform dependent. We should use
constant
pool for it. The float literals are 64-bit wide, there is no way to align it
correctly. Once we have float embedded in opcode stream, it will very
difficult
to patch it.

There is really no obvious benefit to do so. We should just use constant
pool.
And leave opcode as signed 32-bit integer stream. The 32-bit value can be
represented using different formats in memory or in file -- endian, and
size.

Hong



RE: Bytecode safety

2001-09-18 Thread Hong Zhang

 Proposed: Parrot should never crash due to malformed bytecode.  When
 choosing between execution speed and bytecode safety, safety should
 always win.  Careful op design and possibly a validation pass before
 execution will hopefully keep the speed penalty to a minimum.

We can use similar model as Java bytecode. Because of poor design, the
Java bytecode requires exponential algorithm to verify the bytecode,
mainly caused by weak-typing on local variables (where all other parts
of Java runtime are strongly typed), and the notorious jsr/ret bytecode.
We should avoid the same kind of mistakes. The bytecode verification
should be about O(n * ln(n)).

Hong



RE: [PATCH] changing IV to opcode_t!!

2001-09-18 Thread Hong Zhang


Do we want the opcode to be so complicated? I thought we are
going to use this kind of thing for generic pointers. The p
member of opcode does not make any sense to me.

Hong

 Earlier there was some discussion about changing typedef long IV
 to
 typedef union {
   IV i;
   void* p;
 } opcode_t;



RE: Bytecode file format

2001-09-14 Thread Hong Zhang


 OffsetLength  Description
 0 1   Magic Cookie (0x013155a1)
 1 n   Data
 n+1   m   Directory Table
 m+n+1 1   Offset of beginning of directory table (i.e. n+1)

I think we need a version right after cookie for long term compatibility.

 The directory is after the data so offsets can be determined as the data
 is written.  The directory offset is at the very end, so it can be
 determined before the directory is written, and easily found by loaders.

Having the directory at the end may not be a good choice. It requires to
load everything into memory before parsing. If the directory is in the
front, we can do stream-parsing.

Hong



RE: RFC: Bytecode file format

2001-09-14 Thread Hong Zhang

 8-byte word:endianness (magic value 0x123456789abcdef0)
 byte:   word size
 byte[7]:empty
 word:   major version
 word:   minor version
 
 Where all word values are as big as the word size says they are.
 
 The magic value can be something else, but it should byteswap such that if

 you read it in you can tell whether it was a big-endian write or a 
 little-endian write.

Since the magic value can tell the endian, there is really no need
for the endianness field.

Personally I don't like the word size concept. I prefer we use fixed
4-byte word. If we support multiple word size, each runtime have to
deal with several bytecode data format, 2, 4, 6, 8-byte word. I believe
the 4-byte word will be 99+% of all practical use. We should let minority
convert it, instead of asking every runtime to handle every thing.

Hong



RE: RFC: Bytecode file format

2001-09-14 Thread Hong Zhang

 We can't do that. There are platforms on both ends that
 have _no_ native 32-bit data formats (Crays, some 16-bit
 CPUs?). They still need to be able to load and generate
 bytecode without ridiculuous CPU penalties (your Palm III
 is not running on a 700MHz Pentium III, after all!)

If the platform can not deal with 32-bit value, the runtime
can convert it to their own in memory format. Almost all
platforms can deal with 32-bit value from file/data base.

All these is based on the assumption of portable bytecode
file. If the file is just snapshot of runtime image, there
is not need to discuss much here, since each runtime can
just choose its own format without worry about interexchange.

Hong



RE: RFC: Bytecode file format

2001-09-14 Thread Hong Zhang

 There's a one-off conversion penalty at bytecode load time, and I don't 
 consider that excessive. I want the bytecode to potentially be in platform

 native format (4/8 byte ints, big or little endian) with a simple and 
 well-defined set of conversion semantics. That way the bytecode loader can

 manage it quickly, and the external conversion tool (To change between 
 types) can deal with it simply as well.

If you want native format, you have implement runtime specific image file
format, such as Smalltalk image.

It will be hard to use one format for both native and portable.

Hong



RE: Using int32_t instead of IV for code

2001-09-13 Thread Hong Zhang

 If we are going to keep on doing fancy stuff with pointer arithmetic (eg
 the Alloc_Aligned/CHUNK_BASE stuff), I think we're also going to need an
 integer type which is guaranteed to be the same width as a pointer, so
 we can freely typecast between the two.

You are not supposed to do fancy stuff with code stream.
 
 Also, if we've got a system with 64 bit IVs, are the arguments to Parrot
 opcodes going to be 32 or 64 bit? If 32 bit, is there going to be any
 way of loading a 64 bit constant?

The arguments are always 32-bit. For larger constant, such as 64-bit int,
number, bigint/bigfloat/string, you must use constant pool. There is not
much benefit to embed 64-bit value into code stream, since it is rarely
used and bloat up everything else.

Hong



RE: Using int32_t instead of IV for code

2001-09-13 Thread Hong Zhang


 I'd have thought it made sense to define it as a bytecode_t type, or
 some such which could be platform specific.

It is better called opcode_t, since we are not using bytecode anyway.

Hong



RE: Parrot coredumps on Solaris 8

2001-09-12 Thread Hong Zhang

 Now works on Solaris and i386, but segfaults at the GRAB_IV call in
 read_constants_table on my Alpha. Problems with the integer-pointer
 conversions in memory.c? (line 29 is giving me a warning).

The line 29 is extremely wrong. It assigns IV to void* without casting.
The alignment calculation is very wrong too. Using classic alignment,
it should read as:

mem = (void*) (((IV)mem + mask)  ~mask);

Hong



Using int32_t instead of IV for code

2001-09-12 Thread Hong Zhang


I think we should use int32_t instead of IV for all code related
data. The IV is 64-bit on 64-bit machine, which is significant
waste. The IV is also platform specific, and has caused some
nasty problems so far.

Hong



RE: Math functions? (Particularly transcendental ones)

2001-09-10 Thread Hong Zhang

 Uri Guttman  
  we are planning automatic over/underflow to bigfloat. so there is no
  need for traps. they could be provided at the time of the 
  conversion to big*.
 
 OK. But will Perl support signaling and non-signaling NANs?

I don't think we should go for automatic overflow/underflow between
float and bigfloat. The float exception (overflow, underflow, inexact,
divide zero, ...) is very difficult to handle. Using Unix signal is 
expensive and very platform-specific (lots of ucontext issues). Since
C language does not support floating-point signal, we may use some
assembly code to handle it, it will be porting nightmare.

Since most of floating-point assumes IEEE-semantics, taking automatic
float/bigfloat will change this assumption significantly. It may a
lot of code and algorithm. I think it is safer just to provide a
BigDecimal class for developers to use, and keep the basic float
semantics (close to 64-bit IEEE-754 if possible).

Hong



RE: An overview of the Parrot interpreter

2001-09-05 Thread Hong Zhang


 True, but it is easier to generate FAST code for a register machine.
 A stack machine forces a lot of book-keeping either run-time inc/dec of
sp, 
 or alternatively compile-time what-is-offset-now stuff. The latter is a
real 
 pain if you are trying to issue multiple instructions at once.

I think we need to get some initial performance characteristics of register
machine vs stack machine before we go too far. There is not much points left
debating in email list.

I believe we have some misunderstanding here. The inc/dec of sp cost
nothing,
the sp is almost always a register variable, the cost of arithmetics on it
is 
mostly likely hidden by dispatch loop. The main cost is useless memory copy
like: push local #3. The register machine can only avoid copy between local 
variables and expression stack. If a sub uses a lot of globals and fields,
the register machine has to load/store them (in/out register file), which is
exactly the same as push/pop stack.

I think the performance gain of a register machine comes from several areas:

1) avoid copy between local and stack, but it can not speed up global/field
  access.

2) complex op reduce dispatch overhead
add i1, i2, i3;
  vs
push local 1;
push local 2;
add
pop local 3
  This is likely the biggest gain.

3) special registers (32 ints, 32 floats, 32 strings) simplify gc and speed
  up common opcodes. In order to achieve this, we must enable some type
system.
  I remember Perl6 is going to have unified int/bigint, num/bignum, multi
  encoding/charset string. I wonder how the special registers can handle
this
  feature, since there may be overflow/underflow problem.

Hong



RE: An overview of the Parrot interpreter

2001-09-05 Thread Hong Zhang


 If you really want a comparison, here's one. Take this loop:
 
  i = 0;
  while (i  1000) {
i = i + 7;
  }
 
 with the ops executed in the loop marked with pipes. The corresponding 
 parrot code would be:
 
 getaddr P0, i
 store   P0, 0
 store   I0, 1000
 foo: | branchgt end, P0, I0
   | add P0, P0, 7
   | jump foo

I think dan gave a straight forward translation, since it does not really
use the int register. The optimized code will be faster.

store i1, 0;
store i2, 1000;
jump L2;
L1:
add i1, 7 = i1;
L2:
branchlt i1, i2 = L1;
getaddr i = P0;
store i1 = P0;

Howerver, I like to point out one hidden overhead of register opcode is 
decoding the parameter. The add instrction of stack machine does not have
args, but for register machine, it has 3 arguments.

Hong



RE: Final draft: Conventions and Guidelines for Perl Source Code

2001-08-13 Thread Hong Zhang


I believe the advantage of

 if (...) {
   ...
 } else {
   ...
 }

is to write very dense code, especially when the block itself is single
line.
This style may not be readable to some people.

This style is not very consistent,
 if (...) {
   ...
 }
 else
 {
   ...
 }

I believe it would better be

/* comment */
if (...) {
  ...
}
/* comment */
else {
  ...
}

The advantage of this style is not as dense as previous one, and 
good for comment.

if (...)
{
  ...
}
else
{
  ...
}

The last style is very sparse, and very readable. It just wastes
too much screen and paper (if you wanna print).

BTW, I am not sure it has been mentioned already. We should enfore
{} even for single line block. Since we use plenty of macros that
may be expanded to multi lines, it is much safer and consistent
to always use {}.

Hong



RE: Draft assembly PDD

2001-08-06 Thread Hong Zhang


 The branch instruction is wrong. It should be branch #num.
 The offset should be part of instruction, not from register.
 
 Nope, because that kills the potential for computed relative 
 branches. (It's in there on purpose) Branches should work from 
 both constants and registers.

Even so, the branch #num should have better performance, and
it is part of any machine language. Since we already have jump 
instruction, do we really need the branch %r, which can be
simulated by add %r, %pc, #num; jump %r.

 The register set seems too big. It reduces cache efficiency
 and uses too much stack.
 
 Yeah, that's something I'm worried about. 64 may be too much. 
 16 is too few, so we might split the difference and go with 32
 to start.

If we define caller-save and callee save. The 64 register may
not be bad, as long as caller-save set is small.

If we don't define caller/callee save, we can still use 64 
register. However, we need add one tag bit to each function/
stack frame to indicate whether is big frame or small frame.
The big frame uses 64, the small use 16. The reg set is still
64, but the small frame does not use anything beyond 16. So
we don't have to save/restore them.

It is not just for performance, the stack size and cache
locationality are also big issues.

Hong



RE: The internal string API

2001-06-20 Thread Hong Zhang


 The one problem with copy-on-write is that, if we implement it in
software, 
 we end up paying the price to check it on every string write. (No free 
 depending on the hardware, alas)
 
 Not that this should shoot down the idea of COW strings, but it is a cost 
 that needs considering. (I suppose we could have a COW subtype of the
basic 
 scalar and string scalar)

Even with software implementation, it can come almost free. In this case,
I can use two sizes for each string, readSize and writeSize. The write
operation will check agains writeSize as part of normal bounds check.
If an string is read only, (such as literal), the writeSize will be 0, and
we
do copy on write. The same scheme applies to string growth. So the price
is just one extra word (writeSize) per string. Since this enables us intern
all literal strings without introducing another data type, I would say the
overhead is minimal.

Hong



RE: The internal string API

2001-06-19 Thread Hong Zhang


 * Convert from and to UTF-32
 * lengths in bytes, characters, and possibly glyphs
 * character size (with the variable length ones reporting in negative
numbers)

What do you mean by character size if it does not support variable length?

 * get and set the locale (This might not be the spot for this)

The locale should be context based. Each thread should have its own
locale.

 * normalize (a noop for non-Unicode data)
 * Get the encoding name

The encoding name is tricky. Neither Java or POSIX defines their
naming scheme. I personally prefer full name with lower case,
such as iso8859-1, the API converts name to lower automatically.
The encoding name must be strict ASCII. Some common aliases
may be provided. There must be an API to list all supported encoding
during runtime.

 * Do a substr operation by character and glyph

The byte based is more useful. I have utf-8, and I want to substr it
to another utf-8. It is painful to convert it or linear search for
charaacter
position.

 I don't know if we want to treat encoding and data format separately--it 
 would seem to make sense to be able to have a string tell us it's 
 Unicode/UTF-32/Korean rather than just UTF-32/Korean, since I 
 don't see why it wouldn't be allowable to use the UTF-8 or UTF-16 encoding

 on non-Unicode data. (Not that it'd necessarily be all that useful, and I 
 can see just not allowing it)

I don't see the core should support language/locale in this detail.
I deal a lot of mix chinese/english text file. There is no way to represent
it using plain string, unless you want to make string be rich-format-text
-buffer. Current locale or explicit locale parameter will suffice your goal.

Hong



RE: The internal string API

2001-06-19 Thread Hong Zhang


This is the common approach of complicated text representation,
the implemetations I have seen includes IBM IText and SGI
rope. For rope, each rope is represented by either of a simple 
immutable string, a simple mutable string, a simple immutable
substring of another rope, or a binary node of other two ropes.
We can even add user-defined node for things like memory-
mapped, or #include etc.

The basic string is just one of the rope type. We can build a
text package much like SGI rope. I don't think we should make
the basic string be rope-like, just for complexity and modularity.

Hong

 the simplest tree is one node with a raw block in it.  Only 
 when you start
 doing
 things to it
 
 
   substr($A, 27, 3, $B)
 
 and suchlike
 does deferring the copying give a win.
 
 Say $A is 35 megabytes long and $B is 23K.  Currently, and in any
 string representation that uses raw blocks, we have to do 
 these things:
 
   copy substr($A,27,3) to return value if needed
   Allocate a new 36M block
   copy substr($A,0,27)
   copy $B
   copy substr($A,30)
 
   set $A's data pointer to the new block
   free $A's old block
 
 
 With a tree representation, the assign-to-middle operation becomes:
 
   Return Value if needed is substr($A,27,3)
   Create a new string-segment-list-node
   Segment 1: substr($A,0,27)
   Segment 2: $B (which might be another tree)
   Segment 3: substr($A,30)
   return $A's old top node to the node pool
   set $A's data pointer to the new top node
   set $B to copy-on-write mode, so future changes to $B 
 do not affect $A
 
 no new allocations!
 
 
 
 This kind of thing also allows us to do live interpolation in which
 
   ql this will $change 
 
 might rewrite to a magic scalar that evaluates the join every time
 it is fetched instead of once when it is built.
 
 Mixed-type?  Yes!  You could even have a value that is not a 
 string at all,
 hanging off your string tree.
 



RE: More character matching bits

2001-06-12 Thread Hong Zhang


We should let external collator to handle all these fancy features.
People can always normalize/canonicalize/do-whatever-you-want
and send the result text/binary to regex. All the features we
argue about here can be easily done by a customized collator.

Do NOT expect the Perl regex be a linguist that can understand
every language in the world and be able to match my name in 
English and Chinese :-) (Of course, that will be a useful
feature for me.)

Please note regex is O(n) at best, adding an external collator
will make is O(2n). Put fancy unicode feature into regex will 
not make it any faster.

My recommendation is to keep regex locale independent. And
have some API for handling locale specific features, though
I am not sure what is the best way to do this.

Hong



RE: Should we care much about this Unicode-ish criticism?

2001-06-11 Thread Hong Zhang


 However, I don't think this actually affects your comments, except that
 I'd guess that the half digits mentioned by Hong don't have the same
 term case used with them that the letters of various alphabets do.

I am not sure if we mean the same thing. The regular ascii 0123456789
are called half-width-digit in china, because they take about half
of the width of any chinese character to display on the screen or
paper. There are another set of 012... in chinese encoding that
denotes digits look the same width as chinese characters, full-width.
The full width characters mainly used for formatting. It has nothing
to do width the lowercase/uppercase in roman language. I believe Unicode
has many font characters.

 Is this Uppercase?
 Is this Lowercase?

I believe the Unicode already defines character categories, such as
L, Lu, Ll, Lo. I prefer we just use unicode term instead of extending
ctype.h. The Perl 5 regex already support them.

Hong



RE: Unicode sorting...

2001-06-08 Thread Hong Zhang


  I can't really believe that this would be a problem, but if they're
  integrated alphabets from different locales, will there be issues
  with sorting (if we're not planning to use the locale)? Are there
  instances where like characters were combined that will affect the
  sort orders?
 
 Yes, it is an issue.  In the general case, you CANNOT sort strings of
 several locales/languages into a single order that would satisfy all
 of the locales/languages.  One often quoted example is German and
 Swedish/Finnish: the LATIN CAPITAL LETTER A WITH RING ABOVE comes
 between A and B in the former but after Z (not immediately, but
 doesn't matter here) in the latter.  Similarly for all the accented
 alphabetic characters, the rules how they are sorted differ from one
 place to another , and many languages have special combinations like
 ch, ss, ij that require special attention.

My understanding is there is NO general unicode sorting, period.

The most useful one must be locale-sensitive, as defined by unicode
collation. In practice, the story is even worse. For example, how do
you sort strings comming from different locales, say I have an address
book with names from all over the world. Which locale I should use
to sort the names. Another example is the chinese has no definite
sorting order, period. The commonly used scheme are phonetic-based
or stroke-based. Since many characters have more than one pronounciations
(context sensitive) and more than one forms (simplified and traditional).
So if we have a mix content from china and taiwan, it is impossible to
sort in a way everyone will feel happy. Also Chinese is space insensitive.
In English, we have to use space to separate words. But in Chinese,
there is no lexical words, only linguistic words. You can insert space
between any two chinese characters without change their meaning.

I heard a rumor long time ago, the unicode consortium was working on
a locale independent collation, which can be used to sort mix content.
As for Perl, I like to have several basic sortings:
a) binary sorting
b) locale independent general sort
c) locale-sensitive sort based on unicode collation

We could have more if possible. The general sort can be done by
canonicalize all strings, remove case info, remove diacritics,
remove font/width, then use binary sort.

Hong



RE: Should we care much about this Unicode-ish criticism?

2001-06-08 Thread Hong Zhang

  What happens if unicode supported uppercase and lowercase numbers?
 
  [I had a dig about, and it doesn't seem to mention lowercase or
  uppercase digits. Are they just a typography distinction, 
 and hence not
  enough to be worthy of codepoints?]
 
 Damned if I know; I didn't know there even was such a thing.
 
 Uppercase vs. lowercase for letters is more than a 
 typographic distinction
 for many languages; there are words in English, for example, with a
 different meaning depending on whether they're capitalized (since
 capitalization indicates a proper noun).  If there is some similar
 distinction of meaning for numbers in some language, I suppose that
 Unicode may add such a thing; to date, there doesn't appear to be any
 concept of uppercase or lowercase for anything but letters.

There does exist half-width digits and full-width digits (widely used in
chinese). They create similar problem.

Hong



RE: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Hong Zhang

 On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote:
  This is the bit that scares me about unifying perl ops and regex ops:
  can we really unify them without taking a performance hit?
 
 Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like.
 
 Second, Ruby uses a giant switch instead of function pointers for their
 op despatch loop; Matz says it doesn't make that much difference in
 terms of performance.

Function pointer dispath is normally faster or as fast as switch. The main
down side is the context. A typical regular expression engine can pre-fetch
many variables into register local, they can be efficiently used by
all switch cases. However, the common context for regular expression is
relative small, I am not sure of the performance hit.

Hong



RE: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Hong Zhang

 Courtesy of Slashdot, 
 http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
 
 I'm not sure if this is an issue for us or not, as we're generally 
 language-neutral, and I don't see any technical issues with any of the 
 UTF-* encodings having headroom problems.

I think the author confused himself. The Unicode itself is not sufficient
to process human language, no matter how many characters it includes.
It is just an encoding.

Just take Chinese as example, only small percent (10%) of Chinese can
read more than 6000 characters. The biggest dictionary I know of includes
about 65000 characters, many of them even linguists can not agree with
each other. Some of the characters are kind of research result of the
authors. It is impossible to includes those characters into an 
international standard, such as Unicode. 

Unicode contains surrogates for future growth. We still have about 1M
code points left for allocation. Eventually it will include much more
characters than anyone can care about.

Hong



RE: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Hong Zhang

 Firstly, the JIS standard defines, along with the ordering and
 enumeration of its characters, their glyph shape. Unicode, on  the other
 hand does not. This means that as far as Unicode is concerned, there is
 literally no distinction between two distinct shapes and hence no way to
 specify which should be used. This becomes particularly emotive when one
 is, for instance, attempting to represent a person's name - 
 if they have a particular preferred variant character with which they
write their
 name, there is no way to communicate that to the computer, and
 information is lost. 

This is a very common practice, nothing to surprise. As you can tell,
my name is "hong zhang", which already lost "chinese tone" and
"glyph". "hong" has 4 tones, each tone can be any of several
characters, each character can be one of several glyphs (simplified and
tranditional). However, it does not really matter to still call it my name.

 The second objection is again related to character versus  glyph issues:
 since Chinese,

I think this problem =~ locale. For any unicode character, you can not
properly tell its lower case or upper case without considering locale.
And unicode does not encode locale.

 Finally, there is a historiographical issue; when computers are used to
 digitise and store historical literature containing archaic characters,
 specifying the exact variant character becomes an important
 consideration.

I believe this should be handled by application. This kind of work is needed
by research. Perl should not care about it.

Hong


RE: Stacks, registers, and bytecode. (Oh, my!)

2001-05-30 Thread Hong Zhang


 There's no reason why you can.t have a hybrid scheme. In fact I think
 it's a big win over a pure register-addressing scheme. Consider...

The hybrid scheme may be a win in some cases, but I am not sure if it
worth the complexity. I personally prefer a strict RISC style opcodes,
mainly load, store, and ops for common operators (+, -, * etc), plus
escaped opcode for complicated operators and functions.

 Consider the following code.
 
 $a = $x*$y+$z
 
 Suppose we have r5 and r6 available for scratch use, and that for some
 reason we wish to keep a pointer to $a in r1 at the end 
 (perhaps we use $a again a couple of lines later):
 
 
 This might have the following bytecode with a pure resiger scheme:
 
 GETSV('x',r5) # get pointer to global $x, store in register 5
 GETSV('y',r6)
 MULT(r5,r5,r6)  # multiply the things pointed to by r5 and 
 r6; store ptr to
   # result in r5
 GETSV('z',r6)
 ADD(r5,r5,r6)
 GETSV('a',r1)
 SASSIGN(r1,r5)

Please note most of common operations will deal with locals, not
globals. Since almost all locals will fit into register set, the
generated bytecode will be very small and very fast.

The global access is doomed to be slower than locals, especailly
considering the synchronization overhead associated with threading.

Hong



RE: Stacks, registers, and bytecode. (Oh, my!)

2001-05-29 Thread Hong Zhang


 here is an idea. if we use a pure stack design but you can access the
 stack values with an index, then the index number can get large. so a
 fixed register set would allow us to limit the index to 8 bits. so the
 byte code could look something like this:
 
   16 bit op   (plenty of room for growth)
   8 bit register index of arg 1
   8 bit register index of arg 2
   ...
 
   next op code ...
 
   literal data support is needed (read only)

 either each op code knows how many args it has,

I like to do so, otherwise we will lose most of the performance gain.

 or we have an end marker (e.g  0xff which is never used as a register
index).

If we have to use variable arguments, I strongly recommend to add one argc
byte immediately following the opcode. Linear scan bytecode will be very
slow.

 the op code is stored in network endian order and the interpreter will
 always build a 16 bit int from the 2 bytes.

The 16-bit op has both endian issue and alignment issue. Most of RISC
machine can not access byte-aligned opcode, so we have to add a lot
of padding. Anyway, it will be fatter and slower than 8-bit opcode.
I prefer to using escape opcode.

 we have a simple set of load literal, push/pop (multiple) registers op
 codes. 

There should be no push/pop opcodes. They are simply register moves.
 
 each thread has its own register set.
 
 all registers point to PMC's
 
 passing lists to/from subs is via an array ref. the data list 
 is on the
 stack and the array ref is in @_ or passed by return().
 
 special registers ($_, @_, events, etc.) are indexed with a starting
 offset of 64, so general registers are 0-63.
 
 this can be mmapped in, executed with NO changes, fairly easily
 generated by the compiler front end, optimizable on or offline,
 (dis)assembler can be simply written, etc.
 
 simple to translate to TIL code by converting each op code to a call to
 the the op function itself and passing in the register indexes or even
 the PMC pointers themselves.

Agreed.

Hong



RE: Stacks registers

2001-05-23 Thread Hong Zhang


 Register based. Untyped registers; I'm hoping that the vtable stuff can be
 sufficiently optimized that there'll be no major win in 
 storing multiple copies of a PMC's data in different types knocking
around. 
 
 For those yet to be convinced by the benefits of registers over stacks,
try
 grokking in fullness what op scratchpads are about. Ooh look, registers.

I think stack based =~ register based. If we don't have Java-like jsr 
and ret, every bytecode inside one method always operates on the same
stack
depth, therefore we can just treat the locals + stack as a flat register
file. A single pass can translate stack based code into register based code.

For example:
  push local #3; = move #(max_local + opcode_stack_depth), #3

  push local #3; push local #4; add; pop local #5; = add #5, #3, #4

  push local #3; push local #4; call foo; pop #6; = call_2 #6, #3, #4

As long as stack based system is carefully designed, we can easily add
linear-cost translation step to convert it into register based bytecode,
and run it faster.

Hong



Re: Perl_foo() vs foo() etc

2001-04-12 Thread Hong Zhang

IIRC, ISO C says you cannot have /^_[A-Z_][A-Za-z_0-9]*$/. That's reserved
for the standard.

If you consider our prefix is "_Perl_" not just "_", we will be pretty safe.
There are just not many people follow the standard anyway :-)

Hong




Re: Unicode handling

2001-03-23 Thread Hong Zhang

 I recommend to use 'u' flag, which indicates all operations are performed
 against unicode grapheme/glyph. By default re is performed on codepoint.

 U doesn't really signal "glyph" to me, but we are sort of limited in what
 we have left. We still need a zero-width assertion for glyph boundary
 within regexes themselves.

The 'u' flag means "advanced unicode feature(s)", which includes "always
matching against glyph/grapheme, not codepoint". What it really means is
up to discussion.  I think we probably still need "glyph" or "grapheme"
boundary in some cases.

 We need the character equivalence construct, such as [[=a=]], which
 matches "a", "A ACUTE".

 Yeah, we really need a big list of these. PDD anyone?

I don't think we need a big list here. The [[=a=]] is part of POSIX 1003.2
regex syntax, also [[.ch.]]. Perl 5 does not support these syntax. We can
implement in Perl 6.

For even advantage equivalence, we can offload the job to collation library.

Hong




Re: Unicode handling

2001-03-23 Thread Hong Zhang

  We need the character equivalence construct, such as [[=a=]], which
  matches "a", "A ACUTE".
 
  Yeah, we really need a big list of these. PDD anyone?
 
 
 But surely this is a locale issue, and not an encoding one?  Not every 
 language recognizes the same character equivalences.

Let me clarify it. The "character equivalence", assuming [[~a~]] syntax,
means matching a sequence of a single letter 'a' followed any number of
combining characters. I believe we can handle this without considering
locale. Whether it is still useful is up to discussion. At least it is
trivial to implement.

Hong




Re: PDD 4: Internal data types

2001-03-22 Thread Hong Zhang

  The normalization has something to do with encoding. If you compare two
  strings with the same encoding, of course you don't have to care about
it.

 Of course you do. Think about it.

I said "you don't have to". You can use "==" for codepoint comparison, and
something like "Normalizer.compare(a, b)" for lexical comparison, like Java.
It may not be the best solution, but it is doable and acceptable.

 If I'm comparing "(Greek letter lower case alpha with tonos)" with "(Greek
 letter lower case alpha)(+tonos)" I want them to compare equal. One string
is
 normalized, the other isn't; how they're encoded is irrelevant, you still
have
 to care about normalization. (This is where Perl 5 currently falls over)

 Normalization has utterly nothing at all to do with encoding. Nothing.

Please not fight on wording. For most encodings I know of, the concept of
normalization does not even exist. What is your definition of normalization?

 Now, since we have to normalize strings in some cases (like the comparison
 above) when the user hasn't explicitly asked for it, let's not make things
 like length() and substr() dependent on whether or not the string is
 normalized, eh? The *last* thing I want to happen is this:

 $a = "(Greek letter lower case alpha with tonos)"
 print length $a; # 1
 if ($a eq "(Greek letter lower case alpha)(+tonos)") {
 # (Which it damned well ought to)

 print length $a; # 2! HA! Surprise! $a had to be normalized!
 }

I fully understand this. This is one of the reasons I propose sole UTF-8
encoding. If length() and substr() depend on string internal encoding,
are they still useful? Who can handle this magic length().

I still believe UTF-8 is the best choice. Random string access is just
not important, at least, to me.

Let's not fight on string encoding. I like to see some suggestions about
how to handle normalization transparently. Making length()/substr() depend
on encoding/normalization (whatever they are) does not make sense to me.

Hong




Re: Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread Hong Zhang

Here is some of my experience with HotSpot for Linux port.

  I've read, in the glibc info manuals, the the similar situation
  exists in C programming -- you don't want to do a lot inside the
  signal handler; just set a flag and return, then check that flag from
  your main loop, and run a "bottom half".

It is much more limited than you read. Even the sprintf() does not
work well. The sprintf() support "%m", which means errno. The errno
is "#define errno *__errno_location()", which uses thread_self().
If you install signal handler with alternative signal stack.
The sprintf() will crash immediately, even you use empty format string.

  I've looked, a little, (and months ago at that) at the LibREP (ala
  "sawfish") virtual machine.  It's a pretty good indirect threaded VM
  that uses techniques pioneered by Forth engines.  It utilizes the GCC
  ability to take the address of a label to build a jump table indexed
  by opcode.  Very efficient.

It is not very portable. I don't believe it will be any faster than
switch case.

  What if, at the C level, you had a signal handler that sets or
  increments a flag or counter, stuffs a struct with information about
  the signal's context, then pushes (by "push", I mean "(cons v ls)",
  not "(append! ls v)" 'whatever ;-) that struct on a stack...

I don't believe there is any way to push anything on the stack inside
signal handler without breaking the interpreter. Remember the signal
context is not useful outside signal handler.

For synchronous signal, we can use regular signal handler or win32
structured exception handler, for things like SIGSEGV etc.

For asynchronous singal handler, we have to do some magic things.
If you don't need signal context (most of time), you can use a generic
signal handler,

void perl_signal_handler(int sig) {
/* this is thread safe, SMP safe, nested signal safe */
atomic_increment(signal_table[sig]);
/* set general flag for all async event */
async_flag = 1;
}

If you really need signal context, you have to use a dedicated thread

void* thread_function(void* arg) {
while (sigwaitinfo(sigset, siginfo)) {
/* handle signal here */
}
/* something wrong */
}

Hong




Re: Idea for safe signal handling by a byte code interpreter

2001-03-22 Thread Hong Zhang

  What if, at the C level, you had a signal handler that sets or
  increments a flag or counter, stuffs a struct with information
about
  the signal's context, then pushes (by "push", I mean "(cons v ls)",
  not "(append! ls v)" 'whatever ;-) that struct on a stack...

 Hong I don't believe there is any way to push anything on the stack
inside
 Hong signal handler without breaking the interpreter. Remember the
signal
 Hong context is not useful outside signal handler.

  I don't mean "the stack", but "a stack"; one created just for this
purpose.

"a stack" is still too easy to get overflow. And will be difficult to manage
in threaded environment, malloc() is not allowed inside signal handler.
A simple signal count will be much easier to deal with.

I tried to give a concrete solution here. I have used this solution for
HotSpot java virtual machine for Linux, and it works fine.

Hong




Re: PDD 4: Internal data types

2001-03-08 Thread Hong Zhang

 I was thinking maybe (length/4)*31-bit 2s complement to make portable
 overflow detection easier, but that would be only if there wasn't a good C
 library for this available to snag.

I believe Python uses (length/2)*15-bit 2's complement representation.

Because bigint and bitnum are complicated anyway, we should make the
them transparent and disallow direct field access. Some generic inline
functions and/or macros must be used to access individual digits.
The actual data structure should be defined the implementation,
and is irrelavent to rest of system. We can decide its size if
we really need.

Hong




Re: PDD 4: Internal data types

2001-03-08 Thread Hong Zhang


For bigint, we definite need a highly portable implementation.
People can do platform specific optimization on their own later.
We should settle the generic implementation first, with proper
encapsulation.

Hong

 Do we need to settle on anything - can it vary by platform so that 64 bit
 platforms can use 64 bit, in which case the 32/31 choice could even be by
 platform (or always 32 if we find it works well)





Re: PDD 4: Internal data types

2001-03-06 Thread Hong Zhang

 Unless I really, *really* misread the unicode standard (which is
distinctly
 possible) normalization has nothing to do with encoding,

I understand what you are trying to say. But it is not very easy in
practice.
The normalization has something to do with encoding. If you compare two
strings
with the same encoding, of course you don't have to care about it. But if
you
compare two strings with different encodings (what Perl 6 will do), you have
to care about it. The 6 character "re`sume`" in latin-1 encoding should
equal to 8 characters decomposed unicode string. That is what people would
expect. If the language does not handle it, some library will do it.

 and the encoding
 we choose doesn't make any difference to the character position, string
 length, or ord stuff if we define them to work on characters rather than
 bytes. Which doesn't mean it's not a problem, it's just a different
problem.

Anyway, that is the problem I tried to raise, different problem is still
problem. I am not sure what the character definition you are using. The
single codepoint "e`" can be expressed by two codepoints in unicode.
So the ord("e`") will return different value depending on its own encoding.
All the concept of character position, string length, and ord() stuff
depend on encoding. If Perl 6 uses only one encoding, everything will be
just fine. Otherwise, someone has to handle this problem.

 Perl users will have to face all kinds of problem when they try to deal
 with individual characters.

 Most won't, honestly. At a guess, 90% of perl's current userbase doesn't
 care about Unicode for any reason other than XML,

I totally agree with you on this. That was not my point. What I tried to
express is what Perl 6 should do for people who do care about it. I like
to see the solution, be it part of language or some unicode library.

Hong





Re: PDD 4: Internal data types

2001-03-05 Thread Hong Zhang

struct perl_string {
  void *string_buffer;
  UV length;
  UV allocated;
  UV flags;
}
 
 The low three bits of the flags field is reserved for the type of the
 string. The various types are:
 
 =over 4
 
 =item BINARY (0)
 
 =item ASCII (1)
 
 =item EBCDIC (2)
 
 =item UTF_8 (3)
 
 =item UTF_32 (4)
 
 =item NATIVE_1 (5) through NATIVE_3 (7)

Some thoughts about string encoding. Because Unicode normalization
and canonical equivalent, some characters that take one codepoint
in one encoding may take two or more codepoints in another encoding,
mainly vowels with diacritics. In that sense, the substr() may give
different results depending on its current encoding.

Here is an example, "re`sume`" takes 6 characters in Latin-1, but
could take 8 characters in Unicode. All Perl functions that directly
deal with character position and length will be sensitive to encoding.
I wonder how we should handle this case.

Hong





Re: PDD 4: Internal data types

2001-03-05 Thread Hong Zhang

 Here is an example, "re`sume`" takes 6 characters in Latin-1, but
 could take 8 characters in Unicode. All Perl functions that directly
 deal with character position and length will be sensitive to encoding.
 I wonder how we should handle this case.
 
 My first inclination is to force normalization on any data we manipulate.

That was one of the reasons I proposed UTF-8 string encoding. If we don't
do normalization (by keeping multiple encoding), we have to avoid using
character position, string length, ord(), since they are encoding specific.
Perl users will have to face all kinds of problem when they try to deal
with individual characters.

In any case, we need to make sure that regex not have any problems with 
normalization.

Hong




Questions about PDD 4: Internal data types

2001-03-02 Thread Hong Zhang

 Integer data types are generically referred to as CINTs. There is an
 CINT typedef that is guaranteed to hold any integer type.

Does such thing exist? Unless it is BIGINT.

 Should we scrap the buffer pointer and just tack the buffer on the end
 of the structure? Saves a level of indirection, but means if we need
 to make the buffer bigger we have to adjust anything pointing to it.

It largely depends on whether these primitive types are mutable or
immutable. Most languages chose immutable, such as Python or Smalltalk.
I assume Perl will choose mutable semantics.

 Floating point data types are generically reffered to as
 CNUMs. There is a CNUM typedef that is guaranteed to hold any
 floating point data type.

Can you clarify this? The __float80 on x86 has very bad alignment,
and not all compilers support it.

 =item BINARY (0)
 
 =item ASCII (1)
 
 =item EBCDIC (2)
 
 =item UTF_8 (3)
 
 =item UTF_32 (4)
 
 =item NATIVE_1 (5) through NATIVE_3 (7)

Why not to include UTF-16?

Hong




Re: PDD 4: Internal data types

2001-03-02 Thread Hong Zhang

 I was hoping to get us something that was guaranteed to hold an integer,
no
 matter what it was, so you could do something like:

struct thingie {
   UV type;
   INT my_int;
}

What is the purpose of doing this? The SV is guaranteed to hold anything.
Why we need a type that can hold any integer, and a type that can hold
any float. The struct/union solution does not provide much type safety.
How can I tell which member is valid without external knowledge.
I don't think we really need this type, using SV instead.

Hong




Re: C Garbage collector

2001-02-23 Thread Hong Zhang


I don't quite understand what is the intention here. Most of
C garbage collector is mark sweep based. It has all common
problems of gc, for example non-deterministic finalization
(destruction), or conservativeness. If we decide to use
GC for Perl, it will be trivial to implement a simple
mark sweep collector or semi space copy collector. There
is no advantage to use C garbage collector.

Hong

- Original Message -
From: "NeonEdge" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, February 21, 2001 3:32 AM
Subject: RE: C Garbage collector


 I agree with Damien that the Sun description sounds less portable, which
we all
 know in the Perl world is crucial (80 ports)(although Sun mentions 16-bit
 DOS/Win). Any GC implementation needs to try to 'not break' the existing
stuff.
 Other questions are somewhat dependent upon what language is used to
implement
 (both GC descriptions are C or C++ dependent which is ok by me, but I'm a
 masochist). I've been off the list since RFCs closed, so does anyone know
if
 there's been any further thoughts on the implementation language?
 Grant M.





Re: string encoding

2001-02-16 Thread Hong Zhang

 People in Japan/China/Korea have been using multi-byte encoding for
 long time. I personally have used it for more 10 years. I never feel
 much of the "pain". Do you think I are using my computer with O(n)
 while you are using it with O(1)? There are 100 million people using
 variable-length encoding!!!

 Not at this level they aren't. The people actually writing the code do
feel
 the pain, and you do pay a computational price. You can't *not* pay the
price.

substr($foo, 233253, 14)

 is going to cost significantly more with variable sized characters than
 fixed sized ones.

I don't believe so. The problem is you assume the character position at
the very beginning. Where you get the value of 233253 and 14. Hereby
I will show an example of how to decode "Context-Length: 1000" into
name value pair using multi-byte encoding. The code is in C syntax.

   char* str = "Content-Length: 1000";
   int idx = indexof(str, ": "); /* sort of strstr() */
   char* name = strndup(str, idx);
   char* value = strdup(str + idx + strlen(": "));

If you go through C string functions plus XXXprintf(). Most of them, if
not all, are O(n).

 Take this example, in Chinese every character has the same width, so
 it is very easy to format paragraphs and lines. Most English web pages
 are rendered using "Times New Roman", which is a variable-width font.
 Do you think the English pages are rendered O(n) while Chinese page
 are rendered O(1)?

 You need a better example, since that one's rather muddy.

The example is not good. How about find the cursor position when you
click in the middle of a Word document? Fix width font will be fast
than variable one. Right?

 As I said there are many more hard problems than UTF-8. If you want
 to support i18n and l10n, you have to live with it.

 No, we don't. We do *not* have to live with it at all. That UTF-8 is a
 variable-length representation is an implementation detail, and one we are
 not required to live with internally. If UTF-16 (which is also variable
 width, annoyingly) or UTF-32 (which doesn't officially exist as far as I
 can tell, but we can define by fiat) is better for us, then great. They're
 all just different ways of representing Unicode abstract characters. (I
 think--I'm only up to chapter 3 of the unicode 3.0 book)

 Besides, I think you're arguing a completely different point, and I think
 it's been missed generally. Where we're going to get bit hard, and I can't
 see a way around, is combining characters.

My original argument is to use UTF-8 as the internal representation of
string.
Given the complexity of i18n and l10n, most text processing jobs can be done
as efficiently using UTF-8 as using UTF-32, unless you want to treat them as
binary. Most text process are using linear algorithms anyway.

Hong





Re: string encoding

2001-02-16 Thread Hong Zhang

  What do you mean? Have you seen people using multi-byte encoding
  in Japan/China/Korea?

 You're talking to the wrong person. Japanese data handling is my graduate
 dissertation. :)

 The Unified Hangul/Kanji/Ha'nzi' Characters in Unicode (so-called
"Unihan")
 occupy one and only one codepoint each. Legacy data sets (EUC and the
like)
 can be processed internally by being converted to Unicode on entry to the
 core.

Did it buy you much? I don't believe so. Can you give some examples why
random character access is so important? Most people are processing text
linearly.

I have been working with Java for many years. I found that Unicode is the
best excuse people are using for i18n and l10n. English speaking developers,
including me, want to keep their simple mind of english text process, so we
don't have to the real hard work.

Hong




Re: string encoding

2001-02-16 Thread Hong Zhang

 And address arithmetic and mem(cmp|cpy) is faster than array iteration.

Ha Ha Ha. You must be kidding.

The mem(cmp|cpy) work just fine on UTF-8 string comparison and copy.
But the memcmp() can not be used for UTF-32 string comparison, because
of endian issue.

Hong




Re: string encoding

2001-02-16 Thread Hong Zhang

 Did it buy you much? I don't believe so. Can you give some examples why
 random character access is so important? Most people are processing text
 linearly.

 Most, but not all. And as this is the internals list, we have to deal with
 all. We can't choose a convenient subset and ignore the rest. (No matter
 how much I might like to...)

I believe that a larger subset of people will be more happy with UTF-8
than UTF-32. The UTF-32 is not panacea either. We have to make trade
off. Unless we choose to use multi string encodings, I vote for UTF-8.

 I have been working with Java for many years. I found that Unicode is the
 best excuse people are using for i18n and l10n. English speaking
developers,
 including me, want to keep their simple mind of english text process, so
we
 don't have to the real hard work.

 Okay, this paragraph made no sense to me, but it feels like it's saying
 something that's important. Could you try again?

Based on my previous experience with i18n and l10n, I believe UTF-32 will
not help you much, if any. It just misleads people believe the Unicode
processing is simple.

Hong




Re: string encoding

2001-02-16 Thread Hong Zhang

I like to wrap up my argument.

I recommend to use UTF-8 as the sole string encoding.
If we end up with multiple encodings, there is absolutely
no point for this argument.

Benefits of UTF-8 is more compact, less encoding conversion,
more friendly to C API. UTF-16 is variable length encoding
too, if considering the surrogates. UTF-32 is way too big.

The main disadvantage of UTF-8 is O(n) random access, which I
personally believe is not very important, since most text
processing require linear scan of text. Multi-byte encoding
has been widely used in Asian countries for years. It does
not seem to be a significant problem.

If Perl intends to have supurior of Unicode, i18n and l10n,
the benefits of UTF-16 will fade away pretty quickly.

Overall, both UTF-8 and UTF-16 are acceptable. But I believe
UTF-8 is a slightly better choice.

Hong




Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Hong Zhang

   {
 my $fh = IO::File-new("file");
 print $fh "foo\n";
   }
   {
 my $fh = IO::File-new("file");
 print $fh "bar\n";
   }
 
 At present "file" will contain "foo\nbar\n".  Without DF it could just
 as well be "bar\nfoo\n".  Make no mistake, this is a major change to the
 semantics of perl.
 
 Alan Burlison

This code should NEVER work, period. People will just ask for trouble
with this kind of code.

The DF never exists, even with reference count. Can anyone show me how
to deterministically collect circular reference? The current semantics
of perl works most of time, but not always.

What we really are talking about is "Shall Perl provide 90% or 99% of DF?"
The operating system provides 0% during runtime, 100% at process exit.

Hong






Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Hong Zhang

 Hong Zhang wrote:
 
  This code should NEVER work, period. People will just ask for trouble
  with this kind of code.
 
 Actually I meant to have specified "" as the mode, i.e. append, then
 what I originally said holds true.  This behaviour is predictable and
 dependable in the current perl implementation.  Without the  the file
 will contain just "bar\n".

That was not what I meant. Your code already assume the existence of
reference counting. It does not work well with any other kind of garbage
collection. If you translate the same code into C without putting in
the close(), the code will not work at all.

By the way, in order to use perl in real native thread systems, we have
to use atomic operation for increment/decrement reference count. On most
systems I have measured (pc and sparc), any atomic operation takes about
0.1-0.3 micro second, and it will be even worse on large SMP machines.
The latest garbage collection algorithms (parallel and cocurrent) can 
handle large memory pretty well. The cost will be less DF.

Hong




string encoding

2001-02-15 Thread Hong Zhang

Hi, All,

I want to give some of my thougts about string encoding.

Personally I like the UTF-8 encoding. The solution to the
variable length can be handled by a special (virtual)
function like

class String {
virtual UV iterate(/*inout*/ int* index);
};

So in typical string iteration, the code will looks like
for (i = 0; i  size;) {
UV ch = s-iterate(i);
/* do what u want */
}
instead of
for (i = 0; i  size; i++) {
uint32 ch = s-charAt(i);
/* be my guest */
}

The new style will be strange, but not very difficult to
use. It also hide the internal representation.

The UTF-32 suggestion is largely ignorant to internationalization.
Many user characters are composed by more than one unicode code
point. If you consider the unicode normalization, canonical form,
hangul conjoined, hindic cluster, combining character, varama,
collation, locale, UTF-32 will not help you much, if at all.

Hong




Re: string encoding

2001-02-15 Thread Hong Zhang

 On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
  Personally I like the UTF-8 encoding. The solution to the
  variable length can be handled by a special (virtual)
  function like
 
 I'm expecting that the virtual, internal representation will not
 be in a UTF but will simply be an array of codepoints. Manipulating
 UTF8 internally is horrible because it's a variable length encoding,
 so you need to keep track of where you are both in terms of characters
 and bytes. Yuck, yuck, yuck.

I am not sure if you have read through my email.

The concept of characters have nothing to do with codepoints.
Many characters are composed by more than one codepoints.

The concept of character position is completely useless in
many languages. Many languages just don't have the English-style
"character", see collation, hungul conjoined, combining characters.
There is just no easy way to keep track of character position.
What you really meant was probably the codepoint position.
The codepoint position is largely internal to library.
As long as regular expression can efficiently handle utf-8,
(as it does now), most people will feel just fine with it.

There are just not many people interested in the codepoint
position, if they ever heard of it. They care more about
m// or s///.

Even you want to keep track the character offsets, it is still much
easier than many other unicode features I mentioned.

Hong




  1   2   >