RE: Unicode thoughts...
At 4:32 PM -0800 3/25/02, Brent Dax wrote: I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. FWIW, ICU in the distribution is a given if we use it. Parrot will require a C compiler and link tools (maybe make, but maybe not) to build on a target platform and nothing else. If we rely on ICU we must ship with it. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode thoughts...
Someone said that ICU requires a C++ compiler. That's concerning to me, as is the issue of how we bootstrap our build process. We were planning on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm sure it's not going to be written in pure ansi C) --Josh At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote: At 4:32 PM -0800 3/25/02, Brent Dax wrote: I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. FWIW, ICU in the distribution is a given if we use it. Parrot will require a C compiler and link tools (maybe make, but maybe not) to build on a target platform and nothing else. If we rely on ICU we must ship with it. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode thoughts...
At 10:07 AM -0500 3/30/02, Josh Wilmes wrote: Someone said that ICU requires a C++ compiler. That's concerning to me, as is the issue of how we bootstrap our build process. We were planning on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm sure it's not going to be written in pure ansi C) If the C++ bits are redoable as C, I'm OK with it. I've not taken a good look at it to know how much it depends on C++. If it's mostly // comments and such we can work around the issues easily enough. If its objects, well, I suppose it depends on how much it relies on them. At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote: At 4:32 PM -0800 3/25/02, Brent Dax wrote: I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. FWIW, ICU in the distribution is a given if we use it. Parrot will require a C compiler and link tools (maybe make, but maybe not) to build on a target platform and nothing else. If we rely on ICU we must ship with it. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode thoughts...
Dan Sugalski wrote: At 10:07 AM -0500 3/30/02, Josh Wilmes wrote: Someone said that ICU requires a C++ compiler. That's concerning to me, as is the issue of how we bootstrap our build process. We were planning on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm sure it's not going to be written in pure ansi C) If the C++ bits are redoable as C, I'm OK with it. I've not taken a good look at it to know how much it depends on C++. If it's mostly // comments and such we can work around the issues easily enough. If its objects, well, I suppose it depends on how much it relies on them. Looking at icu/common I see more .c files than .cpp files, and what .cpp files there are look somewhat like wrappers around the C code. In addition, the .cpp files appear to do such things as create iterator wrappers, bidirectional display and normalization. Some of the files are thicker than wrappers, but I think there's enough code behind this C++ veneer that we can at least use it if not the entire library. -- Jeff [EMAIL PROTECTED]
RE: Unicode thoughts...
Jeff: # This will likely open yet another can of worms, but Unicode has been # delayed for too long, I think. It's time to add the Unicode libraries # (In our case, the ICU libraries at http://oss.software.ibm.com/icu/, # which Larry has now blessed) to Parrot. string.c already has # (admittedly # unavoidable, due to the library not being included) # assumptions such as # isdigit(). So, I have a few thoughts (that may have already been shot # down by people wiser than I in such matters) to explicate, and some # questions to ask. # # ICU should be added as a note in the README, and maybe to 'INSTALL' if # we ever create one. Let's not add it to CVS, as it's not under our # control. If we have to patch ICU to make it work correctly # with Parrot, # the patches should be submitted back to the ICU team. And I'm joining # the appropriate mailing lists to keep appraised of development. I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. We also need to make sure ICU will work everywhere. And I do mean *everywhere*. Will it work on VMS? Palm OS? Crays? # Before Unicode goes into full swing, I need some idea of how # we're going # to deploy the libraries. On this note, I defer to the # Configure master, # Brent. I've already done some work with ICU, so I'm reasonably # comfortable with migrating in one Unicode bit at a time, until we're # ready for full UTF-16 compliance. # # The RE engine should (I'm speaking without having recently read the # source, so feel free to correct me) not need to be migrated, as it's # already using UTF-32 internally, which leaves just the string # internals. # These can be migrated to using ICU macros fairly easily (I've already # done some of the work locally), so I think the main focus should be on # encodings, as we'll have to eventually support the more common # wide-character encodings such as KOI-8 and BIG5. There are a few things that need to change, but they aren't big issues. Mostly it's just places where character sets have been presumed. However, I'm seriously thinking about a major re-architecture of the regex engine, which would probably help these sorts of issues. # I still have some questions about using UTF-16 internally for string # representation (as mentioned in # http:[EMAIL PROTECTED]/msg07856.html), # but I've resolved most of those. It's an excellent match for the ICU # library, as it uses UTF-16 internally. My only question is if we're # going to incur a performance hit every time a scalar is transferred to # the RE engine, as it uses UTF-32 internally. That can change. However, utf32 seems like the best match, as it would allow us to reach into a string's guts for speed. (We don't currently do that, but if I do redesign the engine, I'll probably be able to.) # Also, once we have UTF-16 running internally, I'd be interested in # seeing what memory consumption looks like vs. UTF-32, beause I'd like to # see if it makes sense to add a compile-time switch between UTF-8 and # UTF-32 to let the installer decide on memory tradeoffs. ICU has an # internal macro that defines its own internal representation, and that # could conflict with our intended usage as well. # # Performance would suffer in the UTF-8 case, naturally, but the # difference in memory usage might be significant enough that we'd want to # leave the decision up to the installer. Having said that, the headache # of testing multiple versions of Perl6 might not be worth it. # # So, to wrap up, I'm soliciting thoughts on how best to start the Unicode # migration, and deal with the inevitable problems that will come up. I'm # hoping that most of the magic will be hidden in string.c, where we won't # have to worry about it, but we'll have to see. # # Now, this is admittedly being composed at 2:00 A.M, so my thoughts may # not be the most coherent, and for that I apologize. Most of my concern # stems from how best to add build steps to the various platforms without # ending up with a completely broken Parrot for weeks and developers # screaming about What the *HELL* is this error? Where is this library? # brane explodes. If these issues have already been beaten to death and # we've moved on to more interesting issues, of course I'll be interested # there as well. Overall you seem to be pretty on target. Of course, my brain isn't really built for character sets and stuff like that. Also note that I went to bed at one, was rudely awakened by a screaming toddler at two, didn't fall asleep again till four, and woke up at nine, so I'm probably not very coherent. I feel a little dizzy--I'm gonna take a nap. --Brent Dax [EMAIL PROTECTED] @roles=map {Parrot $_} qw(embedding regexen Configure) #define private public --Spotted in a C++ program just before a #include
RE: Unicode thoughts...
We also need to make sure ICU will work everywhere. And I do mean *everywhere*. Will it work on VMS? Palm OS? Crays? Nope, nope, and nope. From their site - Operating systemCompilerTesting frequency Windows 98/NT/2000 Microsoft Visual C++ 6.0Reference platform Red Hat Linux 6.1 gcc 2.95.2 Reference platform AIX 4.3.3 xlC 3.6.4 Reference platform Solaris 2.6 Workshop Pro CC 4.2 Reference platform HP/UX 11.01 aCC A.12.10 Reference platform AIX 5.1.0 L Visual Age C++ 5.0 Regularly tested Solaris 2.7 Workshop Pro CC 6.0 Regularly tested Solaris 2.6 gcc 2.91.66 Regularly tested FreeBSD 4.4 gcc 2.95.3 Regularly tested HP/UX 11.01 CC A.03.10 Regularly tested OS/390 (zSeries)CC r10 Regularly tested AS/400 (iSeries)V5R1 iCCRarely tested NetBSD, OpenBSD Rarely tested SGI/IRIXRarely tested PTX Rarely tested OS/2 Visual Age Rarely tested Macintosh Needs help to port -(MBrod)- __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/
Re: Unicode thoughts...
This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode) --Josh At 17:02 on 03/25/2002 PST, Charles Bunders [EMAIL PROTECTED] wrote: We also need to make sure ICU will work everywhere. And I do mean *everywhere*. Will it work on VMS? Palm OS? Crays? Nope, nope, and nope. From their site - Operating systemCompilerTesting frequency Windows 98/NT/2000 Microsoft Visual C++ 6.0Reference platform Red Hat Linux 6.1 gcc 2.95.2 Reference platform AIX 4.3.3 xlC 3.6.4 Reference platform Solaris 2.6 Workshop Pro CC 4.2 Reference platform HP/UX 11.01 aCC A.12.10 Reference platform AIX 5.1.0 L Visual Age C++ 5.0 Regularly tested Solaris 2.7 Workshop Pro CC 6.0 Regularly tested Solaris 2.6 gcc 2.91.66 Regularly tested FreeBSD 4.4 gcc 2.95.3 Regularly tested HP/UX 11.01 CC A.03.10 Regularly tested OS/390 (zSeries)CC r10 Regularly tested AS/400 (iSeries)V5R1 iCCRarely tested NetBSD, OpenBSD Rarely tested SGI/IRIXRarely tested PTX Rarely tested OS/2 Visual Age Rarely tested Macintosh Needs help to port -(MBrod)- __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/
RE: Unicode thoughts...
I think it will be relative easy to deal with different compiler and different operating system. However, ICU does contain some C++ code. It will make life much harder, since current Parrot only assume ANSI C (even a subset of it). Hong This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode)
Re: Unicode thoughts...
Hong Zhang wrote: I think it will be relative easy to deal with different compiler and different operating system. However, ICU does contain some C++ code. It will make life much harder, since current Parrot only assume ANSI C (even a subset of it). Hong This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode) I guess it's obvious that I hadn't looked at the target platforms for ICU as closely as I probably should have. C vs. C++ doesn't concern me, as it can always be rewritten, but lack of platforms like OS X does. Given that, I think an interim solution consisting of basic Unicode utilities we'll need, such as Unicode_isdigit(). This can be a simple wrapper around isdigit() for the moment, until I sort out which files we need from the Unicode database, and what support functions/data structures will be required. Given that we're dedicated to either UTF-16 or UTF-32 for internal string representation (undecided as of yet, and isn't affected by this), we can get away with creating a simple unicode.{c.h} suite of functions that looks like: Parrot_Int Parrot_isDigit(char* glyph); We can get away with the simplicity here because the character array should already be a valid UTF-{16,32) string, and responsibility for making sure there's a valid glyph at that offset can be safely offloaded to the caller, if not higher up the calling chain. Also, it should be in a separate file because, assuming the final internal representation matches that of the RE engine, the engine can use these utilities as well. Now, admittedly this is only slightly better-thought-out than the origina proposal, but I think it has a much better chance of being implemented, and in a fairly short amount of time. (He said, knowing full well that there's always one more problem) ASCII versions of the functions should be almost trivial, and can be left in there as a compile-time switch should we choose to do an ASCII-only or UTF-8-only version. In conclusion, this approach feels more workable, and the full UTF-16 implementation details can be rolled out incrementally, rather than a single mass migration. If this suggestion flies, I'll rewrite strings.pdd and post it in the next few days. -- Jeff [EMAIL PROTECTED]
Re: Unicode thoughts...
Jeff wrote: Hong Zhang wrote: I think it will be relative easy to deal with different compiler and different operating system. However, ICU does contain some C++ code. It will make life much harder, since current Parrot only assume ANSI C (even a subset of it). Hong This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode) I guess it's obvious that I hadn't looked at the target platforms for ICU as closely as I probably should have. C vs. C++ doesn't concern me, as it can always be rewritten, but lack of platforms like OS X does. Given that, I think an interim solution consisting of basic Unicode utilities we'll need, such as Unicode_isdigit(). This can be a simple wrapper around isdigit() for the moment, until I sort out which files we need from the Unicode database, and what support functions/data structures will be required. Given that we're dedicated to either UTF-16 or UTF-32 for internal string representation (undecided as of yet, and isn't affected by this), we can get away with creating a simple unicode.{c.h} suite of functions that looks like: Parrot_Int Parrot_isDigit(char* glyph); We can get away with the simplicity here because the character array should already be a valid UTF-{16,32) string, and responsibility for making sure there's a valid glyph at that offset can be safely offloaded to the caller, if not higher up the calling chain. Also, it should be in a separate file because, assuming the final internal representation matches that of the RE engine, the engine can use these utilities as well. Now, admittedly this is only slightly better-thought-out than the origina proposal, but I think it has a much better chance of being implemented, and in a fairly short amount of time. (He said, knowing full well that there's always one more problem) ASCII versions of the functions should be almost trivial, and can be left in there as a compile-time switch should we choose to do an ASCII-only or UTF-8-only version. In conclusion, this approach feels more workable, and the full UTF-16 implementation details can be rolled out incrementally, rather than a single mass migration. If this suggestion flies, I'll rewrite strings.pdd and post it in the next few days. -- Jeff [EMAIL PROTECTED] Okay, now I feel utterly silly, having just looked at chartypes/unicode.c. Well, that approach'll work. Wonder why nobody thought...greps for isdigit()...uh...never mind. I'll be over here, with the dunce cap on. -- Jeff [EMAIL PROTECTED]