RE: Unicode thoughts...

2002-03-30 Thread Dan Sugalski

At 4:32 PM -0800 3/25/02, Brent Dax wrote:
I *really* strongly suggest we include ICU in the distribution.  I
recently had to turn off mod_ssl in the Apache 2 distro because I
couldn't get OpenSSL downloaded and configured.

FWIW, ICU in the distribution is a given if we use it.

Parrot will require a C compiler and link tools (maybe make, but 
maybe not) to build on a target platform and nothing else. If we rely 
on ICU we must ship with it.
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: Unicode thoughts...

2002-03-30 Thread Josh Wilmes


Someone said that ICU requires a C++ compiler.  That's concerning to me, 
as is the issue of how we bootstrap our build process.  We were planning 
on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm 
sure it's not going to be written in pure ansi C)

--Josh

At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote:

 At 4:32 PM -0800 3/25/02, Brent Dax wrote:
 I *really* strongly suggest we include ICU in the distribution.  I
 recently had to turn off mod_ssl in the Apache 2 distro because I
 couldn't get OpenSSL downloaded and configured.
 
 FWIW, ICU in the distribution is a given if we use it.
 
 Parrot will require a C compiler and link tools (maybe make, but 
 maybe not) to build on a target platform and nothing else. If we rely 
 on ICU we must ship with it.
 -- 
  Dan
 
 --it's like this---
 Dan Sugalski  even samurai
 [EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk





Re: Unicode thoughts...

2002-03-30 Thread Dan Sugalski

At 10:07 AM -0500 3/30/02, Josh Wilmes wrote:
Someone said that ICU requires a C++ compiler.  That's concerning to me,
as is the issue of how we bootstrap our build process.  We were planning
on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm
sure it's not going to be written in pure ansi C)

If the C++ bits are redoable as C, I'm OK with it. I've not taken a 
good look at it to know how much it depends on C++. If it's mostly // 
comments and such we can work around the issues easily enough.

If its objects, well, I suppose it depends on how much it relies on them.

At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote:

  At 4:32 PM -0800 3/25/02, Brent Dax wrote:
  I *really* strongly suggest we include ICU in the distribution.  I
  recently had to turn off mod_ssl in the Apache 2 distro because I
  couldn't get OpenSSL downloaded and configured.

  FWIW, ICU in the distribution is a given if we use it.

  Parrot will require a C compiler and link tools (maybe make, but
  maybe not) to build on a target platform and nothing else. If we rely
   on ICU we must ship with it.

-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: Unicode thoughts...

2002-03-30 Thread Jeff

Dan Sugalski wrote:
 
 At 10:07 AM -0500 3/30/02, Josh Wilmes wrote:
 Someone said that ICU requires a C++ compiler.  That's concerning to me,
 as is the issue of how we bootstrap our build process.  We were planning
 on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm
 sure it's not going to be written in pure ansi C)
 
 If the C++ bits are redoable as C, I'm OK with it. I've not taken a
 good look at it to know how much it depends on C++. If it's mostly //
 comments and such we can work around the issues easily enough.
 
 If its objects, well, I suppose it depends on how much it relies on them.

Looking at icu/common I see more .c files than .cpp files, and what .cpp
files there are look somewhat like wrappers around the C code. In
addition, the .cpp files appear to do such things as create iterator
wrappers, bidirectional display and normalization.

Some of the files are thicker than wrappers, but I think there's enough
code behind this C++ veneer that we can at least use it if not the
entire library.
--
Jeff [EMAIL PROTECTED]



RE: Unicode thoughts...

2002-03-25 Thread Brent Dax

Jeff:
# This will likely open yet another can of worms, but Unicode has been
# delayed for too long, I think. It's time to add the Unicode libraries
# (In our case, the ICU libraries at http://oss.software.ibm.com/icu/,
# which Larry has now blessed) to Parrot. string.c already has
# (admittedly
# unavoidable, due to the library not being included)
# assumptions such as
# isdigit(). So, I have a few thoughts (that may have already been shot
# down by people wiser than I in such matters) to explicate, and some
# questions to ask.
#
# ICU should be added as a note in the README, and maybe to 'INSTALL' if
# we ever create one. Let's not add it to CVS, as it's not under our
# control. If we have to patch ICU to make it work correctly
# with Parrot,
# the patches should be submitted back to the ICU team. And I'm joining
# the appropriate mailing lists to keep appraised of development.

I *really* strongly suggest we include ICU in the distribution.  I
recently had to turn off mod_ssl in the Apache 2 distro because I
couldn't get OpenSSL downloaded and configured.

We also need to make sure ICU will work everywhere.  And I do mean
*everywhere*.  Will it work on VMS?  Palm OS?  Crays?

# Before Unicode goes into full swing, I need some idea of how
# we're going
# to deploy the libraries. On this note, I defer to the
# Configure master,
# Brent. I've already done some work with ICU, so I'm reasonably
# comfortable with migrating in one Unicode bit at a time, until we're
# ready for full UTF-16 compliance.
#
# The RE engine should (I'm speaking without having recently read the
# source, so feel free to correct me) not need to be migrated, as it's
# already using UTF-32 internally, which leaves just the string
# internals.
# These can be migrated to using ICU macros fairly easily (I've already
# done some of the work locally), so I think the main focus should be on
# encodings, as we'll have to eventually support the more common
# wide-character encodings such as KOI-8 and BIG5.

There are a few things that need to change, but they aren't big issues.
Mostly it's just places where character sets have been presumed.

However, I'm seriously thinking about a major re-architecture of the
regex engine, which would probably help these sorts of issues.

# I still have some questions about using UTF-16 internally for string
# representation (as mentioned in
#
http:[EMAIL PROTECTED]/msg07856.html),
# but I've resolved most of those. It's an excellent match for the ICU
# library, as it uses UTF-16 internally. My only question is if we're
# going to incur a performance hit every time a scalar is transferred to
# the RE engine, as it uses UTF-32 internally.

That can change.  However, utf32 seems like the best match, as it would
allow us to reach into a string's guts for speed.  (We don't currently
do that, but if I do redesign the engine, I'll probably be able to.)

# Also, once we have UTF-16 running internally, I'd be interested in
# seeing what memory consumption looks like vs. UTF-32, beause I'd like
to
# see if it makes sense to add a compile-time switch between UTF-8 and
# UTF-32 to let the installer decide on memory tradeoffs. ICU has an
# internal macro that defines its own internal representation, and that
# could conflict with our intended usage as well.
#
# Performance would suffer in the UTF-8 case, naturally, but the
# difference in memory usage might be significant enough that we'd want
to
# leave the decision up to the installer. Having said that, the headache
# of testing multiple versions of Perl6 might not be worth it.
#
# So, to wrap up, I'm soliciting thoughts on how best to start the
Unicode
# migration, and deal with the inevitable problems that will come up.
I'm
# hoping that most of the magic will be hidden in string.c, where we
won't
# have to worry about it, but we'll have to see.
#
# Now, this is admittedly being composed at 2:00 A.M, so my thoughts may
# not be the most coherent, and for that I apologize. Most of my concern
# stems from how best to add build steps to the various platforms
without
# ending up with a completely broken Parrot for weeks and developers
# screaming about What the *HELL* is this error? Where is this library?
# brane explodes. If these issues have already been beaten to death
and
# we've moved on to more interesting issues, of course I'll be
interested
# there as well.

Overall you seem to be pretty on target.  Of course, my brain isn't
really built for character sets and stuff like that.

Also note that I went to bed at one, was rudely awakened by a screaming
toddler at two, didn't fall asleep again till four, and woke up at nine,
so I'm probably not very coherent.  I feel a little dizzy--I'm gonna
take a nap.

--Brent Dax [EMAIL PROTECTED]
@roles=map {Parrot $_} qw(embedding regexen Configure)

#define private public
--Spotted in a C++ program just before a #include




RE: Unicode thoughts...

2002-03-25 Thread Charles Bunders

 
 We also need to make sure ICU will work everywhere.  And I do mean
 *everywhere*.  Will it work on VMS?  Palm OS?  Crays?

Nope, nope, and nope.

From their site -

Operating systemCompilerTesting frequency 
Windows 98/NT/2000  Microsoft Visual C++ 6.0Reference platform 
Red Hat Linux 6.1   gcc 2.95.2  Reference platform 
AIX 4.3.3   xlC 3.6.4   Reference platform 
Solaris 2.6 Workshop Pro CC 4.2 Reference platform 
HP/UX 11.01 aCC A.12.10 Reference platform 
AIX 5.1.0 L Visual Age C++ 5.0  Regularly tested 
Solaris 2.7 Workshop Pro CC 6.0 Regularly tested 
Solaris 2.6 gcc 2.91.66 Regularly tested 
FreeBSD 4.4 gcc 2.95.3  Regularly tested 
HP/UX 11.01 CC A.03.10  Regularly tested 
OS/390 (zSeries)CC r10  Regularly tested 
AS/400 (iSeries)V5R1 iCCRarely tested 
NetBSD, OpenBSD Rarely tested 
SGI/IRIXRarely tested 
PTX Rarely tested 
OS/2 Visual Age Rarely tested 
Macintosh   Needs help to port 

-(MBrod)-

__
Do You Yahoo!?
Yahoo! Movies - coverage of the 74th Academy Awards®
http://movies.yahoo.com/



Re: Unicode thoughts...

2002-03-25 Thread Josh Wilmes


This is rather concerning to me.  As I understand it, one of the goals for 
parrot was to be able to have a usable subset of it which is totally 
platform-neutral (pure ANSI C).   If we start to depend too much on 
another library which may not share that goal, we could have trouble 
with the parrot build process (which was supposed to be shipped as parrot 
bytecode)

--Josh

At 17:02 on 03/25/2002 PST, Charles Bunders [EMAIL PROTECTED] wrote:

  
  We also need to make sure ICU will work everywhere.  And I do mean
  *everywhere*.  Will it work on VMS?  Palm OS?  Crays?
 
 Nope, nope, and nope.
 
 From their site -
 
 Operating systemCompilerTesting frequency 
 Windows 98/NT/2000  Microsoft Visual C++ 6.0Reference platform 
 Red Hat Linux 6.1   gcc 2.95.2  Reference platform 
 AIX 4.3.3   xlC 3.6.4   Reference platform 
 Solaris 2.6 Workshop Pro CC 4.2 Reference platform 
 HP/UX 11.01 aCC A.12.10 Reference platform 
 AIX 5.1.0 L Visual Age C++ 5.0  Regularly tested 
 Solaris 2.7 Workshop Pro CC 6.0 Regularly tested 
 Solaris 2.6 gcc 2.91.66 Regularly tested 
 FreeBSD 4.4 gcc 2.95.3  Regularly tested 
 HP/UX 11.01 CC A.03.10  Regularly tested 
 OS/390 (zSeries)CC r10  Regularly tested 
 AS/400 (iSeries)V5R1 iCCRarely tested 
 NetBSD, OpenBSD Rarely tested 
 SGI/IRIXRarely tested 
 PTX Rarely tested 
 OS/2 Visual Age Rarely tested 
 Macintosh   Needs help to port 
 
 -(MBrod)-
 
 __
 Do You Yahoo!?
 Yahoo! Movies - coverage of the 74th Academy Awards®
 http://movies.yahoo.com/





RE: Unicode thoughts...

2002-03-25 Thread Hong Zhang


I think it will be relative easy to deal with different compiler
and different operating system. However, ICU does contain some
C++ code. It will make life much harder, since current Parrot
only assume ANSI C (even a subset of it).

Hong

 This is rather concerning to me.  As I understand it, one of 
 the goals for 
 parrot was to be able to have a usable subset of it which is totally 
 platform-neutral (pure ANSI C).   If we start to depend too much on 
 another library which may not share that goal, we could have trouble 
 with the parrot build process (which was supposed to be 
 shipped as parrot bytecode)



Re: Unicode thoughts...

2002-03-25 Thread Jeff

Hong Zhang wrote:
 
 I think it will be relative easy to deal with different compiler
 and different operating system. However, ICU does contain some
 C++ code. It will make life much harder, since current Parrot
 only assume ANSI C (even a subset of it).
 
 Hong
 
  This is rather concerning to me.  As I understand it, one of
  the goals for
  parrot was to be able to have a usable subset of it which is totally
  platform-neutral (pure ANSI C).   If we start to depend too much on
  another library which may not share that goal, we could have trouble
  with the parrot build process (which was supposed to be
  shipped as parrot bytecode)

I guess it's obvious that I hadn't looked at the target platforms for
ICU as closely as I probably should have. C vs. C++ doesn't concern me,
as it can always be rewritten, but lack of platforms like OS X does.
Given that, I think an interim solution consisting of basic Unicode
utilities we'll need, such as Unicode_isdigit(). This can be a simple
wrapper around isdigit() for the moment, until I sort out which files we
need from the Unicode database, and what support functions/data
structures will be required.

Given that we're dedicated to either UTF-16 or UTF-32 for internal
string representation (undecided as of yet, and isn't affected by this),
we can get away with creating a simple unicode.{c.h} suite of functions
that looks like:

Parrot_Int Parrot_isDigit(char* glyph);

We can get away with the simplicity here because the character array
should already be a valid UTF-{16,32) string, and responsibility for
making sure there's a valid glyph at that offset can be safely offloaded
to the caller, if not higher up the calling chain. Also, it should be in
a separate file because, assuming the final internal representation
matches that of the RE engine, the engine can use these utilities as
well.

Now, admittedly this is only slightly better-thought-out than the
origina proposal, but I think it has a much better chance of being
implemented, and in a fairly short amount of time. (He said, knowing
full well that there's always one more problem) ASCII versions of the
functions should be almost trivial, and can be left in there as a
compile-time switch should we choose to do an ASCII-only or UTF-8-only
version.

In conclusion, this approach feels more workable, and the full UTF-16
implementation details can be rolled out incrementally, rather than a
single mass migration. If this suggestion flies, I'll rewrite
strings.pdd and post it in the next few days.
--
Jeff [EMAIL PROTECTED]



Re: Unicode thoughts...

2002-03-25 Thread Jeff

Jeff wrote:
 
 Hong Zhang wrote:
 
  I think it will be relative easy to deal with different compiler
  and different operating system. However, ICU does contain some
  C++ code. It will make life much harder, since current Parrot
  only assume ANSI C (even a subset of it).
 
  Hong
 
   This is rather concerning to me.  As I understand it, one of
   the goals for
   parrot was to be able to have a usable subset of it which is totally
   platform-neutral (pure ANSI C).   If we start to depend too much on
   another library which may not share that goal, we could have trouble
   with the parrot build process (which was supposed to be
   shipped as parrot bytecode)
 
 I guess it's obvious that I hadn't looked at the target platforms for
 ICU as closely as I probably should have. C vs. C++ doesn't concern me,
 as it can always be rewritten, but lack of platforms like OS X does.
 Given that, I think an interim solution consisting of basic Unicode
 utilities we'll need, such as Unicode_isdigit(). This can be a simple
 wrapper around isdigit() for the moment, until I sort out which files we
 need from the Unicode database, and what support functions/data
 structures will be required.
 
 Given that we're dedicated to either UTF-16 or UTF-32 for internal
 string representation (undecided as of yet, and isn't affected by this),
 we can get away with creating a simple unicode.{c.h} suite of functions
 that looks like:
 
 Parrot_Int Parrot_isDigit(char* glyph);
 
 We can get away with the simplicity here because the character array
 should already be a valid UTF-{16,32) string, and responsibility for
 making sure there's a valid glyph at that offset can be safely offloaded
 to the caller, if not higher up the calling chain. Also, it should be in
 a separate file because, assuming the final internal representation
 matches that of the RE engine, the engine can use these utilities as
 well.
 
 Now, admittedly this is only slightly better-thought-out than the
 origina proposal, but I think it has a much better chance of being
 implemented, and in a fairly short amount of time. (He said, knowing
 full well that there's always one more problem) ASCII versions of the
 functions should be almost trivial, and can be left in there as a
 compile-time switch should we choose to do an ASCII-only or UTF-8-only
 version.
 
 In conclusion, this approach feels more workable, and the full UTF-16
 implementation details can be rolled out incrementally, rather than a
 single mass migration. If this suggestion flies, I'll rewrite
 strings.pdd and post it in the next few days.
 --
 Jeff [EMAIL PROTECTED]

Okay, now I feel utterly silly, having just looked at
chartypes/unicode.c. Well, that approach'll work. Wonder why nobody
thought...greps for isdigit()...uh...never mind. I'll be over here,
with the dunce cap on.
--
Jeff [EMAIL PROTECTED]