Re: [EPIC] What is a space? And who is using a non-C locale?

2004-11-29 Thread Ben Winslow
I'm more than a little late here (I've been rather busy lately), and the
decision has already been made; I don't have any problems with that
description, but I'll finally weigh in here and make some comments to
share the information (since I'm apparently the EPIC Unicode expert. ;)

On Mon, 2004-11-08 at 15:35 -0600, Jeremy Nelson wrote:
 So one of the things that has come up during the epic4 vote is how
 spaces are handled inconsistently throughout epic.  I had originally
 intended to address this in epic5, but it looks like this is a real
 problem for /xdebug extractw users.
 
 In the C locale, characters 9 (^I), 10 (^J), 11 (^K), 12 (^L), 13 (^M), 
 and 32 (space) are considered spaces.  That means isspace(x) returns 1
 for any x = one of those characters, and 0 for everything else.  
 
 EPIC has three ways to determine spaces
 
 1) Use the system's isspace(), which could be locale dependant.
 2) Use epic's my_isspace() which behaves exactly the same as isspace()
does in the C locale.
 3) Just compare the character against character 32 (space).
 
 It would be best if we just had one way, because that would be less 
 confusing.  But this is a problem because in some places in epic, a
 tab is a space, and in other places, it isn't.  So if I just switch
 everything to use isspace(), then some scripts might break in ways
 that I can't anticipate.
 
 I have two questions:
 
 *) Are any of you using tabs, newlines, carriage returns, etc, in any
way that depends on them not being a space character in some context?
If you are, you need to be a participant in this discussion!
 
 *) Are any of you using a non-C locale?  (If you don't know, then you are
not)  Does your locale have a different set of space characters?

I'm using en_US.UTF-8 (US English, UTF-8 charset, of course); in this
locale, with glibc at least, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, and 0x20 are
all returned as spaces by isspace() (this matches the C locale.)
Unicode has several additional spacing characters (especially since it
adds several typographic spaces), but isspace() seems to ignore these
(well, only 0xa0 (non-breaking space) falls within the range isspace()
can handle), presumably so that old applications won't see different
behavior.

 Feedback for these two questions is greatly appreciated and may reduce
 near term pain and suffering. 

Part of the underlying problem here is that different people use
different character sets everywhere; for example, the author of a script
pack you want to use may be using Shift-JIS for Japanese, while you're
using the near-ubiquitous-for-English ISO8859-1.  Obviously, it's not
desirable for the script to (mal)function differently just because
you're using a different locale, so consistent script behavior is
desirable.

The second part of the underlying problem is that there is no difference
between the character behavior for scripts and the rest of EPIC,
including user input; if you said 'HelloEN SPACEworld!' to a channel,
and a script tried to parse that, there's no way for EPIC to decide
this is a user's input, so we should treat it according to the locale's
rules or this is an internal part of a script, so we want to be
certain it functions the same way everywhere.

If you wanted to address this problem (which is obviously something
that's not in the scope of EPIC4), there are a couple of approaches that
are immediately obvious:

1) Define a certain character set in which all scripts are parsed.  I
imagine that any system on which EPIC still compiles has sufficient
locale support to handle this.
Pro: Existing functions could be made locale-aware without scripts
breaking between different locales.
Con: Moderately inconvenient if your favorite text editor doesn't
support on-the-fly character translation.
Con: May be a bit cumbersome to implement--it's not as simple as using
iconv() to translate the script from some known charset to the current
locale's charset, since the script locale may have characters that
cannot be represented in the current locale.

2) When/if EPIC is ever properly locale aware, add separate locale-aware
versions of several string functions, enabling scripts to decide on
their own where locale-specific spaces should have meaning.
Pro: Fairly simple to implement.  Basically approach #1, with the script
charset being US-ASCII and all bytes above 127 being left as-is.
Con: Scripters have to specifically support the locale-specific
functions in their scripts for things to Just Work for an end-user.
Con: The addition of several locale-aware text functions would enlarge
the function table and increase lookup times.

 Jeremy

As I said, this is all pretty well out of the scope of EPIC4; however, I
hope that this information will be helpful during the development of
EPIC5, especially if proper locale support is decided to be a Good Idea.

-- 
Ben Winslow [EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part

[EPIC] What is a space? And who is using a non-C locale?

2004-11-08 Thread Jeremy Nelson
So one of the things that has come up during the epic4 vote is how
spaces are handled inconsistently throughout epic.  I had originally
intended to address this in epic5, but it looks like this is a real
problem for /xdebug extractw users.

In the C locale, characters 9 (^I), 10 (^J), 11 (^K), 12 (^L), 13 (^M), 
and 32 (space) are considered spaces.  That means isspace(x) returns 1
for any x = one of those characters, and 0 for everything else.  

EPIC has three ways to determine spaces

1) Use the system's isspace(), which could be locale dependant.
2) Use epic's my_isspace() which behaves exactly the same as isspace()
   does in the C locale.
3) Just compare the character against character 32 (space).

It would be best if we just had one way, because that would be less 
confusing.  But this is a problem because in some places in epic, a
tab is a space, and in other places, it isn't.  So if I just switch
everything to use isspace(), then some scripts might break in ways
that I can't anticipate.

I have two questions:

*) Are any of you using tabs, newlines, carriage returns, etc, in any
   way that depends on them not being a space character in some context?
   If you are, you need to be a participant in this discussion!

*) Are any of you using a non-C locale?  (If you don't know, then you are
   not)  Does your locale have a different set of space characters?

Feedback for these two questions is greatly appreciated and may reduce
near term pain and suffering. 

Jeremy
___
List mailing list
[EMAIL PROTECTED]
http://epicsol.org/mailman/listinfo/list