subject:"Re\: unicode"

Re: unicode

2016-09-17 Thread Timo Paulssen

On 17/09/16 13:34, Moritz Lenz wrote:>> Searching further I found the
ucd2c.pl program in the Moarvm tools >>  directory. This generates the
unicode_db.c somewhere else in the >> rakudo tree. I run this program
myself on the Unicode 9.0.0 >> database and comparing the generated
files shows many differences >> between the one in the rakudo tree and
the generated one. > > Please make a rakudo spectest with those changes,
and if it passes, > submit your patch as a pull request.
Unicode support is more than just having the data from the text files in
our own unicode database. In Unicode 9, the Zero Width Joiner is now
explicitly supported for emoji. If we don't change the algorithm to
create individual graphemes from streams of codepoints to consider this,
we'll end up with improper support for 8 (because new stuff is in there)
and improper support for 9 (because some stuff is missing) at the same
time; i suspect that'll help nobody.

I expect Jnthn will do the full & proper update during the coming month,
and running ucd2c.pl is the least time-consuming step of that, but
perhaps a pull request for this is still welcome.

Re: unicode

2016-09-17 Thread MT


Hi,
I am looking forward to it
Thanks,
Marcel

On Sat, Sep 17, 2016 at 01:34:45PM +0200, Moritz Lenz wrote:

Hi,

On 17.09.2016 13:12, MT wrote:

The date found in the file  unicode_db.c file is 2012-07-20 which is
about Unicode version 6.1.0

So the content in that file is not getting updated when the shipped Unicode
version is updated? If so, is there a tool that needs fixing to automate that?
  

docs/ChangeLog in MoarVM says

+ Updated to Unicode 8

in the section of the 2015.07 release, so it's not that bad :-)

I believe that the plan is to update to Unicode 9 just after this month's
release (to give a whole month to iron out any instabilities or bugs).

So it might be a little bit bad this month, but next month will be awesome.
Allegedly :-)

Nicholas Clark

Re: unicode

2016-09-17 Thread Nicholas Clark

On Sat, Sep 17, 2016 at 01:34:45PM +0200, Moritz Lenz wrote:
> Hi,
> 
> On 17.09.2016 13:12, MT wrote:

> > The date found in the file  unicode_db.c file is 2012-07-20 which is 
> > about Unicode version 6.1.0

So the content in that file is not getting updated when the shipped Unicode
version is updated? If so, is there a tool that needs fixing to automate that?

> docs/ChangeLog in MoarVM says
> 
> + Updated to Unicode 8
>
> in the section of the 2015.07 release, so it's not that bad :-)

I believe that the plan is to update to Unicode 9 just after this month's
release (to give a whole month to iron out any instabilities or bugs).

So it might be a little bit bad this month, but next month will be awesome.
Allegedly :-)

Nicholas Clark

Re: unicode

2016-09-17 Thread MT






Searching further I found the ucd2c.pl program in the Moarvm tools
directory. This generates the unicode_db.c somewhere else in the rakudo
tree. I run this program myself on the Unicode 9.0.0 database and
comparing the generated files shows many differences between the one in
the rakudo tree and the generated one.

Please make a rakudo spectest with those changes, and if it passes,
submit your patch as a pull request.


The date found in the file  unicode_db.c file is 2012-07-20 which is
about Unicode version 6.1.0
How do I proceed from here? Do I pull in the newest rakudo version, make 
another git branch, then change it and then push the branch after the 
tests have run successfully ? This way I am not able to cripple the 
rakudo code. Other people can check the changes too before merging.

docs/ChangeLog in MoarVM says

+ Updated to Unicode 8
in the section of the 2015.07 release, so it's not that bad :-)

I have seen it now, indeed not that old, but it means also the Unicode 
changes a lot between versions.


Greets,
Marcel

Re: unicode

2016-09-17 Thread Moritz Lenz

Hi,

On 17.09.2016 13:12, MT wrote:

> Searching further I found the ucd2c.pl program in the Moarvm tools 
> directory. This generates the unicode_db.c somewhere else in the rakudo 
> tree. I run this program myself on the Unicode 9.0.0 database and 
> comparing the generated files shows many differences between the one in 
> the rakudo tree and the generated one.

Please make a rakudo spectest with those changes, and if it passes,
submit your patch as a pull request.

> The date found in the file  unicode_db.c file is 2012-07-20 which is 
> about Unicode version 6.1.0

docs/ChangeLog in MoarVM says

+ Updated to Unicode 8

in the section of the 2015.07 release, so it's not that bad :-)

Cheers,
Moritz

-- 
Moritz Lenz
https://deploybook.com/ -- https://perlgeek.de/ -- https://perl6.org/

Re: Unicode Categories

2010-11-12 Thread karl williamson


Tom Christiansen wrote:

Patrick wrote:

:  * Almost. E.g. isL would be nice to have as well.
:
: Those exist also:
:
:  $ ./perl6
:   say 'abCD34' ~~ / isL /
:  a
:   say 'abCD34' ~~ / isN /
:  3
:  

They may exist, but I'm not certain it's a good idea to encourage
the Is_XXX approach on *anything* except Script=XXX properties.  


They certainly don't work on everything, you know.

Also, I can't for the life of me why one would ever write isL when
Letter is so much more obvious; similarly, for isN over Number.  
Just because you can do so, doesn't mean you necessarily should.


http://unicode.org/reports/tr18/#Categories

The recommended names for UCD properties and property values are in
PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
There are both abbreviated names and longer, more descriptive names.

It is strongly recommended that both names be recognized, and that
loose matching of property names be used, whereby the case
distinctions, whitespace, hyphens, and underbar are ignored.

Furthermore, be aware that the Number property is *NOT* the same
as the Decimal_Number property.  In perl5, if one wants [0-9], then
one expresses it exactly that way, since that's a lot shorter than
writing (?=\p{ASCII})\p{Nd}, where Nd can also be Decimal_Number.

Again, please that Number is far broader than even Decimal_Number,
which is itself almost certainly broader than you're thinking.

Here's a trio of little programs specifically designed to help scout
out Unicode characters and their properties.  They work best on 5.12+,
but should be ok on 5.10, too.

--tom



The 'Is' prefix can be used on any property in 5.12 for which there is 
no naming conflict.  The only naming conflicts are certain of the block 
properties, such as Arabic.  IsArabic means the Arabic script.  InArabic 
means the base Arabic block.  Personally, I find Is and In unintuitive, 
and prefer to write sc=arabic or blk=arabic instead.


When Unicode proposed to add some properties in 5.2 that started with 
'Is', there was significant enough protest that they backed off, and 
promised never to do it again, adding a stability policy to 6.0 to that 
effect.  Apparently a number of languages use 'Is' as a prefix.

Re: Unicode Categories

2010-11-11 Thread Tom Christiansen

The 'Is' prefix can be used on any property in 5.12 for which there is 
no naming conflict.  The only naming conflicts are certain of the block 
properties, such as Arabic.  IsArabic means the Arabic script.  InArabic 
means the base Arabic block.  Personally, I find Is and In unintuitive, 
and prefer to write sc=arabic or blk=arabic instead.

I agree.

When Unicode proposed to add some properties in 5.2 that started with 
'Is', there was significant enough protest that they backed off, and 
promised never to do it again, adding a stability policy to 6.0 to that 
effect.  Apparently a number of languages use 'Is' as a prefix.

Yes, that's right.  Even worse, there are languages that are very very 
bad about Is vs In, giving the wrong sense to them.

--tom

Re: Unicode Categories

2010-11-10 Thread Patrick R. Michaud

On Wed, Nov 10, 2010 at 01:03:26PM -0500, Chase Albert wrote:
 Sorry if this is the wrong forum. I was wondering if there was a way to
 specify unicode
 categorieshttp://www.fileformat.info/info/unicode/category/index.htmin
 a regular expression (and hence a grammar), or if there would be any
 consideration for adding support for that (requiring some kind of special
 syntax).

Unicode categories are done using assertion syntax with is followed by
the category name.  Thus isLu (uppercase letter), isNd (decimal digit), 
isZs (space separator), etc.

This even works in Rakudo today:

$ ./perl6
 say 'abcdEFG' ~~ / isLu /
E

They can also be combined, as in +isLu+isLt  (uppercase+titlecase).
The relevant section of the spec is in Synopsis 5; search for Unicode
properties are always available with a prefix.

Hope this helps!

Pm

Re: Unicode Categories

2010-11-10 Thread Chase Albert

That's exactly what I was looking for*. Awesome, thank you.

~Cheers


* Almost. E.g. isL would be nice to have as well.

On Wed, Nov 10, 2010 at 13:15, Patrick R. Michaud pmich...@pobox.comwrote:

 Unicode
 properties are always available with a prefix

Re: Unicode Categories

2010-11-10 Thread Patrick R. Michaud

On Wed, Nov 10, 2010 at 01:21:57PM -0500, Chase Albert wrote:
 That's exactly what I was looking for*. Awesome, thank you.
 
 * Almost. E.g. isL would be nice to have as well.

Those exist also:

  $ ./perl6
   say 'abCD34' ~~ / isL /
  a
   say 'abCD34' ~~ / isN /
  3
   

Pm

Re: Unicode Categories

2010-11-10 Thread Chase Albert

Even awesomer, thank you again.

On Wed, Nov 10, 2010 at 13:28, Patrick R. Michaud pmich...@pobox.comwrote:

 On Wed, Nov 10, 2010 at 01:21:57PM -0500, Chase Albert wrote:
  That's exactly what I was looking for*. Awesome, thank you.
 
  * Almost. E.g. isL would be nice to have as well.

 Those exist also:

  $ ./perl6
   say 'abCD34' ~~ / isL /
  a
   say 'abCD34' ~~ / isN /
  3
  

 Pm

Re: Unicode Categories

2010-11-10 Thread Tom Christiansen

Patrick wrote at 12:15pm CST on Wednesday, 10 November 2010:

 Sorry if this is the wrong forum. I was wondering if there was a way to
 specify unicode
 categorieshttp://www.fileformat.info/info/unicode/category/index.htmin
 a regular expression (and hence a grammar), or if there would be any
 consideration for adding support for that (requiring some kind of special
 syntax).

 Unicode categories are done using assertion syntax with is followed by
 the category name.  Thus isLu (uppercase letter), isNd (decimal digit), 
 isZs (space separator), etc.

 This even works in Rakudo today:

$ ./perl6
 say 'abcdEFG' ~~ / isLu /
E

 They can also be combined, as in +isLu+isLt  (uppercase+titlecase).
 The relevant section of the spec is in Synopsis 5; search for Unicode
 properties are always available with a prefix.
 
 Hope this helps!

Actually, that quote from Synopsis raises more questions than it answers.

Below I've annonated the three output groups with (letters):

% uniprops -a A
U+0041 ‹A› \N{ LATIN CAPITAL LETTER A }:
 (A)\w \pL \p{LC} \p{L_} \p{L} \p{Lu}
 (B)AHex ASCII_Hex_Digit All Any Alnum Alpha Alphabetic ASCII Assigned
Cased Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_Lowercased CWL
Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph
GrBase Hex XDigit Hex_Digit ID_Continue IDC ID_Start IDS Letter L_
Latin Latn Uppercase_Letter PerlWord PosixAlnum PosixAlpha
PosixGraph PosixPrint PosixUpper Print Upper Uppercase Word
XID_Continue XIDC XID_Start XIDS
 (C)Age:1.1 Block=Basic_Latin Bidi_Class:L Bidi_Class=Left_To_Right
Bidi_Class:Left_To_Right Bc=L Block:ASCII Block:Basic_Latin
Blk=ASCII Canonical_Combining_Class:0
Canonical_Combining_Class=Not_Reordered
Canonical_Combining_Class:Not_Reordered Ccc=NR
Canonical_Combining_Class:NR Decomposition_Type:None Dt=None
East_Asian_Width:Na East_Asian_Width=Narrow East_Asian_Width:Narrow
Ea=Na Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
Hangul_Syllable_Type=Not_Applicable
Hangul_Syllable_Type:Not_Applicable Hst=NA
Joining_Group:No_Joining_Group Jg=NoJoiningGroup
Joining_Type:Non_Joining Jt=U Joining_Type:U
Joining_Type=Non_Joining Script=Latin Line_Break:AL
Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None
Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0
Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0
Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1
Present_In:5.2 In=5.2 Script:Latin Sc=Latn Script:Latn
Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP
Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter

What that means is that the B properties are properties from 
the *General* category.  They may all be referred to as \p{X} 
or \p{IsX}, \p{General_Category=X} or \p{General_Category:X}, 
and \p{GC=X} or \p{GC:X}.

I have a feeling that your synopsis quote is referring only to 
type B properties alone.  It is not talking about type C properties, 
which must also be accounted for.

--tom

Re: Unicode Categories

2010-11-10 Thread Tom Christiansen

Patrick wrote:

:  * Almost. E.g. isL would be nice to have as well.
:
: Those exist also:
:
:  $ ./perl6
:   say 'abCD34' ~~ / isL /
:  a
:   say 'abCD34' ~~ / isN /
:  3
:  

They may exist, but I'm not certain it's a good idea to encourage
the Is_XXX approach on *anything* except Script=XXX properties.  

They certainly don't work on everything, you know.

Also, I can't for the life of me why one would ever write isL when
Letter is so much more obvious; similarly, for isN over Number.  
Just because you can do so, doesn't mean you necessarily should.

http://unicode.org/reports/tr18/#Categories

The recommended names for UCD properties and property values are in
PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
There are both abbreviated names and longer, more descriptive names.

It is strongly recommended that both names be recognized, and that
loose matching of property names be used, whereby the case
distinctions, whitespace, hyphens, and underbar are ignored.

Furthermore, be aware that the Number property is *NOT* the same
as the Decimal_Number property.  In perl5, if one wants [0-9], then
one expresses it exactly that way, since that's a lot shorter than
writing (?=\p{ASCII})\p{Nd}, where Nd can also be Decimal_Number.

Again, please that Number is far broader than even Decimal_Number,
which is itself almost certainly broader than you're thinking.

Here's a trio of little programs specifically designed to help scout
out Unicode characters and their properties.  They work best on 5.12+,
but should be ok on 5.10, too.

--tom


unitrio.tar.gz
Description: application/tar

Re: Unicode in 'NFG' formation ?

2009-05-20 Thread Helmut Wollmersdorfer


Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:



2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?



Good question.  My assumption is that adding marks to a character
doesn't change its fundamental nature.  What needs to be provided
other pass-through to the base character's properties?


This will work in most cases, but e.g. not with the property 
ASCII_Hex_Digit.


LATIN SMALL LETTER A is ASCII_Hex_Digit
but
GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE is_not 
ASCII_Hex_Digit


I will try to generate some millions of cases based on nfc(nfd($string)) 
to find out the best inheritance rules.


4) Should the definition of graphemes conform to Unicode Standard Annex  
#29 'grapheme clusters'? Wich level - legacy, extended or tailored?



No opinion, other than that we're aiming for the most modern
formulation that doesn't implicitly cede declarational control to
something out of the control of Perl 6 declarations.  (See locales for
an example of something Perl 6 ignores in the absence of an explicit
declaration to pay attention to them.)  So just guessing from the
names without reading the Annex in question, not legacy, but probably
extended, with explicitly tailoring allowed by declaration.  (Unless
extended has some dire performance or policy consequences that would
be contraindicative...)


Will look into ICU what's supported.


So as long as we stay inside these fundamental Perl 6 design
principles, feel free to whack on the specs.


OK. Hopefully some Indic, Arabic and Asian natives review this.

Helmut Wollmersdorfer

Re: Unicode in 'NFG' formation ?

2009-05-20 Thread John M. Dlugosz


Larry Wall larry-at-wall.org |Perl 6| wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:
  

[1] Open questions:

1) Will graphemes have an unique charname?
   e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE



Yes, presumably that comes with the normalization part of NFG.
We're not aiming for round-tripping of synthetic codepoints, just
as NFC doesn't do round-tripping of sequences that have precomposed
codepoints.  We're really just extending the NFC notion a bit further
to encompass temporary precomposed codepoints.

  
Unique for asking for the name, not when specifying the name.  Just as 
with the code-point order, any combination that means the same should 
give the same grapheme, just as if you had create the code point 
sequence first.  Perhaps you are not realizing that the different 
classes of modifiers are independent.  You could say DOT ABOVE AND DOT 
ABOVE and get the same thing as DOT BELOW and DOT ABOVE.





2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?



Good question.  My assumption is that adding marks to a character
doesn't change its fundamental nature.  What needs to be provided
other pass-through to the base character's properties?

  
Depends on the property!  Being a modifier, for example.  A detailed 
look would be needed to decide which properties just pass through to the 
base char, which are enhanced (e.g. letter becomes letter with 
modifiers), which don't make sense, which are mostly OK but change 
sometimes, etc.

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Helmut Wollmersdorfer


John M. Dlugosz wrote:
I was going over S02, and found it opens with, By default Perl presents 
Unicode in NFG formation, where each grapheme counts as one character.


I looked up NFG, and found it to be an invention of this group, but 
didn't find any details when I tried to chase down the links.


As Durran already wrote, the only definition is in 
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html 
 which references 'Unicode Normalization Forms' 
http://www.unicode.org/reports/tr15/.


Also there is a reference to
The Unicode Standard defines a grapheme cluster (commonly simplified to 
just grapheme). IMHO the authors meant this document:


 Unicode Standard Annex #29
 Unicode Text Segmentation
 http://unicode.org/reports/tr29/

This opens a whole bunch of questions for me.  


I have many unanswered questions [1] about graphemes.

If you mean that the 
default for what the individual items in a string are is graphemes, OK, 
but what does that have to do with parsing source code?  


First - nothing. S01: Perl 6 is written in Unicode. Developers can 
choose one of the encodings (UTF-8, UTF-17 etc.) for files with Perl 
source code. Characters outside the ASCII range can be used for 
identifiers, literals, and syntactic punctuation (e.g. 'bracketing 
pairs').


It's a problem of the parser to handle it correctly.

Even so, that's 
not something that would be called a Normalization Form.


Not in Unicode, but it can be called Grapheme Composition.

Thus

\c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE]
\c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE]

should all lead to the same grapheme (my personal assumption).

Character set encodings and stuff is one of my strengths.  I'd like to 
straighten this out, and can certainly straighten out the wording, but 
first need to know what you meant by that.


What's specified:
1) A grapheme is 1 character, thus has 'length' 1.
2) A grapheme has a unique internal representation as an integer for 
some life-time (process), outside the Unicode codepoints.

3) Graphemes can be normalized to NFD, NFC etc.

[1] Open questions:

1) Will graphemes have an unique charname?
   e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE
2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?
3) Details of 'life-time', round-trip.
4) Should the definition of graphemes conform to Unicode Standard Annex 
#29 'grapheme clusters'? Wich level - legacy, extended or tailored?


Helmut Wollmersdorfer

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Helmut Wollmersdorfer


Darren Duncan wrote:

Since you seem eager, I recommend you start with porting the Parrot PDD 
28 to a new Perl 6 Synopsis 15, and continue from there.


IMHO we need some people for a broad discussion on the details first.

Helmut Wollmersdorfer

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed

Do we really need to be able to map arbitrary graphemes to integers,
or is it enough to have an opaque value returned by ord() that, when
fed to chr(), returns the same grapheme?  If the latter, a list of
code points (in one of the official Normalzation Formats) would seem
to be sufficient.

On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote:
 Darren Duncan wrote:

 Since you seem eager, I recommend you start with porting the Parrot PDD
 28 to a new Perl 6 Synopsis 15, and continue from there.

 IMHO we need some people for a broad discussion on the details first.

 Helmut Wollmersdorfer


-- 
Sent from my mobile device

Mark J. Reed markjr...@gmail.com

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
 If you haven't read the PDD, it's a good start.

snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.

-- 
Mark J. Reed markjr...@gmail.com

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings


Mark J. Reed wrote:

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
  

If you haven't read the PDD, it's a good start.



snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.

  


There's a couple of cases. First of all, it doesn't have to be an 
integer. It needs to be a fixed size, and it needs to be orderable, so 
that we can store a bunch of them in an intelligent fashion, thus making 
it easy to sort them.


With that said, integers meet the need exactly. Plus, there's the 
benefit that unicode already has an escape hatch built in to it for 
user-defined stuff. And that escape hatch is an integer.


The benefits are documented in the pod: they're fixed size, so we can 
scan over them forward and backward at low cost. They're easily 
distinguished (high bit set) so string code can special-case them 
quickly. They're orderable, comparable, etc. And best of all they 
contain no trans fat!


=Austin

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH


On May 18, 2009, at 09:21 , Mark J. Reed wrote:

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm



I would argue that if you are working with a grapheme cluster  
(grapheme), arithmetic on individual grapheme values is undefined.   
What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING  
DOT BELOW]) + 1?  If you say it increments the base character (a  
reasonable-looking initial stance), what happens if I add an amount  
which changes the base character to a combining character?  And what  
happens if the original grapheme doesn't have a base character?


In short, I think the only remotely sane result of ord() on a grapheme  
is an opaque value meaningful to chr() but to very little, if  
anything, else.  If you want to represent it as an integer, fine, but  
it should be obscured such that math isn't possible on it.   
Conversely, if you want ord() values you can manipulate, you must work  
at the codepoint level.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall

On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
 On May 18, 2009, at 09:21 , Mark J. Reed wrote:
 If you're doing arithmetic with the code points or scalar values of
 characters, then the specific numbers would seem to matter.  I'm


 I would argue that if you are working with a grapheme cluster  
 (grapheme), arithmetic on individual grapheme values is undefined.   
 What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING  
 DOT BELOW]) + 1?  If you say it increments the base character (a  
 reasonable-looking initial stance), what happens if I add an amount  
 which changes the base character to a combining character?  And what  
 happens if the original grapheme doesn't have a base character?

 In short, I think the only remotely sane result of ord() on a grapheme  
 is an opaque value meaningful to chr() but to very little, if anything, 
 else.  If you want to represent it as an integer, fine, but it should be 
 obscured such that math isn't possible on it.  Conversely, if you want 
 ord() values you can manipulate, you must work at the codepoint level.

Sure, but this is a weak argument, since you can already write complete
ord/chr nonsense at the codepoint level (even in ASCII), and all we're
doing here is making graphemes work more like codepoints in terms of
storage and indexing.  If people abuse it, they have only themselves
to blame for relying on what is essentially an implementation detail.
The whole point of ord is to cheat, so if they get caught cheating,
well, they just have to take their lumps.  In the age of Unicode,
ord and chr are pretty much irrelevant to most normal text processing
anyway except for encoders and decoders, so there's not a great deal
of point in labeling the integers as an opaque type, in my opinion.

As an implementation detail however, it's important to note that
the signed/unsigned distinction gives us a great deal of latitude
in how to store a particular sequence of integers.  Latin-1 will (by
definition) fit in a *uint8, while ASCII plus (no more that 128) NFG
negatives will fit into *int8.  Most European languages will fit into
*int16 with up to 32768 synthetic chars.  Most Asian text still fits
into *uint16 as long as they don't synthesize codepoints.  And we can
always resort to *uint32 and *int32 knowing that the Unicode consortium
isn't going to use the top bit any time in the foreseeable future.
(Unless, of course, they endorse something resembling NFG. :)

Note also that uint8 has nothing to do with UTF-8, and uint16 has
nothing to do with UTF-16.  Surrogate pairs are represented by a single
integer in NFG.  That is, NFG is always abstract codepoints of some
sort without regard to the underlying representation.  In that sense
it's not important that synthetic codepoints are negative, of course.

Larry

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed

 On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
 I would argue that if you are working with a grapheme cluster
 (grapheme), arithmetic on individual grapheme values is undefined.

Yup, that was exactly what I was arguing.

 In short, I think the only remotely sane result of ord() on a grapheme
 is an opaque value meaningful to chr() but to very little, if anything,
 else.

Which is what we have with the negative integer spec.  What I dislike
is the transient, handlish nature of those values: like a handle, you
can't store the value and then use it to reconstruct the grapheme
later.  But since actually storing the grapheme itself should be no
great feat, I guess that's not much of a hardship.

On Mon, May 18, 2009 at 1:37 PM, Larry Wall la...@wall.org wrote:
 you can already write complete ord/chr nonsense at the codepoint level (even 
 in ASCII)

Sorry, could you clarify what you mean by that?

 And we can  always resort to *uint32 and *int32 knowing that the Unicode 
consortium
 isn't going to use the top bit any time in the foreseeable future.

s/top bit/top 11 bits/...

 Note also that uint8 has nothing to do with UTF-8, and uint16 has
 nothing to do with UTF-16.  Surrogate pairs are represented by a single
 integer in NFG.

They are also represented by a single value in UTF-8; that is, the
full scalar value is encoded directly, rather being first encoded into
UTF-16 surrogates which are then encoded as UTF-8...

 That is, NFG is always abstract codepoints of some sort

Barely-relevant terminology nit: abstract code points sounds like
something that would be associated with abstract characters, which
as defined in Unicode are formally distinct from graphemes, which is
what we're talking about here.

Also, the term code points includes the surrogates, which can only
appear in UTF-16; I imagine the scalar values we deal with most of the
time at the character/grapheme level would be the subset of code
points excluding surrogates, which are called Unicode scalar values.

Surrogates are just weird, since they have assigned code points even
though they're purely an encoding mechanism.  As such, they straddle
the line between abstract characters and an encoding form. I assume
that if text comes in as UTF-16, the surrogates will disappear as far
as character-level P6 code is concerned.  So is there any way for P6
to manipulate surrogates as characters?  Maybe an adverb or trait?
Or does one have to descend to the bytewise layer for that?  (As you
said, that *normally* shouldn't be necessary outside encoding and
decoding, where you need to do things bytewise anyway; just trying to
cover all the bases...)
-- 
Mark J. Reed markjr...@gmail.com

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:
 [1] Open questions:

 1) Will graphemes have an unique charname?
e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE

Yes, presumably that comes with the normalization part of NFG.
We're not aiming for round-tripping of synthetic codepoints, just
as NFC doesn't do round-tripping of sequences that have precomposed
codepoints.  We're really just extending the NFC notion a bit further
to encompass temporary precomposed codepoints.

 2) Can I use Unicode property matching safely with graphemes?
If yes, who or what maintains the necessary tables?

Good question.  My assumption is that adding marks to a character
doesn't change its fundamental nature.  What needs to be provided
other pass-through to the base character's properties?

 3) Details of 'life-time', round-trip.

Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).

 4) Should the definition of graphemes conform to Unicode Standard Annex  
 #29 'grapheme clusters'? Wich level - legacy, extended or tailored?

No opinion, other than that we're aiming for the most modern
formulation that doesn't implicitly cede declarational control to
something out of the control of Perl 6 declarations.  (See locales for
an example of something Perl 6 ignores in the absence of an explicit
declaration to pay attention to them.)  So just guessing from the
names without reading the Annex in question, not legacy, but probably
extended, with explicitly tailoring allowed by declaration.  (Unless
extended has some dire performance or policy consequences that would
be contraindicative...)

So as long as we stay inside these fundamental Perl 6 design
principles, feel free to whack on the specs.

Larry

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall

On Mon, May 18, 2009 at 02:16:17PM -0400, Mark J. Reed wrote:
: Surrogates are just weird, since they have assigned code points even
: though they're purely an encoding mechanism.  As such, they straddle
: the line between abstract characters and an encoding form. I assume
: that if text comes in as UTF-16, the surrogates will disappear as far
: as character-level P6 code is concerned.

I devoutly hope so.  UTF-8 is much cleaner than UTF-16 in this regard.
(And it's why I qualified my code point with abstract earlier, to
mean the UTF-8 interpretion rather than the UTF-16 interpretation.)

: So is there any way for P6
: to manipulate surrogates as characters?  Maybe an adverb or trait?
: Or does one have to descend to the bytewise layer for that?  (As you
: said, that *normally* shouldn't be necessary outside encoding and
: decoding, where you need to do things bytewise anyway; just trying to
: cover all the bases...)

Buf16 should work for raw UTF-16 just fine.  That's one of the main
reasons we have buffers in sizes other than 8, after all.

Larry

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH


On May 18, 2009, at 14:16 , Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:

3) Details of 'life-time', round-trip.


Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).


I find mysef wondering if they might need to be standardized anyway;  
specifically I'm contemplating Erlang-style services.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings


Brandon S. Allbery KF8NH wrote:

On May 18, 2009, at 14:16 , Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:

3) Details of 'life-time', round-trip.


Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).


I find mysef wondering if they might need to be standardized anyway; 
specifically I'm contemplating Erlang-style services.


Why wouldn't a marshalling of an NFG string automatically include the 
grapheme table? That way you can realize it and immediately use it in 
fast mode. Alternatively, if you were providing a persistent string 
service, a post-marshalling step could re-normalize it in local NFG.


The response in NFG could either use the same table you sent (if the 
response is a subset of the original string) or could attach its own 
table for translation at your end.


=Austin

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings


Larry Wall wrote:

Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).
  


I think that a DoS attack on Unicode would be called IBM/Windows Code 
Pages. The rest of the world have been suffering this attack for the 
last 40 years. I'm not sure anyone would notice, at this point. :-)

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz


Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote:

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
  

If you haven't read the PDD, it's a good start.



snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.
  


My feelings, in general.  It appears that the concept of mapping total 
graphemes to integers, negative, etc. is an implementation decision.  
Perl 6 strings has a concept of graphemes, and functions that work with 
them.  But the core language specification should keep that as general 
as possible, and allow implementation freedom.  The statement that base 
moda modb produces the same grapheme value as base modb moda is at 
the correct level.  The statement the grapheme is an Int is not only 
at the wrong level, but not right, as they should be their own distinct 
type.  I think that the PDD details of assigning negative values as 
encountered AND the idea of being a list of code points in some 
normalized form, AND the idea of it being a buffer of bytes in UTF8 with 
that list of code points encoded therein, are all *allowed* as correct 
implementations.  So is having a type whose instance data stores it in 
however many forms it wants, and for the Perl end of things you just let 
the === operator take its natural course.



If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.



Well, you can view a string as bytes of UTF8, code points, or 
graphemes.  If you want numbers you probably wanted the first two.  A 
grapheme object should in some ways behave as a string of 1 grapheme and 
allow you to obtain bytes of UTF8 or code points, easily. 

Now object identity, the address of an object, is not mandated to be 
an Int or even numeric.  Different types can return different things 
even.  The only thing we know is that infix:=== uses them.


Should graphemes be any different?  A grapheme object has observed 
behavior (encode it as...) and internal unobserved behavior.  Perhaps 
we need more assertions such as saying that it can serve as hash keys 
properly, rather than going all the way to saying that they must be 
numbered.  Especially with an internal numbering system that changes 
from run to run!


Meanwhile... that's what the Str class does.  It still has nothing to do 
with how source code is parsed.  To that extent, mentioning it in S02, 
at least in that section, is a mistake.  A see-also to general Perl 
Unicode documentation would not be objectionable.


Also, I described more detailed, formal handling of the input stream to 
the Perl 6 parser last year:  http://www.dlugosz.com/Perl6/specdoc.pdf 
in Section 3.1.  It was discussed on this mailing list when I was 
starting it.


--John

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz


Larry Wall larry-at-wall.org |Perl 6| wrote:

Sure, but this is a weak argument, since you can already write complete
ord/chr nonsense at the codepoint level (even in ASCII), and all we're
doing here is making graphemes work more like codepoints in terms of
storage and indexing.  If people abuse it, they have only themselves
to blame for relying on what is essentially an implementation detail.
The whole point of ord is to cheat, so if they get caught cheating,
well, they just have to take their lumps.  In the age of Unicode,
ord and chr are pretty much irrelevant to most normal text processing
anyway except for encoders and decoders, so there's not a great deal
of point in labeling the integers as an opaque type, in my opinion.

  



Playing the Devil's Advocate here, some other discussion on this thread 
made me think of something.  People already write code that expects 
ord's to be ordered.  Instead of saying, well, use code points if you 
want to do that we can encourage people to embrace graphemes and say 
don't use code points or bytes!  Use graphemes! if they behave in a 
familiar enough manner.


So on one hand I say viva la revolution!, graphemes are modeled after 
the object identity, which is totally opaque except for equality 
testing.  But on the other hand, I want to say they may be funky 
inside, but you can still _use_ them in the ways you want... and assure 
that they work as hash keys and are not only ordered but include ASCII 
ordering as a subgroup.  But, still not disallow any good implementation 
ideas that befit totally different implementations.


Of course, that's not a problem unique to graphemes.  The object 
identity keys, for example.  Any forward-thinking that replaces old 
values with magic cookies.  Perhaps we need a general class that will 
assign orderable tags to arbitrary values and remember the mapping, and 
use that for more general cases.  It can be explicitly specialized to 
use any implementation-dependent ordering that actually exists on that 
type, and the general case would just be to memo-ize an int mapping.


--John

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz


Larry Wall larry-at-wall.org |Perl 6| wrote:

into *uint16 as long as they don't synthesize codepoints.  And we can
always resort to *uint32 and *int32 knowing that the Unicode consortium
isn't going to use the top bit any time in the foreseeable future.
(Unless, of course, they endorse something resembling NFG. :)
  


No, a few million code points in the Unicode standard can produce an 
arbitrary number of unique grapheme clusters, since you can apply as 
many modifiers as you like to each different base character.  If you 
allow multiples, the total is unbounded.


A small program, which ought to go into the test suite g, can generate 
4G distinct grapheme clusters, one at a time. 

How many implementations will that break?  If they want fixed size, 
64-bits should do for now.  Also, if the spec doesn't list a requirement 
for a minimum implement ion limit, *any* fixed-size implementation will 
be incorrect even if untestable as such.


--John

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall

On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote:
 No, a few million code points in the Unicode standard can produce an  
 arbitrary number of unique grapheme clusters, since you can apply as  
 many modifiers as you like to each different base character.  If you  
 allow multiples, the total is unbounded.

 A small program, which ought to go into the test suite g, can generate  
 4G distinct grapheme clusters, one at a time. 

That precise behavior is what I was characterizing as a DoS attack.  :)

So in my head it falls into the Doctor-it-hurts-when-I-do-this category.

Larry

Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH


On May 18, 2009, at 21:54 , Larry Wall wrote:

On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote:

No, a few million code points in the Unicode standard can produce an
arbitrary number of unique grapheme clusters, since you can apply as
many modifiers as you like to each different base character.  If you
allow multiples, the total is unbounded.

A small program, which ought to go into the test suite g, can  
generate

4G distinct grapheme clusters, one at a time.


That precise behavior is what I was characterizing as a DoS  
attack.  :)
So in my head it falls into the Doctor-it-hurts-when-I-do-this  
category.



If you're working with externally generated Unicode, you may not have  
that option.  I've gotten some bizarre combinations out of Word in  
Hebrew with nikudot, then saved as UTF8 text (so bizarre, in fact,  
that in the end I used gedit on FreeBSD).


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part

Re: Unicode in 'NFG' formation ?

2009-05-16 Thread Darren Duncan


John M. Dlugosz wrote:
I was going over S02, and found it opens with, By default Perl presents 
Unicode in NFG formation, where each grapheme counts as one character.


I looked up NFG, and found it to be an invention of this group, but 
didn't find any details when I tried to chase down the links.


This opens a whole bunch of questions for me.  If you mean that the 
default for what the individual items in a string are is graphemes, OK, 
but what does that have to do with parsing source code?  Even so, that's 
not something that would be called a Normalization Form.


Character set encodings and stuff is one of my strengths.  I'd like to 
straighten this out, and can certainly straighten out the wording, but 
first need to know what you meant by that.


Can someone catch me up on the particulars?


I noticed and asked about this a few months ago.  As you say, NFG was invented 
for Perl 6 and/or Parrot.


See http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html 
for all the formal details that exist to my knowledge.


Back at the time I raised the issue, it was said that we need to take that 
Parrot PDD 28 and derive the initial Perl 6 Synopsis 15 from it.  Such a 
Synopsis could basically just start out as a clone of the Parrot document.  I 
said that someday I might have the round-tuit for this, but as yet I didn't.


Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a 
new Perl 6 Synopsis 15, and continue from there.


-- Darren Duncan

Re: Unicode bracketing spec question

2009-04-24 Thread Timothy S. Nelson


On Thu, 23 Apr 2009, Helmut Wollmersdorfer wrote:


Timothy S. Nelson wrote:
I note that S02 says that the unicode classes Ps/Pe are blessed to act 
as opening and closing quotes.  Is there a reason that we can't have Pi/Pf 
blessed too?  I ask because there are quotation marks in the Pi/Pf set that 
are called Substitution and Transposition which I thought might be cool 
quotes for s/// and tr/// :).


You mean

2E00 - 2E2F Supplemental Punctuation

New Testament editorial symbols
[...]
2E02 LEFT SUBSTITUTION BRACKET
2E03 RIGHT SUBSTITUTION BRACKET
[...]
2E09 LEFT TRANSPOSITION BRACKET
2E0A RIGHT TRANSPOSITION BRACKET


That sounds like them.

But if you really want to use these characters, your source will be hard to 
read without exotic fonts. You have been warned;-)


	My fonts don't show them either.  But we could call it job 
protection ;).


:)


-
| Name: Tim Nelson | Because the Creator is,|
| E-mail: wayl...@wayland.id.au| I am   |
-

BEGIN GEEK CODE BLOCK
Version 3.12
GCS d+++ s+: a- C++$ U+++$ P+++$ L+++ E- W+ N+ w--- V- 
PE(+) Y+++ PGP-+++ R(+) !tv b++ DI D G+ e++ h! y-

-END GEEK CODE BLOCK-

Re: Unicode bracketing spec question

2009-04-23 Thread Helmut Wollmersdorfer


Timothy S. Nelson wrote:
I note that S02 says that the unicode classes Ps/Pe are blessed to 
act as opening and closing quotes.  Is there a reason that we can't have 
Pi/Pf blessed too?  I ask because there are quotation marks in the Pi/Pf 
set that are called Substitution and Transposition which I thought 
might be cool quotes for s/// and tr/// :).


You mean

2E00 - 2E2F Supplemental Punctuation

New Testament editorial symbols
[...]
2E02 LEFT SUBSTITUTION BRACKET
2E03 RIGHT SUBSTITUTION BRACKET
[...]
2E09 LEFT TRANSPOSITION BRACKET
2E0A RIGHT TRANSPOSITION BRACKET

Cool idea.

But if you really want to use these characters, your source will be hard 
to read without exotic fonts. You have been warned;-)


Helmut Wollmersdorfer

[perl #61394] Re: unicode and macosx

2008-12-16 Thread via RT

# New Ticket Created by  Stephane Payrard 
# Please include the string:  [perl #61394]
# in the subject line of all future correspondence about this issue. 
# URL: http://rt.perl.org/rt3/Ticket/Display.html?id=61394 


my $s =  ; say  $s.chars  #  now returns 1

Note : the bug was reported on macintel 32 bits which died. I am now
testing on a macintel 64 bits.
I don't know if it can affect the test.

On Mon, May 19, 2008 at 6:28 PM, Stéphane Payrard cognomi...@gmail.com wrote:
 On a macintel 10.5 I have some problem with unicode. unicode
 characters are not recognized as such. See the rakudo test below

 The configuring phase gives :

 Determining whether ICU is installed...yes.

 The compiling phase finish with an error but it apprently causes no
 problems except I can't run 'make test' because of
 the dependance on a successful compilation.

 ar: blib/lib/libparrot.a is a fat file (use libtool(1) or lipo(1) and
 ar(1) on it)
 ar: blib/lib/libparrot.a: Inappropriate file type or format
 make: *** [blib/lib/libparrot.a] Error 1

 rakudo is generated without problem

 But the following test fails. I pasted the content of the literal
 string with a character that emacs says to be #x8a0


 my $s =  ; say  $s.chars  # $s == \x8a0
 2


 I expected one.

 --
 cognominal stef




-- 
cognominal stef

[Patch] Re: Unicode Operators cheatsheet, please!

2005-06-02 Thread Kevin Puetz

Rob Kinyon wrote:

 xOn 5/31/05, Sam Vilain [EMAIL PROTECTED] wrote:
 Rob Kinyon wrote:
  I would love to see a document (one per editor) that describes the
  Unicode characters in use and how to make them. The Set implementation
  in Pugs uses (at last count) 20 different Unicode characters as
  operators.
 
 I have updated the unicode quickref, and started a Perlmonks discussion
 node for this to be explored - see
 http://www.perlmonks.org/index.pl?node_id=462246
 
 As I replied on Perlmonks, it would be more helpful if the Compose
 keys were listed and not just the ASCII versions. Plus, a quick primer
 on how to enable Unicode in your favorite editor. I don't know about
 Emacs, but the Vim documentation on multibyte is difficult to work
 with, at best.

Well, :help digraph isn't particularly bad, though the included table only
covers latin-1. The canonical source is RFC1345. But I've attached a patch
for the set symbols that have them.

 Thanks,
 Rob
Index: docs/quickref/unicode
===
--- docs/quickref/unicode	(revision 4305)
+++ docs/quickref/unicode	(working copy)
@@ -21,6 +21,10 @@
 Note that the compose combinations here are an X11R6 standard, and do not
 necessarily correspond to the compose combinations available when you use your
 compose key.
+
+The digraphs used in vim come from Character Mnemonics  Character Sets,
+RFC1345 (http://www.ietf.org/rfc/rfc1345.txt). After doing :set digraph,
+the digraph ^k A B may also be entered as A BS B.
 
 Unicode ASCIIkey sequence
 charfallbackVimEmacs   Unix Compose Key combination
@@ -30,22 +34,22 @@
 ¥   Y   ^k Y e C-x 8 Y Compose Y =
 
 Set.pm operators (included for reference):
-≠   !=
-∩   *
-∪   +
+≠   !=  ^k ! =
+∩   *   ^k ( U
+∪   +   ^k ) U
 ∖   -
-⊂   
-⊃   
-⊆   =
-⊇   =
-⊄  !( $a  $b ) 
+⊂  ^k ( C
+⊃  ^k ) C
+⊆   =  ^k ( _
+⊇   =  ^k ) _
+⊄  !( $a  $b ) 
 ⊅  !( $a  $b )
 ⊈ !( $a = $b )
 ⊉ !( $a = $b )
-⊊   
+⊊  
 ⊋   
-∋/∍   $a.includes($b)
-∈/∊   $b.includes($a)
+∋/∍   $a.includes($b)   ^k ) -
+∈/∊   $b.includes($a)   ^k ( -
 ∌!$a.includes($b)
 ∉!$b.includes($a)
 
@@ -58,20 +62,20 @@
 
 So, these *might* be considered not too awful;
 
-×   *
-¬   !
+×   *   ^k * X
+¬   !   ^k N O
 ∕   /
 ≡  =:=
 ≔   :=
   ⩴ or ≝   ::=
-  ≈ or ≊~~
+  ≈ or ≊~~  ^k ? 2
 …  ...
-√  sqrt()
-∧   
-∨   ||
+√  sqrt()   ^k R T
+∧ ^k A N
+∨   ||  ^k O R
 ∣   mod  (? bit of a stretch, perhaps)
-   ⌈$x⌉ceil($x)
-   ⌊$x⌋floor($x)
+   ⌈$x⌉ceil($x) ^k / 7
+   ⌊$x⌋floor($x)^k 7 / 7
 
 
 However I think it is a BAD idea that the following unicode characters

Re: Unicode Operators cheatsheet, please!

2005-06-01 Thread Rob Kinyon

xOn 5/31/05, Sam Vilain [EMAIL PROTECTED] wrote:
 Rob Kinyon wrote:
  I would love to see a document (one per editor) that describes the
  Unicode characters in use and how to make them. The Set implementation
  in Pugs uses (at last count) 20 different Unicode characters as
  operators.
 
 I have updated the unicode quickref, and started a Perlmonks discussion node
 for this to be explored - see http://www.perlmonks.org/index.pl?node_id=462246

As I replied on Perlmonks, it would be more helpful if the Compose
keys were listed and not just the ASCII versions. Plus, a quick primer
on how to enable Unicode in your favorite editor. I don't know about
Emacs, but the Vim documentation on multibyte is difficult to work
with, at best.

Thanks,
Rob

Re: Unicode Operators cheatsheet, please!

2005-05-31 Thread Sam Vilain


Rob Kinyon wrote:

I would love to see a document (one per editor) that describes the
Unicode characters in use and how to make them. The Set implementation
in Pugs uses (at last count) 20 different Unicode characters as
operators.


I have updated the unicode quickref, and started a Perlmonks discussion node
for this to be explored - see http://www.perlmonks.org/index.pl?node_id=462246

Sam.

Re: Unicode Operators cheatsheet, please!

2005-05-27 Thread Gaal Yahas

On Fri, May 27, 2005 at 10:29:39AM -0400, Rob Kinyon wrote:
 I would love to see a document (one per editor) that describes the
 Unicode characters in use and how to make them. The Set implementation
 in Pugs uses (at last count) 20 different Unicode characters as
 operators.

Good idea. A modest start is at docs/quickref/unicode .

-- 
Gaal Yahas [EMAIL PROTECTED]
http://gaal.livejournal.com/

Re: Unicode Support - ICU Optional

2004-08-05 Thread Nicholas Clark

On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote:

 WRT improving the ease of use of ICU.  My suggestion
 is that a representative from each platform that
 Parrot is currently being built on download the latest
 stable version of ICU source, build it, and note
 anything special they needed to do to get it
 working. Those things should make putting a newer
 version into CVS a realistic possibility.  I am
 volunteering for Cygwin (yeah I know - big surprise
 there).

OK. Solaris, Sun C compilers. Notionally a supported platform.

/usr/include/sys/feature_tests.h, line 277: #error: Compiler or options invalid for 
pre-UNIX 03 X/Open applications  and pre-2001 POSIX applications
cc: acomp failed for putil.c
make[1]: *** [putil.d] Error 2
make[1]: Leaving directory `/export/home/nick/Ponie/ponie-clean/icu/source/common'
make: *** [all-recursive] Error 2


It's this one again. Solaris 10 seems too new for it. OK, Solaris 10 is in
beta but this is the same pain as before. I should report this to the
ICU people.


Note also that it will only build as is on this list of platforms for 3.0

The following names can be supplied as the argument for platform:

AIX4.3xlC   Use IBM's xlC on AIX 4.3
AIX4.3xlC_nothreads Use IBM's xlC on AIX 4.3 with no multithreading
AIX4.3VAUse IBM's Visual Age xlC_r compiler on AIX 4.3
AIXGCC  Use GCC on AIX
ALPHA/LINUXGCC  Use GCC on Alpha/Linux systems
ALPHA/LINUXCCC  Use Compaq C compiler on Alpha/Linux systems
BeOSUse the GNU C++ compiler on BeOS
Cygwin  Use the GNU C++ compiler on Cygwin
Cygwin/MSVC Use the Microsoft Visual C++ compiler on Cygwin
FreeBSD Use the GNU C++ compiler on Free BSD
HP-UX11ACC  Use the Advanced C++ compiler on HP-UX 11
HP-UX11CC   Use HP's C++ compiler on HP-UX 11
LinuxRedHat Use the GNU C++ compiler on Linux
LINUX/ECC   Use the Intel ECC compiler on Linux
LINUX/ICC   Use the Intel ICC compiler on Linux
MacOSX  Use the GNU C++ compiler on MacOS X (Darwin)
QNX Use QNX's QCC compiler on QNX/Neutrino
SOLARISCC   Use Sun's CC compiler on Solaris
SOLARISCC/W4.2  Use Sun's Workshop 4.2 CC compiler on Solaris
SOLARISGCC  Use the GNU C++ compiler on Solaris
SOLARISX86  Use Sun's CC compiler on Solaris x86
TRU64V5.1/CXX   Use Compaq's cxx compiler on Tru64 (OSF)
zOS Use IBM's cxx compiler on z/OS (os/390)
zOSV1R2 Use IBM's cxx compiler for z/OS 1.2
OS390V2R10  Use IBM's cxx compiler for OS/390 2.10



ie I'm being forced to build LP64 on Solaris. I will now try to scrape enough
disk space on a friends Debian Sparc box to see what LinuxRedHat works
like there. I won't have time to try Linux on ARM until Friday (at least)
and I no longer have access to any other architectures running Debian.

It may make sense to work with ICU initially, and it does support some of
the more esoteric platforms that perl5 does (QNX, BeOS, EBCDIC mainframes)
but I don't even see Irix in their list of supported platforms, let alone
some of our more fun friends such as Unicos and NEC SuperUnix (or whatever
the pain is called. Nice hardware; evil Unix)

Heck, even OpenBSD isn't there.

We would have to work with them quite a lot to bring ICU to the level of
portability of Perl 5.

Nicholas Clark

Re: Unicode Support - ICU Optional

2004-08-05 Thread Nicholas Clark

On Thu, Aug 05, 2004 at 10:51:46AM +0100, Nicholas Clark wrote:
 On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote:
 
  WRT improving the ease of use of ICU.  My suggestion
  is that a representative from each platform that
  Parrot is currently being built on download the latest
  stable version of ICU source, build it, and note
  anything special they needed to do to get it
  working. Those things should make putting a newer
  version into CVS a realistic possibility.  I am
  volunteering for Cygwin (yeah I know - big surprise
  there).
 
 OK. Solaris, Sun C compilers. Notionally a supported platform.


OK. AIX, gcc. Notionally a supported platform.

$ /usr/bin/gmake 
rm -rf config/icu-config
/opt/freeware/bin/install -c -m 644 ./config/icu-config-top config/icu-config
sed -f ./config/make2sh.sed  ./config/Makefile.inc | grep -v '#M#' | uniq  
config/icu-config
sed -f ./config/make2sh.sed  ./config/mh-aix-gcc | grep -v '#M#' | uniq  
config/icu-config
cat ./config/icu-config-bottom  config/icu-config
echo # Rebuilt on `date`  config/icu-config
/bin/sh ./mkinstalldirs lib
mkdir lib
/bin/sh ./mkinstalldirs bin
mkdir bin
/usr/bin/gmake[0]: Making `all' in `stubdata'
gmake[1]: Entering directory `/data_vx/nick-sandpit/build/icu/source/stubdata'
generating dependency information for stubdata.c
gmake[1]: Leaving directory `/data_vx/nick-sandpit/build/icu/source/stubdata'
gmake[1]: Entering directory `/data_vx/nick-sandpit/build/icu/source/stubdata'
/opt/freeware/GNUPro/bin/gcc  -I../common -I../common  -DHAVE_CONFIG_H -O3  -c   -o 
stubdata.o stubdata.c
rm -f libicudata30.0.a ;  /opt/freeware/GNUPro/bin/gcc -O3   -Wl,-bbigtoc  -shared 
-Wl,-bexpall  -o libicudata30.0.a stubdata.o 
/opt/freeware/GNUPro/lib/gcc-lib/powerpc-ibm-aix5.1.0.0/2.9-aix51-020209/real-ld: 
target expall not found
collect2: ld returned 1 exit status
gmake[1]: *** [libicudata30.0.a] Error 1
gmake[1]: Leaving directory `/data_vx/nick-sandpit/build/icu/source/stubdata'
gmake: *** [all-recursive] Error 2


This looks terminal. OTOH I know how to work around Solaris 10, so I'll
report on that when it's finished.

If I sound rude about this, it's because I know how portable the people who
came before me managed to make Perl5, and I try to keep it that way.

Nicholas Clark

Re: Unicode Support - ICU Optional

2004-08-05 Thread Nicholas Clark

On Thu, Aug 05, 2004 at 10:51:46AM +0100, Nicholas Clark wrote:

 It's this one again. Solaris 10 seems too new for it. OK, Solaris 10 is in
 beta but this is the same pain as before. I should report this to the
 ICU people.

Reported as bug #4047

ICU 3 will build, pass all tests and install if make is invoked as
  make CC=cc\ -D_XPG6

Libs are 32 bit.

 Heck, even OpenBSD isn't there.

Or VMS. How could I miss VMS?

Nicholas Clark

Re: Unicode Support - ICU Optional

2004-08-05 Thread Nicholas Clark

On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote:

 WRT improving the ease of use of ICU.  My suggestion
 is that a representative from each platform that
 Parrot is currently being built on download the latest
 stable version of ICU source, build it, and note
 anything special they needed to do to get it
 working. Those things should make putting a newer

Builds OK on FreeBSD 5.2, but make check goes boom:

 /ucmptst/
   ---[OK]  ---/ucmptst/TestUCMP8API
 /tsformat/
 /tsformat/ccaltst/
Segmentation fault (core dumped)
gmake[2]: *** [check-local] Error 139
gmake[2]: Leaving directory `/home/nick/build/icu/source/test/cintltst'
gmake[1]: *** [check-recursive] Error 2
gmake[1]: Leaving directory `/home/nick/build/icu/source/test'
gmake: *** [check-recursive] Error 2


I've not looked into why yet.

Nicholas Clark

Re: Unicode Support - ICU Optional

2004-08-05 Thread Nicholas Clark

On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote:

 WRT improving the ease of use of ICU.  My suggestion
 is that a representative from each platform that
 Parrot is currently being built on download the latest
 stable version of ICU source, build it, and note

x86 Debian builds and tests just fine when ICU is configured with the
platform LinuxRedHat

The Sparc Debian box is down, so I can't see if that's LinuxRedHat too.

Nicholas Clark

Re: Unicode Support - ICU Optional

2004-08-05 Thread Andy Dougherty

On Thu, 5 Aug 2004, Nicholas Clark wrote:

 On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote:

  WRT improving the ease of use of ICU.  My suggestion
  is that a representative from each platform that
  Parrot is currently being built on download the latest
  stable version of ICU source, build it, and note

 x86 Debian builds and tests just fine when ICU is configured with the
 platform LinuxRedHat

 The Sparc Debian box is down, so I can't see if that's LinuxRedHat too.

I just successfully built icu-3.0 on a Debian/UltraSPARC system.  The
LinuxRedHat bit doesn't actually do anything beyond what a plain run
of ./configure would do.

I'm not sure quite what to think about ICU at the moment.

I certainly agree ICU is complex and when it goes awry, it looks quite
daunting to fix.

Part of the issue is certainly that ICU is trying to do some hard
things:

1.  It builds 8 different shared libraries along the way.
Presumably, as part of the build/test/install/use cycle, it
needs to use those libraries, and not other versions of the icu
libraries.  As we know from dealing with shared libperl.so
libraries, this is hard to do, and requires platform-specific
information that's often impossible to guess.

Similarly, it has to find data files, with a way to determine
at run time where to find them.

2.  It generates some stuff on-the-fly.  Doing so portably
(while correctly propagating all the various environment
variables to get the right shared libraries and build tools)
is, again, hard.

3.  Correct implementation of some Unicode stuff is hard.

Any competing system would likely have to address many of the same
issues.

I haven't dug into the build system deeply enough to have a sense of
how much work would be involved in making it more portable.  Some of
the complexity is probably necessary because the problem itself is
complex, but some of it has probably just evolved that way.  perl5's
Configure system certainly has a lot of both types of complexity.

Finally, I note that ICU is still rapidly evolving.  The 2.6 version
in parrot right now that is obsolete is only 6 months old.  That's
both good and bad.  It's good in that fixes could quickly flow both
ways between the parrot and icu developers.  It's bad, though, if the
parrot version ends up forking significantly from the standard one,
because it will be a lot of work to keep parrot's version in sync.

So I don't know what to do with it at the moment; any alternative
looks like a lot of work.

-- 
Andy Dougherty  [EMAIL PROTECTED]

Re: Unicode Support - ICU Optional

2004-08-04 Thread Dan Sugalski

At 4:10 AM -0700 8/4/04, Joshua Gatcomb wrote:
All:
After speaking with Dan in #parrot last night, I
either had originally misunderstood his position or he
has changed it (paraphrased):
We will ship Parrot with unicode support, but:.
A.  The unicode support does not necessarily need to
be limited to a single library or ICU specifically.
B.  Just because CVS will have unicode support, does
not mean the user will be forced to use it.
C.  Configure should detect a system unicode library
and do the right thing in choosing which one it uses.
Yup, you've got it. I *thought* that having ICU in would be more a 
win than a loss. Given the hell this has been putting people through 
I'm seriously changing my mind.

We have a single requirement -- Parrot, as shipped, *must* have a 
working Unicode solution. It won't have to be configured when 
parrot's built, but it must at least be configurable. Right now, that 
solution's ICU. Longer term, well... longer term I dunno.

So, here's the plan.
1) We beat up Configure to probe for and use the system ICU, if 
available. (Switches are needed now, it should be automagic)

2) I spec out the encoding and charset APIs for the loadable encoding 
and charset modules. (This is step one of teasing ICU out of the core)

3) We make Parrot's string system use the loadable encoding and charset system
4) We get non-unicode encodings and charsets in
5) We make ICU a loadable module tied into the proper encodings and charset
Step 1 can be done by anyone willing to poke at the configure perl 
code. Step 2 needs me, and I'll get that done when I'm waiting for 
the train today. (No, don't ask) Step 3 is the biggie here, as it 
touches a lot of string.c. 4's relatively easy (7-bit ASCII and 
binary'll be first :) and 5 may or may not be straightforward, 
depending on how the design goes.

I'd like to get work on steps 3 and 4 going quickly -- the sooner the 
better -- once the API design's done.

And yes, the API will support doing this in bytecode, though there'll 
be the obligatory performance penalty, so if someone later comes 
along and wants to reimplement the Unicode support in a parrot 
language, well... that'd be keen and we could toss ICU from the 
distribution entirely. (Though still use it if there's a system 
version installed)
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Unicode step by step

2004-04-13 Thread Marcus Thiesen

On Saturday 10 April 2004 15:13, Leopold Toetsch wrote:
 There is of course still the question: Should we really have ICU in the
 tree. This needs tracking updates and patching (again) to make it build
 and so on.

In the sake of platform independence I'd say to keep it there. It's far easier 
if you have only the usual build dependencies and the one special thing 
inside the tree to quick test on different platforms.
What I want to say that you'll find a sane build environment and a Perl on 
most of the machines, but even I don't have ICU installed.
BTW, it doesn't compile on any platform at the moment, after a realclean on 
the first make it complains about 
../data/locales/ja.txt:15: parse error. Stopped parsing with 
U_INVALID_FORMAT_ERROR
couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3

If you do a make at this point again, it skips these steps and tries to link 
parrot, failing on many undefined symbols, I believe from the non existent 
ICU. 

 Thanks,
 leo

Have fun,
Marcus

-- 
 :: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 :: 

I can resist anything but temptation
  Oscar Wilde


pgp0.pgp
Description: signature

Re: Unicode step by step

2004-04-13 Thread Jeff Clites

BTW, it doesn't compile on any platform at the moment, after a 
realclean on
the first make it complains about
../data/locales/ja.txt:15: parse error. Stopped parsing with
U_INVALID_FORMAT_ERROR
couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3
Try a make realclean first--Dan checked in a fix for this, and it 
seems to require this to force everything to start fresh.

If you do a make at this point again, it skips these steps and tries 
to link
parrot, failing on many undefined symbols, I believe from the non 
existent
ICU.
At this point I'd expect it to link, but maybe not run well--that 
failure comes when packaging up the data files, and at that point the 
the libraries themselves should already be built and in the right 
place. But you are detecting some loose behavior in the Makefile, 
which was done in part so that ICU wouldn't rebuild unless you make 
clean.

JEff

Re: Unicode step by step

2004-04-13 Thread luka frelih

just a confirmation...
my i386 debian linux gives the same error repeatedly after make 
realclean,
if i make again, it compiles a broken parrot which fails (too) many 
tests...

also it seems (to me) that parrot's configured choice of compiler, 
linker, ... is not used in building icu?

does icu have some non-ubiquitous dependencies?

LF

../data/locales/ja.txt:15: parse error. Stopped parsing with
U_INVALID_FORMAT_ERROR
couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3
Try a make realclean first--Dan checked in a fix for this, and it 
seems to require this to force everything to start fresh.

If you do a make at this point again, it skips these steps and tries 
to link
parrot, failing on many undefined symbols, I believe from the non 
existent
ICU.
At this point I'd expect it to link, but maybe not run well--that 
failure comes when packaging up the data files, and at that point the 
the libraries themselves should already be built and in the right 
place. But you are detecting some loose behavior in the Makefile, 
which was done in part so that ICU wouldn't rebuild unless you make 
clean.

Re: Unicode step by step

2004-04-13 Thread Marcus Thiesen

On Tuesday 13 April 2004 13:28, luka frelih wrote:
 just a confirmation...
 my i386 debian linux gives the same error repeatedly after make
 realclean,
 if i make again, it compiles a broken parrot which fails (too) many
 tests...

 also it seems (to me) that parrot's configured choice of compiler,
 linker, ... is not used in building icu?

 does icu have some non-ubiquitous dependencies?

As I said yesterday, it worked on a machine of mine which I hadn't touched for 
quite some while. On my notebook, where I do daily builds, I ran in the same 
problem, even after having made a realclean. 
So I did a make clean in the icu subdir directly, deleted all files which 
are listed in .cvsignore and ran the realclean configure build test all 
over and now it works. Seems as if something doesn't get cleaned up in icu 
with a parrot realclean.

Have fun,
Marcus


-- 
 :: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 :: 

Do something every day that you don't want to do; this is the golden rule for 
acquiring the habit of doing your duty without pain
   Mark Twain


pgp0.pgp
Description: signature

Re: Unicode step by step

2004-04-13 Thread Leopold Toetsch

Marcus Thiesen wrote:
. Seems as if something doesn't get cleaned up in icu 
with a parrot realclean.
Yep. I've removed cleaning icu from clean/realclean[1].

$ make help | grep clean
...
icu.clean:   ...
And there is always make cvsclean.

Have fun,
Marcus
leo

[1] If anyone puts that in again he might also send a lot faster PC to 
me (and possibly other developers ;)

Re: Unicode step by step

2004-04-13 Thread Dan Sugalski

At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote:
Marcus Thiesen wrote:
. Seems as if something doesn't get cleaned up in icu with a parrot 
realclean.
Yep. I've removed cleaning icu from clean/realclean[1].
I think we need to put that back for a bit, but with this:

[1] If anyone puts that in again he might also send a lot faster PC 
to me (and possibly other developers ;)
We're also likely going to be well-off if we get configure to detect 
a system ICU install and use that instead. It shouldn't be that 
tough, but I've not had a chance to poke around in the icu part of 
our config system to find out what we need to do.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Unicode step by step

2004-04-13 Thread Leopold Toetsch

Dan Sugalski [EMAIL PROTECTED] wrote:
 At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote:
Marcus Thiesen wrote:
. Seems as if something doesn't get cleaned up in icu with a parrot
realclean.

Yep. I've removed cleaning icu from clean/realclean[1].

 I think we need to put that back for a bit,

I did list two alternatives. The normal way of changes doesn't include
changes to ICU source (and honestly shouldn't). Currently building is
still a bit in flux, which does mandate a make icu.clean.

And there is of course already a new ICU version on *their* website, but
we still try to get/keep 2.6 running.

I'm still not sure that this lib should be part of *our* tree ...

 ... but with this:

[1] If anyone puts that in again he might also send a lot faster PC
to me (and possibly other developers ;)

 We're also likely going to be well-off if we get configure to detect
 a system ICU install and use that instead.

There are severals issues: First one is MANIFEST and CVS and patches.
Config steps should be simple. But - of course - I'd appreciate this
alternative as already layed out.

leo

Re: [PATCH] Re: Unicode step by step

2004-04-11 Thread Leopold Toetsch

Jeff Clites [EMAIL PROTECTED] wrote:

 Here's a patch to src/pf_items.c, and a ppc t/native_pbc/number_3.pbc.

Works.

 If it's working correctly, the attached strings-and-byte-order.* should
 both do the same thing--output the Angstrom symbol. If it's wrong, then
 the pbc version should output junk on a little-endian system.

 (If your terminal emulator isn't prepared to handle UTF-8, then pipe
 the output through 'less', and you should see something like
 E284AB.)

$ parrot string_1.pbc
Å

$ parrot string_1.pbc | od -tx1
000 e2 84 ab 0a
004

 JEff

Thanks - I'll apply it RSN.

leo

Re: Unicode step by step

2004-04-10 Thread Jeff Clites

On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote:

2) String PBC layout. The internal string type has changed. This 
currently breaks native_pbc tests (that have strings) as well as some 
parrot xx.pbc tests related to strings.
These are working for me (which tests are failing for you?)--I did 
update the PF_* API to match the changes to string internals. Of 
course, since the internals changed the pbc layout changed also, so the 
native_pbc test files need to be regenerated on the various 
platforms--but the ppc one I submitted (see other post, or original 
patch submission) should work. But if that one fails for you, it's 
probably b/c of byte order, and I need to look and find where we do the 
endianness correction for integers in pbc files, and hook in to do 
something similar for certain string cases. If someone can send me a 
number_X.pbc file generated on an i386 platform, that will help me 
test.

But, it's correct that there's no backward-compatibility code in place, 
to allow reading old pbc files. Do we want to have that sort of thing 
at this stage? (Certainly, I'd think that after 1.0 we'd want backward 
compatibility with any format changes, but do we need it at this 
stage?)

But let me know which parrot xx.pbc tests are failing for you.

The layout seems to depend somehow on the supported Unicode levels (or 
not). So before fixing the PBC issues, I'd just have a statememt: 
parrot_string_t looks such and such or of course as is now.
Could you rephrase? I'm not understanding what you are saying.

The only real change in the pbc format (if I'm recalling 
correctly--I'll have to go back and look) are that rather than 
serializing the encoding/chartype/language triple, we are writing out 
the s-representation (still followed by s-bufused and then the 
contents of the buffer). The only other wrinkle is that for cases where 
s-representation is 2 or 4, we need to endianness correct when we use 
the bytecode.

This is probably a separate discussion, but we _could_ decide instead 
to represent strings in pbc files always in UTF-8. Advantage: Simpler, 
no endianness correction needed, probably durable to further changes in 
string internals, could isolate s-representation awareness to string.c 
and string_primitives.c. Disadvantages: De-serializing a string from a 
pbc file will always involve a copy, and could result in larger files 
in some cases. I could argue it either way--one's cleaner, the other is 
probably faster.

There is of course still the question: Should we really have ICU in 
the tree. This needs tracking updates and patching (again) to make it 
build and so on.
One consideration is that I may need to patch ICU a few places--there's 
at least one API which they only expose in C++, so I need to wrap it in 
C and it's cleaner to do that as a patch to ICU rather than having C++ 
code in the core of parrot. Other than that, I think it boils down to 
convenience, and (possibly) consistency in being able to say that 
parrot version foo corresponds to ICU version bar (but maybe we don't 
need to be able to say that).

JEff

[PATCH] Re: Unicode step by step

2004-04-10 Thread Jeff Clites

On Apr 10, 2004, at 1:12 PM, Jeff Clites wrote:

On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote:

2) String PBC layout. The internal string type has changed. This 
currently breaks native_pbc tests (that have strings) as well as some 
parrot xx.pbc tests related to strings.
These are working for me (which tests are failing for you?)--I did 
update the PF_* API to match the changes to string internals. Of 
course, since the internals changed the pbc layout changed also, so 
the native_pbc test files need to be regenerated on the various 
platforms--but the ppc one I submitted (see other post, or original 
patch submission) should work. But if that one fails for you, it's 
probably b/c of byte order, and I need to look and find where we do 
the endianness correction for integers in pbc files, and hook in to do 
something similar for certain string cases.
Here's a patch to src/pf_items.c, and a ppc t/native_pbc/number_3.pbc. 
If it's working correctly, the attached strings-and-byte-order.* should 
both do the same thing--output the Angstrom symbol. If it's wrong, then 
the pbc version should output junk on a little-endian system.

(If your terminal emulator isn't prepared to handle UTF-8, then pipe 
the output through 'less', and you should see something like 
E284AB.)

(PS--I had to give the pbc file a fake extension, to keep the 
develooper mail server from rejecting it.)

JEff



pf_items_c.patch
Description: Binary data


number_3.pbc
Description: Binary data


strings-and-byte-order.pasm
Description: Binary data


strings-and-byte-order.pbc.file
Description: Binary data

Re: Unicode support in Emacs

2004-03-24 Thread Calle Dybedahl

 Karl == Karl Brodowsky [EMAIL PROTECTED] writes:

 I get the impression that Unicode-support has kind of gone on top of
 this stuff and I must admit that the way I am currently using
 Unicode is to edit the stuff with \ucafe\ubabe-kind of replacements
 and run perlscripts to convert for example my private html-format
 into WWW-html.

Um. That sounds like a lot of work... XEmacs handles Unicode and UTF-8
quite well, and has for the last couple of years[1]. It may have
problems that I don't know of if you dig sufficiently far down, and it
may not cooperate flawlessly with all possible major and minor modes,
but it's at least good enough for me to edit XML documents in UTF-8
and to read UTF-8-encoded News postings and mail without problems. The
most difficult bit has been to find a Unicode font that isn't
butt-ugly.

[1] In the 21.4.x series you need to install a lisp module (which the
XEmacs package system will do for you if you ask it; it's
seriously inspired by CPAN) and add three lines to your .emacs. In
21.5.x it should all Just Work.
-- 
 Calle Dybedahl [EMAIL PROTECTED]
 http://www.livejournal.com/users/cdybedahl/
 Last week was a nightmare, never to be repeated - until this week
-- Tom, a.s.r

Re: Unicode in Emacs (was: Semantics of vector operations)

2004-02-04 Thread Kurt Starsinic

On Feb 03, David Wheeler wrote:
 On Feb 3, 2004, at 7:13 AM, Kurt Starsinic wrote:
 
 No joke.  You'll need to have the mule-ucs module installed.
 A quick Google search turns up plenty of sources.
 
 Oh, I have Emacs 21.3.50. Mule is gone.

I'm afraid you're on your own, then.  I'm using 21.3.1.  If you
work it out, please post.

- Kurt

Re: Unicode under Windows (was RE: Semantics of vector operations)

2004-01-30 Thread Rod Adams

Austin Hastings wrote:

From: Rod Adams [mailto:[EMAIL PROTECTED]

Question in all this: What does one do when they have to _debug_ some 
code that was written with these lovely Unicode ops, all while stuck in 
an ASCII world?
 

That's why I suggested a standard script for Unicode2Ascii be shipped with the distro.
 

Good Idea, which would also beg an ASCII2Unicode script to reverse the 
process.

Also, isn't it a pain to type all these characters when they are not on 
your keyboard? As a predominately Win2k/XP user in the US, I see all 
these glyphs just fine, but having to remember Alt+0171 for a  
 

is going 
   

to get old fast... I much sooner go ahead and write Eraquo 
 

and be done 
   

with it.

Thoughts?
 

This has been discussed a bunch of times, but for Windows users 
the very best thing in the US is to change your Start  Settings 
   

Control Panel  Keyboard  Input Locales so that you have the 
 

option of switching over to a United States-International IME.

Once you've got that available (I used the Left-Alt+Shift hotkey) 
you can make a map of the keys. The only significant drawback is 
the behavior of the quote character, since it is used to encode 
accent marks. It takes getting used to the quote+space behavior, 
or defining a macro key (hint, hint).
   

(Links Snipped)
 

Thanks for the pointers. I've now set up Win2k so I can easily switch 
between US and United States International. Works nicely.

Now I have to go beat up the Thunderbird guys for trapping the keyboard 
directly and not allowing me to type the chars here.

Thanks Again
-- Rod

RE: Unicode under Windows (was RE: Semantics of vector operations)

2004-01-29 Thread Austin Hastings

 -Original Message-
 From: Austin Hastings [mailto:[EMAIL PROTECTED]

  From: Rod Adams [mailto:[EMAIL PROTECTED]

  Question in all this: What does one do when they have to _debug_ some 
  code that was written with these lovely Unicode ops, all while stuck in 
  an ASCII world?

That's why I suggested a standard script for Unicode2Ascii be shipped with the distro.

  Also, isn't it a pain to type all these characters when they are not on 
  your keyboard? As a predominately Win2k/XP user in the US, I see all 
  these glyphs just fine, but having to remember Alt+0171 for a  
 is going 
  to get old fast... I much sooner go ahead and write Eraquo 
 and be done 
  with it.

  Thoughts?

 This has been discussed a bunch of times, but for Windows users 
 the very best thing in the US is to change your Start  Settings 
  Control Panel  Keyboard  Input Locales so that you have the 
 option of switching over to a United States-International IME.

 Once you've got that available (I used the Left-Alt+Shift hotkey) 
 you can make a map of the keys. The only significant drawback is 
 the behavior of the quote character, since it is used to encode 
 accent marks. It takes getting used to the quote+space behavior, 
 or defining a macro key (hint, hint).

Sorry for the self-reply, but here's some links for you:

These guys sell an overlay, and include a picture of the overlay, for US-Int 
keyboarding:

http://www.datacal.com/dce/catalog/us-international-layout.htm

Some extra information with a Francophonic spin to it:

http://www.lehman.cuny.edu/depts/langlit/labs/keyboard.htm

A more complete keyboard diagram:

http://www.worldnames.net/ML_input/InternationalKeyboard.cfm

From the horse's mouth there is an interesting applet:

http://www.microsoft.com/globaldev/reference/keyboards.aspx

Finally, you could define your OWN keyboard layout using this tool (requires .NET 
install):

http://www.microsoft.com/globaldev/tools/msklc.mspx

=Austin

Re: Unicode, internationalization, C++, and ICU

2004-01-16 Thread Michael Scott

Maybe we can use someone else's solution...

http://lists.ximian.com/archives/public/mono-list/2003-November/ 
016731.html

On 16 Jan 2004, at 00:33, Jonathan Worthington wrote:

- Original Message -
From: Dan Sugalski [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, January 15, 2004 8:09 PM
Subject: Unicode, internationalization, C++, and ICU

Now, assuming there's still anyone left reading this message...

We've been threatening to build ICU into parrot, and it's time for
that to start happening. Unfortunately there's a problem--it doesn't
work right now. So, what we need is some brave soul to track ICU
development and keep us reasonably up to date. What I'd really like
is:
1) ICU building and working
2) ICU not needing any C++
I've done some testing, and I hate to be the bearer of bad news but I
believe we have something of a problem.  :-(  The configure script  
turns out
to be a shell script which, unless I'm mistaken, means we're currently
unable to build ICU anywhere we don't have bash or similar.  Win32 for
starters, which is where I'm testing.  A possible solution might be to
re-write the configure script in Perl - though we'd have to keep it
maintained as we do ICU updates.  Another one, for Win32 at least, is  
that
we *might* be able to use UNIX Services For Win32 and run configure  
under
that, generate a Win32 makefile and just copy it in place with the  
configure
script.  Less portable to other places with the same problem, though,  
and
again we have to maintain it as we update ICU.

There is also a problem with the configure stage on Win32, but that's  
an
aside until the above issue is sorted out.

I also gave it a spin in cygwin, where the configure script for ICU  
runs
OK, but there's no C++ compiler so it doesn't get built.

Thoughts?

Jonathan

Re: Unicode, internationalization, C++, and ICU

2004-01-16 Thread Dan Sugalski

At 10:40 AM +0100 1/16/04, Michael Scott wrote:
Maybe we can use someone else's solution...

http://lists.ximian.com/archives/public/mono-list/2003-November/016731.html
Could be handy. We really ought to detect a system-installed ICU and 
use that rather than our local copy at configure time, if it's of an 
appropriate version. That'd at least avoid having two copies, and 
potentially get us some system-wide runtime memory savings.

- Original Message -
From: Dan Sugalski [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, January 15, 2004 8:09 PM
Subject: Unicode, internationalization, C++, and ICU
Now, assuming there's still anyone left reading this message...

We've been threatening to build ICU into parrot, and it's time for
that to start happening. Unfortunately there's a problem--it doesn't
work right now. So, what we need is some brave soul to track ICU
development and keep us reasonably up to date. What I'd really like
is:
1) ICU building and working
2) ICU not needing any C++
I've done some testing, and I hate to be the bearer of bad news but I
believe we have something of a problem.  :-(  The configure script turns out
to be a shell script which, unless I'm mistaken, means we're currently
unable to build ICU anywhere we don't have bash or similar.  Win32 for
starters, which is where I'm testing.  A possible solution might be to
re-write the configure script in Perl - though we'd have to keep it
maintained as we do ICU updates.  Another one, for Win32 at least, is that
we *might* be able to use UNIX Services For Win32 and run configure under
that, generate a Win32 makefile and just copy it in place with the configure
script.  Less portable to other places with the same problem, though, and
again we have to maintain it as we update ICU.
There is also a problem with the configure stage on Win32, but that's an
aside until the above issue is sorted out.
I also gave it a spin in cygwin, where the configure script for ICU runs
OK, but there's no C++ compiler so it doesn't get built.
Thoughts?

Jonathan


--
Dan
--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Unicode, internationalization, C++, and ICU

2004-01-16 Thread Jeff Clites

On Jan 15, 2004, at 3:33 PM, Jonathan Worthington wrote:

- Original Message -
From: Dan Sugalski [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, January 15, 2004 8:09 PM
Subject: Unicode, internationalization, C++, and ICU

Now, assuming there's still anyone left reading this message...

We've been threatening to build ICU into parrot, and it's time for
that to start happening. Unfortunately there's a problem--it doesn't
work right now. So, what we need is some brave soul to track ICU
development and keep us reasonably up to date. What I'd really like
is:
1) ICU building and working
2) ICU not needing any C++
I've done some testing, and I hate to be the bearer of bad news but I
believe we have something of a problem.  :-(  The configure script  
turns out
to be a shell script which, unless I'm mistaken, means we're currently
unable to build ICU anywhere we don't have bash or similar.  Win32 for
starters, which is where I'm testing.
This page give instructions for building on Windows--it doesn't seem to  
require installing bash or anything:

http://oss.software.ibm.com/cvs/icu/~checkout~/icu/ 
readme.html#HowToBuildWindows

I assume that on Windows you don't need to run the configure script.

JEff

Re: Unicode, internationalization, C++, and ICU

2004-01-16 Thread Jonathan Worthington

 snip

 This page give instructions for building on Windows--it doesn't seem to
 require installing bash or anything:

 http://oss.software.ibm.com/cvs/icu/~checkout~/icu/
 readme.html#HowToBuildWindows

 I assume that on Windows you don't need to run the configure script.

Thanks for that, I'll work on and test a patch for the Configure script to
do this on Win32 later.  It won't help with any compiler other than MSVC++,
but it certainly helps.

Thanks,

Jonathan

Re: Unicode, internationalization, C++, and ICU

2004-01-15 Thread Michael Scott

Well I did originally have this in mind, but the more I looked into it  
the more I thought it needed someone with unicode experience.

It seems to me that the unicode world is full of ah but in North  
Icelandic Yiddish aleph is considered to be an infinitely composite  
character and other such arcane exceptions that make the inexperienced  
the natural victims of their own rational assumptions.

Also, given the icu-not-building problem  
(http://www.mail-archive.com/[EMAIL PROTECTED]/msg17477.html)  
maybe what we need is an icu person per platform. This might have the  
benefit of making the task seem less onerous.

I did manage to get it building on OS X (still does, I just checked). I  
wonder on what systems is it actually failing?

I'll include this wiki page again because it contains a few links that  
unicode-savvy lurkers might find useful.

http://www.vendian.org/parrot/wiki/bin/view.cgi/Main/ 
ParrotDistributionUnicodeSupport

Mike

On 15 Jan 2004, at 21:09, Dan Sugalski wrote:

Now, assuming there's still anyone left reading this message...

We've been threatening to build ICU into parrot, and it's time for  
that to start happening. Unfortunately there's a problem--it doesn't  
work right now. So, what we need is some brave soul to track ICU  
development and keep us reasonably up to date. What I'd really like  
is:

1) ICU building and working
2) ICU not needing any C++
I'd also like a pony, too, so I can live if we don't get #2, at least  
for a bit (as it means that we now require a C++ compiler to build  
parrot).

Anyone care to volunteer?
--
Dan
--it's like  
this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Unicode, internationalization, C++, and ICU

2004-01-15 Thread Jonathan Worthington

- Original Message - 
From: Dan Sugalski [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, January 15, 2004 8:09 PM
Subject: Unicode, internationalization, C++, and ICU

 Now, assuming there's still anyone left reading this message...

 We've been threatening to build ICU into parrot, and it's time for
 that to start happening. Unfortunately there's a problem--it doesn't
 work right now. So, what we need is some brave soul to track ICU
 development and keep us reasonably up to date. What I'd really like
 is:

 1) ICU building and working
 2) ICU not needing any C++

I've done some testing, and I hate to be the bearer of bad news but I
believe we have something of a problem.  :-(  The configure script turns out
to be a shell script which, unless I'm mistaken, means we're currently
unable to build ICU anywhere we don't have bash or similar.  Win32 for
starters, which is where I'm testing.  A possible solution might be to
re-write the configure script in Perl - though we'd have to keep it
maintained as we do ICU updates.  Another one, for Win32 at least, is that
we *might* be able to use UNIX Services For Win32 and run configure under
that, generate a Win32 makefile and just copy it in place with the configure
script.  Less portable to other places with the same problem, though, and
again we have to maintain it as we update ICU.

There is also a problem with the configure stage on Win32, but that's an
aside until the above issue is sorted out.

I also gave it a spin in cygwin, where the configure script for ICU runs
OK, but there's no C++ compiler so it doesn't get built.

Thoughts?

Jonathan

Re: Unicode operators

2002-11-07 Thread Dan Sugalski

At 1:27 PM -0800 11/6/02, Brad Hughes wrote:

Flaviu Turean wrote:
[...]

5. if you want to wait for the computing platforms before programming in
p6, then there is quite a wait ahead. how about platforms which will never
catch up? VMS, anyone?


Not to start an OS war thread or anything, but why do people still have
this mistaken impression of VMS?  We have compilers and hard drives and
networking and everything.  We even have color monitors.  Sure, we lack
a decent c++ compiler, but we consider that a feature.  :-)


Lacking a decent C++ compiler isn't necessarily a strike against 
VMS--to be a strike against, there'd actually have to *be* a decent 
C++ compiler...
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

vote no - Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-06 Thread David Dyck


The first message had many of the following characters viewable in my
telnet window, but the repost introduced a 0xC2 prefix to the 0xA7 character.

I have this feeling that many people would vote against posting all these
funny characters, as is does make reading the perl6 mailing lists difficult
in some contexts.  Ever since introducing these UTF-8   127 characters
into this mailing list, I can never be sure of what the posting author
intended to send.  I'm all for supporting UTF-8 characters in strings,
and perhaps even in variable names but to we really have to have
perl6 programs with core operators in UTF-8.  I'd like to see all
the perl6 code that had UTF-8 operators start with  use non_portable_utf8_operators.

As it stands now, I'm going to have to find new tools for my linux platform
that has been performing fine since 1995 (perl5.9 still supports libc5!),
and I don't yet know how I am
going to be able to telnet in from win98, and I'll bet that the dos kermit that I
use when I dial up won't support UTF-8 characters either.

 David

ps.

I just read how many people will need to upgrade their operating systems
if the want to upgrade to MS Word11.

Do we want to require operating system and/or many support tools to
be upgraded before we can share perl6 scripts via email?


On Tue, 5 Nov 2002 at 09:56 -0800, Michael Lazzaro [EMAIL PROTECTED]:

  CodeSymbol  Comment
  167 §  Could be used
  169 ©  Could be used
  171 «  May well be used
  172 ¬  Not?
  174 ®  Could be used
  176 °  Could be used
  177 ±  Introduces an interesting level of uncertainty?  Useable
  181 µ  Could be used
  182 ¶  Could be used
  186 º  Could be used (but I dislike it as it is alphabetic)
  187 »  May well be used
  191 ¿  Could be used

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Richard Proctor

This UTF discussion has got silly.

I am sitting at a computer that is operating in native Latin-1 and is
quite happy - there is no likelyhood that UTF* is ever likely to reach it.

The Gillemets are coming through fine, but most of the other heiroglyphs need
a lot to be desired.

Lets consider the coding comparisons.

Chars in the range 128-159 are not defined in Latin-1 (issue 1) and are
used differently by windows to Latin-1 (later issues) so should be avoided.

Chars in the range 160-191 (which include the gillemot) are coming through
fine if encoded by the sender as UTF8.

Anything in the range 192-255 is encoded differently and thus should be
avoided.

Therefore the only addition characters that could be used, that will work
under UTF8 and Latin-1 and Windows are:

CodeSymbol  Comment
160 Non-breaking space (map to normal whitespace)
161 ¡   Could be used
162 ¢   Could be used
163 £   Could be used
164 ¤   Could be used
165 ¥   Could be used
166 ¦   Could be used
167 §   Could be used
168 ¨   Could be used thouugh risks confusion with 
169 ©   Could be used
170 ª   Could be used (but I dislike it as it is alphabetic)
171 «   May well be used
172 ¬   Not?
173    Nonbreaking - treat as the same
174 ®   Could be used
175 ¯   May cause confusion with _ and -
176 °   Could be used
177 ±   Introduces an interesting level of uncertainty?  Useable
178 ²   To the power of 2 (squaring ? ) Otherwise best avoided
179 ³   Cubing? Otherwise best avoided
180 ´   Too confusing with ' and `
181 µ   Could be used
182 ¶   Could be used
183 ·   Dot Product? though likely to be confused with .
184 ¸   treat as ,
185 ¹   To the power 1? Probably best avoided
186 º   Could be used (but I dislike it as it is alphabetic)
187 »   May well be used
188 ¼   Could be used
189 ½   Could be used
190 ¾   Could be used
191 ¿   Could be used

Richard 

-- 
Personal [EMAIL PROTECTED]http://www.waveney.org
Telecoms [EMAIL PROTECTED]  http://www.WaveneyConsulting.com
Web services [EMAIL PROTECTED]http://www.wavwebs.com
Independent Telecomms Specialist, ATM expert, Web Analyst  Services

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Michael Lazzaro

Thanks, I've been hoping for someone to post that list.  Taking it one 
step further, we can assume that the only chars that can be used are 
those which:

-- don't have an obvious meaning that needs to be reserved
-- appear decently on all platforms
-- are distinct and recognizable in the tiny font sizes
 used when programming

Comparing your list with mine, with some subjective editing based on my 
small courier font, that chops the list of usable operators down to 
only a handful:

Code	Symbol	Comment
167	§	Could be used
169	©	Could be used
171	«	May well be used
172	¬	Not?
174	®	Could be used
176	°	Could be used
177	±	Introduces an interesting level of uncertainty?  Useable
181	µ	Could be used
182	¶	Could be used
186	º	Could be used (but I dislike it as it is alphabetic)
187	»	May well be used
191	¿	Could be used


That's all.  A shame, because some of the others have very interesting 
possibilities:

   • ≠ ø † ∑ ∂ ƒ ∆ ≤ ≥ ∫ ≈ Ω ‡ ± ˇ ∏ Æ

But if Windows can't easily do them, that's a pretty big problem.  
Thanks for the list.

MikeL

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Jonathan Scott Duff


I'm all for one or two unicode operators if they're chosen properly
(and I trust Larry to do that since he's done a stellar job so far),
but what's the mechanism to generate unicode operators if you don't
have access to a unicode-aware editor/terminal/font/etc.?  IS the only
recourse to use the named versions?  Or will there be some sort of
digraph/trigraph/whatever sequence that always gives us the operator
we need?  Something like \x[263a] but in regular code and not just
quote-ish contexts:  

$campers = $a \x[263a] $b   # make $a and $b happy

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Smylers

Dan Kogai wrote:

 We already have source filters in perl5 and I'm pretty much sure
 someone will just invent yet another 'use operators = ascii;' kind
 of stuff in perl6.

I think that's backwards to have operators being funny characters by
default but requiring explicit declaration to use well-known Ascii
characters.

Doing it t'other way round would mean that you can always write fully
portable code fragments in pure Ascii, something that'd be helpful on
mailing lists and the like.

There could be an alias syntax for people in an environment where they'd
prefer to have a non-Ascii character in place of a conglomerate of Ascii
symbols, maybe:

  treat '»...«' as '[...]';

That has the documentational advantage that any non-Ascii character used
in code must be declared earlier in that file.  And even if the
non-Ascii character gets warped in the post and displays oddly for you,
you can still see what the author intended it to do.

This has the risk that Damian described of everybody defining their own
operators, but I think that's unlikely.  There's likely to be a
convention used by many people, at least those who operate in a given
character set.  This way also permits those who live in a Latin 2 (or
whatever) world to have their own convention using characters that make
sense to them.

Smylers

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Smylers

Richard Proctor wrote:

 I am sitting at a computer that is operating in native Latin-1 and is
 quite happy - there is no likelyhood that UTF* is ever likely to reach
 it.
 
 ... Therefore the only addition characters that could be used, that
 will work under UTF8 and Latin-1 and Windows ...

What about people who don't use Latin-1, perhaps because their native
language uses Latin-2 or some other character set mutually exclusive
with Latin-1?

I don't have a Latin-2 ('Central and East European languages') typeface
handy, but its manpage includes:

  253   171   AB LATIN CAPITAL LETTER T WITH CARON
  273   187   BB LATIN SMALL LETTER T WITH CARON

Caron is sadly missing from my dictionary so I'm not sure what those
would look like, but I suspect they wouldn't be great symbols for vector
operators.

 171   «   May well be used

Also I wonder how similar to doubled less-than or greater-than signs
guillemets would look.  In this font they're fine, but I'm concerned at
my abilities to make them sufficiently distinguishable on a whiteboard,
and whether publishers will cope with them (compare a recent discussion
on 'use Perl' regarding curly quotes and fi ligatures appearing in
code samples).

Smylers

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Richard Proctor

On Tue 05 Nov, Smylers wrote:
 Richard Proctor wrote:
 
  I am sitting at a computer that is operating in native Latin-1 and is
  quite happy - there is no likelyhood that UTF* is ever likely to reach
  it.
  
  ... Therefore the only addition characters that could be used, that
  will work under UTF8 and Latin-1 and Windows ...
 
 What about people who don't use Latin-1, perhaps because their native
 language uses Latin-2 or some other character set mutually exclusive
 with Latin-1?


Once you go beyond latin-1 there is nothing common anyway.  The Gullimots
become T and t with inverted hats under Latin-2, oe and G with an inverted
hat under Latin-3, oe and G with a squiggle under it under Latin-4, No
meaning and a stylisd K for Latin-5, (cant find latin6), Gullimots under
Latin 7, nothing under latin-8. 

Richard

-- 
Personal [EMAIL PROTECTED]http://www.waveney.org
Telecoms [EMAIL PROTECTED]  http://www.WaveneyConsulting.com
Web services [EMAIL PROTECTED]http://www.wavwebs.com
Independent Telecomms Specialist, ATM expert, Web Analyst  Services

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Michael Lazzaro


As one of the instigators of this thread, I submit that we've probably 
argued about the Unicode stuff enough.  The basic issues are now known, 
and it's known that there's no general agreement on any of this stuff, 
nor will there ever be.  To wit:

-- Extended glyphs might be extremely useful in extending the operator 
table in non-ambiguous ways, especially for advanced things like «op»..

-- Many people loathe the idea, and predict newcomers will too.

-- Many mailers  older platforms tend to react badly for both viewing 
and inputting.

-- If extended characters are used at all, the decision needs to be 
made whether they shall be least-common-denominator Latin1, UTF-8, or 
full Unicode, and if there are backup spellings so that everyone can 
play.

It's up to Larry, and he knows where we're all coming from.  Unless 
anyone has any _new_ observations, I propose we pause the debate until 
a decision is reached?

MikeL

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Damian Conway

Scott Duff wrote:


I'm all for one or two unicode operators if they're chosen properly
(and I trust Larry to do that since he's done a stellar job so far),
but what's the mechanism to generate unicode operators if you don't
have access to a unicode-aware editor/terminal/font/etc.?  IS the only
recourse to use the named versions?  Or will there be some sort of
digraph/trigraph/whatever sequence that always gives us the operator
we need?  Something like \x[263a] but in regular code and not just
quote-ish contexts:  

	$campers = $a \x[263a] $b  	# make $a and $b happy

That would probably be:

 	$campers = $a \c[263a] $b  	# make $a and $b happy

if it were allowed (which I suspect it mightn't be, since it looks
rather like an index on a reference to the value returned by a call
to the subroutine Cc.

Incidentally, this is why I previously suggested that we might
allow POD escapes in code as well. Thus:

 	$campers = $a E263a $b  	# make $a and $b happy

Damian

Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]

2002-11-05 Thread Damian Conway

Michael Lazzaro proposed:


It's up to Larry, and he knows where we're all coming from.  Unless 
anyone has any _new_ observations, I propose we pause the debate until a 
decision is reached?

I second the motion!

Damian

RE: Unicode thoughts...

2002-03-30 Thread Dan Sugalski


At 4:32 PM -0800 3/25/02, Brent Dax wrote:
I *really* strongly suggest we include ICU in the distribution.  I
recently had to turn off mod_ssl in the Apache 2 distro because I
couldn't get OpenSSL downloaded and configured.

FWIW, ICU in the distribution is a given if we use it.

Parrot will require a C compiler and link tools (maybe make, but 
maybe not) to build on a target platform and nothing else. If we rely 
on ICU we must ship with it.
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

Re: Unicode thoughts...

2002-03-30 Thread Josh Wilmes



Someone said that ICU requires a C++ compiler.  That's concerning to me, 
as is the issue of how we bootstrap our build process.  We were planning 
on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm 
sure it's not going to be written in pure ansi C)

--Josh

At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote:

 At 4:32 PM -0800 3/25/02, Brent Dax wrote:
 I *really* strongly suggest we include ICU in the distribution.  I
 recently had to turn off mod_ssl in the Apache 2 distro because I
 couldn't get OpenSSL downloaded and configured.
 
 FWIW, ICU in the distribution is a given if we use it.
 
 Parrot will require a C compiler and link tools (maybe make, but 
 maybe not) to build on a target platform and nothing else. If we rely 
 on ICU we must ship with it.
 -- 
  Dan
 
 --it's like this---
 Dan Sugalski  even samurai
 [EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk

Re: Unicode thoughts...

2002-03-30 Thread Dan Sugalski


At 10:07 AM -0500 3/30/02, Josh Wilmes wrote:
Someone said that ICU requires a C++ compiler.  That's concerning to me,
as is the issue of how we bootstrap our build process.  We were planning
on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm
sure it's not going to be written in pure ansi C)

If the C++ bits are redoable as C, I'm OK with it. I've not taken a 
good look at it to know how much it depends on C++. If it's mostly // 
comments and such we can work around the issues easily enough.

If its objects, well, I suppose it depends on how much it relies on them.

At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote:

  At 4:32 PM -0800 3/25/02, Brent Dax wrote:
  I *really* strongly suggest we include ICU in the distribution.  I
  recently had to turn off mod_ssl in the Apache 2 distro because I
  couldn't get OpenSSL downloaded and configured.

  FWIW, ICU in the distribution is a given if we use it.

  Parrot will require a C compiler and link tools (maybe make, but
  maybe not) to build on a target platform and nothing else. If we rely
   on ICU we must ship with it.

-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk

Re: Unicode thoughts...

2002-03-30 Thread Jeff


Dan Sugalski wrote:
 
 At 10:07 AM -0500 3/30/02, Josh Wilmes wrote:
 Someone said that ICU requires a C++ compiler.  That's concerning to me,
 as is the issue of how we bootstrap our build process.  We were planning
 on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm
 sure it's not going to be written in pure ansi C)
 
 If the C++ bits are redoable as C, I'm OK with it. I've not taken a
 good look at it to know how much it depends on C++. If it's mostly //
 comments and such we can work around the issues easily enough.
 
 If its objects, well, I suppose it depends on how much it relies on them.

Looking at icu/common I see more .c files than .cpp files, and what .cpp
files there are look somewhat like wrappers around the C code. In
addition, the .cpp files appear to do such things as create iterator
wrappers, bidirectional display and normalization.

Some of the files are thicker than wrappers, but I think there's enough
code behind this C++ veneer that we can at least use it if not the
entire library.
--
Jeff [EMAIL PROTECTED]

RE: Unicode thoughts...

2002-03-25 Thread Brent Dax


Jeff:
# This will likely open yet another can of worms, but Unicode has been
# delayed for too long, I think. It's time to add the Unicode libraries
# (In our case, the ICU libraries at http://oss.software.ibm.com/icu/,
# which Larry has now blessed) to Parrot. string.c already has
# (admittedly
# unavoidable, due to the library not being included)
# assumptions such as
# isdigit(). So, I have a few thoughts (that may have already been shot
# down by people wiser than I in such matters) to explicate, and some
# questions to ask.
#
# ICU should be added as a note in the README, and maybe to 'INSTALL' if
# we ever create one. Let's not add it to CVS, as it's not under our
# control. If we have to patch ICU to make it work correctly
# with Parrot,
# the patches should be submitted back to the ICU team. And I'm joining
# the appropriate mailing lists to keep appraised of development.

I *really* strongly suggest we include ICU in the distribution.  I
recently had to turn off mod_ssl in the Apache 2 distro because I
couldn't get OpenSSL downloaded and configured.

We also need to make sure ICU will work everywhere.  And I do mean
*everywhere*.  Will it work on VMS?  Palm OS?  Crays?

# Before Unicode goes into full swing, I need some idea of how
# we're going
# to deploy the libraries. On this note, I defer to the
# Configure master,
# Brent. I've already done some work with ICU, so I'm reasonably
# comfortable with migrating in one Unicode bit at a time, until we're
# ready for full UTF-16 compliance.
#
# The RE engine should (I'm speaking without having recently read the
# source, so feel free to correct me) not need to be migrated, as it's
# already using UTF-32 internally, which leaves just the string
# internals.
# These can be migrated to using ICU macros fairly easily (I've already
# done some of the work locally), so I think the main focus should be on
# encodings, as we'll have to eventually support the more common
# wide-character encodings such as KOI-8 and BIG5.

There are a few things that need to change, but they aren't big issues.
Mostly it's just places where character sets have been presumed.

However, I'm seriously thinking about a major re-architecture of the
regex engine, which would probably help these sorts of issues.

# I still have some questions about using UTF-16 internally for string
# representation (as mentioned in
#
http:[EMAIL PROTECTED]/msg07856.html),
# but I've resolved most of those. It's an excellent match for the ICU
# library, as it uses UTF-16 internally. My only question is if we're
# going to incur a performance hit every time a scalar is transferred to
# the RE engine, as it uses UTF-32 internally.

That can change.  However, utf32 seems like the best match, as it would
allow us to reach into a string's guts for speed.  (We don't currently
do that, but if I do redesign the engine, I'll probably be able to.)

# Also, once we have UTF-16 running internally, I'd be interested in
# seeing what memory consumption looks like vs. UTF-32, beause I'd like
to
# see if it makes sense to add a compile-time switch between UTF-8 and
# UTF-32 to let the installer decide on memory tradeoffs. ICU has an
# internal macro that defines its own internal representation, and that
# could conflict with our intended usage as well.
#
# Performance would suffer in the UTF-8 case, naturally, but the
# difference in memory usage might be significant enough that we'd want
to
# leave the decision up to the installer. Having said that, the headache
# of testing multiple versions of Perl6 might not be worth it.
#
# So, to wrap up, I'm soliciting thoughts on how best to start the
Unicode
# migration, and deal with the inevitable problems that will come up.
I'm
# hoping that most of the magic will be hidden in string.c, where we
won't
# have to worry about it, but we'll have to see.
#
# Now, this is admittedly being composed at 2:00 A.M, so my thoughts may
# not be the most coherent, and for that I apologize. Most of my concern
# stems from how best to add build steps to the various platforms
without
# ending up with a completely broken Parrot for weeks and developers
# screaming about What the *HELL* is this error? Where is this library?
# brane explodes. If these issues have already been beaten to death
and
# we've moved on to more interesting issues, of course I'll be
interested
# there as well.

Overall you seem to be pretty on target.  Of course, my brain isn't
really built for character sets and stuff like that.

Also note that I went to bed at one, was rudely awakened by a screaming
toddler at two, didn't fall asleep again till four, and woke up at nine,
so I'm probably not very coherent.  I feel a little dizzy--I'm gonna
take a nap.

--Brent Dax [EMAIL PROTECTED]
@roles=map {Parrot $_} qw(embedding regexen Configure)

#define private public
--Spotted in a C++ program just before a #include

RE: Unicode thoughts...

2002-03-25 Thread Charles Bunders


 
 We also need to make sure ICU will work everywhere.  And I do mean
 *everywhere*.  Will it work on VMS?  Palm OS?  Crays?

Nope, nope, and nope.

From their site -

Operating systemCompilerTesting frequency 
Windows 98/NT/2000  Microsoft Visual C++ 6.0Reference platform 
Red Hat Linux 6.1   gcc 2.95.2  Reference platform 
AIX 4.3.3   xlC 3.6.4   Reference platform 
Solaris 2.6 Workshop Pro CC 4.2 Reference platform 
HP/UX 11.01 aCC A.12.10 Reference platform 
AIX 5.1.0 L Visual Age C++ 5.0  Regularly tested 
Solaris 2.7 Workshop Pro CC 6.0 Regularly tested 
Solaris 2.6 gcc 2.91.66 Regularly tested 
FreeBSD 4.4 gcc 2.95.3  Regularly tested 
HP/UX 11.01 CC A.03.10  Regularly tested 
OS/390 (zSeries)CC r10  Regularly tested 
AS/400 (iSeries)V5R1 iCCRarely tested 
NetBSD, OpenBSD Rarely tested 
SGI/IRIXRarely tested 
PTX Rarely tested 
OS/2 Visual Age Rarely tested 
Macintosh   Needs help to port 

-(MBrod)-

__
Do You Yahoo!?
Yahoo! Movies - coverage of the 74th Academy Awards®
http://movies.yahoo.com/

Re: Unicode thoughts...

2002-03-25 Thread Josh Wilmes



This is rather concerning to me.  As I understand it, one of the goals for 
parrot was to be able to have a usable subset of it which is totally 
platform-neutral (pure ANSI C).   If we start to depend too much on 
another library which may not share that goal, we could have trouble 
with the parrot build process (which was supposed to be shipped as parrot 
bytecode)

--Josh

At 17:02 on 03/25/2002 PST, Charles Bunders [EMAIL PROTECTED] wrote:

  
  We also need to make sure ICU will work everywhere.  And I do mean
  *everywhere*.  Will it work on VMS?  Palm OS?  Crays?
 
 Nope, nope, and nope.
 
 From their site -
 
 Operating systemCompilerTesting frequency 
 Windows 98/NT/2000  Microsoft Visual C++ 6.0Reference platform 
 Red Hat Linux 6.1   gcc 2.95.2  Reference platform 
 AIX 4.3.3   xlC 3.6.4   Reference platform 
 Solaris 2.6 Workshop Pro CC 4.2 Reference platform 
 HP/UX 11.01 aCC A.12.10 Reference platform 
 AIX 5.1.0 L Visual Age C++ 5.0  Regularly tested 
 Solaris 2.7 Workshop Pro CC 6.0 Regularly tested 
 Solaris 2.6 gcc 2.91.66 Regularly tested 
 FreeBSD 4.4 gcc 2.95.3  Regularly tested 
 HP/UX 11.01 CC A.03.10  Regularly tested 
 OS/390 (zSeries)CC r10  Regularly tested 
 AS/400 (iSeries)V5R1 iCCRarely tested 
 NetBSD, OpenBSD Rarely tested 
 SGI/IRIXRarely tested 
 PTX Rarely tested 
 OS/2 Visual Age Rarely tested 
 Macintosh   Needs help to port 
 
 -(MBrod)-
 
 __
 Do You Yahoo!?
 Yahoo! Movies - coverage of the 74th Academy Awards®
 http://movies.yahoo.com/

RE: Unicode thoughts...

2002-03-25 Thread Hong Zhang



I think it will be relative easy to deal with different compiler
and different operating system. However, ICU does contain some
C++ code. It will make life much harder, since current Parrot
only assume ANSI C (even a subset of it).

Hong

 This is rather concerning to me.  As I understand it, one of 
 the goals for 
 parrot was to be able to have a usable subset of it which is totally 
 platform-neutral (pure ANSI C).   If we start to depend too much on 
 another library which may not share that goal, we could have trouble 
 with the parrot build process (which was supposed to be 
 shipped as parrot bytecode)

Re: Unicode thoughts...

2002-03-25 Thread Jeff


Hong Zhang wrote:
 
 I think it will be relative easy to deal with different compiler
 and different operating system. However, ICU does contain some
 C++ code. It will make life much harder, since current Parrot
 only assume ANSI C (even a subset of it).
 
 Hong
 
  This is rather concerning to me.  As I understand it, one of
  the goals for
  parrot was to be able to have a usable subset of it which is totally
  platform-neutral (pure ANSI C).   If we start to depend too much on
  another library which may not share that goal, we could have trouble
  with the parrot build process (which was supposed to be
  shipped as parrot bytecode)

I guess it's obvious that I hadn't looked at the target platforms for
ICU as closely as I probably should have. C vs. C++ doesn't concern me,
as it can always be rewritten, but lack of platforms like OS X does.
Given that, I think an interim solution consisting of basic Unicode
utilities we'll need, such as Unicode_isdigit(). This can be a simple
wrapper around isdigit() for the moment, until I sort out which files we
need from the Unicode database, and what support functions/data
structures will be required.

Given that we're dedicated to either UTF-16 or UTF-32 for internal
string representation (undecided as of yet, and isn't affected by this),
we can get away with creating a simple unicode.{c.h} suite of functions
that looks like:

Parrot_Int Parrot_isDigit(char* glyph);

We can get away with the simplicity here because the character array
should already be a valid UTF-{16,32) string, and responsibility for
making sure there's a valid glyph at that offset can be safely offloaded
to the caller, if not higher up the calling chain. Also, it should be in
a separate file because, assuming the final internal representation
matches that of the RE engine, the engine can use these utilities as
well.

Now, admittedly this is only slightly better-thought-out than the
origina proposal, but I think it has a much better chance of being
implemented, and in a fairly short amount of time. (He said, knowing
full well that there's always one more problem) ASCII versions of the
functions should be almost trivial, and can be left in there as a
compile-time switch should we choose to do an ASCII-only or UTF-8-only
version.

In conclusion, this approach feels more workable, and the full UTF-16
implementation details can be rolled out incrementally, rather than a
single mass migration. If this suggestion flies, I'll rewrite
strings.pdd and post it in the next few days.
--
Jeff [EMAIL PROTECTED]

Re: Unicode thoughts...

2002-03-25 Thread Jeff


Jeff wrote:
 
 Hong Zhang wrote:
 
  I think it will be relative easy to deal with different compiler
  and different operating system. However, ICU does contain some
  C++ code. It will make life much harder, since current Parrot
  only assume ANSI C (even a subset of it).
 
  Hong
 
   This is rather concerning to me.  As I understand it, one of
   the goals for
   parrot was to be able to have a usable subset of it which is totally
   platform-neutral (pure ANSI C).   If we start to depend too much on
   another library which may not share that goal, we could have trouble
   with the parrot build process (which was supposed to be
   shipped as parrot bytecode)
 
 I guess it's obvious that I hadn't looked at the target platforms for
 ICU as closely as I probably should have. C vs. C++ doesn't concern me,
 as it can always be rewritten, but lack of platforms like OS X does.
 Given that, I think an interim solution consisting of basic Unicode
 utilities we'll need, such as Unicode_isdigit(). This can be a simple
 wrapper around isdigit() for the moment, until I sort out which files we
 need from the Unicode database, and what support functions/data
 structures will be required.
 
 Given that we're dedicated to either UTF-16 or UTF-32 for internal
 string representation (undecided as of yet, and isn't affected by this),
 we can get away with creating a simple unicode.{c.h} suite of functions
 that looks like:
 
 Parrot_Int Parrot_isDigit(char* glyph);
 
 We can get away with the simplicity here because the character array
 should already be a valid UTF-{16,32) string, and responsibility for
 making sure there's a valid glyph at that offset can be safely offloaded
 to the caller, if not higher up the calling chain. Also, it should be in
 a separate file because, assuming the final internal representation
 matches that of the RE engine, the engine can use these utilities as
 well.
 
 Now, admittedly this is only slightly better-thought-out than the
 origina proposal, but I think it has a much better chance of being
 implemented, and in a fairly short amount of time. (He said, knowing
 full well that there's always one more problem) ASCII versions of the
 functions should be almost trivial, and can be left in there as a
 compile-time switch should we choose to do an ASCII-only or UTF-8-only
 version.
 
 In conclusion, this approach feels more workable, and the full UTF-16
 implementation details can be rolled out incrementally, rather than a
 single mass migration. If this suggestion flies, I'll rewrite
 strings.pdd and post it in the next few days.
 --
 Jeff [EMAIL PROTECTED]

Okay, now I feel utterly silly, having just looked at
chartypes/unicode.c. Well, that approach'll work. Wonder why nobody
thought...greps for isdigit()...uh...never mind. I'll be over here,
with the dunce cap on.
--
Jeff [EMAIL PROTECTED]

RE: Unicode sorting...

2001-06-08 Thread Hong Zhang



  I can't really believe that this would be a problem, but if they're
  integrated alphabets from different locales, will there be issues
  with sorting (if we're not planning to use the locale)? Are there
  instances where like characters were combined that will affect the
  sort orders?
 
 Yes, it is an issue.  In the general case, you CANNOT sort strings of
 several locales/languages into a single order that would satisfy all
 of the locales/languages.  One often quoted example is German and
 Swedish/Finnish: the LATIN CAPITAL LETTER A WITH RING ABOVE comes
 between A and B in the former but after Z (not immediately, but
 doesn't matter here) in the latter.  Similarly for all the accented
 alphabetic characters, the rules how they are sorted differ from one
 place to another , and many languages have special combinations like
 ch, ss, ij that require special attention.

My understanding is there is NO general unicode sorting, period.

The most useful one must be locale-sensitive, as defined by unicode
collation. In practice, the story is even worse. For example, how do
you sort strings comming from different locales, say I have an address
book with names from all over the world. Which locale I should use
to sort the names. Another example is the chinese has no definite
sorting order, period. The commonly used scheme are phonetic-based
or stroke-based. Since many characters have more than one pronounciations
(context sensitive) and more than one forms (simplified and traditional).
So if we have a mix content from china and taiwan, it is impossible to
sort in a way everyone will feel happy. Also Chinese is space insensitive.
In English, we have to use space to separate words. But in Chinese,
there is no lexical words, only linguistic words. You can insert space
between any two chinese characters without change their meaning.

I heard a rumor long time ago, the unicode consortium was working on
a locale independent collation, which can be used to sort mix content.
As for Perl, I like to have several basic sortings:
a) binary sorting
b) locale independent general sort
c) locale-sensitive sort based on unicode collation

We could have more if possible. The general sort can be done by
canonicalize all strings, remove case info, remove diacritics,
remove font/width, then use binary sort.

Hong

RE: Unicode sorting...

2001-06-08 Thread NeonEdge


 Another example is the chinese has no definite
 sorting order, period. The commonly used scheme are
 phonetic-based or stroke-based. Since many characters
 have more than one pronounciations (context sensitive)
 and more than one forms (simplified and traditional).
 So if we have a mix content from china and taiwan, it
 is impossible to sort in a way everyone will feel happy

If this is the case, how would a regex like ^[a-zA-Z] work (or other, more
sensitive characters)? If just about anything can come between A and Z, and
letters that might be there in a particular locale aren't in another locale,
then how will regex engine make the distinction? Will it have to create it's
own locale-specific character table?
Grant M.
(is it just me, or is this looking more and more painful).

RE: Unicode sorting...

2001-06-08 Thread Hong Zhang



 If this is the case, how would a regex like ^[a-zA-Z] work (or other,
more
 sensitive characters)? If just about anything can come between A and Z,
and
 letters that might be there in a particular locale aren't in another
locale,
 then how will regex engine make the distinction?

This syntax was designed for English. It just does not make any sense in
Chinese.
The Chinese just don't have sorting order for most of history. The phonetic
order
and stroke order was introduced only couple of hundred years ago.

I don't really care how regex handle it. If I do need to search range or
sort,
I will create my own collator to convert the string into a normalized form,
and hand it to regex or qsort. It is up to me to define the collator. The
regex
does not even need to care about the order. Of course, the regex will
support
some basic ordering for opto.

Hong

Re: Unicode sorting...

2001-06-08 Thread Jarkko Hietaniemi


  If this is the case, how would a regex like ^[a-zA-Z] work (or other,
 more
  sensitive characters)? If just about anything can come between A and Z,
 and
  letters that might be there in a particular locale aren't in another
 locale,
  then how will regex engine make the distinction?
 
 This syntax was designed for English. It just does not make any sense in
 Chinese.

It actually is rather faulty for English, too.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: Unicode sorting...

2001-06-08 Thread Jarkko Hietaniemi


 I can't really believe that this would be a problem, but if they're
 integrated alphabets from different locales, will there be issues
 with sorting (if we're not planning to use the locale)? Are there
 instances where like characters were combined that will affect the
 sort orders?

Yes, it is an issue.  In the general case, you CANNOT sort strings of
several locales/languages into a single order that would satisfy all
of the locales/languages.  One often quoted example is German and
Swedish/Finnish: the LATIN CAPITAL LETTER A WITH RING ABOVE comes
between A and B in the former but after Z (not immediately, but
doesn't matter here) in the latter.  Similarly for all the accented
alphabetic characters, the rules how they are sorted differ from one
place to another , and many languages have special combinations like
ch, ss, ij that require special attention.

Unicode defines a canonical ordering which has hooks for locale
specific rules:

http://www.unicode.org/unicode/reports/tr10/

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

RE: Unicode sorting...

2001-06-08 Thread Dan Sugalski


At 11:29 AM 6/8/2001 -0700, Hong Zhang wrote:

  If this is the case, how would a regex like ^[a-zA-Z] work (or other,
more
  sensitive characters)? If just about anything can come between A and Z,
and
  letters that might be there in a particular locale aren't in another
locale,
  then how will regex engine make the distinction?

This syntax was designed for English. It just does not make any sense in
Chinese. The Chinese just don't have sorting order for most of history. 
The phonetic order and stroke order was introduced only couple of hundred 
years ago.

The A-Z syntax is really a shorthand for All the uppercase letters. 
(Originally at least) I won't argue the problems with sorting various sets 
of characters in various locales, but for regexes at least it's not an 
issue, because the point isn't sorting or ordering, it's identifying 
groups. We just need to make sure there's a named group for the different 
languages we know of--things like [[:kanji]] or [[:hiragana]] for example. 
(They should also be named in the language they represent, but I'm going to 
take a miss on trying to wedge an example in here, as I've a hard enough 
time getting letters with umlauts in)

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Unicode sorting...

2001-06-08 Thread Jarkko Hietaniemi


 The A-Z syntax is really a shorthand for All the uppercase letters. 
 (Originally at least) I won't argue the problems with sorting various sets 
 of characters in various locales, but for regexes at least it's not an 
 issue, because the point isn't sorting or ordering, it's identifying 
 groups. We just need to make sure there's a named group for the different 
 languages we know of--things like [[:kanji]] or [[:hiragana]] for example. 

It's spelled \p{...} (after I fixed a silly typo in bleadperl)

$ ./perl -Ilib -wle 'print a if \x{30a1} =~ /\p{InKatakana}/'
a
$ grep 30A1 lib/unicode/Unicode.txt
30A1;KATAKANA LETTER SMALL A;Lo;0;L;N;
3301;SQUARE ARUHUA;So;0;L;square 30A2 30EB 30D5 30A1N;SQUARED ARUHUA
3332;SQUARE HUARADDO;So;0;L;square 30D5 30A1 30E9 30C3 30C9N;SQUARED HUARA
DDO
FF67;HALFWIDTH KATAKANA LETTER SMALL A;Lo;0;L;narrow 30A1N;
$ 

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: Unicode handling

2001-03-27 Thread Larry Wall


Dan Sugalski writes:
: Fair enough. I think there are some cases where there's a base/combining 
: pair of codepoints that don't map to a single combined-character code 
: point. Not matching on a glyph boundary could make things really odd, but 
: I'd hate to have the checking code on by default, since that'd slow down 
: the common case where the string in NFC won't have those.

Assume that in practice most of the normalization will be done by the
input disciplines.  Then we might have a pragma that says to try to
enforce level 1, level 2, level 3 if your data doesn't match your
expectations.  Then hopefully the expected semantics of the operators
will usually (I almost said "normally" :-) match the form of the data
coming in, and forced conversions will be rare.

That's how I see it currently.  But the smarter I get the less I know.

Larry

Re: Unicode handling

2001-03-27 Thread Larry Wall


Garrett Goebel writes:
: Someone please clue me in. A pointer to an RFC which defines the use of
: colons in Perl6 among other things would help.

Heh.  If you read the RFCs, you'll discover one of the basic rules of
language redesign: everybody wants the colon.  And it never seems to
occur to people that we'll actually have to break Perl 5's ?: operator
in order to give them the colon.  :-)

Larry

Re: Unicode handling

2001-03-27 Thread Dan Sugalski


At 07:21 AM 3/27/2001 -0800, Larry Wall wrote:
Dan Sugalski writes:
: Fair enough. I think there are some cases where there's a base/combining
: pair of codepoints that don't map to a single combined-character code
: point. Not matching on a glyph boundary could make things really odd, but
: I'd hate to have the checking code on by default, since that'd slow down
: the common case where the string in NFC won't have those.

Assume that in practice most of the normalization will be done by the
input disciplines.  Then we might have a pragma that says to try to
enforce level 1, level 2, level 3 if your data doesn't match your
expectations.  Then hopefully the expected semantics of the operators
will usually (I almost said "normally" :-) match the form of the data
coming in, and forced conversions will be rare.

The only problem with that is it means we'll be potentially altering the 
data as it comes in, which leads back to the problem of input and output 
files not matching for simple filter programs. (Plus it means we spend CPU 
cycles altering data that we might not actually need to)

It might turn out that deferred conversions don't save anything, and if 
that's so then I can live with that. And we may feel comfortable declaring 
that we preserve equivalency in Unicode data only, and that's OK too. 
(Though *you* get to call that one... :)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Unicode handling

2001-03-27 Thread Dan Sugalski


At 08:37 PM 3/26/2001 +, [EMAIL PROTECTED] wrote:
Damien Neil [EMAIL PROTECTED] writes:
  So $c = chr(ord($c)) could change $c?  That seems odd.
 
  It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC)
  but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness.
  Then of course someone will want it to be the number 0x45 and not do
  that 'cos they are using chr/ord to mess with JPEG image data...
  So there needs to be a 'binary' encoding which they can use.
 
 That doesn't seem to be what Dan was saying, however.

And Dan is the one "in charge" on this list - so my perl5.7-ish view
may be wrong.

"In charge" is such a strong phrase. (And not what I thought the job 
originally was, but that's a separate issue...)

 It would make
 perfect sense to me for chr(ord($c)) to return $c in a different
 encoding.  (Assuming, of course, that $c is a single character.)
 
 Assume ord is dependent on the current default encoding.
 
   use utf8; # set default encoding.
   my $e : ebcdic = 'a';
   my $u = chr(ord($e));
 
 If ord is dependent on the current default encoding, I would expect
 the above to leave the UTF-8 string "a" in $u.  This makes sense to
 me.

Good.

I'm afraid this isn't what I'd normally think of--ord to me returns the 
integer value of the first code point in the string. That does mean that A 
is different for ASCII and EBCDIC, but that's just One Of Those Things.

The alternative is for us to do data conversions some times (when we're 
pulling data out of an EBCDIC or Shift-JIS string in a Unicode block) but 
not others (when we're pulling binary data out in a Unicode or EBCDIC 
block). That seems a little off to me, but I could well be wrong. It also 
means we may well mangle data that's incorrectly tagged--if, for example an 
input filter tagged binary data with a non-binary type, which isn't that 
unlikely.

 If ord is dependent on the encoding of the string it gets, as Dan
 was saying, than ord($e) is 0x81,

It it could still be 0x81 (from ebcdic) with the encoding carried
along with the _number_ if we thought that worth the trouble.
(It isn't too bad for assignment but is far from clear
what
  2 (ebcdic) * 0xA1(iso_8859_7)
might mean - perhaps we drop the tag if anything other the + or - happens.

Or what we do with it if it's stringified. The only thing I can see keeping 
the tag around for would be later chr() and pack() calls, and that doesn't 
seem like it'd happen often enough to justify the overhead. Could be wrong, 
though.

 and $u is "\x81".  This seems
 strange.
 
 Hmm.  It suddenly occurs to me that I may have been misinterpreting:
 ord is dependent on both the encoding of its argument (to determine
 the logical character containing in that argument) and the current
 default encoding (to determine the value in the current character set
 representing that character).

That wasn't my intention. I was thinking that chr was bound to the current 
default encoding, and ord was bound to the string type of the scalar being 
ord-ed.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

1 2 >

1 - 100 of 154 matches

Mail list logo