[OT] Re: [MacPerl] Re: problem with Japanese text

2003-03-30 Thread Joel Rees
 ... In English the singular nominative pronoun is nothing but I, 
 no matter how old or young you are or whether you are a boy or a girl 
 (or a computer).  But in Japanese it can be Watashi or Boku or 
 Ore 

Boku. Hmm. What would it mean if an educational program intended for
kindergarten use said Boku no shippai desu ka?

 ...

 Well, 
 the total richness of Japanese is I believe is increasing but 
 ironically per-capita richness might not be.  But I believe this 
 phenomenon is not unique to Japanese; could be even more prevalent in 
 English.  If you don't believe me just compare the Two Bushes in White 
 House :)

Heh. West Texas is a country and a language to itself. 

-- 
Joel Rees [EMAIL PROTECTED]
(Amusing myself by imagining the elder bush saying Kwansai and the
younger saying Kansai. Not sure I could imagine the old guy saying
tefu-tefu, though.)



Re: [MacPerl] Re: problem with Japanese text

2003-03-29 Thread Robin
On Saturday, March 29, 2003, at 10:19  am, Jeff Lowrey wrote:
I don't think we can use 'camel' unless we're willing to admit that we 
also have not evolved to smell good.
and spit when we're unhappy :-)

On Friday, March 28, 2003, at 02:00  am, Nicholas G. Thornton wrote:

 in Japanese you can have different kanji (or groups of kana) that 
spell out the same thing, in so far as pronunciation goes..or you 
want to rewrite all the kanji as kana
English has a couple of variants of this problem - 'queue' and 'cue' 
which sound the same and look the same in IPA and  words like 'record' 
(noun) and 'record' (verb) or 'read' (present) and 'read' (past) which 
by looking the same to a regex.

Of course we use the context to understand the difference when reading 
and there are perl modules which can parse English grammar helping to 
Identify what kind of word you're dealing with

http://new.brians.org/Projects/Technology/Papers/LinkParser/

Having said all that, Japanese usually put the phonetic pronunciation 
(hiragana) with their name and surname (which is usually in kanji) 
which is one situation where the kanji are contextless and it is 
necessary to know the correct pronunciation, I haven't seen yet seen 
software which can do this so in most address book type apps they still 
have to enter both kinds of data manually.



Robin



Re: [MacPerl] Re: problem with Japanese text

2003-03-28 Thread Robin
On Friday, March 28, 2003, at 02:14  pm, Dan Kogai wrote:

On the other hand, counting can be tricky even for natives.  The very 
name of numbers changes depending on what you count.
parallels for this in English can be seen in English group names - a 
gaggle of geese, a troop of monkeys, a knot of toads, a pack of dogs. 
Anyone care to suggest a good one for a group of perl programmers ?  a 
larry, a wall, a camel  . ;-)

In Japanese the very notion of a word is often moot.
Linguistically I tend to look at ASCII words as being like Kanji - a 
combination of symbols to stand for a concept -in English we use 
symbols which were intended to represent the phonetic sound (Great 
Vowel Shift anyone?), while Kanji are combinations of symbols 
representing concepts. So in a way the word 'mentality' is really a 
multibyte character and the kanji for 'kangaikata' stands for the same 
mental idea as the word 'mentality' - what you call the container for 
that packet of data is up to you :-)

Programmatically the encoding delineates how the data can be chunked up 
- ASCII uses whitespace to separate words and a 7 bit envelope per 
character, while EUC-JP uses an 8 bit evelope and Shift-JIS uses a 
'stand on the suitcase while I lock it' system to pack the same 8 bit 
data into a 7 bit envelope (but it was developed my Microsoft). Perl 
was developed by mostly native English speakers, so in text processing 
it takes advantage of recurring patterns of 7 bit ASCII data to 
determine how the data is chunked. And chunked it must be or it is in 
coherent, yet this article:

http://www.perl.com/pub/a/2000/05/cobol.html

talks of a perler meeting a bizarre group of programmers to whom


the idea of variable-length, \n-terminated records was new and strange

implying that data is chunked into fixed length records which aren't 
separated by tokens. Japanese? no COBOL ;-)

Robin



Re: [MacPerl] Re: problem with Japanese text

2003-03-28 Thread Jeff Lowrey
At 6:47 PM +0900 3/28/03, Robin wrote:
On Friday, March 28, 2003, at 02:14  pm, Dan Kogai wrote:

 On the other hand, counting can be tricky even for natives.  The 
very name of numbers changes depending on what you count.
parallels for this in English can be seen in English group names - a 
gaggle of geese, a troop of monkeys, a knot of toads, a pack of 
dogs. Anyone care to suggest a good one for a group of perl 
programmers ?  a larry, a wall, a camel  . ;-)
I don't think we can use 'camel' unless we're willing to admit that 
we also have not evolved to smell good.

I'd like to go with 'pathologically ecclectic list', but that doesn't 
fall off the tongue very well.

We could go with something resembling a collective form of 'japh' - 
perhaps 'japher' - to be delightfully recursive.

I don't think we can use 'wall', as it is too connotative of 
'preventing things from being done', and there's always MTOWTDI.

Ergo, I'm going to suggest 'hash'.  We're all unique, we're all 
addressed by name, and we're not guaranteed to be returned in any 
particular order.

-Jeff


Re: [MacPerl] Re: problem with Japanese text

2003-03-27 Thread Robin
On Thursday, March 27, 2003, at 01:31  am, Chris Nandor wrote:

[EMAIL PROTECTED] (Robin) wrote:
MacPerl per se historically has not been aware of locale outside of
ascii defined ones (not sure about the latest version).
Is there a reason for MacJPerl when MacPerl 5.8.x is released?
 while the 5.8 perl interpreter has built in unicode support, how you 
would go about displaying, editing or even using a perl script 
containing Japanese characters on OS9 (even with the Japanese Language 
kit) is no small task (try a simple regex substitution and you'll see 
what I mean). OSX makes it potentially easier, but most software is 
lagging far behing the promise, still coming in mono lingual mindset 
rather than multilingual, and all that this entails.

Anyway for anyone intersted in more info about the history and 
development of Japanese text encodings, here's a link to one of the 
best pages I found so far on the web:
http://tronweb.super-nova.co.jp/characcodehist.html

Robin



Re: [MacPerl] Re: problem with Japanese text

2003-03-27 Thread Rich Morin
Character set difficulties are still a real problem, but so is dynamic
text.  Damian Conway's paper
  An Algorithmic Approach to English Pluralization
  http://www.csse.monash.edu.au/~damian/papers/HTML/Plurals.html
contains some fairly complicated tools for generating dynamically-
pluralized English.  Now generalize that tool set for multiple
languages and/or more complex variations.  Right.
In my current work, I am generating user-specific explanations for the
permission and ownership information in (roughly) an ls -al listing.
That is, the user gets three paragraphs, saying (a) what the effect of
these permissions is on the user, (b) how this was derived, and (c)
what the item's permissions are, as a whole.  For example:
  permission bits  mode  owner  group  type   name
  ---    -  -     
  rwx  rwx  r-xt   1775  root   wheel  directory  Users
  /Users
  --
  You have read, write, and execute (rwx) permissions for this
  directory.  This allows you to inspect, change (e.g., add to), and
  access its contents.  Because the sticky bit (t) is set, you may not
  remove other users' files.
  The user id for this node does not match your effective user id, but
  the group id matches one of your effective group ids.  Consequently,
  your access is controlled by the node's group permissions (rwx, as
  shown in the second field of the permission bits column).
  The node's owner (root) has read, write, and execute permissions
  (rwx).  Members of group wheel have read, write, and execute
  permissions (rwx).  Other users have read and execute permissions
  (r-x).  The sticky bit (t in the third field) is set; files in this
  directory may only be removed or renamed by a user if the user has
  write permission for the directory and the user is the owner of the
  file, the owner of the directory, or the super-user.  See sticky(8)
  for more information.
As I was writing generation code for the text above, the prospect of
modifying the code for multiple languages crossed my mind.  I quickly
decided, however, that this was unlikely to be my problem.  Even if I
weren't firmly monolingual, the process of generalizing this code is
going to be quite language-specific (and doing it automagically is
AI-complete :-).
-r
--
email: [EMAIL PROTECTED]; phone: +1 650-873-7841
http://www.cfcl.com/rdm- my home page, resume, etc.
http://www.cfcl.com/Meta   - The FreeBSD Browser, Meta Project, etc.
http://www.ptf.com/dossier - Prime Time Freeware's DOSSIER series
http://www.ptf.com/tdc - Prime Time Freeware's Darwin Collection


Re: [MacPerl] Re: problem with Japanese text

2003-03-27 Thread Joel Rees
Not sure if my comments are relevant, just feeling inclined to expose my
ignorance --

 Character set difficulties are still a real problem, but so is dynamic
 text.  Damian Conway's paper
 
An Algorithmic Approach to English Pluralization
http://www.csse.monash.edu.au/~damian/papers/HTML/Plurals.html
 
 contains some fairly complicated tools for generating dynamically-
 pluralized English.  Now generalize that tool set for multiple
 languages and/or more complex variations.  Right.

Japanese is one of those languages that has relatively few specifically
plural forms. To get the pluralizations right in Japanese, the program
would have to consult a dictionary.

 In my current work, I am generating user-specific explanations for the
 permission and ownership information in (roughly) an ls -al listing.
 That is, the user gets three paragraphs, saying (a) what the effect of
 these permissions is on the user, (b) how this was derived, and (c)
 what the item's permissions are, as a whole. 

I see the reason for the interest in automatic pluralization there.

Pluralization could probably be ignored for this purpose for Japanese,
but, if the purpose is to produce text that the technically un-inclined
can parse reasonably effortlessly, there are all sorts of other context
related issues, most of which would require not just vocabulary
dictionaries, but idiom dictionaries as well. And your locale machinery would
have to include some sensitivity to dialect issues and social status
issues, to make the generated text natural and non-offending.

Japanese is becoming more egalitarian, more homogenized, and less
colorful, so those who work on such things are aiming at a moving target.

Thinking about the recognizer side, did anyone mention that Japanese
text does not use word delimiters? Space has a somewhat different
meaning for Japanese.

-- 
Joel Rees [EMAIL PROTECTED]



Re: [MacPerl] Re: problem with Japanese text

2003-03-27 Thread Dan Kogai
On Friday, Mar 28, 2003, at 11:37 Asia/Tokyo, Joel Rees wrote:
Not sure if my comments are relevant, just feeling inclined to expose 
my
ignorance --
And here is mine.

Japanese is one of those languages that has relatively few specifically
plural forms. To get the pluralizations right in Japanese, the program
would have to consult a dictionary.
More exactly speaking, Japanese has no plural form in a sense of 
Indo-European languages.  Japanese totally lacks subject-verb agreement 
so you don have to delete the es in does when you change the 
subject form s/he to they.

On the other hand, counting can be tricky even for natives.  The very 
name of numbers changes depending on what you count.  When you count 
people it goes hito-ri, futa-ri, san-nin but when you count object it 
goes hito-tsu, futa-tsu (or ik-ko, ni-ko,) and the list goes on (I 
think this number-object agreement came from Chinese).

But when the number is not an issue, you can totally forget if a 
subject is singular or plural.

Pluralization could probably be ignored for this purpose for Japanese,
but, if the purpose is to produce text that the technically un-inclined
can parse reasonably effortlessly, there are all sorts of other context
related issues, most of which would require not just vocabulary
dictionaries, but idiom dictionaries as well. And your locale 
machinery would
have to include some sensitivity to dialect issues and social status
issues, to make the generated text natural and non-offending.
I feel Japanese is a hard language to compose because of that but that 
also makes Japanese easier to read because Japanese tend to include not 
only what to say but also in what situation by what kind of person 
says.  In English the singular nominative pronoun is nothing but I, 
no matter how old or young you are or whether you are a boy or a girl 
(or a computer).  But in Japanese it can be Watashi or Boku or 
Ore or Maro or Warawa or Sessha or Jibun or Ware  even 
English me can be used.

Maybe to compensate this complexity, Japanese grammar seems much 
simpler.   No subject-verb agreement, very few irregular verbs   It 
is far easier to compose a grammatically correct Japanese.  It gets 
darn hard once you aim for social and political correctness.

Japanese is becoming more egalitarian, more homogenized, and less
colorful, so those who work on such things are aiming at a moving 
target.
Less colorful I am not sure because at the same time the newer, simple, 
and more boring expressions are pervasive, the old and more complex 
expressions hardly die.  So in total Japanese is getting richer.  Well, 
the total richness of Japanese is I believe is increasing but 
ironically per-capita richness might not be.  But I believe this 
phenomenon is not unique to Japanese; could be even more prevalent in 
English.  If you don't believe me just compare the Two Bushes in White 
House :)

Thinking about the recognizer side, did anyone mention that Japanese
text does not use word delimiters? Space has a somewhat different
meaning for Japanese.
Japanese tokenization is nothing but a trivial issue.  In Japanese the 
very notion of a word is often moot.  Nevertheless, we do have good 
enough tokenizers to implement input methods and search engines.  Of 
course they are not perfect but the Japanese are very frank about the 
lack of perfection.  After all we don't even have de jure standard 
Japanese to compare.

Dan the Man with Too Many Languages to Deal with



Re: [MacPerl] Re: problem with Japanese text

2003-03-26 Thread Robin
While I realise this is diverging slightly from the original posting, I 
think some background info is useful for dealing with Japanese text. 
There are several text encoding formats - the most widly used being 
ShiftJIS and EUC-JP. Without going into too many details, ShiftJIS 
encoding was created by Microsoft to its usual exacting (lack of) 
standards, which makes it ticklish to deal with, so in the past when 
processing Japanese text, Japanese perlers used a four step conversion 
solution:

(1) input converted from Shift_JIS to EUC_JP
(2) EUC_JP encoded data processed
(3) EUC_JP data converted back to Shift_JIS
(4) output
perl 5.8.0 has built in Unicode support, however the same 4 step 
process is still required for Shift-JIS data

(1) input converted from Shift_JIS to UTF8 (unicode)
(2)  UTF8 encoded data processed
(3) UTF8 data converted back to Shift_JIS
(4) output
MacPerl per se historically has not been aware of locale outside of 
ascii defined ones (not sure about the latest version). Which is why of 
course there is MacJPerl.

http://world.std.com/~habilis/macjperl



HTH

Robin





On Wednesday, March 19, 2003, at 05:58  am, Scott R. Godin wrote:

Jon Reinsch wrote:

I use a simple MacPerl program to archive my email: I save each 
message to
a text file, then run the program to append the messages to a text 
file in
date/time order. Omitting some details, the heart of the program is 
just:

open (inhandle,$infilename))
{
while(inhandle)
{ print $outhandle $_; }
}
My problem is that some of my email contains Japanese text. I'm 
running OS
9.2.1 with the Japanese Language Kit installed. But when Japanese text
goes through the program it comes out as garbage like
bvwirQ[^. Obviously the encoding is being 
lost,
but I don't have the slightest idea how to fix this. Is there a 
module out
there that would provide a simple answer to this problem? Maybe it's 
just
a fantasy, but I'm hoping for something simple like
print $outhandle convertJapaneseText($_);


This might seem very simple but have you looked into

use locale

at all ? try looking at perldoc perllocale for some informative text.

dunno if this will help but it's where my instincts pointed me...




Re: [MacPerl] Re: problem with Japanese text

2003-03-26 Thread Chris Nandor
In article [EMAIL PROTECTED],
 [EMAIL PROTECTED] (Robin) wrote:

 MacPerl per se historically has not been aware of locale outside of 
 ascii defined ones (not sure about the latest version). Which is why of 
 course there is MacJPerl.
 
 http://world.std.com/~habilis/macjperl

Is there a reason for MacJPerl when MacPerl 5.8.x is released?

-- 
Chris Nandor  [EMAIL PROTECTED]http://pudge.net/
Open Source Development Network[EMAIL PROTECTED] http://osdn.com/


Re: [MacPerl] Re: problem with Japanese text

2003-03-26 Thread Dan Kogai
On Thursday, Mar 27, 2003, at 01:31 Asia/Tokyo, Chris Nandor wrote:
Is there a reason for MacJPerl when MacPerl 5.8.x is released?
I thought none but the second thought;  The built-in text editor that 
many not support multibyte characters.  But even that is moot since 
there are many text editors which can use MacPerl, some of which even 
free (I use mi when I have to type in Japanese 
http://www.asahi-net.or.jp/~gf6d-kmym/, free, perl-savvy, and 
supports all major Japanese encodings including UTF-8).

I wonder how many of you have ever tried 5.8 features such as Encode 
and PerlIO in MacPerl (besides make test, of course).  I don't even 
lauch Classic these days...

Dan the ex-user of MacOS



Re: [MacPerl] Re: problem with Japanese text

2003-03-26 Thread Chris Nandor
At 13:52 +0900 2003.03.27, Dan Kogai wrote:
I wonder how many of you have ever tried 5.8 features such as Encode
and PerlIO in MacPerl (besides make test, of course).  I don't even
lauch Classic these days...

Give me some examples to run and I can give it a shot.  :)

My greatest reason to run Classic is for Mac::Glue programs.  I have
everything I need for Mac::Glue ported, though, so I expect that to change
soon after I get back from vacation (I'll get to work on Mac::Glue after a
new release of Mac::Carbon, plus Mac::Apps::Launch and
Mac::AppleEvents::Simple, and probably a Bundle::Mac::Carbon ...).  But I
still plan on releasing MacPerl 5.8.x, which is mostly all there and
working now (I did a test build of the latest code a week or so ago).

-- 
Chris Nandor  [EMAIL PROTECTED]http://pudge.net/
Open Source Development Network[EMAIL PROTECTED] http://osdn.com/