[OT] Re: [MacPerl] Re: problem with Japanese text
... In English the singular nominative pronoun is nothing but I, no matter how old or young you are or whether you are a boy or a girl (or a computer). But in Japanese it can be Watashi or Boku or Ore Boku. Hmm. What would it mean if an educational program intended for kindergarten use said Boku no shippai desu ka? ... Well, the total richness of Japanese is I believe is increasing but ironically per-capita richness might not be. But I believe this phenomenon is not unique to Japanese; could be even more prevalent in English. If you don't believe me just compare the Two Bushes in White House :) Heh. West Texas is a country and a language to itself. -- Joel Rees [EMAIL PROTECTED] (Amusing myself by imagining the elder bush saying Kwansai and the younger saying Kansai. Not sure I could imagine the old guy saying tefu-tefu, though.)
Re: [MacPerl] Re: problem with Japanese text
On Saturday, March 29, 2003, at 10:19 am, Jeff Lowrey wrote: I don't think we can use 'camel' unless we're willing to admit that we also have not evolved to smell good. and spit when we're unhappy :-) On Friday, March 28, 2003, at 02:00 am, Nicholas G. Thornton wrote: in Japanese you can have different kanji (or groups of kana) that spell out the same thing, in so far as pronunciation goes..or you want to rewrite all the kanji as kana English has a couple of variants of this problem - 'queue' and 'cue' which sound the same and look the same in IPA and words like 'record' (noun) and 'record' (verb) or 'read' (present) and 'read' (past) which by looking the same to a regex. Of course we use the context to understand the difference when reading and there are perl modules which can parse English grammar helping to Identify what kind of word you're dealing with http://new.brians.org/Projects/Technology/Papers/LinkParser/ Having said all that, Japanese usually put the phonetic pronunciation (hiragana) with their name and surname (which is usually in kanji) which is one situation where the kanji are contextless and it is necessary to know the correct pronunciation, I haven't seen yet seen software which can do this so in most address book type apps they still have to enter both kinds of data manually. Robin
Re: [MacPerl] Re: problem with Japanese text
On Friday, March 28, 2003, at 02:14 pm, Dan Kogai wrote: On the other hand, counting can be tricky even for natives. The very name of numbers changes depending on what you count. parallels for this in English can be seen in English group names - a gaggle of geese, a troop of monkeys, a knot of toads, a pack of dogs. Anyone care to suggest a good one for a group of perl programmers ? a larry, a wall, a camel . ;-) In Japanese the very notion of a word is often moot. Linguistically I tend to look at ASCII words as being like Kanji - a combination of symbols to stand for a concept -in English we use symbols which were intended to represent the phonetic sound (Great Vowel Shift anyone?), while Kanji are combinations of symbols representing concepts. So in a way the word 'mentality' is really a multibyte character and the kanji for 'kangaikata' stands for the same mental idea as the word 'mentality' - what you call the container for that packet of data is up to you :-) Programmatically the encoding delineates how the data can be chunked up - ASCII uses whitespace to separate words and a 7 bit envelope per character, while EUC-JP uses an 8 bit evelope and Shift-JIS uses a 'stand on the suitcase while I lock it' system to pack the same 8 bit data into a 7 bit envelope (but it was developed my Microsoft). Perl was developed by mostly native English speakers, so in text processing it takes advantage of recurring patterns of 7 bit ASCII data to determine how the data is chunked. And chunked it must be or it is in coherent, yet this article: http://www.perl.com/pub/a/2000/05/cobol.html talks of a perler meeting a bizarre group of programmers to whom the idea of variable-length, \n-terminated records was new and strange implying that data is chunked into fixed length records which aren't separated by tokens. Japanese? no COBOL ;-) Robin
Re: [MacPerl] Re: problem with Japanese text
At 6:47 PM +0900 3/28/03, Robin wrote: On Friday, March 28, 2003, at 02:14 pm, Dan Kogai wrote: On the other hand, counting can be tricky even for natives. The very name of numbers changes depending on what you count. parallels for this in English can be seen in English group names - a gaggle of geese, a troop of monkeys, a knot of toads, a pack of dogs. Anyone care to suggest a good one for a group of perl programmers ? a larry, a wall, a camel . ;-) I don't think we can use 'camel' unless we're willing to admit that we also have not evolved to smell good. I'd like to go with 'pathologically ecclectic list', but that doesn't fall off the tongue very well. We could go with something resembling a collective form of 'japh' - perhaps 'japher' - to be delightfully recursive. I don't think we can use 'wall', as it is too connotative of 'preventing things from being done', and there's always MTOWTDI. Ergo, I'm going to suggest 'hash'. We're all unique, we're all addressed by name, and we're not guaranteed to be returned in any particular order. -Jeff
Re: [MacPerl] Re: problem with Japanese text
On Thursday, March 27, 2003, at 01:31 am, Chris Nandor wrote: [EMAIL PROTECTED] (Robin) wrote: MacPerl per se historically has not been aware of locale outside of ascii defined ones (not sure about the latest version). Is there a reason for MacJPerl when MacPerl 5.8.x is released? while the 5.8 perl interpreter has built in unicode support, how you would go about displaying, editing or even using a perl script containing Japanese characters on OS9 (even with the Japanese Language kit) is no small task (try a simple regex substitution and you'll see what I mean). OSX makes it potentially easier, but most software is lagging far behing the promise, still coming in mono lingual mindset rather than multilingual, and all that this entails. Anyway for anyone intersted in more info about the history and development of Japanese text encodings, here's a link to one of the best pages I found so far on the web: http://tronweb.super-nova.co.jp/characcodehist.html Robin
Re: [MacPerl] Re: problem with Japanese text
Character set difficulties are still a real problem, but so is dynamic text. Damian Conway's paper An Algorithmic Approach to English Pluralization http://www.csse.monash.edu.au/~damian/papers/HTML/Plurals.html contains some fairly complicated tools for generating dynamically- pluralized English. Now generalize that tool set for multiple languages and/or more complex variations. Right. In my current work, I am generating user-specific explanations for the permission and ownership information in (roughly) an ls -al listing. That is, the user gets three paragraphs, saying (a) what the effect of these permissions is on the user, (b) how this was derived, and (c) what the item's permissions are, as a whole. For example: permission bits mode owner group type name --- - - rwx rwx r-xt 1775 root wheel directory Users /Users -- You have read, write, and execute (rwx) permissions for this directory. This allows you to inspect, change (e.g., add to), and access its contents. Because the sticky bit (t) is set, you may not remove other users' files. The user id for this node does not match your effective user id, but the group id matches one of your effective group ids. Consequently, your access is controlled by the node's group permissions (rwx, as shown in the second field of the permission bits column). The node's owner (root) has read, write, and execute permissions (rwx). Members of group wheel have read, write, and execute permissions (rwx). Other users have read and execute permissions (r-x). The sticky bit (t in the third field) is set; files in this directory may only be removed or renamed by a user if the user has write permission for the directory and the user is the owner of the file, the owner of the directory, or the super-user. See sticky(8) for more information. As I was writing generation code for the text above, the prospect of modifying the code for multiple languages crossed my mind. I quickly decided, however, that this was unlikely to be my problem. Even if I weren't firmly monolingual, the process of generalizing this code is going to be quite language-specific (and doing it automagically is AI-complete :-). -r -- email: [EMAIL PROTECTED]; phone: +1 650-873-7841 http://www.cfcl.com/rdm- my home page, resume, etc. http://www.cfcl.com/Meta - The FreeBSD Browser, Meta Project, etc. http://www.ptf.com/dossier - Prime Time Freeware's DOSSIER series http://www.ptf.com/tdc - Prime Time Freeware's Darwin Collection
Re: [MacPerl] Re: problem with Japanese text
Not sure if my comments are relevant, just feeling inclined to expose my ignorance -- Character set difficulties are still a real problem, but so is dynamic text. Damian Conway's paper An Algorithmic Approach to English Pluralization http://www.csse.monash.edu.au/~damian/papers/HTML/Plurals.html contains some fairly complicated tools for generating dynamically- pluralized English. Now generalize that tool set for multiple languages and/or more complex variations. Right. Japanese is one of those languages that has relatively few specifically plural forms. To get the pluralizations right in Japanese, the program would have to consult a dictionary. In my current work, I am generating user-specific explanations for the permission and ownership information in (roughly) an ls -al listing. That is, the user gets three paragraphs, saying (a) what the effect of these permissions is on the user, (b) how this was derived, and (c) what the item's permissions are, as a whole. I see the reason for the interest in automatic pluralization there. Pluralization could probably be ignored for this purpose for Japanese, but, if the purpose is to produce text that the technically un-inclined can parse reasonably effortlessly, there are all sorts of other context related issues, most of which would require not just vocabulary dictionaries, but idiom dictionaries as well. And your locale machinery would have to include some sensitivity to dialect issues and social status issues, to make the generated text natural and non-offending. Japanese is becoming more egalitarian, more homogenized, and less colorful, so those who work on such things are aiming at a moving target. Thinking about the recognizer side, did anyone mention that Japanese text does not use word delimiters? Space has a somewhat different meaning for Japanese. -- Joel Rees [EMAIL PROTECTED]
Re: [MacPerl] Re: problem with Japanese text
On Friday, Mar 28, 2003, at 11:37 Asia/Tokyo, Joel Rees wrote: Not sure if my comments are relevant, just feeling inclined to expose my ignorance -- And here is mine. Japanese is one of those languages that has relatively few specifically plural forms. To get the pluralizations right in Japanese, the program would have to consult a dictionary. More exactly speaking, Japanese has no plural form in a sense of Indo-European languages. Japanese totally lacks subject-verb agreement so you don have to delete the es in does when you change the subject form s/he to they. On the other hand, counting can be tricky even for natives. The very name of numbers changes depending on what you count. When you count people it goes hito-ri, futa-ri, san-nin but when you count object it goes hito-tsu, futa-tsu (or ik-ko, ni-ko,) and the list goes on (I think this number-object agreement came from Chinese). But when the number is not an issue, you can totally forget if a subject is singular or plural. Pluralization could probably be ignored for this purpose for Japanese, but, if the purpose is to produce text that the technically un-inclined can parse reasonably effortlessly, there are all sorts of other context related issues, most of which would require not just vocabulary dictionaries, but idiom dictionaries as well. And your locale machinery would have to include some sensitivity to dialect issues and social status issues, to make the generated text natural and non-offending. I feel Japanese is a hard language to compose because of that but that also makes Japanese easier to read because Japanese tend to include not only what to say but also in what situation by what kind of person says. In English the singular nominative pronoun is nothing but I, no matter how old or young you are or whether you are a boy or a girl (or a computer). But in Japanese it can be Watashi or Boku or Ore or Maro or Warawa or Sessha or Jibun or Ware even English me can be used. Maybe to compensate this complexity, Japanese grammar seems much simpler. No subject-verb agreement, very few irregular verbs It is far easier to compose a grammatically correct Japanese. It gets darn hard once you aim for social and political correctness. Japanese is becoming more egalitarian, more homogenized, and less colorful, so those who work on such things are aiming at a moving target. Less colorful I am not sure because at the same time the newer, simple, and more boring expressions are pervasive, the old and more complex expressions hardly die. So in total Japanese is getting richer. Well, the total richness of Japanese is I believe is increasing but ironically per-capita richness might not be. But I believe this phenomenon is not unique to Japanese; could be even more prevalent in English. If you don't believe me just compare the Two Bushes in White House :) Thinking about the recognizer side, did anyone mention that Japanese text does not use word delimiters? Space has a somewhat different meaning for Japanese. Japanese tokenization is nothing but a trivial issue. In Japanese the very notion of a word is often moot. Nevertheless, we do have good enough tokenizers to implement input methods and search engines. Of course they are not perfect but the Japanese are very frank about the lack of perfection. After all we don't even have de jure standard Japanese to compare. Dan the Man with Too Many Languages to Deal with
Re: [MacPerl] Re: problem with Japanese text
While I realise this is diverging slightly from the original posting, I think some background info is useful for dealing with Japanese text. There are several text encoding formats - the most widly used being ShiftJIS and EUC-JP. Without going into too many details, ShiftJIS encoding was created by Microsoft to its usual exacting (lack of) standards, which makes it ticklish to deal with, so in the past when processing Japanese text, Japanese perlers used a four step conversion solution: (1) input converted from Shift_JIS to EUC_JP (2) EUC_JP encoded data processed (3) EUC_JP data converted back to Shift_JIS (4) output perl 5.8.0 has built in Unicode support, however the same 4 step process is still required for Shift-JIS data (1) input converted from Shift_JIS to UTF8 (unicode) (2) UTF8 encoded data processed (3) UTF8 data converted back to Shift_JIS (4) output MacPerl per se historically has not been aware of locale outside of ascii defined ones (not sure about the latest version). Which is why of course there is MacJPerl. http://world.std.com/~habilis/macjperl HTH Robin On Wednesday, March 19, 2003, at 05:58 am, Scott R. Godin wrote: Jon Reinsch wrote: I use a simple MacPerl program to archive my email: I save each message to a text file, then run the program to append the messages to a text file in date/time order. Omitting some details, the heart of the program is just: open (inhandle,$infilename)) { while(inhandle) { print $outhandle $_; } } My problem is that some of my email contains Japanese text. I'm running OS 9.2.1 with the Japanese Language Kit installed. But when Japanese text goes through the program it comes out as garbage like bvwirQ[^. Obviously the encoding is being lost, but I don't have the slightest idea how to fix this. Is there a module out there that would provide a simple answer to this problem? Maybe it's just a fantasy, but I'm hoping for something simple like print $outhandle convertJapaneseText($_); This might seem very simple but have you looked into use locale at all ? try looking at perldoc perllocale for some informative text. dunno if this will help but it's where my instincts pointed me...
Re: [MacPerl] Re: problem with Japanese text
In article [EMAIL PROTECTED], [EMAIL PROTECTED] (Robin) wrote: MacPerl per se historically has not been aware of locale outside of ascii defined ones (not sure about the latest version). Which is why of course there is MacJPerl. http://world.std.com/~habilis/macjperl Is there a reason for MacJPerl when MacPerl 5.8.x is released? -- Chris Nandor [EMAIL PROTECTED]http://pudge.net/ Open Source Development Network[EMAIL PROTECTED] http://osdn.com/
Re: [MacPerl] Re: problem with Japanese text
On Thursday, Mar 27, 2003, at 01:31 Asia/Tokyo, Chris Nandor wrote: Is there a reason for MacJPerl when MacPerl 5.8.x is released? I thought none but the second thought; The built-in text editor that many not support multibyte characters. But even that is moot since there are many text editors which can use MacPerl, some of which even free (I use mi when I have to type in Japanese http://www.asahi-net.or.jp/~gf6d-kmym/, free, perl-savvy, and supports all major Japanese encodings including UTF-8). I wonder how many of you have ever tried 5.8 features such as Encode and PerlIO in MacPerl (besides make test, of course). I don't even lauch Classic these days... Dan the ex-user of MacOS
Re: [MacPerl] Re: problem with Japanese text
At 13:52 +0900 2003.03.27, Dan Kogai wrote: I wonder how many of you have ever tried 5.8 features such as Encode and PerlIO in MacPerl (besides make test, of course). I don't even lauch Classic these days... Give me some examples to run and I can give it a shot. :) My greatest reason to run Classic is for Mac::Glue programs. I have everything I need for Mac::Glue ported, though, so I expect that to change soon after I get back from vacation (I'll get to work on Mac::Glue after a new release of Mac::Carbon, plus Mac::Apps::Launch and Mac::AppleEvents::Simple, and probably a Bundle::Mac::Carbon ...). But I still plan on releasing MacPerl 5.8.x, which is mostly all there and working now (I did a test build of the latest code a week or so ago). -- Chris Nandor [EMAIL PROTECTED]http://pudge.net/ Open Source Development Network[EMAIL PROTECTED] http://osdn.com/