Re: extracting words
On Sat, 10 Feb 2001, Edward Cherlin wrote: At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the word is in blacklist hashtable. This is all very straightforward, but the problem is the extracting of wordsfrom this string. How do i determine what a word is in Japanese or Korean or whatever other language? { a space ? } No. Chinese and Japanese almost never have spaces between words, and they are not required in Korean. I'm afraid this is a little bit misleading. In modern Korean orthography, every word is delimeted by space (Korean Orthographic Rules, article 2 : 1988-01-19, Ministry of Education, ROK). The exception for that rule is that particles (Josa) have to follow the preceding word without space (ibid, article 41). There are also some minor exceptions (ibid, article 43, article 47, article 49 ) so you might say you're correct in that spaces are *not required* in Korean, but the principle of delimeting every pair of words with a space is still there. Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. I absolutely argee with you on this point. An example from Korean: Abeojigabangeisseoyo. Should this be segmented as Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange isseoyo (Father is in the bag)? I don't think this is such a good example for your case for the enormous difficulties and complexities involved in extracting words (which I agree on) because the original question was how to extract words out of 'supposedly orthographically correct sentences'. Your example (Abeojigabangeisseoyo) clearly violates the (modern) orthographic rule by glueing together all the words without space (nobody would write that way). One of the reasons that spaces are used to separate words in Korean writing is to break/lift this kind of degeneracy (as taught in the first grade Korean class). It would have been more appropriate if you had come up with an example from Japanese or Chinese where spaces are rarely used to separate words. Jungshik Shin
Re: extracting words
in the first grade Korean class). It would have been more appropriate if you had come up with an example from Japanese or Chinese where spaces are rarely used to separate words. From Japanese, how about: kokodehakimonowonuidekudasai This could be koko de hakimono wo nuide kudasai (take your shoes off here) or koko deha kimono wo nuide kudasai (take your clothes off here; in this case "ha" is pronounced "wa") Actually, this example was given at an AAMT meeting last year to show that writing words in kanji + kana rather than just kana makes it easier for MT software to break down strings into words accurately and therefore produce better translations. Best wishes, Jonathan Lewis Tokyo Denki University
Re: Unicode collation algorithm - interpretation]
Jim, Thanks for the reply, which Hugh had indeed alerted me to expect. See interpolations below. I particularly want to respond to the statement that you made: It has been suggested that SQL collation name should instead identify both collation element table and maximum level. I believe that the "maximum level" is built into the collation element table inseparably. I think you misunderstand me. The "maximum level" I was referring to is that mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for each string", para 2, which reads: quote An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array. For example, if the maximum level is set to 2, then level 3 and higher weights (including the normalized Unicode string) are not appended to the sort key. Thus any differences at levels 3 and higher will be ignored, leveling any such differences in string comparison. /quote There is, of course, an upper limit to the number of levels provided for in the Collation Element Table and 14651 requires that "The number of levels that the process supports ... shall be at least three." So I think it's fair to say we are discussing whether, and if so how, these levels should be made visible to the SQL user. We can safely assume that at least some users will require sometimes exact, sometimes inexact comparisons (at least for pseudo-equality, to a lesser extent for sorting). We can also safely assume that users will wish to get the performance benefit of some preprocessing. It is clearly possible to preprocess as far as the end of step 2 of the Unicode collation algorithm without committing to a level. I understand you to say that several implementors have concluded that this level of preprocessing is not cost-effective, in comparison to going all the way to the sort key. I am in no position to dispute that conclusion. I monitored the email discussions rather a lot during the development of ISO 14651 and it seemed awfully likely as a result of the discussions (plus conversations that I've had with implementors in at least 3 companies) that a specific collation would be built by constructing the collation element table (as you mentioned in your note) and then "compiling" it into the code that actually does the collation. That code would *inherently* have built into it the levels that were specified in the collation table that was constructed. It's not like the code can pick and choose which of the levels it wishes to honor. I'm afraid I don't understand what this is saying. I've seen both the 104651 "Common Template Table" and the Unicode "Default Unicode Collation Element Table", and assume them to be equivalent, but have not verified that they are. Neither of them looks particularly "compilable" to me but, in view of your quotes, I'm not at all clear what you mean by '"compiling" it into the code that actually does the collation.' I'm also unclear what an SQL-implementor is likely to supply as "a collation", though I imagine (only!) that it might be a part only of the CTT/CET appropriate to the script used by a particular culture, and with appropriate tailoring. But I have no reason to expect the executable ("compiled"?) code the implements the algorithm to vary depending on the collation, or on the level (case-blind c) specified by the user for a particular comparison. I find it easier to imagine differences in code depending on whether a collate clause is in a column definition or in, say, WHERE C1 = C2 COLLATE collation name. Of course, if you really want to specify an SQL collation name that somehow identifies 2 or 3 or 4 (or more) collations built in conformance with ISO 14651 and then use an additional parameter to choose between them, I guess that's possible (but not, IMHO, desirable). Unless you mean for performance reasons, I'd be interested to know why not desirable. However, it would be very difficult to enforce a rule that says that the collection of collations so identified are "the same" except for the level chosen. One could be oriented towards, say, French, and the other towards German or Thai and it would be very hard for the SQL engine to know that it was being misled. I can see a problem in ensuring that COLLATE (Collate_Fr_Fr, 2) bears the same relation to COLLATE (Collate_Fr_Fr, 1) as COLLATE (Collate_Thai, 2) bears to COLLATE (Collate_Thai, 1), but I honestly don't know how significant that is, or even what "the same" ought to mean if Thai has no cases or diacritics anyway. This seems almost to be questioning the usefulness of levels. Perhaps they have values for some cultures but not others. If that's the case, I don't see that my suggestion is completely invalidated, though it's value might be so seriously reduced as to make it negligible. Mike.
FW: extracting words
Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. But how can we then ever have a reliable word-break algorithm? It cannot be that, say, for a simple editor (be it written in Java or whatever) you have to supply a database with language specific details just to do automatic word wrap. Ciao, Mike
Re: FW: extracting words
If you are willing to give up precision, then you can use heuristics. The grossest heuristics are not really word breaking at all, but give users that do not know the language a compatible way of working with the text. For example, some software have extended their western European language software which did word breaking with spaces, to simply break after each ideograph when moving their software to CJK markets. Although this is in no way "word" breaking, it gives user a predictable behavior for "control-right-arrow" functions that executed "next word". Although it gives some kind of upward and "global" comaptibility, it does mean that next character and next word do pretty much the same thing for ideographs. It's ugly but perhaps ok for a simple editor. You can improve the precision with better heuristics and more data, so you get to decide how much is good enough... tex Mike Lischke wrote: Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. But how can we then ever have a reliable word-break algorithm? It cannot be that, say, for a simple editor (be it written in Java or whatever) you have to supply a database with language specific details just to do automatic word wrap. Ciao, Mike -- According to Murphy, nothing goes according to Hoyle. -- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 Progress Software Corp.14 Oak Park, Bedford, MA 01730 http://www.Progress.com#1 Embedded Database Globalization Program http://www.Progress.com/partners/globalization.htm ---
Re: extracting words
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the Linebreak TR. For linebreak the only tricky language is Thai, since it requires a dictionary lookup (much like hyphenation in English). Java (and ICU) supply linebreak mechanisms as a part of the standard API. They also supply wordbreak, but it is recognized that those are purely heuristic for languages such as Chinese and Japanese; the APIs are intended for functions like double-click, not for dividing text into terms for searching. The latter is a very complex problem, since if done well requires both division into words, and extraction of roots: e.g. "go" from "went" and "gone". It is important to keep these very different processes straight: - line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) BTW, someone on this thread made this topic out to be even more complex than is: that Devanagari and Korean are written without spaces. While that may have been the case historically, I believe that the modern text does use spaces. Chinese, Japanese and Thai are the main languages written without spaces. Mark - Original Message - From: "Mike Lischke" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Sunday, February 11, 2001 09:47 Subject: FW: extracting words Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. But how can we then ever have a reliable word-break algorithm? It cannot be that, say, for a simple editor (be it written in Java or whatever) you have to supply a database with language specific details just to do automatic word wrap. Ciao, Mike
Re: Unicode collation algorithm - interpretation]
Mike, Jim, I am confused by this thread so I will offer my perspective. The collation algorithm is small and can be written to work flexibly with different levels of sorting. It is easy to have a parameterized table format so that tables can have different levels. I find I need to have the ability to do the following: 1) Users have different requirements, and being able to choose case-insensitivity, accent-insensitivity, etc. is important. Therefore having an API for comparison that allows "strength" variations such as these is important. Users do want to change their comparisons or sorting dynamically, so it should be built into the api not by having separate tables. Yes, the code picks and chooses the number of levels. Personally, I don't like to build selection of the strength into the table names, but as there is an equivalency between a multicomponent name and a name coupled with additional options, the choice is discretionary. 2) I doubt many applications compile the tables into their code. Most applications want to have the flexibility to change tables for different languages. Therefore the tables are externalized and probably parameterized for efficiency. This also allows the tables to be field-upgradable and easily improved or modified. Certainly tables are compiled into binary formats for efficiency. It is possible, that an application that does not need strong precision in its sorting might limit its capabilities to fewer levels. Although there might be some minimal savings in code and processing for the algorithm, probably the true benefit is smaller memory or disk footprints of the tables. If developers said they had compiled in limitations in their programs I would guess they were referring to memory and disk limitations they imposed on theirselves. 3) I don't see a reason to build separate tables that are the same except for different number of levels, for use by the same software. I would just have the maximum level table needed by the software and let the algorithm choose the number of levels to use. Having different tables for different software does make sense since the software might have a different max number of levels and could benefit from lower space requirements, etc. Also, as the TR points out, whether or not the software supports combining marks and whether a normalization pass is made first might impact the tables, so these things vary from application to application. hth tex J M Sykes wrote: Jim, Thanks for the reply, which Hugh had indeed alerted me to expect. See interpolations below. I particularly want to respond to the statement that you made: It has been suggested that SQL collation name should instead identify both collation element table and maximum level. I believe that the "maximum level" is built into the collation element table inseparably. I think you misunderstand me. The "maximum level" I was referring to is that mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for each string", para 2, which reads: quote An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array. For example, if the maximum level is set to 2, then level 3 and higher weights (including the normalized Unicode string) are not appended to the sort key. Thus any differences at levels 3 and higher will be ignored, leveling any such differences in string comparison. /quote There is, of course, an upper limit to the number of levels provided for in the Collation Element Table and 14651 requires that "The number of levels that the process supports ... shall be at least three." So I think it's fair to say we are discussing whether, and if so how, these levels should be made visible to the SQL user. We can safely assume that at least some users will require sometimes exact, sometimes inexact comparisons (at least for pseudo-equality, to a lesser extent for sorting). We can also safely assume that users will wish to get the performance benefit of some preprocessing. It is clearly possible to preprocess as far as the end of step 2 of the Unicode collation algorithm without committing to a level. I understand you to say that several implementors have concluded that this level of preprocessing is not cost-effective, in comparison to going all the way to the sort key. I am in no position to dispute that conclusion. I monitored the email discussions rather a lot during the development of ISO 14651 and it seemed awfully likely as a result of the discussions (plus conversations that I've had with implementors in at least 3 companies) that a specific collation would be built by constructing the collation element table (as you mentioned in your note) and then "compiling" it into the code that actually does the collation. That code would *inherently* have built into it the levels that were specified in the collation table
[OT] RE: FW: extracting words
On Sun, 11 Feb 2001, Mike Lischke wrote: If you are willing to give up precision, then you can use heuristics. It's ugly but perhaps ok for a simple editor. You can improve the precision with better heuristics and more data, so you get to decide how much is good enough... So using white spaces for general word breaking and ideographs for CJK would be an acceptable approach? What I wonder about is how to handle No, that is not acceptable for Chinese. Chinese text does not use white space anywhere.[1] What was described was that it is tolerable (but not perfect--e.g., punctuation is not handled properly) to break *lines* in Chinese text between Chinese characters. To break *words* properly in Chinese text, you really need a dictionary.[2] [1] There is some Chinese text with spaces, where a space is inserted after each Chinese character, but that is a hack to make word-wrapping behave properly on Chinese-unaware software (which would otherwise treat an entire paragraph of Chinese text as a single "word"). [2] You might get away with treating each Chinese character as a "word", but this is technically wrong from linguistic standpoint, despite cultural claims to the contrary, and will have implications. The handling of Japanese and Korean text is different from that of Chinese (lumping them together as "CJK" is inappropriate in this context), but I will leave them for others to provide a better treatment. (Jungshik Shin has already explained the Korean case.) Thomas Chan [EMAIL PROTECTED]
Re: extracting words
Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I recommended in my last message. The Unicode standard is online, as is the TR. Both can be found by going to www.unicode.org, and selecting the right topic. The TR in particular discusses the recommended approach to line break in great detail. However, as with all Unicode functionality, you should not try to reinvent the wheel. See if you can use the services of the OS/Platform, or get a Unicode library (such as ICU or Basis), to cover your requirements. You mentioned Java; it has an API for line break. Mark P.S. It also helps communication if we use the same terms, e.g. "line break", not "word wrapping". P.P.S. As to the list settings: if we change it to please you, we would annoy someone else. We cannot simultaneously please everyone. And please, nobody start another thread on this topic. - Original Message - From: "Mike Lischke" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Sunday, February 11, 2001 11:32 Subject: re: extracting words - line break (wrapping lines on the screen) - word break (for selection) - word/root extraction (for search) I recognize that the second and third case are really difficult to handle. But for word wrapping I assume line breaking is sufficient. But when I don't have spaces to use for wrapping and/or don't know whether the actual text part uses spaces at all (what about exotic languages like Ogham or Anglo-saxon?) then how can I go to implement word wrapping? Simply do it character by character? Ciao, Mike PS: sorry for sending this mail first to you privately, but those unpractical list settings make me always to send to the wrong place first. It is difficult for me to get used to these strange settings. I'm answering about 50 mails per day with a simple "reply", so I simply forget all the time that I have to "reply all" (and the out-of-office bounces I get to my private mail whenever I send a message to the Unicode list don't make the task easier).
Re: Unicode collation algorithm - interpretation]
I agree with Tex that the algorithm is small, if implemented in the straightforward way. I also agree with his #1, #2, and #3. I will add two things: 1. Where performance is important, and where people start adding options (e.g. uppercase lowercase vs. the reverse), the implemenation of collation becomes rather tricky. Not unexpectedly -- doing any programming task under memory and performance constraints is what separates the adults from the children. People interested in collation may want to take a look at the open-source ICU design currently being implemented. The document is posted on the ICU site; I have a copy of the latest version on http://www.macchiato.com/uca/ICU_collation_design.htm. Feedback is welcome, of course. 2. There seems to be a confusion between the datatables used to support collation, and the sort keys -- say for an index -- that are generated from those tables. As Tex said, in the datatables you would go ahead and include the level data all the time. Not much value to producing different tables, except for very limited environments. On the other hand, it may well be useful to generate sort key indices with different levels, since that can reduce the size of your index where the less significant levels are not important. When searching an index using sort keys, you definitely need to use the same parameters in generating your query sort key as you uses when generating the sort key index; certain combinations of parameters will have incomparable sort keys. If you used a 3-level sort with SHIFTED alternates in your index, for example, then you need to reproduce that in your query. (BTW, L3_SHIFTED is probably the most useful combination to use in general; for the vast majority of applications more than 3 levels simply bloats the index with little value to end users.) A loose match can use a query sort key with fewer levels than the sort index used, but this needs a small piece of logic to generate the upper and lower bounds for the loose match. Mark - Original Message - From: "Tex Texin" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Cc: "Unicode List" [EMAIL PROTECTED]; "Fred Zemke" [EMAIL PROTECTED] Sent: Sunday, February 11, 2001 11:42 Subject: Re: Unicode collation algorithm - interpretation] Mike, Jim, I am confused by this thread so I will offer my perspective. The collation algorithm is small and can be written to work flexibly with different levels of sorting. It is easy to have a parameterized table format so that tables can have different levels. I find I need to have the ability to do the following: 1) Users have different requirements, and being able to choose case-insensitivity, accent-insensitivity, etc. is important. Therefore having an API for comparison that allows "strength" variations such as these is important. Users do want to change their comparisons or sorting dynamically, so it should be built into the api not by having separate tables. Yes, the code picks and chooses the number of levels. Personally, I don't like to build selection of the strength into the table names, but as there is an equivalency between a multicomponent name and a name coupled with additional options, the choice is discretionary. 2) I doubt many applications compile the tables into their code. Most applications want to have the flexibility to change tables for different languages. Therefore the tables are externalized and probably parameterized for efficiency. This also allows the tables to be field-upgradable and easily improved or modified. Certainly tables are compiled into binary formats for efficiency. It is possible, that an application that does not need strong precision in its sorting might limit its capabilities to fewer levels. Although there might be some minimal savings in code and processing for the algorithm, probably the true benefit is smaller memory or disk footprints of the tables. If developers said they had compiled in limitations in their programs I would guess they were referring to memory and disk limitations they imposed on theirselves. 3) I don't see a reason to build separate tables that are the same except for different number of levels, for use by the same software. I would just have the maximum level table needed by the software and let the algorithm choose the number of levels to use. Having different tables for different software does make sense since the software might have a different max number of levels and could benefit from lower space requirements, etc. Also, as the TR points out, whether or not the software supports combining marks and whether a normalization pass is made first might impact the tables, so these things vary from application to application. hth tex J M Sykes wrote: Jim, Thanks for the reply, which Hugh had indeed alerted me to expect. See interpolations below. I particularly want to respond to the statement that you made: It
Re: [OT] RE: FW: extracting words
On Sun, 11 Feb 2001, Thomas Chan wrote: On Sun, 11 Feb 2001, Mike Lischke wrote: If you are willing to give up precision, then you can use heuristics. It's ugly but perhaps ok for a simple editor. You can improve the precision with better heuristics and more data, so you get to decide how much is good enough... So using white spaces for general word breaking and ideographs for CJK would be an acceptable approach? What I wonder about is how to handle The handling of Japanese and Korean text is different from that of Chinese (lumping them together as "CJK" is inappropriate in this context), but I I'm glad to see this. Lumping them together as "CJK" is inappropriate not only in this context but also in other cases as well. For sure Chinese, Japanese and Korean text processing have a lot in common. However, there are a lot of differences as well. In case of Korean, Korean writting system Hangul is not just syllabic (as is Japanese Kana) but it's also alphabetic (which means it also needs to be dealt with the way Thai and Indic scripts are treated in some cases) and this point should not be overlooked to avoid making half-baked Korean support. The other day, somebody wrote to this list that most morphemes in CJK might be monosyllabic. That's true of Chinese (as far as I can tell), but cannot be farther from true in Japanese and Korean (although that holds true for Chinese-loan-words in Korean). Chinese is an isolating language. On the other hand, Japanese and Korean are agglutinating languages (the geographic closeness doesn't necesarilly lead to the linguistic closeness. The distance between Chinese on the one hand and Japanese and Korean on the other hand is much much greater than that between English and Sanskrit both of which belong to the Indo-European language family). IMHO, this difference makes it harder to extract word-roots (for search engines, DB, etc) out of Japanese and Korean text (and highly inflective languages) than out of Chinese text. Jungshik Shin