Re: extracting words

2001-02-11 Thread Jungshik Shin




On Sat, 10 Feb 2001, Edward Cherlin wrote:

 At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:

 I'm writing a C-program that is called Blacklist, It's purpose is to accept
 a string (unicode) and extract words from it, then hash the found words
 according to a hashing algorythm and see if the word is in blacklist
 hashtable.
 
 This is all very straightforward, but the problem is the extracting of
 wordsfrom this string.
 How do i determine what a word is in Japanese or Korean or whatever other
 language? { a space ? }

 No. Chinese and Japanese almost never have spaces between words, and
 they are not required in Korean.

I'm afraid this is a little bit misleading.  In modern Korean orthography,
every word is delimeted by space (Korean Orthographic Rules, article 2 :
1988-01-19, Ministry of Education, ROK). The exception for that rule is
that particles (Josa) have to follow the preceding word without space
(ibid, article 41). There are also some minor exceptions (ibid, article
43, article 47, article 49 )  so you might say you're correct in that
spaces are *not required* in Korean, but the principle of delimeting
every pair of words with a space is still there.


 Yes, we have had it for a long time; no, nobody has solved it
 entirely; and yes, this approach is wrong. Breaking a string into
 words may require a thorough understanding of the vocabulary and
 grammar of the language, and even that may not be enough.

I absolutely argee with you on this point.

 An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
 Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange
 isseoyo (Father is in the bag)?

I don't think this is such a good example for your case for the enormous
difficulties and complexities involved in extracting words (which
I agree on) because the original question was how to extract words
out of 'supposedly orthographically correct sentences'. Your example
(Abeojigabangeisseoyo) clearly violates the (modern) orthographic rule
by glueing together all the words without space (nobody would write
that way).  One of the reasons that spaces are used to separate words
in Korean writing is to break/lift this kind of degeneracy (as taught
in the first grade Korean class). It would have been more appropriate
if you had come up with an example from Japanese or Chinese where spaces
are rarely used to separate words.

Jungshik Shin




Re: extracting words

2001-02-11 Thread Jonathan Lewis

 in the first grade Korean class). It would have been more appropriate
 if you had come up with an example from Japanese or Chinese where spaces
 are rarely used to separate words.

From Japanese, how about:

kokodehakimonowonuidekudasai

This could be

koko de hakimono wo nuide kudasai (take your shoes off here)

or

koko deha kimono wo nuide kudasai (take your clothes off here; in this case
"ha" is pronounced "wa")


Actually, this example was given at an AAMT meeting last year to show that
writing words in kanji + kana rather than just kana makes it easier for MT
software to break down strings into words accurately and therefore produce
better translations.

Best wishes,

Jonathan Lewis

Tokyo Denki University




Re: Unicode collation algorithm - interpretation]

2001-02-11 Thread J M Sykes

Jim,

Thanks for the reply, which Hugh had indeed alerted me to expect. See
interpolations below.

 I particularly want to respond to the statement that you made:

 It has been suggested that SQL collation name should instead identify
 both collation element table and maximum level.

 I believe that the "maximum level" is built into the
 collation element table inseparably.

I think you misunderstand me. The "maximum level" I was referring to is that
mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for
each string", para 2, which reads:

quote
An implementation may allow the maximum level to be set to a smaller level
than the available levels in the collation element array. For example, if
the maximum level is set to 2, then level 3 and higher weights (including
the normalized Unicode string) are not appended to the sort key. Thus any
differences at levels 3 and higher will be ignored, leveling any such
differences in string comparison.
/quote

There is, of course, an upper limit to the number of levels provided for in
the Collation Element Table and 14651 requires that "The number of levels
that the process supports ... shall be at least three." So I think it's fair
to say we are discussing whether, and if so how, these levels should be made
visible to the SQL user.

We can safely assume that at least some users will require sometimes exact,
sometimes inexact comparisons (at least for pseudo-equality, to a lesser
extent for sorting).

We can also safely assume that users will wish to get the performance
benefit of some preprocessing.

It is clearly possible to preprocess as far as the end of step 2 of the
Unicode collation algorithm without committing to a level. I understand you
to say that several implementors have concluded that this level of
preprocessing is not cost-effective, in comparison to going all the way to
the sort key. I am in no position to dispute that conclusion.

 I monitored the email discussions rather a lot during the development of
 ISO 14651 and it seemed awfully likely as a result of the
 discussions (plus conversations that I've had with implementors in
 at least 3 companies) that
 a specific collation would be built by constructing the collation
 element table (as you mentioned in your note) and then "compiling"
 it into the code that actually does the collation.
 That code would *inherently* have built
 into it the levels that were specified in the collation table that was
 constructed.  It's not like the code can pick and choose which of the
 levels it wishes to honor.

I'm afraid I don't understand what this is saying. I've seen both the 104651
"Common Template Table" and the Unicode "Default Unicode Collation Element
Table", and assume them to be equivalent, but have not verified that they
are. Neither of them looks particularly "compilable" to me but, in view of
your quotes, I'm not at all clear what you mean by '"compiling" it into the
code that actually does the collation.'

I'm also unclear what an SQL-implementor is likely to supply as "a
collation", though I imagine (only!) that it might be a part only of the
CTT/CET appropriate to the script used by a particular culture, and with
appropriate tailoring. But I have no reason to expect the executable
("compiled"?) code the implements the algorithm to vary depending on the
collation, or on the level (case-blind c) specified by the user for a
particular comparison.

I find it easier to imagine differences in code depending on whether a
collate clause is in a column definition or in, say,
WHERE C1 = C2 COLLATE collation name.

 Of course, if you really want to specify an SQL collation name that
 somehow identifies 2 or 3 or 4 (or more) collations built in
 conformance with ISO
 14651 and then use an additional parameter to choose between them, I guess
 that's possible (but not, IMHO, desirable).

Unless you mean for performance reasons, I'd be interested to know why not
desirable.

 However, it would be very
 difficult to enforce a rule that says that the collection of collations so
 identified are "the same" except for the level chosen.  One could be
 oriented towards, say, French, and the other towards German or Thai and it
 would be very hard for the SQL engine to know that it was being misled.

I can see a problem in ensuring that COLLATE (Collate_Fr_Fr, 2) bears the
same relation to COLLATE (Collate_Fr_Fr, 1) as COLLATE (Collate_Thai, 2)
bears to COLLATE (Collate_Thai, 1), but I honestly don't know how
significant that is, or even what "the same" ought to mean if Thai has no
cases or diacritics anyway.

This seems almost to be questioning the usefulness of levels. Perhaps they
have values for some cultures but not others. If that's the case, I don't
see that my suggestion is completely invalidated, though it's value might be
so seriously reduced as to make it negligible.

Mike.






FW: extracting words

2001-02-11 Thread Mike Lischke

 
 Yes, we have had it for a long time; no, nobody has solved it 
 entirely; and yes, this approach is wrong. Breaking a string into 
 words may require a thorough understanding of the vocabulary and 
 grammar of the language, and even that may not be enough.

But how can we then ever have a reliable word-break algorithm? It cannot be that, say, 
for a simple editor (be it written in Java or whatever) you have to supply a database 
with language specific details just to do automatic word wrap.

Ciao, Mike





Re: FW: extracting words

2001-02-11 Thread Tex Texin

If you are willing to give up precision, then you can use heuristics.

The grossest heuristics are not really word breaking at all, but
give users that do not know the language a compatible way of working
with the text. For example, some software have extended their western
European language software which did word breaking with spaces, to
simply break after each ideograph when moving their software to CJK
markets. Although this is in no way "word" breaking, it gives user
a predictable behavior for "control-right-arrow" functions that
executed "next word". 

Although it gives some kind of upward and "global" comaptibility,
it does mean that next character and next word do pretty much the
same thing for ideographs.

It's ugly but perhaps ok for a simple editor. You can improve the
precision
with better heuristics and more data, so you get to decide how much is
good enough...

tex

Mike Lischke wrote:
 
 
  Yes, we have had it for a long time; no, nobody has solved it
  entirely; and yes, this approach is wrong. Breaking a string into
  words may require a thorough understanding of the vocabulary and
  grammar of the language, and even that may not be enough.
 
 But how can we then ever have a reliable word-break algorithm? It cannot be that, 
say, for a simple editor (be it written in Java or whatever) you have to supply a 
database with language specific details just to do automatic word wrap.
 
 Ciao, Mike

-- 
According to Murphy, nothing goes according to Hoyle.
--
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.14 Oak Park, Bedford, MA 01730

http://www.Progress.com#1 Embedded Database

Globalization Program   
http://www.Progress.com/partners/globalization.htm
---



Re: extracting words

2001-02-11 Thread Mark Davis

Word break is *very* different than linebreak; see Chapter 5 of TUS, and the
Linebreak TR. For linebreak the only tricky language is Thai, since it
requires a dictionary lookup (much like hyphenation in English). Java (and
ICU) supply linebreak mechanisms as a part of the standard API. They also
supply wordbreak, but it is recognized that those are purely heuristic for
languages such as Chinese and Japanese; the APIs are intended for functions
like double-click, not for dividing text into terms for searching. The
latter is a very complex problem, since if done well requires both division
into words, and extraction of roots: e.g. "go" from "went" and "gone". It is
important to keep these very different processes straight:

- line break (wrapping lines on the screen)
- word break (for selection)
- word/root extraction (for search)

BTW, someone on this thread made this topic out to be even more complex than
is: that Devanagari and Korean are written without spaces. While that may
have been the case historically, I believe that the modern text does use
spaces. Chinese, Japanese and Thai are the main languages written without
spaces.

Mark
- Original Message -
From: "Mike Lischke" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Sunday, February 11, 2001 09:47
Subject: FW: extracting words



 Yes, we have had it for a long time; no, nobody has solved it
 entirely; and yes, this approach is wrong. Breaking a string into
 words may require a thorough understanding of the vocabulary and
 grammar of the language, and even that may not be enough.

But how can we then ever have a reliable word-break algorithm? It cannot be
that, say, for a simple editor (be it written in Java or whatever) you have
to supply a database with language specific details just to do automatic
word wrap.

Ciao, Mike






Re: Unicode collation algorithm - interpretation]

2001-02-11 Thread Tex Texin

Mike, Jim,

I am confused by this thread so I will offer my perspective.

The collation algorithm is small and can be written to work
flexibly with different levels of sorting.

It is easy to have a parameterized table format so that
tables can have different levels.

I find I need to have the ability to do the following:

1) Users have different requirements, and being able to choose
case-insensitivity, accent-insensitivity, etc. is important.
Therefore having an API for comparison that allows "strength" variations
such as these is important. Users do want to change their comparisons
or sorting dynamically, so it should be built into the api not by
having separate tables. Yes, the code picks and chooses the
number of levels.

Personally, I don't like to build selection of the strength
into the table names, but as there is an equivalency between a
multicomponent name and a name coupled with additional options,
the choice is discretionary.

2) I doubt many applications compile the tables into their
code. Most applications want to have the flexibility to change
tables for different languages. Therefore the tables are externalized
and probably parameterized for efficiency. This also allows the
tables to be field-upgradable and easily improved or modified.

Certainly tables are compiled into binary formats for efficiency.

It is possible, that an application that does not need strong
precision in its sorting might limit its capabilities to fewer
levels. Although there might be some minimal savings in code
and processing for the algorithm, probably the true benefit is
smaller memory or disk footprints of the tables. If developers
said they had compiled in limitations in their programs I would
guess they were referring to memory and disk limitations they
imposed on theirselves.

3) I don't see a reason to build separate tables that are the
same except for different number of levels, for use by the same
software.

I would just have the maximum level table needed by the software
and let the algorithm choose the number of levels to use.

Having different tables for different software does make sense
since the software might have a different max number of levels
and could benefit from lower space requirements, etc.

Also, as the TR points out, whether or not the software
supports combining marks and whether a normalization pass
is made first might impact the tables, so these things vary
from application to application.

hth
tex


J M Sykes wrote:
 
 Jim,
 
 Thanks for the reply, which Hugh had indeed alerted me to expect. See
 interpolations below.
 
  I particularly want to respond to the statement that you made:
 
  It has been suggested that SQL collation name should instead identify
  both collation element table and maximum level.
 
  I believe that the "maximum level" is built into the
  collation element table inseparably.
 
 I think you misunderstand me. The "maximum level" I was referring to is that
 mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for
 each string", para 2, which reads:
 
 quote
 An implementation may allow the maximum level to be set to a smaller level
 than the available levels in the collation element array. For example, if
 the maximum level is set to 2, then level 3 and higher weights (including
 the normalized Unicode string) are not appended to the sort key. Thus any
 differences at levels 3 and higher will be ignored, leveling any such
 differences in string comparison.
 /quote
 
 There is, of course, an upper limit to the number of levels provided for in
 the Collation Element Table and 14651 requires that "The number of levels
 that the process supports ... shall be at least three." So I think it's fair
 to say we are discussing whether, and if so how, these levels should be made
 visible to the SQL user.
 
 We can safely assume that at least some users will require sometimes exact,
 sometimes inexact comparisons (at least for pseudo-equality, to a lesser
 extent for sorting).
 
 We can also safely assume that users will wish to get the performance
 benefit of some preprocessing.
 
 It is clearly possible to preprocess as far as the end of step 2 of the
 Unicode collation algorithm without committing to a level. I understand you
 to say that several implementors have concluded that this level of
 preprocessing is not cost-effective, in comparison to going all the way to
 the sort key. I am in no position to dispute that conclusion.
 
  I monitored the email discussions rather a lot during the development of
  ISO 14651 and it seemed awfully likely as a result of the
  discussions (plus conversations that I've had with implementors in
  at least 3 companies) that
  a specific collation would be built by constructing the collation
  element table (as you mentioned in your note) and then "compiling"
  it into the code that actually does the collation.
  That code would *inherently* have built
  into it the levels that were specified in the collation table 

[OT] RE: FW: extracting words

2001-02-11 Thread Thomas Chan

On Sun, 11 Feb 2001, Mike Lischke wrote:

  If you are willing to give up precision, then you can use heuristics.
 
  It's ugly but perhaps ok for a simple editor. You can improve the
  precision
  with better heuristics and more data, so you get to decide how much is
  good enough...
 
 So using white spaces for general word breaking and ideographs for CJK
 would be an acceptable approach? What I wonder about is how to handle

No, that is not acceptable for Chinese.  Chinese text does not use white 
space anywhere.[1]  What was described was that it is tolerable (but not
perfect--e.g., punctuation is not handled properly) to break *lines* in
Chinese text between Chinese characters.  To break *words* properly in
Chinese text, you really need a dictionary.[2]

[1] There is some Chinese text with spaces, where a space is inserted
after each Chinese character, but that is a hack to make word-wrapping
behave properly on Chinese-unaware software (which would otherwise treat
an entire paragraph of Chinese text as a single "word").

[2] You might get away with treating each Chinese character as a "word",
but this is technically wrong from linguistic standpoint, despite cultural
claims to the contrary, and will have implications.


The handling of Japanese and Korean text is different from that of Chinese
(lumping them together as "CJK" is inappropriate in this context), but I
will leave them for others to provide a better treatment.  (Jungshik Shin
has already explained the Korean case.)


Thomas Chan
[EMAIL PROTECTED]




Re: extracting words

2001-02-11 Thread Mark Davis

Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
recommended in my last message. The Unicode standard is online, as is the
TR. Both can be found by going to www.unicode.org, and selecting the right
topic. The TR in particular discusses the recommended approach to line break
in great detail.

However, as with all Unicode functionality, you should not try to reinvent
the wheel. See if you can use the services of the OS/Platform, or get a
Unicode library (such as ICU or Basis), to cover your requirements. You
mentioned Java; it has an API for line break.

Mark

P.S. It also helps communication if we use the same terms, e.g. "line
break", not "word wrapping".
P.P.S. As to the list settings: if we change it to please you, we would
annoy someone else. We cannot simultaneously please everyone. And please,
nobody start another thread on this topic.

- Original Message -
From: "Mike Lischke" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Sunday, February 11, 2001 11:32
Subject: re: extracting words




 - line break (wrapping lines on the screen)
 - word break (for selection)
 - word/root extraction (for search)

I recognize that the second and third case are really difficult to handle.
But for word wrapping I assume line breaking is sufficient. But when I don't
have spaces to use for wrapping and/or don't know whether the actual text
part uses spaces at all (what about exotic languages like Ogham or
Anglo-saxon?) then how can I go to implement word wrapping? Simply do it
character by character?

Ciao, Mike

PS: sorry for sending this mail first to you privately, but those
unpractical list settings make me always to send to the wrong place first.
It is difficult for me to get used to these strange settings. I'm answering
about 50 mails per day with a simple "reply", so I simply forget all the
time that I have to "reply all" (and the out-of-office bounces I get to my
private mail whenever I send a message to the Unicode list don't make the
task easier).





Re: Unicode collation algorithm - interpretation]

2001-02-11 Thread Mark Davis

I agree with Tex that the algorithm is small, if implemented in the
straightforward way. I also agree with his #1, #2, and #3. I will add two
things:

1. Where performance is important, and where people start adding options
(e.g. uppercase  lowercase vs. the reverse), the implemenation of collation
becomes rather tricky. Not unexpectedly -- doing any programming task under
memory and performance constraints is what separates the adults from the
children.

People interested in collation may want to take a look at the open-source
ICU design currently being implemented. The document is posted on the ICU
site; I have a copy of the latest version on
http://www.macchiato.com/uca/ICU_collation_design.htm. Feedback is welcome,
of course.

2. There seems to be a confusion between the datatables used to support
collation, and the sort keys -- say for an index -- that are generated from
those tables. As Tex said, in the datatables you would go ahead and include
the level data all the time. Not much value to producing different tables,
except for very limited environments.

On the other hand, it may well be useful to generate sort key indices with
different levels, since that can reduce the size of your index where the
less significant levels are not important. When searching an index using
sort keys, you definitely need to use the same parameters in generating your
query sort key as you uses when generating the sort key index; certain
combinations of parameters will have incomparable sort keys.

If you used a 3-level sort with SHIFTED alternates in your index, for
example, then you need to reproduce that in your query. (BTW, L3_SHIFTED is
probably the most useful combination to use in general; for the vast
majority of applications more than 3 levels simply bloats the index with
little value to end users.)

A loose match can use a query sort key with fewer levels than the sort index
used, but this needs a small piece of logic to generate the upper and lower
bounds for the loose match.

Mark

- Original Message -
From: "Tex Texin" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: "Unicode List" [EMAIL PROTECTED]; "Fred Zemke"
[EMAIL PROTECTED]
Sent: Sunday, February 11, 2001 11:42
Subject: Re: Unicode collation algorithm - interpretation]


 Mike, Jim,

 I am confused by this thread so I will offer my perspective.

 The collation algorithm is small and can be written to work
 flexibly with different levels of sorting.

 It is easy to have a parameterized table format so that
 tables can have different levels.

 I find I need to have the ability to do the following:

 1) Users have different requirements, and being able to choose
 case-insensitivity, accent-insensitivity, etc. is important.
 Therefore having an API for comparison that allows "strength" variations
 such as these is important. Users do want to change their comparisons
 or sorting dynamically, so it should be built into the api not by
 having separate tables. Yes, the code picks and chooses the
 number of levels.

 Personally, I don't like to build selection of the strength
 into the table names, but as there is an equivalency between a
 multicomponent name and a name coupled with additional options,
 the choice is discretionary.

 2) I doubt many applications compile the tables into their
 code. Most applications want to have the flexibility to change
 tables for different languages. Therefore the tables are externalized
 and probably parameterized for efficiency. This also allows the
 tables to be field-upgradable and easily improved or modified.

 Certainly tables are compiled into binary formats for efficiency.

 It is possible, that an application that does not need strong
 precision in its sorting might limit its capabilities to fewer
 levels. Although there might be some minimal savings in code
 and processing for the algorithm, probably the true benefit is
 smaller memory or disk footprints of the tables. If developers
 said they had compiled in limitations in their programs I would
 guess they were referring to memory and disk limitations they
 imposed on theirselves.

 3) I don't see a reason to build separate tables that are the
 same except for different number of levels, for use by the same
 software.

 I would just have the maximum level table needed by the software
 and let the algorithm choose the number of levels to use.

 Having different tables for different software does make sense
 since the software might have a different max number of levels
 and could benefit from lower space requirements, etc.

 Also, as the TR points out, whether or not the software
 supports combining marks and whether a normalization pass
 is made first might impact the tables, so these things vary
 from application to application.

 hth
 tex


 J M Sykes wrote:
 
  Jim,
 
  Thanks for the reply, which Hugh had indeed alerted me to expect. See
  interpolations below.
 
   I particularly want to respond to the statement that you made:
  
   It 

Re: [OT] RE: FW: extracting words

2001-02-11 Thread Jungshik Shin

On Sun, 11 Feb 2001, Thomas Chan wrote:

 On Sun, 11 Feb 2001, Mike Lischke wrote:

   If you are willing to give up precision, then you can use heuristics.
  
   It's ugly but perhaps ok for a simple editor. You can improve the
   precision
   with better heuristics and more data, so you get to decide how much is
   good enough...
 
  So using white spaces for general word breaking and ideographs for CJK
  would be an acceptable approach? What I wonder about is how to handle

 The handling of Japanese and Korean text is different from that of Chinese
 (lumping them together as "CJK" is inappropriate in this context), but I

I'm glad to see this. Lumping them together as "CJK" is inappropriate not
only in this context but also in other cases as well. For sure Chinese,
Japanese and Korean text processing have a lot in common.  However, there
are a lot of differences as well. In case of Korean, Korean writting
system Hangul  is not just syllabic (as is Japanese Kana) but it's also
alphabetic (which means it also needs to be dealt with the way Thai and
Indic scripts are treated in some cases) and this point should not be
overlooked to avoid making half-baked Korean support.

The other day, somebody wrote to this list that most morphemes in CJK
might be monosyllabic. That's true of Chinese (as far as I can tell),
but cannot be farther from true in Japanese and Korean (although that
holds true for Chinese-loan-words in Korean). Chinese is an isolating
language. On the other hand, Japanese and Korean are agglutinating
languages (the geographic closeness doesn't necesarilly lead to the
linguistic closeness. The distance between Chinese on the one hand and
Japanese and Korean on the other hand is much much greater than that
between English and Sanskrit both of which belong to the Indo-European
language family).  IMHO, this difference makes it harder to extract
word-roots (for search engines, DB, etc) out of Japanese and Korean text
(and highly inflective languages) than out of Chinese text.


Jungshik Shin