Re: extracting words

2001-02-11 Thread Jungshik Shin
On Sat, 10 Feb 2001, Edward Cherlin wrote: At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote: I'm writing a C-program that is called Blacklist, It's purpose is to accept a string (unicode) and extract words from it, then hash the found words according to a hashing algorythm and see if the

Re: extracting words

2001-02-11 Thread Jonathan Lewis
in the first grade Korean class). It would have been more appropriate if you had come up with an example from Japanese or Chinese where spaces are rarely used to separate words. From Japanese, how about: kokodehakimonowonuidekudasai This could be koko de hakimono wo nuide kudasai (take

Re: Unicode collation algorithm - interpretation]

2001-02-11 Thread J M Sykes
Jim, Thanks for the reply, which Hugh had indeed alerted me to expect. See interpolations below. I particularly want to respond to the statement that you made: It has been suggested that SQL collation name should instead identify both collation element table and maximum level. I believe

FW: extracting words

2001-02-11 Thread Mike Lischke
Yes, we have had it for a long time; no, nobody has solved it entirely; and yes, this approach is wrong. Breaking a string into words may require a thorough understanding of the vocabulary and grammar of the language, and even that may not be enough. But how can we then ever have a

Re: FW: extracting words

2001-02-11 Thread Tex Texin
If you are willing to give up precision, then you can use heuristics. The grossest heuristics are not really word breaking at all, but give users that do not know the language a compatible way of working with the text. For example, some software have extended their western European language

Re: extracting words

2001-02-11 Thread Mark Davis
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the Linebreak TR. For linebreak the only tricky language is Thai, since it requires a dictionary lookup (much like hyphenation in English). Java (and ICU) supply linebreak mechanisms as a part of the standard API. They also

Re: Unicode collation algorithm - interpretation]

2001-02-11 Thread Tex Texin
Mike, Jim, I am confused by this thread so I will offer my perspective. The collation algorithm is small and can be written to work flexibly with different levels of sorting. It is easy to have a parameterized table format so that tables can have different levels. I find I need to have the

[OT] RE: FW: extracting words

2001-02-11 Thread Thomas Chan
On Sun, 11 Feb 2001, Mike Lischke wrote: If you are willing to give up precision, then you can use heuristics. It's ugly but perhaps ok for a simple editor. You can improve the precision with better heuristics and more data, so you get to decide how much is good enough... So using

Re: extracting words

2001-02-11 Thread Mark Davis
Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I recommended in my last message. The Unicode standard is online, as is the TR. Both can be found by going to www.unicode.org, and selecting the right topic. The TR in particular discusses the recommended approach to line break

Re: Unicode collation algorithm - interpretation]

2001-02-11 Thread Mark Davis
I agree with Tex that the algorithm is small, if implemented in the straightforward way. I also agree with his #1, #2, and #3. I will add two things: 1. Where performance is important, and where people start adding options (e.g. uppercase lowercase vs. the reverse), the implemenation of

Re: [OT] RE: FW: extracting words

2001-02-11 Thread Jungshik Shin
On Sun, 11 Feb 2001, Thomas Chan wrote: On Sun, 11 Feb 2001, Mike Lischke wrote: If you are willing to give up precision, then you can use heuristics. It's ugly but perhaps ok for a simple editor. You can improve the precision with better heuristics and more data, so you get to