On Sat, 10 Feb 2001, Edward Cherlin wrote:
At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
I'm writing a C-program that is called Blacklist, It's purpose is to accept
a string (unicode) and extract words from it, then hash the found words
according to a hashing algorythm and see if the
in the first grade Korean class). It would have been more appropriate
if you had come up with an example from Japanese or Chinese where spaces
are rarely used to separate words.
From Japanese, how about:
kokodehakimonowonuidekudasai
This could be
koko de hakimono wo nuide kudasai (take
Jim,
Thanks for the reply, which Hugh had indeed alerted me to expect. See
interpolations below.
I particularly want to respond to the statement that you made:
It has been suggested that SQL collation name should instead identify
both collation element table and maximum level.
I believe
Yes, we have had it for a long time; no, nobody has solved it
entirely; and yes, this approach is wrong. Breaking a string into
words may require a thorough understanding of the vocabulary and
grammar of the language, and even that may not be enough.
But how can we then ever have a
If you are willing to give up precision, then you can use heuristics.
The grossest heuristics are not really word breaking at all, but
give users that do not know the language a compatible way of working
with the text. For example, some software have extended their western
European language
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the
Linebreak TR. For linebreak the only tricky language is Thai, since it
requires a dictionary lookup (much like hyphenation in English). Java (and
ICU) supply linebreak mechanisms as a part of the standard API. They also
Mike, Jim,
I am confused by this thread so I will offer my perspective.
The collation algorithm is small and can be written to work
flexibly with different levels of sorting.
It is easy to have a parameterized table format so that
tables can have different levels.
I find I need to have the
On Sun, 11 Feb 2001, Mike Lischke wrote:
If you are willing to give up precision, then you can use heuristics.
It's ugly but perhaps ok for a simple editor. You can improve the
precision
with better heuristics and more data, so you get to decide how much is
good enough...
So using
Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
recommended in my last message. The Unicode standard is online, as is the
TR. Both can be found by going to www.unicode.org, and selecting the right
topic. The TR in particular discusses the recommended approach to line break
I agree with Tex that the algorithm is small, if implemented in the
straightforward way. I also agree with his #1, #2, and #3. I will add two
things:
1. Where performance is important, and where people start adding options
(e.g. uppercase lowercase vs. the reverse), the implemenation of
On Sun, 11 Feb 2001, Thomas Chan wrote:
On Sun, 11 Feb 2001, Mike Lischke wrote:
If you are willing to give up precision, then you can use heuristics.
It's ugly but perhaps ok for a simple editor. You can improve the
precision
with better heuristics and more data, so you get to
11 matches
Mail list logo