At 09:42 AM 4/18/2002, Markus Scherer wrote: >Doug Ewell wrote: > >>The ICU package includes a sorted Thai word list in a UTF-8 file called >>th18057.txt. Since you may not wish to download the whole package and I >>don't know if the Thai file is available separately, I have uploaded it >>(for a limited time only) to: > > >Note that ICU has CVS and WebCVS, so you can get any of our files separately. >For this one: >http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/test/testdata/th18057.txt > >(ICU uses the X license. See http://oss.software.ibm.com/icu/) > >We use this word list for word break iteration, for which we have APIs.
That file is used to test Thai collation; there is a separate, binary dictionary file that's used for word breaking. The dictionary is built using ICU4J. You can pick up the source file here: http://oss.software.ibm.com/cvs/icu/~checkout~/icu4j/src/com/ibm/icu/dev/data/thai6.ucs?rev=1.4&content-type=text/plain&cvsroot=ICU4J This file is UTF-16 with a BOM at the front. There is a ^M ^J after each word. >PS: For details about CVS for ICU see >http://oss.software.ibm.com/icu/develop/cvs.html Eric Mader IBM GCoC - San Jose 5600 Cottle Rd. M/S 50-2/B11 San Jose, CA 95193

