There is sometimes a question of whether or not SenseClusters can process languages other than English. In fact, it is very easy to modify the types of text SenseClusters can handle - you simply need to change the tokenization file.
The default tokenization file uses \w+ as the main identifier of tokens - this corresponds to the ranges of characters 0-9, A-Z, a-z and _. Now, it does not include accented characters, etc. So, if you are working with a language that uses, for example, extended ASCII characters, you can modify the token file to look more like this: /<head[^<]*>\s*[\w+\x80-\xff]+\s*<\/head>/ /[\w+\x80-\xff]+/ This will recognize head tags and strings that include both the standard \w class of characters, as well as those in the range of hex 80 to hex ff in the extended ascii table. You can see what those characters consist of by looking at a table like this: http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm Now, it's probably the case that we should make the extended ascii token file the default for SenseClusters, so that will be coming in a future release. But, til then it's very easy to adjust to process your favorite extended ascii text. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
