> On 3 Oct 2016, at 19:17, Jean-Denis Muys via swift-users 
> <swift-users@swift.org> wrote:
> 
> You are right: I don’t know much about asian languages.
> 
> How would you go about counting consonants, vowels (and tone-marks?) in the 
> most general way?

Iterate over unicodeScalars (in the most general case) - Swift characters are 
probably ok for European languages.

For each unicodeScalar a.k.a codepoint you can use the icu function:
        int8_t  chrTyp = u_charType (codepoint) 
This returns the general category value for the code point.
This gives you something like U_OTHER_PUNCTUATION, U_MATH_SYMBOL, 
U_OTHER_LETTER etc.
See enum UCharCategory in 
<http://icu-project.org/apiref/icu4c-latest/uchar_8h.html>

In European languages ignore U_NON_SPACING_MARKs.

There is a compare:options function for NSString (and probably similar for 
Swift String) which might use the options NSCaseInsensitiveSearch and 
NSDiacriticInsensitiveSearch to find equality between ‘E’, ‘e’ and è, é, Ĕ etc.
That is: for each character (or unicodeScalar) compare to a, e, i, o, u with 
these options.

let str = "HaÁÅǺáXeëẽêèâàZ"

for char in str.characters
{
        let vowel = isVowel( char )
        print("\(char) is \(vowel ? "vowel" : "consonant")")
}

func isVowel( _ char: Character ) -> Bool
{
        let s1 = "\(char)"
        let s2 = s1 as NSString
        let opt: NSString.CompareOptions = [.diacriticInsensitive, 
.caseInsensitive]

        //      no idea how do to this with Strings:
        if s2.compare("a", options: opt) == .orderedSame {return true}
        if s2.compare("e", options: opt) == .orderedSame {return true}
        …
        return false
}


If you really want to use Thai, then do NOT ignore U_NON_SPACING_MARKs because 
some vowels are classified thusly.
U+0E01 … U+0E2E are consonants, U+0E30 … U+0E39 and U+0E40 … U+0E44 are vowels.
But then: ‘อ’ is sometimes a (silent) consonant (อยาก), sometimes a vowel (บอ), 
sometimes part of a vowel (มือ), sometimes part of a diphthong (เบื่อ).
Similar for ย: normal consonant (ยาก), part of vowel (ไทย) or diphthong (เมีย).
In the latter case only ม is a consonant, the rest is one single diphthong and 
ี is a U_NON_SPACING_MARK which really is a vowel.
Oh, and don't forget the ligatures ฤ, ฤๅ, ฦ, ฦๅ. These are both a consonant and 
a vowel. Same for ำ: not a ligature but a vowel + consonant.


But to talk about german:
What about diphthongs? “neu” has one consonant + one vowel sound (but 2 vowel 
characters).
What if some silly users don’t know how to type umlauts and write “ueber” 
(instead of correctly “über”). This is really one consonant (+diaeresis).
But beware: “aktuell” is definitely not a misspelling of “aktüll” and has two 
vowels.

Gerriet.

_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users

Reply via email to