On Mon, Dec 09, 2013 at 01:28:46PM +0100, Keith J. Schultz wrote: > Hi Khaled, > > I would agree with you if the text was not encoded in unicode! > A properly encoded utf-8 string should contain everything you need!
No it doesn’t, otherwise please prove me wrong and till me how you can, programatically, identify the language of this paragraph using Unicode properties. > Unfortunately, for efficiency reasons, utf-8 strings are not properly > encoded and programs assume a particular language, to save space. > In multi-language environments methods are used for efficiency to make > sure the system uses the correct language! > > It is not the fault of utf-8, but the way it is implemented. Encodings has nothing to do with language identification, you can always convert text to Unicode prior to processing it. > As far as the methods you point to, they are for identify texts of unknown > origine and possibly of unknown encoding or an encoding that already has not > identified > the language. If the language of the text is already known (i.e. properly tagged text), we don’t need to identify it. > Am 09.12.2013 um 10:38 schrieb Khaled Hosny <[email protected]>: > > > On Mon, Dec 09, 2013 at 09:22:10AM +0100, Keith J. Schultz wrote: > >> Hi Khaled, > >> > >> your question can not be serious! > > > > No, it is. > > > >> It is pretty much in the standard! > > > > No. > > > >> True enough that for most western languages american, english, spanish, > >> german, austrian, etc. this is somewhat difficult. Yet, these are not > >> causing the problems. > > > > You can’t identify the language of a Unicode string just by examining > > the Unicode properties for the characters in that string, simply because > > such Unicode property does not exist. Language identifications involves > > quite some statistical analysis[1]. You can identify scripts using > > Unicode properties quite reliably, though. > > > > 1. > > https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches > > > > Regards, > > Khaled > [snip, snip] > > > -------------------------------------------------- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
