RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the script the character belongs to. Wouter Kool Metadata Specialist ยท OCLC B.V.

Re: UNICODE character identification

2015-02-10 Thread George Milten
yes probably this is where i was also heading, but thought there was a more clever way. Also, is there a good perl normaliser? I have not had any experience with: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm For starters if i could spot only the odd letters between latin

Re: UNICODE character identification

2015-02-10 Thread George Milten
looks good, though i guess it is a deprecated module, thank you though for the info, will further investigate towards the machine learning process, but i guess my use case is simpler: Check if a character belongs to a certain set = language, and see if it is odd, based on the language of the word

Re: UNICODE character identification

2015-02-10 Thread George Milten
utf-8, thank you 2015-02-10 16:54 GMT+02:00 Kool,Wouter wouter.k...@oclc.org: What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the

RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
Apologies, I missed the subject line... Then you might use the regex character classes. For instance $text =~ m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you find the character class that

UNICODE character identification

2015-02-10 Thread George Milten
Hello friendly folks, follows what i am trying to do, and i am looking for your help in order to find the most clever way to achieve this: We have records, that include typos like this: we have a word say Plato, where the last o is inputted with the keyboard set to Greek language, so we need