What encoding is your data in? utf8? Single-byte encoding? Marc8? That
information matters a lot to determine whether your idea would work. If it is
in a single-byte encoding there is often no way to determine the script the
character belongs to.
Wouter Kool
Metadata Specialist ยท OCLC B.V.
yes probably this is where i was also heading, but thought there was a more
clever way. Also, is there a good perl normaliser? I have not had any
experience with:
http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm
For starters if i could spot only the odd letters between latin
looks good, though i guess it is a deprecated module,
thank you though for the info, will further investigate towards the machine
learning process, but i guess my use case is simpler: Check if a character
belongs to a certain set = language, and see if it is odd, based on the
language of the word
utf-8,
thank you
2015-02-10 16:54 GMT+02:00 Kool,Wouter wouter.k...@oclc.org:
What encoding is your data in? utf8? Single-byte encoding? Marc8? That
information matters a lot to determine whether your idea would work. If it
is in a single-byte encoding there is often no way to determine the
Apologies, I missed the subject line...
Then you might use the regex character classes. For instance $text =~
m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested
it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you
find the character class that
Hello friendly folks,
follows what i am trying to do, and i am looking for your help in order to
find the most clever way to achieve this:
We have records, that include typos like this: we have a word say Plato,
where the last o is inputted with the keyboard set to Greek language, so we
need