Re: UNICODE character identification
looks good, though i guess it is a deprecated module, thank you though for the info, will further investigate towards the machine learning process, but i guess my use case is simpler: Check if a character belongs to a certain set = language, and see if it is odd, based on the language of the word 2015-02-10 17:17 GMT+02:00 Kool,Wouter : > You might also take a machine learning approach, like Naïve Bayesian > Classification. For instance > http://search.cpan.org/~kwilliams/Algorithm-NaiveBayes-0.04/lib/Algorithm/NaiveBayes.pm. > You build test sets from records in various scripts and use the classifier > to find hybrid cases. I have quite satisfactory results with this approach > in a slightly different use case. > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 16:09 > > *To:* Kool,Wouter > *Cc:* perl4lib@perl.org > *Subject:* Re: UNICODE character identification > > > > yes probably this is where i was also heading, but thought there was a > more clever way. Also, is there a good perl normaliser? I have not had any > experience with: > > > > http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm > > > > For starters if i could spot only the odd letters between latin and greek > regex character classes, i would be more than happy > > > > 2015-02-10 17:04 GMT+02:00 Kool,Wouter : > > Apologies, I missed the subject line... > > Then you might use the regex character classes. For instance $text =~ > m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested > it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you > find the character class that most characters match and you look for the > exceptions. Would that help? > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 15:56 > *To:* Kool,Wouter > *Cc:* perl4lib@perl.org > *Subject:* Re: UNICODE character identification > > > > utf-8, > > > > thank you > > > > 2015-02-10 16:54 GMT+02:00 Kool,Wouter : > > What encoding is your data in? utf8? Single-byte encoding? Marc8? That > information matters a lot to determine whether your idea would work. If it > is in a single-byte encoding there is often no way to determine the script > the character belongs to. > > > > > > *Wouter Kool* > Metadata Specialist *·* OCLC B.V. > Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands > t +31-(0)71-524 6500 > > wouter.k...@oclc.org *·* www.oclc.org > > [image: Volg @OCLC_NL op Twitter] <https://twitter.com/OCLC_NL>*[image: > Volg OCLC (Nederland) op LinkedIn]* > <https://www.linkedin.com/company/oclc-nederland->*[image: Abonneer op > OCLCVideo]* > <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> > > *[image: > https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uz&oid=00D8000ZRv8&lastMod=140984368]* > <http://www.oclc.org/> > > > > > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 13:27 > *To:* perl4lib@perl.org > *Subject:* UNICODE character identification > > > > Hello friendly folks, > > > > follows what i am trying to do, and i am looking for your help in order to > find the most clever way to achieve this: > > > > We have records, that include typos like this: we have a word say Plato, > where the last o is inputted with the keyboard set to Greek language, so we > need something that would parse all metadata in a per character basis, > check against what is the script language that the majority of characters > the word belongs to have, and return the odd characters, the script they > belong, and the record identifier they were found in, so as to be able to > correct them > > > > thank you in advance > > > > >
RE: UNICODE character identification
You might also take a machine learning approach, like Naïve Bayesian Classification. For instance http://search.cpan.org/~kwilliams/Algorithm-NaiveBayes-0.04/lib/Algorithm/NaiveBayes.pm. You build test sets from records in various scripts and use the classifier to find hybrid cases. I have quite satisfactory results with this approach in a slightly different use case. From: George Milten [mailto:george.mil...@gmail.com] Sent: dinsdag 10 februari 2015 16:09 To: Kool,Wouter Cc: perl4lib@perl.org Subject: Re: UNICODE character identification yes probably this is where i was also heading, but thought there was a more clever way. Also, is there a good perl normaliser? I have not had any experience with: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm For starters if i could spot only the odd letters between latin and greek regex character classes, i would be more than happy 2015-02-10 17:04 GMT+02:00 Kool,Wouter mailto:wouter.k...@oclc.org>>: Apologies, I missed the subject line... Then you might use the regex character classes. For instance $text =~ m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you find the character class that most characters match and you look for the exceptions. Would that help? From: George Milten [mailto:george.mil...@gmail.com<mailto:george.mil...@gmail.com>] Sent: dinsdag 10 februari 2015 15:56 To: Kool,Wouter Cc: perl4lib@perl.org<mailto:perl4lib@perl.org> Subject: Re: UNICODE character identification utf-8, thank you 2015-02-10 16:54 GMT+02:00 Kool,Wouter mailto:wouter.k...@oclc.org>>: What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the script the character belongs to. Wouter Kool Metadata Specialist · OCLC B.V. Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands t +31-(0)71-524 6500 wouter.k...@oclc.org<mailto:wouter.k...@oclc.org> · www.oclc.org<http://www.oclc.org/> [Volg @OCLC_NL op Twitter]<https://twitter.com/OCLC_NL>[Volg OCLC (Nederland) op LinkedIn]<https://www.linkedin.com/company/oclc-nederland->[Abonneer op OCLCVideo]<https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> [https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uz&oid=00D8000ZRv8&lastMod=140984368]<http://www.oclc.org/> From: George Milten [mailto:george.mil...@gmail.com<mailto:george.mil...@gmail.com>] Sent: dinsdag 10 februari 2015 13:27 To: perl4lib@perl.org<mailto:perl4lib@perl.org> Subject: UNICODE character identification Hello friendly folks, follows what i am trying to do, and i am looking for your help in order to find the most clever way to achieve this: We have records, that include typos like this: we have a word say Plato, where the last o is inputted with the keyboard set to Greek language, so we need something that would parse all metadata in a per character basis, check against what is the script language that the majority of characters the word belongs to have, and return the odd characters, the script they belong, and the record identifier they were found in, so as to be able to correct them thank you in advance
Re: UNICODE character identification
yes probably this is where i was also heading, but thought there was a more clever way. Also, is there a good perl normaliser? I have not had any experience with: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm For starters if i could spot only the odd letters between latin and greek regex character classes, i would be more than happy 2015-02-10 17:04 GMT+02:00 Kool,Wouter : > Apologies, I missed the subject line... > > Then you might use the regex character classes. For instance $text =~ > m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested > it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you > find the character class that most characters match and you look for the > exceptions. Would that help? > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 15:56 > *To:* Kool,Wouter > *Cc:* perl4lib@perl.org > *Subject:* Re: UNICODE character identification > > > > utf-8, > > > > thank you > > > > 2015-02-10 16:54 GMT+02:00 Kool,Wouter : > > What encoding is your data in? utf8? Single-byte encoding? Marc8? That > information matters a lot to determine whether your idea would work. If it > is in a single-byte encoding there is often no way to determine the script > the character belongs to. > > > > > > *Wouter Kool* > Metadata Specialist *·* OCLC B.V. > Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands > t +31-(0)71-524 6500 > > wouter.k...@oclc.org *·* www.oclc.org > > [image: Volg @OCLC_NL op Twitter] <https://twitter.com/OCLC_NL>*[image: > Volg OCLC (Nederland) op LinkedIn]* > <https://www.linkedin.com/company/oclc-nederland->*[image: Abonneer op > OCLCVideo]* > <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> > > *[image: > https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uz&oid=00D8000ZRv8&lastMod=140984368]* > <http://www.oclc.org/> > > > > > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 13:27 > *To:* perl4lib@perl.org > *Subject:* UNICODE character identification > > > > Hello friendly folks, > > > > follows what i am trying to do, and i am looking for your help in order to > find the most clever way to achieve this: > > > > We have records, that include typos like this: we have a word say Plato, > where the last o is inputted with the keyboard set to Greek language, so we > need something that would parse all metadata in a per character basis, > check against what is the script language that the majority of characters > the word belongs to have, and return the odd characters, the script they > belong, and the record identifier they were found in, so as to be able to > correct them > > > > thank you in advance > > >
RE: UNICODE character identification
Apologies, I missed the subject line... Then you might use the regex character classes. For instance $text =~ m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you find the character class that most characters match and you look for the exceptions. Would that help? From: George Milten [mailto:george.mil...@gmail.com] Sent: dinsdag 10 februari 2015 15:56 To: Kool,Wouter Cc: perl4lib@perl.org Subject: Re: UNICODE character identification utf-8, thank you 2015-02-10 16:54 GMT+02:00 Kool,Wouter mailto:wouter.k...@oclc.org>>: What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the script the character belongs to. Wouter Kool Metadata Specialist · OCLC B.V. Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands t +31-(0)71-524 6500 wouter.k...@oclc.org<mailto:wouter.k...@oclc.org> · www.oclc.org<http://www.oclc.org/> [Volg @OCLC_NL op Twitter]<https://twitter.com/OCLC_NL>[Volg OCLC (Nederland) op LinkedIn]<https://www.linkedin.com/company/oclc-nederland->[Abonneer op OCLCVideo]<https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> [https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uz&oid=00D8000ZRv8&lastMod=140984368]<http://www.oclc.org/> From: George Milten [mailto:george.mil...@gmail.com<mailto:george.mil...@gmail.com>] Sent: dinsdag 10 februari 2015 13:27 To: perl4lib@perl.org<mailto:perl4lib@perl.org> Subject: UNICODE character identification Hello friendly folks, follows what i am trying to do, and i am looking for your help in order to find the most clever way to achieve this: We have records, that include typos like this: we have a word say Plato, where the last o is inputted with the keyboard set to Greek language, so we need something that would parse all metadata in a per character basis, check against what is the script language that the majority of characters the word belongs to have, and return the odd characters, the script they belong, and the record identifier they were found in, so as to be able to correct them thank you in advance
Re: UNICODE character identification
utf-8, thank you 2015-02-10 16:54 GMT+02:00 Kool,Wouter : > What encoding is your data in? utf8? Single-byte encoding? Marc8? That > information matters a lot to determine whether your idea would work. If it > is in a single-byte encoding there is often no way to determine the script > the character belongs to. > > > > > > *Wouter Kool* > Metadata Specialist *·* OCLC B.V. > Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands > t +31-(0)71-524 6500 > > wouter.k...@oclc.org *·* www.oclc.org > > [image: Volg @OCLC_NL op Twitter] <https://twitter.com/OCLC_NL> *[image: > Volg OCLC (Nederland) op LinkedIn]* > <https://www.linkedin.com/company/oclc-nederland->*[image: Abonneer op > OCLCVideo]* > <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> > > *[image: > https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uz&oid=00D8000ZRv8&lastMod=140984368]* > <http://www.oclc.org/> > > > > > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 13:27 > *To:* perl4lib@perl.org > *Subject:* UNICODE character identification > > > > Hello friendly folks, > > > > follows what i am trying to do, and i am looking for your help in order to > find the most clever way to achieve this: > > > > We have records, that include typos like this: we have a word say Plato, > where the last o is inputted with the keyboard set to Greek language, so we > need something that would parse all metadata in a per character basis, > check against what is the script language that the majority of characters > the word belongs to have, and return the odd characters, the script they > belong, and the record identifier they were found in, so as to be able to > correct them > > > > thank you in advance >
RE: UNICODE character identification
What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the script the character belongs to. Wouter Kool Metadata Specialist · OCLC B.V. Schipholweg 99 · P.O. Box 876 · 2300 AW Leiden · The Netherlands t +31-(0)71-524 6500 wouter.k...@oclc.org<mailto:wouter.k...@oclc.org> · www.oclc.org<http://www.oclc.org/> [Volg @OCLC_NL op Twitter]<https://twitter.com/OCLC_NL> [Volg OCLC (Nederland) op LinkedIn] <https://www.linkedin.com/company/oclc-nederland-> [Abonneer op OCLCVideo] <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> [https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C00227Uz&oid=00D8000ZRv8&lastMod=140984368]<http://www.oclc.org/> From: George Milten [mailto:george.mil...@gmail.com] Sent: dinsdag 10 februari 2015 13:27 To: perl4lib@perl.org Subject: UNICODE character identification Hello friendly folks, follows what i am trying to do, and i am looking for your help in order to find the most clever way to achieve this: We have records, that include typos like this: we have a word say Plato, where the last o is inputted with the keyboard set to Greek language, so we need something that would parse all metadata in a per character basis, check against what is the script language that the majority of characters the word belongs to have, and return the odd characters, the script they belong, and the record identifier they were found in, so as to be able to correct them thank you in advance
UNICODE character identification
Hello friendly folks, follows what i am trying to do, and i am looking for your help in order to find the most clever way to achieve this: We have records, that include typos like this: we have a word say Plato, where the last o is inputted with the keyboard set to Greek language, so we need something that would parse all metadata in a per character basis, check against what is the script language that the majority of characters the word belongs to have, and return the odd characters, the script they belong, and the record identifier they were found in, so as to be able to correct them thank you in advance