RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
What encoding is your data in? utf8? Single-byte encoding? Marc8? That information matters a lot to determine whether your idea would work. If it is in a single-byte encoding there is often no way to determine the script the character belongs to. Wouter Kool Metadata Specialist · OCLC B.V. Sch

RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
most characters match and you look for the exceptions. Would that help? From: George Milten [mailto:george.mil...@gmail.com] Sent: dinsdag 10 februari 2015 15:56 To: Kool,Wouter Cc: perl4lib@perl.org Subject: Re: UNICODE character identification utf-8, thank you 2015-02-10 16:54 GMT+02:00

RE: UNICODE character identification

2015-02-10 Thread Kool,Wouter
quite satisfactory results with this approach in a slightly different use case. From: George Milten [mailto:george.mil...@gmail.com] Sent: dinsdag 10 februari 2015 16:09 To: Kool,Wouter Cc: perl4lib@perl.org Subject: Re: UNICODE character identification yes probably this is where i was also

RE: Options for translating languages within perl scripts

2015-02-27 Thread Kool,Wouter
Perhaps theWWW::Babelfish module helps. It seems to support connecting to the Google and Yahoo services as well. Haven't tried it but seems interesting: http://search.cpan.org/~durist/WWW-Babelfish-0.16/Babelfish.pm Wouter From: Eileen Pinto [mailto:epi...@library.berkeley.edu] Sent: donderdag

RE: identify encoding from a file

2016-02-08 Thread Kool,Wouter
At OCLC we have some good results detecting frequent encodings and recurring encoding problems using Naïve Bayesian classification. You have to have training data for the classes you want to detect. And language comes into play, because the distribution of characters is dependent on it. No silve