What encoding is your data in? utf8? Single-byte encoding? Marc8? That
information matters a lot to determine whether your idea would work. If it is
in a single-byte encoding there is often no way to determine the script the
character belongs to.
Wouter Kool
Metadata Specialist · OCLC B.V.
Sch
most characters match and you look for the
exceptions. Would that help?
From: George Milten [mailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 15:56
To: Kool,Wouter
Cc: perl4lib@perl.org
Subject: Re: UNICODE character identification
utf-8,
thank you
2015-02-10 16:54 GMT+02:00
quite satisfactory results with this approach in a
slightly different use case.
From: George Milten [mailto:george.mil...@gmail.com]
Sent: dinsdag 10 februari 2015 16:09
To: Kool,Wouter
Cc: perl4lib@perl.org
Subject: Re: UNICODE character identification
yes probably this is where i was also
Perhaps theWWW::Babelfish module helps. It seems to support connecting to the
Google and Yahoo services as well. Haven't tried it but seems interesting:
http://search.cpan.org/~durist/WWW-Babelfish-0.16/Babelfish.pm
Wouter
From: Eileen Pinto [mailto:epi...@library.berkeley.edu]
Sent: donderdag
At OCLC we have some good results detecting frequent encodings and recurring
encoding problems using Naïve Bayesian classification. You have to have
training data for the classes you want to detect. And language comes into play,
because the distribution of characters is dependent on it. No silve