On 2005.3.31, at 10:18 AM, Avi Rappoport wrote:
Hi old friends (and new),
I'm quite enjoying getting back to scripting, and like Perl a lot,
especially with Affrus. While I'm probably inefficient, it's nice to
have a language actually designed for text processing (search engine
logs, in my case). However, I've got some Unicode issues and that
seems to be platform-specific, so thought I'd ask here.
Have you done perldoc perlunicode and used that as a lullaby for
several afternoon naps in a row? Used the stuff referred there for a
few more afternoon naps? (perldoc always seems to put me to sleep, but
if I don't open it up and stare at it in spite of the soporific effect,
nothing seeps in at all.) Have you gone to unicode.org and scanned what
they have to offer relevant to the character ranges (languages) you
need to be parsing? Have you looked up the traditional encodings for
your language/locale, particularly the microsoft (bleaugh) code pages?
(Google or your other favorite search engines can help.)
I've done enough research to know that I should avoid hardcoded
counting with positions and use the perl functions which will
automatically handle utf8 characters properly. That's cool. I'm
pretty sure I'm reading in utf8 and comparisons seem to work.
Comparisons can seem to work when the encoding is all off, as long as
the input is being munged the same way in all inputs. That doesn't mean
it will work for all valid input, however.
What I can't do is generate readable cross-platform output to show my
clients.
Nothing necessarily surprising there. It takes quite a bit of tuning
your brain to get the code right. (I speak from experience with
Japanese encodings. ;)
Even opening the output in BBEdit as UTF8 doesn't convert the codes
into properly rendered extended characters, and by the time it gets
into Excel on their Windows workstation, all hope is pretty much gone.
BBEdit, IIRC, handles some of the traditional encodings fairly well.
(Does quite well with the Japanese encodings, at any rate.) So if you
are opening UTF-8 and it isn't looking right, your output is probably
not UTF-8. If you check the options in the file opening dialogs, you
may find a way to convert from the actual encoding you're writing out.
And/or you should be able to adjust your perl, but we can't help you
with that unless we see some code and have some idea what
encoding/language/locale you're trying to write out.
Incidentally, in many of the traditional encodings, the basic Latin
will be in the some positions (same code points) as UTF-8 Unicode basic
Latin.
The stuff that looks like HTML entities is fine when viewed in a
browser:
#1575;#1604;#1578;#1593;#1575;#1585;#1601;
s#305;emens
And if necessary, I can deliver in HTML.
But my logs have characters like this in them:
(from BBEdit as UTF8:)
atualizao
carreo
(from BBEdit as Mac Roman)
atualizao
torunn tmmervold
lschen
I can tell they mean something, but I can't figure out how to make
them readable. Help?
TIA,
Avi