Re: dealing with UTF8 text

2005-03-31 Thread Joel Rees
On 2005.3.31, at 10:18 AM, Avi Rappoport wrote:
Hi old friends (and new),
I'm quite enjoying getting back to scripting, and like Perl a lot, 
especially with Affrus.  While I'm probably inefficient, it's nice to 
have a language actually designed for text processing (search engine 
logs, in my case).  However, I've got some Unicode issues and that 
seems to be platform-specific, so thought I'd ask here.
Have you done perldoc perlunicode and used that as a lullaby for 
several afternoon naps in a row? Used the stuff referred there for a 
few more afternoon naps? (perldoc always seems to put me to sleep, but 
if I don't open it up and stare at it in spite of the soporific effect, 
nothing seeps in at all.) Have you gone to unicode.org and scanned what 
they have to offer relevant to the character ranges (languages) you 
need to be parsing? Have you looked up the traditional encodings for 
your language/locale, particularly the microsoft (bleaugh) code pages? 
(Google or your other favorite search engines can help.)

I've done enough research to know that I should avoid hardcoded 
counting with positions and use the perl functions which will 
automatically handle utf8 characters properly.  That's cool.  I'm 
pretty sure I'm reading in utf8 and comparisons seem to work.
Comparisons can seem to work when the encoding is all off, as long as 
the input is being munged the same way in all inputs. That doesn't mean 
it will work for all valid input, however.

What I can't do is generate readable cross-platform output to show my 
clients.
Nothing necessarily surprising there. It takes quite a bit of tuning 
your brain to get the code right. (I speak from experience with 
Japanese encodings. ;)

 Even opening the output in BBEdit as UTF8 doesn't convert the codes 
into properly rendered extended characters, and by the time it gets 
into Excel on their Windows workstation, all hope is pretty much gone.
BBEdit, IIRC, handles some of the  traditional encodings fairly well. 
(Does quite well with the Japanese encodings, at any rate.) So if you 
are opening UTF-8 and it isn't looking right, your output is probably 
not UTF-8. If you check the options in the file opening dialogs, you 
may find a way to convert from the actual encoding you're writing out. 
And/or you should be able to adjust your perl, but we can't help you 
with that unless we see some code and have some idea what 
encoding/language/locale you're trying to write out.

Incidentally, in many of the traditional encodings, the basic Latin 
will be in the some positions (same code points) as UTF-8 Unicode basic 
Latin.

The stuff that looks like HTML entities is fine when viewed in a 
browser:

#1575;#1604;#1578;#1593;#1575;#1585;#1601;
s#305;emens
And if necessary, I can deliver in HTML.
But my logs have characters like this in them:
(from BBEdit as UTF8:)
   
atualizao
carreo
(from BBEdit as Mac Roman)
   
atualizao
torunn tmmervold
lschen
I can tell they mean something, but I can't figure out how to make 
them readable.  Help?

TIA,
Avi



dealing with UTF8 text

2005-03-30 Thread Avi Rappoport
Hi old friends (and new),
I'm quite enjoying getting back to scripting, and like Perl a lot, 
especially with Affrus.  While I'm probably inefficient, it's nice to 
have a language actually designed for text processing (search engine 
logs, in my case).  However, I've got some Unicode issues and that 
seems to be platform-specific, so thought I'd ask here.

I've done enough research to know that I should avoid hardcoded 
counting with positions and use the perl functions which will 
automatically handle utf8 characters properly.  That's cool.  I'm 
pretty sure I'm reading in utf8 and comparisons seem to work.

What I can't do is generate readable cross-platform output to show my 
clients.  Even opening the output in BBEdit as UTF8 doesn't convert 
the codes into properly rendered extended characters, and by the time 
it gets into Excel on their Windows workstation, all hope is pretty 
much gone.

The stuff that looks like HTML entities is fine when viewed in a browser:
#1575;#1604;#1578;#1593;#1575;#1585;#1601;
s#305;emens
And if necessary, I can deliver in HTML.
But my logs have characters like this in them:
(from BBEdit as UTF8:)
ˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆáˆâ ˆ‚ˆáˆ°ˆüˆì ˆ¶ˆèˆ¨ ˆáˆîˆ¶ˆùˆâ
atualiza§£o
carreo
(from BBEdit as Mac Roman)
É íáßÓ  Ô¯É
atualizaˆÉ¬ßˆÉ¬£o
torunn tømmervold
löschen
I can tell they mean something, but I can't figure out how to make 
them readable.  Help?

TIA,
Avi

--
Complete Guide to Search Engines for Web Sites and Intranets
   http://www.searchtools.com