Managing non-ascii filenames in python

2009-07-19 Thread pdenize
I created the following filename in windows just as a test -
“Dönåld’s™ Néphêws” deg°.txt
The quotes are non -ascii, many non english characters, long hyphen
etc.

Now in DOS I can do a directory and it translates them all to
something close.
Dönåld'sT Néphêws deg°.txt

I thought the correct way to do this in python would be to scan the
dir
files=os.listdir(os.path.dirname( os.path.realpath( __file__ ) ))

then print the filenames
for filename in files:
  print filename

but as expected teh filename is not correct - so correct it using the
file sysytems encoding

  print filename.decode(sys.getfilesystemencoding())

But I get
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2014'
in position 6: character maps to undefined

All was working well till these characters came along

I need to be able to write (a representation) to the screen (and I
don't see why I should not get something as good as DOS shows).

Write it to an XML file in UTF-8

and write it to a text file and be able to read it back in.
Again I was supprised that this was also difficult - it appears that
the file also wanted ascii.  Should I have to open the file in binary
for write (I expect so) but then what encoding should I write in?

I have been beating myself up with this for weeks as I get it working
then come across some outher character that causes it all to stop
again.

Please help.
-- 
http://mail.python.org/mailman/listinfo/python-list


Help needed with filenames

2009-05-10 Thread pdenize
I have a program that reads files using glob and puts them into an XML
file in UTF-8 using
  unicode(file, sys.getfilesystemencoding()).encode(UTF-8)
This all works fine including all the odd characters like accents etc.

However I also print what it is doing and someone pointed out that
many characters are not printing correctly in the Windows command
window.

I have tried to figure this out but simply get lost in the translation
stuff.
if I just use print filename it has characters that dont match the
ones in the filename (I sorta expected that).
So I tried print unicode(file, sys.getfilesystemencoding()) expecting
the correct result, but no.
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'

I did notice that when a windows command window does a directory
listing of these files the characters seem to be translated into close
approximations (long dash to minus, special double quotes to simple
double quotes, but still retains many of the accent chars).  I looked
at translate to do this but did not know how to determine which
characters to map.

Can anyone tell me what I should be doing here?
--
http://mail.python.org/mailman/listinfo/python-list