Re: Making safe file names

2013-05-28 Thread Albert van der Horst
In article lvydneajg7lxnhtmnz2dnuvz_rkdn...@westnet.com.au,
Neil Hodgson  nhodg...@iinet.net.au wrote:
Andrew Berg:

 This is not a Unicode issue since (modern) file systems will happily
accept it. The issue is that certain characters (which are ASCII) are
 not allowed on some file systems:
   \ / : * ? | @ and the NUL character
 The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
and NUL and / are not allowed on pretty much any file system. Locale
 settings and encodings aside, these 11 characters will need to be escaped.

There's also the Windows device name hole. There may be trouble with
artists named 'COM4', 'CLOCK$', 'Con', or similar.

http://support.microsoft.com/kb/74496

That applies to MS-DOS names. God forbid that this still holds on more modern
Microsoft operating systems?

http://en.wikipedia.org/wiki/Nul_%28band%29

Neil
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spearc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-28 Thread Chris Angelico
On Tue, May 28, 2013 at 11:44 PM, Albert van der Horst
alb...@spenarnc.xs4all.nl wrote:
 In article lvydneajg7lxnhtmnz2dnuvz_rkdn...@westnet.com.au,
 Neil Hodgson  nhodg...@iinet.net.au wrote:
There's also the Windows device name hole. There may be trouble with
artists named 'COM4', 'CLOCK$', 'Con', or similar.

http://support.microsoft.com/kb/74496

 That applies to MS-DOS names. God forbid that this still holds on more modern
 Microsoft operating systems?

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (In
tel)] on win32
Type help, copyright, credits or license for more information.
 open(com1,w).write(Test\n)
Traceback (most recent call last):
  File stdin, line 1, in module
FileNotFoundError: [Errno 2] No such file or directory: 'com1'
 open(con,w).write(Test\n)
Test
5


ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-28 Thread Grant Edwards
On 2013-05-28, Albert van der Horst alb...@spenarnc.xs4all.nl wrote:

 There's also the Windows device name hole. There may be trouble with
 artists named 'COM4', 'CLOCK$', 'Con', or similar.

http://support.microsoft.com/kb/74496

 That applies to MS-DOS names. God forbid that this still holds on
 more modern Microsoft operating systems?

There are no more modern Microsoft operating systems.  Only more
recent ones.  There are still lots of reserved filenames in recent
versions of Windows.

-- 
Grant Edwards   grant.b.edwardsYow! I've got an IDEA!!
  at   Why don't I STARE at you
  gmail.comso HARD, you forget your
   SOCIAL SECURITY NUMBER!!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-11 Thread Andrew Berg
On 2013.05.08 18:37, Dennis Lee Bieber wrote:
   And now you've seen why music players don't show the user the
 physical file name, but maintain a database mapping the internal data
 (name, artist, track#, album, etc.) to whatever mangled name was needed
 to satisfy the file system.
Tags are used mainly for organization but a nice benefit of tags is that they 
are not subject to file system or URL or whatever other
limits. If an audio file has no metadata, most players will show the file name.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-11 Thread Chris Angelico
On Thu, May 9, 2013 at 1:08 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 I suspect that the only way to be completely ungoogleable would be to
 name yourself something common, not something obscure. Say, if you called
 yourself Hard Rock Band, and did hard rock. But then, googling for
 Heavy Metal alone brings up the magazine as the fourth hit, so if you
 get famous enough, even that won't work.

Yeah, so why are ubergeneric domain names worth so much? Whatevs.

The best way to be findable in a web search is to have content on your
web site. Real crawlable content. I guarantee you'll be found. Even if
you're some tiny thing tucked away in a corner of teh interwebs, you
can be found.

http://www.google.com/search?q=minstrel+hall

The song is there, but so is an obscure little DD MUD.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-09 Thread Roy Smith
In article 518b133b$0$29997$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

 I suspect that the only way to be completely ungoogleable would be to 
 name yourself something common, not something obscure.

http://en.wikipedia.org/wiki/The_band
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-09 Thread Gregory Ewing

Roy Smith wrote:

In article 518b133b$0$29997$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

I suspect that the only way to be completely ungoogleable would be to 
name yourself something common, not something obscure.


http://en.wikipedia.org/wiki/The_band


Nope... googling for the band brings that up as the
very first result.

The Google knows all. You cannot escape The Google...

--
Greg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-09 Thread Tim Chase
On 2013-05-10 12:04, Gregory Ewing wrote:
 Roy Smith wrote:
  http://en.wikipedia.org/wiki/The_band
 
 Nope... googling for the band brings that up as the
 very first result.
 
 The Google knows all. You cannot escape The Google...

That does it.  I'm naming my band Google. :-)

-tkc


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-08 Thread Roy Smith
In article mailman.1465.1368056269.3114.python-l...@python.org,
 Dennis Lee Bieber wlfr...@ix.netcom.com wrote:

 On Tue, 07 May 2013 18:10:25 -0500, Andrew Berg
 bahamutzero8...@gmail.com declaimed the following in
 gmane.comp.python.general:
 
  None of these would work because I would have no idea which file stores 
  data for which artist without writing code to figure it out. If I
  were to end up writing a bug that messed up a few of my cache files and 
  noticed it with a specific artist (e.g., doing a now playing and
  seeing the wrong tags), I would either have to manually match up the hash 
  or base64 encoding in order to delete just that file so that it
  gets regenerated or nuke and regenerate my entire cache.
 
   And now you've seen why music players don't show the user the
 physical file name, but maintain a database mapping the internal data
 (name, artist, track#, album, etc.) to whatever mangled name was needed
 to satisfy the file system.

Yup.  At Songza, we deal with this crap every day.  It usually bites us 
the worst when trying to do keyword searches.  When somebody types in 
Blue Oyster Cult, they really mean Blue Oyster Cult, and our search 
results need to reflect that.  Likewise for Ke$ha, Beyonce, and I don't 
even want to think about the artist formerly known as an unpronounceable 
glyph.

Pro-tip, guys.  If you want to form a band, and expect people to be able 
to find your stuff in a search engine some day, don't play cute with 
your name.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-08 Thread Chris Angelico
On Thu, May 9, 2013 at 10:16 AM, Roy Smith r...@panix.com wrote:
 Pro-tip, guys.  If you want to form a band, and expect people to be able
 to find your stuff in a search engine some day, don't play cute with
 your name.

It's the modern equivalent of names like Catherine Withekay.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-08 Thread Steven D'Aprano
On Wed, 08 May 2013 20:16:25 -0400, Roy Smith wrote:

 Yup.  At Songza, we deal with this crap every day.  It usually bites us
 the worst when trying to do keyword searches.  When somebody types in
 Blue Oyster Cult, they really mean Blue Oyster Cult, 

Surely they really mean Blue Öyster Cult.


 and our search
 results need to reflect that.  Likewise for Ke$ha, Beyonce, and I don't
 even want to think about the artist formerly known as an unpronounceable
 glyph.

Dropped or incorrect accents are no different from any other misspelling, 
and good search engines (whether online or in a desktop application) 
should be able to deal with a tolerable number of misspellings.

Googling for Blue Oyster Cult brings up four of the top ten hits 
spelled correctly with the accent, Blue Öyster Cult. Even misspelled as 
blew oytser cult, Google does the right thing.

Even Bing manages to find Ke$ha's wikipedia page, her official website, 
youtube channel, facebook and myspace pages from the misspelling kehsha.



 Pro-tip, guys.  If you want to form a band, and expect people to be able
 to find your stuff in a search engine some day, don't play cute with
 your name.

Googling for the the (including quotes) brings up 145 million hits, 
nine of the first ten hits being relevant to the band. 

On the other hand, I wouldn't want to be in a band called The Beetles.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-08 Thread Roy Smith
In article 518b00a2$0$29997$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

  When somebody types in
  Blue Oyster Cult, they really mean Blue Oyster Cult, 
 
 Surely they really mean Blue Öyster Cult.

Yes.  The oomlaut was there when I typed it.  Who knows what happened to 
it by the time it hit the wire.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-08 Thread Andrew Berg
On 2013.05.08 19:16, Roy Smith wrote:
 Yup.  At Songza, we deal with this crap every day.  It usually bites us 
 the worst when trying to do keyword searches.  When somebody types in 
 Blue Oyster Cult, they really mean Blue Oyster Cult, and our search 
 results need to reflect that.  Likewise for Ke$ha, Beyonce, and I don't 
 even want to think about the artist formerly known as an unpronounceable 
 glyph.

 Pro-tip, guys.  If you want to form a band, and expect people to be able 
 to find your stuff in a search engine some day, don't play cute with 
 your name.
It's a thing (especially in witch house) to make names with odd glyphs in order 
to be harder to find and be more underground. Very silly.
Try doing searches for these artists with names like these:
http://www.last.fm/music/%E2%96%BC%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0
http://www.last.fm/music/ki%E2%80%A0%E2%80%A0y+c%E2%96%B2t
-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-08 Thread Steven D'Aprano
On Wed, 08 May 2013 21:11:28 -0500, Andrew Berg wrote:

 It's a thing (especially in witch house) to make names with odd glyphs
 in order to be harder to find and be more underground. Very silly. Try
 doing searches for these artists with names like these:

Challenge accepted.

 http://www.last.fm/music/%E2%96%BC%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0%
E2%96%A1%E2%96%A0
 http://www.last.fm/music/ki%E2%80%A0%E2%80%A0y+c%E2%96%B2t


The second one is trivial. Googling for kitty cat witch 
house (including quotes) gives at least 3 relevant links out of the top 
4 hits are relevant. (I'm not sure about the Youtube page.) That gets you 
the correct spelling, ki††y c△t, and googling for that brings up many 
more hits.

The first one is a tad trickier, since googling for ▼□■□■□■ brings up 
nothing at all, and mourning star doesn't give any relevant hits on the 
first page. But mourning star witch house (inc. quotes) is successful.

I suspect that the only way to be completely ungoogleable would be to 
name yourself something common, not something obscure. Say, if you called 
yourself Hard Rock Band, and did hard rock. But then, googling for 
Heavy Metal alone brings up the magazine as the fourth hit, so if you 
get famous enough, even that won't work.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Making safe file names

2013-05-07 Thread Andrew Berg
Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and 
have been naming the files using the artist name. However,
artist names can have characters that are not allowed in file names for most 
file systems (e.g., C/A/T has forward slashes). Are there any
recommended strategies for naming such files while avoiding conflicts (I 
wouldn't want to run into problems for an artist named C-A-T or
CAT, for example)? I'd like to make the files easily identifiable, and there 
really are no limits on what characters can be in an artist name.
-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Terry Jan Reedy

On 5/7/2013 3:58 PM, Andrew Berg wrote:

Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and 
have been naming the files using the artist name. However,
artist names can have characters that are not allowed in file names for most 
file systems (e.g., C/A/T has forward slashes). Are there any
recommended strategies for naming such files while avoiding conflicts (I 
wouldn't want to run into problems for an artist named C-A-T or
CAT, for example)? I'd like to make the files easily identifiable, and there 
really are no limits on what characters can be in an artist name.


Sounds like you want something like the html escape or urlencode 
functions, which serve the same purpose of encoding special chars. 
Rather than invent a new tranformation, you could use the same scheme 
used for html entities. (Sorry, I forget the details.) It is possible 
that one of the functions would work for you as is, or with little 
modification.


Terry



--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Fábio Santos
I suggest Base64. b64encode
(http://docs.python.org/2/library/base64.html#base64.b64encode) and
b64decode take an argument which allows you to eliminate the pesky /
character. It's reversible and simple.

More suggestions: how about a hash? Or just use IDs from the database?

On Tue, May 7, 2013 at 8:58 PM, Andrew Berg bahamutzero8...@gmail.com wrote:
 Currently, I keep Last.fm artist data caches to avoid unnecessary API calls 
 and have been naming the files using the artist name. However,
 artist names can have characters that are not allowed in file names for most 
 file systems (e.g., C/A/T has forward slashes). Are there any
 recommended strategies for naming such files while avoiding conflicts (I 
 wouldn't want to run into problems for an artist named C-A-T or
 CAT, for example)? I'd like to make the files easily identifiable, and there 
 really are no limits on what characters can be in an artist name.
 --
 CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
 --
 http://mail.python.org/mailman/listinfo/python-list



--
Fábio Santos
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread MRAB

On 07/05/2013 20:58, Andrew Berg wrote:

Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and 
have been naming the files using the artist name. However,
artist names can have characters that are not allowed in file names for most 
file systems (e.g., C/A/T has forward slashes). Are there any
recommended strategies for naming such files while avoiding conflicts (I 
wouldn't want to run into problems for an artist named C-A-T or
CAT, for example)? I'd like to make the files easily identifiable, and there 
really are no limits on what characters can be in an artist name.


Conflicts won't occur if:

1. All of the characters of the artist's name are mapped to an encoding.

2. Different characters map to different encodings.

3. No encoding is a prefix of another encoding.

In practice, you'll be mapping most characters to themselves.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Dan Stromberg
On 5/7/13, Andrew Berg bahamutzero8...@gmail.com wrote:
 Currently, I keep Last.fm artist data caches to avoid unnecessary API calls
 and have been naming the files using the artist name. However,
 artist names can have characters that are not allowed in file names for most
 file systems (e.g., C/A/T has forward slashes). Are there any
 recommended strategies for naming such files while avoiding conflicts (I
 wouldn't want to run into problems for an artist named C-A-T or
 CAT, for example)? I'd like to make the files easily identifiable, and there
 really are no limits on what characters can be in an artist name.

You might consider:
http://stromberg.dnsalias.org/svn/backshift/trunk/escape_mod.py
http://stromberg.dnsalias.org/svn/backshift/trunk/test-escape_mod

It doubles the length of the string, but it produces safe, easily
readable escaped strings - which tends to make debugging easier.

It requires a couple of other modules (easily obtained from the same
SVN repo) though.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Jens Thoms Toerring
Andrew Berg bahamutzero8...@gmail.com wrote:
 Currently, I keep Last.fm artist data caches to avoid unnecessary API calls
 and have been naming the files using the artist name. However, artist names
 can have characters that are not allowed in file names for most file systems
 (e.g., C/A/T has forward slashes). Are there any recommended strategies for
 naming such files while avoiding conflicts (I wouldn't want to run into
 problems for an artist named C-A-T or CAT, for example)? I'd like to make
 the files easily identifiable, and there really are no limits on what
 characters can be in an artist name. --

It's not clear what the context that you need this for. You
could e.g. replace all characters not allowed by the file
system by their hexidecimal (ASCII) values, preceeded by a
'% (so '/' would be changed to '%2F', and also encode a '%'
itself in a name by '%25'). Then you have a well-defined
two-way mapping (isomorphic if I remember my math-lear-
nining days correctly) between the original name and the
way you store it. E.g.

  C/A/T  would become  C%2FA%2FT

and

  C%2FA/T  would become  C%252FA%2FT

You can translate back and forth between them with not too
much effort.

Of course, that assumes that '%' is a character allowed by
your file system - otherwise pick some other one, any one
will do in principle. It's a bit harder for a human to in-
terpret but rathe likely not that much of a problem. You
probably will have seen that kind of scheme used in URLs.
The concept is rather old and called 'escape character',
i.e. have one character that assumes some special meaning
and also escaped it.

If, on the hand, those names are never to be translated back
to the original name another strategy would be to use the SHA1
hash value of the artists name. Since clashes between SHA1 hash
values are rather hard to produce it's a rather safe method of
converting something (i.e. the artists name) to a number. The
drawback, of course, is that you can't translate back from the
hash value to the original name (if that would be simple the
whole thing wouldn't work;-)

   Regards, Jens
-- 
  \   Jens Thoms Toerring  ___  j...@toerring.de
   \__  http://toerring.de
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Chris Angelico
On Wed, May 8, 2013 at 8:18 AM, Fábio Santos fabiosantos...@gmail.com wrote:
 I suggest Base64. b64encode
 (http://docs.python.org/2/library/base64.html#base64.b64encode) and
 b64decode take an argument which allows you to eliminate the pesky /
 character. It's reversible and simple.

But it doesn't look anything like the original.

I'd be inclined to go for something like quoted-printable or
URL-encoding; special characters become much longer, but ordinary
characters (mostly) stay as themselves.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 17:18, Fábio Santos wrote:
 I suggest Base64. b64encode
 (http://docs.python.org/2/library/base64.html#base64.b64encode) and
 b64decode take an argument which allows you to eliminate the pesky /
 character. It's reversible and simple.
 
 More suggestions: how about a hash? Or just use IDs from the database?
None of these would work because I would have no idea which file stores data 
for which artist without writing code to figure it out. If I
were to end up writing a bug that messed up a few of my cache files and noticed 
it with a specific artist (e.g., doing a now playing and
seeing the wrong tags), I would either have to manually match up the hash or 
base64 encoding in order to delete just that file so that it
gets regenerated or nuke and regenerate my entire cache.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 17:01, Terry Jan Reedy wrote:
 Sounds like you want something like the html escape or urlencode 
 functions, which serve the same purpose of encoding special chars. 
 Rather than invent a new tranformation, you could use the same scheme 
 used for html entities. (Sorry, I forget the details.) It is possible 
 that one of the functions would work for you as is, or with little 
 modification.
This has the problem of mangling non-ASCII characters (and artist names with 
non-ASCII characters are not rare). I most definitely want to
keep as many characters untouched as possible so that the files are easy to 
identify by looking at the file name. Ideally, only characters
that file systems don't like would be transformed.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Dave Angel

On 05/07/2013 03:58 PM, Andrew Berg wrote:

Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and 
have been naming the files using the artist name. However,
artist names can have characters that are not allowed in file names for most 
file systems (e.g., C/A/T has forward slashes). Are there any
recommended strategies for naming such files while avoiding conflicts (I 
wouldn't want to run into problems for an artist named C-A-T or
CAT, for example)? I'd like to make the files easily identifiable, and there 
really are no limits on what characters can be in an artist name.



So what you need first is a list of allowable characters for all your 
target OS versions.  And don't forget that the allowable characters may 
vary depending on the particular file system(s) mounted on a given OS.


You also need to decide how to handle Unicode characters, since they're 
different for different OS.  In Windows on NTFS, filenames are in 
Unicode, while on Unix, filenames are bytes.  So on one of those, you 
will be encoding/decoding if your code is to be mostly portable.


Don't forget that ls and rm may not use the same encoding you're using. 
 So you may not consider it adequate to make the names legal, but you 
may also want they easily typeable in the shell.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Roy Smith
In article mailman.1428.1367972114.3114.python-l...@python.org,
 Dave Angel da...@davea.name wrote:

 On 05/07/2013 03:58 PM, Andrew Berg wrote:
  Currently, I keep Last.fm artist data caches to avoid unnecessary API calls 
  and have been naming the files using the artist name. However,
  artist names can have characters that are not allowed in file names for 
  most file systems (e.g., C/A/T has forward slashes). Are there any
  recommended strategies for naming such files while avoiding conflicts (I 
  wouldn't want to run into problems for an artist named C-A-T or
  CAT, for example)? I'd like to make the files easily identifiable, and 
  there really are no limits on what characters can be in an artist name.
 
 
 So what you need first is a list of allowable characters for all your 
 target OS versions.  And don't forget that the allowable characters may 
 vary depending on the particular file system(s) mounted on a given OS.
 
 You also need to decide how to handle Unicode characters, since they're 
 different for different OS.  In Windows on NTFS, filenames are in 
 Unicode, while on Unix, filenames are bytes.  So on one of those, you 
 will be encoding/decoding if your code is to be mostly portable.
 
 Don't forget that ls and rm may not use the same encoding you're using. 
   So you may not consider it adequate to make the names legal, but you 
 may also want they easily typeable in the shell.

One possible tool that may help you here is unidecode 
(https://pypi.python.org/pypi/Unidecode).  It doesn't solve your whole 
problem, but it does help get unicode text into a form which is both 
7-bit clean and human readable.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 17:37, Jens Thoms Toerring wrote:
 You
 could e.g. replace all characters not allowed by the file
 system by their hexidecimal (ASCII) values, preceeded by a
 '% (so '/' would be changed to '%2F', and also encode a '%'
 itself in a name by '%25'). Then you have a well-defined
 two-way mapping (isomorphic if I remember my math-lear-
 nining days correctly) between the original name and the
 way you store it. E.g.
 
   C/A/T  would become  C%2FA%2FT
 
 and
 
   C%2FA/T  would become  C%252FA%2FT
 
 You can translate back and forth between them with not too
 much effort.
 
 Of course, that assumes that '%' is a character allowed by
 your file system - otherwise pick some other one, any one
 will do in principle. It's a bit harder for a human to in-
 terpret but rathe likely not that much of a problem.
Yes, something like this is what I am trying to achieve. Judging by the 
responses I've gotten so far, I think I'll have to roll my own
transformation scheme since URL encoding and the like transform Unicode 
characters. I can memorize that 植松伸夫 is a Japanese composer who
is well-known for his works in the Final Fantasy series of video games. Trying 
to match up the URL-encoded version to an artist would be
almost impossible when I have several other artist names that have no ASCII 
characters.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 19:14, Dave Angel wrote:
 You also need to decide how to handle Unicode characters, since they're 
 different for different OS.  In Windows on NTFS, filenames are in 
 Unicode, while on Unix, filenames are bytes.  So on one of those, you 
 will be encoding/decoding if your code is to be mostly portable.
Characters outside whatever sys.getfilesystemencoding() returns won't be 
allowed. If the user's locale settings don't support Unicode, my
program will be far from the only one to have issues with it. Any problem 
reports that arise from a user moving between legacy encodings
will generally be ignored. I haven't yet decided how I will handle artist names 
with characters outside UTF-8, but inside UTF-16/32 (UTF-16
is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in 
their locale settings).
 Don't forget that ls and rm may not use the same encoding you're using. 
 So you may not consider it adequate to make the names legal, but you 
 may also want they easily typeable in the shell.
I don't understand. I have no intention of changing Unicode characters.


This is not a Unicode issue since (modern) file systems will happily accept it. 
The issue is that certain characters (which are ASCII) are
not allowed on some file systems:
 \ / : * ?| @ and the NUL character
The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL 
and / are not allowed on pretty much any file system. Locale
settings and encodings aside, these 11 characters will need to be escaped.
-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Dave Angel

On 05/07/2013 08:51 PM, Andrew Berg wrote:

On 2013.05.07 19:14, Dave Angel wrote:

You also need to decide how to handle Unicode characters, since they're
different for different OS.  In Windows on NTFS, filenames are in
Unicode, while on Unix, filenames are bytes.  So on one of those, you
will be encoding/decoding if your code is to be mostly portable.

Characters outside whatever sys.getfilesystemencoding() returns won't be 
allowed. If the user's locale settings don't support Unicode, my
program will be far from the only one to have issues with it. Any problem 
reports that arise from a user moving between legacy encodings
will generally be ignored. I haven't yet decided how I will handle artist names 
with characters outside UTF-8,


There aren't any characters outside UTF-8.  But a character is not in 
utf-8, it can be encoded by utf-8.


 but inside UTF-16/32 (UTF-16

Nor outside UTF-16 or 32.


is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in 
their locale settings).

Don't forget that ls and rm may not use the same encoding you're using.
So you may not consider it adequate to make the names legal, but you
may also want they easily typeable in the shell.

I don't understand. I have no intention of changing Unicode characters.


So you're comfortable typing arbitrary characters?  what about all the 
characters that have identical displays in your font? What about viewing 
0x07 in the terminal window?  Or 0x04?





This is not a Unicode issue since (modern) file systems will happily accept it. 
The issue is that certain characters (which are ASCII) are
not allowed on some file systems:
  \ / : * ?| @ and the NUL character
The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL 
and / are not allowed on pretty much any file system. Locale
settings and encodings aside, these 11 characters will need to be escaped.



As soon as you have a small, finite list of invalid characters, writing 
an escape system is pretty easy.



--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Neil Hodgson

Andrew Berg:


This is not a Unicode issue since (modern) file systems will happily accept it. 
The issue is that certain characters (which are ASCII) are
not allowed on some file systems:
  \ / : * ? | @ and the NUL character
The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL 
and / are not allowed on pretty much any file system. Locale
settings and encodings aside, these 11 characters will need to be escaped.


   There's also the Windows device name hole. There may be trouble with 
artists named 'COM4', 'CLOCK$', 'Con', or similar.


http://support.microsoft.com/kb/74496
http://en.wikipedia.org/wiki/Nul_%28band%29

   Neil
--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Dave Angel

On 05/07/2013 09:28 PM, Neil Hodgson wrote:

Andrew Berg:


This is not a Unicode issue since (modern) file systems will happily
accept it. The issue is that certain characters (which are ASCII) are
not allowed on some file systems:
  \ / : * ? | @ and the NUL character
The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
and NUL and / are not allowed on pretty much any file system. Locale
settings and encodings aside, these 11 characters will need to be
escaped.


There's also the Windows device name hole. There may be trouble with
artists named 'COM4', 'CLOCK$', 'Con', or similar.



In MSDOS 2, there was a switch that would tell the OS to ignore such 
names unless they were prefixed by \DEV.  But like the switchar switch, 
it was largely ignored by the ignorant, and probably doesn't exist in 
current versions of M$OS



http://support.microsoft.com/kb/74496
http://en.wikipedia.org/wiki/Nul_%28band%29

Neil


While we're looking for trouble, there's also case insensitivity. 
Unclear if the user cares, but tom and TOM are the same file in most 
configurations of NT.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 20:28, Neil Hodgson wrote:
 http://support.microsoft.com/kb/74496
 http://en.wikipedia.org/wiki/Nul_%28band%29
I can indeed confirm that at least 'nul' cannot be used as a filename. However, 
I add an extension to the file names to identify them as caches.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 20:45, Dave Angel wrote:
 While we're looking for trouble, there's also case insensitivity. 
 Unclear if the user cares, but tom and TOM are the same file in most 
 configurations of NT.
Artist names on Last.fm cannot differ only in case. This does remind me to make 
sure to update the case of the artist name as necessary,
though. For example, if Sam becomes SAM again (I have seen Last.fm change the 
case for artist names), I need to make sure that I don't end
up with two file names differing only in case.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 20:13, Dave Angel wrote:
 So you're comfortable typing arbitrary characters?  what about all the 
 characters that have identical displays in your font?
Identification is more important than typing. I can copy and paste into a 
terminal if necessary. I don't foresee typing out one of the
filenames being anything more than a rare occurrence, but I will occasionally 
just read the list.
 What about viewing 
 0x07 in the terminal window?  Or 0x04?
I don't think Last.fm will even send those characters. In any case, control 
characters in artist names are rare enough that it's not worth
the trouble to write the code to avoid the problems associated with them.
 As soon as you have a small, finite list of invalid characters, writing 
 an escape system is pretty easy.
Probably. I was just hoping there was an existing system that would work, but 
as I said in a different reply, it would seem I need to roll
my own.

-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Roy Smith
In article mailman.1435.1367977523.3114.python-l...@python.org,
 Dave Angel da...@davea.name wrote:

 While we're looking for trouble, there's also case insensitivity. 
 Unclear if the user cares, but tom and TOM are the same file in most 
 configurations of NT.

OSX, too.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Steven D'Aprano
On Tue, 07 May 2013 19:51:24 -0500, Andrew Berg wrote:

 On 2013.05.07 19:14, Dave Angel wrote:
 You also need to decide how to handle Unicode characters, since they're
 different for different OS.  In Windows on NTFS, filenames are in
 Unicode, while on Unix, filenames are bytes.  So on one of those, you
 will be encoding/decoding if your code is to be mostly portable.

 Characters outside whatever sys.getfilesystemencoding() returns won't be
 allowed. If the user's locale settings don't support Unicode, my program
 will be far from the only one to have issues with it. Any problem
 reports that arise from a user moving between legacy encodings will
 generally be ignored. I haven't yet decided how I will handle artist
 names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is
 just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in
 their locale settings).

There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire 
Unicode range, unlike other encodings like Latin-1 or ASCII.

Well, that is to say, there may be characters that are not (yet) handled 
at all by Unicode, but there are no known legacy encodings that support 
such characters.

To a first approximation, Unicode covers the entire set of characters in 
human use, and for those which it does not, there is always the private 
use area. So for example, if you wish to record the Artist Formerly Known 
As The Artist Formerly Known As Prince as Love Symbol, you could pick 
an arbitrary private use code point, declare that for your application 
that code point means Love Symbol, and use that code point as the artist 
name. You could even come up with a custom font that includes a rendition 
of that character glyph.

However, there are byte combinations which are not valid UTF-8, which is 
a different story. If you're receiving bytes from (say) a file name, they 
may not necessarily make up a valid UTF-8 string. But this is not an 
issue if you are receiving data from something guaranteed to be valid 
UTF-8.


 Don't forget that ls and rm may not use the same encoding you're using.
 So you may not consider it adequate to make the names legal, but you
 may also want they easily typeable in the shell.

 I don't understand. I have no intention of changing Unicode characters.

Of course you do. You even talk below about Unicode characters like * 
and ? not being allowed on NTFS systems.

Perhaps you are thinking that there are a bunch of characters over here 
called plain text ASCII characters, and a *different* bunch of 
characters with funny accents and stuff called Unicode characters. If 
so, then you are labouring under a misapprehension, and you should start 
off by reading this:

http://www.joelonsoftware.com/articles/Unicode.html


then come back with any questions.


 This is not a Unicode issue since (modern) file systems will happily
 accept it. The issue is that certain characters (which are ASCII) are
 not allowed on some file systems:
  \ / : * ?| @ and the NUL character

These are all Unicode characters too. Unicode is a subset of ASCII, so 
anything which is ASCII is also Unicode.


 The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
 and NUL and / are not allowed on pretty much any file system. Locale
 settings and encodings aside, these 11 characters will need to be
 escaped.

If you have an artist with control characters in their name, like newline 
or carriage return or NUL, I think it is fair to just drop the control 
characters and then give the artist a thorough thrashing with a halibut.

Does your mapping really need to be guaranteed reversible? If you have an 
artist called JoeBlow, and another artist called Joe\0Blow, and a 
third called Joe\nBlow, does it *really* matter if your application 
conflates them?


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Dave Angel

On 05/07/2013 10:06 PM, Andrew Berg wrote:

On 2013.05.07 20:28, Neil Hodgson wrote:

http://support.microsoft.com/kb/74496
http://en.wikipedia.org/wiki/Nul_%28band%29

I can indeed confirm that at least 'nul' cannot be used as a filename. However, 
I add an extension to the file names to identify them as caches.



Won't help.  NUL.txt is just as reserved as NUL is.  Extensions are 
ignored in this particular piece of historical nonsense.



--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Dave Angel

On 05/07/2013 11:40 PM, Steven D'Aprano wrote:


   SNIP

These are all Unicode characters too. Unicode is a subset of ASCII, so
anything which is ASCII is also Unicode.




Typo.  You meant  Unicode is a superset of ASCII.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Steven D'Aprano
On Wed, 08 May 2013 00:13:20 -0400, Dave Angel wrote:

 On 05/07/2013 11:40 PM, Steven D'Aprano wrote:

SNIP

 These are all Unicode characters too. Unicode is a subset of ASCII, so
 anything which is ASCII is also Unicode.



 Typo.  You meant  Unicode is a superset of ASCII.

Damn. Yes, you're right. I was thinking superset, but my fingers typed 
subset.

Thanks for the correction.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Making safe file names

2013-05-07 Thread Andrew Berg
On 2013.05.07 22:40, Steven D'Aprano wrote:
 There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire 
 Unicode range, unlike other encodings like Latin-1 or ASCII.
You are correct. I'm not sure what I was thinking.

 I don't understand. I have no intention of changing Unicode characters.
 
 Of course you do. You even talk below about Unicode characters like * 
 and ? not being allowed on NTFS systems.
I worded that incorrectly. What I meant, of course, is that I intend to 
preserve as many characters as possible and have no need to stay
within ASCII.

 If you have an artist with control characters in their name, like newline 
 or carriage return or NUL, I think it is fair to just drop the control 
 characters and then give the artist a thorough thrashing with a halibut.
While the thrashing with a halibut may be warranted (though I personally would 
use a rubber chicken), conflicts are problematic.

 Does your mapping really need to be guaranteed reversible? If you have an 
 artist called JoeBlow, and another artist called Joe\0Blow, and a 
 third called Joe\nBlow, does it *really* matter if your application 
 conflates them?
Yes and yes. Some artists like to be real cute with their names and make witch 
house artist names look tame in comparison, and some may
choose to use names similar to some very popular artists. I've also seen people 
scrobble fake artists with names that look like real artist
names (using things like a non-breaking space instead of a regular space) with 
different artist pictures in order to confuse and troll
people. If I could remember the user profiles with this, I'd link them. Last.fm 
is a silly place.
As I said before though, I don't think control characters are even allowed in 
artist names (likely for technical reasons).
-- 
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1
-- 
http://mail.python.org/mailman/listinfo/python-list