Re: Making safe file names
In article lvydneajg7lxnhtmnz2dnuvz_rkdn...@westnet.com.au, Neil Hodgson nhodg...@iinet.net.au wrote: Andrew Berg: This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are not allowed on some file systems: \ / : * ? | @ and the NUL character The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale settings and encodings aside, these 11 characters will need to be escaped. There's also the Windows device name hole. There may be trouble with artists named 'COM4', 'CLOCK$', 'Con', or similar. http://support.microsoft.com/kb/74496 That applies to MS-DOS names. God forbid that this still holds on more modern Microsoft operating systems? http://en.wikipedia.org/wiki/Nul_%28band%29 Neil -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spearc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Tue, May 28, 2013 at 11:44 PM, Albert van der Horst alb...@spenarnc.xs4all.nl wrote: In article lvydneajg7lxnhtmnz2dnuvz_rkdn...@westnet.com.au, Neil Hodgson nhodg...@iinet.net.au wrote: There's also the Windows device name hole. There may be trouble with artists named 'COM4', 'CLOCK$', 'Con', or similar. http://support.microsoft.com/kb/74496 That applies to MS-DOS names. God forbid that this still holds on more modern Microsoft operating systems? Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (In tel)] on win32 Type help, copyright, credits or license for more information. open(com1,w).write(Test\n) Traceback (most recent call last): File stdin, line 1, in module FileNotFoundError: [Errno 2] No such file or directory: 'com1' open(con,w).write(Test\n) Test 5 ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013-05-28, Albert van der Horst alb...@spenarnc.xs4all.nl wrote: There's also the Windows device name hole. There may be trouble with artists named 'COM4', 'CLOCK$', 'Con', or similar. http://support.microsoft.com/kb/74496 That applies to MS-DOS names. God forbid that this still holds on more modern Microsoft operating systems? There are no more modern Microsoft operating systems. Only more recent ones. There are still lots of reserved filenames in recent versions of Windows. -- Grant Edwards grant.b.edwardsYow! I've got an IDEA!! at Why don't I STARE at you gmail.comso HARD, you forget your SOCIAL SECURITY NUMBER!! -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.08 18:37, Dennis Lee Bieber wrote: And now you've seen why music players don't show the user the physical file name, but maintain a database mapping the internal data (name, artist, track#, album, etc.) to whatever mangled name was needed to satisfy the file system. Tags are used mainly for organization but a nice benefit of tags is that they are not subject to file system or URL or whatever other limits. If an audio file has no metadata, most players will show the file name. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Thu, May 9, 2013 at 1:08 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: I suspect that the only way to be completely ungoogleable would be to name yourself something common, not something obscure. Say, if you called yourself Hard Rock Band, and did hard rock. But then, googling for Heavy Metal alone brings up the magazine as the fourth hit, so if you get famous enough, even that won't work. Yeah, so why are ubergeneric domain names worth so much? Whatevs. The best way to be findable in a web search is to have content on your web site. Real crawlable content. I guarantee you'll be found. Even if you're some tiny thing tucked away in a corner of teh interwebs, you can be found. http://www.google.com/search?q=minstrel+hall The song is there, but so is an obscure little DD MUD. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
In article 518b133b$0$29997$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: I suspect that the only way to be completely ungoogleable would be to name yourself something common, not something obscure. http://en.wikipedia.org/wiki/The_band -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
Roy Smith wrote: In article 518b133b$0$29997$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: I suspect that the only way to be completely ungoogleable would be to name yourself something common, not something obscure. http://en.wikipedia.org/wiki/The_band Nope... googling for the band brings that up as the very first result. The Google knows all. You cannot escape The Google... -- Greg -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013-05-10 12:04, Gregory Ewing wrote: Roy Smith wrote: http://en.wikipedia.org/wiki/The_band Nope... googling for the band brings that up as the very first result. The Google knows all. You cannot escape The Google... That does it. I'm naming my band Google. :-) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
In article mailman.1465.1368056269.3114.python-l...@python.org, Dennis Lee Bieber wlfr...@ix.netcom.com wrote: On Tue, 07 May 2013 18:10:25 -0500, Andrew Berg bahamutzero8...@gmail.com declaimed the following in gmane.comp.python.general: None of these would work because I would have no idea which file stores data for which artist without writing code to figure it out. If I were to end up writing a bug that messed up a few of my cache files and noticed it with a specific artist (e.g., doing a now playing and seeing the wrong tags), I would either have to manually match up the hash or base64 encoding in order to delete just that file so that it gets regenerated or nuke and regenerate my entire cache. And now you've seen why music players don't show the user the physical file name, but maintain a database mapping the internal data (name, artist, track#, album, etc.) to whatever mangled name was needed to satisfy the file system. Yup. At Songza, we deal with this crap every day. It usually bites us the worst when trying to do keyword searches. When somebody types in Blue Oyster Cult, they really mean Blue Oyster Cult, and our search results need to reflect that. Likewise for Ke$ha, Beyonce, and I don't even want to think about the artist formerly known as an unpronounceable glyph. Pro-tip, guys. If you want to form a band, and expect people to be able to find your stuff in a search engine some day, don't play cute with your name. -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Thu, May 9, 2013 at 10:16 AM, Roy Smith r...@panix.com wrote: Pro-tip, guys. If you want to form a band, and expect people to be able to find your stuff in a search engine some day, don't play cute with your name. It's the modern equivalent of names like Catherine Withekay. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Wed, 08 May 2013 20:16:25 -0400, Roy Smith wrote: Yup. At Songza, we deal with this crap every day. It usually bites us the worst when trying to do keyword searches. When somebody types in Blue Oyster Cult, they really mean Blue Oyster Cult, Surely they really mean Blue Öyster Cult. and our search results need to reflect that. Likewise for Ke$ha, Beyonce, and I don't even want to think about the artist formerly known as an unpronounceable glyph. Dropped or incorrect accents are no different from any other misspelling, and good search engines (whether online or in a desktop application) should be able to deal with a tolerable number of misspellings. Googling for Blue Oyster Cult brings up four of the top ten hits spelled correctly with the accent, Blue Öyster Cult. Even misspelled as blew oytser cult, Google does the right thing. Even Bing manages to find Ke$ha's wikipedia page, her official website, youtube channel, facebook and myspace pages from the misspelling kehsha. Pro-tip, guys. If you want to form a band, and expect people to be able to find your stuff in a search engine some day, don't play cute with your name. Googling for the the (including quotes) brings up 145 million hits, nine of the first ten hits being relevant to the band. On the other hand, I wouldn't want to be in a band called The Beetles. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
In article 518b00a2$0$29997$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: When somebody types in Blue Oyster Cult, they really mean Blue Oyster Cult, Surely they really mean Blue Ãyster Cult. Yes. The oomlaut was there when I typed it. Who knows what happened to it by the time it hit the wire. -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.08 19:16, Roy Smith wrote: Yup. At Songza, we deal with this crap every day. It usually bites us the worst when trying to do keyword searches. When somebody types in Blue Oyster Cult, they really mean Blue Oyster Cult, and our search results need to reflect that. Likewise for Ke$ha, Beyonce, and I don't even want to think about the artist formerly known as an unpronounceable glyph. Pro-tip, guys. If you want to form a band, and expect people to be able to find your stuff in a search engine some day, don't play cute with your name. It's a thing (especially in witch house) to make names with odd glyphs in order to be harder to find and be more underground. Very silly. Try doing searches for these artists with names like these: http://www.last.fm/music/%E2%96%BC%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0 http://www.last.fm/music/ki%E2%80%A0%E2%80%A0y+c%E2%96%B2t -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Wed, 08 May 2013 21:11:28 -0500, Andrew Berg wrote: It's a thing (especially in witch house) to make names with odd glyphs in order to be harder to find and be more underground. Very silly. Try doing searches for these artists with names like these: Challenge accepted. http://www.last.fm/music/%E2%96%BC%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0% E2%96%A1%E2%96%A0 http://www.last.fm/music/ki%E2%80%A0%E2%80%A0y+c%E2%96%B2t The second one is trivial. Googling for kitty cat witch house (including quotes) gives at least 3 relevant links out of the top 4 hits are relevant. (I'm not sure about the Youtube page.) That gets you the correct spelling, ki††y c△t, and googling for that brings up many more hits. The first one is a tad trickier, since googling for ▼□■□■□■ brings up nothing at all, and mourning star doesn't give any relevant hits on the first page. But mourning star witch house (inc. quotes) is successful. I suspect that the only way to be completely ungoogleable would be to name yourself something common, not something obscure. Say, if you called yourself Hard Rock Band, and did hard rock. But then, googling for Heavy Metal alone brings up the magazine as the fourth hit, so if you get famous enough, even that won't work. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Making safe file names
Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 5/7/2013 3:58 PM, Andrew Berg wrote: Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. Sounds like you want something like the html escape or urlencode functions, which serve the same purpose of encoding special chars. Rather than invent a new tranformation, you could use the same scheme used for html entities. (Sorry, I forget the details.) It is possible that one of the functions would work for you as is, or with little modification. Terry -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
I suggest Base64. b64encode (http://docs.python.org/2/library/base64.html#base64.b64encode) and b64decode take an argument which allows you to eliminate the pesky / character. It's reversible and simple. More suggestions: how about a hash? Or just use IDs from the database? On Tue, May 7, 2013 at 8:58 PM, Andrew Berg bahamutzero8...@gmail.com wrote: Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list -- Fábio Santos -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 07/05/2013 20:58, Andrew Berg wrote: Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. Conflicts won't occur if: 1. All of the characters of the artist's name are mapped to an encoding. 2. Different characters map to different encodings. 3. No encoding is a prefix of another encoding. In practice, you'll be mapping most characters to themselves. -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 5/7/13, Andrew Berg bahamutzero8...@gmail.com wrote: Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. You might consider: http://stromberg.dnsalias.org/svn/backshift/trunk/escape_mod.py http://stromberg.dnsalias.org/svn/backshift/trunk/test-escape_mod It doubles the length of the string, but it produces safe, easily readable escaped strings - which tends to make debugging easier. It requires a couple of other modules (easily obtained from the same SVN repo) though. -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
Andrew Berg bahamutzero8...@gmail.com wrote: Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. -- It's not clear what the context that you need this for. You could e.g. replace all characters not allowed by the file system by their hexidecimal (ASCII) values, preceeded by a '% (so '/' would be changed to '%2F', and also encode a '%' itself in a name by '%25'). Then you have a well-defined two-way mapping (isomorphic if I remember my math-lear- nining days correctly) between the original name and the way you store it. E.g. C/A/T would become C%2FA%2FT and C%2FA/T would become C%252FA%2FT You can translate back and forth between them with not too much effort. Of course, that assumes that '%' is a character allowed by your file system - otherwise pick some other one, any one will do in principle. It's a bit harder for a human to in- terpret but rathe likely not that much of a problem. You probably will have seen that kind of scheme used in URLs. The concept is rather old and called 'escape character', i.e. have one character that assumes some special meaning and also escaped it. If, on the hand, those names are never to be translated back to the original name another strategy would be to use the SHA1 hash value of the artists name. Since clashes between SHA1 hash values are rather hard to produce it's a rather safe method of converting something (i.e. the artists name) to a number. The drawback, of course, is that you can't translate back from the hash value to the original name (if that would be simple the whole thing wouldn't work;-) Regards, Jens -- \ Jens Thoms Toerring ___ j...@toerring.de \__ http://toerring.de -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Wed, May 8, 2013 at 8:18 AM, Fábio Santos fabiosantos...@gmail.com wrote: I suggest Base64. b64encode (http://docs.python.org/2/library/base64.html#base64.b64encode) and b64decode take an argument which allows you to eliminate the pesky / character. It's reversible and simple. But it doesn't look anything like the original. I'd be inclined to go for something like quoted-printable or URL-encoding; special characters become much longer, but ordinary characters (mostly) stay as themselves. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 17:18, Fábio Santos wrote: I suggest Base64. b64encode (http://docs.python.org/2/library/base64.html#base64.b64encode) and b64decode take an argument which allows you to eliminate the pesky / character. It's reversible and simple. More suggestions: how about a hash? Or just use IDs from the database? None of these would work because I would have no idea which file stores data for which artist without writing code to figure it out. If I were to end up writing a bug that messed up a few of my cache files and noticed it with a specific artist (e.g., doing a now playing and seeing the wrong tags), I would either have to manually match up the hash or base64 encoding in order to delete just that file so that it gets regenerated or nuke and regenerate my entire cache. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 17:01, Terry Jan Reedy wrote: Sounds like you want something like the html escape or urlencode functions, which serve the same purpose of encoding special chars. Rather than invent a new tranformation, you could use the same scheme used for html entities. (Sorry, I forget the details.) It is possible that one of the functions would work for you as is, or with little modification. This has the problem of mangling non-ASCII characters (and artist names with non-ASCII characters are not rare). I most definitely want to keep as many characters untouched as possible so that the files are easy to identify by looking at the file name. Ideally, only characters that file systems don't like would be transformed. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 05/07/2013 03:58 PM, Andrew Berg wrote: Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. So what you need first is a list of allowable characters for all your target OS versions. And don't forget that the allowable characters may vary depending on the particular file system(s) mounted on a given OS. You also need to decide how to handle Unicode characters, since they're different for different OS. In Windows on NTFS, filenames are in Unicode, while on Unix, filenames are bytes. So on one of those, you will be encoding/decoding if your code is to be mostly portable. Don't forget that ls and rm may not use the same encoding you're using. So you may not consider it adequate to make the names legal, but you may also want they easily typeable in the shell. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
In article mailman.1428.1367972114.3114.python-l...@python.org, Dave Angel da...@davea.name wrote: On 05/07/2013 03:58 PM, Andrew Berg wrote: Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However, artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name. So what you need first is a list of allowable characters for all your target OS versions. And don't forget that the allowable characters may vary depending on the particular file system(s) mounted on a given OS. You also need to decide how to handle Unicode characters, since they're different for different OS. In Windows on NTFS, filenames are in Unicode, while on Unix, filenames are bytes. So on one of those, you will be encoding/decoding if your code is to be mostly portable. Don't forget that ls and rm may not use the same encoding you're using. So you may not consider it adequate to make the names legal, but you may also want they easily typeable in the shell. One possible tool that may help you here is unidecode (https://pypi.python.org/pypi/Unidecode). It doesn't solve your whole problem, but it does help get unicode text into a form which is both 7-bit clean and human readable. -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 17:37, Jens Thoms Toerring wrote: You could e.g. replace all characters not allowed by the file system by their hexidecimal (ASCII) values, preceeded by a '% (so '/' would be changed to '%2F', and also encode a '%' itself in a name by '%25'). Then you have a well-defined two-way mapping (isomorphic if I remember my math-lear- nining days correctly) between the original name and the way you store it. E.g. C/A/T would become C%2FA%2FT and C%2FA/T would become C%252FA%2FT You can translate back and forth between them with not too much effort. Of course, that assumes that '%' is a character allowed by your file system - otherwise pick some other one, any one will do in principle. It's a bit harder for a human to in- terpret but rathe likely not that much of a problem. Yes, something like this is what I am trying to achieve. Judging by the responses I've gotten so far, I think I'll have to roll my own transformation scheme since URL encoding and the like transform Unicode characters. I can memorize that 植松伸夫 is a Japanese composer who is well-known for his works in the Final Fantasy series of video games. Trying to match up the URL-encoded version to an artist would be almost impossible when I have several other artist names that have no ASCII characters. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 19:14, Dave Angel wrote: You also need to decide how to handle Unicode characters, since they're different for different OS. In Windows on NTFS, filenames are in Unicode, while on Unix, filenames are bytes. So on one of those, you will be encoding/decoding if your code is to be mostly portable. Characters outside whatever sys.getfilesystemencoding() returns won't be allowed. If the user's locale settings don't support Unicode, my program will be far from the only one to have issues with it. Any problem reports that arise from a user moving between legacy encodings will generally be ignored. I haven't yet decided how I will handle artist names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in their locale settings). Don't forget that ls and rm may not use the same encoding you're using. So you may not consider it adequate to make the names legal, but you may also want they easily typeable in the shell. I don't understand. I have no intention of changing Unicode characters. This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are not allowed on some file systems: \ / : * ?| @ and the NUL character The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale settings and encodings aside, these 11 characters will need to be escaped. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 05/07/2013 08:51 PM, Andrew Berg wrote: On 2013.05.07 19:14, Dave Angel wrote: You also need to decide how to handle Unicode characters, since they're different for different OS. In Windows on NTFS, filenames are in Unicode, while on Unix, filenames are bytes. So on one of those, you will be encoding/decoding if your code is to be mostly portable. Characters outside whatever sys.getfilesystemencoding() returns won't be allowed. If the user's locale settings don't support Unicode, my program will be far from the only one to have issues with it. Any problem reports that arise from a user moving between legacy encodings will generally be ignored. I haven't yet decided how I will handle artist names with characters outside UTF-8, There aren't any characters outside UTF-8. But a character is not in utf-8, it can be encoded by utf-8. but inside UTF-16/32 (UTF-16 Nor outside UTF-16 or 32. is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in their locale settings). Don't forget that ls and rm may not use the same encoding you're using. So you may not consider it adequate to make the names legal, but you may also want they easily typeable in the shell. I don't understand. I have no intention of changing Unicode characters. So you're comfortable typing arbitrary characters? what about all the characters that have identical displays in your font? What about viewing 0x07 in the terminal window? Or 0x04? This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are not allowed on some file systems: \ / : * ?| @ and the NUL character The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale settings and encodings aside, these 11 characters will need to be escaped. As soon as you have a small, finite list of invalid characters, writing an escape system is pretty easy. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
Andrew Berg: This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are not allowed on some file systems: \ / : * ? | @ and the NUL character The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale settings and encodings aside, these 11 characters will need to be escaped. There's also the Windows device name hole. There may be trouble with artists named 'COM4', 'CLOCK$', 'Con', or similar. http://support.microsoft.com/kb/74496 http://en.wikipedia.org/wiki/Nul_%28band%29 Neil -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 05/07/2013 09:28 PM, Neil Hodgson wrote: Andrew Berg: This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are not allowed on some file systems: \ / : * ? | @ and the NUL character The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale settings and encodings aside, these 11 characters will need to be escaped. There's also the Windows device name hole. There may be trouble with artists named 'COM4', 'CLOCK$', 'Con', or similar. In MSDOS 2, there was a switch that would tell the OS to ignore such names unless they were prefixed by \DEV. But like the switchar switch, it was largely ignored by the ignorant, and probably doesn't exist in current versions of M$OS http://support.microsoft.com/kb/74496 http://en.wikipedia.org/wiki/Nul_%28band%29 Neil While we're looking for trouble, there's also case insensitivity. Unclear if the user cares, but tom and TOM are the same file in most configurations of NT. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 20:28, Neil Hodgson wrote: http://support.microsoft.com/kb/74496 http://en.wikipedia.org/wiki/Nul_%28band%29 I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 20:45, Dave Angel wrote: While we're looking for trouble, there's also case insensitivity. Unclear if the user cares, but tom and TOM are the same file in most configurations of NT. Artist names on Last.fm cannot differ only in case. This does remind me to make sure to update the case of the artist name as necessary, though. For example, if Sam becomes SAM again (I have seen Last.fm change the case for artist names), I need to make sure that I don't end up with two file names differing only in case. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 20:13, Dave Angel wrote: So you're comfortable typing arbitrary characters? what about all the characters that have identical displays in your font? Identification is more important than typing. I can copy and paste into a terminal if necessary. I don't foresee typing out one of the filenames being anything more than a rare occurrence, but I will occasionally just read the list. What about viewing 0x07 in the terminal window? Or 0x04? I don't think Last.fm will even send those characters. In any case, control characters in artist names are rare enough that it's not worth the trouble to write the code to avoid the problems associated with them. As soon as you have a small, finite list of invalid characters, writing an escape system is pretty easy. Probably. I was just hoping there was an existing system that would work, but as I said in a different reply, it would seem I need to roll my own. -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
In article mailman.1435.1367977523.3114.python-l...@python.org, Dave Angel da...@davea.name wrote: While we're looking for trouble, there's also case insensitivity. Unclear if the user cares, but tom and TOM are the same file in most configurations of NT. OSX, too. -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Tue, 07 May 2013 19:51:24 -0500, Andrew Berg wrote: On 2013.05.07 19:14, Dave Angel wrote: You also need to decide how to handle Unicode characters, since they're different for different OS. In Windows on NTFS, filenames are in Unicode, while on Unix, filenames are bytes. So on one of those, you will be encoding/decoding if your code is to be mostly portable. Characters outside whatever sys.getfilesystemencoding() returns won't be allowed. If the user's locale settings don't support Unicode, my program will be far from the only one to have issues with it. Any problem reports that arise from a user moving between legacy encodings will generally be ignored. I haven't yet decided how I will handle artist names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in their locale settings). There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire Unicode range, unlike other encodings like Latin-1 or ASCII. Well, that is to say, there may be characters that are not (yet) handled at all by Unicode, but there are no known legacy encodings that support such characters. To a first approximation, Unicode covers the entire set of characters in human use, and for those which it does not, there is always the private use area. So for example, if you wish to record the Artist Formerly Known As The Artist Formerly Known As Prince as Love Symbol, you could pick an arbitrary private use code point, declare that for your application that code point means Love Symbol, and use that code point as the artist name. You could even come up with a custom font that includes a rendition of that character glyph. However, there are byte combinations which are not valid UTF-8, which is a different story. If you're receiving bytes from (say) a file name, they may not necessarily make up a valid UTF-8 string. But this is not an issue if you are receiving data from something guaranteed to be valid UTF-8. Don't forget that ls and rm may not use the same encoding you're using. So you may not consider it adequate to make the names legal, but you may also want they easily typeable in the shell. I don't understand. I have no intention of changing Unicode characters. Of course you do. You even talk below about Unicode characters like * and ? not being allowed on NTFS systems. Perhaps you are thinking that there are a bunch of characters over here called plain text ASCII characters, and a *different* bunch of characters with funny accents and stuff called Unicode characters. If so, then you are labouring under a misapprehension, and you should start off by reading this: http://www.joelonsoftware.com/articles/Unicode.html then come back with any questions. This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are not allowed on some file systems: \ / : * ?| @ and the NUL character These are all Unicode characters too. Unicode is a subset of ASCII, so anything which is ASCII is also Unicode. The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale settings and encodings aside, these 11 characters will need to be escaped. If you have an artist with control characters in their name, like newline or carriage return or NUL, I think it is fair to just drop the control characters and then give the artist a thorough thrashing with a halibut. Does your mapping really need to be guaranteed reversible? If you have an artist called JoeBlow, and another artist called Joe\0Blow, and a third called Joe\nBlow, does it *really* matter if your application conflates them? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 05/07/2013 10:06 PM, Andrew Berg wrote: On 2013.05.07 20:28, Neil Hodgson wrote: http://support.microsoft.com/kb/74496 http://en.wikipedia.org/wiki/Nul_%28band%29 I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches. Won't help. NUL.txt is just as reserved as NUL is. Extensions are ignored in this particular piece of historical nonsense. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 05/07/2013 11:40 PM, Steven D'Aprano wrote: SNIP These are all Unicode characters too. Unicode is a subset of ASCII, so anything which is ASCII is also Unicode. Typo. You meant Unicode is a superset of ASCII. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On Wed, 08 May 2013 00:13:20 -0400, Dave Angel wrote: On 05/07/2013 11:40 PM, Steven D'Aprano wrote: SNIP These are all Unicode characters too. Unicode is a subset of ASCII, so anything which is ASCII is also Unicode. Typo. You meant Unicode is a superset of ASCII. Damn. Yes, you're right. I was thinking superset, but my fingers typed subset. Thanks for the correction. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Making safe file names
On 2013.05.07 22:40, Steven D'Aprano wrote: There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire Unicode range, unlike other encodings like Latin-1 or ASCII. You are correct. I'm not sure what I was thinking. I don't understand. I have no intention of changing Unicode characters. Of course you do. You even talk below about Unicode characters like * and ? not being allowed on NTFS systems. I worded that incorrectly. What I meant, of course, is that I intend to preserve as many characters as possible and have no need to stay within ASCII. If you have an artist with control characters in their name, like newline or carriage return or NUL, I think it is fair to just drop the control characters and then give the artist a thorough thrashing with a halibut. While the thrashing with a halibut may be warranted (though I personally would use a rubber chicken), conflicts are problematic. Does your mapping really need to be guaranteed reversible? If you have an artist called JoeBlow, and another artist called Joe\0Blow, and a third called Joe\nBlow, does it *really* matter if your application conflates them? Yes and yes. Some artists like to be real cute with their names and make witch house artist names look tame in comparison, and some may choose to use names similar to some very popular artists. I've also seen people scrobble fake artists with names that look like real artist names (using things like a non-breaking space instead of a regular space) with different artist pictures in order to confuse and troll people. If I could remember the user profiles with this, I'd link them. Last.fm is a silly place. As I said before though, I don't think control characters are even allowed in artist names (likely for technical reasons). -- CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list