Re: [OSM-dev] Disallowing certain characters in tag keys
On Tue, Oct 19, 2010 at 11:52:09AM +0100, Andy Allan wrote: On Tue, Oct 19, 2010 at 10:25 AM, Jochen Topf joc...@remote.org wrote: On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote: On 16/10/10 19:44, Jochen Topf wrote: I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. I really don't understand the problem here - as far as I know all characters can be used in URLs so long as they are properly escaped. If your server software is not coping with that for some reason then I think it's a bug. That might well be a bug. But those bugs creep up all the time, because these things are hard to do and because the specs are not as clear as they should be. I am not saying these things can't be done right, but wouldn't it be nice if we can get rid of that problem instead of everybody writing software for OSM having to make sure all those cases are handled properly? As a test I just created a file called '+?#;%.html' in an apache served directory and then asked Firefox to fetch: http://server/%3c%3e%26%2b%3f%23%3b%25.html and it was retrieved just fine. And now try the same thing again creating a filename with a '/' in it and see whether it works this time. It doesn't, because '/' is special for Unix (and HTTP) and you need to create a directory with the first part of your name and then the second as file. If you would actually want to create one file for every key in the OSM database in your filesystem, you'd have a problem. You example basically proves my point. :-) No, it really doesn't. Obviously I haven't made my point clear enough. I am saying, those special characters don't work like normal characters in many cases. They have special meanings. For instance as directory separators. Or in URLs or HTML code or programming languages. So whenever you do anything where those characters can appear, you have to take special care that your code doesn't break. And programmers are notoriously bad at taking that special care. Let's put it this way - there is a subset[1] of unicode code points that is valid for both keys and values. If you find any characters emitted by OSM that lie outwith that range, then do let us know[3] But we've taken great care to permit all other code points in both keys and values alike, since we've no idea when someone is going to need them. Your example of why we need (and presumably ) is actually great example to undermine your point. Its really a case a weighting the different cases. On the one hand it makes sense to allow everything, because you never know what you will need. But on the other hand it makes sense to restrict what you allow to make handling easier. We have restricted the number of characters in keys and values for instance. There are certainly cases where it would be nice to have more characters, but for practical reasons they are restricted. We have put in a restriction that a key can only appear once on an object. Thats also for practical purposes. I am arguing that there are other things we can do to make working with OSM-tags more convenient, for what I think, no extra cost. Look at what happend with email addresses: You can have nearly every ASCII character in email addresses, spaces and double quotes are allowed for instance, but you have to escape them in the right way. Real mail programs can handle that generally. But most scripts tha people write don't. The result is that in practice you can't use all those characters in email addresses, because they work only half the time. If you send programmers to the RFC and ask them to implement it properly, they can't figure out how to do that and give up. And each one implements his own system, each having his own list of characters that work and that don't work. The end-result is a rather small list of characters that always work and some that work sometimes. (See the details at http://www.remote.org/jochen/mail/info/chars.html ) I argue that if we disallow some characters we can actually expect developers to implement our spec, if we leave the spec open too much, people will ignore the difficult parts. If too many programs don't work with the difficult bits those tags will in practice not be usable anyway, so why not forbid them outright and all have an easier life? Some of these characters need escaping for particular purposes. If you find a unicode character that cannot be URLencoded[2], then do let us know. Or if you find another encoding scenario which can only encode a
Re: [OSM-dev] Disallowing certain characters in tag keys
On Sun, Oct 17, 2010 at 09:48:31AM +0200, Ulf Lamping wrote: Am 16.10.2010 20:44, schrieb Jochen Topf: Hi! I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. Yes, special characters can cause headaches. I remember this from my own tag analyzing experiments and other software projects ;-) I agree with you that most (all?) of them are (usually unintended) bugs. For example: Not long ago, it was a common tag problem that keys started or ended with a space char. IIRC the xybot regularly fixes those bugs now. However, as those characters can be used in the name values (and elsewhere), you have to deal with the correct handling of special characters in your software anyway, so I'm not sure if disallowing special characters in the key will really help us in that regard. The problem with disallowing special characters is that you close the door. Software writers will then write software that depends on them not being there (or not caring which is probably the common case today). If we later find out that - for whatever reasons - we want to use one of those chars this will become extremely difficult, as it will cause trouble at many places in existing software. Thats absolutely true. Thats why I am only proposing a very small list and don't include characters like {} that are not used now, but might make sense in the future. What we currently don't have is a guideline for mappers. I'm missing (and thinking to write for some time) a: How to write good tags. To my knowledge we don't have a written guide (not rule) that we tend to used lower case chars, underscores instead of spaces and all that unwritten rules. Of course, this could include: don't use special chars like /, ?, ... in keys - because this makes it hard for software writers. I agree that we should come up with these guidelines, but thats really a different issue. I tend to simply ignore keys with special chars - as we do it today ... Which works well for lots of software (like renderers who don't care about the things they don't understand). Unfortunately it doesn't work for editors or something like Taginfo which needs to work with *all* legal tags. Jochen -- Jochen Topf joc...@remote.org http://www.remote.org/jochen/ +49-721-388298 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On Sun, Oct 17, 2010 at 04:57:33PM -0400, Anthony wrote: On Sat, Oct 16, 2010 at 2:44 PM, Jochen Topf joc...@remote.org wrote: Technically this would mean changing the API to check for those characters, removing any that are already in the database (can be done with normal manual edits because there are so few cases) and adding checks to the editors so that they can give meaningful error messages. To be clear, they'd still be in the database, in the history. Which is one implementation problem, because it means putting checks in more than one different place. At the very least, the regular API, and the Potlatch API, but there are probably multiple places within the regular API where things would need to be checked. But thats much fewer places than all the software out there. The whole point of an API is that its a sort of choke point, a single place where things can be checked. And then any software which relies on these changes wouldn't work with historical data. Thats a problem, you are right. We could solve that by faking the history. Not the first time this has been done, it would be possible. But most software out there only deals with current data. So even if we keep the history, that software would be made easier. It could be done, but to do all that work just to make it easier to code Taginfo would be, in my opinion, a waste. Especially when there are plenty of simple solutions within taginfo. If URL encoding is too painful, use a modified base64 encoding of the unicode string (using - and _ instead of + and /). Its not only Taginfo. Every software out there would be made easier. If this would be a Taginfo-only problem I wouldn't propose it. One of the biggest problems is that Taginfo doesn't work alone, but wants to work with other tools. If I use base64 encoding then people would need to link to something like http://taginfo.openstreetmap.de/keys/aGlnaHdheQo= instead of http://taginfo.openstreetmap.de/keys/highway. And the link then to XAPI would not be http://www.informationfreeway.org/api/0.6/*[highway=*] but http://www.informationfreeway.org/api/0.6/*[aGlnaHdheQo==*] . Not very user friendly. And then every service would probably use different encoding schemes... I have actually thought about that and might offer a secondary interface to Taginfo using base64 or something like it if I can't avoid it. But thats really ugly and probably nobody would use it anyway, because nobody wants to write special cases for the few keys that use those characters and are bogus anyway. For cleaning up the keys, I'd want to strip down to as few characters as possible. There's no point supporting most unicode characters - keys are supposed to be in English. No. English people should be allowed to use their own language if they want to. So should speakers of every other language on the planet, too. Jochen -- Jochen Topf joc...@remote.org http://www.remote.org/jochen/ +49-721-388298 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On 16/10/10 19:44, Jochen Topf wrote: I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. I really don't understand the problem here - as far as I know all characters can be used in URLs so long as they are properly escaped. If your server software is not coping with that for some reason then I think it's a bug. As a test I just created a file called '+?#;%.html' in an apache served directory and then asked Firefox to fetch: http://server/%3c%3e%26%2b%3f%23%3b%25.html and it was retrieved just fine. Tom -- Tom Hughes (t...@compton.nu) http://compton.nu/ ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote: On 16/10/10 19:44, Jochen Topf wrote: I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. I really don't understand the problem here - as far as I know all characters can be used in URLs so long as they are properly escaped. If your server software is not coping with that for some reason then I think it's a bug. That might well be a bug. But those bugs creep up all the time, because these things are hard to do and because the specs are not as clear as they should be. I am not saying these things can't be done right, but wouldn't it be nice if we can get rid of that problem instead of everybody writing software for OSM having to make sure all those cases are handled properly? As a test I just created a file called '+?#;%.html' in an apache served directory and then asked Firefox to fetch: http://server/%3c%3e%26%2b%3f%23%3b%25.html and it was retrieved just fine. And now try the same thing again creating a filename with a '/' in it and see whether it works this time. It doesn't, because '/' is special for Unix (and HTTP) and you need to create a directory with the first part of your name and then the second as file. If you would actually want to create one file for every key in the OSM database in your filesystem, you'd have a problem. You example basically proves my point. :-) Jochen -- Jochen Topf joc...@remote.org http://www.remote.org/jochen/ +49-721-388298 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On Tue, Oct 19, 2010 at 10:25 AM, Jochen Topf joc...@remote.org wrote: On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote: On 16/10/10 19:44, Jochen Topf wrote: I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. I really don't understand the problem here - as far as I know all characters can be used in URLs so long as they are properly escaped. If your server software is not coping with that for some reason then I think it's a bug. That might well be a bug. But those bugs creep up all the time, because these things are hard to do and because the specs are not as clear as they should be. I am not saying these things can't be done right, but wouldn't it be nice if we can get rid of that problem instead of everybody writing software for OSM having to make sure all those cases are handled properly? As a test I just created a file called '+?#;%.html' in an apache served directory and then asked Firefox to fetch: http://server/%3c%3e%26%2b%3f%23%3b%25.html and it was retrieved just fine. And now try the same thing again creating a filename with a '/' in it and see whether it works this time. It doesn't, because '/' is special for Unix (and HTTP) and you need to create a directory with the first part of your name and then the second as file. If you would actually want to create one file for every key in the OSM database in your filesystem, you'd have a problem. You example basically proves my point. :-) No, it really doesn't. Let's put it this way - there is a subset[1] of unicode code points that is valid for both keys and values. If you find any characters emitted by OSM that lie outwith that range, then do let us know[3] But we've taken great care to permit all other code points in both keys and values alike, since we've no idea when someone is going to need them. Your example of why we need (and presumably ) is actually great example to undermine your point. Some of these characters need escaping for particular purposes. If you find a unicode character that cannot be URLencoded[2], then do let us know. Or if you find another encoding scenario which can only encode a sub-set of unicode code points, let us know. Your application should be able to handle every valid input. You've found that your application is buggy, and now you're asking for the input to be changed. But just the keys, not the values, and only current data, not historical data. It seems a bit ... weird. And your original list of characters is completely arbitrary, not based on any formal specification as far as I can see. If your editor can't handle all necessary characters, fix the editor. If your application can't handle all the characters, fix the application. And if you find dealing with or = or in a key to be hard, it's probably worth taking some time to test with non-BMP characters. (If you later find that having ');DROP DATABASE;-- in a key or value is breaking your database inserts, then please don't ask for these characters to be banned too!) Thanks, Andy [1] See http://www.w3.org/TR/2008/REC-xml-20081126/#charsets [2] http://en.wikipedia.org/wiki/Urlencode - / is %2f, by the way. [3] But you shouldn't rely on it, and defensively program anyway. Not all OSM files are generated by the API. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On 19/10/10 10:25, Jochen Topf wrote: On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote: As a test I just created a file called '+?#;%.html' in an apache served directory and then asked Firefox to fetch: http://server/%3c%3e%26%2b%3f%23%3b%25.html and it was retrieved just fine. And now try the same thing again creating a filename with a '/' in it and see whether it works this time. It doesn't, because '/' is special for Unix (and HTTP) and you need to create a directory with the first part of your name and then the second as file. If you would actually want to create one file for every key in the OSM database in your filesystem, you'd have a problem. Sure if you have a slash then, for static files served from unix, that would have to correspond to a directory separator. That's a unix file naming limitation though. In a dynamic application where you are decoding the path information yourself and deciding what it means there is no such restriction. Tom -- Tom Hughes (t...@compton.nu) http://compton.nu/ ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On Tue, Oct 19, 2010 at 6:52 AM, Andy Allan gravityst...@gmail.com wrote: Let's put it this way - there is a subset[1] of unicode code points that is valid for both keys and values. If you find any characters emitted by OSM that lie outwith that range, then do let us know[3] Even if they're only in the history? Last I checked (a couple months ago), there were quite a few invalid characters in the history (1). Would you like the list (seems like something which would be easy for you to generate yourself)? If so, is there something going to be done about them? (1) For example, see the last character in the comment at http://www.openstreetmap.org/api/0.6/changeset/936207 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
Am 16.10.2010 20:44, schrieb Jochen Topf: Hi! I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. Yes, special characters can cause headaches. I remember this from my own tag analyzing experiments and other software projects ;-) I agree with you that most (all?) of them are (usually unintended) bugs. For example: Not long ago, it was a common tag problem that keys started or ended with a space char. IIRC the xybot regularly fixes those bugs now. However, as those characters can be used in the name values (and elsewhere), you have to deal with the correct handling of special characters in your software anyway, so I'm not sure if disallowing special characters in the key will really help us in that regard. The problem with disallowing special characters is that you close the door. Software writers will then write software that depends on them not being there (or not caring which is probably the common case today). If we later find out that - for whatever reasons - we want to use one of those chars this will become extremely difficult, as it will cause trouble at many places in existing software. What we currently don't have is a guideline for mappers. I'm missing (and thinking to write for some time) a: How to write good tags. To my knowledge we don't have a written guide (not rule) that we tend to used lower case chars, underscores instead of spaces and all that unwritten rules. Of course, this could include: don't use special chars like /, ?, ... in keys - because this makes it hard for software writers. I tend to simply ignore keys with special chars - as we do it today ... Regards, ULFL ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
The problem with disallowing special characters is that you close the door. Software writers will then write software that depends on them not being there (or not caring which is probably the common case today). If we later find out that - for whatever reasons - we want to use one of those chars this will become extremely difficult, as it will cause trouble at many places in existing software. This is an instance of what is called The Expression Problem. -- Tony Morris http://tmorris.net/ ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
Am 16.10.2010 22:31, schrieb Andreas Kalsch: I agree with whitespace - this can be very confusing = To add: Make keys lowercase (or even remove diacritics), because keys are always simple names. I've added a Character section to the page: http://wiki.openstreetmap.org/wiki/Any_tags_you_like ... that tries to summarise this discussion. We might not want to disallow characters, but letting the mappers know which characters to avoid is a good idea IMHO. Please improve it if you like ... Regards, ULFL ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On Sat, Oct 16, 2010 at 2:44 PM, Jochen Topf joc...@remote.org wrote: Technically this would mean changing the API to check for those characters, removing any that are already in the database (can be done with normal manual edits because there are so few cases) and adding checks to the editors so that they can give meaningful error messages. To be clear, they'd still be in the database, in the history. Which is one implementation problem, because it means putting checks in more than one different place. At the very least, the regular API, and the Potlatch API, but there are probably multiple places within the regular API where things would need to be checked. And then any software which relies on these changes wouldn't work with historical data. It could be done, but to do all that work just to make it easier to code Taginfo would be, in my opinion, a waste. Especially when there are plenty of simple solutions within taginfo. If URL encoding is too painful, use a modified base64 encoding of the unicode string (using - and _ instead of + and /). For cleaning up the keys, I'd want to strip down to as few characters as possible. There's no point supporting most unicode characters - keys are supposed to be in English. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
Hi, what is the problem with escaping problematic characters? There should be build in functions for most languages, like uri_escape in Perl and URLEncode.encode in Java. This proposal [1] moves values into the key to describe conditions. (Although you could argue, it should be done like that anyway...) [1] http://wiki.openstreetmap.org/wiki/Proposed_features/Extended_conditions_for_access_tags, Sebastian Jochen Topf wrote: Hi! I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?, , etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. We can't really do anything about that with regard to tag values, they must be allowed to contain all those characters. But it would help at least a little if we knew those characters can never appear in tag keys. And I can't really see a legitimate reason why we need those characters in keys. Looking at the database almost all cases where they appear in keys are obvious errors. Out of the about 2 different keys, there are only about 190 keys with problematic characters in them (another about 800 with whitespace). Really the only case that I can't immediately rule out as errors or see an alternative tagging are tag keys like maxspeed:weight7.5. And with those you can already see the problems: Some of them have gt; instead of the . So I'd like us to think about whether we can disallow a few characters from appearing in tag keys. Technically this would mean changing the API to check for those characters, removing any that are already in the database (can be done with normal manual edits because there are so few cases) and adding checks to the editors so that they can give meaningful error messages. Shouldn't be too hard. So, what characters am I talking about? I haven't drawn up a complete list and we certainly would need to discuss this further. Here is a preliminary list: Whitespace Should use '_' instead of whitespace in keys, whitespace are also very confusing for users, especially at beginning and end of a text. /+?#;%' Special characters in XML, HTML and/or URLs. \' Characters often used for quoting. =Because its used in many places as the separation character between tag key and tag value. If we disallow this, we can always treat one string like foo=bar as k:foo, v:bar without any ambiguities. This is a small list of special characters, all other characters should still be allowed. That means tag keys can still be in Chinese or whatever. We'd just disallow a few characters of which we know that they will make problems again and again. And to emphasize this again: I am only talking about tag keys. Tag values must be allowed to contain the full Unicode set of characters. Jochen ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
I agree with whitespace - this can be very confusing = To add: Make keys lowercase (or even remove diacritics), because keys are always simple names. Am 16.10.10 20:44, schrieb Jochen Topf: Hi! I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. We can't really do anything about that with regard to tag values, they must be allowed to contain all those characters. But it would help at least a little if we knew those characters can never appear in tag keys. And I can't really see a legitimate reason why we need those characters in keys. Looking at the database almost all cases where they appear in keys are obvious errors. Out of the about 2 different keys, there are only about 190 keys with problematic characters in them (another about 800 with whitespace). Really the only case that I can't immediately rule out as errors or see an alternative tagging are tag keys like maxspeed:weight7.5. And with those you can already see the problems: Some of them have gt; instead of the . So I'd like us to think about whether we can disallow a few characters from appearing in tag keys. Technically this would mean changing the API to check for those characters, removing any that are already in the database (can be done with normal manual edits because there are so few cases) and adding checks to the editors so that they can give meaningful error messages. Shouldn't be too hard. So, what characters am I talking about? I haven't drawn up a complete list and we certainly would need to discuss this further. Here is a preliminary list: Whitespace Should use '_' instead of whitespace in keys, whitespace are also very confusing for users, especially at beginning and end of a text. /+?#;%' Special characters in XML, HTML and/or URLs. \' Characters often used for quoting. =Because its used in many places as the separation character between tag key and tag value. If we disallow this, we can always treat one string like foo=bar as k:foo, v:bar without any ambiguities. This is a small list of special characters, all other characters should still be allowed. That means tag keys can still be in Chinese or whatever. We'd just disallow a few characters of which we know that they will make problems again and again. And to emphasize this again: I am only talking about tag keys. Tag values must be allowed to contain the full Unicode set of characters. Jochen ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On Sat, Oct 16, 2010 at 09:55:41PM +0200, Sebastian Klein wrote: what is the problem with escaping problematic characters? There should be build in functions for most languages, like uri_escape in Perl and URLEncode.encode in Java. The problem is that its very hard to get this right. I have just spend an hour debugging a problem where the semi-colon (;) character in a URL was mis-handled by Apache. The mod_proxy module I use in Taginfo silently cuts of everything after an ; in the URL even if its escaped. Thats probably because the ; is handled specially for some reason in the RFC defining URLs. I found an option to fix this, but its doesn't work with mod_rewrite, so that had to be worked around, too. I got it to work, but I don't want to know what the next problem will be. It just goes to show that even software like Apache has problems dealing with these things, not to speak of some scripts somebody just hacked together. I already mentioned the keys in the database with gt; or so in there, probably from some software escaping once too often. Special characters must be escaped exactly once. If they are not escaped or escaped more than once, you get broken results. And on the other side you have to de-escape exactly once. This is difficult to get right. And the penalty for not getting it right might just be an SQL injection or a cross-site-scripting attack vector. Not allowing those characters in the first place, makes software easier to write and understand and more robust. And it even makes it friendlier for humans, because if they use those characters you can immediately give them an error message instead of creating broken data. And all of that without any cost, really. I don't see that we ever need those characters in tag keys. Of course if we do need those characters than we have to get all of this right and right every time. This proposal [1] moves values into the key to describe conditions. (Although you could argue, it should be done like that anyway...) [1] http://wiki.openstreetmap.org/wiki/Proposed_features/Extended_conditions_for_access_tags, I think thats a misguided use of tag keys probably invented by people who have never actually tried to write code that tries to interpret OSM tags. Jochen -- Jochen Topf joc...@remote.org http://www.remote.org/jochen/ +49-721-388298 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
On Sat, Oct 16, 2010 at 10:31:47PM +0200, Andreas Kalsch wrote: I agree with whitespace - this can be very confusing = To add: Make keys lowercase (or even remove diacritics), because keys are always simple names. Thats a different issue. I agree that keys normally should be lowercase. But thats just a nice convention. There might be good reasons for uppercase keys, for instance when the key name was used in upper case in some other system where data was imported from. There is no need to force people into a convention here. Thats different from the issue I have been talking about where there are real problems with some characters. Jochen Am 16.10.10 20:44, schrieb Jochen Topf: Hi! I am currently fighting some issues where tags with strange characters in them need to be represented in a URL for Taginfo. Lots of other websites probably will have similar issues. Characters like /, ?,, etc. have special meaning in URLs so if they appear in tags I can't have those tags in URLs. Sometimes escaping characters as %XX helps, sometimes not. And those problems are not confined to web pages and URLs only. Special characters that need escaping are often a problem. We can't really do anything about that with regard to tag values, they must be allowed to contain all those characters. But it would help at least a little if we knew those characters can never appear in tag keys. And I can't really see a legitimate reason why we need those characters in keys. Looking at the database almost all cases where they appear in keys are obvious errors. Out of the about 2 different keys, there are only about 190 keys with problematic characters in them (another about 800 with whitespace). Really the only case that I can't immediately rule out as errors or see an alternative tagging are tag keys like maxspeed:weight7.5. And with those you can already see the problems: Some of them have gt; instead of the . So I'd like us to think about whether we can disallow a few characters from appearing in tag keys. Technically this would mean changing the API to check for those characters, removing any that are already in the database (can be done with normal manual edits because there are so few cases) and adding checks to the editors so that they can give meaningful error messages. Shouldn't be too hard. So, what characters am I talking about? I haven't drawn up a complete list and we certainly would need to discuss this further. Here is a preliminary list: Whitespace Should use '_' instead of whitespace in keys, whitespace are also very confusing for users, especially at beginning and end of a text. /+?#;%' Special characters in XML, HTML and/or URLs. \' Characters often used for quoting. =Because its used in many places as the separation character between tag key and tag value. If we disallow this, we can always treat one string like foo=bar as k:foo, v:bar without any ambiguities. This is a small list of special characters, all other characters should still be allowed. That means tag keys can still be in Chinese or whatever. We'd just disallow a few characters of which we know that they will make problems again and again. And to emphasize this again: I am only talking about tag keys. Tag values must be allowed to contain the full Unicode set of characters. Jochen ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev -- Jochen Topf joc...@remote.org http://www.remote.org/jochen/ +49-721-388298 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
+1. much sanity ensues. On Sat, Oct 16, 2010 at 7:44 PM, Jochen Topf joc...@remote.org wrote: So I'd like us to think about whether we can disallow a few characters from appearing in tag keys. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] Disallowing certain characters in tag keys
Jochen Topf wrote: This proposal [1] moves values into the key to describe conditions. (Although you could argue, it should be done like that anyway...) [1] http://wiki.openstreetmap.org/wiki/Proposed_features/Extended_conditions_for_access_tags, I think thats a misguided use of tag keys probably invented by people who have never actually tried to write code that tries to interpret OSM tags. No speculation required, I'm the one who is to blame for that proposal. Before I finished the text that's still in the wiki, however, I /did/ write an experimental implementation* for this syntax, as well as an alternative syntax, to find out whether the ideas could work from a developer's perspective. I didn't encounter any significant problems. In retrospect ... well, maybe it wasn't the best of ideas to write that implementation based on the GraphView plugin for JOSM. After all, I figure that working on a comparably small in-memory dataset makes a significant difference for the way you would deal with keys. Add to this that I didn't have to deal with any web issues, or in fact any interface between programs at all (- no encoding issues), and it probably wasn't remotely a representative development experience. Unfortunately, it seems this will produce exactly the outcome I didn't intend at all: more variety in tagging. People will continue to use those parts of the proposal that don't require special characters (maxspeed:wet, :forward and the like), so we will use completely different solutions for simple and for more complex cases. Well, I'll file this as failed attempt of overzealous newbie to build tagging cathedral. Tobias Knerr * I ended up never publishing the implementation, though - not due to condition handling itself, but because I never got around to implement a proper opening_hours parser. Turns out that's actually more work than the entire rest of the syntax. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev