Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Jochen Topf
On Sun, Oct 17, 2010 at 09:48:31AM +0200, Ulf Lamping wrote:
 Am 16.10.2010 20:44, schrieb Jochen Topf:
 Hi!

 I am currently fighting some issues where tags with strange characters in 
 them
 need to be represented in a URL for Taginfo. Lots of other websites probably
 will have similar issues. Characters like /, ?,, etc. have special meaning
 in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
 escaping characters as %XX helps, sometimes not. And those problems are not
 confined to web pages and URLs only. Special characters that need escaping
 are often a problem.

 Yes, special characters can cause headaches. I remember this from my own  
 tag analyzing experiments and other software projects ;-)

 I agree with you that most (all?) of them are (usually unintended) bugs.  
 For example: Not long ago, it was a common tag problem that keys started  
 or ended with a space char. IIRC the xybot regularly fixes those bugs 
 now.

 However, as those characters can be used in the name values (and  
 elsewhere), you have to deal with the correct handling of special  
 characters in your software anyway, so I'm not sure if disallowing  
 special characters in the key will really help us in that regard.

 The problem with disallowing special characters is that you close the  
 door. Software writers will then write software that depends on them  
 not being there (or not caring which is probably the common case today).  
 If we later find out that - for whatever reasons - we want to use one of  
 those chars this will become extremely difficult, as it will cause  
 trouble at many places in existing software.

Thats absolutely true. Thats why I am only proposing a very small list and
don't include characters like {} that are not used now, but might make
sense in the future.

 What we currently don't have is a guideline for mappers. I'm missing  
 (and thinking to write for some time) a: How to write good tags. To my  
 knowledge we don't have a written guide (not rule) that we tend to used  
 lower case chars, underscores instead of spaces and all that unwritten  
 rules. Of course, this could include: don't use special chars like /,  
 ?, ... in keys - because this makes it hard for software writers.

I agree that we should come up with these guidelines, but thats really
a different issue.

 I tend to simply ignore keys with special chars - as we do it today ...

Which works well for lots of software (like renderers who don't care about
the things they don't understand). Unfortunately it doesn't work for editors
or something like Taginfo which needs to work with *all* legal tags.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Jochen Topf
On Sun, Oct 17, 2010 at 04:57:33PM -0400, Anthony wrote:
 On Sat, Oct 16, 2010 at 2:44 PM, Jochen Topf joc...@remote.org wrote:
  Technically this would mean changing the API to check
  for those characters, removing any that are already in the database (can be
  done with normal manual edits because there are so few cases) and adding 
  checks
  to the editors so that they can give meaningful error messages.
 
 To be clear, they'd still be in the database, in the history.
 
 Which is one implementation problem, because it means putting checks
 in more than one different place.  At the very least, the regular API,
 and the Potlatch API, but there are probably multiple places within
 the regular API where things would need to be checked.

But thats much fewer places than all the software out there. The whole point
of an API is that its a sort of choke point, a single place where things
can be checked.

 And then any software which relies on these changes wouldn't work with
 historical data.

Thats a problem, you are right. We could solve that by faking the history. Not
the first time this has been done, it would be possible. But most software out
there only deals with current data. So even if we keep the history, that
software would be made easier.

 It could be done, but to do all that work just to make it easier to
 code Taginfo would be, in my opinion, a waste.  Especially when there
 are plenty of simple solutions within taginfo.  If URL encoding is too
 painful, use a modified base64 encoding of the unicode string (using
 - and _ instead of + and /).

Its not only Taginfo. Every software out there would be made easier. If this
would be a Taginfo-only problem I wouldn't propose it. One of the biggest
problems is that Taginfo doesn't work alone, but wants to work with other
tools. If I use base64 encoding then people would need to link to something
like http://taginfo.openstreetmap.de/keys/aGlnaHdheQo= instead of
http://taginfo.openstreetmap.de/keys/highway. And the link then to XAPI
would not be http://www.informationfreeway.org/api/0.6/*[highway=*] but
http://www.informationfreeway.org/api/0.6/*[aGlnaHdheQo==*] . Not very user
friendly. And then every service would probably use different encoding
schemes...

I have actually thought about that and might offer a secondary interface
to Taginfo using base64 or something like it if I can't avoid it. But thats
really ugly and probably nobody would use it anyway, because nobody wants
to write special cases for the few keys that use those characters and are
bogus anyway.

 For cleaning up the keys, I'd want to strip down to as few characters
 as possible.  There's no point supporting most unicode characters -
 keys are supposed to be in English.

No. English people should be allowed to use their own language if they
want to. So should speakers of every other language on the planet, too.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Tom Hughes

On 16/10/10 19:44, Jochen Topf wrote:


I am currently fighting some issues where tags with strange characters in them
need to be represented in a URL for Taginfo. Lots of other websites probably
will have similar issues. Characters like /, ?,, etc. have special meaning
in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
escaping characters as %XX helps, sometimes not. And those problems are not
confined to web pages and URLs only. Special characters that need escaping
are often a problem.


I really don't understand the problem here - as far as I know all 
characters can be used in URLs so long as they are properly escaped. If 
your server software is not coping with that for some reason then I 
think it's a bug.


As a test I just created a file called '+?#;%.html' in an apache 
served directory and then asked Firefox to fetch:


  http://server/%3c%3e%26%2b%3f%23%3b%25.html

and it was retrieved just fine.

Tom

--
Tom Hughes (t...@compton.nu)
http://compton.nu/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Jochen Topf
On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:
 On 16/10/10 19:44, Jochen Topf wrote:

 I am currently fighting some issues where tags with strange characters in 
 them
 need to be represented in a URL for Taginfo. Lots of other websites probably
 will have similar issues. Characters like /, ?,, etc. have special meaning
 in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
 escaping characters as %XX helps, sometimes not. And those problems are not
 confined to web pages and URLs only. Special characters that need escaping
 are often a problem.

 I really don't understand the problem here - as far as I know all  
 characters can be used in URLs so long as they are properly escaped. If  
 your server software is not coping with that for some reason then I  
 think it's a bug.

That might well be a bug. But those bugs creep up all the time, because these
things are hard to do and because the specs are not as clear as they should be.
I am not saying these things can't be done right, but wouldn't it be nice if
we can get rid of that problem instead of everybody writing software for OSM
having to make sure all those cases are handled properly?

 As a test I just created a file called '+?#;%.html' in an apache  
 served directory and then asked Firefox to fetch:

   http://server/%3c%3e%26%2b%3f%23%3b%25.html

 and it was retrieved just fine.

And now try the same thing again creating a filename with a '/' in it and see
whether it works this time. It doesn't, because '/' is special for Unix (and
HTTP) and you need to create a directory with the first part of your name and
then the second as file. If you would actually want to create one file for
every key in the OSM database in your filesystem, you'd have a problem.

You example basically proves my point. :-)

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Andy Allan
On Tue, Oct 19, 2010 at 10:25 AM, Jochen Topf joc...@remote.org wrote:
 On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:
 On 16/10/10 19:44, Jochen Topf wrote:

 I am currently fighting some issues where tags with strange characters in 
 them
 need to be represented in a URL for Taginfo. Lots of other websites probably
 will have similar issues. Characters like /, ?,, etc. have special meaning
 in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
 escaping characters as %XX helps, sometimes not. And those problems are not
 confined to web pages and URLs only. Special characters that need escaping
 are often a problem.

 I really don't understand the problem here - as far as I know all
 characters can be used in URLs so long as they are properly escaped. If
 your server software is not coping with that for some reason then I
 think it's a bug.

 That might well be a bug. But those bugs creep up all the time, because these
 things are hard to do and because the specs are not as clear as they should 
 be.
 I am not saying these things can't be done right, but wouldn't it be nice if
 we can get rid of that problem instead of everybody writing software for OSM
 having to make sure all those cases are handled properly?

 As a test I just created a file called '+?#;%.html' in an apache
 served directory and then asked Firefox to fetch:

   http://server/%3c%3e%26%2b%3f%23%3b%25.html

 and it was retrieved just fine.

 And now try the same thing again creating a filename with a '/' in it and see
 whether it works this time. It doesn't, because '/' is special for Unix (and
 HTTP) and you need to create a directory with the first part of your name and
 then the second as file. If you would actually want to create one file for
 every key in the OSM database in your filesystem, you'd have a problem.

 You example basically proves my point. :-)

No, it really doesn't.

Let's put it this way - there is a subset[1] of unicode code points
that is valid for both keys and values. If you find any characters
emitted by OSM that lie outwith that range, then do let us know[3] But
we've taken great care to permit all other code points in both keys
and values alike, since we've no idea when someone is going to need
them. Your example of why we need  (and presumably ) is actually
great example to undermine your point.

Some of these characters need escaping for particular purposes. If you
find a unicode character that cannot be URLencoded[2], then do let us
know. Or if you find another encoding scenario which can only encode a
sub-set of unicode code points, let us know.

Your application should be able to handle every valid input. You've
found that your application is buggy, and now you're asking for the
input to be changed. But just the keys, not the values, and only
current data, not historical data. It seems a bit ... weird. And your
original list of characters is completely arbitrary, not based on any
formal specification as far as I can see.

If your editor can't handle all necessary characters, fix the editor.
If your application can't handle all the characters, fix the
application. And if you find dealing with  or = or  in a key to be
hard, it's probably worth taking some time to test with non-BMP
characters.

(If you later find that having ');DROP DATABASE;-- in a key or value
is breaking your database inserts, then please don't ask for these
characters to be banned too!)

Thanks,
Andy

[1] See http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
[2] http://en.wikipedia.org/wiki/Urlencode - / is %2f, by the way.
[3] But you shouldn't rely on it, and defensively program anyway. Not
all OSM files are generated by the API.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Tom Hughes

On 19/10/10 10:25, Jochen Topf wrote:

On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:


As a test I just created a file called '+?#;%.html' in an apache
served directory and then asked Firefox to fetch:

   http://server/%3c%3e%26%2b%3f%23%3b%25.html

and it was retrieved just fine.


And now try the same thing again creating a filename with a '/' in it and see
whether it works this time. It doesn't, because '/' is special for Unix (and
HTTP) and you need to create a directory with the first part of your name and
then the second as file. If you would actually want to create one file for
every key in the OSM database in your filesystem, you'd have a problem.


Sure if you have a slash then, for static files served from unix, that 
would have to correspond to a directory separator. That's a unix file 
naming limitation though.


In a dynamic application where you are decoding the path information 
yourself and deciding what it means there is no such restriction.


Tom

--
Tom Hughes (t...@compton.nu)
http://compton.nu/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Anthony
On Tue, Oct 19, 2010 at 6:52 AM, Andy Allan gravityst...@gmail.com wrote:
 Let's put it this way - there is a subset[1] of unicode code points
 that is valid for both keys and values. If you find any characters
 emitted by OSM that lie outwith that range, then do let us know[3]

Even if they're only in the history? Last I checked (a couple months
ago), there were quite a few invalid characters in the history (1).
Would you like the list (seems like something which would be easy for
you to generate yourself)?  If so, is there something going to be done
about them?

(1) For example, see the last character in the comment at
http://www.openstreetmap.org/api/0.6/changeset/936207

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev