Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-20 Thread Jochen Topf
On Tue, Oct 19, 2010 at 11:52:09AM +0100, Andy Allan wrote:
 On Tue, Oct 19, 2010 at 10:25 AM, Jochen Topf joc...@remote.org wrote:
  On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:
  On 16/10/10 19:44, Jochen Topf wrote:
 
  I am currently fighting some issues where tags with strange characters in 
  them
  need to be represented in a URL for Taginfo. Lots of other websites 
  probably
  will have similar issues. Characters like /, ?,, etc. have special 
  meaning
  in URLs so if they appear in tags I can't have those tags in URLs. 
  Sometimes
  escaping characters as %XX helps, sometimes not. And those problems are 
  not
  confined to web pages and URLs only. Special characters that need escaping
  are often a problem.
 
  I really don't understand the problem here - as far as I know all
  characters can be used in URLs so long as they are properly escaped. If
  your server software is not coping with that for some reason then I
  think it's a bug.
 
  That might well be a bug. But those bugs creep up all the time, because 
  these
  things are hard to do and because the specs are not as clear as they should 
  be.
  I am not saying these things can't be done right, but wouldn't it be nice if
  we can get rid of that problem instead of everybody writing software for OSM
  having to make sure all those cases are handled properly?
 
  As a test I just created a file called '+?#;%.html' in an apache
  served directory and then asked Firefox to fetch:
 
    http://server/%3c%3e%26%2b%3f%23%3b%25.html
 
  and it was retrieved just fine.
 
  And now try the same thing again creating a filename with a '/' in it and 
  see
  whether it works this time. It doesn't, because '/' is special for Unix (and
  HTTP) and you need to create a directory with the first part of your name 
  and
  then the second as file. If you would actually want to create one file for
  every key in the OSM database in your filesystem, you'd have a problem.
 
  You example basically proves my point. :-)
 
 No, it really doesn't.

Obviously I haven't made my point clear enough. I am saying, those special
characters don't work like normal characters in many cases. They have special
meanings. For instance as directory separators. Or in URLs or HTML code or
programming languages. So whenever you do anything where those characters
can appear, you have to take special care that your code doesn't break. And
programmers are notoriously bad at taking that special care.

 Let's put it this way - there is a subset[1] of unicode code points
 that is valid for both keys and values. If you find any characters
 emitted by OSM that lie outwith that range, then do let us know[3] But
 we've taken great care to permit all other code points in both keys
 and values alike, since we've no idea when someone is going to need
 them. Your example of why we need  (and presumably ) is actually
 great example to undermine your point.

Its really a case a weighting the different cases. On the one hand it
makes sense to allow everything, because you never know what you will
need. But on the other hand it makes sense to restrict what you allow
to make handling easier. We have restricted the number of characters
in keys and values for instance. There are certainly cases where it
would be nice to have more characters, but for practical reasons they
are restricted. We have put in a restriction that a key can only appear
once on an object. Thats also for practical purposes. I am arguing that
there are other things we can do to make working with OSM-tags more
convenient, for what I think, no extra cost.

Look at what happend with email addresses: You can have nearly every ASCII
character in email addresses, spaces and double quotes are allowed for
instance, but you have to escape them in the right way. Real mail programs
can handle that generally. But most scripts tha people write don't.  The result
is that in practice you can't use all those characters in email addresses,
because they work only half the time. If you send programmers to the RFC
and ask them to implement it properly, they can't figure out how to do that
and give up. And each one implements his own system, each having his own
list of characters that work and that don't work. The end-result is a rather
small list of characters that always work and some that work sometimes.
(See the details at http://www.remote.org/jochen/mail/info/chars.html )

I argue that if we disallow some characters we can actually expect developers
to implement our spec, if we leave the spec open too much, people will
ignore the difficult parts. If too many programs don't work with the difficult
bits those tags will in practice not be usable anyway, so why not forbid them
outright and all have an easier life?

 Some of these characters need escaping for particular purposes. If you
 find a unicode character that cannot be URLencoded[2], then do let us
 know. Or if you find another encoding scenario which can only encode a
 

Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Jochen Topf
On Sun, Oct 17, 2010 at 09:48:31AM +0200, Ulf Lamping wrote:
 Am 16.10.2010 20:44, schrieb Jochen Topf:
 Hi!

 I am currently fighting some issues where tags with strange characters in 
 them
 need to be represented in a URL for Taginfo. Lots of other websites probably
 will have similar issues. Characters like /, ?,, etc. have special meaning
 in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
 escaping characters as %XX helps, sometimes not. And those problems are not
 confined to web pages and URLs only. Special characters that need escaping
 are often a problem.

 Yes, special characters can cause headaches. I remember this from my own  
 tag analyzing experiments and other software projects ;-)

 I agree with you that most (all?) of them are (usually unintended) bugs.  
 For example: Not long ago, it was a common tag problem that keys started  
 or ended with a space char. IIRC the xybot regularly fixes those bugs 
 now.

 However, as those characters can be used in the name values (and  
 elsewhere), you have to deal with the correct handling of special  
 characters in your software anyway, so I'm not sure if disallowing  
 special characters in the key will really help us in that regard.

 The problem with disallowing special characters is that you close the  
 door. Software writers will then write software that depends on them  
 not being there (or not caring which is probably the common case today).  
 If we later find out that - for whatever reasons - we want to use one of  
 those chars this will become extremely difficult, as it will cause  
 trouble at many places in existing software.

Thats absolutely true. Thats why I am only proposing a very small list and
don't include characters like {} that are not used now, but might make
sense in the future.

 What we currently don't have is a guideline for mappers. I'm missing  
 (and thinking to write for some time) a: How to write good tags. To my  
 knowledge we don't have a written guide (not rule) that we tend to used  
 lower case chars, underscores instead of spaces and all that unwritten  
 rules. Of course, this could include: don't use special chars like /,  
 ?, ... in keys - because this makes it hard for software writers.

I agree that we should come up with these guidelines, but thats really
a different issue.

 I tend to simply ignore keys with special chars - as we do it today ...

Which works well for lots of software (like renderers who don't care about
the things they don't understand). Unfortunately it doesn't work for editors
or something like Taginfo which needs to work with *all* legal tags.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Jochen Topf
On Sun, Oct 17, 2010 at 04:57:33PM -0400, Anthony wrote:
 On Sat, Oct 16, 2010 at 2:44 PM, Jochen Topf joc...@remote.org wrote:
  Technically this would mean changing the API to check
  for those characters, removing any that are already in the database (can be
  done with normal manual edits because there are so few cases) and adding 
  checks
  to the editors so that they can give meaningful error messages.
 
 To be clear, they'd still be in the database, in the history.
 
 Which is one implementation problem, because it means putting checks
 in more than one different place.  At the very least, the regular API,
 and the Potlatch API, but there are probably multiple places within
 the regular API where things would need to be checked.

But thats much fewer places than all the software out there. The whole point
of an API is that its a sort of choke point, a single place where things
can be checked.

 And then any software which relies on these changes wouldn't work with
 historical data.

Thats a problem, you are right. We could solve that by faking the history. Not
the first time this has been done, it would be possible. But most software out
there only deals with current data. So even if we keep the history, that
software would be made easier.

 It could be done, but to do all that work just to make it easier to
 code Taginfo would be, in my opinion, a waste.  Especially when there
 are plenty of simple solutions within taginfo.  If URL encoding is too
 painful, use a modified base64 encoding of the unicode string (using
 - and _ instead of + and /).

Its not only Taginfo. Every software out there would be made easier. If this
would be a Taginfo-only problem I wouldn't propose it. One of the biggest
problems is that Taginfo doesn't work alone, but wants to work with other
tools. If I use base64 encoding then people would need to link to something
like http://taginfo.openstreetmap.de/keys/aGlnaHdheQo= instead of
http://taginfo.openstreetmap.de/keys/highway. And the link then to XAPI
would not be http://www.informationfreeway.org/api/0.6/*[highway=*] but
http://www.informationfreeway.org/api/0.6/*[aGlnaHdheQo==*] . Not very user
friendly. And then every service would probably use different encoding
schemes...

I have actually thought about that and might offer a secondary interface
to Taginfo using base64 or something like it if I can't avoid it. But thats
really ugly and probably nobody would use it anyway, because nobody wants
to write special cases for the few keys that use those characters and are
bogus anyway.

 For cleaning up the keys, I'd want to strip down to as few characters
 as possible.  There's no point supporting most unicode characters -
 keys are supposed to be in English.

No. English people should be allowed to use their own language if they
want to. So should speakers of every other language on the planet, too.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Tom Hughes

On 16/10/10 19:44, Jochen Topf wrote:


I am currently fighting some issues where tags with strange characters in them
need to be represented in a URL for Taginfo. Lots of other websites probably
will have similar issues. Characters like /, ?,, etc. have special meaning
in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
escaping characters as %XX helps, sometimes not. And those problems are not
confined to web pages and URLs only. Special characters that need escaping
are often a problem.


I really don't understand the problem here - as far as I know all 
characters can be used in URLs so long as they are properly escaped. If 
your server software is not coping with that for some reason then I 
think it's a bug.


As a test I just created a file called '+?#;%.html' in an apache 
served directory and then asked Firefox to fetch:


  http://server/%3c%3e%26%2b%3f%23%3b%25.html

and it was retrieved just fine.

Tom

--
Tom Hughes (t...@compton.nu)
http://compton.nu/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Jochen Topf
On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:
 On 16/10/10 19:44, Jochen Topf wrote:

 I am currently fighting some issues where tags with strange characters in 
 them
 need to be represented in a URL for Taginfo. Lots of other websites probably
 will have similar issues. Characters like /, ?,, etc. have special meaning
 in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
 escaping characters as %XX helps, sometimes not. And those problems are not
 confined to web pages and URLs only. Special characters that need escaping
 are often a problem.

 I really don't understand the problem here - as far as I know all  
 characters can be used in URLs so long as they are properly escaped. If  
 your server software is not coping with that for some reason then I  
 think it's a bug.

That might well be a bug. But those bugs creep up all the time, because these
things are hard to do and because the specs are not as clear as they should be.
I am not saying these things can't be done right, but wouldn't it be nice if
we can get rid of that problem instead of everybody writing software for OSM
having to make sure all those cases are handled properly?

 As a test I just created a file called '+?#;%.html' in an apache  
 served directory and then asked Firefox to fetch:

   http://server/%3c%3e%26%2b%3f%23%3b%25.html

 and it was retrieved just fine.

And now try the same thing again creating a filename with a '/' in it and see
whether it works this time. It doesn't, because '/' is special for Unix (and
HTTP) and you need to create a directory with the first part of your name and
then the second as file. If you would actually want to create one file for
every key in the OSM database in your filesystem, you'd have a problem.

You example basically proves my point. :-)

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Andy Allan
On Tue, Oct 19, 2010 at 10:25 AM, Jochen Topf joc...@remote.org wrote:
 On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:
 On 16/10/10 19:44, Jochen Topf wrote:

 I am currently fighting some issues where tags with strange characters in 
 them
 need to be represented in a URL for Taginfo. Lots of other websites probably
 will have similar issues. Characters like /, ?,, etc. have special meaning
 in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
 escaping characters as %XX helps, sometimes not. And those problems are not
 confined to web pages and URLs only. Special characters that need escaping
 are often a problem.

 I really don't understand the problem here - as far as I know all
 characters can be used in URLs so long as they are properly escaped. If
 your server software is not coping with that for some reason then I
 think it's a bug.

 That might well be a bug. But those bugs creep up all the time, because these
 things are hard to do and because the specs are not as clear as they should 
 be.
 I am not saying these things can't be done right, but wouldn't it be nice if
 we can get rid of that problem instead of everybody writing software for OSM
 having to make sure all those cases are handled properly?

 As a test I just created a file called '+?#;%.html' in an apache
 served directory and then asked Firefox to fetch:

   http://server/%3c%3e%26%2b%3f%23%3b%25.html

 and it was retrieved just fine.

 And now try the same thing again creating a filename with a '/' in it and see
 whether it works this time. It doesn't, because '/' is special for Unix (and
 HTTP) and you need to create a directory with the first part of your name and
 then the second as file. If you would actually want to create one file for
 every key in the OSM database in your filesystem, you'd have a problem.

 You example basically proves my point. :-)

No, it really doesn't.

Let's put it this way - there is a subset[1] of unicode code points
that is valid for both keys and values. If you find any characters
emitted by OSM that lie outwith that range, then do let us know[3] But
we've taken great care to permit all other code points in both keys
and values alike, since we've no idea when someone is going to need
them. Your example of why we need  (and presumably ) is actually
great example to undermine your point.

Some of these characters need escaping for particular purposes. If you
find a unicode character that cannot be URLencoded[2], then do let us
know. Or if you find another encoding scenario which can only encode a
sub-set of unicode code points, let us know.

Your application should be able to handle every valid input. You've
found that your application is buggy, and now you're asking for the
input to be changed. But just the keys, not the values, and only
current data, not historical data. It seems a bit ... weird. And your
original list of characters is completely arbitrary, not based on any
formal specification as far as I can see.

If your editor can't handle all necessary characters, fix the editor.
If your application can't handle all the characters, fix the
application. And if you find dealing with  or = or  in a key to be
hard, it's probably worth taking some time to test with non-BMP
characters.

(If you later find that having ');DROP DATABASE;-- in a key or value
is breaking your database inserts, then please don't ask for these
characters to be banned too!)

Thanks,
Andy

[1] See http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
[2] http://en.wikipedia.org/wiki/Urlencode - / is %2f, by the way.
[3] But you shouldn't rely on it, and defensively program anyway. Not
all OSM files are generated by the API.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Tom Hughes

On 19/10/10 10:25, Jochen Topf wrote:

On Tue, Oct 19, 2010 at 10:06:15AM +0100, Tom Hughes wrote:


As a test I just created a file called '+?#;%.html' in an apache
served directory and then asked Firefox to fetch:

   http://server/%3c%3e%26%2b%3f%23%3b%25.html

and it was retrieved just fine.


And now try the same thing again creating a filename with a '/' in it and see
whether it works this time. It doesn't, because '/' is special for Unix (and
HTTP) and you need to create a directory with the first part of your name and
then the second as file. If you would actually want to create one file for
every key in the OSM database in your filesystem, you'd have a problem.


Sure if you have a slash then, for static files served from unix, that 
would have to correspond to a directory separator. That's a unix file 
naming limitation though.


In a dynamic application where you are decoding the path information 
yourself and deciding what it means there is no such restriction.


Tom

--
Tom Hughes (t...@compton.nu)
http://compton.nu/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-19 Thread Anthony
On Tue, Oct 19, 2010 at 6:52 AM, Andy Allan gravityst...@gmail.com wrote:
 Let's put it this way - there is a subset[1] of unicode code points
 that is valid for both keys and values. If you find any characters
 emitted by OSM that lie outwith that range, then do let us know[3]

Even if they're only in the history? Last I checked (a couple months
ago), there were quite a few invalid characters in the history (1).
Would you like the list (seems like something which would be easy for
you to generate yourself)?  If so, is there something going to be done
about them?

(1) For example, see the last character in the comment at
http://www.openstreetmap.org/api/0.6/changeset/936207

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-17 Thread Ulf Lamping

Am 16.10.2010 20:44, schrieb Jochen Topf:

Hi!

I am currently fighting some issues where tags with strange characters in them
need to be represented in a URL for Taginfo. Lots of other websites probably
will have similar issues. Characters like /, ?,, etc. have special meaning
in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
escaping characters as %XX helps, sometimes not. And those problems are not
confined to web pages and URLs only. Special characters that need escaping
are often a problem.


Yes, special characters can cause headaches. I remember this from my own 
tag analyzing experiments and other software projects ;-)


I agree with you that most (all?) of them are (usually unintended) bugs. 
For example: Not long ago, it was a common tag problem that keys started 
or ended with a space char. IIRC the xybot regularly fixes those bugs now.


However, as those characters can be used in the name values (and 
elsewhere), you have to deal with the correct handling of special 
characters in your software anyway, so I'm not sure if disallowing 
special characters in the key will really help us in that regard.


The problem with disallowing special characters is that you close the 
door. Software writers will then write software that depends on them 
not being there (or not caring which is probably the common case today). 
If we later find out that - for whatever reasons - we want to use one of 
those chars this will become extremely difficult, as it will cause 
trouble at many places in existing software.



What we currently don't have is a guideline for mappers. I'm missing 
(and thinking to write for some time) a: How to write good tags. To my 
knowledge we don't have a written guide (not rule) that we tend to used 
lower case chars, underscores instead of spaces and all that unwritten 
rules. Of course, this could include: don't use special chars like /, 
?, ... in keys - because this makes it hard for software writers.



I tend to simply ignore keys with special chars - as we do it today ...

Regards, ULFL

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-17 Thread Tony Morris

 The problem with disallowing special characters is that you close the
 door. Software writers will then write software that depends on them
 not being there (or not caring which is probably the common case
 today). If we later find out that - for whatever reasons - we want to
 use one of those chars this will become extremely difficult, as it
 will cause trouble at many places in existing software.

This is an instance of what is called The Expression Problem.

-- 
Tony Morris
http://tmorris.net/



___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-17 Thread Ulf Lamping

Am 16.10.2010 22:31, schrieb Andreas Kalsch:

I agree with

whitespace - this can be very confusing

=

To add:

Make keys lowercase (or even remove diacritics), because keys are always
simple names.


I've added a Character section to the page:

http://wiki.openstreetmap.org/wiki/Any_tags_you_like

... that tries to summarise this discussion.


We might not want to disallow characters, but letting the mappers know 
which characters to avoid is a good idea IMHO.


Please improve it if you like ...

Regards, ULFL

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-17 Thread Anthony
On Sat, Oct 16, 2010 at 2:44 PM, Jochen Topf joc...@remote.org wrote:
 Technically this would mean changing the API to check
 for those characters, removing any that are already in the database (can be
 done with normal manual edits because there are so few cases) and adding 
 checks
 to the editors so that they can give meaningful error messages.

To be clear, they'd still be in the database, in the history.

Which is one implementation problem, because it means putting checks
in more than one different place.  At the very least, the regular API,
and the Potlatch API, but there are probably multiple places within
the regular API where things would need to be checked.

And then any software which relies on these changes wouldn't work with
historical data.

It could be done, but to do all that work just to make it easier to
code Taginfo would be, in my opinion, a waste.  Especially when there
are plenty of simple solutions within taginfo.  If URL encoding is too
painful, use a modified base64 encoding of the unicode string (using
- and _ instead of + and /).

For cleaning up the keys, I'd want to strip down to as few characters
as possible.  There's no point supporting most unicode characters -
keys are supposed to be in English.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-16 Thread Sebastian Klein

Hi,

what is the problem with escaping problematic characters? There should 
be build in functions for most languages, like uri_escape in Perl and 
URLEncode.encode in Java.


This proposal [1] moves values into the key to describe conditions. 
(Although you could argue, it should be done like that anyway...)
[1] 
http://wiki.openstreetmap.org/wiki/Proposed_features/Extended_conditions_for_access_tags,



Sebastian


Jochen Topf wrote:

Hi!

I am currently fighting some issues where tags with strange characters in them
need to be represented in a URL for Taginfo. Lots of other websites probably
will have similar issues. Characters like /, ?, , etc. have special meaning
in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
escaping characters as %XX helps, sometimes not. And those problems are not
confined to web pages and URLs only. Special characters that need escaping
are often a problem.

We can't really do anything about that with regard to tag values, they must be
allowed to contain all those characters. But it would help at least a little if
we knew those characters can never appear in tag keys. And I can't really see a
legitimate reason why we need those characters in keys. Looking at the database
almost all cases where they appear in keys are obvious errors. Out of the about
2 different keys, there are only about 190 keys with problematic characters
in them (another about 800 with whitespace). Really the only case that I can't
immediately rule out as errors or see an alternative tagging are tag keys like
maxspeed:weight7.5. And with those you can already see the problems: Some of
them have gt; instead of the .

So I'd like us to think about whether we can disallow a few characters from
appearing in tag keys. Technically this would mean changing the API to check
for those characters, removing any that are already in the database (can be
done with normal manual edits because there are so few cases) and adding checks
to the editors so that they can give meaningful error messages. Shouldn't be
too hard.

So, what characters am I talking about? I haven't drawn up a complete list
and we certainly would need to discuss this further.

Here is a preliminary list:

Whitespace   Should use '_' instead of whitespace in keys, whitespace are
 also very confusing for users, especially at beginning and end
 of a text.

/+?#;%'  Special characters in XML, HTML and/or URLs.

\'  Characters often used for quoting.

=Because its used in many places as the separation character
 between tag key and tag value. If we disallow this, we can always
 treat one string like foo=bar as k:foo, v:bar without any
 ambiguities.

This is a small list of special characters, all other characters should still
be allowed. That means tag keys can still be in Chinese or whatever. We'd just
disallow a few characters of which we know that they will make problems again
and again.

And to emphasize this again: I am only talking about tag keys. Tag values must
be allowed to contain the full Unicode set of characters.

Jochen



___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-16 Thread Andreas Kalsch

I agree with

whitespace - this can be very confusing

=

To add:

Make keys lowercase (or even remove diacritics), because keys are always simple 
names.

Am 16.10.10 20:44, schrieb Jochen Topf:

Hi!

I am currently fighting some issues where tags with strange characters in them
need to be represented in a URL for Taginfo. Lots of other websites probably
will have similar issues. Characters like /, ?,, etc. have special meaning
in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
escaping characters as %XX helps, sometimes not. And those problems are not
confined to web pages and URLs only. Special characters that need escaping
are often a problem.

We can't really do anything about that with regard to tag values, they must be
allowed to contain all those characters. But it would help at least a little if
we knew those characters can never appear in tag keys. And I can't really see a
legitimate reason why we need those characters in keys. Looking at the database
almost all cases where they appear in keys are obvious errors. Out of the about
2 different keys, there are only about 190 keys with problematic characters
in them (another about 800 with whitespace). Really the only case that I can't
immediately rule out as errors or see an alternative tagging are tag keys like
maxspeed:weight7.5. And with those you can already see the problems: Some of
them have gt; instead of the .

So I'd like us to think about whether we can disallow a few characters from
appearing in tag keys. Technically this would mean changing the API to check
for those characters, removing any that are already in the database (can be
done with normal manual edits because there are so few cases) and adding checks
to the editors so that they can give meaningful error messages. Shouldn't be
too hard.

So, what characters am I talking about? I haven't drawn up a complete list
and we certainly would need to discuss this further.

Here is a preliminary list:

Whitespace   Should use '_' instead of whitespace in keys, whitespace are
  also very confusing for users, especially at beginning and end
  of a text.

/+?#;%'  Special characters in XML, HTML and/or URLs.

\'  Characters often used for quoting.

=Because its used in many places as the separation character
  between tag key and tag value. If we disallow this, we can always
  treat one string like foo=bar as k:foo, v:bar without any
  ambiguities.

This is a small list of special characters, all other characters should still
be allowed. That means tag keys can still be in Chinese or whatever. We'd just
disallow a few characters of which we know that they will make problems again
and again.

And to emphasize this again: I am only talking about tag keys. Tag values must
be allowed to contain the full Unicode set of characters.

Jochen



___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-16 Thread Jochen Topf
On Sat, Oct 16, 2010 at 09:55:41PM +0200, Sebastian Klein wrote:
 what is the problem with escaping problematic characters? There should  
 be build in functions for most languages, like uri_escape in Perl and  
 URLEncode.encode in Java.

The problem is that its very hard to get this right. I have just spend an hour
debugging a problem where the semi-colon (;) character in a URL was mis-handled
by Apache. The mod_proxy module I use in Taginfo silently cuts of everything
after an ; in the URL even if its escaped. Thats probably because the ; is
handled specially for some reason in the RFC defining URLs. I found an option
to fix this, but its doesn't work with mod_rewrite, so that had to be worked
around, too. I got it to work, but I don't want to know what the next problem
will be. It just goes to show that even software like Apache has problems
dealing with these things, not to speak of some scripts somebody just hacked
together.

I already mentioned the keys in the database with gt; or so in there,
probably from some software escaping once too often. Special characters must
be escaped exactly once. If they are not escaped or escaped more than once, you
get broken results. And on the other side you have to de-escape exactly once.
This is difficult to get right. And the penalty for not getting it right might
just be an SQL injection or a cross-site-scripting attack vector.

Not allowing those characters in the first place, makes software easier to
write and understand and more robust. And it even makes it friendlier for
humans, because if they use those characters you can immediately give them an
error message instead of creating broken data.

And all of that without any cost, really. I don't see that we ever need those
characters in tag keys. Of course if we do need those characters than we have
to get all of this right and right every time.

 This proposal [1] moves values into the key to describe conditions.  
 (Although you could argue, it should be done like that anyway...)
 [1]  
 http://wiki.openstreetmap.org/wiki/Proposed_features/Extended_conditions_for_access_tags,

I think thats a misguided use of tag keys probably invented by people who have
never actually tried to write code that tries to interpret OSM tags.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-16 Thread Jochen Topf
On Sat, Oct 16, 2010 at 10:31:47PM +0200, Andreas Kalsch wrote:
 I agree with

 whitespace - this can be very confusing

 =

 To add:

 Make keys lowercase (or even remove diacritics), because keys are always 
 simple names.

Thats a different issue. I agree that keys normally should be lowercase. But 
thats
just a nice convention. There might be good reasons for uppercase keys, for 
instance
when the key name was used in upper case in some other system where data was 
imported
from.

There is no need to force people into a convention here. Thats different from 
the issue
I have been talking about where there are real problems with some characters.

Jochen


 Am 16.10.10 20:44, schrieb Jochen Topf:
 Hi!

 I am currently fighting some issues where tags with strange characters in 
 them
 need to be represented in a URL for Taginfo. Lots of other websites probably
 will have similar issues. Characters like /, ?,, etc. have special meaning
 in URLs so if they appear in tags I can't have those tags in URLs. Sometimes
 escaping characters as %XX helps, sometimes not. And those problems are not
 confined to web pages and URLs only. Special characters that need escaping
 are often a problem.

 We can't really do anything about that with regard to tag values, they must 
 be
 allowed to contain all those characters. But it would help at least a little 
 if
 we knew those characters can never appear in tag keys. And I can't really 
 see a
 legitimate reason why we need those characters in keys. Looking at the 
 database
 almost all cases where they appear in keys are obvious errors. Out of the 
 about
 2 different keys, there are only about 190 keys with problematic 
 characters
 in them (another about 800 with whitespace). Really the only case that I 
 can't
 immediately rule out as errors or see an alternative tagging are tag keys 
 like
 maxspeed:weight7.5. And with those you can already see the problems: Some 
 of
 them have gt; instead of the .

 So I'd like us to think about whether we can disallow a few characters from
 appearing in tag keys. Technically this would mean changing the API to check
 for those characters, removing any that are already in the database (can be
 done with normal manual edits because there are so few cases) and adding 
 checks
 to the editors so that they can give meaningful error messages. Shouldn't be
 too hard.

 So, what characters am I talking about? I haven't drawn up a complete list
 and we certainly would need to discuss this further.

 Here is a preliminary list:

 Whitespace   Should use '_' instead of whitespace in keys, whitespace are
   also very confusing for users, especially at beginning and end
   of a text.

 /+?#;%'  Special characters in XML, HTML and/or URLs.

 \'  Characters often used for quoting.

 =Because its used in many places as the separation character
   between tag key and tag value. If we disallow this, we can 
 always
   treat one string like foo=bar as k:foo, v:bar without any
   ambiguities.

 This is a small list of special characters, all other characters should still
 be allowed. That means tag keys can still be in Chinese or whatever. We'd 
 just
 disallow a few characters of which we know that they will make problems again
 and again.

 And to emphasize this again: I am only talking about tag keys. Tag values 
 must
 be allowed to contain the full Unicode set of characters.

 Jochen


 ___
 dev mailing list
 dev@openstreetmap.org
 http://lists.openstreetmap.org/listinfo/dev


-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-16 Thread OJ W
+1. much sanity ensues.

On Sat, Oct 16, 2010 at 7:44 PM, Jochen Topf joc...@remote.org wrote:
 So I'd like us to think about whether we can disallow a few characters from
 appearing in tag keys.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Disallowing certain characters in tag keys

2010-10-16 Thread Tobias Knerr
Jochen Topf wrote:
 This proposal [1] moves values into the key to describe conditions.  
 (Although you could argue, it should be done like that anyway...)
 [1]  
 http://wiki.openstreetmap.org/wiki/Proposed_features/Extended_conditions_for_access_tags,
 
 I think thats a misguided use of tag keys probably invented by people who have
 never actually tried to write code that tries to interpret OSM tags.

No speculation required, I'm the one who is to blame for that proposal.

Before I finished the text that's still in the wiki, however, I /did/
write an experimental implementation* for this syntax, as well as an
alternative syntax, to find out whether the ideas could work from a
developer's perspective. I didn't encounter any significant problems.

In retrospect ... well, maybe it wasn't the best of ideas to write that
implementation based on the GraphView plugin for JOSM. After all, I
figure that working on a comparably small in-memory dataset makes a
significant difference for the way you would deal with keys. Add to this
that I didn't have to deal with any web issues, or in fact any interface
between programs at all (- no encoding issues), and it probably wasn't
remotely a representative development experience.

Unfortunately, it seems this will produce exactly the outcome I didn't
intend at all: more variety in tagging. People will continue to use
those parts of the proposal that don't require special characters
(maxspeed:wet, :forward and the like), so we will use completely
different solutions for simple and for more complex cases.

Well, I'll file this as failed attempt of overzealous newbie to build
tagging cathedral.

Tobias Knerr


* I ended up never publishing the implementation, though - not due to
condition handling itself, but because I never got around to implement a
proper opening_hours parser. Turns out that's actually more work than
the entire rest of the syntax.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev