subject:"ffs and utf8"

Re: ffs and utf8

2014-12-03 Thread Joel Rees

Dmitrij had some questions about my intent, I'll try to clarify.

2014/12/02 18:57 Joel Rees joel.r...@gmail.com:

 (apologies for the html.)

 2014/12/02 9:52 Dmitrij D. Czarkoff czark...@gmail.com:
[ ... and others
Snipped context:
  There was some discussion of what kind of file names should be allowed to
be stored.
  There was something I read as a suggestion for using a normal form based
in Unicode as a target for enforced file name conversion.
  There were some attempts to discuss reasons why file names should not be
forceably converted.

  And then communication seemed to really break down when I tried to
present a semi-obvious example of why seemingly innocuous conversions turn
out to be not so innocuous after all.]

And, since that didn't work, I tried with an example closer to the
suggested normal form:

  Joel Rees said:
   Now, what would you do with this?
  
   ã¸ã§ã¨ã«
  
   Why not decompose it to the following?
  
   ï½¼ï¾ï½®ï½´ï¾

Which didn't communicate the problem, either.

  Because it is not what Unicode normalization is.

 Well, it definitely isn't Unicode normalization. And there is a reason,
it isn't, even though there
 were many who thought the Unicode standard shouldn't include code points
for wide form glyphs.

 Let's try one more. I think you have said enough that I can infer that
your preferred normal form is
 the decomposit form. So, given that your normalization has resulted in a
file named

 ã·ãã§ã¨ã«ã®æ

 and

given

 the necessity to send it back where it came from, how do you know whether
or not it should
 be restored to

 ã¸ã§ã¨ã«ã®æ

 before you send it back?

  [...]

But normalization is a red herring in this context.

You may personally have no problems with filename conversions improperly
done, but I am not willing to take them lightly where my data is concerned.
I may have a NAS device that I'm using for backup without
compression/amalgamation (i. e., tar/zip), and If I have a file with a
decomposit name backed up on the NAS, I don't want it automatically
converted to composit when it is restored, the existence of normal forms
notwithstanding.

Unix file names can handle UTF-8 encoded Unicode file names without losing
data because no conversion is necessary. There may be issues with
displaying them, but the file name itself is safe, because '/' is always
'/' and '\0' is always '\0'.

You can even handle broken UTF-8 and unconverted UTF-16/32 of whatever byte
order spit into the file name as a sequence of bytes if and only if you
escape NUL, slash, and your escape character properly, restoring the
escaped characters when putting the file names on the network.

Normalization alone does not know how to restore a potentially normalized
name. It needs some sort of flag character that says this name was
normalized, and a way to choose between de-normalized forms when more than
one denormalized form maps to one particular normal form.

The last time I looked, the Unicode standard itself stated that this was
the case, and that normalized forms were not recomended for such purposes.
The craziness currently infecting the entire industry leaves me with no
confidence that such is still the case.

I haven't used Apple OSses since around 10.4, but Mac OS X was doing a
thing where certain well-known directory names were aliased according to
the current locale. For instance, the user's  music directory was shown
as ãé³æ¥½ã when the locale was set to ja_JP.UTF-8. This is useful to
desktop
users, but is sometimes confusing when you log in via ssh from a terminal
that does not display Japanese and fails to declare itself as such. It's
convenient, but even this can cause problems when backing up the entire
home or user directory, if the backup software doesn't know to ask for the
OS canonical name.

Again, apologies for using my (erk) Android device and spitting html at the
list.

Joel Rees

Computer memory is just fancy paper,
CPUs just fancy pens.
All is a stream of text
flowing from the past into the future.

Re: ffs and utf8

2014-12-03 Thread Anthony J. Bentley

Joel Rees writes:
 You can even handle broken UTF-8 and unconverted UTF-16/32 of whatever byte
 order spit into the file name as a sequence of bytes if and only if you
 escape NUL, slash, and your escape character properly, restoring the
 escaped characters when putting the file names on the network.

This is just asking for security issues. It's the same kind of thinking
that caused the designers of Java to allow embedding NUL in strings as
0xc0 0x80, or CESU-8 where you can encode astral characters with surrogate
pairs instead of just writing the character directly. The kinds of things
that make people think Unicode is complex and prone to security issues,
even though neither of them are allowed by the UTF-8 spec!

 Normalization alone does not know how to restore a potentially normalized
 name. It needs some sort of flag character that says this name was
 normalized, and a way to choose between de-normalized forms when more than
 one denormalized form maps to one particular normal form.

Once you start stacking multiple accents this becomes unworkable.

 I haven't used Apple OSses since around 10.4, but Mac OS X was doing a
 thing where certain well-known directory names were aliased according to
 the current locale. For instance, the user's  music directory was shown
 as 「音楽」 when the locale was set to ja_JP.UTF-8.

IMO this is totally crazy behavior and unrelated to the Unicode issue.

-- 
Anthony J. Bentley

Re: ffs and utf8

2014-12-03 Thread Dmitrij D. Czarkoff

Anthony J. Bentley said:
  I haven't used Apple OSses since around 10.4, but Mac OS X was doing a
  thing where certain well-known directory names were aliased according to
  the current locale. For instance, the user's  music directory was shown
  as 「音楽」 when the locale was set to ja_JP.UTF-8.
 
 IMO this is totally crazy behavior and unrelated to the Unicode issue.

GNOME does this too.  It goes even further - proposes to rename XDG
directories if locale changes.  Most amusingly, if you happen run GNOME
and Firefox with English locale and then switch to non-English locale,
your GNOME will rename XDG directories to new locale defaults, and
Firefox will re-create ~/Desktop.  I rarely have to deal with systems
with non-English locales, but each and every time I have to, I get
terrified with the changes since the last time.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-12-03 Thread Joel Rees

On Wed, Dec 3, 2014 at 9:09 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote:
 Anthony J. Bentley said:
  I haven't used Apple OSses since around 10.4, but Mac OS X was doing a
  thing where certain well-known directory names were aliased according to
  the current locale. For instance, the user's  music directory was shown
  as 「音楽」 when the locale was set to ja_JP.UTF-8.

 IMO this is totally crazy behavior and unrelated to the Unicode issue.

 GNOME does this too.  It goes even further - proposes to rename XDG
 directories if locale changes.  Most amusingly, if you happen run GNOME
 and Firefox with English locale and then switch to non-English locale,
 your GNOME will rename XDG directories to new locale defaults, and
 Firefox will re-create ~/Desktop.  I rarely have to deal with systems
 with non-English locales, but each and every time I have to, I get
 terrified with the changes since the last time.

8-/

One of the reasons I quit using gnome.

If there were a way of specifying the initial locale when you create a
new login id, that locale could specify the language to create these
directory names in, and then they should never change. My memory is
that you have to log in once to do that, however.

Maybe it would be better just to not make those directories until they
are needed by an application, and then ask the user to name them
instead of providing standard names.

-- 
Joel Rees

Be careful when you look at conspiracy.
Look first in your own heart,
and ask yourself if you are not your own worst enemy.
Arm yourself with knowledge of yourself, as well.

Re: ffs and utf8

2014-12-03 Thread Dmitrij D. Czarkoff

First of all, I really don't believe that preservation of non-canonical
form should be a consideration for any software.  There is no single
reason to allow non-canonical forms to exist at all, while there are
several reasons to avoid them.  More so for foreign encodings in
filenames - if you are trying to store UTF-16 names on a system with
UTF-8 locale, you should be converting, not escaping.  Doing otherwise
is just asking for troubles.

Next, I assume that ability to enter filenames trumps ability to
preserve original filename on Unix-like systems.  In most cases right
now these two values don't clash, because user input is normalized from
the very beginning in IME.  That said, there may be exceptions.  Eg.
several mail clients won't normalize filename if input encoding matches
encoding of attachement.  Thus, having recieved a file with non-ASCII
filename from Mac, you'll end up being unable to address it from shell
even if it was typed using exactly the same keyboard layout you use.  I
don't see how this situation may be justified.  The rare cases when
original filenames must be preserved byte to byte warrant some special
handling (eg. storing filenames elsewhere separately or preserving the
whole files with names and attributes in some archive or other form of
special database).

Finally, provided that both ends of network communication use canonical
forms for Unicode, the matter of storing file remotely and then
recieving it back with filename intact is simply a matter of
normalization on reciever's side.  That is: if you prefer your local
files in NFD, and your NAS uses NFC, you should simply normalize
filenames when you recieve files back.  The only potential problem here
is compatibility normalizations, but these are already problematic
enough to be avoided in all cases where NFD or NFC do the job.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-12-03 Thread Joel Rees

2014/12/03 22:23 Dmitrij D. Czarkoff czark...@gmail.com:

 First of all, I really don't believe that preservation of non-canonical
 form should be a consideration for any software.

There is no particular canonical form for some kinds of software.

Unix, in particular, happens to have file name limitations that are
compatible with all versions of Unicode past 2.0, at least, in UTF-8, but
it has no native encoding. Most of the tools support ASCII, many now
support Unicode. But there is no native encoding. That's one of the
strengths of Unix.

 There is no single
 reason to allow non-canonical forms to exist at all,

non-canonical forms in what context?

 while there are
 several reasons to avoid them.

Which non-canonical forms?

 More so for foreign encodings in
 filenames -

Define foreign encoding, too. Make sure your definition works for my
context.

Now, if you don't mind keeping my data away from your machine, maybe it's
okay if your definition doesn't work for my context. For some 7 billion
definitons of me.

 if you are trying to store UTF-16 names on a system with
 UTF-8 locale, you should be converting, not escaping.

Not much argument with that. Many things that can be done should not
necessarily be done.

Most of the time, anyway. There may be some special cases, but you are
talking about file names, and I don't think of any, right off the bat.

 Doing otherwise
 is just asking for troubles.

Oh, I just thought of a couple of exceptions. Theoretical at this point,
but definitely exceptions.

There's no rule that an OS has to use byte-string file names. (And you
don't have to do the stupid things a certain well-known OS does, that uses
UCS-16 as its native transform and Unicode as its native encoding.) But you
know that.

 Next, I assume that ability to enter filenames trumps ability to
 preserve original filename on Unix-like systems.

Entering file names is a function of the tools, not of the OS. And if you
want tools that are limited to NFD, you are free to build and use them.

 In most cases right
 now these two values don't clash, because user input is normalized from
 the very beginning in IME.

Choice, function, and construction of the input stack (and output stack) is
nearly completely independent of the OS (for any decent OS).

 That said, there may be exceptions.  Eg.
 several mail clients won't normalize filename if input encoding matches
 encoding of attachement.

Mail clients are also pretty independent of the OS.

 Thus, having recieved a file with non-ASCII
 filename from Mac, you'll end up being unable to address it from shell
 even if it was typed using exactly the same keyboard layout you use.

Keyboard layout is independent of the OS. And it is actually possible to
set up an openbsd keyboard and input method that closely mimics a Macintosh.

 I
 don't see how this situation may be justified.

Doesn't need to be. Only needs to be worked around.

 The rare cases when
 original filenames must be preserved byte to byte warrant some special
 handling (eg. storing filenames elsewhere separately or preserving the
 whole files with names and attributes in some archive or other form of
 special database).

Actually, the contexts in which data handling should be orthogonal to
filename encodings are the more common contexts. The OS has to do a lot
that the user never sees, and those internal functions just start fighting
each other when they start making assumptions like encodings.

 Finally, provided that both ends of network communication use canonical
 forms for Unicode, the matter of storing file remotely and then
 recieving it back with filename intact is simply a matter of
 normalization on reciever's side.

As long as you don't drop bytes somehow on the way from here to there.

 That is: if you prefer your local
 files in NFD, and your NAS uses NFC, you should simply normalize
 filenames when you recieve files back.

Not OS issues. Application issues. Maybe tool issues, for a limited subset
of tools.

 The only potential problem here
 is compatibility normalizations, but these are already problematic
 enough to be avoided in all cases where NFD or NFC do the job.

Broken compatibility normalizations get invented precisely because OS
architects think an OS needs a native encoding.

Remember, the Universal TransForms were invented independently of Unicode.
They were adopted by the Unicode Consortium about the time the Consortium
finally became convinced that there really are more than 65,536
character-like objects that need a code point in a modern information
encoding scheme.

UTF-8 and Unicode are not equivalent.

Joel Rees

Computer memory is just fancy paper,
CPUs just fancy pens.
All is a stream of text
flowing from the past into the future.

Re: ffs and utf8

2014-12-03 Thread Anthony J. Bentley

Joel Rees writes:
 2014/12/03 22:23 Dmitrij D. Czarkoff czark...@gmail.com:
 
  First of all, I really don't believe that preservation of non-canonical
  form should be a consideration for any software.
 
 There is no particular canonical form for some kinds of software.
 
 Unix, in particular, happens to have file name limitations that are
 compatible with all versions of Unicode past 2.0, at least, in UTF-8, but
 it has no native encoding.

To me, the current state of affairs--where filenames can contain
anything and the same filename can and does get interpreted differently
by different programs--feels extremely dangerous. Moving to a single,
well-defined encoding for filenames would make things simpler and
safer. Well, it might. That's why we're discussing this carefully, to
figure out if something like this is actually workable.

There are two kinds of features being discussed:

1) Unicode normalization. This is analogous to case insensitivity:
   multiple filenames map to the same (normalized) filename.

2) Disallowing particular characters. 1-31 and invalid UTF-8 sequences
   are popular examples.

Maybe one is workable. Maybe both are, or neither.

Say I have a hypothetical machine with the above two features
(normalizing to NFC, disallowing 1-31/invalid UTF-8). Now I log into a
typical Unix anything but \0 or / machine, via SFTP or whatever. What
are the failure modes?

The first kind is that I could type get x followed by get y,
where x and y are canonically the same in Unicode but represented
differently because they're not normalized on the remote host. I would
expect this to work smoothly: first I download x to NFC(x), and then
b overwrites it.

The second kind is that I could type get z, where z contains an invalid
character. How should my system handle this? Error as if I had asked for
a filename that's too long? Come up with a new errno? I don't know, but
in this hypothetical machine it should fail somehow.

But creating new files is only part of the problem. If we still allow
them in existing files, we lose all the security/robustness benefits
and just annoy ourselves by adding restrictions with no point.

So say I mount a filesystem containing the same files a, b, and c. What
happens?

 - Fail to mount? (Simultaneously simplest, safest, and least useful)
 - Hide the files? (Seems potentially unsafe)
 - Try to escape the filenames? (Seems crazy)

Is it currently possible to take a hex editor and add / to a filename
(as opposed to a pathname) inside a disk image? If that's possible, how
do systems currently deal with it? Because it's the same problem.

FAT32 has both case insensitivity and disallowed characters. How well
does OpenBSD handle those restrictions? If not optimally, then how can
they be made better? If it already handles them with aplomb, then is
it applicable to the above scenarios?

-- 
Anthony J. Bentley

Re: ffs and utf8

2014-12-03 Thread Theo de Raadt

Joel Rees writes:
 2014/12/03 22:23 Dmitrij D. Czarkoff czark...@gmail.com:
 
  First of all, I really don't believe that preservation of non-canonical
  form should be a consideration for any software.
 
 There is no particular canonical form for some kinds of software.
 
 Unix, in particular, happens to have file name limitations that are
 compatible with all versions of Unicode past 2.0, at least, in UTF-8, but
 it has no native encoding.

To me, the current state of affairs--where filenames can contain
anything and the same filename can and does get interpreted differently
by different programs--feels extremely dangerous. Moving to a single,
well-defined encoding for filenames would make things simpler and
safer. Well, it might. That's why we're discussing this carefully, to
figure out if something like this is actually workable.

There are two kinds of features being discussed:

1) Unicode normalization. This is analogous to case insensitivity:
   multiple filenames map to the same (normalized) filename.

2) Disallowing particular characters. 1-31 and invalid UTF-8 sequences
   are popular examples.

Maybe one is workable. Maybe both are, or neither.

Say I have a hypothetical machine with the above two features
(normalizing to NFC, disallowing 1-31/invalid UTF-8). Now I log into a
typical Unix anything but \0 or / machine, via SFTP or whatever. What
are the failure modes?

The first kind is that I could type get x followed by get y,
where x and y are canonically the same in Unicode but represented
differently because they're not normalized on the remote host. I would
expect this to work smoothly: first I download x to NFC(x), and then
b overwrites it.

The second kind is that I could type get z, where z contains an invalid
character. How should my system handle this? Error as if I had asked for
a filename that's too long? Come up with a new errno? I don't know, but
in this hypothetical machine it should fail somehow.

But creating new files is only part of the problem. If we still allow
them in existing files, we lose all the security/robustness benefits
and just annoy ourselves by adding restrictions with no point.

So say I mount a filesystem containing the same files a, b, and c. What
happens?

 - Fail to mount? (Simultaneously simplest, safest, and least useful)
 - Hide the files? (Seems potentially unsafe)
 - Try to escape the filenames? (Seems crazy)

Is it currently possible to take a hex editor and add / to a filename
(as opposed to a pathname) inside a disk image? If that's possible, how
do systems currently deal with it? Because it's the same problem.

FAT32 has both case insensitivity and disallowed characters. How well
does OpenBSD handle those restrictions? If not optimally, then how can
they be made better? If it already handles them with aplomb, then is
it applicable to the above scenarios?

http://en.wikipedia.org/wiki/Where%27s_the_beef%3F

I mean, where's the diffs for all these issues?

Oh.  There is no beef.

This is idle chatter hoping someone supplies some secret sauce that
makes a disparate audience with different demands all happy.


Why don't you guys go write some code and prove your points?
Maybe this is simply a very hard problem, and not going to be satisfied
by people who simply talk about it?

Re: ffs and utf8

2014-12-03 Thread Dmitrij D. Czarkoff

Joel Rees said:
 Maybe it would be better just to not make those directories until they
 are needed by an application, and then ask the user to name them
 instead of providing standard names.

Actually, it is still workable if you carry your ~/.config/user-dirs.dir
around, so that you could install it before you first log into GNOME.  I
used this approach to sanitize structure of my home directory when I
needed a working GNOME desktop.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-12-02 Thread Joel Rees

(apologies for the html.)

2014/12/02 9:52 Dmitrij D. Czarkoff czark...@gmail.com:

 Joel Rees said:
  Now, what would you do with this?
 
  ã¸ã§ã¨ã«
 
  Why not decompose it to the following?
 
  ï½¼ï¾ï½®ï½´ï¾

 Because it is not what Unicode normalization is.

Well, it definitely isn't Unicode normalization. And there is a reason, it
isn't, even though there were many who thought the Unicode standard
shouldn't include code points for wide form glyphs.

Let's try one more. I think you have said enough that I can infer that your
preferred normal form is the decomposit form. So, given that your
normalization has resulted in a file named

ã·ãã§ã¨ã«ã®æ

and the necessity to send it back where it came from, how do you know
whether or not it should be restored to

ã¸ã§ã¨ã«ã®æ

before you send it back?

 [...]

--
Joel Rees

Re: ffs and utf8

2014-12-01 Thread Anthony J. Bentley

Hi Ingo,

Ingo Schwarze writes:
 While the article is old, the essence of what Schneier said here
 still stands, and it is not likely to fall in the future:
 
   https://www.schneier.com/crypto-gram-0007.html#9

The most interesting sentence here is:

Unicode is just too complex to ever be secure.

This is sort of valid, and it's why the only sane way to handle UTF-8
is to ignore the complexities and escape methods he alluded to.
Codepoints should be represented with the shortest possible sequence.
Surrogate pairs should not be encoded in UTF-8. Byte order marks should
not exist in UTF-8. UTF-8 parsers should handle encoding errors in the
same well-defined way: abort decoding on invalid sequence and retry
starting with the second byte.

I like how Plan 9 handled Unicode. Aside from inventing UTF-8--an
encoding scheme that actually makes sense with C strings, unlike the
disastrous designs-by-committee that were UCS-2 and UTF-16--they
basically used it as just a way to have more than 256 characters.
Most parts of Unicode proper, like collation or canonical equivalence,
were simply dropped. Noncompliant? Sure, but it made things dramatically
simpler.

In other words, divorce UTF-8 the encoding from Unicode the standard.

Homograph attacks are a real concern with any large character set. But:

1) I've been tricked by... well, not attacks, but simply badly written
   filenames with plain old ASCII: e instead of a, spaces instead of
   underscores, 0/O or l/I/1. It's easy to fool the human mind by
   feeding it something that sort of looks like what's expected.

2) Given that filenames can contain literally anything except / and \0,
   there are so many other attacks that enforcing valid UTF-8 in
   filenames would be a hypothetical improvement (not that I'm
   necessarily advocating doing that in OpenBSD). Spaces are bad enough.
   How many shell scripts handle *newlines* correctly? What about VT100
   escape sequences? This whole thing is a security nightmare already.

I happily use UTF-8 filenames on OpenBSD, and have done so for years.

-- 
Anthony J. Bentley

Re: ffs and utf8

2014-12-01 Thread pizdelect

On Sat, Nov 29, 2014 at 09:48:53PM +0100, Dmitrij D. Czarkoff wrote:
 That said, the standard provides just enough facilities to make
 filesystem-related aspects of Unicode work nicely, particularily in case
 of utf-8.  Eg. ability to enforce NFD for all operations on file names
 could actually make several things more secure by preventing homograph
 attacks.

How do you 'enforce' NFD?

Let the kernel normalize (ie /destructively/ transform) the file names
behind user's back, so that a file will be listed with a different name
than that with which it was created? That's very nice and secure, indeed.

Reject file names that are not in NFD? But if you're into preventing
people from using file names they want to use and have used without
problems until now, why not just go all the way back to uppercase + the dot?

And btw, normalization won't do much about 'homographs':

$ echo  ∕еtс∕раsswd
$ rm ∕еtс∕раsswd
$

Re: ffs and utf8

2014-12-01 Thread Dmitrij D. Czarkoff

pizdel...@gmail.com said:
 How do you 'enforce' NFD?
 
 Let the kernel normalize (ie /destructively/ transform) the file names
 behind user's back, so that a file will be listed with a different name
 than that with which it was created? That's very nice and secure, indeed.

I would enforce normalization at filename access time (open(), fopen(),
readdir(), etc).  Yes, destructively transform.  I would reject
filenames that won't decode.  If this is documented, I just don't see
how it is behind user's back, and it at least partially solves the
problem of accessing right files.

FWIW I've stopped using Unicode filenames after I found that I can't
type in the name of file that contains only the glyphs that I can type
in, just because at that time I used keyboard layout with combining
diacritical marks instead of dead keys, so my input was NFD, while
name of the file I got from somewhere was NFC.

 And btw, normalization won't do much about 'homographs':
 
 $ echo  ∕еtс∕раsswd
 $ rm ∕еtс∕раsswd
 $

This is a separate problem.  My suggestion does not help here, which
does not render it useless for other cases.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-12-01 Thread Janne Johansson

2014-12-01 10:20 GMT+01:00 Dmitrij D. Czarkoff czark...@gmail.com:

 pizdel...@gmail.com said:
  How do you 'enforce' NFD?
 
  Let the kernel normalize (ie /destructively/ transform) the file names
  behind user's back, so that a file will be listed with a different name
  than that with which it was created? That's very nice and secure, indeed.

 I would enforce normalization at filename access time (open(), fopen(),
 readdir(), etc).  Yes, destructively transform.  I would reject
 filenames that won't decode.  If this is documented, I just don't see
 how it is behind user's back, and it at least partially solves the
 problem of accessing right files.


I don't know if I read this wrong, but a new list of rules on how a
filename can and must look would override the current allowed charsets and
mangle names that my programs would want to write? No please.

-- 
May the most significant bit of your life be positive.

On Mon, Dec 01, 2014 at 10:38:40AM +0200, pizdel...@gmail.com wrote:
On Sat, Nov 29, 2014 at 09:48:53PM +0100, Dmitrij D. Czarkoff wrote:
That said, the standard provides just enough facilities to make
filesystem-related aspects of Unicode work nicely, particularily in case
of utf-8. Eg. ability to enforce NFD for all operations on file names
could actually make several things more secure by preventing homograph
attacks.

How do you 'enforce' NFD?

Anything that stores filenames outside the filesystem (e.g. in a database)
for later use will have problems if NFD or NFC is enforced by the filesystem.
Version control systems are particularly prone to this issue.

Apple HFS+ does such normalisation behind the application's back.
Put a file with a funky name on disk, read the containing directory back,
and you might not find any directory entry matching the byte sequence written.
Not a smart idea if you ask me since it breaks applications which weren't
written with normalization in mind.
Example: http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Git suffers from the same problem (and ended up committing a patch that simply
ignores compat to existing repositories from version 1.7.12 onwards?!?)
http://mail-archives.apache.org/mod_mbox/subversion-users/201208.mbox/%3C501D29CF.6000308%40web.de%3E

The only VCS I know of which normalized from day one is Veracity. Not because
the developers were experts on Unicode, but because they had the benefit of
hindsight -- they link to the above SVN bug from a comment in their code.

Design-by-committee giving us standards that ignore existing realities.

Re: ffs and utf8

2014-12-01 Thread Stefan Sperling

On Mon, Dec 01, 2014 at 10:20:08AM +0100, Dmitrij D. Czarkoff wrote:
 I would enforce normalization at filename access time (open(), fopen(),
 readdir(), etc).  Yes, destructively transform.  I would reject
 filenames that won't decode.  If this is documented, I just don't see
 how it is behind user's back, and it at least partially solves the
 problem of accessing right files.

Bad idea. See my other post. Apple did this and broke existing applications.

Re: ffs and utf8

2014-12-01 Thread Dmitrij D. Czarkoff

Stefan Sperling said:
 Bad idea. See my other post. Apple did this and broke existing applications.

OpenBSD changed time_t and broke existing applications, but hardly
anyone thinks it was a bad idea.  Fancy filenames are long known to be
problematic, so filename policy enforcement is a breakage of the same
sort.  Apple have taken the lead here, and they may eventually do the
same thing to industry as OpenBSD did by changing time_t.

FWIW now it is rather safe to normalize filenames now, as related
problems are already being solved due to breakages on OSX.

Although I might be missing something, an additional function which
takes desired filename and outputs normalized filename could probably
solve this problem on applications' side.  Such function, be it
implemented in libc, could even allow system administrators enforce
local file naming preference as system-wide policy.

P.S.:  I don't actually propose to implement filename normalization in
OpenBSD right now.  I've merely thrown this idea to generate potentially
fruitful discussion.  Don't mistake it for feature request or demand of
some kind.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-12-01 Thread Janne Johansson

2014-12-01 12:05 GMT+01:00 Dmitrij D. Czarkoff czark...@gmail.com:

 Stefan Sperling said:
  Bad idea. See my other post. Apple did this and broke existing
 applications.

 OpenBSD changed time_t and broke existing applications, but hardly
 anyone thinks it was a bad idea.  Fancy filenames are long known to be
 problematic, so filename policy enforcement is a breakage of the same
 sort.


Well, even if the implementation broke old assumptions on how large the
storage of a time_t is, it did not make previously valid dates invalid.
There is quite a bit of difference between changing the storage format and
making some dates impossible that previously did work.


-- 
May the most significant bit of your life be positive.

Re: ffs and utf8

2014-12-01 Thread Dmitrij D. Czarkoff

Janne Johansson said:
 There is quite a bit of difference between changing the storage format and
 making some dates impossible that previously did work.

Don't think so.  Something got changed, things got broken and need to be
fixed.  The only real question is: is the change worth the trouble.  I
think it is, although unanimous negative reaction hints that I am
probably missing something important.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-12-01 Thread Joel Rees

On Mon, Dec 1, 2014 at 8:43 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote:
 Janne Johansson said:
 There is quite a bit of difference between changing the storage format and
 making some dates impossible that previously did work.

 Don't think so.  Something got changed, things got broken and need to be
 fixed.  The only real question is: is the change worth the trouble.  I
 think it is, although unanimous negative reaction hints that I am
 probably missing something important.

Hmm. What would you suggest doing with the following file name?

／ｅｔｃ

(You may need a Japanese font to display it.)

If you try to normalize it on a *nix box, it will hopefully conflict
with your system file permissions. But, then what do you do with it?

If you throw it away because it's non-normal, and it happens to have
the parts of the new marketing plan that didn't fit under some other
category, will the boss be okay with that?

Maybe your company has a set of normalization rules that works okay
for your company. Maybe my company doesn't work well with those rules.
That's the problem.

-- 
Joel Rees

Be careful when you look at conspiracy.
Look first in your own heart,
and ask yourself if you are not your own worst enemy.
Arm yourself with knowledge of yourself, as well.

Re: ffs and utf8

2014-12-01 Thread Dmitrij D. Czarkoff

Joel Rees said:
 Hmm. What would you suggest doing with the following file name?
 
 ／ｅｔｃ
 
 (You may need a Japanese font to display it.)
 
 If you try to normalize it on a *nix box, it will hopefully conflict
 with your system file permissions. But, then what do you do with it?
 
 If you throw it away because it's non-normal, and it happens to have
 the parts of the new marketing plan that didn't fit under some other
 category, will the boss be okay with that?

I am not sure I get you.  I proposed using NFD for filenames.  In system
implementing that the filename you provide as example above would be
stored as is, as it is already NFD and can't be further decomposed.  I
never suggested NFKD as your message implies.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-12-01 Thread Ted Unangst

On Mon, Dec 01, 2014 at 12:43, Dmitrij D. Czarkoff wrote:
 Janne Johansson said:
 There is quite a bit of difference between changing the storage format and
 making some dates impossible that previously did work.
 
 Don't think so.  Something got changed, things got broken and need to be
 fixed.  The only real question is: is the change worth the trouble.  I
 think it is, although unanimous negative reaction hints that I am
 probably missing something important.
 

Fixing time_t did not suddenly make OpenBSD systems unable to
communicate with other systems with other time_t sizes. It was an
implementation detail, but the various protocols and formats that
embed dates and times in them were not changed.

Your proposed change changes an important protocol: the one that lets
me save files I receive from others to my filesystem. When I can no
longer save web pages or email attachments and send them back to the
sender with the same name, you have broken the protocol.

You may also think of it this way. 64-bit time_t permitted more times
to be represented. Long after the tar format itself cannot handle the
current date, I will still be able to unpack old existing archives. You
are proposing that *fewer* filenames be represented. My existing
archives with forbidden filenames will no longer work.

Re: ffs and utf8

2014-12-01 Thread frantisek holop

Joel Rees, 01 Dec 2014 22:04:
 Hmm. What would you suggest doing with the following file name?
 
 ／ｅｔｃ
 
 (You may need a Japanese font to display it.)
 
 If you try to normalize it on a *nix box, it will hopefully conflict
 with your system file permissions. But, then what do you do with it?

this example has potential to confuse of course,
however NFD normalizing will not turn it into
'/etc', as they are not composite characters:

'／': unicode cat=Po
'ｅ': unicode cat=Ll
'ｔ': unicode cat=Ll
'ｃ': unicode cat=Ll

(Po=other punctuation, Ll=lowercase letter)


-f
-- 
i quit drinking/smoking/ sex once.  very boring 15 minutes.

Re: ffs and utf8

2014-12-01 Thread frantisek holop

Stefan Sperling, 29 Nov 2014 18:17:
  Are you aware of 'detox' package?
 
 There's also converters/convmv

$ touch »´ÁÉǑÄ«
$ convmv *
wrong/unknown from encoding!
$ convmv -f utf8 -t latin1 *
Starting a dry run without changes...
iso-8859-1 doesn't cover all needed characters for: ./»´ÁÉǑÄ«
To prevent damage to your files, we won't continue.
First fix this or correct options!


convmv is a precision tool.  it needs to know the
source and target encoding, and is very careful.

in contrast my tool is much more blunt, because i dont care
about an exact 1:1 mapping as on my disks punctuation and
diacritics are not welcome.

-f
-- 
life is like... an analogy.

Re: ffs and utf8

2014-12-01 Thread Joel Rees

On Mon, Dec 1, 2014 at 11:13 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote:
 Joel Rees said:
 Hmm. What would you suggest doing with the following file name?

 ／ｅｔｃ

 (You may need a Japanese font to display it.)

 If you try to normalize it on a *nix box, it will hopefully conflict
 with your system file permissions. But, then what do you do with it?

 If you throw it away because it's non-normal, and it happens to have
 the parts of the new marketing plan that didn't fit under some other
 category, will the boss be okay with that?

 I am not sure I get you.  I proposed using NFD for filenames.  In system
 implementing that the filename you provide as example above would be
 stored as is, as it is already NFD and can't be further decomposed.  I
 never suggested NFKD as your message implies.

Very good.

Now, what would you do with this?

ジョエル

Why not decompose it to the following?

ｼﾞｮｴﾙ

I know what the Unicode rules say, but my boss says, if I'm going to
play with file names, he wants it done his way. And the company across
the hall has a policy just a little different, but still not matching
Unicode rules.

You have to keep rules about making file names for internal use
separate from rules about storing filenames received, or the internal
system loses its meaning. What use are systems if you have to resort
to meaninglessness to use them?

;-P

-- 
Joel Rees

All truth is independent in that sphere in which God has placed it,
to act for itself, as all intelligence also;
otherwise there is no existence.

Re: ffs and utf8

2014-12-01 Thread Anthony J. Bentley

Ted Unangst writes:
 On Mon, Dec 01, 2014 at 12:43, Dmitrij D. Czarkoff wrote:
  Janne Johansson said:
  There is quite a bit of difference between changing the storage format and
  making some dates impossible that previously did work.
  
  Don't think so.  Something got changed, things got broken and need to be
  fixed.  The only real question is: is the change worth the trouble.  I
  think it is, although unanimous negative reaction hints that I am
  probably missing something important.
  
 
 Fixing time_t did not suddenly make OpenBSD systems unable to
 communicate with other systems with other time_t sizes. It was an
 implementation detail, but the various protocols and formats that
 embed dates and times in them were not changed.
 
 Your proposed change changes an important protocol: the one that lets
 me save files I receive from others to my filesystem. When I can no
 longer save web pages or email attachments and send them back to the
 sender with the same name, you have broken the protocol.

Should I be able to save web pages or email attachments with filenames
containing newlines?

How about backspaces?

What about terminal escape sequences, or ASCII control codes?

Yes, these have been possible in Unix since time immemorial. And the
fact that to this day there's no way for me to sanitize them terrifies me.

-- 
Anthony J. Bentley

Re: ffs and utf8

2014-12-01 Thread Dmitrij D. Czarkoff

Joel Rees said:
 Now, what would you do with this?
 
 ジョエル
 
 Why not decompose it to the following?
 
 ｼﾞｮｴﾙ

Because it is not what Unicode normalization is.

 I know what the Unicode rules say, but my boss says, if I'm going to
 play with file names, he wants it done his way.

And now you suggest that idea of enforcing local filename policy is bad
idea because local filename policy might not be sane.  Ok.

First, let's decouple NFD suggestion from local policy.  Again, no
problems with NFD here.  I don't really see any sense in local policy
that demands this conversion, but if your boss needs it, it is not my
business.  I can't get why mention it though: it is completely unrelated
problem.

 You have to keep rules about making file names for internal use
 separate from rules about storing filenames received, or the internal
 system loses its meaning.

And now you speak of normalization or of local policy?  At any rate, any
incoming file has a name, which is encoded somehow.  It may be encoded
in utf-16le, for example.  Now, either you store a filename that you
can't read without using iconv or another tool of a kind, or you convert
the name to your locale.  If your locale happens to use utf-8, you still
have to convert byte sequence to another byte sequence.  The conversion
I proposed would convert destructively, but maintaining Unicode
equivalence, so aside from subtle technical (choice of canonical form)
the set of glyphs that makes the filename would remain exactly the same.
This is not even a policy, just consistent representation.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-11-30 Thread Dmitrij D. Czarkoff

Joel Rees said:
 That said, the standard provides just enough facilities to make
 filesystem-related aspects of Unicode work nicely, particularily in case
 of utf-8.  Eg. ability to enforce NFD for all operations on file names
 could actually make several things more secure by preventing homograph
 attacks.
 
 I think this assertion is a bit optimistic, and not just given your
 following caveat.

Provided that I have to cope with Unicode file names every day, I just
can't see more pessimistic approach then just allowing arbitrary Unicode
codepoints with no sanitization whatsoever.  Every now and then I have
to use printf(1) and xclip(1x) just because there is no other way to
address a file or identify all codepoints of its name.  From here I
don't see ability to enforce policy on Unicode strings as something as
useless as you put it.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-11-30 Thread Dmitrij D. Czarkoff

Thomas Bohl said:
 # ls | cat
 Will display the characters right.
 Not entirely sure why though.

From ls(1) manual:

| -q  Force printing of non-graphic characters in file names as the
| character `?'; this is the default when output is to a terminal.


-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-11-30 Thread Joel Rees

On Sun, Nov 30, 2014 at 6:31 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote:
 Joel Rees said:
 That said, the standard provides just enough facilities to make
 filesystem-related aspects of Unicode work nicely, particularily in case
 of utf-8.  Eg. ability to enforce NFD for all operations on file names
 could actually make several things more secure by preventing homograph
 attacks.

 I think this assertion is a bit optimistic, and not just given your
 following caveat.

 Provided that I have to cope with Unicode file names every day,

Same here, FWIW, Japanese. (And then there are the times I have to
work on file names encoded in shift-JIS. Fun stuff.)

 I just
 can't see more pessimistic approach then just allowing arbitrary Unicode
 codepoints with no sanitization whatsoever.

Pessimistic? Optimistic? Asking for trouble, yes.

I generally try to use Romaji (latinized phonetic Japanese, all
ASCII, if I avoid the overbar approach to lengthened vowels) when I
know a file is going to move to another machine. If file names are
strictly phonetic, you can set up a round-trip mapping from Romaji to
kana, but most of the time Japanese file names include Kanji, and
there is no round-trip mapping that can be meaningfully read by a
human.

There are ASCII-encoded JIS codes which could be used to produce
round-trip mapping, but I'd need to run the output of ls through some
sort of a custom filter to make sense of the names. Might be a useful
thing to build.

 Every now and then I have
 to use printf(1) and xclip(1x) just because there is no other way to
 address a file or identify all codepoints of its name.  From here I
 don't see ability to enforce policy on Unicode strings as something as
 useless as you put it.

Not saying it's useless to have a policy.

What I'm saying is that unicode utf-8 has parsing problems independent
of issues like characters that appear the same but have separate code
points. utf-8 is pretty simple until you start mapping it to real
characters. Getting the mapping right is difficult, which is why you
have your policy, I think.

One of these days I want to build a ctype library that gives
meaningful results for the Japanese subset of the CJK subset of
Unicode. But that's only going to help with some of the problems.

-- 
Joel Rees

Be careful when you look at conspiracy.
Look first in your own heart,
and ask yourself if you are not your own worst enemy.
Arm yourself with knowledge of yourself, as well.

Re: ffs and utf8

2014-11-30 Thread Christian Weisgerber

On 2014-11-29, Ingo Schwarze schwa...@usta.de wrote:

 But Unicode must never be allowed near anything that might get
 executed as program code, including scripts in interpreted languages,
 including, but not limited to, the shell.  In particular, that means
 trying to handle Unicode in filenames is a bad idea.

Why filenames at all?  Just use inode numbers.

-- 
Christian naddy Weisgerber  na...@mips.inka.de

ffs and utf8

2014-11-29 Thread frantisek holop

i have written for myself a small python3 script that
removes accented characters and all utf8 symbols
from filenames, a kind of utf-8 to ascii sanitizer.

while working on it, i created some strange test cases
(e.g. »´ÁÉǑÄ«) for filenames and i was pleasently
surprised that the files were created/read/renamed/deleted
without problems.

is it true to say then, that ffs is entirely utf8 safe,
and/or that ffs is actually an utf-8 encoded filesystem
as IIRC Mac OS is?  or is it some kind of happy accident
that it works? :)

-f
-- 
mips = meaningless index of processor speed

Re: ffs and utf8

2014-11-29 Thread Ville Valkonen

Hello,

On 29 November 2014 at 14:02, frantisek holop min...@obiit.org wrote:
 i have written for myself a small python3 script that
 removes accented characters and all utf8 symbols
 from filenames, a kind of utf-8 to ascii sanitizer.

Are you aware of 'detox' package?

--
Regards,
Ville

Re: ffs and utf8

2014-11-29 Thread frantisek holop

frantisek holop, 29 Nov 2014 13:02:
 while working on it, i created some strange test cases
 (e.g. »´ÁÉǑÄ«) for filenames and i was pleasently
 surprised that the files were created/read/renamed/deleted
 without problems.

i think i should clarify this a bit:
they show perfect in midnight commander, not in shell.

$ touch »´ÁÉǑÄ«
$ ls
??

-f
-- 
to every rule there's an exception  vice versa.

Re: ffs and utf8

2014-11-29 Thread frantisek holop

Ville Valkonen, 29 Nov 2014 14:08:
 Are you aware of 'detox' package?

$ touch »´ÁÉǑÄ«
$ detox *
$ ls
A_A_A_A_C_A_A_

$ touch »´ÁÉǑÄ«
$ my_silly_script
$ ls
aeoa

perhaps with some massaging detox can be made
to work like my script, i dont know.  but that is
actually besides the point.

i wrote my own 128 lines python script to have fun,
see how some stuff works, learn about utf-8, etc.

as an added bonus, it is also a mass renamer that
can remove and add strings from/to filenames.

and when it will grow up, it will also autotag music
files, create art thumbnail and feed it to cmus!  :)

where is your detox now? :)

-f
-- 
dinner: dead animals and some stuff out of the ground.

Re: ffs and utf8

2014-11-29 Thread Paolo Aglialoro

Shouldn't in 2014 the aim having all working in utf-8?

Re: ffs and utf8

2014-11-29 Thread frantisek holop

Paolo Aglialoro, 29 Nov 2014 13:56:
 Shouldn't in 2014 the aim having all working in utf-8?

sure.
but i like my filenames ascii and whitespaceless.
shows my age.

-f
-- 
what a nice night for an evening.  -- steven wright

Re: ffs and utf8

2014-11-29 Thread Dmitrij D. Czarkoff

frantisek holop said:
 is it true to say then, that ffs is entirely utf8 safe,
 and/or that ffs is actually an utf-8 encoded filesystem
 as IIRC Mac OS is?  or is it some kind of happy accident
 that it works? :)

As I get it, ffs is entirely utf8 safe because it is not encoding
aware.  With whatever locale commands

  $ touch `printf \aabb\bc`

and

  $ touch `printf \201\202\203`

succeed.  (Interestingly, ls | cat in presence of filename with ASCII
bell does not actually ring the bell, although backspace works as
expected.)  These octet arrays may happen to be valid utf-8 as well.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-11-29 Thread Lars


Hi,

On 29.11.2014 13:20, frantisek holop wrote:


i think i should clarify this a bit:
they show perfect in midnight commander, not in shell.

$ touch »´ÁÉǑÄ«
$ ls
??

-f


I had a similar problem some time ago and have been told that the ls 
tool is not aware of UTF-8. See here for some details:

http://marc.info/?l=openbsd-portsm=135345716931800w=2

regards
Lars

Re: ffs and utf8

2014-11-29 Thread Ted Unangst

On Sat, Nov 29, 2014 at 13:02, frantisek holop wrote:

 is it true to say then, that ffs is entirely utf8 safe,
 and/or that ffs is actually an utf-8 encoded filesystem
 as IIRC Mac OS is?  or is it some kind of happy accident
 that it works? :)

FFS stores filenames as bytes.

Re: ffs and utf8

2014-11-29 Thread Christian Weisgerber

On 2014-11-29, frantisek holop min...@obiit.org wrote:

 is it true to say then, that ffs is entirely utf8 safe,
 and/or that ffs is actually an utf-8 encoded filesystem
 as IIRC Mac OS is?

The former.  Unix filesystems accept all bytes for filenames with
the exception of 0x2f, which serves as directory separator, and
0x00, which terminates the string.  Any encoding that doesn't
conflict with these two restrictions is valid.

UTF-8 was invented on Plan-9, a Unix offspring, so no surprise
there.

-- 
Christian naddy Weisgerber  na...@mips.inka.de

Re: ffs and utf8

2014-11-29 Thread Christian Weisgerber

On 2014-11-29, frantisek holop min...@obiit.org wrote:

 $ touch »´ÁÉǑÄ«
 $ ls
 ??

If you need a locale-aware ls(1), use the one from the colorls package.
(Don't worry, colored output is entirely optional.)

-- 
Christian naddy Weisgerber  na...@mips.inka.de

Re: ffs and utf8

2014-11-29 Thread Ingo Schwarze

Hi,

Paolo Aglialoro wrote on Sat, Nov 29, 2014 at 01:56:23PM +0100:

 Shouldn't in 2014 the aim having all working in utf-8?

Most definitely not, that would directly run contrary to some of
OpenBSD's most important project goals:  Correctness, simplicity,
security.

While the article is old, the essence of what Schneier said here
still stands, and it is not likely to fall in the future:

  https://www.schneier.com/crypto-gram-0007.html#9

In conclusion, Unicode can be used for any text where it is a safe
assumption that the only thing ever to be done with it is display
it to the user, so the only risk would be display slightly garbled
text to the user.  That's why mandoc(1) isn't very worried about
handling Unicode characters in manual content.

But Unicode must never be allowed near anything that might get
executed as program code, including scripts in interpreted languages,
including, but not limited to, the shell.  In particular, that means
trying to handle Unicode in filenames is a bad idea.

System tools (like ls(1) and sh(1)) must never attempt to interpret
input as Unicode for essentially the same reasons why they must
never attempt to encode output in XML.

Yours,
  Ingo

Re: ffs and utf8

2014-11-29 Thread Jan Stary

On Nov 29 13:02:34, min...@obiit.org wrote:
 is it true to say then, that ffs is entirely utf8 safe,
 and/or that ffs is actually an utf-8 encoded filesystem

The file names are just strings of bytes.
There is nothing UTF8 about them.

On Nov 29 14:23:35, czark...@gmail.com wrote:
 (Interestingly, ls | cat in presence of filename with ASCII
 bell does not actually ring the bell, although backspace works as
 expected.)

This cant't but remind one of
Mr. Malcolm Peter Brian Telescope Adrian Umbrella Stand Jasper Wednesday
(pops mouth twice) Stoatgobbler John Raw Vegetable (whinnying) Arthur
Norman Michael (blows squeaker) Featherstone Smith (whistle) Northcott
Edwards Harris (fires pistol, then 'whoop') Mason
(chuff-chuff-chuff-chuff) Frampton Jones Fruitbat Gilbert (sings) 'We'll
keep a welcome in the' (three shots) Williams If I Could Walk That Way
Jenkin (squeaker) Tiger-drawers Pratt Thompson (sings) 'Raindrops Keep
Falling On My Head' Darcy Carter (horn) Pussycat (sings) 'Don't Sleep In
The Subway' Barton Mainwaring (hoot, 'whoop') Smith

Re: ffs and utf8

2014-11-29 Thread Stefan Sperling

On Sat, Nov 29, 2014 at 02:08:32PM +0200, Ville Valkonen wrote:
 Hello,
 
 On 29 November 2014 at 14:02, frantisek holop min...@obiit.org wrote:
  i have written for myself a small python3 script that
  removes accented characters and all utf8 symbols
  from filenames, a kind of utf-8 to ascii sanitizer.
 
 Are you aware of 'detox' package?

There's also converters/convmv

Re: ffs and utf8

2014-11-29 Thread Dmitrij D. Czarkoff

Ingo Schwarze said:
 While the article is old, the essence of what Schneier said here
 still stands, and it is not likely to fall in the future:
 
   https://www.schneier.com/crypto-gram-0007.html#9

Sorry, but this article is mostly based on lack of understanding of
Unicode.

 that would directly run contrary to some of OpenBSD's most important
 project goals:  Correctness, simplicity, security.

Yes, Unicode is very complex.  Just complex enough that there is (to my
knowledge) no single application that does it right in every aspect.

That said, the standard provides just enough facilities to make
filesystem-related aspects of Unicode work nicely, particularily in case
of utf-8.  Eg. ability to enforce NFD for all operations on file names
could actually make several things more secure by preventing homograph
attacks.

Unfortunately, there is no realistic hope that NFD will be enforced by
every OS and filesystem out there any time soon, so at this stage file
names with bytes outside printable ASCII range will cause problems at
some point.  On my systems I limit filenames to [0-9A-Za-z~._/-] range.

-- 
Dmitrij D. Czarkoff

Re: ffs and utf8

2014-11-29 Thread Joel Rees

On Sun, Nov 30, 2014 at 5:48 AM, Dmitrij D. Czarkoff czark...@gmail.com wrote:
 Ingo Schwarze said:
 While the article is old, the essence of what Schneier said here
 still stands, and it is not likely to fall in the future:

   https://www.schneier.com/crypto-gram-0007.html#9

 Sorry, but this article is mostly based on lack of understanding of
 Unicode.

Sometimes I have found myself wondering whether Bruce Schneier's lack
of erudition is studied.

At any rate, I've found that, when he says I see smoke, there is
often fire somewhere in the vicinity.

 that would directly run contrary to some of OpenBSD's most important
 project goals:  Correctness, simplicity, security.

 Yes, Unicode is very complex.  Just complex enough that there is (to my
 knowledge) no single application that does it right in every aspect.

Considering that making a universal character encoding scheme is, in
and of itself, a self-contradictory project, they've done moderately
well, I think.

 That said, the standard provides just enough facilities to make
 filesystem-related aspects of Unicode work nicely, particularily in case
 of utf-8.  Eg. ability to enforce NFD for all operations on file names
 could actually make several things more secure by preventing homograph
 attacks.

I think this assertion is a bit optimistic, and not just given your
following caveat.

 Unfortunately, there is no realistic hope that NFD will be enforced by
 every OS and filesystem out there any time soon, so at this stage file
 names with bytes outside printable ASCII range will cause problems at
 some point.  On my systems I limit filenames to [0-9A-Za-z~._/-] range.

Warning! Rambling ahead:

And now I find myself bemused again by my own regular tendency to be
confused by the conflation of the file name database with more general
purpose database indexes.

Fifteen years ago, I said to someone that the useful life of the
current encoding scheme in Unicode was about twenty-five years, and
that they/we should be looking for good ways to restructure it. I had
trouble then figuring out a way to disentangle the various
requirements, and I still don't see a clear way to it. But I'm
inclined to think the original idea of a 16-bit encoding was, while
not correctly seeing the reality of actually characters in use, was
almost seeing the requirements of the system correctly.

I think we need an international encoding that uses a restricted
subset of actual characters in use, and a structure that allows for a
simpler parsing of the international encoding part.

(And from here my thoughts get even less coherent. Sorry for the interruption.)

-- 
Joel Rees

Be careful when you look at conspiracy.
Look first in your own heart,
and ask yourself if you are not your own worst enemy.
Arm yourself with knowledge of yourself, as well.

Re: ffs and utf8

2014-11-29 Thread Thomas Bohl


Am 29.11.2014 um 13:20 schrieb frantisek holop:

i think i should clarify this a bit:
they show perfect in midnight commander, not in shell.

$ touch »´ÁÉǑÄ«
$ ls
??


# ls | cat
Will display the characters right.
Not entirely sure why though.

48 matches

Mail list logo