Re: On KIO and non-unicode compatible paths

2018-04-09 Thread Christoph Feck

On 08.04.2018 13:59, Inkane wrote:

I recently had a look at Bug 173097 (Cannot delete a file with "invalid"
characters in its name), and unfortunately, this seems to be a
surprisingly difficult issue to fix with how KIO is currently designed.
[...]
The root of the issue here is basically the way Qt handles file paths,


Since QFile::setEncodingFunction() no longer works, another way to 
"hack" the conversion is to use QTextCodec::setCodecForLocale() within 
our platform plugin. A specially crafted codec could replace non-UTF8 
bytes with other UTF-16 code words.


From some minor investigations, we could either use U+DC80...U+DCFF 
(what Python3 uses), or U+EF80...U+EFFF (what MirOS uses). The latter 
code range is also mentioned as "reserved for encoding hacks" in the 
Under-ConScript Unicode Registry http://www.kreativekorp.com/ucsur/


https://docs.python.org/3.3/howto/unicode.html says:
"Files in an Unknown Encoding

What can you do if you need to make a change to a file, but don’t know 
the file’s encoding? If you know the encoding is ASCII-compatible and 
only want to examine or modify the ASCII parts, you can open the file 
with the surrogateescape error handler[...] The surrogateescape error 
handler will decode any non-ASCII bytes as code points in the Unicode 
Private Use Area ranging from U+DC80 to U+DCFF. These private code 
points will then be turned back into the same bytes when the 
surrogateescape error handler is used when encoding the data and writing 
it back out."


I can no longer find the MirOS/MirBSD reference, though.

--
Christoph Feck



On KIO and non-unicode compatible paths

2018-04-08 Thread Inkane

Abstract


I recently had a look at Bug 173097 (Cannot delete a file with "invalid"
characters in its name), and unfortunately, this seems to be a
surprisingly difficult issue to fix with how KIO is currently designed.

The following should document the current state to the best of my
understanding, and ideally spurn discussion on how to improve the
situation.

The root of the issue here is basically the way Qt handles file paths,
but this leads to two kinds of issues: a) How we use Qt for file
operations inside of KIO and b) restrictions inherent to KIO's API.

Qt and file paths
-

Now, why is Qt causing issues? Well, paths are in general represented as
QStrings. For instance, QFile's constructor takes a QString as an
argument, and QDir::entryList returns a QStringList. According to the
documentation “QString stores a string of 16-bit QChars, where each
QChar corresponds one Unicode 4.0 character.” However, for file paths,
this is overly restrictive: Most Unix file systems allow arbitrary file
names, as long as they do not contain '\0' and '/'.  NTFS has more
restrictions, but still less than “valid Unicode string” as fa ras I can
tell. Note that as far as I can tell, functions like QFile::encodeName
don't help at all here, because QString upon construction already
replaced invalid sequences with a “replacement character”, and there is
no way to get the original byte sequence back.

This leads me to the conclusion that Qt is currently inadequate for
handling arbitrary file names, as opposed to e.g. Boost::Filesystem or
the new std::filesystem, or other languages functionality like Python's
os module and Rust's fs::path.

How this affects KIO


Implementation wise, KIO naturally uses Qt functionality for the most
part, except in some low level layers like the new polkit integration,
where platform native functionality is used. As an example, we can look
at the Trash KIO Slave.  Paths are generally stored as QStrings and
TrashImpl::listDir uses QDir::entryList internally. To fix this, we
would need to replace the usage of QString with something that preserves
arbitrary data, like a QByteArray.  Furthermore, Qt's file handling
functionality would need to be replaced with something else, like the
platform native functions, or some abstractions provided by another
library. As you can guess, this obviously already creates quite a lot of
code churn.

However, even if we did this, it wouldn't actually help all too much.
The reason for this is KIO's API. If, for instance, we take a look at
KIO::SlaveBase::listDir, we see that it takes a QUrl as its argument.
And, surprise, surprise, QUrl's constructor takes a QString. So even if
a slave internally only works with a byte preserving representation of
paths, clients like Dolphin cannot currently tell it to display the
correct path. Without changing the API, the only way I see out here
would be to use something like base64 encoding to transmit the path and
for usage in UDS_NAME, and have the decoded string in UDS_DISPLAY_NAME.

Why we should care
--

Some might argue that non Unicode compatible file names are a rare edge
case, and in the greater scheme of things, this might even be true.
However, Bug 206761 had 101 votes, and 173097 has accumulated 45 votes
over its lifetime, which IMHO indicates that some of our users are
affected by this. (Nota Bene: Some of the issues in 206761 might already
have been mitigated by the usage of QFile::encodeName in appropriate
places, but this does only covers a subset of cases).

What can we do
--

Well, we can complain about Qt (I'm not sure if there already is a bug
report about Qt's path handling), but even if they care and want to
change this, we can assume that this probably won't happen before Qt 6,
considering that this touches an essential part of Qt Core.

Besides that, I actually have no idea what the correct course of action
is. As outlined above, porting KIO away from Qt's file handling is a
large scale task, for which I alone realistically have neither the time
nor the energy.  Also, even then there's still the question how to
maintain API and ABI compatibility regarding KIO's current QUrl based
interface.

So does anyone else have brilliant ideas, provocative thoughts, found a
flaw or mistake in what I've wrote or wants to tell us their favourite
bike shed colour?