[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2024-03-02 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=378904

gudvinr+...@gmail.com changed:

   What|Removed |Added

 CC||gudvinr+...@gmail.com

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2022-12-04 Thread Elvis Angelaccio
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #16 from Elvis Angelaccio  ---
Update on this issue: I played a bit with encoding probing using both
KEncodingProber and ICU.

The biggest issue with this approach is that filenames are usually very short,
so the prober does not have enough data to properly guess the correct encoding.

One possible solution could be the following: we add KEncodingProber support in
the libzip plugin (Ark's default plugin for zip files). If KEncodingProber
detects one or more non-unicode encodings, Ark would show a notification to the
user asking if they want to attempt to fix garbled filenames, if any. If the
user confirms, the libzip plugin would then reload the archive and convert the
filenames from the detected encoding to the standard UTF-16 encoding used by
Qt. This "opt-in" step is required because if we do it automatically we could
break the normal workflow for valid zip archives that only contain UTF-8
filenames (since again, the probing is not precise and could detect a wrong
encoding for a valid UTF-8 filename, and there is the addition overhead problem
mentioned in previous comments).

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2022-12-04 Thread Elvis Angelaccio
https://bugs.kde.org/show_bug.cgi?id=378904

Elvis Angelaccio  changed:

   What|Removed |Added

 CC||frank...@goodhorse.idv.tw

--- Comment #15 from Elvis Angelaccio  ---
*** Bug 324978 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2022-11-13 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=378904

1900011...@pku.edu.cn changed:

   What|Removed |Added

 CC||1900011...@pku.edu.cn

--- Comment #14 from 1900011...@pku.edu.cn ---
Reply leohea...@leohearts.com:

You can use `unarchiver` which gives `unar` command. I can confirm this issue
and I use Arch Linux with Ark 22.08.3

This happens on GBK zips from some of my teachers. Filenames unarchived will be
awful like
`▒╛┐╞-88-218412-▓╠░╪│σ-í╢▓╗┐╔─µ╡─╝╙║═ú║▓╩╔½íó║┌░╫╝░╞Σ╦√í¬í¬╙░╩╙╓╨╜¿╓■╡─╔½╙δ╣Γí╖.docx`.

PS: I think Ark has to many plugins, it's good to have one best plugin for one
archive format.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2022-01-20 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=378904

leohea...@leohearts.com changed:

   What|Removed |Added

 CC||leohea...@leohearts.com

--- Comment #13 from leohea...@leohearts.com ---
I'm also getting trouble with this problem. I often get some archives with GBK
encoding and have to end up using 7z.exe with wine to unzip them. Maybe adding
a command line option which provides encoding auto defection can be acceptable?

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2020-06-24 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=378904

un...@mail.ru changed:

   What|Removed |Added

 CC||un...@mail.ru

--- Comment #12 from un...@mail.ru ---
I recently wrote patches to p7zip and unzip for OEM charset detection based on
system locale. It's exactly that windows internal zip encoder does.

https://sourceforge.net/p/infozip/patches/29/
https://sourceforge.net/p/p7zip/bugs/187/

To get correct file names you just need to install patched p7zip and set your
system locale correctly. Or do something like
alias 7z='LC_ALL=el_GR.UTF-8 7z'
if you prefer opening archives using the locale different from system one.

Alkis Georgopoulos is planning to package patched p7zip to .deb's and upload to
 ppa: https://github.com/mate-desktop/engrampa/issues/5#issuecomment-648410042

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-10-27 Thread Nicolas Frattaroli
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #11 from Nicolas Frattaroli  ---
>and there only to files created by legacy software (does anyone know which 
>programs produce these zip files actually?)

Windows creates ZIP files with filenames encoded in the system locale's
charset. So any ZIP file created on Windows with "send to->zip compressed
folder" by someone using a locale that doesn't map to utf8 is affected.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-08-05 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #10 from 20l8kxxl8...@opayq.com ---
The problem can usually be solved by using `unar` with the `-e` parameter (see
Bug 324978 Comment 13 for examples). Ark's has a cliunarchiverplugin that uses
`unar`, but apparently it is only used for RAR archives.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-08-05 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=378904

20l8kxxl8...@opayq.com changed:

   What|Removed |Added

   See Also||https://bugs.kde.org/show_b
   ||ug.cgi?id=324978

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-07-28 Thread Zeno Endemann
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #9 from Zeno Endemann  ---
After skimming the zip format spec
(https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) a little, there
are actually a few flags and optional extra fields that have influence on the
character encoding that should be used. Unfortunately, investigating how to
deal with this 'properly' looks like a lot of work, and most likely would need
changes to libzip as well. I won't have time for that after all, sorry.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-07-16 Thread Zeno Endemann
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #8 from Zeno Endemann  ---
Oh, and in response to the sentence "If we could assume that all archive
entries have the same encoding, we could only probe the first entry": It would
not make any sense for a zip file to have multiple entries with different
encoding, no one would be able to decompress such a file reliably. So we don't
need to worry about this case and once we have detected an encoding that should
be used for the whole file. But only probing the first entry would be less
reliable, after all character encoding probing gets more reliable the more text
it sees. There needs to be a balance between performance and reliablilty,
that's why I said 30 entries or so.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-07-16 Thread Zeno Endemann
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #7 from Zeno Endemann  ---
Right, probing should probably be limited to maybe the first 30 entries or so.

But as it is pretty clear that this auto detection won't always work I'd really
like to have a manual override as well. On the other hand I can definitely
understand not wanting to make the UI more complex for this corner case that
only applies to  the zip format, and there only to files created by legacy
software (does anyone know which programs produce these zip files actually?),
so the best compromise I can come up with is a command line flag (think
"--libzip-plugin-force-char-encoding=SHIFT-JIS").

One more thing, in the zip spec there is a file global flag that, if set,
requires the zip file to be utf8. If that flag is set and we encounter a
non-valid utf8 string that probably should be treated as an error.
Unfortunately I haven't seen any way to get the value of the flag via the
libzip API.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-07-16 Thread Ragnar Thomsen
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #6 from Ragnar Thomsen  ---
Created attachment 121560
  --> https://bugs.kde.org/attachment.cgi?id=121560=edit
MARU164.zip opened using KEncodingProber

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-07-16 Thread Ragnar Thomsen
https://bugs.kde.org/show_bug.cgi?id=378904

--- Comment #5 from Ragnar Thomsen  ---
I tried using KEncodingParser with the libzip-plugin to open the attached
Japanese zip archive. It seems like it could correctly detect the encoding for
all the files (see attached screenshot), so this seems like a promising
approach.
I also tried using the uchardet library but it detected ASCII encoding for all
the files.

One concern is the overhead of probing for the encoding of each archive entry.
Opening the linux kernel source in zip format took 106 secs with probing vs 5
secs without, so there is significant overhead to this approach.
I think we either need to be smart and only probe when needed (can't see how
though) or we add a menu item in the GUI to reload the archive with probing of
filename encodings. If we could assume that all archive entries have the same
encoding, we could only probe the first entry, but I think this assumption
doesn't hold in real life, e.g. in the attached archive the first entry is
detected as UTF8 since it doesn't contain Japanese characters.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-07-14 Thread Zeno Endemann
https://bugs.kde.org/show_bug.cgi?id=378904

Zeno Endemann  changed:

   What|Removed |Added

 CC||zeno.endem...@googlemail.co
   ||m

--- Comment #4 from Zeno Endemann  ---
I've recently run into this problem as well. So I've looked at the Ark sources,
but I don't see a good way to add this feature, both from the coding (no other
archive plugin needs special open-time options, so there is understandably no
infrastructure) as well as the UI standpoint (there is no dialog when opening
an archive via the main navigation, and adding one would be weird). I would
very much understand if an intrusive code change would not be acceptable just
for this problem.

Thus I would propose the following solution: Use auto detection of the encoding
(via KEncodingProber) per default (which should hopefully work for most people,
it worked at least on my zip file with Japanese encoding), but also have an
override via command line switch or environment variable that would force to
use an encoding for all opened zip files for the running Ark process. While not
ideal that would be good enough for me, and only require minimal changes and no
changes to the UI.

There is one risk though, in that using encoding auto detection could
potentially introduce regressions for other users. Note though that some kind
of encoding auto detection is already in use (see ZIP_FL_ENC_GUESS flag here:
https://libzip.org/documentation/zip_name_locate.html), but that is not working
sufficiently apparently.

Anyway, if my suggested approach is acceptable, I could prepare a patch.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-04-22 Thread Alexander Trufanov
https://bugs.kde.org/show_bug.cgi?id=378904

Alexander Trufanov  changed:

   What|Removed |Added

 CC||trufano...@gmail.com

--- Comment #3 from Alexander Trufanov  ---
As I found out this is a very old problem with roots in ZIP specification. ZIP
can contain non-UTF filenames, UTF-8 filenames, or non-UTF filenames with
additional field that contain UTF-8 filename (since 2007). Same isapplied to
ZIP archive commentary.

The problem is that by design the non-UTF charset is IBM 437 charset which does
not support non-Western languages.
On practice Windows encode filenames with one of its DOS charsets (CP*), for
example for Russian it'll be CP866 (IBM 866). And there is no field in ZIP to
specify which exactly charset was used.
Even worse the fact that by default many Windows achievers don't use UTF-8
encoding but this DOS one.

As I understand ZIP authors don't want to fix this and suggesting everyone to
switch to UTF-8 for non-English systems.

There are several different patches, libs, tools proposed by developers to
workaround the problem decade ago.

Also maintainers of some linux systems patch zip/unzip tools in their systems
to workaround that. For example, here is discussion about unzip patch for
Ubuntu systems: https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961/

Which end up with a patch that has been accepted for Ubuntu main branch. But
this take years.
I think this is a mirror of this patch:
https://github.com/zip-i18n/unzip/blob/master/debian/patches/20-unzip60-alt-iconv-utf8

As I can see from code they request locale from system and try to match it with
DOS charset based on hardcoded table. And additionally provide command line
args to allow user to specify the filename encoding by himself. I would say
their predefined encoding list is rather small and oriented to
russian-speakers. Or perhaps that's a wrong patch.

Anyway still no GUI archivers implemented something like that.

I don't believe much in automatic encoding detection. At least if one not bet
on fact that all non-UTF encodings coming from Win shall be CPxxx, and not
Windows-12xx. Bcs even for russian there are 4-5 charsets and some of them very
hard to distinguish without a dictionary or text statistics. So it may be a
heuristic but not 100% reliable method.

But I think Ark can do something like Ubuntu's unzip have:

1. A small prebuilt table to match current locale to encoding supposed to come
from Win-created ZIPs (like here:
https://github.com/zip-i18n/unzip/blob/master/debian/patches/20-unzip60-alt-iconv-utf8#L36)
in assumption that Linux and Windows users spoke same language.

2. Ark can copy-paste cool menu from Kate (Tools/Encodings) that will let user
switch to one of encodings available in his system in GUI. And use this choice
to display filenames and archive commentaries in GUI as well as for I/O
operations while extracting files. This will allow user to find proper charset
and get files extracted.

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2019-03-01 Thread Patrick Silva
https://bugs.kde.org/show_bug.cgi?id=378904

Patrick Silva  changed:

   What|Removed |Added

 CC||bugsefor...@gmx.com

--- Comment #2 from Patrick Silva  ---
ark 18.12.2 has the same problem on Arch Linux.

Operating System: Arch Linux 
KDE Plasma Version: 5.15.2
KDE Frameworks Version: 5.55.0
Qt Version: 5.12.1

-- 
You are receiving this mail because:
You are watching all bug changes.

[ark] [Bug 378904] Ark should use charset auto-detection for filenames

2018-09-28 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=378904

sowi...@dukun.de changed:

   What|Removed |Added

 CC||sowi...@dukun.de

--- Comment #1 from sowi...@dukun.de ---
I can confirm this behaviour, it's really annoying when working together with
non-UTF-8 systems.
I tried to use other software, but it looks like there is none who handles this
correctly. Unzip, file-roller, p7zip (sadly dead), peazip (crashed) all failed.
Looks like there is no software for Linux that can do it currently.

-- 
You are receiving this mail because:
You are watching all bug changes.