Re: a splitting script

2026-01-30 Thread Pádraig Brady

On 27/01/2026 18:10, Pádraig Brady wrote:

On 26/01/2026 16:01, [email protected] wrote:

Can that highlighting not be made to work when there is only
one space between option and description?


I'll apply the attached to our highlighting matcher,
which makes --long-option matching a bit more strict,
thus avoiding the issue.


We also need similar tightening of the highlighting in
various translated dd --help output, which is done in the attached.

cheers,
Padraig
From cd841c54b8b41700cd635311db3dd87340c8fbda Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= 
Date: Fri, 30 Jan 2026 17:34:28 +
Subject: [PATCH] doc: improve highlighting of dd --help translations

* src/system.h (oputs_): Ensure we're not matching '-' in
translated descriptions.  Also support highlighting only
dd "foo=bar" when the description is separated with a single space.
---
 src/system.h | 30 +-
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/src/system.h b/src/system.h
index 6a84d1cc3..e52990d44 100644
--- a/src/system.h
+++ b/src/system.h
@@ -567,28 +567,48 @@ oputs_ (MAYBE_UNUSED char const *program, char const *option)
   return;
 }
 
+  bool double_space = true;
   char const *first_word = option + strspn (option, " \t\n");
   char const *option_text = strchr (option, '-');
-  if (!option_text)
-option_text = first_word;  /* for dd option syntax.  */
+  if (!option_text)/* for dd(1) option syntax.  */
+{
+  option_text = first_word;
+  /* Just match first word to support single spaced
+ translated dd "foo=bar description" format.  */
+  double_space = false;
+}
+  else if (option_text != first_word)  /* for test(1) option syntax.  */
+{
+  /* Ensure only a single space before '-', to avoid matching
+ within descriptions for translated dd option syntax.  */
+  char const *s = first_word;
+  while (s < option_text && !(isspace (*s) && isspace (*(s + 1
+s++;
+  if (s < option_text)
+{
+  /* Probably mismatched dd format.  */
+  option_text = first_word;
+  double_space = false;
+}
+}
+
   size_t anchor_len = strcspn (option_text, ",=[ \n");
 
   /* Set highlighted text up to spacing after the full option text.
  Any single space is included in highlighted text,
  double space or TAB or newline terminates the option text.  */
-  bool long_option = false;
   char const *desc_text = option_text + anchor_len;
   while (*desc_text && *desc_text != '\n')
 {
   if (*desc_text == '-' && *(desc_text + 1) == '-')
-long_option = true;
+double_space = false;
   if (isspace (*desc_text))
 {
   if (*desc_text == '\t' || isspace (*(desc_text + 1)))
 break;
   /* With long options we restrict the match as some translations
  delimit a long option and description with a single space.  */
-  if (long_option && *(desc_text + 1) != '-')
+  if (!double_space && *(desc_text + 1) != '-')
 break;
 }
 
-- 
2.52.0



Re: a splitting script, and requesting a new snapshot

2026-01-29 Thread Pádraig Brady

On 29/01/2026 13:29, [email protected] wrote:


Op 27-01-2026 om 16:09 schreef Pádraig Brady:

I've attached an updated split.py that wraps iff splitting,and also auto
excludes the commands that don't wrap.


Thanks.


Note sk.po has an invalid utf8 char which stops processing,
so I manually edited sk.po so that the non utf8 ç
in Fran.*Pinard was replaced, _before_ I ran the script.


Yes, I had noticed that too and edited the file manually.


Note also af.po and gl.po should be run with
LC_ALL=en_US.iso-8859-1 or equivalent.


Thanks for reporting.  I will use `msgcat --to=UTF-8` instead.


Can a new coreutils-ss.tar.xz snapshot file be generated?  Because I see
that the synopsis for `date` has been changed since coreutils-9.9.272.

(In general: please refrain from changing translatable strings during the
preparations for a release.  It's not nice for translators to see that their
fully translated PO file at the TP does not result in a fully translated PO
file in the release tarball.)


I've just updated https://www.pixelbeat.org/cu/coreutils-ss.tar.xz

thank you,
Padraig



Re: a splitting script, and requesting a new snapshot

2026-01-29 Thread coordinator



Op 27-01-2026 om 16:09 schreef Pádraig Brady:
I've attached an updated split.py that wraps iff splitting,and also auto 
excludes the commands that don't wrap.


Thanks.


Note sk.po has an invalid utf8 char which stops processing,
so I manually edited sk.po so that the non utf8 ç
in Fran.*Pinard was replaced, _before_ I ran the script.


Yes, I had noticed that too and edited the file manually.


Note also af.po and gl.po should be run with
LC_ALL=en_US.iso-8859-1 or equivalent.


Thanks for reporting.  I will use `msgcat --to=UTF-8` instead.


Can a new coreutils-ss.tar.xz snapshot file be generated?  Because I see
that the synopsis for `date` has been changed since coreutils-9.9.272.

(In general: please refrain from changing translatable strings during the
preparations for a release.  It's not nice for translators to see that their
fully translated PO file at the TP does not result in a fully translated PO
file in the release tarball.)


--
Regards,

Benno




Re: a splitting script

2026-01-27 Thread Pádraig Brady

On 26/01/2026 16:01, [email protected] wrote:


Op 25-01-2026 om 22:08 schreef Pádraig Brady:

Another existing issue I noticed in a few places in hu.po
(and some other translations), is inadequate separation
between the --option and description.


Well, I sometimes choose to use just one space when otherwise
there is not enough room for the description to fit (in 80 columns)
or to align nicely with the surrounding ones.  So I think it should
be the prerogative of the translator to layout their texts as they
see fit.




(With the new format -- option and description on separate lines --
there will seldom be a need for the translator to choose an adjusted
layout, and the problem with too few spaces will probably go away.)


OK I agree.


This will impact man page layout if one was generating
those for various languages.


Are there actually any man pages generated from --help texts?


Well all the default ones of course.
But they're not impacted.
I presume there are lang specific man pages available,
though I've not looked TBH.
Anyway given help2man will follow our generated highlighting
the man pages will also be fine following this patch.


Also it would impact
the new option highlighting in --help.


Can that highlighting not be made to work when there is only
one space between option and description?


I'll apply the attached to our highlighting matcher,
which makes --long-option matching a bit more strict,
thus avoiding the issue.

thanks,
PadraigFrom f2af245ebaf88c15f236217b6d43cefcff0eda43 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= 
Date: Tue, 27 Jan 2026 17:45:02 +
Subject: [PATCH] doc: improve highlighting of single spaced translations

* src/system.h (oputs_): Translations sometimes use a single space
between an option and its description.  They only do this though
for long options since they result in less available screen space.
Therefore be more strict with option matching once we've encountered
a long option, which supports the more varied formats often
associated with short options.
---
 src/system.h | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/src/system.h b/src/system.h
index 3695954db..7e1fdbff0 100644
--- a/src/system.h
+++ b/src/system.h
@@ -567,7 +567,6 @@ oputs_ (MAYBE_UNUSED char const* program, char const *option)
   return;
 }
 
-
   char const* first_word = option + strspn (option, " \t\n");
   char const *option_text = strchr (option, '-');
   if (!option_text)
@@ -577,11 +576,24 @@ oputs_ (MAYBE_UNUSED char const* program, char const *option)
   /* Set highlighted text up to spacing after the full option text.
  Any single space is included in highlighted text,
  double space or TAB or newline terminates the option text.  */
+  bool long_option = false;
   char const *desc_text = option_text + anchor_len;
-  while (*desc_text && *desc_text != '\n'
- && (! isspace (*desc_text)
- || (*desc_text != '\t' && ! isspace (*(desc_text + 1)
-desc_text++;
+  while (*desc_text && *desc_text != '\n')
+{
+  if (*desc_text == '-' && *(desc_text + 1) == '-')
+long_option = true;
+  if (isspace (*desc_text))
+{
+  if (*desc_text == '\t' || isspace (*(desc_text + 1)))
+break;
+  /* With long options we restrict the match as some translations
+ delimit a long option and description with a single space.  */
+  if (long_option && *(desc_text + 1) != '-')
+break;
+}
+
+  desc_text++;
+}
 
   /* write spaces before option text. */
   fwrite (option, 1, first_word - option, stdout);
-- 
2.52.0



Re: a splitting script

2026-01-27 Thread Pádraig Brady

On 26/01/2026 15:47, [email protected] wrote:


Op 25-01-2026 om 19:30 schreef Pádraig Brady:

Oh...  Can you give two examples of commands for which option
descriptions aren't on the next line?


cat, ptx, truncate at least
as the descriptions on those are succinct enough.


For `truncate` it would be just two options that would get wrapped
when they shouldn't be -- that would still be acceptable.  But for
`cat` it would be ten and for `ptx` sixteen.  :/

Well, the script could check for "src/cat.c" and "src/ptx.c" in
the preceding line and skip the wrapping when the relevant bools
are set.  So... please implement the wrapping and I'll implement
the exceptions.

(That is: the wrapping should only happen when options are split,
not for any options that are already single.  This will not prevent
all valid translations from becoming fuzzy when msgmerged, but a
good amount.)



I've attached an updated split.py that wraps iff splitting,and also auto 
excludes the commands that don't wrap.

It does result in a lot less fuzzy:
  $ diff pl-new-orig.po pl-new.po | grep -- '-#, fuzzy' | wc -l
  233

Note sk.po has an invalid utf8 char which stops processing,
so I manually edited sk.po so that the non utf8 ç
in Fran.*Pinard was replaced, _before_ I ran the script.

Note also af.po and gl.po should be run with
LC_ALL=en_US.iso-8859-1 or equivalent.

thanks,
Padraig#!/usr/bin/env python3

import sys
import re

# Files to exclude from msgid wrapping
EXCLUDED_FILES_PATTERN = re.compile(r'src/(cat|nl|ptx|realpath|runcon|shuf|stdbuf|stty|sync|tac|truncate|uname|who)\.c')

def wrap_msgid_line(line):
"""Wrap a single msgid line by splitting option from description.

Returns a list of lines after wrapping.
"""
# Remove trailing newline for processing
content = line.rstrip('\n')

# Content should be like: "  -a, --multiple   description\n"
if not content.startswith('"') or not content.endswith('"'):
return [line]

# Get the inner content (without outer quotes)
inner = content[1:-1]

# Check if it ends with \n
has_trailing_newline = inner.endswith('\\n')
if has_trailing_newline:
inner_no_newline = inner[:-2]
else:
inner_no_newline = inner

# Pattern to match option followed by 2+ spaces and description
# Options: leading spaces, optional short opt (-X, ), long opt (--something)
match = re.match(r'^(\s+(?:-\S,\s+)?--?[^\s]+)\s{2,}(.+)$', inner_no_newline)
if not match:
return [line]

option = match.group(1)
description = match.group(2)

# Build wrapped lines
option_line = '"' + option + '\\n"\n'
desc_line = '" ' + description + ('\\n"\n' if has_trailing_newline else '"\n')

return [option_line, desc_line]

def split_po_entries(lines):
i = 0
fuzzy = False
current_files = []
prev_was_location = False

while i < len(lines):
line = lines[i]

# Track current files from location comments (can span multiple consecutive #: lines)
if line.startswith('#:'):
if not prev_was_location:
# Start of a new entry's location comments - reset
current_files = []
current_files.append(line)
prev_was_location = True
else:
prev_was_location = False

if "#, fuzzy" in line:
fuzzy = True

if line.strip() == 'msgid ""':
start_i = i
msgid_lines = []
i += 1
while i < len(lines) and lines[i].startswith('"'):
msgid_lines.append(lines[i])
i += 1

if i < len(lines) and lines[i].strip() == 'msgstr ""':
msgstr_lines = []
i += 1
while i < len(lines) and lines[i].startswith('"'):
msgstr_lines.append(lines[i])
i += 1

def is_option(line):
if line.startswith('"  --'):
return True
if line.startswith('"  -'):
text = line[4:]
if text.startswith('M '):
return False
if len(text) > 0 and text[0] != ' ':
return True
if re.match(r'^"  \S+ -\S\S \S+  ', line):
return True
if re.match(r'^"  [a-z]+=\S+  ', line):
return True
return False

def is_option_relaxed(line):
if re.match(r'^" {1,6}--', line):
return True
return is_option(line)

has_options = any(is_option(line) for line in msgid_lines)

# Check if wrapping should be excluded for any of the tagged files
should_wrap = True
if any(EXCLUDED_FILES_PATTERN.search(f) for f in current_fil

Re: a splitting script

2026-01-26 Thread Pádraig Brady

On 26/01/2026 15:47, [email protected] wrote:


Op 25-01-2026 om 19:30 schreef Pádraig Brady:

Oh...  Can you give two examples of commands for which option
descriptions aren't on the next line?


cat, ptx, truncate at least
as the descriptions on those are succinct enough.


For `truncate` it would be just two options that would get wrapped
when they shouldn't be -- that would still be acceptable.  But for
`cat` it would be ten and for `ptx` sixteen.  :/

Well, the script could check for "src/cat.c" and "src/ptx.c" in
the preceding line and skip the wrapping when the relevant bools
are set.  So... please implement the wrapping and I'll implement
the exceptions.

(That is: the wrapping should only happen when options are split,
not for any options that are already single.  This will not prevent
all valid translations from becoming fuzzy when msgmerged, but a
good amount.)


Will do.

thanks,
Padraig




Re: a splitting script

2026-01-26 Thread Pádraig Brady

On 26/01/2026 15:32, Benno Schulenberg wrote:


Op 25-01-2026 om 17:19 schreef Egmont Koblinger:

Another random find:

9.9 tarball's hu.po line 3252.  The translation is outdated, it misses
the "+" flag.  Accordingly, it's marked as fuzzy.


This made me think: when a msgid-msgstr pair is marked as fuzzy,
then the split.py script should _not_ split any options that are
in the msgid and msgstr, because whatever is in the msgstr does
not correspond to what is in the msgid -- or at least not fully.

One can observe this problem when running ./split.py on sk.po
and then searching for "--no-dereference".  Oops.

(In hu.po at the TP there are no fuzzies, so I didn't notice the
problem there.)

So split.py has to check for the "#, fuzzy" marker, and skip the
splitting of the subsequent msgid-msgstr pair.  I've implemented
that in the attached updated script.

(But maybe it is better to split them anyway and mark every
resulting pair as fuzzy?)


Since it's relatively rare, it's probably best to skip as you've done now.
I'll rebase on your updated script.

thanks!
Padraig



Re: a splitting script

2026-01-26 Thread coordinator



Op 25-01-2026 om 22:08 schreef Pádraig Brady:

Another existing issue I noticed in a few places in hu.po
(and some other translations), is inadequate separation
between the --option and description.


Well, I sometimes choose to use just one space when otherwise
there is not enough room for the description to fit (in 80 columns)
or to align nicely with the surrounding ones.  So I think it should
be the prerogative of the translator to layout their texts as they
see fit.

(With the new format -- option and description on separate lines --
there will seldom be a need for the translator to choose an adjusted
layout, and the problem with too few spaces will probably go away.)


This will impact man page layout if one was generating
those for various languages.


Are there actually any man pages generated from --help texts?


Also it would impact
the new option highlighting in --help.


Can that highlighting not be made to work when there is only
one space between option and description?


The following sed tweaks all the po files appropriately.
Benno, is this something you might run when updating the po set?


I do not want to change any formatting that the translators
probably consciously chose.


Or maybe this is more appropriate to enforce in coreutils
when importing new translations ?


Same thing: I don't think coreutils should change anything
in the PO files provided by translators.


--
Regards,

Benno




Re: a splitting script

2026-01-26 Thread coordinator



Op 25-01-2026 om 19:30 schreef Pádraig Brady:

Oh...  Can you give two examples of commands for which option
descriptions aren't on the next line?


cat, ptx, truncate at least
as the descriptions on those are succinct enough.


For `truncate` it would be just two options that would get wrapped
when they shouldn't be -- that would still be acceptable.  But for
`cat` it would be ten and for `ptx` sixteen.  :/

Well, the script could check for "src/cat.c" and "src/ptx.c" in
the preceding line and skip the wrapping when the relevant bools
are set.  So... please implement the wrapping and I'll implement
the exceptions.

(That is: the wrapping should only happen when options are split,
not for any options that are already single.  This will not prevent
all valid translations from becoming fuzzy when msgmerged, but a
good amount.)


--
Regards,

Benno




Re: a splitting script

2026-01-26 Thread Benno Schulenberg


Op 25-01-2026 om 17:19 schreef Egmont Koblinger:

Another random find:

9.9 tarball's hu.po line 3252.  The translation is outdated, it misses
the "+" flag.  Accordingly, it's marked as fuzzy.


This made me think: when a msgid-msgstr pair is marked as fuzzy,
then the split.py script should _not_ split any options that are
in the msgid and msgstr, because whatever is in the msgstr does
not correspond to what is in the msgid -- or at least not fully.

One can observe this problem when running ./split.py on sk.po
and then searching for "--no-dereference".  Oops.

(In hu.po at the TP there are no fuzzies, so I didn't notice the
problem there.)

So split.py has to check for the "#, fuzzy" marker, and skip the
splitting of the subsequent msgid-msgstr pair.  I've implemented
that in the attached updated script.

(But maybe it is better to split them anyway and mark every
resulting pair as fuzzy?)


--
Regards,

Benno
#!/usr/bin/env python3

import sys
import re

def split_po_entries(lines):
i = 0
fuzzy = False

while i < len(lines):
line = lines[i]

if "#, fuzzy" in line:
fuzzy = True

if line.strip() == 'msgid ""':
start_i = i
msgid_lines = []
i += 1
while i < len(lines) and lines[i].startswith('"'):
msgid_lines.append(lines[i])
i += 1

if i < len(lines) and lines[i].strip() == 'msgstr ""':
msgstr_lines = []
i += 1
while i < len(lines) and lines[i].startswith('"'):
msgstr_lines.append(lines[i])
i += 1

def is_option(line):
if line.startswith('"  --'):
return True
if line.startswith('"  -'):
text = line[4:]
if text.startswith('M '):
return False
if len(text) > 0 and text[0] != ' ':
return True
if re.match(r'^"  \S+ -\S\S \S+  ', line):
return True
if re.match(r'^"  [a-z]+=\S+  ', line):
return True
return False

def is_option_relaxed(line):
if re.match(r'^" {1,6}--', line):
return True
if line.startswith('"  -'):
text = line[4:]
if text.startswith('M '):
return False
if len(text) > 0 and text[0] != ' ':
return True
if re.match(r'^"  \S+ -\S\S \S+  ', line):
return True
if re.match(r'^"  [a-z]+=\S+  ', line):
return True
return False

has_options = any(is_option(line) for line in msgid_lines)

if has_options and not fuzzy:
first_non_empty = None
for j, line in enumerate(msgid_lines):
if line.strip() not in ('""', '"\\n"'):
first_non_empty = j
break

if first_non_empty is not None and is_option(msgid_lines[first_non_empty]):
msgid_lines = msgid_lines[first_non_empty:]
msgstr_lines = msgstr_lines[first_non_empty:] if first_non_empty < len(msgstr_lines) else msgstr_lines

msgid_groups = []
msgstr_groups = []

msgid_indices = [0]
for j in range(1, len(msgid_lines)):
if is_option(msgid_lines[j]):
msgid_indices.append(j)
msgid_indices.append(len(msgid_lines))

msgstr_indices = [0]
for j in range(1, len(msgstr_lines)):
if is_option_relaxed(msgstr_lines[j]):
msgstr_indices.append(j)
msgstr_indices.append(len(msgstr_lines))

for k in range(len(msgid_indices) - 1):
msgid_groups.append(msgid_lines[msgid_indices[k]:msgid_indices[k+1]])

for k in range(len(msgstr_indices) - 1):
msgstr_groups.append(msgstr_lines[msgstr_indices[k]:msgstr_indices[k+1]])

for msgid_group, msgstr_group in zip(msgid_groups, msgstr_groups):
print('msgid ""')
for line in msgid_group:
print(line, end='')
print('msgstr ""')
for line in msgstr_group:
print(line, end='')
print()

continue

for j in range(start_i, i):
   

Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 15:29, [email protected] wrote:


Op 25-01-2026 om 15:41 schreef Egmont Koblinger:

"simply"?  It sounds anything but simple to me.

[...]

your only anchors are the "-a", "-b" and "--color" words in this example,
which you presumably locate using some heuristics (needs to begin with a
dash? or needs to begin after no more than a certain amount of leading
spaces? does it stop at the first '=' or '[' sign?), and then there are
exceptions to these rules (like the output of `[ --help` contains many
options not beginning with a slash)...


You apparently haven't looked at the script.  Claude did a fine job
creating the heuristics for catching all options -- it even included
the `dd` options that don't start with a dash _and_ excluded the two
"-M " cases that aren't an option.


I don't have too much trust that it can properly split _all_ of them
without producing a single faulty translation.


Okay.  Here is the result of the first version of the split.py script
run on the latest coreutils.hu.po (from 2018):

 https://translationproject.org/latest/coreutils/HU.po

Compare with the lowercase version to see the changes.  Let us know
if you spot any splitting mistake.


Another existing issue I noticed in a few places in hu.po
(and some other translations), is inadequate separation
between the --option and description.

This will impact man page layout if one was generating
those for various languages.  Also it would impact
the new option highlighting in --help.

The following sed tweaks all the po files appropriately.
Benno, is this something you might run when updating the po set?

  # Add an extra space if there is only one
  sed -i -E 's/^(msgid )?"(  -., | {4,6}--)([^ \]+) ([^ -\])/\1"\2\3  \4/' *.po

Or maybe this is more appropriate to enforce in coreutils
when importing new translations ?

thanks,
Padraig



Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 17:57, [email protected] wrote:


Op 25-01-2026 om 17:57 schreef Pádraig Brady:

Attaching latest split.py script,
with your dd fix, and my 5 leading dashes fix.


Thanks.  I've updated the HU.po file by using that script.


I've also attached a defuzz.py script which
removes fuzzy tags from all msgids matching an option,
so could be used after msgmerge stage.
Not ideal, but also wrapping all option msgids is not ideal,
as we don't wrap them for all commands.


Oh...  Can you give two examples of commands for which option
descriptions aren't on the next line?



cat, ptx, truncate at least
as the descriptions on those are succinct enough.


(I'll be busy the next three days.  So my responses will be
slow in arriving.)

thanks,
Padraig



Re: a splitting script

2026-01-25 Thread Egmont Koblinger
> Right?  That translation is not outdated -- not in 2018.
>
> When msgmerging it against the current POT file, it will
> get marked as fuzzy -- because it is outdated now.

Indeed, I overlooked this, you're right.  In HU.po msgid is the old version.



e.



Re: a splitting script

2026-01-25 Thread Benno Schulenberg



Op 25-01-2026 om 17:19 schreef Egmont Koblinger:

Another random find:

9.9 tarball's hu.po line 3252.  The translation is outdated, it misses
the "+" flag.  Accordingly, it's marked as fuzzy.

HU.po line 3020.  Same outdated translation.  The fuzzy flag is removed.


You mean the msgid that starts like this:

  "The following optional flags may follow '%':\n"
   "\n"
   "  -  (hyphen) do not pad the field\n"
   [...]

Right?  That translation is not outdated -- not in 2018.

When msgmerging it against the current POT file, it will
get marked as fuzzy -- because it is outdated now.


--
Regards,

Benno




Re: a splitting script

2026-01-25 Thread coordinator



Op 25-01-2026 om 17:57 schreef Pádraig Brady:

Attaching latest split.py script,
with your dd fix, and my 5 leading dashes fix.


Thanks.  I've updated the HU.po file by using that script.


I've also attached a defuzz.py script which
removes fuzzy tags from all msgids matching an option,
so could be used after msgmerge stage.
Not ideal, but also wrapping all option msgids is not ideal,
as we don't wrap them for all commands.


Oh...  Can you give two examples of commands for which option
descriptions aren't on the next line?


(I'll be busy the next three days.  So my responses will be
slow in arriving.)


--
Regards,

Benno




Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 16:27, Egmont Koblinger wrote:

Next random find:

#: src/nproc.c:63
msgid "  --all  print the number of installed processors\n"
msgstr ""
" --all   a beépített processzorok számának kiírása\n"
" --ignore=N  ha lehetséges, N feldolgozóegység figyelmen kívül hagyása\n"


Heh, 5 leading spaces.
I've relaxed my local script to allow that.

cheers,
Padraig



Re: a splitting script

2026-01-25 Thread Egmont Koblinger
Next random find:

#: src/nproc.c:63
msgid "  --all  print the number of installed processors\n"
msgstr ""
" --all   a beépített processzorok számának kiírása\n"
" --ignore=N  ha lehetséges, N feldolgozóegység figyelmen kívül hagyása\n"


e.

On Sun, Jan 25, 2026 at 5:19 PM Egmont Koblinger  wrote:
>
> Independently from you guys, I've also found the incorrect splits at
> dd's options that don't begin with a hyphen.
>
> I hope you understand now what I mean by these auto-split translations
> needing a thorough review.
>
> (And even if they're fine in one language doesn't guarantee that
> they're fine in all; I think they should be thoroughly reviewed in all
> languages, for which marking them as fuzzy is the right thing to do.)
>
>
> Another random find:
>
> 9.9 tarball's hu.po line 3252.  The translation is outdated, it misses
> the "+" flag.  Accordingly, it's marked as fuzzy.
>
> HU.po line 3020.  Same outdated translation.  The fuzzy flag is removed.
>
> Was this flag incorrectly removed by this script, or was it a
> transltor mistake?  PO-Revision-Date of 2018 kinda excludes the
> latter.
>
>
> Would it make sense to have the AI-script run first, producing tons of
> split translations into a temporary file; followed by a second,
> manually written step (perhaps using python's polib as I did) that
> makes sure to do nothing more to the .po files than add a bunch of
> fuzzy entries?  To absolutely surely guarantee that no old translation
> is un-fuzzied or broken in some other way, and also the new
> translations are all marked fuzzy?
>
>
> e.
>
> On Sun, Jan 25, 2026 at 5:06 PM Benno Schulenberg
>  wrote:
> >
> >
> > > Op 25-01-2026 om 15:41 schreef Egmont Koblinger:
> > >> I don't have too much trust that it can properly split _all_ of them
> > >> without producing a single faulty translation.
> >
> > Well, you're right.  It starts to go wrong here:
> >
> > msgid "  if=FILE read from FILE instead of stdin\n"
> > msgstr ""
> > "  if=FÁJL olvasás a FÁJLBÓL a szabványos bemenet helyett\n"
> > "  iflag=JELÖLŐK   olvasás a vesszővel elválasztott szimbólumlistának 
> > megfelelően\n"
> > "  obs=BÁJTegyszerre BÁJT bájt kiírása\n"
> > "  of=FÁJL a FÁJLBA ír a szabványos kimenet helyett\n"
> > "  oflag=JELÖLŐK   a vesszővel elválasztott szimbólumlistának megfelelően 
> > ír\n"
> >
> > It worked fine on the Dutch file.  So... I'm guessing one of the regexes 
> > doesn't
> > work for the accented letters.  Going to try with just an \S instead  
> > Yes,
> > that fixes it and does not introduce any new changes. So I've recreated the
> > HU.po file with the modified script.
> >
> >https://translationproject.org/latest/coreutils/HU.po
> >
> >
> > --
> > Regards,
> >
> > Benno



Re: a splitting script

2026-01-25 Thread coordinator



Op 25-01-2026 om 16:45 schreef Pádraig Brady:

The dd word=word cases aren't matched up correctly.
I've just removed them from my split.py for now.


It works fine when using this fragment of code instead:

if re.match(r'^"  [a-z]+=\S+  ', line):
return True

before the `return False`.


--
Regards,

Benno




Re: a splitting script

2026-01-25 Thread Egmont Koblinger
Independently from you guys, I've also found the incorrect splits at
dd's options that don't begin with a hyphen.

I hope you understand now what I mean by these auto-split translations
needing a thorough review.

(And even if they're fine in one language doesn't guarantee that
they're fine in all; I think they should be thoroughly reviewed in all
languages, for which marking them as fuzzy is the right thing to do.)


Another random find:

9.9 tarball's hu.po line 3252.  The translation is outdated, it misses
the "+" flag.  Accordingly, it's marked as fuzzy.

HU.po line 3020.  Same outdated translation.  The fuzzy flag is removed.

Was this flag incorrectly removed by this script, or was it a
transltor mistake?  PO-Revision-Date of 2018 kinda excludes the
latter.


Would it make sense to have the AI-script run first, producing tons of
split translations into a temporary file; followed by a second,
manually written step (perhaps using python's polib as I did) that
makes sure to do nothing more to the .po files than add a bunch of
fuzzy entries?  To absolutely surely guarantee that no old translation
is un-fuzzied or broken in some other way, and also the new
translations are all marked fuzzy?


e.

On Sun, Jan 25, 2026 at 5:06 PM Benno Schulenberg
 wrote:
>
>
> > Op 25-01-2026 om 15:41 schreef Egmont Koblinger:
> >> I don't have too much trust that it can properly split _all_ of them
> >> without producing a single faulty translation.
>
> Well, you're right.  It starts to go wrong here:
>
> msgid "  if=FILE read from FILE instead of stdin\n"
> msgstr ""
> "  if=FÁJL olvasás a FÁJLBÓL a szabványos bemenet helyett\n"
> "  iflag=JELÖLŐK   olvasás a vesszővel elválasztott szimbólumlistának 
> megfelelően\n"
> "  obs=BÁJTegyszerre BÁJT bájt kiírása\n"
> "  of=FÁJL a FÁJLBA ír a szabványos kimenet helyett\n"
> "  oflag=JELÖLŐK   a vesszővel elválasztott szimbólumlistának megfelelően 
> ír\n"
>
> It worked fine on the Dutch file.  So... I'm guessing one of the regexes 
> doesn't
> work for the accented letters.  Going to try with just an \S instead  Yes,
> that fixes it and does not introduce any new changes. So I've recreated the
> HU.po file with the modified script.
>
>https://translationproject.org/latest/coreutils/HU.po
>
>
> --
> Regards,
>
> Benno



Re: a splitting script

2026-01-25 Thread Benno Schulenberg




Op 25-01-2026 om 15:41 schreef Egmont Koblinger:
I don't have too much trust that it can properly split _all_ of them 
without producing a single faulty translation.


Well, you're right.  It starts to go wrong here:

msgid "  if=FILE read from FILE instead of stdin\n"
msgstr ""
"  if=FÁJL olvasás a FÁJLBÓL a szabványos bemenet helyett\n"
"  iflag=JELÖLŐK   olvasás a vesszővel elválasztott szimbólumlistának 
megfelelően\n"
"  obs=BÁJTegyszerre BÁJT bájt kiírása\n"
"  of=FÁJL a FÁJLBA ír a szabványos kimenet helyett\n"
"  oflag=JELÖLŐK   a vesszővel elválasztott szimbólumlistának megfelelően ír\n"

It worked fine on the Dutch file.  So... I'm guessing one of the regexes doesn't
work for the accented letters.  Going to try with just an \S instead  Yes,
that fixes it and does not introduce any new changes. So I've recreated the
HU.po file with the modified script.

  https://translationproject.org/latest/coreutils/HU.po


--
Regards,

Benno



Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 15:29, [email protected] wrote:


Op 25-01-2026 om 15:41 schreef Egmont Koblinger:

"simply"?  It sounds anything but simple to me.

[...]

your only anchors are the "-a", "-b" and "--color" words in this example,
which you presumably locate using some heuristics (needs to begin with a
dash? or needs to begin after no more than a certain amount of leading
spaces? does it stop at the first '=' or '[' sign?), and then there are
exceptions to these rules (like the output of `[ --help` contains many
options not beginning with a slash)...


You apparently haven't looked at the script.  Claude did a fine job
creating the heuristics for catching all options -- it even included
the `dd` options that don't start with a dash _and_ excluded the two
"-M " cases that aren't an option.


I don't have too much trust that it can properly split _all_ of them
without producing a single faulty translation.


Okay.  Here is the result of the first version of the split.py script
run on the latest coreutils.hu.po (from 2018):

 https://translationproject.org/latest/coreutils/HU.po

Compare with the lowercase version to see the changes.  Let us know
if you spot any splitting mistake.


The dd word=word cases aren't matched up correctly.
I've just removed them from my split.py for now.

cheers,
Padraig



Re: a splitting script

2026-01-25 Thread coordinator



Op 25-01-2026 om 15:41 schreef Egmont Koblinger:

"simply"?  It sounds anything but simple to me.

[...]

your only anchors are the "-a", "-b" and "--color" words in this example,
which you presumably locate using some heuristics (needs to begin with a
dash? or needs to begin after no more than a certain amount of leading
spaces? does it stop at the first '=' or '[' sign?), and then there are
exceptions to these rules (like the output of `[ --help` contains many
options not beginning with a slash)...


You apparently haven't looked at the script.  Claude did a fine job
creating the heuristics for catching all options -- it even included
the `dd` options that don't start with a dash _and_ excluded the two
"-M " cases that aren't an option.

I don't have too much trust that it can properly split _all_ of them 
without producing a single faulty translation.


Okay.  Here is the result of the first version of the split.py script
run on the latest coreutils.hu.po (from 2018):

   https://translationproject.org/latest/coreutils/HU.po

Compare with the lowercase version to see the changes.  Let us know
if you spot any splitting mistake.


--
Regards,

Benno



Re: a splitting script

2026-01-25 Thread Egmont Koblinger
On Sun, Jan 25, 2026 at 1:09 PM  wrote:

> The translations are not generated.  What is generated is the script.

I perfectly understand this.


> And the script simply splits compounded strings into separate strings.

"simply"?  It sounds anything but simple to me.

Given a source string that looks roughly like e.g.:

  -a  one-line description
  -b  a longer description that
  spans across two lines
  -c, --color[=WHEN]  an even longer description
  describing the possible values
  spanning across even more lines

and its translation where the number of lines might change for each
option, the word "WHEN' is also translated, your only anchors are the
"-a", "-b" and "--color" words in this example, which you presumably
locate using some heuristics (needs to begin with a dash? or needs to
begin after no more than a certain amount of leading spaces? does it
stop at the first '=' or '[' sign?), and then there are exceptions to
these rules (like the output of `[ --help` contains many options not
beginning with a slash)...

I have a pretty high trust that a resonable helper script, whether
manually engineered or AI-written, can properly split the vast
majority of the source strings and their translations.

I don't have too much trust that it can properly splilt _all_ of them
without producing a single faulty translation.


e.



Re: a splitting script

2026-01-25 Thread coordinator



Op 25-01-2026 om 12:43 schreef Egmont Koblinger:

The question is: Can we be confident that all these auto-generated
translations are perfectly fine?


The translations are not generated.  What is generated is the script.
And the script simply splits compounded strings into separate strings.
The script doesn't pull anything from a hat -- check the script.


--
Regards,

Benno




Re: a splitting script

2026-01-25 Thread Egmont Koblinger
On Sun, Jan 25, 2026 at 12:48 PM Pádraig Brady  wrote:

> I diffed the old and new pl.po and it all looked good.
> It is a very mechanical change after all.
>
> I'll do some extra checking when we've a more finalized script.

I'd be happy to do another round of review on the autogenerated hu.po
bits if you send that file to me (not on the wording choices of old
translations, just to make sure the splits happened at the right
places).


e.



Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 11:43, Egmont Koblinger wrote:

On Sun, Jan 25, 2026 at 12:28 PM  wrote:


Please save the translators the painful
busywork of having to unfuzzy perfectly fine msgstrs.


The question is: Can we be confident that all these auto-generated
translations are perfectly fine?

This is where I have strong doubts. I'm afraid a manual review is inevitable.


I diffed the old and new pl.po and it all looked good.
It is a very mechanical change after all.

I'll do some extra checking when we've a more finalized script.

thanks,
Padraig



Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 11:39, [email protected] wrote:


Op 25-01-2026 om 12:23 schreef Pádraig Brady:

So 2 questions.
1. Do we really want to wrap the msgstrs too?


Maybe not.  Maybe there are translators that don't want to waste
so much vertical space.  So... let's just wrap the msgids.  The
script can then be provided to the translators and they can run
a modified version if they want the msgstrs to be wrapped too.


2. If we don't, do we still want to wrap the msgid to avoid the fuzzy tag?


Yes please.  The fuzzy tag is a nuisance when the translation
is perfectly correct.


Cool, agreed.

I'm away from the computer for a couple of hours,
but it should be a few mins work when I get back.

cheers,
Padraig



Re: a splitting script

2026-01-25 Thread Egmont Koblinger
On Sun, Jan 25, 2026 at 12:28 PM  wrote:

> Please save the translators the painful
> busywork of having to unfuzzy perfectly fine msgstrs.

The question is: Can we be confident that all these auto-generated
translations are perfectly fine?

This is where I have strong doubts. I'm afraid a manual review is inevitable.


e.



Re: a splitting script

2026-01-25 Thread coordinator



Op 25-01-2026 om 12:23 schreef Pádraig Brady:

So 2 questions.
1. Do we really want to wrap the msgstrs too?


Maybe not.  Maybe there are translators that don't want to waste
so much vertical space.  So... let's just wrap the msgids.  The
script can then be provided to the translators and they can run
a modified version if they want the msgstrs to be wrapped too.


2. If we don't, do we still want to wrap the msgid to avoid the fuzzy tag?


Yes please.  The fuzzy tag is a nuisance when the translation
is perfectly correct.


--
Regards,

Benno




Re: a splitting script

2026-01-25 Thread coordinator



Op 25-01-2026 om 12:17 schreef Pádraig Brady:

On 25/01/2026 11:01, Egmont Koblinger wrote:

In this regard, I can't see how updating the formatting of msgid would
improve the game.


It improves the game a lot: translators will have to remove the
"#, fuzzy" mark from maybe around 50 translations, instead of
from 562.  That saves a lot of work.


Yes the only reason to wrap the msgid would be to avoid the fuzzy tag.


Right.  But the fuzzy tag is spurious as the only thing that
changed (in most cases) is that the description was moved to
a separate line.  Please save the translators the painful
busywork of having to unfuzzy perfectly fine msgstrs.


--
Regards,

Benno




Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 10:18, [email protected] wrote:


Op 24-01-2026 om 23:39 schreef Pádraig Brady:

I asked claude sonnet 4.5 to code a script to split
the combined options in the existing po files
to separate translations per option.
It basically one shotted it in about 10 seconds.


Nice!


At that stage there are only whitespace differences
and msgmerge is able to fuzzy match those fine.


Hmm.  It results in around 560 fuzzy matches, of which the vast
majority would be exact matches if the msgid had been formatted
the same way as in current coreutils.pot: the option description
on a separate line preceded by nine spaces.  It would be nice if
Claude could print out both msgid and msgstr in that format.  It
will save fourteen translators a lot of busywork.


Right, the new wrapped msgid will happen automatically
with the next msgmerge but will be tagged as fuzzy.
We could easily enough wrap the msgid to avoid that.

I was wondering about whether translators would prefer
having these auto adjusted translations marked as fuzzy or not.
I'll defer to your preference on this so.

As for wrapping the msgstr, I was a bit more wary as languages
have different wrapping attributes.  For example we don't wrap
some utils like ptx where the descriptions currently fit.
Also some languages like "zh" are naturally more concise and
thus would prefer not to wrap.

So 2 questions.
1. Do we really want to wrap the msgstrs too?
2. If we don't, do we still want to wrap the msgid to avoid the fuzzy tag?


So I presume we could just update all latest po files with this script,
then upload all these "split" po files without involving translators?


When a script with the above improvement becomes available, I will
run it at the TP and create a new series of PO files (9.9.258) that
will supersede the current 9.9-pre1, to avoid overwriting the actual
work of the translators.


Sounds good.

thanks,
Padraig.



Re: a splitting script

2026-01-25 Thread Pádraig Brady

On 25/01/2026 11:01, Egmont Koblinger wrote:

Hi,

I might be misunderstanding the situation again, so take this comment
with a grain of salt.

I'd highly appreciate if AI-generated translations would all be marked
as fuzzy, requiriing manual verification and manual removal of the
fuzzy flag, no matter how boring that is. Maybe it's just me, but I
wouldn't trust the AI-generated script to get all the new approx. 14 *
560 translation correct; or to be smart enough to know which ones it's
super-duper confident in and mark those as non-fuzzy.


AI-generated translations is a stretch :)
It just a tool to split existing translations.

wrapping is a little more involved/subjective though,
as mentioned in other mail...


In this regard, I can't see how updating the formatting of msgid would
improve the game. They will be properly formatted after a msgmerge,
presumably maintaining the fuzzy matches generated by AI.


Yes the only reason to wrap the msgid would be to avoid the fuzzy tag.

cheers,
Padraig



Re: a splitting script

2026-01-25 Thread Egmont Koblinger
Hi,

I might be misunderstanding the situation again, so take this comment
with a grain of salt.

I'd highly appreciate if AI-generated translations would all be marked
as fuzzy, requiriing manual verification and manual removal of the
fuzzy flag, no matter how boring that is. Maybe it's just me, but I
wouldn't trust the AI-generated script to get all the new approx. 14 *
560 translation correct; or to be smart enough to know which ones it's
super-duper confident in and mark those as non-fuzzy.

In this regard, I can't see how updating the formatting of msgid would
improve the game. They will be properly formatted after a msgmerge,
presumably maintaining the fuzzy matches generated by AI.


e.

On Sun, Jan 25, 2026 at 11:18 AM  wrote:
>
>
> Op 24-01-2026 om 23:39 schreef Pádraig Brady:
> > I asked claude sonnet 4.5 to code a script to split
> > the combined options in the existing po files
> > to separate translations per option.
> > It basically one shotted it in about 10 seconds.
>
> Nice!
>
> > At that stage there are only whitespace differences
> > and msgmerge is able to fuzzy match those fine.
>
> Hmm.  It results in around 560 fuzzy matches, of which the vast
> majority would be exact matches if the msgid had been formatted
> the same way as in current coreutils.pot: the option description
> on a separate line preceded by nine spaces.  It would be nice if
> Claude could print out both msgid and msgstr in that format.  It
> will save fourteen translators a lot of busywork.
>
> > So I presume we could just update all latest po files with this script,
> > then upload all these "split" po files without involving translators?
>
> When a script with the above improvement becomes available, I will
> run it at the TP and create a new series of PO files (9.9.258) that
> will supersede the current 9.9-pre1, to avoid overwriting the actual
> work of the translators.
>
>
> --
> Regards,
>
> Benno
>



Re: a splitting script

2026-01-25 Thread coordinator



Op 24-01-2026 om 23:39 schreef Pádraig Brady:

I asked claude sonnet 4.5 to code a script to split
the combined options in the existing po files
to separate translations per option.
It basically one shotted it in about 10 seconds.


Nice!


At that stage there are only whitespace differences
and msgmerge is able to fuzzy match those fine.


Hmm.  It results in around 560 fuzzy matches, of which the vast
majority would be exact matches if the msgid had been formatted
the same way as in current coreutils.pot: the option description
on a separate line preceded by nine spaces.  It would be nice if
Claude could print out both msgid and msgstr in that format.  It
will save fourteen translators a lot of busywork.


So I presume we could just update all latest po files with this script,
then upload all these "split" po files without involving translators?


When a script with the above improvement becomes available, I will
run it at the TP and create a new series of PO files (9.9.258) that
will supersede the current 9.9-pre1, to avoid overwriting the actual
work of the translators.


--
Regards,

Benno