Re: [Freeipa-devel] i18n infrastructure improvements

2013-01-11 Thread Petr Viktorin

Hello list,
This discussion was started in private; I'll continue it here.

On 01/10/2013 05:41 PM, John Dennis wrote:

On 01/10/2013 04:27 AM, Petr Viktorin wrote:

On 01/09/2013 03:55 PM, John Dennis wrote:



And I could work on improving the i18n/translations infrastructure,
starting by writing up a RFE+design.



Could you elaborate as to what you perceive as the current problems and
what this work would address.



Here are my notes:



- Use fake translations for tests


We already do (but perhaps not sufficiently).


I mean use it in *all* tests, to ensure all the right things are 
translated and weird characters are handled well.

See https://www.redhat.com/archives/freeipa-devel/2012-October/msg00278.html


- Split up huge strings so the entire text doesn't have to be
retranslated each time something changes/is added


Good idea. But one question I have is should we be optimizing for our
programmers time or the translators time? The Transifex tool should make
available to translators similar existing translations (in fact it
might, I seem to recall some functionality in this area). Wouldn't it be
better to address this issue in Transifex where all projects would benefit?

Also the exact same functionality is needed to support release versions.
The strings between releases are often close but not identical. The
Transifex tool should make available a close match from a previous
version to the translator working on a new version (or visa versa). See
your issue below concerning versions.

IMHO this is a Transifex issue which needs to be solved there, not
something we should be investing precious IPA programmers time on. Plus
if it's solved in Transifex it's a *huge* win for *everyone*, not just IPA.


Huh? Splitting the strings provides additional information 
(paragraph/context boundaries) that Transifex can't get otherwise. From 
what I hear it's a pretty standard technique when working with gettext.


For typos, gettext has the fuzzy functionality that we explicitly turn 
off. I think we're on our own here.



- Keep a history/repo of the translations, since Transifex only stores
the latest version


We already do keep a history, it's in git.


It's not updated often enough. If I mess something up before a release 
and Transifex gets wiped, or if a rogue translator deletes some 
translations, the work is gone.



- Update the source strings on Transifex more often (ideally as soon as
patches are pushed)


Yes, great idea, this would be really useful and is necessary.


- Break Git dependencies: make it possible generate the POT in an
unpacked tarball


Are you talking about the fact our scripts invoke git to determine what
files to process? If so then yes, this would be a good dependency to get
rid of. However it does mean we somehow have to maintain a manifest list
of some sort somewhere.


A directory listing is fine IMO. We use it for more critical things, 
like loading plugins, without any trouble.
Also, when run in a Git repo the Makefile can compare the file list with 
what Git says and warn accordingly.



- Figure out how to best share messages across versions (2.x vs. 3.x) so
they only have to be translated once


There is a crying need for this, but isn't this a Transifex issue? Why
would we solving this in IPA? What about SSSD and every other project,
they all have identical issues. As far as I can tell Transifex has never
addressed this issue sufficiently (see above) and the onus is on them to
do so.


I don't think waiting for Transifex will solve the problem.


- Clean up checked-in PO files even more, for nicer diffs


A nice feature, but I'm wondering to extent we're currently suffering
because of this. It's rare that we have to compare PO files. Plus diff
is not well suited for comparing PO's because PO files with equivalent
data can be formatted differently. That's why I wrote some tools to read
PO files, normalize the contents and then do a comparison. Anyway my top
level question is is this something we really need at this point?


You're right that files have to be normalized to diff well.That's 
actually the point here :)
Anyway I'm just thinking of sorting the PO alphabetically - an extra 
option to msgattrib should do it.



- Automate  document the process so any dev can do it


Excellent goal, we're not too far from it now, but of all the things on
the list this is the most important.


--
Petr³

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Re: [Freeipa-devel] i18n infrastructure improvements

2013-01-11 Thread John Dennis

On 01/11/2013 10:04 AM, Petr Viktorin wrote:

Hello list,
This discussion was started in private; I'll continue it here.

On 01/10/2013 05:41 PM, John Dennis wrote:

On 01/10/2013 04:27 AM, Petr Viktorin wrote:

On 01/09/2013 03:55 PM, John Dennis wrote:



And I could work on improving the i18n/translations infrastructure,
starting by writing up a RFE+design.



Could you elaborate as to what you perceive as the current problems and
what this work would address.



Here are my notes:



- Use fake translations for tests


We already do (but perhaps not sufficiently).


I mean use it in *all* tests, to ensure all the right things are
translated and weird characters are handled well.
See https://www.redhat.com/archives/freeipa-devel/2012-October/msg00278.html


Ah yes, I like the idea of a test domain for strings, this is a good 
idea. Not only would it exercise our i18n code more but it could 
insulate the tests from string changes (the test would look for a 
canonical string in the test domain)





- Split up huge strings so the entire text doesn't have to be
retranslated each time something changes/is added


Good idea. But one question I have is should we be optimizing for our
programmers time or the translators time? The Transifex tool should make
available to translators similar existing translations (in fact it
might, I seem to recall some functionality in this area). Wouldn't it be
better to address this issue in Transifex where all projects would benefit?

Also the exact same functionality is needed to support release versions.
The strings between releases are often close but not identical. The
Transifex tool should make available a close match from a previous
version to the translator working on a new version (or visa versa). See
your issue below concerning versions.

IMHO this is a Transifex issue which needs to be solved there, not
something we should be investing precious IPA programmers time on. Plus
if it's solved in Transifex it's a *huge* win for *everyone*, not just IPA.


Huh? Splitting the strings provides additional information
(paragraph/context boundaries) that Transifex can't get otherwise. From
what I hear it's a pretty standard technique when working with gettext.


I'm not sure how splitting text into smaller units gives more context 
but I can see the argument for each msgid being a logical paragraph. We 
don't have too many multi-paragraph strings now so it shouldn't be too 
involved.




For typos, gettext has the fuzzy functionality that we explicitly turn
off. I think we're on our own here.


Be very afraid of turning on fuzzy matching. Before we moved to TX we 
used the entire gnu tool chain. I discovered a number of our PO files 
were horribly corrupted. With a lot of work I traced this down to fuzzy 
matches. If memory serves me right here is what happened.


When a msgstr was absent a fuzzy match was performed and inserted as a 
candidate msgstr. Somehow the fuzzy candidates got accepted as actual 
msgstr's. I'm not sure if we ever figured out how this happened. The two 
most likely explanations were 1) a known bug in TX that stripped the 
fuzzy flag off the msgstr or 2) a translator who blindly accepted all 
TX suggestions. (A suggestion in TX comes from a fuzzy match).


But the real problem is the fuzzy matching is horribly bad. Most of the 
fuzzy suggestions (primarily on short strings) were wildly incorrect.


I had to go back to a number of PO files and manually locate all fuzzy 
suggestions that had been promoted to legitimate msgstr's. A tedious 
process I hope to never repeat.


BTW, if memory serves me correctly the fuzzy suggestions got into the PO 
files in the first place because we were running the full gnu tool chain 
(sorry off the top of my head I don't recall exactly which component 
inserts the fuzzy suggestion), but I think we've since turned that off, 
for a very good reason.






- Keep a history/repo of the translations, since Transifex only stores
the latest version


We already do keep a history, it's in git.


It's not updated often enough. If I mess something up before a release
and Transifex gets wiped, or if a rogue translator deletes some
translations, the work is gone.


Yes, updating more frequently is an excellent goal.




- Update the source strings on Transifex more often (ideally as soon as
patches are pushed)


Yes, great idea, this would be really useful and is necessary.


- Break Git dependencies: make it possible generate the POT in an
unpacked tarball


Are you talking about the fact our scripts invoke git to determine what
files to process? If so then yes, this would be a good dependency to get
rid of. However it does mean we somehow have to maintain a manifest list
of some sort somewhere.


A directory listing is fine IMO. We use it for more critical things,
like loading plugins, without any trouble.
Also, when run in a Git repo the Makefile can compare the file list with
what Git says and warn accordingly.


How do you 

Re: [Freeipa-devel] i18n infrastructure improvements

2013-01-11 Thread Jérôme Fenal
2013/1/11 John Dennis jden...@redhat.com

 On 01/11/2013 10:04 AM, Petr Viktorin wrote:

 Hello list,
 This discussion was started in private; I'll continue it here.

 On 01/10/2013 05:41 PM, John Dennis wrote:

 On 01/10/2013 04:27 AM, Petr Viktorin wrote:

 On 01/09/2013 03:55 PM, John Dennis wrote:


  And I could work on improving the i18n/translations infrastructure,
 starting by writing up a RFE+design.


  Could you elaborate as to what you perceive as the current problems and
 what this work would address.


  Here are my notes:


  - Use fake translations for tests


 We already do (but perhaps not sufficiently).


 I mean use it in *all* tests, to ensure all the right things are
 translated and weird characters are handled well.
 See https://www.redhat.com/**archives/freeipa-devel/2012-**
 October/msg00278.htmlhttps://www.redhat.com/archives/freeipa-devel/2012-October/msg00278.html


 Ah yes, I like the idea of a test domain for strings, this is a good idea.
 Not only would it exercise our i18n code more but it could insulate the
 tests from string changes (the test would look for a canonical string in
 the test domain)


FWIW, KDE also uses an empty .po (e.g. empty translated messages) in order
to easier spot strings not marked for translations.



 - Split up huge strings so the entire text doesn't have to be
 retranslated each time something changes/is added


 Good idea. But one question I have is should we be optimizing for our
 programmers time or the translators time? The Transifex tool should make
 available to translators similar existing translations (in fact it
 might, I seem to recall some functionality in this area). Wouldn't it be
 better to address this issue in Transifex where all projects would
 benefit?

 Also the exact same functionality is needed to support release versions.
 The strings between releases are often close but not identical. The
 Transifex tool should make available a close match from a previous
 version to the translator working on a new version (or visa versa). See
 your issue below concerning versions.

 IMHO this is a Transifex issue which needs to be solved there, not
 something we should be investing precious IPA programmers time on. Plus
 if it's solved in Transifex it's a *huge* win for *everyone*, not just
 IPA.


 Huh? Splitting the strings provides additional information
 (paragraph/context boundaries) that Transifex can't get otherwise. From
 what I hear it's a pretty standard technique when working with gettext.


 I'm not sure how splitting text into smaller units gives more context but
 I can see the argument for each msgid being a logical paragraph. We don't
 have too many multi-paragraph strings now so it shouldn't be too involved.


One issue also discussed on this list is the problem of 100+ lines strings
in man pages generated from ___doc___ tags in scripts.
Those are a _real_ pain for translators to maintain when only one line is
changed.

Didn't have the time yet to explore splitting those strings, I need to take
some to do so.



 For typos, gettext has the fuzzy functionality that we explicitly turn
 off. I think we're on our own here.


 Be very afraid of turning on fuzzy matching. Before we moved to TX we used
 the entire gnu tool chain. I discovered a number of our PO files were
 horribly corrupted. With a lot of work I traced this down to fuzzy matches.
 If memory serves me right here is what happened.

 When a msgstr was absent a fuzzy match was performed and inserted as a
 candidate msgstr. Somehow the fuzzy candidates got accepted as actual
 msgstr's. I'm not sure if we ever figured out how this happened. The two
 most likely explanations were 1) a known bug in TX that stripped the fuzzy
 flag off the msgstr or 2) a translator who blindly accepted all TX
 suggestions. (A suggestion in TX comes from a fuzzy match).

 But the real problem is the fuzzy matching is horribly bad. Most of the
 fuzzy suggestions (primarily on short strings) were wildly incorrect.

 I had to go back to a number of PO files and manually locate all fuzzy
 suggestions that had been promoted to legitimate msgstr's. A tedious
 process I hope to never repeat.

 BTW, if memory serves me correctly the fuzzy suggestions got into the PO
 files in the first place because we were running the full gnu tool chain
 (sorry off the top of my head I don't recall exactly which component
 inserts the fuzzy suggestion), but I think we've since turned that off, for
 a very good reason.



  - Keep a history/repo of the translations, since Transifex only stores
 the latest version


 We already do keep a history, it's in git.


 It's not updated often enough. If I mess something up before a release
 and Transifex gets wiped, or if a rogue translator deletes some
 translations, the work is gone.


 Yes, updating more frequently is an excellent goal.


Yes, please!

Having nothing to translate for months on Transifex is not fun.
Having a mass of new strings to translate every once 

Re: [Freeipa-devel] i18n infrastructure improvements

2013-01-11 Thread John Dennis

On 01/11/2013 02:44 PM, Jérôme Fenal wrote:

2013/1/11 John Dennis jden...@redhat.com mailto:jden...@redhat.com


Thank you Jérôme for your insights as a translator. We have a lop-sided 
perspective mostly from the developer point of view. We need to better 
understand the translator's perspective.



I'm not sure how splitting text into smaller units gives more
context but I can see the argument for each msgid being a logical
paragraph. We don't have too many multi-paragraph strings now so it
shouldn't be too involved.


One issue also discussed on this list is the problem of 100+ lines
strings in man pages generated from ___doc___ tags in scripts.
Those are a _real_ pain for translators to maintain when only one line
is changed.


I still think TX should attempt to match the msgid from a previous pot 
with an updated pot and show the *word* differences between the strings 
along with an edit window for the original translation. That would be so 
useful to translators I can't believe TX does not have that feature. All 
you would have to do is make a few trivial edits in the translation and 
save it.


But heck, I'm not a translator and I haven't used the translator's part 
of the TX tool much other than to explore how it works (and that was a 
while ago).




I'd see a few remarks here:
- this massive .po file would grow wildly, especially when a typo is
corrected in huge strings (__doc___), when additional sentences are
added to those, etc.
- breaking down bigger strings in smaller ones will certainly help here
in avoiding duplicated content,



- in Transifex, it is easy to upload a .po onto another branch, and only
untranslated matching strings would be updated. I used it on ananconda
where there are multiple branches between Fedora  RHEL5/6  master,
that worked easily without breaking anything.


When you say easy to upload a .po onto another branch I assume you don't 
mean branch (TX has no such concept) but rather another TX resource. 
Anyway this is good to know, perhaps the way TX handles versions is not 
half as bad as it would appear.


--
John Dennis jden...@redhat.com

Looking to carve out IT costs?
www.redhat.com/carveoutcosts/

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Re: [Freeipa-devel] i18n infrastructure improvements

2013-01-11 Thread Jérôme Fenal
2013/1/11 John Dennis jden...@redhat.com

 On 01/11/2013 02:44 PM, Jérôme Fenal wrote:

 2013/1/11 John Dennis jden...@redhat.com mailto:jden...@redhat.com


 Thank you Jérôme for your insights as a translator. We have a lop-sided
 perspective mostly from the developer point of view. We need to better
 understand the translator's perspective.


You're welcome.
I'm not an expert at Transifex though.
I've yet to schedule a lunch with Kevin Raymond (he works a few kms away
from the French Red Hat office) who is coordinating the whole Fedora
translation effort, but customers first, yada yada... :)






 I'm not sure how splitting text into smaller units gives more
 context but I can see the argument for each msgid being a logical
 paragraph. We don't have too many multi-paragraph strings now so it
 shouldn't be too involved.


 One issue also discussed on this list is the problem of 100+ lines
 strings in man pages generated from ___doc___ tags in scripts.
 Those are a _real_ pain for translators to maintain when only one line
 is changed.


I still think TX should attempt to match the msgid from a previous pot with
 an updated pot and show the *word* differences between the strings along
 with an edit window for the original translation. That would be so useful
 to translators I can't believe TX does not have that feature. All you would
 have to do is make a few trivial edits in the translation and save it.


I agree with you.
But transifex developers seem to be overloaded at the moment.
I can check with Kevin (and internally) if Zanata would provide a better
home to host the translation effort.


 But heck, I'm not a translator and I haven't used the translator's part of
 the TX tool much other than to explore how it works (and that was a while
 ago).


I can understand that... :)
Hopefully, the IPA dev team is mutllingual ;)

 I'd see a few remarks here:
 - this massive .po file would grow wildly, especially when a typo is
 corrected in huge strings (__doc___), when additional sentences are
 added to those, etc.
 - breaking down bigger strings in smaller ones will certainly help here
 in avoiding duplicated content,


 - in Transifex, it is easy to upload a .po onto another branch, and only
 untranslated matching strings would be updated. I used it on ananconda
 where there are multiple branches between Fedora  RHEL5/6  master,
 that worked easily without breaking anything.


When you say easy to upload a .po onto another branch I assume you don't
 mean branch (TX has no such concept) but rather another TX resource. Anyway
 this is good to know, perhaps the way TX handles versions is not half as
 bad as it would appear.


You're right. See how anaconda is organized, for instance:
 https://fedora.transifex.com/projects/p/fedora/language/en/?project=2059

Regards,

J.
-- 
Jérôme Fenal
___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Re: [Freeipa-devel] i18n infrastructure improvements

2013-01-11 Thread John Dennis

On 01/11/2013 04:00 PM, Jérôme Fenal wrote:

When you say easy to upload a .po onto another branch I assume you
don't mean branch (TX has no such concept) but rather another TX
resource. Anyway this is good to know, perhaps the way TX handles
versions is not half as bad as it would appear.


You're right. See how anaconda is organized, for instance:
https://fedora.transifex.com/projects/p/fedora/language/en/?project=2059


We follow the same model as anaconda, a new TX resource per version.

--
John Dennis jden...@redhat.com

Looking to carve out IT costs?
www.redhat.com/carveoutcosts/

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel

Re: [Freeipa-devel] i18n infrastructure improvements

2013-01-11 Thread Jérôme Fenal
2013/1/11 John Dennis jden...@redhat.com

 On 01/11/2013 04:00 PM, Jérôme Fenal wrote:

 When you say easy to upload a .po onto another branch I assume you
 don't mean branch (TX has no such concept) but rather another TX
 resource. Anyway this is good to know, perhaps the way TX handles
 versions is not half as bad as it would appear.


 You're right. See how anaconda is organized, for instance:
 https://fedora.transifex.com/**projects/p/fedora/language/en/**
 ?project=2059https://fedora.transifex.com/projects/p/fedora/language/en/?project=2059


 We follow the same model as anaconda, a new TX resource per version.


Yup.
Minus the frequent updates on master/head resource ipa (and no IPA  3.x
resource, but that is not a problem for IPA, given its fast pace and no
long maintenance on older branches).

-- 
Jérôme Fenal
___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel