Re: [Freeipa-devel] i18n infrastructure improvements
Hello list, This discussion was started in private; I'll continue it here. On 01/10/2013 05:41 PM, John Dennis wrote: On 01/10/2013 04:27 AM, Petr Viktorin wrote: On 01/09/2013 03:55 PM, John Dennis wrote: And I could work on improving the i18n/translations infrastructure, starting by writing up a RFE+design. Could you elaborate as to what you perceive as the current problems and what this work would address. Here are my notes: - Use fake translations for tests We already do (but perhaps not sufficiently). I mean use it in *all* tests, to ensure all the right things are translated and weird characters are handled well. See https://www.redhat.com/archives/freeipa-devel/2012-October/msg00278.html - Split up huge strings so the entire text doesn't have to be retranslated each time something changes/is added Good idea. But one question I have is should we be optimizing for our programmers time or the translators time? The Transifex tool should make available to translators similar existing translations (in fact it might, I seem to recall some functionality in this area). Wouldn't it be better to address this issue in Transifex where all projects would benefit? Also the exact same functionality is needed to support release versions. The strings between releases are often close but not identical. The Transifex tool should make available a close match from a previous version to the translator working on a new version (or visa versa). See your issue below concerning versions. IMHO this is a Transifex issue which needs to be solved there, not something we should be investing precious IPA programmers time on. Plus if it's solved in Transifex it's a *huge* win for *everyone*, not just IPA. Huh? Splitting the strings provides additional information (paragraph/context boundaries) that Transifex can't get otherwise. From what I hear it's a pretty standard technique when working with gettext. For typos, gettext has the fuzzy functionality that we explicitly turn off. I think we're on our own here. - Keep a history/repo of the translations, since Transifex only stores the latest version We already do keep a history, it's in git. It's not updated often enough. If I mess something up before a release and Transifex gets wiped, or if a rogue translator deletes some translations, the work is gone. - Update the source strings on Transifex more often (ideally as soon as patches are pushed) Yes, great idea, this would be really useful and is necessary. - Break Git dependencies: make it possible generate the POT in an unpacked tarball Are you talking about the fact our scripts invoke git to determine what files to process? If so then yes, this would be a good dependency to get rid of. However it does mean we somehow have to maintain a manifest list of some sort somewhere. A directory listing is fine IMO. We use it for more critical things, like loading plugins, without any trouble. Also, when run in a Git repo the Makefile can compare the file list with what Git says and warn accordingly. - Figure out how to best share messages across versions (2.x vs. 3.x) so they only have to be translated once There is a crying need for this, but isn't this a Transifex issue? Why would we solving this in IPA? What about SSSD and every other project, they all have identical issues. As far as I can tell Transifex has never addressed this issue sufficiently (see above) and the onus is on them to do so. I don't think waiting for Transifex will solve the problem. - Clean up checked-in PO files even more, for nicer diffs A nice feature, but I'm wondering to extent we're currently suffering because of this. It's rare that we have to compare PO files. Plus diff is not well suited for comparing PO's because PO files with equivalent data can be formatted differently. That's why I wrote some tools to read PO files, normalize the contents and then do a comparison. Anyway my top level question is is this something we really need at this point? You're right that files have to be normalized to diff well.That's actually the point here :) Anyway I'm just thinking of sorting the PO alphabetically - an extra option to msgattrib should do it. - Automate document the process so any dev can do it Excellent goal, we're not too far from it now, but of all the things on the list this is the most important. -- Petr³ ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] i18n infrastructure improvements
On 01/11/2013 10:04 AM, Petr Viktorin wrote: Hello list, This discussion was started in private; I'll continue it here. On 01/10/2013 05:41 PM, John Dennis wrote: On 01/10/2013 04:27 AM, Petr Viktorin wrote: On 01/09/2013 03:55 PM, John Dennis wrote: And I could work on improving the i18n/translations infrastructure, starting by writing up a RFE+design. Could you elaborate as to what you perceive as the current problems and what this work would address. Here are my notes: - Use fake translations for tests We already do (but perhaps not sufficiently). I mean use it in *all* tests, to ensure all the right things are translated and weird characters are handled well. See https://www.redhat.com/archives/freeipa-devel/2012-October/msg00278.html Ah yes, I like the idea of a test domain for strings, this is a good idea. Not only would it exercise our i18n code more but it could insulate the tests from string changes (the test would look for a canonical string in the test domain) - Split up huge strings so the entire text doesn't have to be retranslated each time something changes/is added Good idea. But one question I have is should we be optimizing for our programmers time or the translators time? The Transifex tool should make available to translators similar existing translations (in fact it might, I seem to recall some functionality in this area). Wouldn't it be better to address this issue in Transifex where all projects would benefit? Also the exact same functionality is needed to support release versions. The strings between releases are often close but not identical. The Transifex tool should make available a close match from a previous version to the translator working on a new version (or visa versa). See your issue below concerning versions. IMHO this is a Transifex issue which needs to be solved there, not something we should be investing precious IPA programmers time on. Plus if it's solved in Transifex it's a *huge* win for *everyone*, not just IPA. Huh? Splitting the strings provides additional information (paragraph/context boundaries) that Transifex can't get otherwise. From what I hear it's a pretty standard technique when working with gettext. I'm not sure how splitting text into smaller units gives more context but I can see the argument for each msgid being a logical paragraph. We don't have too many multi-paragraph strings now so it shouldn't be too involved. For typos, gettext has the fuzzy functionality that we explicitly turn off. I think we're on our own here. Be very afraid of turning on fuzzy matching. Before we moved to TX we used the entire gnu tool chain. I discovered a number of our PO files were horribly corrupted. With a lot of work I traced this down to fuzzy matches. If memory serves me right here is what happened. When a msgstr was absent a fuzzy match was performed and inserted as a candidate msgstr. Somehow the fuzzy candidates got accepted as actual msgstr's. I'm not sure if we ever figured out how this happened. The two most likely explanations were 1) a known bug in TX that stripped the fuzzy flag off the msgstr or 2) a translator who blindly accepted all TX suggestions. (A suggestion in TX comes from a fuzzy match). But the real problem is the fuzzy matching is horribly bad. Most of the fuzzy suggestions (primarily on short strings) were wildly incorrect. I had to go back to a number of PO files and manually locate all fuzzy suggestions that had been promoted to legitimate msgstr's. A tedious process I hope to never repeat. BTW, if memory serves me correctly the fuzzy suggestions got into the PO files in the first place because we were running the full gnu tool chain (sorry off the top of my head I don't recall exactly which component inserts the fuzzy suggestion), but I think we've since turned that off, for a very good reason. - Keep a history/repo of the translations, since Transifex only stores the latest version We already do keep a history, it's in git. It's not updated often enough. If I mess something up before a release and Transifex gets wiped, or if a rogue translator deletes some translations, the work is gone. Yes, updating more frequently is an excellent goal. - Update the source strings on Transifex more often (ideally as soon as patches are pushed) Yes, great idea, this would be really useful and is necessary. - Break Git dependencies: make it possible generate the POT in an unpacked tarball Are you talking about the fact our scripts invoke git to determine what files to process? If so then yes, this would be a good dependency to get rid of. However it does mean we somehow have to maintain a manifest list of some sort somewhere. A directory listing is fine IMO. We use it for more critical things, like loading plugins, without any trouble. Also, when run in a Git repo the Makefile can compare the file list with what Git says and warn accordingly. How do you
Re: [Freeipa-devel] i18n infrastructure improvements
2013/1/11 John Dennis jden...@redhat.com On 01/11/2013 10:04 AM, Petr Viktorin wrote: Hello list, This discussion was started in private; I'll continue it here. On 01/10/2013 05:41 PM, John Dennis wrote: On 01/10/2013 04:27 AM, Petr Viktorin wrote: On 01/09/2013 03:55 PM, John Dennis wrote: And I could work on improving the i18n/translations infrastructure, starting by writing up a RFE+design. Could you elaborate as to what you perceive as the current problems and what this work would address. Here are my notes: - Use fake translations for tests We already do (but perhaps not sufficiently). I mean use it in *all* tests, to ensure all the right things are translated and weird characters are handled well. See https://www.redhat.com/**archives/freeipa-devel/2012-** October/msg00278.htmlhttps://www.redhat.com/archives/freeipa-devel/2012-October/msg00278.html Ah yes, I like the idea of a test domain for strings, this is a good idea. Not only would it exercise our i18n code more but it could insulate the tests from string changes (the test would look for a canonical string in the test domain) FWIW, KDE also uses an empty .po (e.g. empty translated messages) in order to easier spot strings not marked for translations. - Split up huge strings so the entire text doesn't have to be retranslated each time something changes/is added Good idea. But one question I have is should we be optimizing for our programmers time or the translators time? The Transifex tool should make available to translators similar existing translations (in fact it might, I seem to recall some functionality in this area). Wouldn't it be better to address this issue in Transifex where all projects would benefit? Also the exact same functionality is needed to support release versions. The strings between releases are often close but not identical. The Transifex tool should make available a close match from a previous version to the translator working on a new version (or visa versa). See your issue below concerning versions. IMHO this is a Transifex issue which needs to be solved there, not something we should be investing precious IPA programmers time on. Plus if it's solved in Transifex it's a *huge* win for *everyone*, not just IPA. Huh? Splitting the strings provides additional information (paragraph/context boundaries) that Transifex can't get otherwise. From what I hear it's a pretty standard technique when working with gettext. I'm not sure how splitting text into smaller units gives more context but I can see the argument for each msgid being a logical paragraph. We don't have too many multi-paragraph strings now so it shouldn't be too involved. One issue also discussed on this list is the problem of 100+ lines strings in man pages generated from ___doc___ tags in scripts. Those are a _real_ pain for translators to maintain when only one line is changed. Didn't have the time yet to explore splitting those strings, I need to take some to do so. For typos, gettext has the fuzzy functionality that we explicitly turn off. I think we're on our own here. Be very afraid of turning on fuzzy matching. Before we moved to TX we used the entire gnu tool chain. I discovered a number of our PO files were horribly corrupted. With a lot of work I traced this down to fuzzy matches. If memory serves me right here is what happened. When a msgstr was absent a fuzzy match was performed and inserted as a candidate msgstr. Somehow the fuzzy candidates got accepted as actual msgstr's. I'm not sure if we ever figured out how this happened. The two most likely explanations were 1) a known bug in TX that stripped the fuzzy flag off the msgstr or 2) a translator who blindly accepted all TX suggestions. (A suggestion in TX comes from a fuzzy match). But the real problem is the fuzzy matching is horribly bad. Most of the fuzzy suggestions (primarily on short strings) were wildly incorrect. I had to go back to a number of PO files and manually locate all fuzzy suggestions that had been promoted to legitimate msgstr's. A tedious process I hope to never repeat. BTW, if memory serves me correctly the fuzzy suggestions got into the PO files in the first place because we were running the full gnu tool chain (sorry off the top of my head I don't recall exactly which component inserts the fuzzy suggestion), but I think we've since turned that off, for a very good reason. - Keep a history/repo of the translations, since Transifex only stores the latest version We already do keep a history, it's in git. It's not updated often enough. If I mess something up before a release and Transifex gets wiped, or if a rogue translator deletes some translations, the work is gone. Yes, updating more frequently is an excellent goal. Yes, please! Having nothing to translate for months on Transifex is not fun. Having a mass of new strings to translate every once
Re: [Freeipa-devel] i18n infrastructure improvements
On 01/11/2013 02:44 PM, Jérôme Fenal wrote: 2013/1/11 John Dennis jden...@redhat.com mailto:jden...@redhat.com Thank you Jérôme for your insights as a translator. We have a lop-sided perspective mostly from the developer point of view. We need to better understand the translator's perspective. I'm not sure how splitting text into smaller units gives more context but I can see the argument for each msgid being a logical paragraph. We don't have too many multi-paragraph strings now so it shouldn't be too involved. One issue also discussed on this list is the problem of 100+ lines strings in man pages generated from ___doc___ tags in scripts. Those are a _real_ pain for translators to maintain when only one line is changed. I still think TX should attempt to match the msgid from a previous pot with an updated pot and show the *word* differences between the strings along with an edit window for the original translation. That would be so useful to translators I can't believe TX does not have that feature. All you would have to do is make a few trivial edits in the translation and save it. But heck, I'm not a translator and I haven't used the translator's part of the TX tool much other than to explore how it works (and that was a while ago). I'd see a few remarks here: - this massive .po file would grow wildly, especially when a typo is corrected in huge strings (__doc___), when additional sentences are added to those, etc. - breaking down bigger strings in smaller ones will certainly help here in avoiding duplicated content, - in Transifex, it is easy to upload a .po onto another branch, and only untranslated matching strings would be updated. I used it on ananconda where there are multiple branches between Fedora RHEL5/6 master, that worked easily without breaking anything. When you say easy to upload a .po onto another branch I assume you don't mean branch (TX has no such concept) but rather another TX resource. Anyway this is good to know, perhaps the way TX handles versions is not half as bad as it would appear. -- John Dennis jden...@redhat.com Looking to carve out IT costs? www.redhat.com/carveoutcosts/ ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] i18n infrastructure improvements
2013/1/11 John Dennis jden...@redhat.com On 01/11/2013 02:44 PM, Jérôme Fenal wrote: 2013/1/11 John Dennis jden...@redhat.com mailto:jden...@redhat.com Thank you Jérôme for your insights as a translator. We have a lop-sided perspective mostly from the developer point of view. We need to better understand the translator's perspective. You're welcome. I'm not an expert at Transifex though. I've yet to schedule a lunch with Kevin Raymond (he works a few kms away from the French Red Hat office) who is coordinating the whole Fedora translation effort, but customers first, yada yada... :) I'm not sure how splitting text into smaller units gives more context but I can see the argument for each msgid being a logical paragraph. We don't have too many multi-paragraph strings now so it shouldn't be too involved. One issue also discussed on this list is the problem of 100+ lines strings in man pages generated from ___doc___ tags in scripts. Those are a _real_ pain for translators to maintain when only one line is changed. I still think TX should attempt to match the msgid from a previous pot with an updated pot and show the *word* differences between the strings along with an edit window for the original translation. That would be so useful to translators I can't believe TX does not have that feature. All you would have to do is make a few trivial edits in the translation and save it. I agree with you. But transifex developers seem to be overloaded at the moment. I can check with Kevin (and internally) if Zanata would provide a better home to host the translation effort. But heck, I'm not a translator and I haven't used the translator's part of the TX tool much other than to explore how it works (and that was a while ago). I can understand that... :) Hopefully, the IPA dev team is mutllingual ;) I'd see a few remarks here: - this massive .po file would grow wildly, especially when a typo is corrected in huge strings (__doc___), when additional sentences are added to those, etc. - breaking down bigger strings in smaller ones will certainly help here in avoiding duplicated content, - in Transifex, it is easy to upload a .po onto another branch, and only untranslated matching strings would be updated. I used it on ananconda where there are multiple branches between Fedora RHEL5/6 master, that worked easily without breaking anything. When you say easy to upload a .po onto another branch I assume you don't mean branch (TX has no such concept) but rather another TX resource. Anyway this is good to know, perhaps the way TX handles versions is not half as bad as it would appear. You're right. See how anaconda is organized, for instance: https://fedora.transifex.com/projects/p/fedora/language/en/?project=2059 Regards, J. -- Jérôme Fenal ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] i18n infrastructure improvements
On 01/11/2013 04:00 PM, Jérôme Fenal wrote: When you say easy to upload a .po onto another branch I assume you don't mean branch (TX has no such concept) but rather another TX resource. Anyway this is good to know, perhaps the way TX handles versions is not half as bad as it would appear. You're right. See how anaconda is organized, for instance: https://fedora.transifex.com/projects/p/fedora/language/en/?project=2059 We follow the same model as anaconda, a new TX resource per version. -- John Dennis jden...@redhat.com Looking to carve out IT costs? www.redhat.com/carveoutcosts/ ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel
Re: [Freeipa-devel] i18n infrastructure improvements
2013/1/11 John Dennis jden...@redhat.com On 01/11/2013 04:00 PM, Jérôme Fenal wrote: When you say easy to upload a .po onto another branch I assume you don't mean branch (TX has no such concept) but rather another TX resource. Anyway this is good to know, perhaps the way TX handles versions is not half as bad as it would appear. You're right. See how anaconda is organized, for instance: https://fedora.transifex.com/**projects/p/fedora/language/en/** ?project=2059https://fedora.transifex.com/projects/p/fedora/language/en/?project=2059 We follow the same model as anaconda, a new TX resource per version. Yup. Minus the frequent updates on master/head resource ipa (and no IPA 3.x resource, but that is not a problem for IPA, given its fast pace and no long maintenance on older branches). -- Jérôme Fenal ___ Freeipa-devel mailing list Freeipa-devel@redhat.com https://www.redhat.com/mailman/listinfo/freeipa-devel