Re: [native-lang] Status update season!
Hi, Thanks for the information. I suppose my overall question is, Can we use this dictionary with OpenOffice.org? yes, but you can't (yet) distribute it together. This is the purpose of this issue. Ask on [EMAIL PROTECTED] how to add your dictionary into DicOOo... -- Pavel Janík [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[native-lang] ping
Ping, sorry I have some email problems... -- Charles-H.Schulz. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [native-lang] Very Easy QA How To
Dear all, do you think we can consider the Easy How TO page on the wiki final? In this case, Maho would like to move it and link to it from the QA project, and I'd like to make an odt document out of it in order to hand it it quickly to the newcomers. Thanks, Charles. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [native-lang] Status update season!
Andrea, Andrea Pescetti a écrit : On 12/12/2006 Charles-H.Schulz wrote: ... a status update on your project would be of course very nice! Here you are. This is a status update for the Italian Native-Lang project (PLIO) in the period September-December 2006. Thank you a lot for this detailed update. My congratulations and my personal wishes of Christmas and New Year go to the Italian Native-Language community. Congratulations to both of you, Davide and Andrea! Regards and thank you, Charles; - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [native-lang] Status update season!
Thanks, Pavel. I have now subscribed to my 15th OpenOffice.org mailing list. Great - it will help you to get faster answers from people actually responsible for their part of work. /me sometimes thinks that this list is a bit more like [EMAIL PROTECTED] or at least that some people want it to be as such ;-) -- Pavel Janík [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[native-lang] Dzongkha as the part of the NLC project
Hi, I am coordinating the work of Localization for our Government and presently am the head of the Research Unit at Department of Information Technology, Bhutan. Our team had been working on localizing Open Office for the past 2 years where by we have completed most of the localization work for our language Dzongkha(dz) in open office. In this regard we would like to submit a proposal to be part of Native Language project. Please find attached a document describing our team and what we had been doing for the past two years. Looking forward to hear from you... Many Thanks Pema Geyleg DIT,MoIC Bhutan +++ Get a free DrukNet e-mail account and stay in touch http://www.druknet.bt NLCproposal.odt Description: application/vnd.oasis.opendocument.text - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[native-lang] Re: [lingu-dev] Spell checking metrics; was:[native-lang] Status update season!
eleonora46 wrote: If both the above are true, then the spell checker did a really good work. Did you try to compute these numbers for your own German dictionary, and compare it to the other German dictionaries from Björn Jacke or Franz Michael Baumann? German is one of few languages where more than one free dictionary is available, so it could be a good test case. Since you continue to work in parallel, I guess each of you are convinced that you do a better job than the others? How do you measure or compare this? German is a good test case also for another reason: Many people in Europe (such as me) know it as their 3rd language, after their native language and English. The recognition of obscure words is more the area of grammar checkers, they should mark obscure words being similar to often used, mispelled words. This note on obscure words connects to what Kevin wrote: cases, like the obscure word yor in English, should clearly not be included since they are most likely to be a misspelling of a common word. It seems we would need statistics on how common yor (or should that be yore?) is in its right use and how common it is as a misspelling of your (or you're). It is easy enough to find statistics on word frequencies, but how or where can we find stats on errors? A simple Google search finds 2.59 billion your and 4.17 million yore, but I cannot tell which of the yore occurrences are errors. There are also 4.37 million (!) hits for yor but they seem to be a film title, a surname, various company names and the ISO language code for Yoruba. The first obvious error usage I find is all yor base r blong 2 us, which is apparently stylistic and not a mistake. One idea for finding stats on errors is to compare changes made to Wikipedia articles. The complete text revision history is available from download.wikimedia.org. All you need is to step through the changes and make statistics for all the small changes such as yor being changed to your. Has anybody done this? Another idea is to make OpenOffice.org report all corrections made by users worldwide to some centralized database. I guess this would conflict with users' interest in their own privacy. -- Lars Aronsson ([EMAIL PROTECTED]) Aronsson Datateknik - http://aronsson.se - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [native-lang] Status update season!
Christian Lohmaier wrote: This could also mean that these are just dumb wordlists that don't make use of affix transformations. Not really suitable for comparing quality then, even when the languages are closely related. This is not the case, though. Swedish, Danish and Norwegian are closely related and have the same language structure. An expanded wordlist is 5...6 times longer than a well-compressed one using the right ispell flags. That factor is smaller for English and German and a lot larger for Finnish and Hungarian. The current German dictionary maintained by Björn Jacke has 80,000 basic forms which expand to 300,000 variations, for a factor of 3.75. Swedish/Danish/Norwegian have the same way to form basic words (with compounds) as German. Basic words can often be translated syllable by syllable, so the number of basic forms should be about the same. But the Scandinavian languages use endings instead of the definite article (the/der/die/das), resulting in a larger number of expanded variations. The current da_DK.dic has 108,400 basic forms and expands to 380,199 variations. The two versions of Norwegian have 133,242 (nb_NO) and 102,578 (nn_NO) basic forms, respectively, and expand to 556,600 and 295,306 variations. However, the currently used Swedish dictionary (which is from 2003, but almost unchanged since 1997) has 24,489 basic forms and expands to 118,270 variations. This is clearly inferior. Of course, if the Swedish dictionary contained 24,000 relevant words and the other languages had many highly specialized words which are only rarely used, we'd still stand a chance. However, this is not the case either. Fortunately, my friend who maintains the Swedish dictionary has recently published a new version (DSSO 1.22) that expands to 242,611 variations, so he's making great progress. I hope this will be included in future versions of OpenOffice.org. We're catching up on the Danes and Norwegians, but they are still ahead. Yesterday I found this paper by two Hungarian authors, who discuss Zipf's law and the minimum number of words in a dictionary required to cover some percentage of a given corpus of text, http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf Their most important observation is that a decent spelling dictionary needs to contain 20,000 words (variations) for English and 80,000 words for German, but 400,000 for Hungarian. The right number for Scandinavian languages should thus be somewhere between 80,000 and 400,000. However, that is only counting the most frequent words from a language. When I add home to an ispell/hunspell dictionary, I also add homes and homely because of how the flags work, even though homely isn't necessarily among the very common and relevant words. So I add a lot of less relevant words, which don't contribute much to the dictionary's usefulness. When I add one basic word and thus 5..6 variations (for Swedish), perhaps I only add 2..3 useful variations. It is hard to know just how much the numbers are inflated. I don't think there is a way to measure this at all. You feel that it is good or bad, but you cannot really measure it. You can give examples, but that's about it. (IMHO) In the case of OpenOffice.org, what really matters is what people feel about Microsoft Word's spell checker. If that was really useless, we wouldn't have to bother. But now we have to bother. -- Lars Aronsson ([EMAIL PROTECTED]) Aronsson Datateknik - http://aronsson.se - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[native-lang] Re: [lingu-dev] Re: [native-lang] Status update season!
Hi Lars, and all, The current German dictionary maintained by Björn Jacke has 80,000 basic forms which expand to 300,000 variations, for a factor of 3.75. Swedish/Danish/Norwegian have the same way to form basic words (with compounds) as German. Basic words can often be translated syllable by syllable, so the number of basic forms should be about the same. But the Scandinavian languages use endings instead of the definite article (the/der/die/das), resulting in a larger number of expanded variations. If we're into statistics, then the Polish dictionary has something like 3.5 million expanded forms, and about 300.000 base forms. The quality of the dictionary is excellent. How was that achieved? Simple, set up a local scrabble-like community and develop a scrabble dictionary using scrabble players linguistic competence. It's incredibly efficient. Then you simply tweak the Scrabble dict to your needs (like removing rare and confusing forms). I recommend this kind of technique to all l10 teams and dict developers. Look at www.kurnik.pl to see how the site is managed, and in www.kurnik.pl/dictionary there is some info on the dict. Best regards, and happy holidays, Marcin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]