Re: [native-lang] Status update season!

2006-12-22 Thread Pavel Janík

Hi,

Thanks for the information. I suppose my overall question is, Can  
we use this dictionary with OpenOffice.org?


yes, but you can't (yet) distribute it together. This is the purpose  
of this issue. Ask on [EMAIL PROTECTED] how to add your dictionary  
into DicOOo...

--
Pavel Janík
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[native-lang] ping

2006-12-22 Thread Charles-H.Schulz
Ping, sorry I have some email problems...

-- 
Charles-H.Schulz.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [native-lang] Very Easy QA How To

2006-12-22 Thread Charles-H.Schulz
Dear all,

do you think we can consider the Easy How TO page on the wiki final? In
this case, Maho would like to move it and link to it from the QA
project, and I'd like to make an odt document out of it in order to hand
it it quickly to the newcomers.

Thanks,

Charles.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [native-lang] Status update season!

2006-12-22 Thread Charles-H.Schulz
Andrea,

Andrea Pescetti a écrit :
 On 12/12/2006 Charles-H.Schulz wrote: ...
 a status update on your project would be of course very nice!
 
 Here you are. This is a status update for the Italian Native-Lang
 project (PLIO) in the period September-December 2006.
 
Thank you a lot for this detailed update. My congratulations and my
personal wishes of Christmas and New Year go to the Italian
Native-Language community.

Congratulations to both of you, Davide and Andrea!

Regards and thank you,

Charles;

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [native-lang] Status update season!

2006-12-22 Thread Pavel Janík
Thanks, Pavel. I have now subscribed to my 15th OpenOffice.org  
mailing list.


Great - it will help you to get faster answers from people actually  
responsible for their part of work.


/me sometimes thinks that this list is a bit more like  
[EMAIL PROTECTED] or at least that some people want it to be  
as such ;-)

--
Pavel Janík
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[native-lang] Dzongkha as the part of the NLC project

2006-12-22 Thread Pema Geyleg

Hi,
 
I am coordinating the work of Localization for our 
Government and presently am the head of the Research Unit 
at Department of Information Technology, Bhutan.


Our team had been working on localizing Open Office for 
the past 2 years where by we have completed most of the 
localization work for our language Dzongkha(dz) in open 
office.


In this regard we would like to submit a proposal to be 
part of Native Language project. Please find attached a 
document describing our team and what we had been doing 
for the past two years.


Looking forward to hear from you...

Many Thanks
Pema Geyleg
DIT,MoIC
Bhutan

+++
Get a free DrukNet e-mail account and stay in touch
http://www.druknet.bt


NLCproposal.odt
Description: application/vnd.oasis.opendocument.text
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[native-lang] Re: [lingu-dev] Spell checking metrics; was:[native-lang] Status update season!

2006-12-22 Thread Lars Aronsson
eleonora46 wrote:

 If both the above are true, then the spell checker 
 did a really good work.

Did you try to compute these numbers for your own German 
dictionary, and compare it to the other German dictionaries from 
Björn Jacke or Franz Michael Baumann?  German is one of few 
languages where more than one free dictionary is available, so it 
could be a good test case.  Since you continue to work in 
parallel, I guess each of you are convinced that you do a better 
job than the others?  How do you measure or compare this?

German is a good test case also for another reason: Many people in 
Europe (such as me) know it as their 3rd language, after their 
native language and English.

 The recognition of obscure words is more the area of grammar checkers,
 they should mark obscure words being similar to often used,
 mispelled words.

This note on obscure words connects to what Kevin wrote:

  cases, like the obscure word yor in English, should clearly 
  not be included since they are most likely to be a misspelling 
  of a common word.

It seems we would need statistics on how common yor (or should 
that be yore?) is in its right use and how common it is as a 
misspelling of your (or you're).  It is easy enough to find 
statistics on word frequencies, but how or where can we find stats 
on errors?  A simple Google search finds 2.59 billion your and 
4.17 million yore, but I cannot tell which of the yore 
occurrences are errors.  There are also 4.37 million (!) hits for 
yor but they seem to be a film title, a surname, various company 
names and the ISO language code for Yoruba.  The first obvious 
error usage I find is all yor base r blong 2 us, which is 
apparently stylistic and not a mistake.

One idea for finding stats on errors is to compare changes made to 
Wikipedia articles.  The complete text revision history is 
available from download.wikimedia.org.  All you need is to step 
through the changes and make statistics for all the small changes 
such as yor being changed to your.  Has anybody done this?

Another idea is to make OpenOffice.org report all corrections made 
by users worldwide to some centralized database.  I guess this 
would conflict with users' interest in their own privacy.


-- 
  Lars Aronsson ([EMAIL PROTECTED])
  Aronsson Datateknik - http://aronsson.se

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [native-lang] Status update season!

2006-12-22 Thread Lars Aronsson
Christian Lohmaier wrote:

 This could also mean that these are just dumb wordlists that don't
 make use of affix transformations. Not really suitable for comparing
 quality then, even when the languages are closely related.

This is not the case, though.  Swedish, Danish and Norwegian are 
closely related and have the same language structure.  An expanded 
wordlist is 5...6 times longer than a well-compressed one using 
the right ispell flags.  That factor is smaller for English and 
German and a lot larger for Finnish and Hungarian.

The current German dictionary maintained by Björn Jacke has 80,000 
basic forms which expand to 300,000 variations, for a factor of 
3.75.  Swedish/Danish/Norwegian have the same way to form basic 
words (with compounds) as German.  Basic words can often be 
translated syllable by syllable, so the number of basic forms 
should be about the same. But the Scandinavian languages use 
endings instead of the definite article (the/der/die/das), 
resulting in a larger number of expanded variations.

The current da_DK.dic has 108,400 basic forms and expands to 
380,199 variations.  The two versions of Norwegian have 133,242 
(nb_NO) and 102,578 (nn_NO) basic forms, respectively, and expand 
to 556,600 and 295,306 variations. However, the currently used 
Swedish dictionary (which is from 2003, but almost unchanged since 
1997) has 24,489 basic forms and expands to 118,270 variations.  
This is clearly inferior.

Of course, if the Swedish dictionary contained 24,000 relevant 
words and the other languages had many highly specialized words 
which are only rarely used, we'd still stand a chance.  However, 
this is not the case either.

Fortunately, my friend who maintains the Swedish dictionary has 
recently published a new version (DSSO 1.22) that expands to 
242,611 variations, so he's making great progress.  I hope this 
will be included in future versions of OpenOffice.org.  We're 
catching up on the Danes and Norwegians, but they are still ahead. 

Yesterday I found this paper by two Hungarian authors, who discuss 
Zipf's law and the minimum number of words in a dictionary 
required to cover some percentage of a given corpus of text,
http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf

Their most important observation is that a decent spelling 
dictionary needs to contain 20,000 words (variations) for English 
and 80,000 words for German, but 400,000 for Hungarian.  The right 
number for Scandinavian languages should thus be somewhere between 
80,000 and 400,000.

However, that is only counting the most frequent words from a 
language.  When I add home to an ispell/hunspell dictionary, I 
also add homes and homely because of how the flags work, even 
though homely isn't necessarily among the very common and 
relevant words.  So I add a lot of less relevant words, which 
don't contribute much to the dictionary's usefulness.  When I add 
one basic word and thus 5..6 variations (for Swedish), perhaps I 
only add 2..3 useful variations.  It is hard to know just how much 
the numbers are inflated.

 I don't think there is a way to measure this at all. You feel that it
 is good or bad, but you cannot really measure it.
 You can give examples, but that's about it. (IMHO)

In the case of OpenOffice.org, what really matters is what people 
feel about Microsoft Word's spell checker.  If that was really 
useless, we wouldn't have to bother.  But now we have to bother.


-- 
  Lars Aronsson ([EMAIL PROTECTED])
  Aronsson Datateknik - http://aronsson.se

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[native-lang] Re: [lingu-dev] Re: [native-lang] Status update season!

2006-12-22 Thread Marcin Miłkowski

Hi Lars, and all,


The current German dictionary maintained by Björn Jacke has 80,000 
basic forms which expand to 300,000 variations, for a factor of 
3.75.  Swedish/Danish/Norwegian have the same way to form basic 
words (with compounds) as German.  Basic words can often be 
translated syllable by syllable, so the number of basic forms 
should be about the same. But the Scandinavian languages use 
endings instead of the definite article (the/der/die/das), 
resulting in a larger number of expanded variations.


If we're into statistics, then the Polish dictionary has something like 
3.5 million expanded forms, and about 300.000 base forms. The quality of 
the dictionary is excellent.


How was that achieved? Simple, set up a local scrabble-like community 
and develop a scrabble dictionary using scrabble players linguistic 
competence. It's incredibly efficient.


Then you simply tweak the Scrabble dict to your needs (like removing 
rare and confusing forms).


I recommend this kind of technique to all l10 teams and dict developers. 
Look at www.kurnik.pl to see how the site is managed, and in 
www.kurnik.pl/dictionary there is some info on the dict.


Best regards, and happy holidays,
Marcin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]