Re: Help with unmunch and Icelandic + Galician

2015-01-22 Thread Adrián Chaves Fernández
Those are perfectly valid words, they are just used scarcely. Those word termitations are valid Galician but not too frequent is written texts. Of course, I bet you would not ever hear "anglicizariámosllesnola", but if there is anyone out there that uses the verb "anglicizar", which I did not know

Re: Help with unmunch and Icelandic + Galician

2015-01-13 Thread R.J. Baars
By the way, don't trust Google too much. There are words that are valid, but too infrequent for Google to absorb in their indexes. For Dutch, I found lots of words in documents found using Google, contianing words that will not result in Google showing the same document when searching with the word

Re: Help with unmunch and Icelandic + Galician

2015-01-13 Thread Daniel Naber
On 2015-01-08 22:18, Adrián Chaves Fernández wrote: > I uploaded it to SF: > http://sourceforge.net/projects/hunspell-gl/files/tmp/hunspell-words-clean.txt/download > [4] (uncompressed, though, where’s the fun otherwise?!) Thanks. Are all of these real words that actually occur or do many of the

Re: Help with unmunch and Icelandic + Galician

2015-01-08 Thread Adrián Chaves Fernández
I uploaded it to SF: http://sourceforge.net/projects/hunspell-gl/files/tmp/hunspell-words-clean.txt/download (uncompressed, though, where’s the fun otherwise?!) 2015-01-05 14:38 GMT+01:00 Daniel Naber : > On 2014-12-10 18:51, Adrián Chaves Fernández wrote: > > Hi Adrián, > > > You can download pr

Re: Help with unmunch and Icelandic + Galician

2015-01-05 Thread Daniel Naber
On 2014-12-10 18:51, Adrián Chaves Fernández wrote: Hi Adrián, > You can download prebuilt snapshots from > https://sourceforge.net/projects/hunspell-gl/files/instantaneas/ sorry, I only now found time to look at this again. When I run: python unmunch.py -a hunspell-gl-drag-20141115/gl_ES.af' -d

Re: Help with unmunch and Icelandic + Galician

2014-12-10 Thread Adrián Chaves Fernández
You can download prebuilt snapshots from https://sourceforge.net/projects/hunspell-gl/files/instantaneas/ You can alternatively generate the gl_ES.utf8 locale in your system: https://wiki.archlinux.org/index.php/locale#Generating_locales 2014-12-08 15:05 GMT+01:00 Daniel Naber : > On 2014-12-05 0

Re: Help with unmunch and Icelandic + Galician

2014-12-08 Thread Daniel Naber
On 2014-12-05 08:33, Adrián Chaves Fernández wrote: > The repository where the unmunch script is located is part of a > network of repositories. Your output suggest that you did not included > submodules when you cloned the repository, see > http://stackoverflow.com/questions/3796927/how-to-git-cl

Re: Help with unmunch and Icelandic + Galician

2014-12-05 Thread Miguel Solla
Hi, I have found a modified script I made when I was working on Galician hunspell (see attachment). I can't remember if was entirely finished or not, sorry about that. Daniel, you are right about "recursivity" into Galician affixes file. You can find some documentation at http://linguamatica.com/i

Re: Help with unmunch and Icelandic + Galician

2014-12-05 Thread Marco A.G.Pinto
Hello Adrián, I didn't clone unmunch, I did my own version based on what I was told. What I came across is that when I began working on PTG in 2013, the documentation and help from mailing lists regarding Hunspell, didn't mention what I discovered recently. The examples and help I had access

Re: Help with unmunch and Icelandic + Galician

2014-12-04 Thread Adrián Chaves Fernández
The repository where the unmunch script is located is part of a network of repositories. Your output suggest that you did not included submodules when you cloned the repository, see http://stackoverflow.com/questions/3796927/how-to-git-clone-including-submodules Any feedback about the script is ap

Re: Help with unmunch and Icelandic + Galician

2014-12-01 Thread Daniel Naber
On 2014-11-15 07:06, Adrián Chaves Fernández wrote: > As I explain in that Hunspell bug report, I ended up writting a Python > script to unmunch Galician files. Could you explain how this can be used? I'm not very familiar with Python, and when I call 'python2.7 unmunch.py' I get: File "unmu

Re: Help with unmunch and Icelandic + Galician

2014-11-14 Thread R.J. Baars
Continuation flags can also be used for 'compounding'and have the same issue of possibly having an endless loop. I guess that is why Hunspell is time-limited for every lookup. Ruud > 2014-11-05 10:49 GMT+01:00 R.J. Baars : > >> There will never be a new unmunch that supports all new Hunspell >>

Re: Help with unmunch and Icelandic + Galician

2014-11-14 Thread Adrián Chaves Fernández
2014-11-05 10:49 GMT+01:00 R.J. Baars : > There will never be a new unmunch that supports all new Hunspell > functions, since the compounding (or continuation, which is much the same) > makes a list unlimited of size. > In Galician we only use compounds for number-related constructs (e.g. “1.ª”),

Re: Help with unmunch and Icelandic + Galician

2014-11-14 Thread Adrián Chaves Fernández
I found out that the unmunch.sh script, which turns out to be from the Hunspell 1.2.8 version (available in the folder for that version in SourceForge) is a bit buggy. See https://sourceforge.net/p/hunspell/bugs/147/ As I explain in that Hunspell bug report, I ended up writting a Python script to

Re: Help with unmunch and Icelandic + Galician

2014-11-05 Thread R.J. Baars
Like I said, Tatoeba is much too small. There will never be a new unmunch that supports all new Hunspell functions, since the compounding (or continuation, which is much the same) makes a list unlimited of size. Ruud > On 2014-11-04 13:29, R.J. Baars wrote: > >> I put a script generating icelan

Re: Help with unmunch and Icelandic + Galician

2014-11-05 Thread Daniel Naber
On 2014-11-04 13:29, R.J. Baars wrote: > I put a script generating icelandic and the data here: > > www.taaltik.nl/daniel/ice.zip I'm not sure if this approach is viable, at least for Icelandic. Just too many words are missing. For example, I just needed to check a single paragraph to find the

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
I got 2.7 Mb, 229699 lines. Try to download again and give it another try. Ruud > On 2014-11-04 14:10, Adrián Chaves Fernández wrote: > >> I have not read the whole conversation, but for Galician I recently >> needed to unmunch the Hunspell files to generate a Morfologik >> dictionary, and I m

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread Daniel Naber
On 2014-11-04 14:10, Adrián Chaves Fernández wrote: > I have not read the whole conversation, but for Galician I recently > needed to unmunch the Hunspell files to generate a Morfologik > dictionary, and I managed to do it with: > > https://github.com/eitsl/hunspell/blob/master/utils/unmunch.sh [

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
On my system, it just gives error: gensub not defined. > PS: I did it for the "drag" version, not the "comunidade" version. I am > assuming that unmunch.sh would work with "comunidade" as well, but I did > not try it as of today (and I'm at work right now). > > 2014-11-04 14:10 GMT+01:00 Adrián

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread Adrián Chaves Fernández
PS: I did it for the "drag" version, not the "comunidade" version. I am assuming that unmunch.sh would work with "comunidade" as well, but I did not try it as of today (and I'm at work right now). 2014-11-04 14:10 GMT+01:00 Adrián Chaves Fernández : > I have not read the whole conversation, but f

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread Adrián Chaves Fernández
I have not read the whole conversation, but for Galician I recently needed to unmunch the Hunspell files to generate a Morfologik dictionary, and I managed to do it with: https://github.com/eitsl/hunspell/blob/master/utils/unmunch.sh A script which I found at: https://github.com/kscanne/hunspell

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
Daniel, I put a script generating icelandic and the data here: www.taaltik.nl/daniel/ice.zip Read the script ice.sh to see how it works. I might give a try for Galician as well. Ruud -- _

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
That suggestion does not work for Icelanic. > > I could upload a result, but you needed it to come from sources like > Tatoeba and Wikipedia. I have no export routines for those, and currently > no time to make them. > > Maybe in a few weeks. > Ruud > >> On 2014-11-02 11:30, R.J. Baars wrote: >> >

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread R.J. Baars
I could upload a result, but you needed it to come from sources like Tatoeba and Wikipedia. I have no export routines for those, and currently no time to make them. Maybe in a few weeks. Ruud > On 2014-11-02 11:30, R.J. Baars wrote: > >> The most effective way to generate Icelandic is to throw a

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread Daniel Naber
On 2014-10-31 01:42, Anton Meixome wrote: > For Galician > > you can test with the new version > > http://sourceforge.net/projects/hunspell-gl/files/instantaneas/20141025/hunspell-gl-comunidade-20141025.tar.xz/download Unfortunately, this makes unmunch (from hunspell 1.3.3) crash. I guess Ruud

Re: Help with unmunch and Icelandic + Galician

2014-11-04 Thread Daniel Naber
On 2014-11-02 11:30, R.J. Baars wrote: > The most effective way to generate Icelandic is to throw a large words > list to Hunspell, since the dictionary is supporting compounding. Could you upload the result somewhere? In how far does Icelandic support compounding, other than NOSPLITSUGS I canno

Re: Help with unmunch and Icelandic + Galician

2014-11-02 Thread R.J. Baars
Daniel, The most effective way to generate Icelandic is to throw a large words list to Hunspell, since the dictionary is supporting compounding. Just applying the bag of trick results in 0.8 MB of words, using a large words list 2.8 MB. Quite a difference. Ruud

Re: Help with unmunch and Icelandic + Galician

2014-10-30 Thread Anton Meixome
For Galician you can test with the new version http://sourceforge.net/projects/hunspell-gl/files/instantaneas/20141025/hunspell-gl-comunidade-20141025.tar.xz/download Also we have lexical lists from a variety of sources (Wikipedia included) http://sourceforge.net/projects/hunspell-gl/files/inst

Re: Help with unmunch and Icelandic + Galician

2014-10-30 Thread R.J. Baars
Yes, it is much faster without suggestions. It is faster to use a large corpus. Tatoeba and Wikipedia are not very big however. But it is a way to do it. Feel free to; we could compare the results later .. All in all in would be better if Icelandac were maintained. Why is it not? Is the rules par

Re: Help with unmunch and Icelandic + Galician

2014-10-30 Thread Daniel Naber
On 2014-10-30 15:08, R.J. Baars wrote: > My bag of trick is still running. So there might still be a good result > after some time. I estimate it to take another week. Do the suggestions really help that much? Don't we get the same result if we have a large list of words, e.g. the complete conte

Re: Help with unmunch and Icelandic + Galician

2014-10-30 Thread R.J. Baars
Daniel, My bag of trick is still running. So there might still be a good result after some time. I estimate it to take another week. I noticed Icelandic seems to be a compounding language, at least parts of it. The words list is not at all encoded like that. I am tempted to rearrange the spellch

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
Yes, and flag num means any valid number. FLAG LONG makes it possible to longer (string) flags. Ruud > On 2014-10-28 12:58, Marco A.G.Pinto wrote: > >> I believe that if I change the code of Proofing Tool GUI to have >> numbers with more than one character I would break other dictionaries >> :

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
The affix length is deterimined by a Hunspell clause FLAG. There is number, char, and string (long) . Myspell only knew about char, I think. > Dear Ruud and Daniel, > > I believe I have a clue: > Usually suffixes and prefixes only have one character for the rule. > > But this .AFF has more charact

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread Daniel Naber
On 2014-10-28 12:58, Marco A.G.Pinto wrote: > I believe that if I change the code of Proofing Tool GUI to have > numbers with more than one character I would break other dictionaries > :'( I think this number mode gets turned on by "FLAG num" in is_IS.aff. Regards Daniel ---

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread Marco A.G.Pinto
Dear Ruud and Daniel, I believe I have a clue: Usually suffixes and prefixes only have one character for the rule. But this .AFF has more characters and this is the problem. If you take a look at the English AFFs they too have letters and numbers, but only one character long. See the en_GB wo

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
I edited the .aff so that is at least does no longer crash. Look like it has been edited wit an editor inserting tabs wherever. Since tab is a special char to Hunspell, it causes the dump when unmunching. The new aff does not dump, but still adds / to words. Looks like unmunch is not able to proce

Re: Help with unmunch and Icelandic + Galician

2014-10-28 Thread R.J. Baars
Some of the rules in the icelandic affix file are wrong. There are lots of lines like: SFX 1 ur 0 , ending in a 0, causing a 0 to be added to the word by unmunch or Marco's tool. Hunspell furthermore accepts words containing a number without any check by default. So I added the numbers to the word

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Marco A.G.Pinto
Hello! The "Tags" is the extra information which I only found in the pt_PT dictionary, which tells if each word is masculine, feminine, singular, plural, etc. As for the rules, they are here: For each wrong result you get in the extracted list, you can check here if it is a rule or a tool

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
The first thing I notice is that flags and word are not separated on the screen. I added a picture to show that. When I click edit, it is the same. The / is apparently not seen as a flag indicator in the dictionary. In the dic, you can find flags after the / , comments after # and extra data afte

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Marco A.G.Pinto
Dear Ruud, To see if it is a tool bug or a rule bug, just edit the word(s) in the "Dictionary" tab of my tool and it will show a tab containing each rule that generates the derivates. You can edit the words with a double-click or with right-click+EDIT. :-P I am feeling so eager! PS->You mu

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
In the output of the tool are also unmunch errors. Ab0 as the derivative if Abel e.g. After exporting and processing into a words list, out of the 2.7 Mb, 2.3 Mb was accepted as a correct word by the same spellchecker. So the 'bag of trics' might still be useful after unmunching using this tool,

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
The tool seems to work. I will check if it is better than the bag of trick.. Looks very promising. Requires further processing though. Ruud > You have to use V3.0 build 64. From the menu "Dictionary Tools", choose > "Extract wordlist". It worked for me. > > Am 27.10.2014 16:38, schrieb Daniel N

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Jan Schreiber
You have to use V3.0 build 64. From the menu "Dictionary Tools", choose "Extract wordlist". It worked for me. Am 27.10.2014 16:38, schrieb Daniel Naber: > On 2014-10-27 13:48, Marco A.G.Pinto wrote: > >> To unmunch .DIC + .AFF use my tool, Proofing Tool GUI: >> http://marcoagpinto.cidadevirtual

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Below is the full bag of tricks: #!/bin/bash # set the language id (name of hunspell dic without extension) if [ ! $1 ] ; then echo "ENTER THE NAME OF THE DICTIONARY FILE WITHOUT .DIC AS A PARAMTER" else if [ -f $1.dic ] ; then if [ -f $1.aff ] ; then LANG=$1 # try to unmunch

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Apart from the trick I am applying now, a good option for more valid output could be to use the words form Wikipedia and Tatoeba as an extra input. If the language is in those databases. Galician grew to > 3 Mb fast enough when Spanish and Portuguese were used as input. These could also be found i

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Daniel Naber
On 2014-10-27 13:48, Marco A.G.Pinto wrote: > To unmunch .DIC + .AFF use my tool, Proofing Tool GUI: > http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html [3] How does it work? I couldn't find an "unmunch" menu item or similar. What does it do differently to unmunch command line program?

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Marco A.G.Pinto
Daniels and friends, To unmunch .DIC + .AFF use my tool, Proofing Tool GUI: http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html But please notice that the files must be in UTF-8 and not obfuscated. Thanks! Kind regards, >Marco A.G.Pinto -- On 27/10/2014

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
If you don't want the words from my own list added, I will leave them out. No issue. But it will mean, since the source is not unmunchable, you might be missing quite common Icelandic words, because the other tricks did not generate them. But It is already running, without other input than the hun

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Daniel Naber
On 2014-10-27 11:37, R.J. Baars wrote: > That is what these trick do. There is no word added that is not > accepted > by the spellchecker. I understand that, I'd also like to understand where 'virkar' and 'texta' come from: from your unmunch output or from the step you call "Then I added my ow

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Anyway, the words you wanted checked were in the dictionary before unmunch. Ruud > There is no list to go from, so how should I know? If htere was such a > list, there was no need to use unmunch, right? > > Doing an unmunch, you add lots of words to the dictionary, being all > derivatives. That

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
There is no list to go from, so how should I know? If htere was such a list, there was no need to use unmunch, right? Doing an unmunch, you add lots of words to the dictionary, being all derivatives. That is what makes them that big. When the source list is not there, the only thing you can do is

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Daniel Naber
On 2014-10-27 10:53, R.J. Baars wrote: > I first changed it into utf-8; > I removed the po: flags > I changed the tab chars into spaces > Then I unmunched. > I used sed to remove the trailing flags, which are created, as well as > trailing numbers > Then I added my own collection of icelandic word

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Galician will be doable as well. It accepts a lot of spanish and portuguese words (already 4Mb). Add the suggestions to it, and it will be a workable list. My computer will be doing that for the next days (suggestion is slow) By the way, would it not be a good idea to have the full dictionari

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
> On 2014-10-27 10:26, R.J. Baars wrote: > >> I was able to make a file though. It is 3 Mb uncompressed. >> >> You can download it from dev.taaltik.nl/is.okay.zip > > Thanks, what was the exact command you used to create this list? Multiple. And manual editing. I first changed it into utf-8; I re

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread Daniel Naber
On 2014-10-27 10:26, R.J. Baars wrote: > I was able to make a file though. It is 3 Mb uncompressed. > > You can download it from dev.taaltik.nl/is.okay.zip Thanks, what was the exact command you used to create this list? Regards Daniel ---

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Icelandig really create a lot of junk using unmunch, even after removing some newer attributes form the .dict. Looks like unmunch is not capable of using the number flags as well. I was able to make a file though. It is 3 Mb uncompressed. You can download it from dev.taaltik.nl/is.okay.zip Ruud

Re: Help with unmunch and Icelandic + Galician

2014-10-27 Thread R.J. Baars
Unmunch does not support the newer functionalities of Hunspell. It might generate rubbish even. There are ways to do this, more or less. Generating the list using unmunch is still an option, even when it generates rubbish. Add a list of found Icelandic words to that list. The use hunspell with th