Those are perfectly valid words, they are just used scarcely. Those word
termitations are valid Galician but not too frequent is written texts. Of
course, I bet you would not ever hear "anglicizariámosllesnola", but if
there is anyone out there that uses the verb "anglicizar", which I did not
know
By the way, don't trust Google too much. There are words that are valid,
but too infrequent for Google to absorb in their indexes.
For Dutch, I found lots of words in documents found using Google,
contianing words that will not result in Google showing the same document
when searching with the word
On 2015-01-08 22:18, Adrián Chaves Fernández wrote:
> I uploaded it to SF:
> http://sourceforge.net/projects/hunspell-gl/files/tmp/hunspell-words-clean.txt/download
> [4] (uncompressed, though, where’s the fun otherwise?!)
Thanks. Are all of these real words that actually occur or do many of
the
I uploaded it to SF:
http://sourceforge.net/projects/hunspell-gl/files/tmp/hunspell-words-clean.txt/download
(uncompressed, though, where’s the fun otherwise?!)
2015-01-05 14:38 GMT+01:00 Daniel Naber :
> On 2014-12-10 18:51, Adrián Chaves Fernández wrote:
>
> Hi Adrián,
>
> > You can download pr
On 2014-12-10 18:51, Adrián Chaves Fernández wrote:
Hi Adrián,
> You can download prebuilt snapshots from
> https://sourceforge.net/projects/hunspell-gl/files/instantaneas/
sorry, I only now found time to look at this again. When I run:
python unmunch.py -a hunspell-gl-drag-20141115/gl_ES.af' -d
You can download prebuilt snapshots from
https://sourceforge.net/projects/hunspell-gl/files/instantaneas/
You can alternatively generate the gl_ES.utf8 locale in your system:
https://wiki.archlinux.org/index.php/locale#Generating_locales
2014-12-08 15:05 GMT+01:00 Daniel Naber :
> On 2014-12-05 0
On 2014-12-05 08:33, Adrián Chaves Fernández wrote:
> The repository where the unmunch script is located is part of a
> network of repositories. Your output suggest that you did not included
> submodules when you cloned the repository, see
> http://stackoverflow.com/questions/3796927/how-to-git-cl
Hi,
I have found a modified script I made when I was working on Galician
hunspell (see attachment). I can't remember if was entirely finished or
not, sorry about that.
Daniel, you are right about "recursivity" into Galician affixes file. You
can find some documentation at
http://linguamatica.com/i
Hello Adrián,
I didn't clone unmunch, I did my own version based on what I was told.
What I came across is that when I began working on PTG in 2013, the
documentation and help from mailing lists regarding Hunspell, didn't
mention what I discovered recently.
The examples and help I had access
The repository where the unmunch script is located is part of a network of
repositories. Your output suggest that you did not included submodules when
you cloned the repository, see
http://stackoverflow.com/questions/3796927/how-to-git-clone-including-submodules
Any feedback about the script is ap
On 2014-11-15 07:06, Adrián Chaves Fernández wrote:
> As I explain in that Hunspell bug report, I ended up writting a Python
> script to unmunch Galician files.
Could you explain how this can be used? I'm not very familiar with
Python, and when I call 'python2.7 unmunch.py' I get:
File "unmu
Continuation flags can also be used for 'compounding'and have the same
issue of possibly having an endless loop.
I guess that is why Hunspell is time-limited for every lookup.
Ruud
> 2014-11-05 10:49 GMT+01:00 R.J. Baars :
>
>> There will never be a new unmunch that supports all new Hunspell
>>
2014-11-05 10:49 GMT+01:00 R.J. Baars :
> There will never be a new unmunch that supports all new Hunspell
> functions, since the compounding (or continuation, which is much the same)
> makes a list unlimited of size.
>
In Galician we only use compounds for number-related constructs (e.g.
“1.ª”),
I found out that the unmunch.sh script, which turns out to be from the
Hunspell 1.2.8 version (available in the folder for that version in
SourceForge) is a bit buggy. See
https://sourceforge.net/p/hunspell/bugs/147/
As I explain in that Hunspell bug report, I ended up writting a Python
script to
Like I said, Tatoeba is much too small.
There will never be a new unmunch that supports all new Hunspell
functions, since the compounding (or continuation, which is much the same)
makes a list unlimited of size.
Ruud
> On 2014-11-04 13:29, R.J. Baars wrote:
>
>> I put a script generating icelan
On 2014-11-04 13:29, R.J. Baars wrote:
> I put a script generating icelandic and the data here:
>
> www.taaltik.nl/daniel/ice.zip
I'm not sure if this approach is viable, at least for Icelandic. Just
too many words are missing. For example, I just needed to check a single
paragraph to find the
I got 2.7 Mb, 229699 lines.
Try to download again and give it another try.
Ruud
> On 2014-11-04 14:10, Adrián Chaves Fernández wrote:
>
>> I have not read the whole conversation, but for Galician I recently
>> needed to unmunch the Hunspell files to generate a Morfologik
>> dictionary, and I m
On 2014-11-04 14:10, Adrián Chaves Fernández wrote:
> I have not read the whole conversation, but for Galician I recently
> needed to unmunch the Hunspell files to generate a Morfologik
> dictionary, and I managed to do it with:
>
> https://github.com/eitsl/hunspell/blob/master/utils/unmunch.sh [
On my system, it just gives error: gensub not defined.
> PS: I did it for the "drag" version, not the "comunidade" version. I am
> assuming that unmunch.sh would work with "comunidade" as well, but I did
> not try it as of today (and I'm at work right now).
>
> 2014-11-04 14:10 GMT+01:00 Adrián
PS: I did it for the "drag" version, not the "comunidade" version. I am
assuming that unmunch.sh would work with "comunidade" as well, but I did
not try it as of today (and I'm at work right now).
2014-11-04 14:10 GMT+01:00 Adrián Chaves Fernández :
> I have not read the whole conversation, but f
I have not read the whole conversation, but for Galician I recently needed
to unmunch the Hunspell files to generate a Morfologik dictionary, and I
managed to do it with:
https://github.com/eitsl/hunspell/blob/master/utils/unmunch.sh
A script which I found at:
https://github.com/kscanne/hunspell
Daniel,
I put a script generating icelandic and the data here:
www.taaltik.nl/daniel/ice.zip
Read the script ice.sh to see how it works.
I might give a try for Galician as well.
Ruud
--
_
That suggestion does not work for Icelanic.
>
> I could upload a result, but you needed it to come from sources like
> Tatoeba and Wikipedia. I have no export routines for those, and currently
> no time to make them.
>
> Maybe in a few weeks.
> Ruud
>
>> On 2014-11-02 11:30, R.J. Baars wrote:
>>
>
I could upload a result, but you needed it to come from sources like
Tatoeba and Wikipedia. I have no export routines for those, and currently
no time to make them.
Maybe in a few weeks.
Ruud
> On 2014-11-02 11:30, R.J. Baars wrote:
>
>> The most effective way to generate Icelandic is to throw a
On 2014-10-31 01:42, Anton Meixome wrote:
> For Galician
>
> you can test with the new version
>
> http://sourceforge.net/projects/hunspell-gl/files/instantaneas/20141025/hunspell-gl-comunidade-20141025.tar.xz/download
Unfortunately, this makes unmunch (from hunspell 1.3.3) crash. I guess
Ruud
On 2014-11-02 11:30, R.J. Baars wrote:
> The most effective way to generate Icelandic is to throw a large words
> list to Hunspell, since the dictionary is supporting compounding.
Could you upload the result somewhere? In how far does Icelandic support
compounding, other than NOSPLITSUGS I canno
Daniel,
The most effective way to generate Icelandic is to throw a large words
list to Hunspell, since the dictionary is supporting compounding.
Just applying the bag of trick results in 0.8 MB of words, using a large
words list 2.8 MB. Quite a difference.
Ruud
For Galician
you can test with the new version
http://sourceforge.net/projects/hunspell-gl/files/instantaneas/20141025/hunspell-gl-comunidade-20141025.tar.xz/download
Also we have lexical lists from a variety of sources (Wikipedia included)
http://sourceforge.net/projects/hunspell-gl/files/inst
Yes, it is much faster without suggestions. It is faster to use a large
corpus. Tatoeba and Wikipedia are not very big however. But it is a way to
do it. Feel free to; we could compare the results later ..
All in all in would be better if Icelandac were maintained. Why is it not?
Is the rules par
On 2014-10-30 15:08, R.J. Baars wrote:
> My bag of trick is still running. So there might still be a good result
> after some time. I estimate it to take another week.
Do the suggestions really help that much? Don't we get the same result
if we have a large list of words, e.g. the complete conte
Daniel,
My bag of trick is still running. So there might still be a good result
after some time. I estimate it to take another week.
I noticed Icelandic seems to be a compounding language, at least parts of
it. The words list is not at all encoded like that.
I am tempted to rearrange the spellch
Yes, and flag num means any valid number.
FLAG LONG makes it possible to longer (string) flags.
Ruud
> On 2014-10-28 12:58, Marco A.G.Pinto wrote:
>
>> I believe that if I change the code of Proofing Tool GUI to have
>> numbers with more than one character I would break other dictionaries
>> :
The affix length is deterimined by a Hunspell clause FLAG.
There is number, char, and string (long) .
Myspell only knew about char, I think.
> Dear Ruud and Daniel,
>
> I believe I have a clue:
> Usually suffixes and prefixes only have one character for the rule.
>
> But this .AFF has more charact
On 2014-10-28 12:58, Marco A.G.Pinto wrote:
> I believe that if I change the code of Proofing Tool GUI to have
> numbers with more than one character I would break other dictionaries
> :'(
I think this number mode gets turned on by "FLAG num" in is_IS.aff.
Regards
Daniel
---
Dear Ruud and Daniel,
I believe I have a clue:
Usually suffixes and prefixes only have one character for the rule.
But this .AFF has more characters and this is the problem.
If you take a look at the English AFFs they too have letters and
numbers, but only one character long.
See the en_GB wo
I edited the .aff so that is at least does no longer crash.
Look like it has been edited wit an editor inserting tabs wherever. Since
tab is a special char to Hunspell, it causes the dump when unmunching.
The new aff does not dump, but still adds / to words.
Looks like unmunch is not able to proce
Some of the rules in the icelandic affix file are wrong.
There are lots of lines like:
SFX 1 ur 0 , ending in a 0, causing a 0 to be added to the word by unmunch
or Marco's tool.
Hunspell furthermore accepts words containing a number without any check
by default. So I added the numbers to the word
Hello!
The "Tags" is the extra information which I only found in the pt_PT
dictionary, which tells if each word is masculine, feminine, singular,
plural, etc.
As for the rules, they are here:
For each wrong result you get in the extracted list, you can check here
if it is a rule or a tool
The first thing I notice is that flags and word are not separated on the
screen.
I added a picture to show that.
When I click edit, it is the same. The / is apparently not seen as a flag
indicator in the dictionary.
In the dic, you can find flags after the / , comments after # and extra
data afte
Dear Ruud,
To see if it is a tool bug or a rule bug, just edit the word(s) in the
"Dictionary" tab of my tool and it will show a tab containing each rule
that generates the derivates.
You can edit the words with a double-click or with right-click+EDIT.
:-P
I am feeling so eager!
PS->You mu
In the output of the tool are also unmunch errors.
Ab0 as the derivative if Abel e.g.
After exporting and processing into a words list, out of the 2.7 Mb, 2.3
Mb was accepted as a correct word by the same spellchecker.
So the 'bag of trics' might still be useful after unmunching using this
tool,
The tool seems to work.
I will check if it is better than the bag of trick.. Looks very promising.
Requires further processing though.
Ruud
> You have to use V3.0 build 64. From the menu "Dictionary Tools", choose
> "Extract wordlist". It worked for me.
>
> Am 27.10.2014 16:38, schrieb Daniel N
You have to use V3.0 build 64. From the menu "Dictionary Tools", choose
"Extract wordlist". It worked for me.
Am 27.10.2014 16:38, schrieb Daniel Naber:
> On 2014-10-27 13:48, Marco A.G.Pinto wrote:
>
>> To unmunch .DIC + .AFF use my tool, Proofing Tool GUI:
>> http://marcoagpinto.cidadevirtual
Below is the full bag of tricks:
#!/bin/bash
# set the language id (name of hunspell dic without extension)
if [ ! $1 ] ; then
echo "ENTER THE NAME OF THE DICTIONARY FILE WITHOUT .DIC AS A PARAMTER"
else
if [ -f $1.dic ] ; then
if [ -f $1.aff ] ; then
LANG=$1
# try to unmunch
Apart from the trick I am applying now, a good option for more valid
output could be to use the words form Wikipedia and Tatoeba as an extra
input. If the language is in those databases.
Galician grew to > 3 Mb fast enough when Spanish and Portuguese were used
as input. These could also be found i
On 2014-10-27 13:48, Marco A.G.Pinto wrote:
> To unmunch .DIC + .AFF use my tool, Proofing Tool GUI:
> http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html [3]
How does it work? I couldn't find an "unmunch" menu item or similar.
What does it do differently to unmunch command line program?
Daniels and friends,
To unmunch .DIC + .AFF use my tool, Proofing Tool GUI:
http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html
But please notice that the files must be in UTF-8 and not obfuscated.
Thanks!
Kind regards,
>Marco A.G.Pinto
--
On 27/10/2014
If you don't want the words from my own list added, I will leave them out.
No issue. But it will mean, since the source is not unmunchable, you might
be missing quite common Icelandic words, because the other tricks did not
generate them.
But It is already running, without other input than the hun
On 2014-10-27 11:37, R.J. Baars wrote:
> That is what these trick do. There is no word added that is not
> accepted
> by the spellchecker.
I understand that, I'd also like to understand where 'virkar' and
'texta' come from: from your unmunch output or from the step you call
"Then I added my ow
Anyway, the words you wanted checked were in the dictionary before unmunch.
Ruud
> There is no list to go from, so how should I know? If htere was such a
> list, there was no need to use unmunch, right?
>
> Doing an unmunch, you add lots of words to the dictionary, being all
> derivatives. That
There is no list to go from, so how should I know? If htere was such a
list, there was no need to use unmunch, right?
Doing an unmunch, you add lots of words to the dictionary, being all
derivatives. That is what makes them that big.
When the source list is not there, the only thing you can do is
On 2014-10-27 10:53, R.J. Baars wrote:
> I first changed it into utf-8;
> I removed the po: flags
> I changed the tab chars into spaces
> Then I unmunched.
> I used sed to remove the trailing flags, which are created, as well as
> trailing numbers
> Then I added my own collection of icelandic word
Galician will be doable as well.
It accepts a lot of spanish and portuguese words (already 4Mb). Add the
suggestions to it, and it will be a workable list.
My computer will be doing that for the next days (suggestion is slow)
By the way, would it not be a good idea to have the full dictionari
> On 2014-10-27 10:26, R.J. Baars wrote:
>
>> I was able to make a file though. It is 3 Mb uncompressed.
>>
>> You can download it from dev.taaltik.nl/is.okay.zip
>
> Thanks, what was the exact command you used to create this list?
Multiple. And manual editing.
I first changed it into utf-8;
I re
On 2014-10-27 10:26, R.J. Baars wrote:
> I was able to make a file though. It is 3 Mb uncompressed.
>
> You can download it from dev.taaltik.nl/is.okay.zip
Thanks, what was the exact command you used to create this list?
Regards
Daniel
---
Icelandig really create a lot of junk using unmunch, even after removing
some newer attributes form the .dict.
Looks like unmunch is not capable of using the number flags as well.
I was able to make a file though. It is 3 Mb uncompressed.
You can download it from dev.taaltik.nl/is.okay.zip
Ruud
Unmunch does not support the newer functionalities of Hunspell. It might
generate rubbish even.
There are ways to do this, more or less.
Generating the list using unmunch is still an option, even when it
generates rubbish. Add a list of found Icelandic words to that list.
The use hunspell with th
57 matches
Mail list logo