On 25 May 2010 03:04, Arno Teigseth <arnot...@gmail.com> wrote:
> On Mon, 2010-05-24 at 16:41 +0100, James Le Cuirot wrote:
>
>> Because until a few days ago, development had ground to a halt. There
>> had been no new commits since August (I think) and not even any
>> communication from the lead maintainer. It's nice to see that there
>> have some new commits but some communication would make the future look
>> less uncertain.
>
> OK
> You probably have understood by now that I don't know much about
> commiting changes, so I guess this bonus question won't hurt: Can't
> other developers commit changes? Only the lead maintainer can?
>

Y'know, I don't recall seeing in this exchange anything along the
lines of 'I asked for commit access but was refused/got no
answer/etc.'

Of course, there's a catch-22 in that, without the ability to commit
(or have work committed), there's no incentive to continue; but
without continuous additions, there are no obvious candidates to add
to the pool of people with commit rights.

> I'd like to suggest changes, too. Not that I'd have any idea how, but
> I'd like sometime to make tesseract use hunspell in the "what is this
> word really" decision process. The spellchecker could maybe predict
> compound words not in tesseract's "known words list" and help fix small
> character mixups...
>

Hunspell does not 'predict'. It has a dictionary of words, with 'affix
flags', and an affix file which 1) (like ispell, etc.) defines the
prefixes/suffixes that can be applied to those words, 2) (unlike
ispell) defines how words may be compounded, which is more or less the
same as 1) plus the concatenation of a second (or third? I'm not up to
date on it) word -- it's a set of text encoded transformations for
words, to put it another way.

The DAWG files used by tesseract are a binary representation of the
words as a graph; common parts of words are shared - if 'consolation'
and 'confrontation' are in the graph, 'con' and 'ation' are shared
between them. The important part is, DAWGs are fast.

You could always use the tools that come with hunspell to export the
dictionary as a full form list of words, and run that through
wordlist2dawg, which would give you the benefit of hunspell's larger
wordlist without the slowdown.

> ok just my two cents - but if I can't ever get to commit any changes, I
> get your idea of a fork...
>
> best
> Arno
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to