On Mar 8, 3:53 am, zdenko podobny <[email protected]> wrote:
> On Wed, Mar 7, 2012 at 11:51 PM, Falke <[email protected]> wrote:
> > I did search this group but found only old posts regarding multiple
> > languages (regarding 2.0), but, looking forward to the new features in
> > 3.01...
>
> > I am assuming it's still impossible, even in 3.01, to recognize a
> > mixture of languages (distinct alphabets), per scan.  If my assumption
> > is correct, then, the next best thing would/could be to combine
> > multiple traineddata files into one superset...
>
> > this feature will be/is available in 3.02 version[1] (already in svn).
>
> [1]http://groups.google.com/group/tesseract-ocr/msg/29413aef63ee5977
>
>
>
>
>
>
>
> > But is that even feasible??
>
> > Any other solutions for multilingual (multi-alphabetic) documents?
>
> > (ABBYY does it -- why can't we?? :-))
>
> > TIA
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en

Wonderful; a great start.

Just a tiny-issue feedback for now, for an algorithmic tweak:

if an apostrophe has no spaces on either side of it, it's probably a
contraction, rather than a quote.  So, more likely, the two letters on
either side of the apostrophe MUST belong to the same alphabet set.
As it is now, the svn version allows for something like:

                                  п's  ( cyrillic "п" and latin "s")

                                                   or

                                  l'я    (latin "l" and cyrillic "я")


Perhaps there might exist exceptions, but, i think, safe to assume, in
practice,  less than 5% of the time...

thanks for your hard work, amazing product.

To Speedy:  Looks like it's word-level (the above "bug"
notwithstanding :-))

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to