Tesseract 3.02.02 Released

zdenko podobny Sat, 03 Nov 2012 07:51:14 -0700

Hello all,

Tesseract OCR 3.02 was released (as 3.02.02) and you can find it in
download section[1] or on the Project page in section "Featured".

*Tesseract release notes - V3.02*

- Moved ResultIterator/PageIterator to ccmain.
- Added Right-to-left/Bidi capability in the output iterators for
Hebrew/Arabic.
- Added paragraph detection in layout analysis/post OCR.
- Fixed inconsistent xheight during training and over-chopping.
- Added simultaneous multi-language capability.
- Refactored top-level word recognition module.
- Added experimental equation detector.
- Improved handling of resolution from input images.
- Blamer module added for error analysis.
- Cleaned up externally used namespace by removing includes from
baseapi.h.
- Removed dead memory management code.
- Tidied up constraints on control parameters.
- Added support for ShapeTable in classifier and training.
- Refactored class pruner.
- Fixed training leaks and randomness.
- Major improvements to layout analysis for better image detection,
diacritic detection, better textline finding, better tabstop finding.
- Improved line detection and removal.
- Added fixed pitch chopper for CJK.
- Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
- Fixed problems with internally scaled images.
- Added page and bbox to string in tr files to identify source of
training data better.
- Fixes to Hindi Shiroreka splitter.
- Added word bigram correction.
- Reduced stack memory consumption and eliminated some ugly typedefs.
- Added new uniform classifier API.
- Added new training error counter.
- Fixed endian bug in dawg reader.
- C API (thanks to Tobias Müller)
- New solution for VS 2008 (thanks to Tom Powers)
- Many other fixes, including the way in which the chopper finds chops
and messes with the outline while it does so.

Windows installer was build on Windows XP SP3 with NSIS tool. Tesseract.exe
(and trainings tools) is 32bit static build with VC++ 2008 Express, so
maybe you will need Microsoft Visual C++ 2008 SP1 Redistributable Package
(x86) [2].

All google generated language data were updated (community language data
files were not updated yet).
New languages available from google: afr, aze, bel, ben, chr, enm, epo,
est, eus, frm, glg, ita_old, kan, mal, mkd, mlt, msa, spa_old, sqi, swa,
tam, tel.
Cube data files are available for ita, fra, rus, spa too.
Added experimental equation detector (equ).
There is also new community language Ancient Greek (grc) - thanks to Nick
White.

Language data files created for 3.00 and 3.01 can be used in 3.02. Language
data files created with Tesseract OCR 3.02 will not work in previous
versions.

Thanks you all who shared your know-how and tested tesseract 3.02 in svn.
Thanks Google for supporting this project!

[1] http://code.google.com/p/tesseract-ocr/downloads/list
[2]
http://www.microsoft.com/en-us/download/details.aspx?id=5582&WT.mc_id=MSCOM_EN_US_DLC_DETAILS_121LSUS007998

--
Zdenko Podobný
Community project contributor

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Tesseract 3.02.02 Released

Reply via email to