Forwarding update by Ray.
---------- Forwarded message ---------- From: theraysmith <notificati...@github.com> Date: Wed, Jul 12, 2017 at 5:55 AM Subject: Re: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995) To: tesseract-ocr/tesseract <tesser...@noreply.github.com> I'm about ready to update the traineddatas. I have a training run almost complete, and with accuracy that meets with my satisfaction. There are a few regressions, but not too serious. First though, I have to get some code reviewed in Google, and then make some commits to github to match the new traineddatas. Before that, there is the matter of a major pull... Here's what's coming: - Fix to issue 653: New components in traineddata file for the unicharset, recoder and version string. Backwards compatible change, so the LSTM component can still read older files. - Change in training system. The above change makes open source training impossible. Will add a new program to build a starter traineddata from a unicharset and optional word lists. - New "normalization" code to clean corpus text in all languages. That was a big part of the work. - Improvements to the trained networks to improve accuracy on single characters and single words. - 2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer. I have other stuff that is still incomplete, but that is a good list for now. BTW, in case you hadn't noticed, there was a breaking change that made old lstmf files unusable. That was needed to fix LSTM for OSD. It has to know the language of each training sample. The new traineddatas will mostly be smaller than the older ones, as they won't contain the legacy components, and no bigram dawgs are needed. -- Ray. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314609036> , -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWPWhxWpMC-Csx-o3Nd7hvh%3DteJbvfPC2JkL9excAp2CA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.