Actually postprocessing with replace for AE will be the best bet as 4.0 is slower than the tesseract engine for latin-based scripts.
You can experiment with 4.0.0alpha. See https://github.com/tesseract-ocr/tesseract/wiki/Compiling you will also need to compile the latest version of leptonica before that. Sources are at: https://github.com/DanBloomberg/leptonica.git https://github.com/tesseract-ocr/tesseract.git There is no separate src directory for tesseract. I used git clone to get the master branch and then use pull origin to update it. You can also download zip with current master. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jan 9, 2017 at 1:18 PM, Ludvig F Aarstad <lud...@aarstad.org> wrote: > No worries, I will play around and see what I can get working. For now I > am using a simple replace in my script to handle the Æ. > How would I go about if I were to compile tesseract 4.0 alpha using git > and cmake? The wiki says the 4.0 alpha Source code is available in the > master branch of the repository, but I have yet to find it...The compiling > part seems straght-forward enough, but I need the source ;). > > Tried installing the gimagereader hoping that it would give me the dll for > tesseract 4.0, but no. > > mandag 9. januar 2017 08.34.18 UTC+1 skrev shree følgende: > >> Sorry, I am not familiar with powershell and nuget. >> >> If you are on Windows, you can try the experimental binaries for >> 4.0.0alpha for gimagereader, gui front-end to Tesseract-ocr. You can ocr a >> pdf directly or load multiple images at the same time. >> >> - excuse the brevity, sent from mobile >> >> On 09-Jan-2017 12:49 PM, "Ludvig F Aarstad" <lud...@aarstad.org> wrote: >> >>> Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I >>> am basing my code on this: https://github.com/jourdant/po >>> wershell-paperless and there is a script to initialize the environment >>> that is getting the tesseract files from here: >>> https://nuget.org/api/v2/package/tesseract-ocr. Would you be able to >>> point me in the right direction on how to move this from 3.03 to the >>> 4.0alpha? >>> >>> >>> >>> fredag 6. januar 2017 13.50.38 UTC+1 skrev shree følgende: >>> >>>> I have uploaded modified nor.traineddata at >>>> >>>> https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor >>>> .traineddata >>>> >>>> See attached log and info file for commands used in training. It took >>>> about 9 hours on my pc - about 1700 iterations only and then my PC froze so >>>> I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. >>>> 0.853 % character error rate at iteration number 1615. >>>> >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar <shree...@gmail.com> >>>> wrote: >>>> >>>>> @Peter, Have you tried the 4.0.0alpha version yet? >>>>> >>>>> @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I >>>>> will upload the new traineddata so that you can test. You will need >>>>> 4.0.alpha version for testing. >>>>> >>>>> Here is couple of the training tifs and OCRed text. >>>>> >>>>> ShreeDevi >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> On Fri, Jan 6, 2017 at 5:01 PM, Peter <pe...@peterkrantz.se> wrote: >>>>> >>>>>> >>>>>> >>>>>> Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree: >>>>>>> >>>>>>> Ray is planning to retrain the languages for the new 4.0.0 version >>>>>>> sometime in January. So it would be helpful if you could open an issue >>>>>>> on >>>>>>> https://github.com/tesseract-ocr/langdata/issues with this >>>>>>> information. >>>>>>> >>>>>> >>>>>> Is it possible to contribute training data for this effort? I realise >>>>>> swedish will not be on top of the list but I think it would be easy to >>>>>> involve some of the research community here in contributing training data >>>>>> if it could improve the language model. >>>>>> >>>>>> /Peter >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8 >>>>>> a-4861-b29e-80db2b5a687f%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/b193b0be-f57d-44cf-b2e4-6efc5bb9a361% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/b193b0be-f57d-44cf-b2e4-6efc5bb9a361%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-C105vV0%3Dg%3DML6yGbBB9BT0GK2LgPTOwpQZfzHLT2mA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.