Actually postprocessing with replace for AE will be the best bet as 4.0 is
slower than the tesseract engine for latin-based scripts.

You can experiment with 4.0.0alpha.

See https://github.com/tesseract-ocr/tesseract/wiki/Compiling
you will also need to compile the latest version of leptonica before that.

Sources are at:
https://github.com/DanBloomberg/leptonica.git
https://github.com/tesseract-ocr/tesseract.git

There is no separate src directory for tesseract.

I used git clone to get the master branch and then use pull origin to
update it. You can also download zip with current master.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jan 9, 2017 at 1:18 PM, Ludvig F Aarstad <lud...@aarstad.org> wrote:

> No worries, I will play around and see what I can get working. For now I
> am using a simple replace in my script to handle the Æ.
> How would I go about if I were to compile tesseract 4.0 alpha using git
> and cmake? The wiki says the 4.0 alpha Source code is available in the
> master branch of the repository, but I have yet to find it...The compiling
> part seems straght-forward enough, but I need the source ;).
>
> Tried installing the gimagereader hoping that it would give me the dll for
> tesseract 4.0, but no.
>
> mandag 9. januar 2017 08.34.18 UTC+1 skrev shree følgende:
>
>> Sorry, I am not familiar with powershell and nuget.
>>
>> If you are on Windows, you can try the experimental binaries for
>> 4.0.0alpha for gimagereader, gui front-end to Tesseract-ocr. You can ocr a
>> pdf directly or load multiple images at the same time.
>>
>> - excuse the brevity, sent from mobile
>>
>> On 09-Jan-2017 12:49 PM, "Ludvig F Aarstad" <lud...@aarstad.org> wrote:
>>
>>> Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I
>>> am basing my code on this: https://github.com/jourdant/po
>>> wershell-paperless and there is a script to initialize the environment
>>> that is getting the tesseract files from here:
>>> https://nuget.org/api/v2/package/tesseract-ocr. Would you be able to
>>> point me in the right direction on how to move this from 3.03 to the
>>> 4.0alpha?
>>>
>>>
>>>
>>> fredag 6. januar 2017 13.50.38 UTC+1 skrev shree følgende:
>>>
>>>> I have uploaded modified nor.traineddata at
>>>>
>>>> https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor
>>>> .traineddata
>>>>
>>>> See attached log and info file for commands used in training. It took
>>>> about 9 hours on my pc - about 1700 iterations only and then my PC froze so
>>>> I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e.
>>>> 0.853 % character error rate at iteration number 1615.
>>>>
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar <shree...@gmail.com>
>>>> wrote:
>>>>
>>>>> @Peter, Have you tried the 4.0.0alpha version yet?
>>>>>
>>>>> @Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I
>>>>> will upload the new traineddata so that you can test. You will need
>>>>> 4.0.alpha version for testing.
>>>>>
>>>>> Here is couple of the training tifs and OCRed text.
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Fri, Jan 6, 2017 at 5:01 PM, Peter <pe...@peterkrantz.se> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
>>>>>>>
>>>>>>> Ray is planning to retrain the languages for the new 4.0.0 version
>>>>>>> sometime in January. So it would be helpful if you could open an issue 
>>>>>>> on
>>>>>>> https://github.com/tesseract-ocr/langdata/issues with this
>>>>>>> information.
>>>>>>>
>>>>>>
>>>>>> Is it possible to contribute training data for this effort? I realise
>>>>>> swedish will not be on top of the list but I think it would be easy to
>>>>>> involve some of the research community here in contributing training data
>>>>>> if it could improve the language model.
>>>>>>
>>>>>> /Peter
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8
>>>>>> a-4861-b29e-80db2b5a687f%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b193b0be-f57d-44cf-b2e4-6efc5bb9a361%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b193b0be-f57d-44cf-b2e4-6efc5bb9a361%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-C105vV0%3Dg%3DML6yGbBB9BT0GK2LgPTOwpQZfzHLT2mA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to