Best way would be to try it ;-)
AFAIR there were similar approaches e.g. ([1], [2], [3] - IMHO using GNU
Parallel was quite popular; search for "tesseract parallel" - google
provide 1.28 Mio results), but please be aware of this open issue[4]...

[1]
https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/
[2]
https://marketplace.uipath.com/listings/parallel-ocr-processing-using-testract
[3]
https://stackoverflow.com/questions/47958163/tesseract-ocr-large-number-of-files
[4] https://github.com/tesseract-ocr/tesseract/issues/3109

Zdenko


št 19. 5. 2022 o 13:39 Krzysztof J <[email protected]> napísal(a):

> Hello zdenop,
>
> My idea is use multithreading for multiple tiffs - e.g. contain 30 pages.
> Currently tesseract is working on 1 thread, not using its full potential
> for Windows - We could get for example half of available system threads and
> automatically allocate some of the pages of the tiff file as independent
> images. The results would be collected into 1 structure, which would be,
> for example a map of results. The implementation could be carried out on
> the level of some wrapper class, whih has been prepared for communication
> with the OCR engine. Example functional diagram for a 4-core processor is
> presented by below schema. Is this a good direction to run several
> Tesseract OCR instances simultaneously?
>
> [image: Test.png]
>
> poniedziałek, 9 maja 2022 o 18:44:05 UTC+2 zdenop napisał(a):
>
>> Hello,
>>
>> 1) search issue tracker for openmp[1] reports for more details. There are
>> different experiences. For me, it seems for me like it does not help on
>> linux (and mac?) - just consumes the CPU. My experience[2] is that it helps
>> on windows, but maybe it is the question of HW& SW configuration. To be on
>> the safe side - OpenMP is turned off by default, so if somebody turns it
>> on,  such user/developer should be responsible for the consequences ;-)
>>
>> 2) I made some test with multithreading of tesserocr in python and it
>> does not work for me. It works only with 1 thread (I never use
>> multithreading, so maybe the problem is on my side.).
>>
>> Anyway expect and contribution in this area (OpenMP) is warmly welcomed.
>>
>> [1]https://github.com/tesseract-ocr/tesseract/issues?q=is%3Aissue+openmp
>> [2] https://github.com/tesseract-ocr/tessdoc/blob/main/Benchmarks.md
>>
>> Zdenko
>>
>>
>> ne 8. 5. 2022 o 14:24 Krzysztof J <[email protected]> napísal(a):
>>
>>> have the problems & questions:
>>>
>>> 1). Question 1: While preparing the build, I noticed that the
>>> "OPENMP_BUILD" setting is not included when building the solution see below:
>>>
>>> [image: configuration_tesseract.png]
>>>
>>> Anyone can say something more about it? Is using multiprocessing at the
>>> moment recommended? What's the state of it now? I only saw subject #
>>> 1662 <https://github.com/tesseract-ocr/tesseract/issues/1662> where it
>>> was turned off, but it was 4 years ago :o
>>>
>>> 2). Question: Are there any other ways to take advantage of
>>> multithreading in Tesseract besides OpenMP in Tesseract 5.1.0? Anyone have
>>> experience in this topic? For now I am working on 1 thread, but ultimately
>>> I would like to switch to multiple threads.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/190a9765-e00c-4f1b-b784-b81851d2a0c4n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/190a9765-e00c-4f1b-b784-b81851d2a0c4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1d6812e6-2b78-49e5-a341-2f0c5505126bn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1d6812e6-2b78-49e5-a341-2f0c5505126bn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yfk9-bCSY2y6hY7cTbOHOxZB36us3O-hd7F_HzsaSYUw%40mail.gmail.com.

Reply via email to