Best way would be to try it ;-) AFAIR there were similar approaches e.g. ([1], [2], [3] - IMHO using GNU Parallel was quite popular; search for "tesseract parallel" - google provide 1.28 Mio results), but please be aware of this open issue[4]...
[1] https://appliedmachinelearning.blog/2018/06/30/performing-ocr-by-running-parallel-instances-of-tesseract-4-0-python/ [2] https://marketplace.uipath.com/listings/parallel-ocr-processing-using-testract [3] https://stackoverflow.com/questions/47958163/tesseract-ocr-large-number-of-files [4] https://github.com/tesseract-ocr/tesseract/issues/3109 Zdenko št 19. 5. 2022 o 13:39 Krzysztof J <[email protected]> napísal(a): > Hello zdenop, > > My idea is use multithreading for multiple tiffs - e.g. contain 30 pages. > Currently tesseract is working on 1 thread, not using its full potential > for Windows - We could get for example half of available system threads and > automatically allocate some of the pages of the tiff file as independent > images. The results would be collected into 1 structure, which would be, > for example a map of results. The implementation could be carried out on > the level of some wrapper class, whih has been prepared for communication > with the OCR engine. Example functional diagram for a 4-core processor is > presented by below schema. Is this a good direction to run several > Tesseract OCR instances simultaneously? > > [image: Test.png] > > poniedziałek, 9 maja 2022 o 18:44:05 UTC+2 zdenop napisał(a): > >> Hello, >> >> 1) search issue tracker for openmp[1] reports for more details. There are >> different experiences. For me, it seems for me like it does not help on >> linux (and mac?) - just consumes the CPU. My experience[2] is that it helps >> on windows, but maybe it is the question of HW& SW configuration. To be on >> the safe side - OpenMP is turned off by default, so if somebody turns it >> on, such user/developer should be responsible for the consequences ;-) >> >> 2) I made some test with multithreading of tesserocr in python and it >> does not work for me. It works only with 1 thread (I never use >> multithreading, so maybe the problem is on my side.). >> >> Anyway expect and contribution in this area (OpenMP) is warmly welcomed. >> >> [1]https://github.com/tesseract-ocr/tesseract/issues?q=is%3Aissue+openmp >> [2] https://github.com/tesseract-ocr/tessdoc/blob/main/Benchmarks.md >> >> Zdenko >> >> >> ne 8. 5. 2022 o 14:24 Krzysztof J <[email protected]> napísal(a): >> >>> have the problems & questions: >>> >>> 1). Question 1: While preparing the build, I noticed that the >>> "OPENMP_BUILD" setting is not included when building the solution see below: >>> >>> [image: configuration_tesseract.png] >>> >>> Anyone can say something more about it? Is using multiprocessing at the >>> moment recommended? What's the state of it now? I only saw subject # >>> 1662 <https://github.com/tesseract-ocr/tesseract/issues/1662> where it >>> was turned off, but it was 4 years ago :o >>> >>> 2). Question: Are there any other ways to take advantage of >>> multithreading in Tesseract besides OpenMP in Tesseract 5.1.0? Anyone have >>> experience in this topic? For now I am working on 1 thread, but ultimately >>> I would like to switch to multiple threads. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/190a9765-e00c-4f1b-b784-b81851d2a0c4n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/190a9765-e00c-4f1b-b784-b81851d2a0c4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/1d6812e6-2b78-49e5-a341-2f0c5505126bn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1d6812e6-2b78-49e5-a341-2f0c5505126bn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yfk9-bCSY2y6hY7cTbOHOxZB36us3O-hd7F_HzsaSYUw%40mail.gmail.com.

