More of a resolution - it looks like the issue was accidentally because I 
was using 0.3.1, and there was a bug fix in 0.3.2 for properly cleaning of 
temp files: https://github.com/madmaze/pytesseract/releases. So upgrading 
pytesseract is more likely the best course of action.

On Tuesday, 7 April 2020 12:45:18 UTC-4, Michael Keenan wrote:
>
> Hello,
>
> I was actually planning to post something to ask for help on this board 
> but I've recently figured out the problem. I'm posting for any future 
> individuals that come across this problem since it took me a few days of 
> trial and error to figure out (and python multiprocessing constantly finds 
> ways to present a new challenge.)
>
> *Background:* I'm running a massive data transformation in python using 
> multiprocessing on 48-96 CPU AWS EC2 machines. My goal is to use 
> pytesseract to transform millions of images into OCR data for machine 
> learning model training. Each process receives its own batch of images and 
> then the plan is they go to work OCRing for a few days. Couple notable 
> items is that I'm using this environment setting as recommended: 
> os.environ["OMP_THREAD_LIMIT"] 
> = "1", and also using multiprocessing.pool.Pool's maxtasksperchild set to 
> ~ 1000 in hopes of keeping the process environment clean as it runs 
> pytesseract.image_to_data() over many images.
>
> *Problem: *After kicking off the script, I've been watching CPU 
> utilization and been baffled by the left two sessions on the chart below 
> where over time, CPU utilization slowly drops off before I just kill the 
> job to troubleshoot. At first glance it was as if tesseract was getting 
> tired and just slowing down over time.
>
> *Solution:* After much trial and error, I finally figured out that my 
> temp (/tmp/) directory was getting so full (100k+ files), that this 
> seemed to cause some IO overhead (I guess?) somewhere, maybe writing the 
> files locally to OCR, which slowly tanked my CPU utilization over time. The 
> solution is to periodically run 
> pytesseract.pytesseract.cleanup("/tmp/tess*") throughout my script so 
> that this directory can stay reasonably sized. In the bottom right script 
> session, it can be noted how this improved CPU utilization (albiet 
> temporarily) until the directory started to fill again.
>
> I think linux temp directories clean after some number of days normally, 
> so it is necessary to do this periodically in the script.
>
> Hope this solves the issue for someone in the future. Happy OCRing 
> everyone!
>
> [image: Screen Shot 2020-04-07 at 11.49.25 AM.png]
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d1e27edc-da11-4a9d-92f0-4ec02e1f6790%40googlegroups.com.

Reply via email to