More of a resolution - it looks like the issue was accidentally because I
was using 0.3.1, and there was a bug fix in 0.3.2 for properly cleaning of
temp files: https://github.com/madmaze/pytesseract/releases. So upgrading
pytesseract is more likely the best course of action.
On Tuesday, 7 April 2020 12:45:18 UTC-4, Michael Keenan wrote:
>
> Hello,
>
> I was actually planning to post something to ask for help on this board
> but I've recently figured out the problem. I'm posting for any future
> individuals that come across this problem since it took me a few days of
> trial and error to figure out (and python multiprocessing constantly finds
> ways to present a new challenge.)
>
> *Background:* I'm running a massive data transformation in python using
> multiprocessing on 48-96 CPU AWS EC2 machines. My goal is to use
> pytesseract to transform millions of images into OCR data for machine
> learning model training. Each process receives its own batch of images and
> then the plan is they go to work OCRing for a few days. Couple notable
> items is that I'm using this environment setting as recommended:
> os.environ["OMP_THREAD_LIMIT"]
> = "1", and also using multiprocessing.pool.Pool's maxtasksperchild set to
> ~ 1000 in hopes of keeping the process environment clean as it runs
> pytesseract.image_to_data() over many images.
>
> *Problem: *After kicking off the script, I've been watching CPU
> utilization and been baffled by the left two sessions on the chart below
> where over time, CPU utilization slowly drops off before I just kill the
> job to troubleshoot. At first glance it was as if tesseract was getting
> tired and just slowing down over time.
>
> *Solution:* After much trial and error, I finally figured out that my
> temp (/tmp/) directory was getting so full (100k+ files), that this
> seemed to cause some IO overhead (I guess?) somewhere, maybe writing the
> files locally to OCR, which slowly tanked my CPU utilization over time. The
> solution is to periodically run
> pytesseract.pytesseract.cleanup("/tmp/tess*") throughout my script so
> that this directory can stay reasonably sized. In the bottom right script
> session, it can be noted how this improved CPU utilization (albiet
> temporarily) until the directory started to fill again.
>
> I think linux temp directories clean after some number of days normally,
> so it is necessary to do this periodically in the script.
>
> Hope this solves the issue for someone in the future. Happy OCRing
> everyone!
>
> [image: Screen Shot 2020-04-07 at 11.49.25 AM.png]
>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/d1e27edc-da11-4a9d-92f0-4ec02e1f6790%40googlegroups.com.