On Saturday, February 8, 2020 at 3:54:56 PM UTC-5, farhad khalafi wrote: > > I used the official Tesseract 5.0 alpha build for 64-bits under Windows to > do this test. The document is a single page TIFF image of a noisy > engineering drawing. Using segmentation mode 6, the file was processed in > 30 minutes. I tried mode 11 to look for sparse text next. The processing > time increased to over one hour. > > Normally, I wouldn't attempt to OCR a file like this. However, we have a > project that has a large number of scanned images and it is impractical to > examine files individually. > > Is there a way to set a timeout or get some preliminary data during > segmentation so that we can detect and skip such noisy files? > > ... The actual file is about 3.5MB TIFF G4 compressed. >
It seems like you could do some simple frequency analysis in a preprocessing program to detect the high frequency noise. If the volume of images is enough to justify the engineering effort, you could probably even do the analysis in the domain of the G4 codewords without even decompressing the image. Also, do you need to OCR the entire image? Most engineering drawings are well structured with the important information in specific corners of the image. Could you just OCR those blocks? Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a529f634-c287-4ac8-a006-bc8a74fb72f4%40googlegroups.com.

