On Saturday, February 8, 2020 at 3:54:56 PM UTC-5, farhad khalafi wrote:
>
> I used the official Tesseract 5.0 alpha build for 64-bits under Windows to 
> do this test. The document is a single page TIFF image of a noisy 
> engineering drawing. Using segmentation mode 6, the file was processed in 
> 30 minutes. I tried mode 11 to look for sparse text next. The processing 
> time increased to over one hour. 
>
> Normally, I wouldn't attempt to OCR a file like this. However, we have a 
> project that has a large number of scanned images and it is impractical to 
> examine files individually. 
>
> Is there a way to set a timeout or get some preliminary data during 
> segmentation so that we can detect and skip such noisy files? 
>
> ... The actual file is about 3.5MB TIFF G4 compressed.
>

It seems like you could do some simple frequency analysis in a 
preprocessing program to detect the high frequency noise. If the volume of 
images is enough to justify the engineering effort, you could probably even 
do the analysis in the domain of the G4 codewords without even 
decompressing the image.

Also, do you need to OCR the entire image? Most engineering drawings are 
well structured with the important information in specific corners of the 
image. Could you just OCR those blocks?

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a529f634-c287-4ac8-a006-bc8a74fb72f4%40googlegroups.com.

Reply via email to