Using Tesseract Windows 5.0, psm=6, oem=1, eng.traineddata 2018 LSTM + legacy
As shown in the attached example files, Tesseract *sometimes* just adds characters out of thin air into the output stream. Attached are: Invented Characters Input.png - file input to Tesseract Invented Characters Output.txt - Tesseract text output If you look at the sixth non-blank line down in the output which begins with "STYLE" you will see after "PRODUCTION DATE" on that line there are two tildes "~~" followed by a date "06/21/15". If you look at the input .png file you will see that the image is completely and entirely blank between "PRODUCTION DATE" and the date. So why and how is Tesseract essentially inventing the in-between characters out of thin air? I have seen other cases like this, more frequently when using the FAST version of the eng.traineddata. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d9c178d4-e6a2-4035-8eda-5c6435f3ed27%40googlegroups.com.
STOCK# ACTVY. DATE 12/11/19 VIN 1FTEW1EG8FKD90892 YEAR 15 MAKE FT FORD TRUCK MODEL MAINT CODE FOZZ LICENSE# MODEL NUMBER WI1E MODEL F-150 SERIES DESCRIPTION 4WD SS CRW STYLE 4WD SS CRW PRODUCTION DATE ~~ 06/21/15 HR TRIM LEVEL DELIVERY MILEAGE 16 DELIVERY DATE 07/25/15 DEMO DATE IN-SERV DATE SERVICE DAYS 12 DEMO MILEAGE ADVISOR

