Using Tesseract Windows 5.0, psm=6, oem=1, eng.traineddata 2018 LSTM + 
legacy


As shown in the attached example files, Tesseract *sometimes* just adds 
characters out of thin air into the output stream. Attached are:


Invented Characters Input.png - file input to Tesseract
Invented Characters Output.txt - Tesseract text output


If you look at the sixth non-blank line down in the output which begins 
with "STYLE" you will see after "PRODUCTION DATE" on that line there are 
two tildes "~~" followed by a date "06/21/15".


If you look at the input .png file you will see that the image is 
completely and entirely blank between "PRODUCTION DATE" and the date. So 
why and how is Tesseract essentially inventing the in-between characters 
out of thin air?


I have seen other cases like this, more frequently when using the FAST 
version of the eng.traineddata.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d9c178d4-e6a2-4035-8eda-5c6435f3ed27%40googlegroups.com.
STOCK# ACTVY. DATE 12/11/19

VIN 1FTEW1EG8FKD90892 YEAR 15

MAKE FT FORD TRUCK MODEL MAINT CODE FOZZ

LICENSE# MODEL NUMBER WI1E

MODEL F-150 SERIES DESCRIPTION 4WD SS CRW
STYLE 4WD SS CRW PRODUCTION DATE ~~ 06/21/15 HR
TRIM LEVEL DELIVERY MILEAGE 16
DELIVERY DATE 07/25/15 DEMO DATE

IN-SERV DATE SERVICE DAYS 12

DEMO MILEAGE ADVISOR

Reply via email to