[tesseract-ocr] tesseract 4 skips over some text

Chris Hawley Tue, 18 Jul 2017 13:35:19 -0700

The file that i am running OCR on

https://drive.google.com/file/d/0B-iKKP8eIvdgZkhObUVXUVJ1N28/view?usp=sharing

Before anyone asks, it's part of the CIA's Crest Dataset. I noticed
tesseract seems to skip over some text. The command that I am using is

E:\Tesseract\build\bin\Release\tesseract.exe --psm 1 --oem 1
"D:\split\Folder 001\1946-06-21.tiff" test.txt

The output is

21 June 1946

MEMORANDUM For SUPERVISING AGENT,
U. S. SECRET SERVICE,
WHITE Hous®.

1. - It is requested that a White House pass be issued to
Lieutenant General Hoyt S. VANDENBERG, Director of Central Intel-

ligence.

2. - In connection with his official duties, it is necessary
for General Vandenberg to visit the White House frequently,.

3% His physical description is:

Height =-- 6 feet.
Hair «-- _ @FAY ,
Eyes -- _- blue.

Enclosed herewith is his photograph.

THOMAS F, CULLEN
Captain, USNR
Asgistant to the Director.

if you notice, it skips over the "weight -- 165 lbs" line. I wasn't sure if
this qualified as a bug. Is there anything that I can do to improve the
results so that line is included?

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/ef8c2b5c-0f42-4c6e-9d22-1e8fd821571e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] tesseract 4 skips over some text

Reply via email to