The file that i am running OCR on

https://drive.google.com/file/d/0B-iKKP8eIvdgZkhObUVXUVJ1N28/view?usp=sharing

Before anyone asks, it's part of the CIA's Crest Dataset. I noticed 
tesseract seems to skip over some text. The command that I am using is 

E:\Tesseract\build\bin\Release\tesseract.exe --psm 1 --oem 1 
 "D:\split\Folder 001\1946-06-21.tiff" test.txt 

The output is 

21 June 1946

MEMORANDUM For SUPERVISING AGENT,
U. S. SECRET SERVICE,
WHITE Hous®.

 

1. - It is requested that a White House pass be issued to
Lieutenant General Hoyt S. VANDENBERG, Director of Central Intel-

ligence.

 

2. - In connection with his official duties, it is necessary
for General Vandenberg to visit the White House frequently,.

 

 

 

3% His physical description is:

Height =-- 6 feet.
Hair «-- _ @FAY ,
Eyes -- _- blue.

Enclosed herewith is his photograph.

THOMAS F, CULLEN
Captain, USNR
Asgistant to the Director.

 

if you notice, it skips over the "weight -- 165 lbs" line. I wasn't sure if 
this qualified as a bug. Is there anything that I can do to improve the 
results so that line is included?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ef8c2b5c-0f42-4c6e-9d22-1e8fd821571e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to