[tesseract-ocr] 4.0 "Textline Level"

2018-12-19 Thread tcs49
In reading the training wiki for 4.0, I was confused by this line about boxfile creation: The boxes only need to be at the *textline level.* It is thus *far easier* > to make training data from existing image data. > What does "textline level" mean? Would that be an entire line of text on a

[tesseract-ocr] Expected output of LSTMTRAINING

2019-01-07 Thread tcs49
Hey all, After some wrangling, I've been able to get Tesseract to successfully train on my dataset (i.e. lstmtraining application runs to completion without critical errors) However, it's not clear in the wiki what exactly the output of lstmtraining is. In the output directory I set for

Re: [tesseract-ocr] Expected output of LSTMTRAINING

2019-01-07 Thread tcs49
Nevermind. It seems like it wasn't working because I wasn't explicitly setting the --tessdata-dir flag to the correct /tessdata/ on my system. On Monday, January 7, 2019 at 12:58:36 PM UTC-5, tc...@zips.uakron.edu wrote: > > So I was able to successfully get a traineddata file from lstmtraining

Re: [tesseract-ocr] Expected output of LSTMTRAINING

2019-01-07 Thread tcs49
So I was able to successfully get a traineddata file from lstmtraining buthave encounterd a new error. When I try to run Tesseract against an image as follows: tesseract ../test.png out -l lso --oem 1 --psm 7 I get the following error: Failed to read boxes from ../test.png Any

[tesseract-ocr] Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-04 Thread tcs49
Hey all, I'm currently working on a program that explores the handwritten OCR capabilities of Tesseract. I have ~1400 images with ~8 lines of handwritten textlines per image with accompanying BOX files. Additionally, I've got a couple of handwritten fonts that I'm using to bootstrap the

Re: [tesseract-ocr] Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-04 Thread tcs49
Yeah I gave it quite a while to complete and it was still stuck on the same text2image call. Upon inspection, I see that its hanging after the eighth call to text2image during Phase I when the synthetic images are being generated. I'm getting the same behavior using the unmodified tesstrain

[tesseract-ocr] Re: Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-04 Thread tcs49
Disregard my last question. I figured out how to modify the batch size and found that it will hang indefinitely after processing the first batch of files if the specified batch size is smaller than the number of files I want to process. I set the batch size to and everything seems to be

Re: [tesseract-ocr] Re: Tesstrain.sh fails when provided > 7 tif/box pairs

2019-01-04 Thread tcs49
I'm using Tesseract v4.0.0.20181030 which I cloned from the main GitHub page two days ago. I built Tesseract and the training tools from source with the Autotools and Make files. Tesseract and the training tools are being run on a WSL install of Ubuntu v18.04.1 LTS on a VirtualBox VM running

[tesseract-ocr] --eval_listfile question

2019-01-08 Thread tcs49
Hey all, I've got a few question regarding eval_listfile: 1) The listed files are .lstmf right? 2) Should these be generated in the same tesstrain.sh process as the training files or should be they be obtained from a tesstrain.sh process independently? I ask this since based on my

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-30 Thread tcs49
Here's google drive link to a few examples of mine: https://drive.google.com/file/d/1Bhl8nv6rRx2xu5tQx_T1Ru9dvbCyAu6H/view?usp=sharing Each textline in the image has a line in the boxfile for each character in the textline. the box dimensions following a single character are not for a single

Re: [tesseract-ocr] How to increase tesseract model accuracy

2019-05-03 Thread tcs49
How did you add a blacklist? On Monday, April 29, 2019 at 11:32:14 PM UTC-4, Jonathan wrote: > > If you know you won't have numbers, what worked for me is blacklisting > numbers. Otherwise you will have to improve the image quality (like > resizing to bigger size and sharping the edges) > > On