While I do not appreciate the prick-lyness of your response, you did help 
me to see what the problem was by providing a working example using my 
example, so thank you nonetheless.  y-coordinates are swapped in the 
auto-generated box file produced by TesseractTrainer.  Should have been 
obvious from what I was seeing in cowboxer, was staring at it too long I 
guess. 

The preprocessing, image size, cropping and compression had nothing to do 
with it...works just as well on uncompressed .tif as with compressed png.

I also discovered that TesseractTrainer is ignoring the baseline location 
of the font in the boxes it generates.  So, I have modified it to subtract 
that offset, and presto, working auto-generated training imagery.  Pretty 
slick.  Kudos go to Balthazar Rouberol, who wrote this little gem.  

PS: Last I checked, broadband connections and Google's file servers are 
easily capable of handling a 1.6MB tiff, which I included, because that is 
natively what TesseractTrainer generates...(uncompressed .tif).  If you are 
still on a 56k modem, then you have my apologies for eating your 
bandwidth...


On Monday, October 28, 2013 2:58:27 PM UTC-6, zdenop wrote:
>
> Well, if you searched this forum profusely, you could already know that 
> training for "common" font is wasting of time ;-)
> There is nobody (including most experience member of this forum) who got 
> better ocr result by re-training of "common" font (like arial, times). If 
> this statement is not true, than let me know and sent the proof :-)  I will 
> create tesseract hall-of-fame for you ;-)
>
> I would suggest you to focus on image pre-processing (=> making it optimal 
> for OCR) than tesseract training.
>
> Next: if you get strange output - check if it is not because of input - 
> see what simple cropping of image can do: eng.arial.exp1.png
>
> than:
>
>    1. tesseract eng.arial.exp1.png eng.arial.exp1 makebox
>    2. check&edit box file
>    3. tesseract eng.arial.exp1.png eng.arial.exp1 -psm 7 box.train
>    
> And here we are:
> Tesseract Open Source OCR Engine v3.02 with Leptonica
> APPLY_BOXES:
>    Boxes read from boxfile:      31
>    Found 31 good blobs.
> TRAINING ... Font name = arial
> Generated training data for 8 words
>
>
> PS: Sending image with one short line of text with 1.6 Mb is not very good 
> idea. Using compression or better image format would be more efficient. See 
> size of eng.arial.exp1.png
>
> Zdenko
>
>
> On Mon, Oct 28, 2013 at 3:29 AM, Jonathan Nikkel 
> <[email protected]<javascript:>
> > wrote:
>
>> Hey there,
>>
>> I am a Tesseract novice, and would like to solicit some help/advices from 
>> you smart folks.  I will preface by saying that I have read the FAQ, 
>> searched this forum profusely, read all of the topics, and tried all the 
>> suggestions/advices I found, with no luck so far.  This is probably not a 
>> difficult one, I assume I must be missing something stupid, but hey, that 
>> is why we have forums like these =).
>>
>> What I am using: 
>> Windows 7 box
>> Tesseract v3.02
>> TesseractTrainer (auto-generated .tif's based on input training text, 
>> automates the training process)
>>
>> I am able to successfully train the off-the-shelf arial training data 
>> included with the Tesseract dev files.
>>
>> I am now trying to train a custom data set with the Arial font (no mods, 
>> standard installed with windows) using this setup to make sure I understand 
>> this training process/code, and am setting things up correctly, before 
>> moving on to more complex fonts.
>>
>> I am getting 100% failures in blob recognition/box resegmentation, and am 
>> puzzled as to why.  I have tried numerous combinations of character 
>> spacing, line spacing, font size, image bit depth (I am now using a binary 
>> image), DPI (using 300 dpi, 3600x3600 now, to be consistent with the 
>> example trainings), and am trying to home in using a font size that 
>> achieves an xheight of 25 pixels.  I have checked the box file accuracy 
>> using cowboxer, and am getting accurate boxes it appears.
>>
>> Attached are some example files; I have tried alternative character 
>> spacings from nearly touching, up to about double what you see here.  I 
>> have tried all of the pageseg modes, using* {prefix}.tif {prefix} 
>> nobatch box.train *parameters.  Pageseg mode 4 crashes, the rest 
>> generate 100% resegmentation errors.
>>
>> Where am I going wrong?  Anyone have a working example setup with 
>> TesseractTraining they can share?
>>
>> Regards,
>>
>> -Jon
>>
>>
>>
>>
>>  -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>  
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to