[tesseract-ocr] Tesseract 3.02 does not detect inter-word spacing for Bengali language.

Tawfiq Chowdhury Fri, 15 May 2015 13:13:04 -0700

 

I am developing a traindata for Bengali language.The problem is tesseract 
does not recognize most spaces  in the input file and keep almost all the 
characters of the input image together to make one long word instead of 
several words and sentences.This is for a big traindata where it detects 
some spaces, for a small traindata, it detects nothing.I made an English 
traindata with only 26 English alphabets to test whether tesseract detects 
spacing for it and it can detect for English but not for Bengali.I am using 
3.02.02 windows installer.Please tell me where to edit the configuration to 
make it work.I am giving some characters of Bengali to see.


আ মা দে র দে শে র না ম বা লা দে শ

An input text in an image file can be like this আমাদের দেশের নাম বালাদেশ

However, tesseract generates output like this আমাদেরদেশেরনামবালাদেশ

I am doing my thesis on it and in need to help urgently.Thanks in 
advance.Is there any version of 3.03 or 3.04 for windows? I heard there is 
3.03 beta version.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/520ee839-2152-47be-a9b0-7e651db9a2a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Tesseract 3.02 does not detect inter-word spacing for Bengali language.

Reply via email to