Hi,
I am trying to train Tesseract4 with (*ocrd-train 
<https://github.com/OCR-D/ocrd-train>*) Makefile except box file 
generation. Box file(s) are generated using below command

text2image --text "/traintext.txt"  --outputbase "/traintext"  
--fontconfig_tmpdir 
"/fontconfig"  --fonts_dir "/usr/share/fonts" --font "Jameel Noori 
Nastaleeq" --leading 32

I have used (25000) iterations by updating Makefile. I have used following 
command for generating traindata

sudo make training MODEL_NAME=urd1 START_MODEL=urd TESSDATA_REPO=_best 
WORD_LIST=urd1.worldlist.clean

 Start model is existing *Urdu* model from *_best* repository.

Please find below all related files

   1. Text 
   <https://drive.google.com/open?id=17Sb37gwKEf4QquQW9sR6H9OSRSt6m4gc>
   2. Tif 
   <https://drive.google.com/open?id=182f5nkH2XOCJciNSB_uZeFUT2VRGJrX2>
   3. Box 
   <https://drive.google.com/open?id=10TU2fMwg9wn4Ku9jYh-1hB0_PNHpIbLx>
   4. Font 
   <https://drive.google.com/open?id=1ZEMCa-GSOmvd07__8HgRwM57ZCuTWu4n>

"Jameel Noori Nastqleeq" font is ligature based. After successful training 
when I try to use following code to perform OCR on below image, I am facing 
an issue related to space between few words


public static void main(String[] args) {
File imageFile = new File("testing.png");
ITesseract instance = new Tesseract(); 
instance.setDatapath("tessdata");

try {
instance.setLanguage("urd1")
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}

[image: testing3.png]
OCR result:
ہفتوںکی منصوبہ بندی حقیقتکا روپ دھار نےلگی

This is same line which is used during training. OCR Output is fine except 
the missing space between few words i.e.

[image: result.png]

I believe it is because space in Urdu Nastaleeq (with kerning) writing 
style is diagonal instead of vertical

[image: diagonalspace.PNG]

Is there any possibility to resolve this issue in Tesseract4?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7518cecb-407c-423d-a288-b8e65abe9604%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to