Hi, I am trying to train Tesseract4 with (*ocrd-train <https://github.com/OCR-D/ocrd-train>*) Makefile except box file generation. Box file(s) are generated using below command
text2image --text "/traintext.txt" --outputbase "/traintext" --fontconfig_tmpdir "/fontconfig" --fonts_dir "/usr/share/fonts" --font "Jameel Noori Nastaleeq" --leading 32 I have used (25000) iterations by updating Makefile. I have used following command for generating traindata sudo make training MODEL_NAME=urd1 START_MODEL=urd TESSDATA_REPO=_best WORD_LIST=urd1.worldlist.clean Start model is existing *Urdu* model from *_best* repository. Please find below all related files 1. Text <https://drive.google.com/open?id=17Sb37gwKEf4QquQW9sR6H9OSRSt6m4gc> 2. Tif <https://drive.google.com/open?id=182f5nkH2XOCJciNSB_uZeFUT2VRGJrX2> 3. Box <https://drive.google.com/open?id=10TU2fMwg9wn4Ku9jYh-1hB0_PNHpIbLx> 4. Font <https://drive.google.com/open?id=1ZEMCa-GSOmvd07__8HgRwM57ZCuTWu4n> "Jameel Noori Nastqleeq" font is ligature based. After successful training when I try to use following code to perform OCR on below image, I am facing an issue related to space between few words public static void main(String[] args) { File imageFile = new File("testing.png"); ITesseract instance = new Tesseract(); instance.setDatapath("tessdata"); try { instance.setLanguage("urd1") String result = instance.doOCR(imageFile); System.out.println(result); } catch (TesseractException e) { System.err.println(e.getMessage()); } } [image: testing3.png] OCR result: ہفتوںکی منصوبہ بندی حقیقتکا روپ دھار نےلگی This is same line which is used during training. OCR Output is fine except the missing space between few words i.e. [image: result.png] I believe it is because space in Urdu Nastaleeq (with kerning) writing style is diagonal instead of vertical [image: diagonalspace.PNG] Is there any possibility to resolve this issue in Tesseract4? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7518cecb-407c-423d-a288-b8e65abe9604%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

