Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread Fanatico
The conf from kor did already have it #Fixes https://github.com/tesseract-ocr/tesseract/issues/1009 preserve_interword_spaces 1 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it,

[tesseract-ocr] Extract Header and Footer text separately from document image

2018-04-09 Thread Mohit Jain
Is there a way to extract the header and footer content on a document page separately using Tesseract OCR? I tried the hOCR output but it doesn't seem to have any such tags associated with the output. Regards, Mohit -- You received this message because you are subscribed to the Google Groups

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread Romil Mehla
Thanks Shree , but if tesseract is open source then why developers can't answer doubts , If i were to randomly train my model how can i come down to accurate accuracy of my model , then my model accuracy will also be random. I want the reason for condition imposed on training text , how much

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread ShreeDevi Kumar
For tesseract 3.05 random text will work, it is suggested to use combos similar to English training text. It is unlikely you will get answers to your questions from the developers. You can search past issues/questions in forum and github. 3.05 training does not take long, run a few experiments

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
For Korean, please check whether adding the following lines to config, improves your results further. #Fixes https://github.com/tesseract-ocr/tesseract/issues/1009 preserve_interword_spaces 1 ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread Romil Mehla
Hi Shree Thanks for replying For tesseract *3.05.00* I had already checked that link there they mentioned *"Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.* *There should be more samples of the more frequent characters - at least

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
Leftover from 3.04, my guess. On Mon 9 Apr, 2018, 12:52 PM Fanatico, wrote: > It worked, thanks. > > Any reason for this chi_tra there? > > > On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote: >> >> Please remove the sub language line from config file, and use combine

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread Fanatico
It worked, thanks. Any reason for this chi_tra there? On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote: > > Please remove the sub language line from config file, and use combine > tessdata to overwrite it. > > Right now it seems to be using chi_tra also. > > On Mon 9 Apr, 2018, 11:48 AM

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
Please remove the sub language line from config file, and use combine tessdata to overwrite it. Right now it seems to be using chi_tra also. On Mon 9 Apr, 2018, 11:48 AM Fanatico, wrote: > I used one traineddata that I created on removing the top layer from the >

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread Fanatico
I used one traineddata that I created on removing the top layer from the kor.traineddata from "tessdata_best", after this I replaced this traineddata with the one from "tessdata_best" and got the same problem. Yes, it include chi_tra as sublanguage tessedit_load_sublangs chi_tra