[tesseract-ocr] Recognizing page numbers

2017-01-06 Thread Nickolas Pohilets
How can I recognize page number printed on a page? Page numbers can be normal (23) or Roman (XXIII) numerals, can be located in any corner or center of the page top/bottom, can have different placement for even/odd pages, and can have some decoration or chapter name near them. I need to do this

[tesseract-ocr] Swedish language

2017-01-06 Thread ShreeDevi Kumar
Peter, Please see https://github.com/tesseract-ocr/langdata/blob/master/swe/swe.training_text You can provide additional training text if some needed characters are missing in the above. I can do a test training with it. - excuse the brevity, sent from mobile On 06-Jan-2017 5:01 PM, "Peter"

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-06 Thread ShreeDevi Kumar
I have uploaded modified nor.traineddata at https://github.com/Shreeshrii/tessdata4alpha/blob/master/nor.traineddata See attached log and info file for commands used in training. It took about 9 hours on my pc - about 1700 iterations only and then my PC froze so I rebooted and created the

Re: [tesseract-ocr] Re: Tesseract v3.03 and norwegian language

2017-01-06 Thread Peter
Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree: > > Ray is planning to retrain the languages for the new 4.0.0 version > sometime in January. So it would be helpful if you could open an issue on > https://github.com/tesseract-ocr/langdata/issues with this information. > Is it

[tesseract-ocr] Ground Truth from Box Files

2017-01-06 Thread ShreeDevi Kumar
Does anyone know of any utilities to convert a box file to ground truth text file? I am using tesstrain.sh which uses text2image for trying out LSTM training. However, because unrenderable words are not included in the tifs, it is not possible to use the training_text as ground truth. Thanks!