[tesseract-ocr] Mathematical equation detection & recognition

2020-05-18 Thread Владимир Калачихин
What is the current situation with subj? I find only "A Simple Equation Region Detector for Printed Document Images in Tesseract

[tesseract-ocr] Re: Mathematical equation detection & recognition

2020-05-20 Thread Владимир Калачихин
As point in https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html : "equ Math / equation detection module" not present in Tesseract 4. But trainerdata present. Does this mean that I must retrain the equ module from scratch? -- You received this message because you are

[tesseract-ocr] Re: Mathematical equation detection & recognition

2020-05-27 Thread Владимир Калачихин
Heh, "equ" language is not present on language-specific.sh, so training Tesseract 4 to math symbols impossible. Common question: Is there a real way to create a language model from scratch? For new, unknown language? -- You received this message because you are subscribed to the Google

[tesseract-ocr] Re: Mathematical equation detection & recognition

2020-05-27 Thread Владимир Калачихин
Hi Weslley среда, 27 мая 2020 г., 18:02:59 UTC+3 пользователь Weslley Torres написал: > > > Did you manage to detect the area of equations in a picture? > > I did it by naive approsh via consolidate areas with bad recognited symbols: [image: Снимок экрана в 2020-05-18 00-10-39.png] It is no so

[tesseract-ocr] Re: Mathematical equation detection & recognition

2020-05-27 Thread Владимир Калачихин
This is not a production code, just sketch. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion

[tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Владимир Калачихин
Hi! четверг, 28 мая 2020 г., 8:04:03 UTC+3 пользователь Piyush Chandra написал: > > Hope below information helps: :) > > Pls, some questions: Is it required: "--words...", "--numbers..." and "--puncs"? Why do need "--net_spec..."? -- You received this message because you are subscribed to the

[tesseract-ocr] Re: Mathematical equation detection & recognition

2020-05-28 Thread Владимир Калачихин
Hi Weslley! четверг, 28 мая 2020 г., 2:42:23 UTC+3 пользователь Weslley Torres написал: > > probably you have done it already, but in any case.. > Yes, I did. The equations are recognized very bad, with textord_equation_detect=1 or without. This works with the legacy engine only, LSTM does not

[tesseract-ocr] Re: Mathematical equation detection & recognition

2020-05-28 Thread Владимир Калачихин
четверг, 28 мая 2020 г., 14:59:05 UTC+3 пользователь Weslley Torres написал: > I though we should use "equ" instead of "eng" for equations detection. I > mean, how "eng" would recognise Greek letters? And Greek letters are > commonly used in equations. > No. Base concept of my naive

[tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Владимир Калачихин
Hi! Another question: четверг, 28 мая 2020 г., 8:04:03 UTC+3 пользователь Piyush Chandra написал: > > > Create box files: tesseract /path/to/image.tif > path/and/nameof/boxfile/imgae lstmbox > > > On this step tesseract recognize the image? What if this does it badly? Can I specify what text is

[tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Владимир Калачихин
четверг, 28 мая 2020 г., 14:46:10 UTC+3 пользователь Piyush Chandra написал: > > Read about --Net spec here: > https://tesseract-ocr.github.io/tessdoc/VGSLSpecs > > Yes, but why custom net configuration for common task? And, which net configuration well suited for trainning to math symbols?

[tesseract-ocr] Troubles with TessTutorial

2020-05-25 Thread Владимир Калачихин
I'm trying to https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#tesstutorial I repeat all the points as given. On src/training/tesstrain.sh... I have error: ERROR: /tmp/eng-2020-05-25.QY7/eng.Century_Schoolbook_L_Bold.exp0.lstmf does not exist or is not readable Both

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Владимир Калачихин
четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree написал: > Alternately you can use wordstrbox config file. > > What is "wordstrbox config file"? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-28 Thread Владимир Калачихин
creates line level box files. > > If using wordstrbox, please use the groundtruth text for creating > unicharset instead of the box files. > > On Thu, May 28, 2020, 20:49 Владимир Калачихин > wrote: > >> >> четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-31 Thread Владимир Калачихин
Hi ! I still don't understand. пятница, 29 мая 2020 г., 15:02:22 UTC+3 пользователь shree написал: > Input Files > > myfile1.png > myfile1.gt.txt > > Is "myfile1.png" - the picture with training text? What is "myfile1.gt.txt"? -- You received this message because you are subscribed to the

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-31 Thread Владимир Калачихин
Ok, I want to train from training text and fonts. Whats method must be? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-05-31 Thread Владимир Калачихин
воскресенье, 31 мая 2020 г., 19:16:55 UTC+3 пользователь shree написал: > > Use tesstrain.sh or tesstrain.py > > On Sun, May 31, 2020 at 6:45 PM Владимир Калачихин > wrote: > >> Ok, I want to train from training text and fonts. >> Whats method must be? >>

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-01 Thread Владимир Калачихин
Hi! понедельник, 1 июня 2020 г., 11:23:39 UTC+3 пользователь shree написал: > > > ### create tif and box using fonts and training text > text2image --fonts_dir=/home/ubuntu/.fonts > --outputbase=/mylang.myfont.exp0 --max_pages=0 --font=myfont > --text=../langdata/mylang/mylang.training_text >

[tesseract-ocr] Re: Troubles with TessTutorial

2020-05-26 Thread Владимир Калачихин
I don't see any problems. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-02 Thread Владимир Калачихин
понедельник, 1 июня 2020 г., 19:36:07 UTC+3 пользователь shree написал: This is for Latin script not Latin language. > wget the file from > https://github.com/tesseract-ocr/langdata_lstm/blob/master/Latin.unicharset > > Ok, I did it, and some next steps. On step ### Train: > lstmtraining .

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-02 Thread Владимир Калачихин
понедельник, 1 июня 2020 г., 19:37:25 UTC+3 пользователь shree написал: > > You may find this repo useful > > https://github.com/UYousafzai/easy_train_tesseract > > You don't understand. I don't want training to new fonts of existing language. I want a new language. -- You received this

[tesseract-ocr] How to exclude some symbols from recognizing?

2020-07-13 Thread Владимир Калачихин
Subj Numbers, for example. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit

Re: [tesseract-ocr] Re: Creating trainneddata from box files

2020-06-22 Thread Владимир Калачихин
I returned to this job. четверг, 4 июня 2020 г., 19:13:58 UTC+3 пользователь Piyush Chandra написал: > > This is what is missing : --net_spec . Check the line below that I > mentioned before. > > lstmtraining --traineddata ./out/own/own.traineddata --model_output > ./output/own --net_spec

[tesseract-ocr] Re: Digits reading optimalisation.

2021-01-30 Thread Владимир Калачихин
Heh. It's an old issue. For 100% accuracy, you must use a digit-only language model. But there is no such thing. Besides trivial perceptron shows good results on digits recognition. суббота, 30 января 2021 г. в 18:41:13 UTC+3, Benek: > Hello! I'm trying to read some digits and I thought it was

[tesseract-ocr] Re: Digits reading optimalisation.

2021-01-30 Thread Владимир Калачихин
Digits included in language model with letters. And model most trained to phrase recognition, not separate digits. Mistakes on digits unavoidable. суббота, 30 января 2021 г. в 19:12:39 UTC+3, Benek: > I still need to read the dot in the correct place which makes it a bit > harder. So you

[tesseract-ocr] Digit recognition again.

2021-11-30 Thread Владимир Калачихин
Are there any examples of the recognition of code-stamped digits, such as ZIP codes? Or a real approach to recognize handwritten digits? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails