Re: [tesseract-ocr] Re: tesseract4.0 - Tesseract couldn't load any languages!

2017-12-25 Thread ShreeDevi Kumar
Looks like the traineddata files are corrupted or did not download ok and hence you are getting file not found issues. Check the file sizes - you may want to download using wget or curl. root@All-in-1-Touch:/mnt/c/Users/User/shree/tessdata_best# ll *.traineddata -rwxrwxrwx 1 root root 13077423

Re: [tesseract-ocr] chi_* language selected but no Chinese characters are recognized.

2018-01-06 Thread ShreeDevi Kumar
Have you tried with chi_sim which has both chinese and english? On 07-Jan-2018 12:03 PM, "林博仁" wrote: > I unable to extract a document with Chinese characters properly, please > help. > Input File > https://drive.google.com/file/d/16j21iuXVwrxplGGtJZhxeTf0ziXPY >

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

2018-01-09 Thread ShreeDevi Kumar
On Tue, Jan 9, 2018 at 7:57 PM, Yang Yu wrote: > Thanks for giving more insight! > > Sorry for another question: is there any "dropping" logic in tesseract > (say, if the certainty of recognized character < threshold, the result will > not be used thus an empty string is

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

2018-01-08 Thread ShreeDevi Kumar
t; results are by design not some meaningful words. My training data has 5000 > such plate numbers, one line for each as text. The reason why I want to > retrain is the fact that the number of possible Chinese character at > position 0 is limited to ~30. > > Am I doing anything wr

Re: [tesseract-ocr] Which OS is easiest to install tessearact

2018-01-08 Thread ShreeDevi Kumar
ubuntu ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jan 8, 2018 at 6:41 PM, Gary Evensen wrote: > I am going from VPS to my own server and I am debating Centos7 vs ubuntu > Which OS is

Re: [tesseract-ocr] Re: How can I do the training using my own image in Tesseract 4.0

2018-01-11 Thread ShreeDevi Kumar
Currently, Ray/Google has NOT released info on how to train Tesseract 4 (LSTM) with real life images. The only supported option is to use synthetic training data created by tesstrain.sh script using training text and unicode fonts. To train an LSTM model from scratch requires a large amount of

Re: [tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2018-01-04 Thread ShreeDevi Kumar
I will have to look for the exact commands and training text I used at that time. You should be able to recreate the training by following instructions given at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters I had modified the english

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

2018-01-10 Thread ShreeDevi Kumar
On Wed, Jan 10, 2018 at 3:56 PM, wrote: > It works !! > I modified your bash script and executed it. Finally I get different > traineddata size. > > But, can I train it from scratch? > It needs starting traineddata which I can get from combine_lang_model, > isn't it? > >

Re: [tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-10 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/APIExample For example of using tesseract in a program. The training tutorial you refer to is old. See tesstrain.sh for creating synthetic training data. On 10-Jan-2018 2:54 PM, "saumitra mallick" wrote: > Hello

Re: [tesseract-ocr] Re: Need Help with extracting info from Invoice

2018-01-10 Thread ShreeDevi Kumar
On Wed, Jan 10, 2018 at 8:07 PM, Afreen Ferdoash wrote: > I am trying to solve a similar problem, that of reading forms. Tesseract > 4 is doing well but is DROPPING lots of words withing boxes. I thought > this problem of dropping words existed with Indic languages but

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

2018-01-09 Thread ShreeDevi Kumar
1. If you use tesstrain.sh, it will create the starter traineddata, you do NOT need to run combine_lang_data. If you want to change version string, look at tesstrain_utils.sh and modify the command in it. 2. If you are always getting the same size file, it looks like that you are probably copying

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

2018-01-09 Thread ShreeDevi Kumar
gt; On Tue, Jan 9, 2018 at 1:17 PM, ShreeDevi Kumar <shreesh...@gmail.com> > wrote: > >> Fine-tune plus-minus will work for few character changes. >> >> You want to delete thousands of characters. >> >> Maybe you need replace top layer type of trainin

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

2018-01-09 Thread ShreeDevi Kumar
> > > My reason for using combine_lang_data is to make my punc, wordlist, and > numbers effects the trainned data.. Or, it doesn't work like that? > ​If you update the files in langdata folder and then run tesstrain.sh, it will automatically use your files. ​ > > Now, I will try your shell

Re: [tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-08 Thread ShreeDevi Kumar
please see https://github.com/charlesw/tesseract/issues/306 maybe the fix there will help. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jan 8, 2018 at 3:33 PM, James Q wrote: >

Re: [tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-08 Thread ShreeDevi Kumar
tesseract 4 alpha does not support whitelist/blacklist. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jan 8, 2018 at 4:52 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > please see https://g

Re: [tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-08 Thread ShreeDevi Kumar
ee wrote: >> >> tesseract 4 alpha does not support whitelist/blacklist. >> >> ShreeDevi >> ____ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Mon, Jan 8, 2018 at 4:52 PM, ShreeDevi Kumar <shree...@gmail.

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

2018-01-08 Thread ShreeDevi Kumar
Did you use --stop_training flag at the end? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jan 8, 2018 at 5:51 PM, wrote: > Hi all, > > I am doing my project using Tesseract v4.00, and

Re: [tesseract-ocr] Tesseract 4.0 outputs empty string after fine-tuning on different unicharset

2018-01-08 Thread ShreeDevi Kumar
How many iterations did you use for training? You can unpack HanS.traineddata and then run dawg2word program to get the wordlists used in it. Try using these for langdata in addition to your training text. ShreeDevi भजन - कीर्तन -

[tesseract-ocr] Re: BUG : Can't encode transcription error with Sinhala language

2018-01-19 Thread ShreeDevi Kumar
, 2018 at 5:37 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > > Sumedhe, > > I tried to do training with Sinhala just now and ran into the same problem. > > Looks like a bug. > > I have added info on https://github.com/tesseract-ocr/tesseract/ > issues/1012

[tesseract-ocr] BUG : Can't encode transcription error with Sinhala language

2018-01-19 Thread ShreeDevi Kumar
Sumedhe, I tried to do training with Sinhala just now and ran into the same problem. Looks like a bug. I have added info on https://github.com/tesseract-ocr/tesseract/issues/1012 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To

Re: [tesseract-ocr] Improving text recognition in musical scores

2018-01-22 Thread ShreeDevi Kumar
You could try tesseract4.0.0alpha(latest commit from master branch) which will allow you to use 'Latin' traineddata which supports most languages written in Latin script. See if that gives you better recognition for the text. ShreeDevi

Re: [tesseract-ocr] What difference are there between jpn.traineddata and Japanese.traineddata?

2018-01-17 Thread ShreeDevi Kumar
On Wed, Jan 17, 2018 at 3:10 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > Please see https://github.com/tesseract-ocr/tessdata/issues/ > 62#issuecomment-319442674 > > Initial capitals indicate the one model for all langs in that script, so > eg Latin is all latin-based

Re: [tesseract-ocr] What difference are there between jpn.traineddata and Japanese.traineddata?

2018-01-17 Thread ShreeDevi Kumar
Please see https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319442674 Initial capitals indicate the one model for all langs in that script, so eg Latin is all latin-based languages except vie, which has its own Vietnamese. Most of the script models include English training data as

Re: [tesseract-ocr] Re: Can't encode transcription error with Sinhala language

2018-01-17 Thread ShreeDevi Kumar
What version of software r u using? Preferably use latest version from github. Ray has changed the LSTM training process sometime last yr. If you use older version of code, the new instructions will not work. -- You received this message because you are subscribed to the Google Groups

Re: [tesseract-ocr] Can't encode transcription error with Sinhala language

2018-01-13 Thread ShreeDevi Kumar
You are using an old version of training command. Please review the wiki page regarding training again. Trainng from scratch command will be similar to the following. training/lstmtraining \ --debug_interval -1 \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --net_spec

Re: [tesseract-ocr] Empty result with images taken as marginally low resolution - Nepali

2018-01-12 Thread ShreeDevi Kumar
Please file an issue with full details. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Jan 12, 2018 at 8:04 PM, Nirajan Pant wrote: > --psm 3 also not working. > > On Friday, 12 January

Re: [tesseract-ocr] Empty result with images taken as marginally low resolution - Nepali

2018-01-12 Thread ShreeDevi Kumar
It seems some bug has crept in the processing of diff psm modes. OCR worked only for psm 4 and 6 ./nepali.png oem 1** psm 1 Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Warning. Invalid resolution 0 dpi. Using 70

Re: [tesseract-ocr] Empty result with images taken as marginally low resolution - Nepali

2018-01-12 Thread ShreeDevi Kumar
psm 1 is 1 Automatic page segmentation with OSD. psm 3 is 3 Fully automatic page segmentation, but no OSD. (Default) see https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Empty result with images taken as marginally low resolution - Nepali

2018-01-12 Thread ShreeDevi Kumar
Niranjan, Please check with 'best' traineddata for nep. That seemed to work. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Empty result with images taken as marginally low resolution - Nepali

2018-01-11 Thread ShreeDevi Kumar
Works fine for me. What traineddata and options did you use? Attaching the output from the following, I did not change dpi of image. #!/bin/bash img_files=$(ls ./nepali*.png) for img_file in ${img_files}; do echo "" ${img_file} oem

Re: [tesseract-ocr] Re: Where to find the LSTM network architecture used in Tesseract?

2018-01-12 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00 https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs https://github.com/tesseract-ocr/docs/tree/master/das_tutorial2016 -- You received this message because you are subscribed to the Google Groups

Re: [tesseract-ocr] Re: Criminal record JPGs: Improving image quality

2018-01-30 Thread ShreeDevi Kumar
Thanks for your response and the link to leptonica's table detection routines. Yes, my query was generic in nature, because I have seem many posts related to OCR of tables, but hadn't come across any method addressing the same. You have correctly pointed out the reasons why it is so. On

Re: [tesseract-ocr] Re: tessdata_best traineddata FIles

2018-02-01 Thread ShreeDevi Kumar
You are correct. Latin script is available only for LSTM mode --oem 1, with traineddata files in tessdata_best and tessdata_fast. Similarly for all other script traineddata files - names starting with CAPITAL letters. ShreeDevi भजन -

Re: [tesseract-ocr] How to get all the tesseract fonts in tiff format

2018-02-01 Thread ShreeDevi Kumar
Gimagereader offers HOCR to pdf output with tesseract as the OCR engine. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Feb 1, 2018

Re: [tesseract-ocr] tessdata_best traineddata FIles

2018-02-01 Thread ShreeDevi Kumar
Latin - for Latin script including languages such as eng, deu, spa etc lat - for Latin language ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Feb 1, 2018 at 4:31 PM, James Q wrote:

Re: [tesseract-ocr] ERROR: Could not find training text file

2018-01-29 Thread ShreeDevi Kumar
You need to give the path based on where you have the files. Eg. Change langdata dir from ../langdata to ../home/adarsh/tes1/tesseract/ langdata Make sure it has other required files. On 30-Jan-2018 12:14 PM, wrote: > Do we need to have the langdata folder in some

Re: [tesseract-ocr] Re: Training the Tesseract-OCR for Kannada Language

2018-02-06 Thread ShreeDevi Kumar
Have you tried tesseract with traineddata from tessdata_fast and tessdata_best On 06-Feb-2018 9:44 PM, "Cisa Anand" wrote: > Hi guys, > I am working on a project involving Kannada text extraction. I used the > kan.traineddata available in the tesseract website but there

Re: [tesseract-ocr] Re: Tamil Trained data; Tesseract 3.01- its strange ways of using the box file.

2018-02-12 Thread ShreeDevi Kumar
That is a really old email regarding traineddata for 3.01. You might get better results using the latest version of files from github. On 12-Feb-2018 9:09 PM, wrote: Hi.. can i get the box file for those tif files and trained data also for latha font... On Sunday,

Re: [tesseract-ocr] Brand new Windows 7 Tesserat user

2018-02-06 Thread ShreeDevi Kumar
Your command syntax is incorrect. It should be -l eng (lang) is not required. Then it will use eng.traineddata for doing the OCR. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Feb 6, 2018 at 4:43 AM, Garry

Re: [tesseract-ocr] Been able to create tessdata from a text and a font, but can I do it from an image?

2018-02-15 Thread ShreeDevi Kumar
Depends on what version of tesseract you are using. tesseract can be used to make box files which work well with 3.0x. Training with images is not supported for 4.0alpha. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.

2018-02-15 Thread ShreeDevi Kumar
You are missing langdata files Failed to load script unicharset from:/home/adarsh/tesseract/ langdata/Latin.unicharset Failed to read data from: /home/adarsh/tesseract/langdata/radical-stroke.txt Error reading radical code table /home/adarsh/tesseract/ langdata/radical-stroke.txt Even after you

Re: [tesseract-ocr] Re: Tesseract recognition accuracy is low

2018-02-15 Thread ShreeDevi Kumar
Read wiki pages about improving quality of your input images. Also try with the latest tesseract code and traineddata files from github. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Feb 15, 2018 at 10:35 AM,

Re: [tesseract-ocr] When using text2image for training, I get the error: Could not find font named... how can I know the correct name of a font?

2018-02-15 Thread ShreeDevi Kumar
You can check available fonts on your system by using --find_fonts with text2image, to find font names used by tesseract example command with output - please modify path to match your setup *text2image --find_fonts --text ./langdata/eng/eng.training_text --outputbase ./langdata/eng/

Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.

2018-02-15 Thread ShreeDevi Kumar
> I have fixed the Langdata folder now. And also the previous files are different from the file now. Look at the error messages. Search for 'Failed' You now have more langdata related errors. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

Re: [tesseract-ocr] Creating wordlist from high confidence words

2018-02-22 Thread ShreeDevi Kumar
Take a look at --user-words and the commands Combine_tessdata Dawg2wordlist Wordlist2dawg You can change the wordlist and it may improve chances of word being recognised, but I don't think recognition is limited to the list. It also depends on the version of tesseract that u r using. On

Re: [tesseract-ocr] Tesseract is giving column data on the last line of file

2018-02-22 Thread ShreeDevi Kumar
What --psm are you using? Tesseract might be treating the last portion as a different column. Try PSM 4 or 6. On 22-Feb-2018 3:48 PM, wrote: > >

Re: [tesseract-ocr] Read Local Charter (Hindi , Tamil, Sinhala)

2018-02-21 Thread ShreeDevi Kumar
What operating system are you on? Which version of tesseract are you currently using? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Feb 21, 2018 at 10:09 AM, Aruna Gamage wrote: > Dear Sir,

Re: [tesseract-ocr] Tesseract is giving column data on the last line of file

2018-02-26 Thread ShreeDevi Kumar
try -c page_separator= "\n" or the code for CRLF -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this

Re: [tesseract-ocr] Error when doing the set_unicharset_properties command on Windows

2018-02-23 Thread ShreeDevi Kumar
Please open this as an issue in github repo - https://github.com/tesseract-ocr/tesseract/issues > the "/" is added without taking care if the command is used on Windows or Linux. Found a couple of places in that file where this is the case. // Load the unicharset for the script if

Re: [tesseract-ocr] Error when doing the set_unicharset_properties command on Windows

2018-02-23 Thread ShreeDevi Kumar
I have used git bash for running tesseract. Not tried for training. You can use the ppa from the link below, rather than trying to build it. https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr/+packages -- You received this message because you are subscribed to the Google Groups

Re: [tesseract-ocr] Tesseract is giving column data on the last line of file

2018-02-23 Thread ShreeDevi Kumar
Probably FF. Tesseract adds a page break (normally form feed) by default. It is still possible to suppress page breaks by setting an empty page_separator. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Feb 23,

Re: [tesseract-ocr] Error when doing the set_unicharset_properties command on Windows

2018-02-23 Thread ShreeDevi Kumar
I use mobaxterm and WSL (bash under windows) on Windows 10. If you are training for legacy tesseract engine (not LSTM) you can use Jtessboxeditor for training. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Feb

Re: [tesseract-ocr] How to train unclear license plate using Tesseract 4

2017-12-26 Thread ShreeDevi Kumar
Take a look at https://github.com/openalpr/openalpr https://github.com/laddng/LiPlate https://stackoverflow.com/questions/.../using-tesseract-to-recognize-license-plates You might get better results using tesseract 3.x - ShreeDevi

Re: [tesseract-ocr] Re: tesseract4.0 - Tesseract couldn't load any languages!

2017-12-22 Thread ShreeDevi Kumar
As per your message above The files are in /usr/local/share/tessdata/ but program is looking for them at /usr/local/share/ you can set TESSDATA_PREFIX and try OR specify the directory as part of the command line. I have found that to be the easiest way, specially when using/comparing diff

Re: [tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2018-01-04 Thread ShreeDevi Kumar
ould be awesome if you could find > back the command line ;) > BR > > Envoyé de mon iPhone > > Le 4 janv. 2018 à 10:08, ShreeDevi Kumar <shreesh...@gmail.com> a écrit : > > I will have to look for the exact commands and training text I used at > that time. > > Yo

Re: [tesseract-ocr] Can Tesseract OCR work on Linux machine?

2018-01-17 Thread ShreeDevi Kumar
Yes, it is available s a package on linux. If you want the latest version, you can build it. Or use the ppa from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Re: [tesseract-ocr] Re: Can't encode transcription error with Sinhala language

2018-01-18 Thread ShreeDevi Kumar
best model for the font you need. On 19-Jan-2018 7:23 AM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote: > Take a look at the lines that are getting the error and check that all > characters are in the unicharset generated by training. > > The size of lstm-unichars

Re: [tesseract-ocr] Re: Can't encode transcription error with Sinhala language

2018-01-18 Thread ShreeDevi Kumar
>I am using the latest version (from the github). Have you cloned the master branch of the tesseract-ocr repository and built it? Which commit number? If you are using https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00alpha , that will not work - that is from Nov 8, 2016. ShreeDevi

Re: [tesseract-ocr] Re: Can't encode transcription error with Sinhala language

2018-01-18 Thread ShreeDevi Kumar
The tags have NOT been updated, hence version showing 4.00.00alpha is meaningless, since there have been hundreds of commits to the code after that tag. Please build using latest commit from master branch, or use the ppa by

Re: [tesseract-ocr] Re: Can't encode transcription error with Sinhala language

2018-01-18 Thread ShreeDevi Kumar
Also see https://github.com/tesseract-ocr/tesseract/search?q=Can%27t+encode+transcription+error=Issues -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Re: Can't encode transcription error with Sinhala language

2018-01-18 Thread ShreeDevi Kumar
Take a look at the lines that are getting the error and check that all characters are in the unicharset generated by training. The size of lstm-unicharset is different than the one generated by the training text, note the message shown at beginning of training. Check github issues, one of the

Re: [tesseract-ocr] @shree / Fianlly I made the customzied (fine tuned) traineddata

2018-03-08 Thread ShreeDevi Kumar
Please look at the kor.config file in langdata. Maybe it is loading chi_tra The langdata files r from 3.04 On Thu 8 Mar, 2018, 2:27 PM 이경준, wrote: > Hi > > Fianlly I made the customzied (fine tuned) traineddata - korean > > > But, Run tesseract > > I have a problem. >

Re: [tesseract-ocr] Tesseract tsv output not working

2018-03-11 Thread ShreeDevi Kumar
1. Please check that your tessdata/configs folder has a file called tsv. 2. Try giving a different output file name (NOT out). 3. Do hocr and pdf outputs work for you? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Re: [tesseract-ocr] Tesseract 4 for old languages

2018-03-12 Thread ShreeDevi Kumar
Please try tesseract 4.0.0beta.1 with languages such as *enm* (English, Middle (1100-1500)) and Fraktur script Also, look at the following project from a few years back http://emop.tamu.edu/outcomes/Franken-Plus ShreeDevi भजन -

Re: [tesseract-ocr] Tesseract 4 for old languages

2018-03-12 Thread ShreeDevi Kumar
files in it have not been updated for 4.0.0 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Mar 12, 2018 at 2:00 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > Please try tesseract 4.0.0beta.1

Re: [tesseract-ocr] Re: tesseract 4.00 beta is released ? I saw the who use the tesseract 4.00 beta

2018-03-12 Thread ShreeDevi Kumar
Master branch in github repo at commit 40f4311 has been tagged as tesseract4.0.0beta.1 - Please see https://github.com/tesseract-ocr/tesseract/releases/tag/4.0.0-beta.1 That commit is the one which has

Re: [tesseract-ocr] Warning. Invalid resolution 0 dpi. Using 70 instead

2018-02-27 Thread ShreeDevi Kumar
Which version of tesseract are you using? On 27-Feb-2018 7:52 PM, "Terry Bryant" wrote: > Hello everyone. I'm facing this above problem when my input image is the > attached file. > > My os: ubuntu14.04 > My input image: in attached file(which is a .png

Re: [tesseract-ocr] Re: Read Local Charter (Hindi , Tamil, Sinhala)

2018-02-27 Thread ShreeDevi Kumar
Yes, it is possible to use tesseract for sinhala. Please mention the type of computer operating system you use and it's version so that I can send appropriate links for you to use. On 27-Feb-2018 4:07 PM, "Aruna Gamage" wrote: > Dear sir, > > Mainly I need sinhala

Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-28 Thread ShreeDevi Kumar
On Thu, Mar 1, 2018 at 9:21 AM, 이경준 wrote: > Thank U reply my question. > > But my system is operated by Ubuntu 16.04. 03 LTS > > I think that that path is not working ? Am I false? > > > 2018년 2월 28일 수요일 오후 6시 18분 41초 UTC+9, shree 님의 말: >> >> Try with following - make

Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-28 Thread ShreeDevi Kumar
>we don't understand each otehr saying. Sorry about that. Please give the following commands and let me know the result. tesseract -v tesseract --list-langs combine_tessdata -u kor.traineddata I do not know Korean, but feedback from other users has been that tesseract4 and the latest

Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-02-28 Thread ShreeDevi Kumar
> my system is operated by Ubuntu 16.04. 03 LTS > Yes .I tried tessdata - kor.trainnedata /// But it is not good enough. sorry .ㅜㅜ i can not use tesseract 4.0 tessdata-kor.trainnedata. in bussiness .. I will suggest that you uninstall your old tesseract version.(3.0x) sudo apt-get remove

Re: [tesseract-ocr] Hindi language version not working. VietOCR.NET-4.5_64

2018-03-01 Thread ShreeDevi Kumar
That document is for an old version of tesseract. Please use vietocr version which supports tesseract 4.00alpha. Download traineddata files for 4.00alpha from tessdata_fast You can try OCR with both hin and Devanagari traineddata files. On 01-Mar-2018 3:23 PM, "Sohan Shekhawat"

Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-03-01 Thread ShreeDevi Kumar
eeDevi >> ____ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Thu, Mar 1, 2018 at 6:36 PM, ShreeDevi Kumar <shree...@gmail.com> >> wrote: >> >>> > combine_tessdata -u kor.train

Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-03-01 Thread ShreeDevi Kumar
> combine_tessdata -u kor.traineddata What is that meaning ? Could you explain for me ? That command will show and unpack the components of your traineddata file. eg. from tesdata_fast combine_tessdata -u ./tessdata_fast/kor.traineddata ./tessdata_fast/kor. Extracting tessdata components from

Re: [tesseract-ocr] Re: I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

2018-03-01 Thread ShreeDevi Kumar
what version of tesseract program you are using. I have already sent you the bash script that you can modify for training. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Mar 1, 2018 at 6:36 PM, ShreeDevi Kumar <shre

Re: [tesseract-ocr] message from runnig tesseract from my tuned traineddata(korean)

2018-03-13 Thread ShreeDevi Kumar
> > > 2) I'm using my korean tuned fine tuned traineddata but, always give > message like that " Error opening data file /chi_tra.trainddata " > please make sure the TESSDATA_PREFIX environment varialbe > > Is it OK? > > shree you teach me ///refer to kor.config > > and I saw the

Re: [tesseract-ocr] How to replace top LSTM top layer ?

2018-03-13 Thread ShreeDevi Kumar
That info is given in the training wiki page. On Tue 13 Mar, 2018, 12:53 PM 이경준, wrote: > There is no way about replacing top layer ... ㅜㅜ > > 2018년 3월 13일 화요일 오후 4시 22분 8초 UTC+9, shree 님의 말: >> >> https://github.com/tesseract-ocr/tesseract/issues/1009 >> >> Link works

Re: [tesseract-ocr] How to replace top LSTM top layer ?

2018-03-13 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/issues/1009 Link works ok On Tue 13 Mar, 2018, 12:37 PM 이경준, wrote: > Shreeshrii commented on 29 Jun 2017 > > • >

Re: [tesseract-ocr] Training tesseract 4.0 with large training text

2018-03-13 Thread ShreeDevi Kumar
You have to look in the file called by it tesstrain_utils.sh On Tue 13 Mar, 2018, 12:22 PM 이경준, wrote: > Hi Shree . I saw the tesstrain.sh file. > > But I cannot point to max-pages to 3 ??? where ??? > > Could you tell me about it more details > > 2018년 3월 13일 화요일 오전

Re: [tesseract-ocr] How to replace top LSTM top layer ?

2018-03-13 Thread ShreeDevi Kumar
That command applies to an older version of the source code. Now you need a starter traineddata. Please see the wiki page at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers ShreeDevi

Re: [tesseract-ocr] Training tesseract 4.0 with large training text

2018-03-12 Thread ShreeDevi Kumar
Please look at tesstrain.sh It is setting max-pages to 3 for text2image invocation. You can change it there. On Tue 13 Mar, 2018, 6:54 AM , wrote: > Dear all, > > I'm trying to train lstm using a large training text, different fonts, > colors etc. I'm trying to use

Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread ShreeDevi Kumar
Give the following command - after changing directories to match your setup text2image --find_fonts \ --fonts_dir /usr/share/fonts \ --text ../langdata/kor/kor.training_text \ --min_coverage .9 \ --render_per_font false \ --outputbase ../langdata/kor/kor \ |& grep raw | sed -e 's/ :.*/" \\/g' |

Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread ShreeDevi Kumar
remove these two lines and try --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ this overrides what is given in language-specific.sh ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Mar 13, 2018

Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread ShreeDevi Kumar
Did you use the fonts_dir where they are installed??? On Tue 13 Mar, 2018, 9:32 PM 이경준, wrote: > Thank U . I have a fontslist file > > but vim fontlist.txt > > There are no fonts ?? > > It means that I cannot use korena fonts?? > > 2018년 3월 13일 화요일 오후 9시 9분 45초 UTC+9,

Re: [tesseract-ocr] Re: pango library doesn't recognize my font .

2018-03-13 Thread ShreeDevi Kumar
change double quote to single quote " to ' ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Mar 13, 2018 at 10:05 PM, 이경준 wrote: > >

Re: [tesseract-ocr] Different output by tesseract for same image

2018-03-13 Thread ShreeDevi Kumar
Please send the sample image for testing. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Mar 13, 2018 at 5:13 PM, Preeti Pandey wrote: > Hi all, > Using tesserect-OCR, different

Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread ShreeDevi Kumar
No. You can use Alex's PPA and install for your version of Ubuntu. On Thu 15 Mar, 2018, 9:16 PM 이경준, wrote: > Now Im installing ubuntu 18.04 for tesseract4.00 beta.1 > > Is it right? > > -- > You received this message because you are subscribed to the Google Groups >

Re: [tesseract-ocr] Bad results on simple code image

2018-03-09 Thread ShreeDevi Kumar
Trying adding a small white border around the image and see if that gives better results. Which version of tesseract, which traineddata file, which os ? On Sat 10 Mar, 2018, 1:59 AM Benno Fünfstück, wrote: > Hi, > > I've tried to get tesseract to recognize a (in my

Re: [tesseract-ocr] Re: I do not include 'chi_tra' in my tessdata folder . What is it ? I have seen language-specific.sh

2018-03-10 Thread ShreeDevi Kumar
Lang1+lang2 should work. If it does not, please open an issue with an example image. If lang2 is English, you may want to try the script level traineddata, which includes English with the other languages . Please take a look at the readme file in tessdata_fast which explains about script level

Re: [tesseract-ocr] I do not include 'chi_tra' in my tessdata folder . What is it ? I have seen language-specific.sh

2018-03-09 Thread ShreeDevi Kumar
I hope someone who knows Korean can answer your questions. On Sat 10 Mar, 2018, 12:48 PM 이경준, wrote: > Hi i'm sorry to question oftenly. and lots of questions. > > But, I must use tesseract 4.0 for my business . > > plz understand my situations. I have lots of family

Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread ShreeDevi Kumar
> 1) how to replace tesseract 4.00 alpha with tesseract 4.00 Beta ? How did you install tesseract 4.00alpha? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread ShreeDevi Kumar
> tesseract 4.0 Alpha on Ubuntu 16.04.03 LTS Please use latest version beta.1 or build from source on github. > They are operated by Windows . I Think. No, they are not operated by windows. They run on 'bash under winodws' which provides Ubuntu 14.04. It can use fonts installed under windows.

Re: [tesseract-ocr] Depending on OS, tesseract (4.0) performance is different?

2018-03-15 Thread ShreeDevi Kumar
sudo apt-get purge packagename, or sudo apt-get remove --purge packagename will remove about *everything* regarding the package packagename, [...] Particularly useful when you want to 'start all over' with an application sudo apt-get autoremove ShreeDevi

Re: [tesseract-ocr] tesseract (4.0) criterion

2018-03-09 Thread ShreeDevi Kumar
>From the wiki, home page Various types of training data can be found on GitHub . Unpack and copy the .traineddata file into a 'tessdata' directory. The exact directory will depend both on the type of training data, and your Linux distribtion. Possibilities are

Re: [tesseract-ocr] Checkbox Extraction as text after Fine tuning for new characters .

2018-04-03 Thread ShreeDevi Kumar
Try to train with a large number of fonts and see if that improves the result. On Tue 3 Apr, 2018, 2:29 PM Apoorv Khanna, wrote: > Hi all, > > I am able to extract few check boxes after fine tuning the English model > but tesseract is not able to extract all the check

Re: [tesseract-ocr] Error at training 4.0

2018-04-04 Thread ShreeDevi Kumar
Training tesseract 4.0.0 is different from process for 3.0x. Training using images is not supported for tesseract 4.0.0. See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 On Thu 5 Apr, 2018, 1:36 AM Fanatico, wrote: > Hi, I'm new to tesseract

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-04 Thread ShreeDevi Kumar
Training tesseract is only supported using unicode fonts. On Thu 5 Apr, 2018, 12:25 AM gopal bhalala, wrote: > Hi I am new in tesseract-ocr. I want trainned non unicode font using > tesseract, I tried with to trained it with jTextboxeditor to trained that > data but did

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-05 Thread ShreeDevi Kumar
quick response, is there any way to train non unicode font > PDF AND IMAGE? > i have non unicode pdf file and image for ocr shall i box it and assing > the uniode font charcter is it right way to do non unicode pdf or image to > OCR. > > On 05-Apr-2018 7:25 AM, "ShreeDe

Re: [tesseract-ocr] ERROR: exp0.box does not exist or is not readable

2018-04-06 Thread ShreeDevi Kumar
Is your langdata in --langdata_dir ../../langdata On Sat 7 Apr, 2018, 4:51 AM Fanatico, wrote: > I'm trying to execute the training from the 4.o tutorial, but I'm getting > an error, can someone help with this? > > Platform: MAC OS X 10.13.3 > Tesseract: 4.0.0-beta.1

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread ShreeDevi Kumar
Just a word list is not enough for training text. For tesseract 4.0.0 it needs to be representative of the text to be recognized. On Sat 7 Apr, 2018, 2:50 PM Romil Mehla, wrote: > Is there any program to generate it ? i see ambiguous_words.cpp > generating dictionary words

<    2   3   4   5   6   7   8   >