Re: [tesseract-ocr] Tesseract 4 training related issue

2018-06-15 Thread ShreeDevi Kumar
Are you using images and box files? Does your box file have boxes for spaces between words? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Jun 15, 2018 at 12:42 PM pranaya mhatre wrote: > Hi, > > I trained

Re: [tesseract-ocr] Can :traineddata" for Tesseract 3 be used for Tesseract 4

2018-06-13 Thread ShreeDevi Kumar
If you have box tiff pairs in tesseract4 format you can generate the lstmf files by running tesseract lang.file.exp0.tif lang.file.exp0 lstm.train lstm.train is a config file. ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] How to assess the quality of Tesseract OCR output programmatically?

2018-06-13 Thread ShreeDevi Kumar
You can compare OCRed text with groundtruth text. If creating pdf, you will have to extract text from it to compare. There are two options: https://github.com/impactcentre/ocrevalUAtion or https://github.com/eddieantonio/isri-ocr-evaluation-tools

Re: [tesseract-ocr] Tesseract 4 for old languages

2018-06-12 Thread ShreeDevi Kumar
Please also see http://doc-creator.labri.fr/ which makes it easy to create synthetic data similar to manuscript pages. On Tue, Jun 12, 2018 at 9:03 PM ShreeDevi Kumar wrote: > Please see the project https://github.com/OCR-D/ocrd-train > > It has support for training tesseract if yo

Re: [tesseract-ocr] Tesseract 4 for old languages

2018-06-12 Thread ShreeDevi Kumar
are used by lstmtraining. >> >> langdata refers to the langdata repository under tesseract-ocr github >> repo. The files in it have not been updated for 4.0.0 >> >> >> >> >> ShreeDevi >> ___

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-06-12 Thread ShreeDevi Kumar
Thank you for the info. The following link also has helpful info. https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/compiler_ref/omp_thread_limit.html ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Image DPI restriction

2018-06-11 Thread ShreeDevi Kumar
For better recognition 300 dpi is recommended. You can use a program like imagemagick to change dpi if needed. On Mon, Jun 11, 2018 at 8:30 PM Vidur Malhotra wrote: > Hi, > I was going through tesseract tutorials wherein it is mentioned that for > Tesseract to do OCR, image should have

Re: [tesseract-ocr] [SOLVED] Re: tess4j: NullPointerException while reading text in rectangle of image.

2018-06-09 Thread ShreeDevi Kumar
For tess4j see https://github.com/nguyenq/tess4j/blob/master/src/test/java/net/sourceforge/tess4j/TessAPI1Test.java On Sun 10 Jun, 2018, 12:51 AM Dattatraya Tembare, wrote: > I have used another method, and it worked perfectly. > > public static void main(String[] args) { > String fileStr =

Re: [tesseract-ocr] error

2018-06-09 Thread ShreeDevi Kumar
You are probably using a wrong traineddata file i.e. 3.0x version file with latest 4.0x code from master branch. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Jun 9, 2018 at 3:33 PM Vishal Jha wrote: > 1,

Re: [tesseract-ocr] Unrecognized argument --linedata_only

2018-06-09 Thread ShreeDevi Kumar
gt; \ > --max_iterations 5000 > &>/home/kddlab/Desktop/tesseract-master/1MyData/testfasout/basetrain.log > and i have this *error now* > > *Segmentation fault (core dumped)* > > > Could you please help me again? > > On Sat, Jun 9, 2018 at 11:33 AM, ShreeDevi Kuma

Re: [tesseract-ocr] Unrecognized argument --linedata_only

2018-06-09 Thread ShreeDevi Kumar
--linedata_only should work. > tesseract 4.0.0-beta.1 Do you know which commit? Please try with latest code. > i am using src/training/tesstrain.sh The command you used was: > sudo tesstrain.sh Why do you need sudo? Please run the script with bash -x src/training/tesstrain.sh etc

Re: [tesseract-ocr] Unrecognized argument --linedata_only

2018-06-08 Thread ShreeDevi Kumar
Are you using the correct version of tesstrain.sh? It should be in src/training/tesstrain.sh ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Jun 8, 2018 at 6:49 PM Zohreh Khosrobeygi wrote: > Hi, > I have

Re: [tesseract-ocr] Suggestion for the API

2018-06-07 Thread ShreeDevi Kumar
You can provide this info as a Pull Request in GitHub repo for easier review and search. On Wed, Jun 6, 2018 at 2:24 PM Paul TOTH wrote: > Hello, > > I'm not a C++ developer and I'm new to the project so I don't want to > disturb the repository with my code...but, I've made some changes that >

Re: [tesseract-ocr] Re: Preprocess Image

2018-06-04 Thread ShreeDevi Kumar
Take a look at http://www.fmwconcepts.com/imagemagick/textcleaner/ and other scripts by Fred ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 4, 2018 at 10:52 PM, Hongguo An wrote: > Can anybody help? thanks

Re: [tesseract-ocr] How to train by tesseract 4.00

2018-06-03 Thread ShreeDevi Kumar
If you want to train using fonts, use tesstrain.sh. See the wiki pages regarding training. If you want to use scanned images, then see https://github.com/OCR-D/ocrd-train for using line images and their ground truth transcriptions to create box files, lstmf files and training. ShreeDevi

Re: [tesseract-ocr] error in lstm training

2018-06-02 Thread ShreeDevi Kumar
> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 You can only continue_from models in tessdata_best repo which are float models. The integer models in tessdata and tessdata_fast can not be used for that purpose. ShreeDevi

Re: [tesseract-ocr] lstmeval gives a perfect result but tesseract fails

2018-06-01 Thread ShreeDevi Kumar
>From what I understand from the documentation provided by Ray Smith regarding LSTM training, the models have been trained on hundreds of thousands of lines and hundreds of fonts. The network spec used for training from scratch will therefore be optimized for such large models. You seem to have

Re: [tesseract-ocr] Not able install tesseract ocr on ubuntu 17.04

2018-06-01 Thread ShreeDevi Kumar
Please see the email from Alex and follow instructions in that. On Fri 1 Jun, 2018, 10:08 AM RT-Rakesh, wrote: > > Hi ShreeDevi, > > Thanks for your response. > > I am still getting this error when trying with the command that you shared. > Please assist me how to go about here. > > Thank you

Re: [tesseract-ocr] lstmeval gives a perfect result but tesseract fails

2018-05-31 Thread ShreeDevi Kumar
>I've trained a LSTM model for a custom language from scratch as explained here . >The language only has about 100 words and 17 characters, so it's pretty simple. For such a small model, try to build the legacy version

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar
le at it, is there a tool I could use to split book pages into > separate lines so that I can give it as part of training (along with it's > text of course) > > > > On 05/30/2018 12:44 PM, ShreeDevi Kumar wrote: > > I am trying a test training for coptic for tess4, will let yo

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar
PM, ShreeDevi Kumar wrote: > I am trying a test training for coptic for tess4, will let you know where > to access traineddata. > > You can train using utf-8 textand unicode optic fonts. > > 1. collect utf-8 text in Coptic > 2. Find Coptic unicode fonts, if you can find one sim

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-30 Thread ShreeDevi Kumar
e can use the other link you > provided > https://github.com/OCR-D/ocrd-train > To train Tesseract 4.00 > > Thank you very much > > > On 05/30/2018 06:31 AM, ShreeDevi Kumar wrote: > > See http://www.moheb.de/ocr.html > > It provides a traineddata

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-29 Thread ShreeDevi Kumar
See http://www.moheb.de/ocr.html It provides a traineddata file for Coptic for use with tesseract version 3. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 29, 2018 at 9:57 PM, wrote: > Hi, > I belong to a

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-05-29 Thread ShreeDevi Kumar
please see https://github.com/OCR-D/ocrd-train you can use it with image files and matching ground truth text - in utf-8. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 29, 2018 at 9:57 PM, wrote: > Hi, >

Re: [tesseract-ocr] Some spaces are not recognized

2018-05-29 Thread ShreeDevi Kumar
set the config variable - "preserve_interword_spaces" to 1 And as 0 For diff runs and see if that makes any difference On Tue 29 May, 2018, 4:30 PM ShreeDevi Kumar, wrote: > >The traineddata from tesseract does not have a spacing problem, > > Then the problem

Re: [tesseract-ocr] Some spaces are not recognized

2018-05-29 Thread ShreeDevi Kumar
>The traineddata from tesseract does not have a spacing problem, Then the problem is related to training. On Tue 29 May, 2018, 4:16 PM Sumedhe Dissanayake, < sumedhedissanay...@gmail.com> wrote: > > > On Friday, May 18, 2018 at 6:32:44 PM UTC+5:30, shree wrote: >> >> image is not visible. >>

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-05-28 Thread ShreeDevi Kumar
Also see https://github.com/tesseract-ocr/tesseract/issues/1317 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, May 28, 2018 at 2:45 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > Please

Re: [tesseract-ocr] Re: use multi threads in tesseract

2018-05-28 Thread ShreeDevi Kumar
Please see https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-increase-speed-of-ocr Set the maximum number of threads using the environment variable OMP_THREAD_LIMIT. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Re: [tesseract-ocr] Re: how to install this

2018-05-24 Thread ShreeDevi Kumar
On Thu, May 24, 2018 at 6:41 PM, Hiren Motwani wrote: > thank you so much .. can you guide me how to use ? > ​https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage if you want a gui, try https://github.com/manisandro/gImageReader/releases ​ > > > On

Re: [tesseract-ocr] how to install this

2018-05-24 Thread ShreeDevi Kumar
https://github.com/UB-Mannheim/tesseract/wiki ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, May 24, 2018 at 6:10 PM, Hiren Motwani wrote: > how to install this tesseract-ocr in

Re: [tesseract-ocr] Tesseract doesnt read tiff files correctly

2018-05-23 Thread ShreeDevi Kumar
tesseract uses leptonica. You can try that for preprocessing See an example at http://tpgit.github.io/UnOfficialLeptDocs/leptonica/line-removal.html ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 23, 2018 at

Re: [tesseract-ocr] missing a line in OCR persian

2018-05-21 Thread ShreeDevi Kumar
Seems related to open issue https://github.com/tesseract-ocr/tesseract/issues/1339 Entire lines of text missing. Different missing when psm = 3, 6, 11 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com 2018-05-22 10:45

Re: [tesseract-ocr] Re: Training Tesseract4.0 (LSTM) on word level bounding boxes

2018-05-21 Thread ShreeDevi Kumar
You can see if generate_line_box.py from https://github.com/OCR-D/ocrd-train is helpful. It requires single line images and matching ground truth to create the box files. ShreeDevi

Re: [tesseract-ocr] run training and testing on gpu

2018-05-19 Thread ShreeDevi Kumar
Regarding LSTM training, please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 > Basically it will still run on anything with enough memory, but the higher-end your processor is, the faster it will go. No *GPU* is needed. (No support.) ShreeDevi

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-18 Thread ShreeDevi Kumar
Hi Reza, Attached are two scripts and one log file. You will need to change the directories in the scripts. finetune.sh and finetune log file are for a sample finetuning for eng. By changing the language code you can run it for fas. You can use that as a test. plus-fas.sh is for plusminus type

Re: [tesseract-ocr] Re: How can JTessBoxEditor generate lstm files ?

2018-05-18 Thread ShreeDevi Kumar
I use WSL with Moboxterm on Windows 10. On Fri 18 May, 2018, 11:33 PM Joshua Willmot, wrote: > I am using Windows Subsystem for Linux (Ubuntu). It works in exactly the > same way as it would on normal Ubuntu. > > On Thursday, May 17, 2018 at 11:11:54 PM UTC+2, Quan

Re: [tesseract-ocr] Re: Error in executing new .traineddata file

2018-05-18 Thread ShreeDevi Kumar
>Tesseract Beta 4.00, and do the same copy the .traineddata inside tessdata, If you have created your traineddata for 3.05, it may not be compatible with 4.0.0beta. On Sat 19 May, 2018, 2:26 AM Quan Nguyen, wrote: > The error message indicated Tesseract was looking for

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-18 Thread ShreeDevi Kumar
I have posted a couple of test models for Farsi at https://github.com/Shreeshrii/tessdata_shreetest These have not been trained on text with diacritics as the normalization and training process was giving error on the combining marks. Please give them a try and see if they provide better

Re: [tesseract-ocr] Some spaces are not recognized

2018-05-18 Thread ShreeDevi Kumar
image is not visible. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, May 18, 2018 at 5:39 PM, Sumedhe Dissanayake < sumedhedissanay...@gmail.com> wrote: > Sometimes spaces between words are ignored when

Re: [tesseract-ocr] tesseract version - Ubuntu 16.04 PPA vs compiling from tesseract-ocr github source (master-branch)

2018-05-17 Thread ShreeDevi Kumar
> Which traineddata (english) is installed when tesseract is installed using the Ubuntu PPA tessdata_fast > Is the Ubuntu PPA version in sync with the Github master branch? Not necessarily. But it should be pretty close, You can look at the commit number and date in the files at ppa. >

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar
I will try to put together complete steps. I am doing a test run for training persian. Are the following fonts ok for it? '55_Sarchia_Kurdish' \ '56_Sarchia_Kurdish_Bold Bold' \ 'Amiri' \ 'Arabic Typesetting' \ 'Arial' \ 'Arial Unicode MS' \ 'B Nazanin' \ 'B Nazanin Bold' \

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar
Please use the latest windows binaries from https://github.com/UB-Mannheim/tesseract/wiki provided by @stweil How do you run bash script on windows10? @stweil I have not tried training on windows? Do you have feedback from others who have tried it. ShreeDevi

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar
What o/s are you running it on? Which version of tesseract? > ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset does not exist or is not readable which version of icu library? ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Re: Problem reading text in two columns

2018-05-11 Thread ShreeDevi Kumar
> I used the tessdata_fast file for English - are these different from tessdata-ocr-eng that comes with Ubuntu? The ppa has traineddata files from tessdata_fast. Ubuntu 18.04 will have the same. Older versions of ubuntu (wihout ppa) will have traineddata files for tesseract 3.0x. You can try

Re: [tesseract-ocr] Tesseract crashed on blank page

2018-05-10 Thread ShreeDevi Kumar
which version? which o/s? which language? what command did you use? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, May 10, 2018 at 10:59 PM, kvc wrote: > Hi everyone, > > I launch

Re: [tesseract-ocr] Problem reading text in two columns

2018-05-06 Thread ShreeDevi Kumar
Which version of tesseract are you using? Which traineddata (from which repo) Try with --psm 6 if using tesseract 4 beta. It will recognise whole line, rather than column On Mon 7 May, 2018, 1:21 AM Brooks Johnson, wrote: > >

Re: [tesseract-ocr] Tesseract 4.0 extracting multiple columns where one is wanted

2018-05-03 Thread ShreeDevi Kumar
Try with --psm 6 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 2, 2018 at 9:26 PM, wrote: > I am using Tesseract 4.0 to extract text from scanned PDF documents. I >

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-05-02 Thread ShreeDevi Kumar
Your image has text in German. You will get better results using language `deu` out of the box. Attached are OCR results using deu.traineddata from tessdata_best and tessdata_fast using tesseract-4.0.0-beta.1 run via command line. #tesseract sample.tif sample-deu-fast -l deu --tessdata-dir

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-05-02 Thread ShreeDevi Kumar
Please provide a small sample image to test. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 2, 2018 at 11:26 AM, wrote: > Training doesn't work. If i use the characters "ä, ö, ü"

Re: [tesseract-ocr] Do I need to call Init before every rectangle?

2018-05-01 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/FAQ#there-are-inconsistent-results-from-tesseract-when-the-same-tessbaseapi-object-is-used-for-decoding-multiple-images On Tue 1 May, 2018, 12:53 PM Ben Rogall, wrote: > > I am using the baseapi to OCR a large number of

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-04-30 Thread ShreeDevi Kumar
Use the latest version 4.0.0beta On Sun 29 Apr, 2018, 1:51 PM , wrote: > I did. Unfortunately they don't aswer... > Have you any advice for me, to improve the > training proccess? How many training texts should i use? Or is it possible > that there is a problem with

Re: [tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

2018-04-29 Thread ShreeDevi Kumar
Please provide a sample image to test. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Apr 26, 2018 at 1:35 PM, Youcef wrote: > > I'm using master branch with tessdata_fast models > > Le

Re: [tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-29 Thread ShreeDevi Kumar
Try tesseract-4.0.0-beta I get correct results with it from command line # tesseract numbers-test.png numbers-test --tessdata-dir ./tessdata_fast -l eng --oem 1 --psm 6 Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica Warning. Invalid resolution 0 dpi. Using 70 instead.

Re: [tesseract-ocr] Trained font - always one letter wrong

2018-04-29 Thread ShreeDevi Kumar
Check that your training text has enough samples for d. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sun, Apr 29, 2018 at 1:51 PM, wrote: > I did. Unfortunately they don't aswer... > Have

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-28 Thread ShreeDevi Kumar
a neural net is about the engine parts, not the image >> characterisation rendering method, am I right? because I see many >> presentations, and most of them talk about the history of tesseract, but >> that's not what I need >> >> 2018-04-27 14:27 GMT+00:00 ShreeDevi Kuma

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-27 Thread ShreeDevi Kumar
Please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM For info about neural nets used by tesseract On Fri 27 Apr, 2018, 7:48 PM Janpieter Sollie, wrote: > I had a quick thought about what you could offload to opencl. I will need > some help from

Re: [tesseract-ocr] Problem facing with tessearct training 4 with arabic

2018-04-25 Thread ShreeDevi Kumar
You are trying to train only digits but then using the unicharset which has these numbers only for compressing the wordlist (which uses Arabic alphabet) to a 'dawg'. The command you have used only creates the starter traineddata for LSTM training. Please follow the instructions given in the wiki

Re: [tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

2018-04-25 Thread ShreeDevi Kumar
Which version of tesseract are you using? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Apr 25, 2018 at 8:29 PM, Youcef wrote: > Hi, > > > Tesseract seems to post process its

Re: [tesseract-ocr] Re: Box file generator combines vertical lines across rows of text

2018-04-24 Thread ShreeDevi Kumar
Please provide a sample tiff, single page will do, for testing. On 25-Apr-2018 2:00 AM, "Cameron McSweeney" wrote: Yes, and the box files 4.0 made still had the same problem. The accuracy with 4.0 was much better but it still needs some tweaking, so I figured I would be

Re: [tesseract-ocr] Re: Box file generator combines vertical lines across rows of text

2018-04-24 Thread ShreeDevi Kumar
Have you tried the latest version, tesseract 4.0.0beta? On Wed 25 Apr, 2018, 12:03 AM Cameron McSweeney, wrote: > Tesseract seems to be much too willing to find vertical lines. For > example, Ds will be divided so that the straight, left portion is separate > from the

Re: [tesseract-ocr] Install Tesseract 4 on CentOS and Red Hat [SOLVED!]

2018-04-24 Thread ShreeDevi Kumar
I have never used equ.traineddata. From feedback in the forum I don't think it works very well. Maybe equ has not been trained via LSTM training, I have no way of knowing. Only Ray Smith or other developers from Google can answer that. Only LSTM models exist in tessdata_best and tessdata_fast.

Re: [tesseract-ocr] Install Tesseract 4 on CentOS and Red Hat [SOLVED!]

2018-04-23 Thread ShreeDevi Kumar
Thanks for the script to install tesseract on CentOS. I would suggest using traineddata files from tessdata_fast or tessdata_best repos for better accuracy and speed. On Mon 23 Apr, 2018, 11:52 PM Eugene Huang, wrote: > Hello! Most people are probably running Tesseract 4

Re: [tesseract-ocr] Unsure why tesseract isn't returning the correct text

2018-04-22 Thread ShreeDevi Kumar
Yes, please use the latest code from github master branch for building. That way you will have all the bug fixes and updates. On Sun 22 Apr, 2018, 2:42 AM 'DR' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > I double checked, there seems to be a 4.0.0-beta.1 tag. I assume you >

Re: [tesseract-ocr] "jav" language -- is it Javanese Script or Latin-based text?

2018-04-22 Thread ShreeDevi Kumar
Seems to be in Latin script see https://github.com/tesseract-ocr/langdata/blob/master/jav/jav.training_text ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sun, Apr 22, 2018 at 2:58 PM, Christopher Imantaka Halim <

Re: [tesseract-ocr] Unsure why tesseract isn't returning the correct text

2018-04-21 Thread ShreeDevi Kumar
BLAZIKEN-M RAPIDASH-M VICTREEBEL-M SHRRPEDO-M PORYGON-I-M RAZELF-M with tesseract -v tesseract 4.0.0-beta.1-133-g5435c leptonica-1.76.0 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.3.0 Found AVX Found SSE tesseract names.png -

Re: [tesseract-ocr] Train Tesseract 4.0 on Windows 8

2018-04-19 Thread ShreeDevi Kumar
tesstrain.sh is a bashshell script. You don't need python for it. try the following: (give the correct path) bash ./tesstrain.sh ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Apr 19, 2018 at 8:01 PM,

Re: [tesseract-ocr] How can I know whichever file format types Tesseract will recognize and able to process them ?

2018-04-18 Thread ShreeDevi Kumar
It depends on which image libraries leptonica was built with. tesseract -v will show the list On Thu 19 Apr, 2018, 10:46 AM abdu, wrote: > How do we get information for the file types in that Tesseract would > capable of processing ? > > -- > You received this message

Re: [tesseract-ocr] Training tessract 4.0 using images?

2018-04-15 Thread ShreeDevi Kumar
Please take a look at tesstrain_utils.sh and language-specific.sh in training directory for more details about how training works. As mentioned before training with box/tiff pairs is not supported. On Mon 16 Apr, 2018, 8:19 AM , wrote: > Hi Shree, > > Thanks for

Re: [tesseract-ocr] Training tessract 4.0 using images?

2018-04-15 Thread ShreeDevi Kumar
Hi Dennis, 1. Copy 4.0 format box/tiff pairs to langdata/$lang directory or any other folder of your choice. 2. Modify tesstrain.sh to copy these files to your /tmp directory - see following for where the lines need to be added source "$(dirname $0)/tesstrain_utils.sh" ARGV=("$@") parse_flags

Re: [tesseract-ocr] Training tessract 4.0 using images?

2018-04-13 Thread ShreeDevi Kumar
training Tesseract 4.0 from images is not officially .supported . Different people have had success in doing LSTM training with box/tiff pairs. but it requires hacks/programming on their part to create 4.0.0 compatible box files. tesstrain.sh creates box/tiff files in the /tmp directory, these

Re: [tesseract-ocr] Re: Change unicharset

2018-04-12 Thread ShreeDevi Kumar
1. concatenate the two training texts cat ./langdata/kor/kor.training_text ./langdata/chi_tra/chi_tra.training_text > ./langdata/kor/kor-chi_tra.training_text 2. run tesstrain.sh with (update for your paths, run with just one font which supports both languages as a test)

Re: [tesseract-ocr] Change unicharset

2018-04-12 Thread ShreeDevi Kumar
You cannot just overwrite the lstm.unicharset in a tarineddata file, the unicharset has to be in sync with the other files in it i.e. lstm, dawgs, recoder etc. > I'm merging the ```kor.training_text``` with the ```chi_tra.training_text``` for tests You need to go through the complete training

Re: [tesseract-ocr] Column splitting failed around fuzzy line

2018-04-11 Thread ShreeDevi Kumar
Try to look at leptonica sample programs about column splitting to see if you can preprocess the image better, before giving to tesseract On Wed 11 Apr, 2018, 11:46 AM Ewan Mellor, wrote: > Hi, > > > I am using Tesseract 4 (git 10f4998a) to process a file with two

Re: [tesseract-ocr] Re: Error opening traineddata files on Mac High Sierra

2018-04-11 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/issues/660 Regarding pdf On Wed 11 Apr, 2018, 1:28 PM ShreeDevi Kumar, <shreesh...@gmail.com> wrote: > 1. Check the output tif and adjust convert command if needed > > 2. Depending on your tesseract version you could try -l frk also.

Re: [tesseract-ocr] Re: Error opening traineddata files on Mac High Sierra

2018-04-11 Thread ShreeDevi Kumar
1. Check the output tif and adjust convert command if needed 2. Depending on your tesseract version you could try -l frk also. 3. Yes, you can get a pdf as output. Search Github issues, there is a long discussion thread regarding best ways to create a pdf output. Look for pdf and invisible

Re: [tesseract-ocr] Re: Doubt on "--eval_listfile"

2018-04-10 Thread ShreeDevi Kumar
Yes, and you can use different text files for training and eval. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Apr 10, 2018 at 10:01 PM, Fanatico wrote: > wen I asked about passing

Re: [tesseract-ocr] Doubt on "--eval_listfile"

2018-04-10 Thread ShreeDevi Kumar
To make sure that the model is not overfitted to training data, your eval set should be different. You can use a different text file, different fonts from the training set to check that the model performs well on text and fonts it has not seen earlier. On Tue 10 Apr, 2018, 8:16 PM Fanatico,

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread ShreeDevi Kumar
act developer answer my question. Please tell me > the way > > Thanks again for your timely reply and help . > > > > > On Sat, Apr 7, 2018 at 6:21 PM, ShreeDevi Kumar <shreesh...@gmail.com> > wrote: > >> see https://github.com/tesseract-ocr/tesseract/wi

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
://bhajans.ramparivar.com On Mon, Apr 9, 2018 at 1:45 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > Leftover from 3.04, my guess. > > On Mon 9 Apr, 2018, 12:52 PM Fanatico, <fanatico.s...@gmail.com> wrote: > >> It worked, thanks. >> >> Any reason for this

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
Leftover from 3.04, my guess. On Mon 9 Apr, 2018, 12:52 PM Fanatico, wrote: > It worked, thanks. > > Any reason for this chi_tra there? > > > On Monday, 9 April 2018 03:24:44 UTC-3, shree wrote: >> >> Please remove the sub language line from config file, and use combine

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-09 Thread ShreeDevi Kumar
Please remove the sub language line from config file, and use combine tessdata to overwrite it. Right now it seems to be using chi_tra also. On Mon 9 Apr, 2018, 11:48 AM Fanatico, wrote: > I used one traineddata that I created on removing the top layer from the >

Re: [tesseract-ocr] Tessercat 4.0 korean detecting chinese

2018-04-08 Thread ShreeDevi Kumar
Which traineddata are you using? Use combine_tessdata and extract the config file to see if chinese is included as sub language. Also look at the lstm-unicharset to see if the Chinese characters are included in it. On Mon 9 Apr, 2018, 11:09 AM Fanatico, wrote: > I'm

Re: [tesseract-ocr] Install and run tesseract 4.0 on MAC OSX step by step

2018-04-08 Thread ShreeDevi Kumar
Thank you. On Sun 8 Apr, 2018, 3:20 PM Fanatico, wrote: > I just posted at the repo issues a step to step that I needed to do so I > could use tessercat 4.0 from my MAC, so I'm just sharing the link in case > someone has the same problems I got. > Obs.: It can save a

Re: [tesseract-ocr] Failed to build ScrollView.jar on MAC OSX

2018-04-07 Thread ShreeDevi Kumar
Please try from the main tesseract folder. On Sat 7 Apr, 2018, 11:50 PM Fanatico, wrote: > from the java folder "cd ~/projects/tesseract/java" in my case > > On Saturday, 7 April 2018 12:40:29 UTC-3, shree wrote: >> >> Please see >>

Re: [tesseract-ocr] Failed to build ScrollView.jar on MAC OSX

2018-04-07 Thread ShreeDevi Kumar
Please see https://github.com/tesseract-ocr/tesseract/blob/master/Makefile.am >From which dir did you try make ScrollView.jar ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Apr 7, 2018 at 7:42 PM, Fanatico

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread ShreeDevi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Apr 7, 2018 at 4:02 PM, Romil Mehla wrote: > Thanks for

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread ShreeDevi Kumar
Just a word list is not enough for training text. For tesseract 4.0.0 it needs to be representative of the text to be recognized. On Sat 7 Apr, 2018, 2:50 PM Romil Mehla, wrote: > Is there any program to generate it ? i see ambiguous_words.cpp > generating dictionary words

Re: [tesseract-ocr] ERROR: exp0.box does not exist or is not readable

2018-04-07 Thread ShreeDevi Kumar
Look in your tmp directory in the sub folders referred in the console output Check the log file and other files there On Sat 7 Apr, 2018, 11:00 AM Fanatico, wrote: > Yes the location is correct, I tried to put the full path to the folder > and go the same error. > >

Re: [tesseract-ocr] ERROR: exp0.box does not exist or is not readable

2018-04-06 Thread ShreeDevi Kumar
Is your langdata in --langdata_dir ../../langdata On Sat 7 Apr, 2018, 4:51 AM Fanatico, wrote: > I'm trying to execute the training from the 4.o tutorial, but I'm getting > an error, can someone help with this? > > Platform: MAC OS X 10.13.3 > Tesseract: 4.0.0-beta.1

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-06 Thread ShreeDevi Kumar
me any guidence on that how to do that? > > Best Regards & Thanking you, > Gopal Dhanjibhai Bhalala > > On Fri, Apr 6, 2018 at 1:20 AM, ShreeDevi Kumar <shreesh...@gmail.com> > wrote: > >> Are you trying to recognize the text from a pdf or image with non unicode

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-05 Thread ShreeDevi Kumar
quick response, is there any way to train non unicode font > PDF AND IMAGE? > i have non unicode pdf file and image for ocr shall i box it and assing > the uniode font charcter is it right way to do non unicode pdf or image to > OCR. > > On 05-Apr-2018 7:25 AM, "ShreeDe

Re: [tesseract-ocr] Traineed non unicode font with tesseract

2018-04-04 Thread ShreeDevi Kumar
Training tesseract is only supported using unicode fonts. On Thu 5 Apr, 2018, 12:25 AM gopal bhalala, wrote: > Hi I am new in tesseract-ocr. I want trainned non unicode font using > tesseract, I tried with to trained it with jTextboxeditor to trained that > data but did

Re: [tesseract-ocr] Error at training 4.0

2018-04-04 Thread ShreeDevi Kumar
Training tesseract 4.0.0 is different from process for 3.0x. Training using images is not supported for tesseract 4.0.0. See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 On Thu 5 Apr, 2018, 1:36 AM Fanatico, wrote: > Hi, I'm new to tesseract

Re: [tesseract-ocr] Checkbox Extraction as text after Fine tuning for new characters .

2018-04-03 Thread ShreeDevi Kumar
Try to train with a large number of fonts and see if that improves the result. On Tue 3 Apr, 2018, 2:29 PM Apoorv Khanna, wrote: > Hi all, > > I am able to extract few check boxes after fine tuning the English model > but tesseract is not able to extract all the check

Re: [tesseract-ocr] does it make sense to train existing languages? how to fix repeatedly wrong letters?

2018-04-02 Thread ShreeDevi Kumar
My suggestion would be to do post processing of the OCR output. On Mon 2 Apr, 2018, 6:09 PM JP T, wrote: > Hi > > I don't really got an understanding of the consequences of training. > > My problem: > I've got tons of pages with a special format. ("one place study"

Re: [tesseract-ocr] Extracting pristine rasterized text

2018-04-02 Thread ShreeDevi Kumar
Thank you for the detailed info. My suggestion is to try recognition with eng.traineddata from the tessdata_fast repository with --oem 1. On Tue 3 Apr, 2018, 3:13 AM Patrick Ramsey, wrote: > Answers below inline. And thank you very much for your help :) > > |PTR

Re: [tesseract-ocr] Any suggestions for more accurate Text conversion?

2018-03-27 Thread ShreeDevi Kumar
Version mismatch. That traineddata is for 4.0. Wiki has pages for training. Look for one appropriate for your version of tesseract. On Wed 28 Mar, 2018, 1:23 AM , wrote: > Hi Shree, > > I just tried using the training data file you provided but it seems that > there is some

Re: [tesseract-ocr] Unable to use tesseract api installed with a nuget pkg

2018-03-27 Thread ShreeDevi Kumar
I don't use visual studio. However I know that we support vs installation via cppan cmake. Please follow those directions. On Tue 27 Mar, 2018, 9:24 PM sonu sainju, wrote: > Hey Shree, Thanks for replying. No I didn't build using cppan and cmake. I > used vcpkg install

Re: [tesseract-ocr] How to merge 2 traineddata into 1 traineddata

2018-03-26 Thread ShreeDevi Kumar
Please look at https://github.com/tesseract-ocr/tessdata_fast/tree/master/script Look at all Han* files maybe Hangul is the one you need. See https://github.com/tesseract-ocr/tessdata_fast/blob/master/README.md for more details ShreeDevi

Re: [tesseract-ocr] How to merge 2 traineddata into 1 traineddata

2018-03-26 Thread ShreeDevi Kumar
Try the script level traineddata files from tessdata_fast/script Han probably has eng+chi* ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Mar 26, 2018 at 12:01 PM, wrote: > Hi I'm

  1   2   3   4   5   6   7   8   >