Re: Thoughts on having the training process take font files directly

2013-10-15 Thread shree
Hi jozef, Did your company publish the article regarding tesseract including training and noise adding where our experience and expertise will be described in more detail On Tuesday, October 16, 2012 9:54:48 PM UTC+5:30, jm wrote: On Tuesday, October 16, 2012 12:27:43 PM UTC+2, TP

Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-08 Thread shree
My information IS dated - I haven't followed the recent changes. Please see this thread - almost a year old which talked of the upcoming changes for training https://groups.google.com/forum/#!searchin/tesseract-dev/fonts/tesseract-dev/4lxGjCGLBSw/CH1cZsovPjIJ On Wednesday, July 9,

[tesseract-ocr] Odia Characters Recognition by Training Tesseract OCR Engine

2014-07-25 Thread shree
http://www.ijcaonline.org/proceedings/icdcit2014/number1/14381-1306 http://research.ijcaonline.org/icdcit2014/number1/icdcit1306.pdf Mamata Nayak and Ajit Kumar Nayak. Article: Odia Characters Recognition by Training Tesseract OCR Engine. *IJCA Proceedings on International Conference on

[tesseract-ocr] Re: [tesseract-dev] Re: Training tools linking failure, icu_48::*

2014-07-31 Thread Shree
It maybe helpful to add these instructions for compiling Tesseract on Ubuntu with training tools in the wiki . On Thursday, July 31, 2014 9:05:08 PM UTC+5:30, Jeff Breidenbach wrote: For me the errors came from some debris in the training directory, and make clean in that directory took

[tesseract-ocr] Re: Tessearct 3.03 timeline (especially Windows)

2014-08-01 Thread shree
see http://vorba.ch/2014/tesseract-3.03-vs2013.html How to build Tesseract 3.03 with Visual Studio 2013 by Paul Vorbach, 2014-04-10 On Friday, May 2, 2014 2:48:35 PM UTC+5:30, Michael wrote: Hi altogether, is there a timeline for 3.03 for Windows resp. the Visual Studio project? All

[tesseract-ocr] Makefile:372: recipe for target 'all' failed - using current version with leptonica 1.71 on cygwin

2014-08-21 Thread shree
hi, I tried compiling the current version downloaded using git with leptonica 1.71 on cygwin and am getting the following error. Please let me know what I need to do to fix this. Thanks, - User@HP /opt/src $ git clone https://code.google.com/p/tesseract-ocr/ Cloning into

[tesseract-ocr] Re: Tesseract compilation on code blocks (gcc + mingw)

2014-08-21 Thread shree
zdenko, the current problem also seems related to strtok_r please see http://stackoverflow.com/questions/12973750/fatal-error-strtok-r-h-no-such-file-or-directory-while-compiling-tesseract-oc http://sourceforge.net/p/mingw/feature-requests/64/ On Tuesday, March 22, 2011 2:06:58 PM UTC+5:30,

Re: [tesseract-ocr] Makefile:372: recipe for target 'all' failed - using current version with leptonica 1.71 on cygwin

2014-08-22 Thread shree
/google/tesseract-ocr/vs2010/port/strtok_r.h Would it also be available during compile under mingw/msys or cygwin? On Thursday, August 21, 2014 6:39:41 PM UTC+5:30, Nick White wrote: On Thu, Aug 21, 2014 at 01:41:23PM +0530, Shree Devi Kumar wrote: Hi Zdenko, ./ confusing for me

Re: [tesseract-ocr] Makefile:372: recipe for target 'all' failed - using current version with leptonica 1.71 on cygwin

2014-08-23 Thread shree
://code.google.com/p/tesseract-ocr/issues/detail?id=1288start=100 https://code.google.com/p/tesseract-ocr/issues/detail?id=1289start=100 On Friday, August 22, 2014 5:51:02 PM UTC+5:30, shree wrote: Thanks for the explanation regarding ./, Nick. I get it, now. The actual error during compile that I got

[tesseract-ocr] 3.04

2014-08-25 Thread Shree
want to add a tag to git so that it is easily identifiable rather than the rev identifier such as *Revision:* *298e31465a44* Thanks. Shree -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from

[tesseract-ocr] Re: [tesseract-dev] Re: tesseract 3.04 can be downloaded as a package for msys2 (will work on windows)

2014-08-26 Thread shree
/tesseract-ocr/issues/list]) FYI, training tools did compile under msys2 on windows8. Thanks, Shree On Tuesday, August 26, 2014 4:39:47 PM UTC+5:30, zdenop wrote: Please stop with this releases!!! 3.04 was not released! We are skipping 3.03 release because some people decided to spread 3.03

[tesseract-ocr] Re: I got some error during training regular(?) box-tiff data. [tesseract 2.04 version]

2014-08-31 Thread shree
2.04 is an old version of tesseract. You should download a newer version. What platform/operating system are you using? On Thursday, August 28, 2014 12:05:43 PM UTC+5:30, Choi wrote: I found that the problem character is only %. And I modified mftraining source code that if protos are

[tesseract-ocr] Re: [tesseract-dev] Re: tesseract 3.04 can be downloaded as a package for msys2 (will work on windows)

2014-08-31 Thread Shree
project (there are reasong why there is no new release) that we should we should remove public tesseract repository. Zdenko On Wed, Aug 27, 2014 at 3:46 AM, shree shree...@gmail.com wrote: Zdenko, Sorry it was not meant to be a 'release' of 3.04, I just wanted to get the latest code

[tesseract-ocr] Re: compile error under ubuntu 14.04

2014-09-09 Thread shree
Also filed as an issue with additional information and log files https://code.google.com/p/tesseract-ocr/issues/detail?id=1307start=100 -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails

[tesseract-ocr] Re: hindi data missing characters

2014-09-24 Thread shree
I have uploaded my traineddata files for hindi and sanskrit at https://code.google.com/r/shreeshrii-tessdata/ I have also added traineddata file for iast - sanskrit transliteration using english alphabet On Friday, September 7, 2012 9:01:09 AM UTC+5:30, cc wrote: Hi, I exploded the

Re: [tesseract-ocr] Re: [Clarification request] Is it possible to let Tesseract generate three output files i) text ii) hOCR iii) PDF in a *single* run ?

2014-09-27 Thread shree
17, 2014 at 4:08 AM, Shree Devi Kumar shree...@gmail.com javascript: wrote: Quan, Can it also be done in commandline version? Shree Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Sep 17, 2014 at 7

[tesseract-ocr] Re: Reading Device labels to get model number

2014-11-13 Thread shree
also take a look at the pre-processing method mentioned at https://github.com/tleyden/open-ocr/wiki/Stroke-Width-Transform-In-Action On Thursday, November 13, 2014 3:30:03 AM UTC+5:30, Bill Garrison wrote: So if someone sends in labels like the attached ones, I need to grab the model number.

Re: [tesseract-ocr] Covering ASCII Extended range.

2014-11-18 Thread shree
0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. 3 = Fully automatic page segmentation, but no OSD. (Default) See whether using OSD to detect the script helps you choose the correct traineddata.

Re: [tesseract-ocr] building training tools on cygwin

2016-04-03 Thread shree
Marco, Thanks for the patches. I wasn't able to build with dev. Please provide the patches as a pull request for the project, when you build it next time. Thanks. On Tuesday, March 29, 2016 at 11:42:01 PM UTC+5:30, marco atzeri wrote: > > > > > > > This is with source as of latest commit: >

[tesseract-ocr] Can not open the input file[Ubuntu]

2017-02-18 Thread shree
Your input file needs to be in your PATH Not along with the tesseract traineddata files. Check that ur file can be found using ls filename Use full path to input file, if required. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To

[tesseract-ocr] Re: Tesseract Installation

2017-04-11 Thread shree
On Tuesday, April 11, 2017 at 4:10:26 PM UTC+5:30, Ibr wrote: > > > Note: I'm using windows 10 bash > I use it too, but via mobaxterm, which makes it easier to use see http://mobaxterm.mobatek.net/download-home-edition.html -- You received this message because you are subscribed to the

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-20 Thread shree
> > Glad you got it to work. > > I have added issue with link to this discussion at https://github.com/tesseract-ocr/tesseract/issues/830 > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-23 Thread shree
Hi Peter, Stefan Weil has made changes to the 3.05 branch to address this issue. Please give a try using the latest commit and preferably provide your feedback in the issue tracker where I have added this. -- You received this message because you are subscribed to the Google Groups

[tesseract-ocr] Re: Difference trained data for Chinese

2017-08-11 Thread shree
Please see https://github.com/tesseract-ocr/tessdata/issues/72 On Friday, August 11, 2017 at 2:26:55 PM UTC+5:30, Yang Yu wrote: > > Good day! > > Recently I was using tesseract (4.0 alpha) to do Chinese OCR and it works > really great. Now I want to pick up a best model to use but I find

[tesseract-ocr] Re: Recognition of trademark symbol

2017-07-24 Thread shree
Martin, Please test again with the latest code from github. Ray has posted a fix for this. See https://github.com/tesseract-ocr/tesseract/commit/b0ead95d64a3667339775b2f99ac37e97e90c2a0 On Monday, March 13, 2017 at 9:33:59 PM UTC+5:30, Martin Fadrhons wrote: > > Hi, > > I was trying to train

[tesseract-ocr] Re: need add around 1000 char in to tesseract traineddata. can use Fine Tuning for Impact?

2017-08-07 Thread shree
Please see https://github.com/tesseract-ocr/langdata/issues/81 and respond there. On Monday, August 7, 2017 at 12:38:08 PM UTC+5:30, Hoang Vu wrote: > > Hi guys! > I'm try to add around 1000 char to my japanese trainneddata. > have a new feature like in here : > >

[tesseract-ocr] Re: unicharset_extractor extracting zero values

2017-06-19 Thread shree
See https://github.com/tesseract-ocr/tesseract/issues/318 regarding the unicharset format I was able to do regular tesseract training (not lstm) using tesseract 4.00.00 version from github master and create new unicharset and traineddata with your box/tiff pair. The output on the same tiff file

[tesseract-ocr] Re: unicharset_extractor extracting zero values

2017-06-19 Thread shree
See https://github.com/tesseract-ocr/tesseract/issues/318 regarding the unicharset format I was able to do regular tesseract training (not lstm) using tesseract 4.00.00 version from github master and create new unicharset and traineddata with your box/tiff pair. The output on the same tiff

Re: [tesseract-ocr] Re: unicharset_extractor extracting zero values

2017-06-21 Thread shree
On Tuesday, June 20, 2017 at 9:09:53 PM UTC+5:30, shree wrote: > > I got the same error building 3.05.01 and have filed it as an issue - > https://github.com/tesseract-ocr/tesseract/issues/1000 > This has been fixed by @stweil via https://github.com/tesseract-ocr/tesseract/pull/

Re: [tesseract-ocr] Re: Need help training Simplified Chinese.

2017-06-25 Thread shree
See https://github.com/tesseract-ocr/tesseract/pull/515 for when this option was implemented (after the https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00alpha ) You should install using the latest code on github. On Sunday, June 25, 2017 at 8:23:34 PM UTC+5:30, shree wrote

Re: [tesseract-ocr] Re: unicharset_extractor extracting zero values

2017-06-20 Thread shree
I got the same error building 3.05.01 and have filed it as an issue - https://github.com/tesseract-ocr/tesseract/issues/1000 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it,

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-05-01 Thread shree
Peter, Please see Stefan's replay at https://github.com/tesseract-ocr/tesseract/issues/830#issuecomment-298298139 On Sunday, April 30, 2017 at 8:53:57 PM UTC+5:30, Peter Reid wrote: > > Hi Shree > > Sorry for the delay in replying but I'm struggling to get a successful >

[tesseract-ocr] Re: train a new font for language of persian

2017-05-05 Thread shree
There is already farsi/persian traineddata for tesseract-ocr 4.0-alpha at https://github.com/tesseract-ocr/tessdata/raw/master/fas.traineddata Have you given it a try? Which font do you want to add to it? On Thursday, May 4, 2017 at 6:06:09 PM UTC+5:30, Ava Nimaee wrote: > > hi every one. i

[tesseract-ocr] Re: ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

2017-09-16 Thread shree
https://github.com/tesseract-ocr/tesseract/pull/1134/files should fix it. On Thursday, September 14, 2017 at 1:50:26 PM UTC+5:30, roberty...@gmail.com wrote: > > Hello, > > I'm trying to train my traineddata model with Tess4.0, following the > commands in the* TrainingTesseract 4.00

Re: [tesseract-ocr] ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

2017-09-15 Thread shree
On Thursday, September 14, 2017 at 2:17:27 PM UTC+5:30, roberty...@gmail.com wrote: > > Shree, thanks for your reply. > > > But I have another problem in the project which needs your helpness: > > Some italicized characters in my data need to be identified, but these >

[tesseract-ocr] Re: ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

2017-09-17 Thread shree
On Saturday, September 16, 2017 at 2:22:47 PM UTC+5:30, shree wrote: > > https://github.com/tesseract-ocr/tesseract/pull/1134/files > should fix it. > > >> Sorry, that is not the correct fix. -- You received this message because you are subscribed to the Google Groups

[tesseract-ocr] Re: Tesseract OCR 4.0.0 Alpha how to train a new font

2017-09-05 Thread shree
Try san_latn.traineddata from https://github.com/Shreeshrii/tessdata4alpha/tree/master/best On Tuesday, August 29, 2017 at 12:19:10 PM UTC+5:30, Anand Akella wrote: > > Hi, > Im new to tesseract and have a pdf

[tesseract-ocr] Re: How to use tesseract4.0 to only recognize the digits??

2017-10-03 Thread shree
You can try the plus-minus type of training if you just want a digits type of traineddata. Your training_text can contain numbers in the format you need and you can train with a font matching your images. For proof of concept you can try my experimental version at

[tesseract-ocr] Re: Trying to add chars to tesseract 4.0

2017-12-15 Thread shree
On Friday, December 8, 2017 at 5:46:01 PM UTC+5:30, Fahad Al-Saidi wrote: > > > I have the same problem, why not the new fine tuned traineddata include > the old wordlist? It suppose to do so. I followed the instructions in the > wiki but I got the same issue. Any help? > If you want the

[tesseract-ocr] Re: Problem reading text in two columns

2018-05-09 Thread shree
> > Please try by building the latest version of tesseract from github > or install from links given in https://github.com/tesseract-ocr/tesseract/wiki I get the following output using the default eng.traineddata from the three repos - tessdata, tessdata_best, tessdata_fast, without any

[tesseract-ocr] Re: Tesseract couldn't load any languages!

2018-05-17 Thread shree
It is possible that you have not downloaded eng.traineddata or it is in a different location. Try running tesseract on command line, check --list-langs. On Friday, May 18, 2018 at 9:27:59 AM UTC+5:30, Dattatraya Tembare wrote: > > > *[SOLVED] changed the language from 'hin+eng' to 'hin'In this

Re: [tesseract-ocr] Re: Training Tesseract4.0 (LSTM) on word level bounding boxes

2018-05-22 Thread shree
On Wednesday, May 23, 2018 at 8:45:13 AM UTC+5:30, nick wrote: > > hi > how can we train the tesseract 4 beta, with our lines dataset? > See https://github.com/OCR-D/ocrd-train -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe

[tesseract-ocr] Re: German "Straße" is often "StraBe" (tesseract 4.0)

2018-05-24 Thread shree
Please try with script/Latin traineddata to see if you get better results. I have added your comment to issue at https://github.com/tesseract-ocr/langdata/pull/54 On Thursday, May 24, 2018 at 5:05:55 PM UTC+5:30, Thomas Güttler wrote: > > I use tesseract 4.0 via docker

[tesseract-ocr] Re: Training error "Couldn't find a matching blob"

2018-05-31 Thread shree
This has been an issue for long. Thanks for finding the problem. Please submit a PR on github. On Friday, June 1, 2018 at 1:55:25 AM UTC+5:30, Paul Kitchen wrote: > > After a lot of stepping through tesseract code, I found the problem. > > 1) In file coutln.cpp, function

Re: [tesseract-ocr] Help for training tesseract to recognize a new (dead) language

2018-06-01 Thread shree
Please see https://github.com/Shreeshrii/tessdata_coptic for the traineddata files. On Friday, June 1, 2018 at 10:45:11 PM UTC+5:30, Ramast wrote: > > Impressive! I thought we would need to do a lot of work in order to reach > that stage. > > > ⲁⲩⲱ ⲟⲛ ⲁⲓ̈ⲧⲣⲉⲩ ⲣ̄ ⲥⲟⲟⲩ ⲛ̄ ⲉⲃⲟⲧ ⲉⲩⲕⲏⲧ ⲉ ϩⲃⲟⲩⲣ >

[tesseract-ocr] Re: Problem compiling tesseract 4.0 on macOS

2018-06-04 Thread shree
please see https://github.com/tesseract-ocr/tesseract/issues/1028#issuecomment-394415918 On Monday, June 4, 2018 at 7:33:19 PM UTC+5:30, Ning Zhao wrote: > > I'm trying to install tesseract 4.0 on macOS High Sierra from source > following

[tesseract-ocr] Re: hocr bbox set to 0,0,xmax,ymax

2018-04-26 Thread shree
Try 4.0.0-beta and see if you get same results. If the problem persists, please post an issue along with a sample image. I have linked this from https://github.com/tesseract-ocr/tesseract/issues/538 On Thursday, April 26, 2018 at 12:17:09 PM UTC+5:30, Sreenath BH wrote: > > Hi > We are using

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-29 Thread shree
> > Please see https://github.com/tesseract-ocr/tesseract/issues/837 > This discussion is better held there. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

2018-04-30 Thread shree
Added to issue on GitHub https://github.com/tesseract-ocr/tesseract/issues/733 On Thursday, April 26, 2018 at 1:35:30 PM UTC+5:30, Youcef wrote: > > > I'm using master branch with tessdata_fast models > > Le mercredi 25 avril 2018 18:49:22 UTC+2, shree a écrit : > >> Wh

Re: [tesseract-ocr] Assert failed:in file weightmatrix.cpp, line 244

2018-07-24 Thread shree
> > * --continue_from >> >> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.lstm >> >> \* >> * --old_traineddata >> /home/tulipan1637/Documents/Emiliano/OCR/OCRtraining/ocrd-train/tessdata/eng.traineddata >> >> \* >> > Use eng.traineddata from tessdata_best

[tesseract-ocr] Re: Hindi language version not working. VietOCR.NET-4.5_64

2018-03-01 Thread shree
Use latest version vietocr https://sourceforge.net/projects/vietocr/files/vietocr/5.0alpha/ or https://sourceforge.net/projects/vietocr/files/vietocr.net/5.0alpha/ Use vietocr to download the traineddata from https://github.com/tesseract-ocr/tessdata_fast On Thursday, March 1, 2018 at

[tesseract-ocr] Re: Creating a new language pack for Javanese Script

2018-04-23 Thread shree
Please see https://github.com/tesseract-ocr/langdata/issues/126 Replying there. On Monday, April 23, 2018 at 2:16:06 AM UTC+5:30, Christopher Imantaka Halim wrote: > > Hi, > > I want to develop an OCR for Javanese Script / Aksara. > https://en.wikipedia.org/wiki/Javanese_script > > Plan on

[tesseract-ocr] Re: Install Tesseract 4 on CentOS and Red Hat [SOLVED!]

2018-04-25 Thread shree
Thanks for the rpm package, Alex. I have added the info to https://github.com/tesseract-ocr/tesseract/wiki On Tuesday, April 24, 2018 at 10:04:55 PM UTC+5:30, Александр Поздняков wrote: > > Hi. I compiled an rpm package with tesseract-ocr for CentOS, Fedora, > ScientificLinux, OpenSuse. It

Re: [tesseract-ocr] Any suggestions for more accurate Text conversion?

2018-03-28 Thread shree
Yes, for 4.0 you can try finetune training. You can download license plate specific fonts to easily make training data. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an

[tesseract-ocr] Re: lstmtraining command line related

2018-03-28 Thread shree
PLEASE DO NOT SHOUT - Sending messages in Large fontsize, RED color etc is not appreciated. You have used a 0-zero instead of a CAPITAL O in your network spec, it should be O1c105 On Wednesday, March 28, 2018 at 12:24:02 PM UTC+5:30, notorio...@gmail.com wrote: > > > > *Invalid network

[tesseract-ocr] Re: Train License plate alphabet for tesseract 4

2018-03-29 Thread shree
There are existing license plates fonts, you could use those also for training See http://www.fontspace.com/category/license%20plate On Thursday, March 8, 2018 at 2:40:35 AM UTC+5:30, Diego Menescal wrote: > > Hello everybody, > > I am starting to get along with tesseract, and when i saw that

Re: [tesseract-ocr] Tesseract convert image to gibberish

2018-03-03 Thread shree
Sure, if you are comfortable building software on Linux. You have to make sure you have all the dependencies etc. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email

[tesseract-ocr] Re: Server performance is 3x as slow versus local machine

2018-10-17 Thread shree
Added to issue at https://github.com/tesseract-ocr/tesseract/issues/1278#issuecomment-430827712 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Server performance is 3x as slow versus local machine

2018-10-18 Thread shree
Reply by @stweil in issue tracker. Please continue further discussion there. It looks like the local machine is rather new hardware, while the server is older. So it could be AVX / SSE none at all. The user can run tesseract --version on both machines to see whether SSE and AVX are found.

[tesseract-ocr] Re: Install Tesseract 4 on CentOS and Red Hat [SOLVED!]

2018-10-25 Thread shree
ur stuff; it is very extensive. If all the >> installations work, this should be front-paged! I have never used openSUSE. >> Could you point me to some resources to figure out how use your >> installation packages? >> >> >> @shree >> Thanks for t

Re: [tesseract-ocr] How to restrict OCR character set.

2019-03-31 Thread shree
March 29, 2019 at 11:12:44 PM UTC-7, shree wrote: >> >> This was finetuned with 20+ monospaced fonts for 400 iterations to error >> rate of 0.242%. >> >> At iteration 44/400/400, Mean rms=0.258%, delta=0.076%, char >> train=0.242%, word train=0.761%, skip rat

Re: [tesseract-ocr] Trainning tesseract for a new language from scratch that does not exist in Tesseract

2019-03-31 Thread shree
On Sunday, March 31, 2019 at 9:55:23 AM UTC+5:30, haru...@gmail.com wrote: > > > Ok. thanks. > > Could you guide me on how to train in tesseract 4? > > >> > See https://groups.google.com/forum/#!topic/tesseract-ocr/Rz0J27855hQ

Re: [tesseract-ocr] Training Tesseract 4 from Scratch

2019-04-07 Thread shree
atch \ *--pass_through_recoder \* --lang foo On Thursday, April 4, 2019 at 7:10:04 PM UTC+5:30, shree wrote: > > #=== CHECK THAT TESSERACT AND TRAINING TOOLS ARE INSTALLED > > tesseract -v > text2image -v > unicharset_extractor -v > set_unicharset_properties -v > combine_lan

Re: [tesseract-ocr] Re: tesseract 4 box files format

2019-02-28 Thread shree
> > https://github.com/tesseract-ocr/tesseract/pull/2231 implements the > Wordstr box file option. > These box files are for each textline and can be easily edited for non-RTL languages. example usage to create box files for english language images p001.png to p015.png for i in $(seq -f

Re: [tesseract-ocr] Making custom traineddata

2019-04-09 Thread shree
Correction: fast version is *ocrb_int (not ocrb-int).* -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to

Re: [tesseract-ocr] Making custom traineddata

2019-04-09 Thread shree
//github.com/tesseract-ocr/tessdata_contrib On Monday, April 8, 2019 at 10:45:29 PM UTC+5:30, shree wrote: > > If you can provide another 40-50 lines of training data (text file) I will > rerun the training > > > On Mon, 8 Apr 2019, 22:11 Jankees Korstanje, wrote: > >

[tesseract-ocr] Re: I have a problem with the current tesseract

2019-06-15 Thread shree
In case that file doesn't work with tesseract4, you can try MICR.traineddata from https://github.com/Shreeshrii/tessdata_MICR/tree/master/MICR-legacy I was able to OCR the three images posted earlier in this thread: ⑈000144⑈ 400756051⑆ 23⑈ 11 ⑈069565⑈ 364013051⑆ 79⑈ 11 ⑈420360⑈ 36400

[tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread shree
combine_tessdata -u new.traineddata new. will unpack the traineddata file. check new.lstm-unicharset in it On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote: > > I tried to fine tune the model and add a new character via training, but > it seems it still couldn't recognize

[tesseract-ocr] Re: lstmeval shows good result but visualized result looks bad

2019-06-17 Thread shree
Your files have prefix of jpn, so I assume you are training for Japanese, but the image in question has only numbers in it. Getting good results on eval data but bad results on OCR could be the result of overfitting the model, if you have used a small sample and trained for large number of

Re: [tesseract-ocr] Choice Iterator only shows one choice for each character

2019-07-04 Thread shree
11:55:02 UTC+2 schrieb shree: >> >> Take a look at >> https://github.com/tesseract-ocr/tesseract/blob/ab09b09da66f458002f01d0bc4ffeee8eff58f6e/src/ccmain/tesseractclass.cpp#L524 >> >> On Mon, Jul 1, 2019 at 2:45 PM Jochen Naumann >> wrote: >> >&

Re: [tesseract-ocr] How to train tesseract with new script?

2019-06-30 Thread shree
ing an error "Failed to load the language". > If possible kindly share your language data. > > Thanks for your cooperation.. > > On Mon, Apr 8, 2019 at 10:29 AM Shree Devi Kumar > wrote: > >> Tesseract 4 LSTM training is done using tesseract, not tensowf

Re: [tesseract-ocr] how to train tesseract to detect superscripts and subscripts

2019-07-14 Thread shree
You can try training from scratch. Use training text and font similar to what you need to recognize. Alternately, try ocrd-train with line images with ground truth. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this

[tesseract-ocr] Re: Error: Deserialize header failed while fine-tuning Tesseract

2019-09-03 Thread shree
Your box files shows Windows CRLF rather than Unix LF. Try opening in notepad++ and check. On Tuesday, September 3, 2019 at 5:10:28 PM UTC+5:30, Pranav Budhwant wrote: > > I tried the same with Tesseract 4.1, and I generated all the files on > Ubuntu instead of creating them on Windows and then

Re: [tesseract-ocr] Tutorial for fine-tuning Tesseract 4 for a new font?

2019-09-18 Thread shree
tutorial, so if anyone reads > this and is also interested: > https://www.youtube.com/watch?v=TpD76k2HYms=314s > > > Am Mi., 18. Sept. 2019 um 12:29 Uhr schrieb Shree Devi Kumar < > shree...@gmail.com >: > >> Please search forum archive >> >> Ther

Re: [tesseract-ocr] Support for alto - option in Tesseract for linux

2019-08-08 Thread shree
I hope other members who use tesseract with python will provide the needed guidance. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

2019-07-21 Thread shree
ssue : even with a lot of iterations (475k) I > still do not see any log message with the error on the evaluation set. > At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char > train=9.379%, word train=9.669%, skip ratio=0.1%, New worst char error = >

[tesseract-ocr] Re: Recognizing blurred dots as CJK characters

2019-11-26 Thread shree
tesseract image001.png - --psm 0 Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 625 Warning. Invalid resolution 0 dpi. Using 70 instead. Page number: 0 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 5.30 *Script: Latin* Script confidence: 3.64 On

[tesseract-ocr] Re: Adding Modi Script to Tesseract

2020-01-30 Thread shree
Please see https://github.com/Shreeshrii/tesstrain-modi for finetune training for Modi from Marathi using synthetic training data in 2 unicode fonts. However since Modi documents are mostly handwritten in cursive style, the training should preferably be done using images. On Sunday, January

[tesseract-ocr] Re: Announcement: Python package pytesstrain (Tesseract training helpers)

2020-01-28 Thread shree
code points)) and found that `ocrevalutf8 accuracy` does not work well for it. Any suggestions ... Shree On Sunday, January 5, 2020 at 2:22:50 AM UTC+5:30, Wincent Balin wrote: > > Hi all, > > I would like to announce pytesstrain, a collection of Tesseract training >

Re: [tesseract-ocr] TrainingTesseract 4.00

2019-12-30 Thread shree
Jaspreet, Please see https://github.com/Shreeshrii/tesstrain-hindi-impact for a *Demo of Tesseract4 Finetune for Impact training for Hindi using images and their GT transcription * using the images and GT that you have sent. On Monday, December 9, 2019 at 3:29:10 PM UTC+5:30, shree wrote

Re: [tesseract-ocr] Re: Failed loading language 'eng'

2020-03-11 Thread shree
I suggest you file an issue with Sikulix Also see https://github.com/RaiMan/SikuliX1/issues/246 On Wednesday, March 11, 2020 at 10:04:40 PM UTC+5:30, Jeremiah wrote: > > So I did download the latest version of the trained data file and tried > but it didn't work. In the actual Java code a

Re: [tesseract-ocr] OMP_THREAD_LIMIT=1 gives improvement in 4.1 version

2020-10-01 Thread shree
Related discussion at https://github.com/tesseract-ocr/tesseract/issues/3109 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] add new characters

2020-10-27 Thread shree
gt;> Found AVX >> Found SSE >> Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 >> liblz4/1.9.2 libzstd/1.4.4 >> >> Many thanks again for your fast help >> >> On Saturday, October 24, 2020 at 3:12:15 PM UTC+2 shree wrote: >> >>

Re: [tesseract-ocr] Training tessract 4.0 using images?

2020-06-16 Thread shree
To those who come across this old thread: Training from single line images and their groundtruth is now possible using the makefile in tesstrain repo. https://stackoverflow.com/questions/43352918/how-do-i-train-tesseract-4-with-image-data-instead-of-a-font-file The above link has a good

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-11 Thread shree
ext but I want to pass in my > own unicharset file. > On Friday, January 8, 2021 at 12:58:27 AM UTC-6 shree wrote: > >> Are any of these vertical fonts? >> >> Encoding errors could be if the characters in training text are not in >> the unicharset. >&

Re: [tesseract-ocr] Diacriticals Training

2020-12-03 Thread shree
he last command which starts training, change the TESSDATA directory to point to wherever you have the tessdata_best/san.traineddata model. On Monday, November 30, 2020 at 8:55:54 PM UTC+5:30 advoca...@gmail.com wrote: > Shree I have gone through it, but I might need proper workflow to &g

Re: [tesseract-ocr] Diacriticals Training

2020-12-11 Thread shree
nks in advance to all those training Tesseract along these lines. > > Greg > > On Thursday, December 3, 2020 at 3:16:19 AM UTC-10 shree wrote: > >> 1. git clone https://github.com/Shreeshrii/tesstrain-sanPlusMinus >> 2. cd tesstrain-sanPlusMinus >> 3. nohup make train

Re: [tesseract-ocr] Japanese - Problems with vertical words

2021-01-08 Thread shree
ed jpn_vert >> >> https://github.com/zodiac3539/jpn_vert >> >> >> On Mon, Jun 3, 2019 at 11:31 AM Shree Devi Kumar >> wrote: >> >>> tesseract 4 has been trained on line images and hence gives better >>> results for lines, as far a

[tesseract-ocr] Re: advice for OCR'ing 9-pin dot matrix BASIC code

2021-01-01 Thread shree
Please see old thread at https://groups.google.com/g/tesseract-ocr/c/ApM_TqwV7aE/m/z5jZV0I0AgAJ for link to a completed project for dot matrix On Monday, December 14, 2020 at 12:11:00 PM UTC+5:30 Keith M wrote: > Hi there, > > I've been circling a problem with OCR'ing 90-pages of 30 year old

[tesseract-ocr] Re: Tesseract v 5.0 on Linux

2021-01-02 Thread shree
In case you can install debian packages - see https://notesalexp.org/tesseract-ocr/ On Friday, January 1, 2021 at 12:59:46 AM UTC+5:30 peter.kr...@torch.ai wrote: > > Is there a way to get Tesseract 5.0 on Linux without building it myself? > I'm on Alpine Linux. apk add only gets me 4.0 > >

Re: [tesseract-ocr] Detecting language automatically

2021-03-25 Thread shree
See https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc //Get OSD - new code int orient_deg; float orient_conf; const char* script_name; float script_conf; api->DetectOrientationScript(_deg, _conf, _name, _conf); printf("\n Orientation

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-15 Thread shree
See attached image from a screenshot of Malayalam wiki and the OCRed text using traineddata from tessdata_best, tessdata_fast and tessdata To me it seems like recognition is 90+% correct. On Sunday, March 14, 2021 at 6:09:17 AM UTC+5:30 shree wrote: > You have not stated the vers

[tesseract-ocr] Re: To make traineddata file non-traineable

2021-02-23 Thread shree
You can create an integer/fast version of traineddata which cannot be used as START_MODEL for further training. `combine_tessdata -c myfile.traineddata` On Monday, February 22, 2021 at 3:58:19 PM UTC+5:30 thiyam...@gmail.com wrote: > Does anyone have any idea about making the traineddata file

Re: [tesseract-ocr] Re: How to use the "latin sanskrit" language?

2021-02-23 Thread shree
> > Please try the models from https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST >>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Re: New release for tessdata_{fast,best}?

2021-02-23 Thread shree
>There is now a 4.1.0 release available for tessdata_fast, tessdata and tessdata_best. See https://github.com/tesseract-ocr/tessdata_fast/issues/26#issuecomment-780127901 @Merlijn Wajer archive.org has many books which use English with diacritics for Sanskrit (IAST). You could try the models

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Shree Devi Kumar
Thanks. I did follow the training wiki. However, since Hindi uses CUBE mode, it is not possible to train for that. I am trying to train for san - Sanskrit which uses the same devanagari script, in Non-cube mode. On Thu, Apr 18, 2013 at 1:34 AM, Sven Pedersen sven.peder...@gmail.comwrote:

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Shree Devi Kumar
Thanks, Zdenko! I think it would be helpful to add this to the training pages wiki in the next update. If possible, also add a list of the languages that use the Cube mode. On Thu, Apr 18, 2013 at 3:05 AM, zdenko podobny zde...@gmail.com wrote: I remember one user post, that he

concatenating tr files

2013-04-18 Thread Shree Devi Kumar
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 says: An alternative to multi-page tiffs is to create many single-page tiffs for a single font, and then you must cat together the tr files for each font into several single-font tr files. In any case, the input tr files to

Re: concatenating tr files

2013-04-19 Thread Shree Devi Kumar
Thanks, Zdenko. Will do and post the link here. Shree Devi Kumar भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Apr 18, 2013 at 11:46 PM, zdenko podobny zde...@gmail.com wrote: post somewhere your files, so we can test

Re: tesseract testing suite

2013-04-19 Thread Shree Devi Kumar
On Thu, Apr 18, 2013 at 11:02 PM, Nick White nick.wh...@durham.ac.ukwrote: Hi Shree, I'm glad you found my article helpful. Apologies for the delay in my reply to you. I'll answer your questions below. Thanks, Nick! I have found that trying to improve recognition by adding more

  1   2   3   4   5   6   7   8   9   10   >