Re: [tesseract-ocr] How to regenerate the training text

2017-06-15 Thread ShreeDevi Kumar
>Where are these scripts, or how can I otherwise generate training text from dictionary/corpus data? These are (most probably) internal scripts at Google which have not been open sourced. Please see

Re: [tesseract-ocr] traineddata file size too small, error clue ?

2017-06-14 Thread ShreeDevi Kumar
Traineddata size will depend on many things, not just number of images. If your unicharset and number of fonts hasn't changed, then the size maybe similar. Traineddata file also has the wordlists in it, so if you are using a smaller wordlist compared to the one in original eng.traineddata, size

Re: [tesseract-ocr] oem Detection

2017-06-14 Thread ShreeDevi Kumar
check that the file is there ls -l */home/ibr/tesstutorial/impact_from_full/jpn.lstm* ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Jun 14, 2017 at 7:28 PM, Ibr wrote: > yes I already

Re: [tesseract-ocr] Re: Font List

2017-06-14 Thread ShreeDevi Kumar
> what is the difference between the engtrain and engeval? It will depend on what fonts and training text you use for each. one is used for training, the other is for evaluation of the training. ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] oem Detection

2017-06-14 Thread ShreeDevi Kumar
You need to extract .lstm from traineddata eg. (change foldernames to match ur setup) combine_tessdata -e ../tessdata/jpn.traineddata jpn.lstm Extracting tessdata components from ../tessdata/jpn.traineddata Wrote jpn.lstm 0:config:size=2573, offset=168 1:unicharset:size=280627, offset=2741

Re: [tesseract-ocr] oem Detection

2017-06-13 Thread ShreeDevi Kumar
combine_tessdata -e extracts the lstm file from the traineddata provided from original training by google. - tesstrain.sh it will create .lstmf files yes. these are created from the box-tiff pairs created from the training text and fonts ---

Re: [tesseract-ocr] oem Detection

2017-06-13 Thread ShreeDevi Kumar
you have to be clear on what files you are combining. the command you have given is overwriting japanese traineddata - is that what you want to do? > *training/combine_tessdata -o tessdata/jpn.traineddata* *Look at help for all options of combine_tessdata* *Figure out which files (lstm, dawg

Re: [tesseract-ocr] oem Detection

2017-06-13 Thread ShreeDevi Kumar
*tesseract image results -l ara --tessdata-dir ./tessdata --oem 1* *uses the LSTM files that are there in ara.traineddata in your tessdata directory.* *Just placing lstm files in tesseract folder is not going to change anything.* *You need to create a new traineddata with the new lstm files and

Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-12 Thread ShreeDevi Kumar
Hari, Please also look in the leptonica program directory for pdf2tiff pdf2mtiff etc -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-12 Thread ShreeDevi Kumar
t;>​ >Pix pix = b.Convert(bitmap); > > This is not leptonica code.​ It shouldn't compile, with b being a ptr > that is dereferenced with a ".". This is then set equal to a pix which is > (as written) not a ptr either, causing a copy if it were correct. > > > On Mon, Jun 12, 201

Re: [tesseract-ocr] Detect Multiple Images by Command Line

2017-06-12 Thread ShreeDevi Kumar
see https://github.com/tesseract-ocr/tesseract/issues/928 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 12, 2017 at 3:58 PM, Ibr wrote: > Hi, > > When I want to detect an image on

Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-12 Thread ShreeDevi Kumar
image processing within tesseract is done by leptonica. https://github.com/DanBloomberg/leptonica + dan bloomberg ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 12, 2017 at 11:25 AM, Hari.K

Re: [tesseract-ocr] Re: What is the "Confidence"value returned by Tesseract and how it is calculated?

2017-06-09 Thread ShreeDevi Kumar
Technical documentation links https://github.com/tesseract-ocr/tesseract/wiki/Technical-Documentation -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-09 Thread ShreeDevi Kumar
+ quan Quan will be better able to advice regarding .net also see https://sourceforge.net/projects/vietocr/files/ vietocr.net/5.0alpha/ ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Jun 9, 2017 at 10:44 AM,

Re: [tesseract-ocr] Tesseract on Bitmap images giving error - Error: "Failed to create pix, this normally occurs because...

2017-06-08 Thread ShreeDevi Kumar
Have you tried using ghostscript to convert pdf to tif files instead? Example commands gs -r600x600 -sDEVICE=tiffg4 -dFirstPage=106 -dLastPage=109-o ./tulasi/tulasikrishna%00d.tif "TulasiPuja.pdf" for one tif per page gs -r600x600 -sDEVICE=tiffg4 -dFirstPage=126 -dLastPage=131

Re: [tesseract-ocr] Re: How can I convert font data from ver 3.02 to 3.05

2017-06-06 Thread ShreeDevi Kumar
As far as I know, the traineddata files for 3.04 (also usable for 3.05) are github versions of the files posted on code.google.com for 3.02. So, I would think 3.02 traineddata files will work with 3.05 but newer files will not work with 3.02. Best is to give it a try and report your results.

Re: [tesseract-ocr] Does any parameter to control ocr region?

2017-06-06 Thread ShreeDevi Kumar
try latest code from http://www.emgu.com/wiki/index.php/Version_History#Emgu.CV-3.2.0 I converted the bmp to png and tried with command line tesseract 4 and get correct result. $ tesseract I.png stdout --oem 1 --psm 6 D $ tesseract I.png stdout --oem 0 --psm 6 D original .bmp also works. $

Re: [tesseract-ocr] Re: Italian - Missing special-words

2017-06-05 Thread ShreeDevi Kumar
Yes, it should be there in tessdata like eng.user-words Please open an issue withdetails and link to this thread also, so that it can be added. Thanks! ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 5, 2017

Re: [tesseract-ocr] Re: Italian - Missing special-words

2017-06-05 Thread ShreeDevi Kumar
File is there in langdata https://github.com/tesseract-ocr/langdata/blob/master/ita/ita.special-words and is referred to in the language config file https://github.com/tesseract-ocr/langdata/blob/master/ita/ita.config ShreeDevi भजन

Re: [tesseract-ocr] Detection Using LSTM Files

2017-06-05 Thread ShreeDevi Kumar
tes a combined version to use for recognition ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 5, 2017 at 7:05 PM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > Comments from Ray regarding t

Re: [tesseract-ocr] Same Font with Multible Styles

2017-06-01 Thread ShreeDevi Kumar
text2image --list_available_fonts --fonts_dir /mnt/c/Windows/Fonts replace the fonts directory with your fonts location eg. 633: Times New Roman, 634: Times New Roman, Bold 635: Times New Roman, Bold Italic 636: Times New Roman, Italic 637: Trajan Pro 638: Trajan Pro Bold 639: Trebuchet MS 640:

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
Read https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 Follow the tutorials. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
Are you training for 3.0 or 4.0? Do you have spaces between the letters in your training text? Read https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tessdata has the traineddata for 4.0. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
gt;>>>> >>>>> binaries from https://github.com/UB-Man >>>>> nheim/tesseract/wiki >>>>> >>>>> Use for GUI - look for tesseract 4.0 versions >>>>> >>>>> gImages

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-06-01 Thread ShreeDevi Kumar
gt;>> VietOCR https://sourceforge.ne >>> t/projects/vietocr/files/vietocr/5.0alpha/ >>> >>> >>> >>> ShreeDevi >>> >>> भजन - कीर्तन - आरती @ http://bhaj

Re: [tesseract-ocr] Unable to find reference to C++ standard functions when building tesseract 4.00alpha

2017-06-01 Thread ShreeDevi Kumar
Does configure need any change?? See earlier messages for details. >> i can't manage to get an option for ./configure to use g++ instead of gcc. If somebody knows how, i would be grateful. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To

Re: [tesseract-ocr] Unable to find reference to C++ standard functions when building tesseract 4.00alpha

2017-05-31 Thread ShreeDevi Kumar
Supported Compilers - GCC 4.8 and above - Clang 3.4 and above - MSVC 2015, 2017 Other compilers might work, but are not officially supported. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 31, 2017

Re: [tesseract-ocr] Unable to find reference to C++ standard functions when building tesseract 4.00alpha

2017-05-31 Thread ShreeDevi Kumar
*git pull origin* to get the latest source. I have built it today without any problems. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 31, 2017 at 6:32 PM, Youcef wrote: > Hi, > >

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-31 Thread ShreeDevi Kumar
/manisandro/ gImageReader/releases VietOCR https://sourceforge.net/projects/vietocr/files/vietocr/5.0alpha/ ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 31, 2017 at 5:05 PM, ShreeDevi

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-31 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage https://github.com/tesseract-ocr/tesseract/wiki https://github.com/UB-Mannheim/tesseract/wiki https://github.com/manisandro/gImageReader/releases ShreeDevi भजन -

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-31 Thread ShreeDevi Kumar
The output you posted, is it using the 3.04 traineddata from repo? What PSM did you use? Try using the experimental tesseract4 version for windows , see wiki for links. On May 31, 2017 3:47 PM, "Mandeep Singh" wrote: > I am using Window 8.1 and tesseract version 3.04. >

Re: [tesseract-ocr] Re: user-words

2017-05-31 Thread ShreeDevi Kumar
Samuel, Do the user-words work as expected after making this change? Which version of tesseract are you using? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 31, 2017 at 2:35 AM, Samuel backus

Re: [tesseract-ocr] How recognize footnotes

2017-05-30 Thread ShreeDevi Kumar
Try the `hocr` output and see if it provides some of what you need. I don't think tesseract will link to footnotes though it may recognize the text. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 30, 2017 at

Re: [tesseract-ocr] Fine-turning LSTM for Japanese

2017-05-28 Thread ShreeDevi Kumar
Ray is the best person to answer your questions. I can only share my experience trying to train using Devanagari script. Fine Tune will work if all you want to change is a font, with the same unicharset. This works well for Latin script based languages but not complex scripts. eg. for

Re: [tesseract-ocr] Fine-turning LSTM for Japanese

2017-05-28 Thread ShreeDevi Kumar
Please see inline replies: On Sun, May 28, 2017 at 4:53 PM, Akira Hayakawa wrote: > I am new to tesseract. My aim is to use this software to analyze Japanese > doc. The idea in my mind is to start from existing model and fine-tune it > by new words that weren't correctly

Re: [tesseract-ocr] Re: Cube training tools

2017-05-26 Thread ShreeDevi Kumar
>> be found in https://github.com/tesseract-ocr/tessdata/tree/3.04.00 >> >> Zdenko >> >> On Wed, May 24, 2017 at 2:54 PM, ShreeDevi Kumar <shree...@gmail.com> >> wrote: >> >>> cube traini

Re: [tesseract-ocr] How to extend the output format

2017-05-25 Thread ShreeDevi Kumar
tesseract writes the file names to console, you can try the following: tesseract list.txt stdout > output.txt 2>&1 or tesseract list.txt stdout -c include_page_breaks=1 > output.txt 2>&1 ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Re: Cube training tools

2017-05-24 Thread ShreeDevi Kumar
cube training is not supported, no information is available for it. It has been deleted from the latest code. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 24, 2017 at 2:51 PM, Merlin ArulPrakash <

Re: [tesseract-ocr] How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

2017-05-24 Thread ShreeDevi Kumar
Which O/S? Which version of Tesseract? How are you training? Have you tried the packaged traineddata for Punjabi? What result do you get with that? ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, May 24, 2017 at

Re: [tesseract-ocr] Neural networks in tesseract 4.0

2017-05-22 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, May 22, 2017 at 8:31 PM,

Re: [tesseract-ocr] Generating a PDF with Tesseract C++

2017-05-22 Thread ShreeDevi Kumar
Look at the examples in https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/2ArchitectureAndDataStructures.pdf ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, May 22, 2017 at 7:34 PM, Saliaj Adrian

Re: [tesseract-ocr] Training from scratch

2017-05-20 Thread ShreeDevi Kumar
also see https://github.com/tesseract-ocr/tesseract/blob/master/contrib/genlangdata.pl ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, May 20, 2017 at 10:12 AM, ShreeDevi Kumar <shreesh...@gmail.com>

Re: [tesseract-ocr] Training from scratch

2017-05-19 Thread ShreeDevi Kumar
Google has not shared its method of training with complete scripts etc. The training instructions on wiki are only a tutorial for learning about LSTM training. Please also see https://github.com/tesseract-ocr/tesseract/issues/644 ShreeDevi -- You received this message because you are

Re: [tesseract-ocr] Training from scratch

2017-05-19 Thread ShreeDevi Kumar
As per Ray 4500 fonts and 40 lines of text were used to create the models of latin scriipt based languages. So I am not sure whether you can replicate the model. For language specific exposure settings etc see

Re: [tesseract-ocr] Tesseract 4 new Font

2017-05-17 Thread ShreeDevi Kumar
1. Which --oem are you using with tesseract 4, legacy engine or lstm? --oem 0 or --oem 1 2. Is Brazilian Portuguese very different from Portuguese? Please see the trainingtext and wordlists on https://github.com/tesseract-ocr/langdata/tree/master/por 3. Provide a sample image with it's ground

Re: [tesseract-ocr] include tesseract ocr in visual studio 2010

2017-05-15 Thread ShreeDevi Kumar
Which version of tesseract, which source? Tesseract 4, master branch does not support visual studio 2010, please check the changelog. You can try the 3.05 branch or newer visual studio. On May 15, 2017 8:10 PM, "emna ouerteni" wrote: > include tesseract ocr in

Re: [tesseract-ocr] Tesseract 4: Shuffling training instances and unicharset compression at the same time?

2017-05-12 Thread ShreeDevi Kumar
Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 80 is the default. I think it means both 64 and 16 are applied. train_mode int 80 Flags from TrainingFlags in lstmrecognizer.h Possible values= 64 for Compress unicharset, 16 for round-robin training. ShreeDevi

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-10 Thread ShreeDevi Kumar
. Please note that so far I have not had success in improving the accuracy of hindi traineddata with my experiments. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com 2017-05-10 22:07 GMT+05:30 ShreeDevi Kumar <shre

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-10 Thread ShreeDevi Kumar
shree wrote: >> >> Attached is the output I get with >> >> tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin >> >> >> ShreeDevi >> >> भजन - कीर्तन - आरती @ http://bhajans.ra

Re: [tesseract-ocr] How to append eng.traindata with new font. ?

2017-05-09 Thread ShreeDevi Kumar
try option for multiple languages -l eng+ ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 9, 2017 at 9:47 PM, wrote: > Hi Community, > > Can someone please tell me how to

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
Attached is the output I get with tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com 2017-05-09 21:11 GMT+05:30 ShreeDevi Kumar <shreesh...@gmail.com>: &g

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
Thanks. Please provide the 'ground truth' ie the original accurate text for the image. Have tried to OCR the same image with options --oem 1 --PSM 6 -l hin Sometimes hindi traineddata gives better results. On May 9, 2017 9:05 PM, "Nirajan Pant" wrote: > Here is a sample

Re: [tesseract-ocr] Re: Tesseract 4.0 Neural Network

2017-05-09 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/Compiling master branch on github is for 4.0.0alpha ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 9, 2017 at 7:35 PM, sfo wrote: >

Re: [tesseract-ocr] tesseract 4.0 documentation

2017-05-09 Thread ShreeDevi Kumar
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 9, 2017 at 7:29 PM, sfo wrote: > hello! where can i find tesseract 4.0

Re: [tesseract-ocr] How to automatically generate .box files when using tesstrain.sh?

2017-05-09 Thread ShreeDevi Kumar
Box files are generated after the tif. The script works on 8 fonts at a time. ls -l /tmp/tmp.Vu25eURnxk/eng/*.* will show you all generated files. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 9, 2017

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 for info about training. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 9, 2017 at 12:38 PM, ShreeDevi Kumar <shreesh...@gmail.

Re: [tesseract-ocr] Training Tesseract 4.0 for Nepali Language

2017-05-09 Thread ShreeDevi Kumar
Please provide sample of 'not giving good results' and samples of lines not being recognized correctly. Images and ground truth files will be helpful. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 9, 2017 at

Re: [tesseract-ocr] Re: got "undefined symbol omp_get_thread_num" while try example "extracting orientation from Tesseract 4.0"

2017-05-07 Thread ShreeDevi Kumar
Most probably the API example has not been updated for tesseract 4. There have been many changes - Please see https://abi-laboratory.pro/tracker/timeline/tesseract/ ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sun,

Re: [tesseract-ocr] Re: Fine tuning with existing box/tiff pairs in Tesseract 4.0

2017-05-06 Thread ShreeDevi Kumar
When using pre-existing box tiff pairs, you have to add a box with tab character to mark end of line and also add boxes with spaces after every word. You then need to generate the .lstmf files - please see training/tesstrain.sh for details. ShreeDevi

Re: [tesseract-ocr] Wrong or missing Segmentation of Words

2017-05-04 Thread ShreeDevi Kumar
Please provide your original image for testing. Thanks! ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, May 4, 2017 at 5:36 PM, 'Thomas Zipproth' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > We

Re: [tesseract-ocr] Re: How to make training for Arabic in Tesseract 4.0

2017-05-04 Thread ShreeDevi Kumar
Ibr, You are incorrect in your description of LSTM training. What you are doing will use the ara.traineddata provided in the repo, there will be no change in output. Once lstmf files are created, you have to run lstmtraining which will run for days/weeks to give you a good result. Please read

Re: [tesseract-ocr] Converting Handwritten image to text format

2017-05-02 Thread ShreeDevi Kumar
tesseract is not meant for OCR of handwriting. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, May 2, 2017 at 1:02 PM, Jaya Kumar wrote: > Hi , > > I have a image document and I am trying

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-05-01 Thread ShreeDevi Kumar
Stefan, Please make the mac binaries available for both 3.05 and 4.00 similar to windows. I noticed that you have posted the test version for standalone Tess. Thanks! PS: Are the Travis created binaries available for download by users? On May 1, 2017 7:30 PM, "'Stefan Weil' via tesseract-ocr" <

Re: [tesseract-ocr] not reading the image properly in tesseract OCR

2017-04-27 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage - excuse the brevity, sent from mobile On 27-Apr-2017 9:04 PM, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote: > tesseract output is plain text only, you will not get rich text with fonts &

Re: [tesseract-ocr] not reading the image properly in tesseract OCR

2017-04-27 Thread ShreeDevi Kumar
tesseract output is plain text only, you will not get rich text with fonts etc. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Apr 27, 2017 at 7:25 PM, Jaya Kumar wrote: > Hi > I am

Re: [tesseract-ocr] pb install on redhat PKG_CHECK_MODULES(LEPTONICA

2017-04-25 Thread ShreeDevi Kumar
I built both from source yesterday. Try the following for building tesseract /autogen.sh ./configure LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make sudo make install sudo ldconfig As given in compiling page on wiki - excuse the brevity, sent from mobile On 25-Apr-2017 2:14 PM,

Re: [tesseract-ocr] Absolute beginner requesting help for getting started with Tesseract in C++ application.

2017-04-25 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/User-App-Example https://github.com/tesseract-ocr/tesseract/wiki/APIExample - excuse the brevity, sent from mobile On 25-Apr-2017 12:11 PM, "Dhairya Shah" wrote: > Dear All, > I am absolute complete beginner with

Re: [tesseract-ocr] Re: issue with simple reading of numbers 9 and 8

2017-04-23 Thread ShreeDevi Kumar
362b68e) ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sun, Apr 23, 2017 at 9:25 AM, ShreeDevi Kumar <shreesh...@gmail.com> wrote: > Try training using more samples of 8, 9, B etc. > > What res

Re: [tesseract-ocr] Re: issue with simple reading of numbers 9 and 8

2017-04-22 Thread ShreeDevi Kumar
Try training using more samples of 8, 9, B etc. What results do you get with the provided eng.traineddata? Are they better or worse? Have you tried changing DPI of image to 300? - excuse the brevity, sent from mobile On 22-Apr-2017 10:29 PM, "James Abney" wrote: > Oh yes

Re: [tesseract-ocr] Re: issue with simple reading of numbers 9 and 8

2017-04-21 Thread ShreeDevi Kumar
Which version of Tesseract. Which o/s? If all your text is in tungsten-semibold, have you tried training with just that font? - excuse the brevity, sent from mobile On 22-Apr-2017 12:50 AM, "James Abney" wrote: The font is tungsten semibold On Friday, April 21, 2017 at

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2017-04-21 Thread ShreeDevi Kumar
If you want to OCR an invoice like the sample you posted, just use the eng.traineddata and OCR the page. You do not need to do any training. Here is the output I get 8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3 Did you know? Your Comcast Business Internet service gives

Re: [tesseract-ocr] Re: Tesseract Installation

2017-04-19 Thread ShreeDevi Kumar
You can check that these are installed by entering the following which text2image The above will show u the location it is installed If you don't have training tools, you will need to build them separately - see https://github.com/tesseract-ocr/tesseract/wiki/Compiling make training sudo make

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread ShreeDevi Kumar
I haven't built 3.05 so cannot help. I would suggest that you try with older commits of tesseract 3.05 branch to see which one works. Hope that those who have built 3.05 on mac will help. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread ShreeDevi Kumar
Please see https://github.com/tesseract-ocr/tesseract/wiki/Compiling If you are building tesseract 4.0, you need Lept 1.74 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Apr 18, 2017 at 2:25 PM, Peter Reid

Re: [tesseract-ocr] Re: Standalone Self-contained Tesseract-OCR for Mac

2017-04-18 Thread ShreeDevi Kumar
Use latest version of leptonica - 1.74.1 https://github.com/DanBloomberg/leptonica ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Apr 17, 2017 at 8:18 PM, Peter Reid wrote: > I've done

Re: [tesseract-ocr] Re: segmentation fault with tesseract 4

2017-04-12 Thread ShreeDevi Kumar
Please open as issue, as problem related to --psm 0. - excuse the brevity, sent from mobile On 13-Apr-2017 9:29 AM, "Pritam Dodeja" wrote: > Find below - I can also ship my docker container to you if you want so you > can see my exact setup, it's about 1.15GB > >

Re: [tesseract-ocr] Re: segmentation fault with tesseract 4

2017-04-12 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage Follow correct order of variables tesseract imagename|stdin outputbase|stdout [options...] [configfile...] ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Lstm training is not like legacy training. Please read the wiki pages regarding 4.0 training. I have given all sample commands there. There are 3 different ways of training. Read the bash scripts regarding training to know more. tesstrain.sh with --linedata-only creates the box tiff pairs but

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Arabic was never trained with the legacy tesseract engine and I doubt you will get any improvement over existing traineddata using cube or lstm. You are free to experiment and see what you come up with. I have pointed to the bash scripts for training. Please refer to them for the correct

Re: [tesseract-ocr] Re: Tesseract (4 alpha ) Amibiguos Situation while Correcting Chars in box file

2017-04-12 Thread ShreeDevi Kumar
You can use jtessboxeditor to edit the box files. Make sure to mark EOL if you are trying to train using scanned images. Also note that this part of code is untested - training 4.0 using pre-existing images and box files. Ray has only explained method for using images created by text2image.

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
see https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh if ((LINEDATA)); then phase_E_extract_features "lstm.train" 8 "lstmf" make__lstmdata else phase_E_extract_features "box.train" 8 "tr" phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto" if

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Read the bash scripts in tesstrain.sh tesstrain_utils.sh language_specific.sh In training directory To understand more detail about lstm training - excuse the brevity, sent from mobile On 12-Apr-2017 10:47 AM, "Ahmad Moawad" wrote: > this is the part from

Re: [tesseract-ocr] Help in TrainingTesseract 4.00 Finetune

2017-04-12 Thread ShreeDevi Kumar
--linedata-only means that it will only try to create lstmf files and not the files for 3.0x traing - excuse the brevity, sent from mobile On 12-Apr-2017 10:39 AM, "Ahmad Moawad" wrote: > Hello All, > > I want help in trainingTesseract 4.00 Finetune >

Re: [tesseract-ocr] Re: Tesseract Installation

2017-04-11 Thread ShreeDevi Kumar
Also, if you want training tools, you need to build them separately - see https://github.com/tesseract-ocr/tesseract/wiki/Compiling make training sudo make training-install ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Re: [tesseract-ocr] Tesseract Installation

2017-04-11 Thread ShreeDevi Kumar
You can ignore it. I get it too when using sudo 2nd time. Host name must be the id for your computer under windows10. Have u tried running tesseract after that? - excuse the brevity, sent from mobile On 11-Apr-2017 4:10 PM, "Ibr" wrote: Hi, I'm trying to install the

Re: [tesseract-ocr] How to add Armenian language support to tesseract

2017-04-11 Thread ShreeDevi Kumar
I have added this at https://github.com/tesseract-ocr/langdata/issues/67 Please add more information there: Which language code - arm or hye Modern Armenian or Classical Armenian Sources for primary texts in unicode the Armenian language to use for training Freely available unicode fonts to

Re: [tesseract-ocr] Tesseract 4.0 doesn't see the changes after Arabic traning

2017-04-08 Thread ShreeDevi Kumar
Arabic traineddata for 3.0x uses cube engine. Training process for that was never shared. Now the cube engine has been removed for lstm 4.0, which is still in alpha stage. There is 4.0alpha traineddata for Arabic and you can train for it , but accuracy is not great. Ray is doing another training

Re: [tesseract-ocr] (Advise needed) Command Output Fails and gives error in Tesseract 4 during fine tuning

2017-04-06 Thread ShreeDevi Kumar
You must be using an old version of traineddata which does not have LSTM. - excuse the brevity, sent from mobile On 07-Apr-2017 2:13 AM, wrote: > I am following this link https://github.com/tesseract-ocr/tesseract/wiki/ > TrainingTesseract-4.00---Finetune > > For genaerating

Re: [tesseract-ocr] Read 2 column Image Horizontally (line by line) rather than Vertically (column by column)

2017-04-06 Thread ShreeDevi Kumar
Normally, for text output, the other config files should not impact. - excuse the brevity, sent from mobile On 07-Apr-2017 2:18 AM, "Mike Hall" wrote: > Yes, we are using the -psm 6 command line argument. And it was not > working. > > But I figured out the issue. > >

Re: [tesseract-ocr] Read 2 column Image Horizontally (line by line) rather than Vertically (column by column)

2017-04-06 Thread ShreeDevi Kumar
Have u tried --psm 6 - excuse the brevity, sent from mobile On 06-Apr-2017 11:06 PM, "Mike Hall" wrote: > We have a C# .Net app that is using Tesseract to do Optical Character > Recognition (OCR) on .tiff files. I've attached a sample tiff file. > > We are then

Re: [tesseract-ocr] Re: train tesseract OCR 4.0

2017-04-05 Thread ShreeDevi Kumar
You do not have the LSTM.train config file. - excuse the brevity, sent from mobile On 05-Apr-2017 1:55 PM, wrote: > After u have said, > > I tried in two ways and i am stuck at lstm step: > > Training > > command used: > >

Re: [tesseract-ocr] Re: train tesseract OCR 4.0

2017-04-05 Thread ShreeDevi Kumar
4.0 is alpha software. Please use an older released version. - excuse the brevity, sent from mobile On 05-Apr-2017 1:55 PM, wrote: > After u have said, > > I tried in two ways and i am stuck at lstm step: > > Training > > command used: > >

Re: [tesseract-ocr] Tesseract (4 alpha ) Amibiguos Situation while Correcting Chars in box file

2017-04-05 Thread ShreeDevi Kumar
Have you tried just using the eng.traineddata directly with tess 3.04/ 3.05 / 4.0? You don't need to train unless it is a very special case. You can try changing the dictionary dawg files with tess 3.0x. ShreeDevi भजन - कीर्तन -

Re: [tesseract-ocr] Re: train tesseract OCR 4.0

2017-04-04 Thread ShreeDevi Kumar
Read https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

Re: [tesseract-ocr] train tesseract OCR 4.0

2017-04-04 Thread ShreeDevi Kumar
Tesstrain.sh generates a file called eng.training_files.txt You are using command without .text extension Check the name of generated file and use that. I have found that editing that file also gives errors. - excuse the brevity, sent from mobile On 04-Apr-2017 7:01 PM,

Re: [tesseract-ocr] train tesseract OCR 4.0

2017-04-04 Thread ShreeDevi Kumar
See https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh -- You received this message because you are

Re: [tesseract-ocr] train tesseract OCR 4.0

2017-04-03 Thread ShreeDevi Kumar
Saurabh, It depends on what you want to do with the bash script. Here is a sample of a script I used to compare results using diff tessdata files by looping thru a set of image files. Google the bash commands to figure out what they do! #!/bin/bash set -vx export

Re: [tesseract-ocr] Error while creating training data for Japanese

2017-04-03 Thread ShreeDevi Kumar
jpn.config in langdata/jpn is loading jpn_vert as a sublanguage tessedit_load_sublangs jpn_vert You can try without that Also look at the settings for jpn in training/language_specific.sh You may need to change the following also .. # The following fonts will be rendered vertically in phase

Re: [tesseract-ocr] VietOCR 5.0 alpha availability

2017-04-03 Thread ShreeDevi Kumar
You need to get vietocr 5.0 alpha for tesseract 4.0 alpha https://sourceforge.net/projects/vietocr/files/vietocr.net/5.0alpha/ https://sourceforge.net/projects/vietocr/files/vietocr/5.0alpha/ ShreeDevi भजन - कीर्तन - आरती @

<    1   2   3   4   5   6   7   8   >